Enhanced network processing in the Cloud Computing era

(1)

Universitá di Pisa

Dottorato di ricerca in Ingegneria dell’Informazione

Enhanced network processing in the Cloud

Computing era

Doctoral Thesis

Author

Vincenzo Maffione

Tutor (s)

Prof. Giuseppe Lettieri, Luigi Rizzo

Reviewer (s)

Prof. Sylvia Ratnasamy, Andrew W. Moore

The Coordinator of the PhD Program

Prof. Marco Luise

Pisa, April 2019 XXXI cycle

(2)

(3)

Alla mia famiglia, per tutto il suo supporto ed aiuto. Ad Alessia, per essermi stata sempre accanto, per avermi motivato, e per aver reso ogni mio giorno piu bello.

(4)

(5)

Summary

C

loud Computing has radically changed the way we look at computing hardware resources, with server machines hosting tens to thousands of possibly unrelated applications belonging to different customers. Virtualization technologies pro-vide those isolation guarantees that are necessary for untrusted applications to coexist within the same physical machine. Although hardware support for virtualization is essential to achieve acceptable performance, the success of Cloud Computing is still largely attributable to a wide range of software products, frameworks and libraries that complete and enhance the virtualization capabilities directly provided by the hardware. In particular, Cloud hosts often need to process in software huge amounts of network traffic on behalf of Virtual Machines and application containers, typically attached to a common virtual switch. Even with the many CPUs available on modern machines (100 and beyond), software packet processing at very high speeds (i.e., 1-100 Mpps) may be challenging because of the communication overhead incurred by the processing pipeline when using basic mechanisms such as inter-process notifications, sleeps, or lock-free queues. The current literature lacks of models to analyze these low level mechanisms in depth and identify suitable guidelines for the design of high-speed processing systems. This thesis presents a thorough discussion of the impact that queues and synchronization mechanisms have on the performance of I/O processing pipelines, for all the possible operating regimes, and with a particular focus on networking and virtualization. Mo-dels for throughput, latency and energy efficiency of producer-consumer systems are introduced and validated experimentally to verify that they are appropriate to represent real systems. Several fast Single Producer Single Consumer queues are discussed and characterized in terms of interactions with the cache coherence system, with two of them being original contributions of this thesis. As an application of how these basic mecha-nisms can be used in practice, a novel high-performance packet scheduling architecture, suitable for Data Centers, is presented and experimentally validated, showing a better throughput, latency and isolation than current solutions. Overall, the analysis presented in this thesis provides suggestions and guidelines to design efficient software datapath components for virtual network switches, hypervisor backends for device emulation, and other I/O processing components relevant to Cloud Computing environments.

(6)

(7)

Sommario

I

l paradigma del Cloud Computing ha radicalmente cambiato il ruolo dei calcolatori e delle loro risorse hardware. Le macchine nel Cloud vengono oggi infatti uti-lizzate per ospitare centinaia o migliaia di applicazioni, spesso tra loro scorrelate e appartenenti a clienti differenti. Le tecnologie di virtualizzazione forniscono quelle garanzie di isolamento che sono necessarie per la convivenza di applicazioni su una stessa macchina fisica, anche in totale mancanza di reciproca fiducia. Benché il suppor-to hardware per la virtualizzazione sia essenziale per ottenere prestazioni accettabili, il successo del Cloud Computing è comunque largamente dovuto alla grande varietà di prodotti software, framework e librerie che completano e potenziano le funzionalità di virtualizzazione implementate direttamente in hardware. In particolare, le macchine nel Cloud vengono spesso utilizzate per processare enormi quantità di traffico di rete per conto di Virtual Machine e container, le quali sono solitamente connesse tra loro tramite uno switch virtuale a comune. Nonostante i calcolatori moderni possano es-sere dotati di centinaia di CPU, l’elaborazione di pacchetti di rete ad alta velocità (per es., 1-100 Mpps) può comunque comportare notevoli difficoltà, a causa dei costi di comunicazione dovuti a meccanismi di base quali notifiche tra processi, operazioni di sincronizzazione e code lock-free. La letteratura corrente manca di modelli per analiz-zare questi meccanismi di basso livello in profondità, al fine di identificare linee guida opportune per il progetto di sistemi di elaborazione dati ad alte prestazioni. Questa tesi presenta un’estesa discussione dell’impatto di meccanismi di sincronizzazione e code sulle prestazioni delle pipeline che processano dati, prendendo in considerazione tutti i possibili regimi operativi, e focalizzandosi in particolare sul networking e la virtualizzazione. Vengono introdotti modelli per il throughput, la latenza e l’efficienza energetica di sistemi produttore/consumatore, validandoli sperimentalmente per verifi-care che siano appropriati alla rappresentazione di sistemi reali. Diverse code Single Producer Single Consumer vengono discusse e caratterizzate in termini di interazioni con il sottosistema delle cache; due di queste code costituiscono un contributo originale di questa tesi. Come applicazione dell’utilità pratica di questo studio, la tesi presenta una architettura innovativa per lo scheduling di pacchetti di rete ad alte prestazioni, adatta ad essere utilizzata nei Data Center. Gli esperimenti mostrano come questa

(8)

architettura comporti vantaggi in termini di throughput, latenza e isolamento rispetto alle soluzioni attuali. Complessivamente, l’analisi presentata in questo lavoro di tesi fornisce suggerimenti e linee guida per il progetto di switch di rete virtuali, emulatori di dispositivi di I/O e altre componenti di elaborazione ad alte prestazioni rilevanti per il Cloud Computing.

(9)

List of publications

International Journals

1. Lettieri, G., Maffione, V., & Rizzo, L. (2017). A Study of I/O Performance of Vir-tual Machines. The Computer Journal, 61(6), 808-831. doi:10.1093/comjnl/bxx092

2. Rizzo, L., Valente, P., Lettieri, G., & Maffione, V. (2018). PSPAT: Software packet scheduling at hardware speed. Computer Communications, 120, 32-45. doi:10.1016/j.comcom.2018.02.018

3. Maffione, V., Lettieri, G., & Rizzo, L. (2018). Cache-aware design of general-purpose Single-Producer-Single-Consumer queues. Software: Practice and Expe-rience. doi:10.1002/spe.2675

International Conferences/Workshops with Peer Review

1. Rizzo, L., Garzarella, S., Lettieri, G., & Maffione, V. (2016). A Study of Speed Mismatches Between Communicating Virtual Machines. Proceedings of the 2016 Symposium on Architectures for Networking and Communications Systems -ANCS 16. doi:10.1145/2881025.2881037

2. Maffione, V., Rizzo, L., & Lettieri, G. (2016). Flexible virtual machine networking using netmap passthrough. 2016 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). doi:10.1109/lanman.2016.7548852

3. Rizzo, L., Lettieri, G., & Maffione, V. (2016). Very high speed link emulation with TLEM. 2016 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). doi:10.1109/lanman.2016.7548841

4. Lettieri, G., Maffione, V., & Rizzo, L. (2017). A Survey of Fast Packet I/O Technologies for Network Function Virtualization. Lecture Notes in Computer Science High Performance Computing, 579-590. doi:10.1007/978-3-319-67630-2_40

(10)

5. Yasukata, K., Huici, F., Maffione, V., Lettieri, G., & Honda, M. (2017). Hy-perNF. Proceedings of the 2017 Symposium on Cloud Computing - SoCC 17. doi:10.1145/3127479.3127489

Others

1. Lettieri, G., Maffione, Honda, M. & Rizzo, L. (2017). The Netmap framework for NFV applications. Full-Day tutorial held at ACM SIGCOMM 2017. https:// conferences.sigcomm.org/sigcomm/2017/tutorial-netmap-nfv.html

(11)

List of Abbreviations

A

API Application Programming Interface. 2, 9

C

CapEx Capital Expenditures. 2

CDF Cumulative Density Function. 28, 118

COTS Commercial off-the-shelf. 2, 5

D

DMA Direct Memory Access. 12

DPDK Data Plane Development Kit. XII, 9, 11–13, 17, 19–21, 23

DPI Deep Packet Inspection. 5

E

EPC Evolved Packet Core. 5

G

GRE Generic Routing Encapsulation. 4

I

IaaS Infrastructure as a Service. 1, 2

IMS IP Multimedia Subsystem. 5

IOMMU I/O Memory Management Unit. 2, 3

IP Internet Protocol. 4, 5, 16, 17, 76 IPI Inter-Processor Interrupt. 53, 56, 61

ISP Internet Service Provider. 5

IT Information Technology. 2

K

(12)

L

LTE Long Term Evolution. 5

M

MSS Maximum Segment Size. 17, 120

MTU Maximum Transmission Unit. 5, 16, 129

N

NAPI New API. 14, 30, 71

NAT Network Address Translation. 5, 116

NF Network Function. 21

NFV Network Function Virtualization. 5, 6, 8–

10, 12–14, 17–21, 23, 31–33, 64, 69, 71, 74, 114, 140, 141

NIC Network Interface Card. 2–4, 8, 10–13, 16– 24, 26, 30, 31, 115–118, 120, 121, 123, 126, 128, 130–132, 135, 136, 138

NSP Network Service Provider. 5

NUMA Non-Uniform Memory Access. 12

O

OpEx Operational Expense. 2

OS Operating System. 4, 6, 8, 9, 11–13, 17, 19, 21, 25, 52–56, 65–67, 114–116, 118, 120, 124, 131, 133

OVS Open vSwitch. 12, 20, 23

OVS-DPDK DPDK-accelerated Open vSwitch. XII, 12, 17–24

P

PaaS Platform as a Service. 1

PCI Peripheral Component Interface. 9, 11, 19, 21, 31, 53

PCIe PCI Express. 2, 3, 34

PtP Point-to-point. 20, 23

R

RAM Random Access Memory. 1

RCU Read-Copy Update. 107

REST Representational State Transfer. 2

RX Receive. 11, 21

S

SDN Software Defined Networking. 5

(13)

SPSC Single Producer Single Consumer. I, III, XIII, 7, 12, 33, 74–79, 85, 92–96, 100, 101, 103, 106, 108, 111–113, 115, 121, 139–141 SR-IOV Single Root I/O Virtualization. XII, 2, 3, 9,

11, 19–24

SRMT Software-based Redundant Multi

Thread-ing. 74

T

TCP Transmission Control Protocol. 16, 17, 76, 122, 137

TSO TCP Segmentation Offload. 4, 16, 137

TX Transmission. 11, 21

U

UDP User Data Protocol. 4, 17, 128–134, 136

V

vCPU virtual CPU. 1, 14, 15

VF Virtual Function. 11, 18, 23

VM Virtual Machine. III, XIII, 1–24, 32–34,

52, 56, 65, 69, 71, 114

VNF Virtual Network Function. 5, 6, 12, 14, 15, 18, 20, 139, 141

VoIP Voice over IP. 4

VPN Virtual Private Network. 5

VPP Vector Packet Processing. 17

(14)

Glossary

B

BIFFQ Batched Improved FastForward Queue.

XIII, 90–92, 94, 95, 97, 99–107, 109, 110, 113, 141

BLQ Batched Lamport Queue. XIII, 81–85, 89, 91, 92, 95, 97–107, 109, 110, 113, 141

BW Busy waiting. 37, 46–51

C

CL Client List. 122, 123

CM Client Mailbox. 122–124, 128

CSB Communication Status Block. 14, 15

F

FC Fast Consumer. 16, 140

FFQ FastForward Queue. XIII, 85–89, 92, 94, 98–100, 102, 105, 106, 110, 111

FP Fast Producer. 140

I

IFFQ Improved FastForward Queue. XIII, 74, 88–92, 94, 95, 99, 100, 102–107, 109, 110, 113, 125, 126, 141

L

LLQ Lazy Lamport Queue. XIII, 29, 80, 81, 84, 85, 87, 92, 98, 100–106, 109, 110

LQ Lamport Queue. XIII, 78, 80–82, 84, 87, 88, 92, 94, 97, 98, 100, 102, 104–106, 109– 111

(15)

M

Mpps Millions packets per second. I, III

N

nFC Notified Fast Consumer. 42–44, 46, 47, 49– 51, 57

nFP Notified Fast Producer. 42–44, 46, 47, 49– 51, 57

nSCS Notified Slow Consumer Start. 42, 44–47, 51

nSPS Notified Slow Producer Start. 42, 44–47, 51, 70

nSS Notified Slow Start. 42, 44–48, 50, 51, 70

P

PS Packet Scheduler. 117, 127

PSPAT Parallel Scheduling Parallel Transmission. XIV, 73, 76, 77, 88, 93, 115, 116, 118, 120–139, 141, 142

S

SA Scheduling Algorithm. 117–119, 121–124, 127, 128, 131, 135

sFC Sleeping Fast Consumer. 38–41, 47, 49, 51, 57, 65, 66

sFP Sleeping Fast Producer. 38–41, 49, 51, 57, 65

sLS Long Sleeps regime. 38, 40–42, 48, 49, 51, 57, 66

SQ Short Queues regime. 45

T

T-WFI Tme Worst-case Fair Index. XIV, 119, 120, 126, 127, 129, 130, 136

TC Traffic Control for the Linux kernel. 25, 26, 115, 128, 130–132, 134–137, 139, 141

(16)

4.4 Experimental validation . . . 92 4.4.1 Validation methodology . . . 94 4.4.2 Throughput experiments . . . 97 4.4.3 Latency evaluation . . . 103 4.5 An example application . . . 106 4.5.1 Experiment methodology . . . 108 4.5.2 Flooding experiments . . . 108 4.5.3 Request-response experiments . . . 109 4.6 Related works . . . 111 4.7 Conclusion . . . 113

5 A high performance network scheduler 114 5.1 Motivation and background . . . 116

5.1.1 Packet Schedulers . . . 117

5.1.2 Hardware Packet Schedulers (and their limitations) . . . 117

5.1.3 Software Packet Schedulers . . . 118

5.1.4 Memory communication costs . . . 118

5.1.5 Scheduling Algorithms . . . 119 5.2 PSPAT architecture . . . 120 5.2.1 Clients . . . 122 5.2.2 Flows . . . 122 5.2.3 Dispatchers . . . 123 5.2.4 The arbiter . . . 123 5.3 Mailboxes . . . 125 5.4 T-WFI in PSPAT . . . 126 5.4.1 T-WFI examples . . . 126 5.5 Experimental evaluation . . . 127

5.5.1 Two PSPAT implementations . . . 128

5.5.2 Testbed description . . . 128

5.5.3 Experiment methodology and configuration . . . 129

5.5.4 Throughput experiments . . . 130

5.5.5 Stability of rate allocation . . . 135

5.5.6 Measurement of latency distributions . . . 135

5.6 Related work . . . 137

6 Conclusions 140

Appendix A Deriving the cost of memory stalls 142

Appendix B Proof of inequality (3.10) 144

(19)

CHAPTER

1

Introduction

T

he success of Cloud Computing technologies [94] has radically changed the way we look at computing hardware resources. Server machines are not dedicated anymore to one or a few cooperating applications, but rather they host tens to thousands of possibly unrelated applications belonging to different customers (tenants). Mutually untrusted applications can coexist in the same physical machine because of the isolation guarantees provided by the virtualization technologies. Isolation can be observed from two complementary perspectives, namely namespace and performance. Because of namespace isolation, an application is not able to detect and interact with the virtualized resources (e.g., CPUs, memory address space, networks, storage) assigned to applications belonging to a different tenant. Performance isolation provides guarantees on the minimum amount of a given resource that can be used by a tenant and its applications, such as number of dedicated CPUs (or amount of CPU time), amount of memory, storage space and network bandwidth, independently of the amount of resources used by the other tenants.

Cloud services come in different flavors, such as Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) [94], with a different balance of degree of resource control, elasticity of resource provisioning, and ease of maintenance. IaaS provides customers with Virtual Machines (VMs) or containers (light virtualization), where they can deploy arbitrary software (operating system included), on which they have complete control. Customers can allocate more VMs or modify the size of an existing VM (number of vCPUs, RAM, etc.) on-demand, and are responsible for configuring their applications to make use of the new resources. PaaS enables the deployment of custom applications that rely on libraries, services and other frameworks supported by the provider. Customers do not have control on the operating system or provider’s software, but the resources allocated to the application can be automatically adapted

(20)

(scaled up or down) to match the current workload.

Cloud Computing allows companies and institutions to (totally or partially) replace their existing on-site IT infrastructure (e.g., server racks or whole data centers) with remote resources hosted in the cloud provider data centers. The main motivation of this strategy is to cut the Operational Expense (OpEx) and Capital Expenditures (CapEx) related to the on-site resources, including hardware purchase and maintenance, software upgrades, machines configuration and troubleshooting. Cloud customers can pay only for the resources that they actually use or that they choose to allocate, scaling up and down as necessary.

1.1

The role of software

On the one hand, the Cloud Computing paradigm has been fostered by substantial hardware improvements over the last two decades. With hardware-assisted CPU virtua-lization [68,101], guest software can run in a VM at nearly bare metal speed, overcoming the performance limitations of instruction emulation and binary translation [30, 132]. Similarly, hardware-assisted I/O virtualization [20] cuts down the cost of I/O operations issued by a VM. For another example, the availability of large Commercial off-the-shelf (COTS) machines with plenty of CPUs (100 and beyond), memory and I/O bandwidth is an excellent opportunity for Cloud providers to amortize the fixed costs involved in maintaining a data center, e.g., power consumption, cabling, cooling, hardware replace-ment and upgrade.

On the other hand, the Cloud ecosystem thrives because of the fundamental contribu-tion of an extended set of software products, frameworks and libraries that complete and enhance the virtualization capabilities directly supported by the hardware. Large parts of this software is related to resource control, configuration, orchestration and elasticity control, which allow for provisioning VMs, application containers, virtual networks and data stores through user-friendly interfaces or REST APIs. Examples of cloud/cluster management software for IaaS are OpenStack [129], Amazon EC2, Microsoft Azure, Google Compute Engine, Docker and Kubernetes [29].

Beside control and orchestration functionalities, it is often the case that virtualized I/O processing for VMs or containers is actually implemented in software, at least in part. Although hardware-assisted solutions exist, such as IOMMU [39] for passing through devices to a VM, or SR-IOV [37] for NIC hardware virtualization, software-based I/O processing still offers many advantages:

• Software is much more flexible than hardware, enabling better velocity for new features, bug fixes and performance improvements. As an example, network encapsulation protocols for the data center evolve more quickly than hardware vendors can implement the corresponding offloads. Fast software implementations are therefore required at least until the hardware support is ready.

• Hardware-assisted solutions tend to move I/O processing to devices on the PCIe bus. As a result, the bus can easily become a bottleneck at high I/O rates, since many VMs may create heavy contention on the bus and device controllers [113] (see Chapter 5). Moreover, access to the PCIe bus is normally granted in such a way that all the CPUs have equal bus access, irrespective of the desired scheduling

(21)

policies. Conversely, implementing I/O processing functionalities on the sys-tem CPUs removes the need to move data to the PCIe bus, and enables custom scheduling policies.

• Hardware virtualization resources are often expensive, and therefore there are limitations on the maximum number of devices and/or VMs that they can support. As an example, SR-IOV allows a single physical NIC to expose many logical networking interfaces (and each one can be assigned to a different VM), but the maximum number of logical interfaces is often limited to 32 or 64. Conversely, I/O devices emulated in software (or paravirtualized [25, 123]) do not suffer from these limitations.

• The most convenient trade-off is often a hybrid solution, where a few VMs or applications are directly served by hardware assisted virtualization (e.g., device passthrough and IOMMU) to benefit from the lower latency and higher predictabil-ity, whereas the remaining I/O is handled in software. The hybrid strategy enables scalability to many clients and at the same time optimal performance for the most demanding customers.

Processing virtualized I/O in software is thus a necessity, but it may be challenging, especially at very high rates. Commodity processors and memory subsystems are not designed to provide predictable performance: they rather rely on several optimizations and heuristics that offer good performance in general, but with high variability across the possible workloads. Examples include speculative and out-of-order execution, branch prediction, and above all the cache coherence protocol. Modern software must take these features into careful consideration to achieve better performance and predictability.

General purpose CPUs programmed for a given data processing task do not normally offer the same level of performance offered by dedicated hardware. Nevertheless, current machines have many available CPUs, which makes it possible to spread the processing work over different CPUs and carry out more work. This approach introduces yet more challenges, because the processes or threads running in parallel on different CPUs need to communicate with each other. Communication involves overheads, e.g., notifications, sleeps, cache misses, that must be kept under control to avoid excessive performance degradation.

1.2

Virtualized networking

Software products performing I/O work on behalf of VMs and containers are particularly common and relevant in the networking field. Virtual switches are a fundamental component of any virtualization stack, as they are responsible for moving applications network traffic between the VMs and the data center physical network (Figure 1.1).

The main task of a virtual switch is to implement the virtual network abstraction. All the VMs and containers belonging to the same virtual network can communicate with each other as they were physically connected to a common hardware Ethernet switch. Conversely, two entities belonging to separate virtual networks cannot see or communicate with each other directly. At the same time, VMs and containers may run on the same host (served by the same virtual switch) or on separate hosts or portions of the data centers, irrespective of their membership to the same or different virtual

(22)

Virtual Switch VM

Guest OS

Virtual NIC driver

Virtual NIC emulation

VM VM

NIC

Figure 1.1: A virtual switch connecting VMs and a physical NIC. The guest OS uses a driver to access thevirtual NIC device, which is normally emulated in software by the host hypervisor. The virtual

switch acts as abackend for the virtual NIC, and it often uses an encapsulation protocol to implement

the “virtual network” abstraction.

networks. This abstraction is implemented by means of encapsulation: the virtual switch encapsulates the IP traffic generated by a VM within protocols such as GRE [40], VXLAN [89], or UDP. The traffic is then transported through the Cloud provider network to the destination host, where the local virtual switch will decapsulate it and deliver it to the right VM (belonging to the same virtual network). In addition to that, virtual switches may perform a wide range of packet manipulation or scheduling tasks, such as rate limiting, prioritization, firewalling, ciphering and deciphering.

Data center traffic characterization Most of the network traffic in a data center is originated from end-users interacting with applications hosted in the data center itself, such as web services, social networks, video streaming, VoIP, or search queries. An end-user request enters the data center and gets routed to the appropriate front-end server, which often needs to decompose it into multiple sub-requests (e.g., tens to hundreds), and forward those to appropriate the back-end servers. Once all the corresponding sub-responses are returned to the front-end server, the latter can send its response back to the end-user. In this common scenario, from the end-user perspective the bottleneck of network performance is normally latency. If the response is large (e.g., because it contains video frames or images), it can be transported over large network packets in order to keep the overall packet rate reasonably low, which is the key for efficient software packet processing. As an example, even with 100 Gbit NICs a packet size of 32 KB results into a maximum packet rate of less than 400000 packets per second (i.e., over 2 µs per packet) which can be easily handled in software using general purpose CPUs. Such a large packet size is possible because of the TCP Segmentation Offload (TSO) [35] capability supported by virtually any NIC; the NIC performs TCP (or UDP) segmentation in

(23)

hardware, thus exposing to the software a very large apparent MTU, typically up to 64 KB.

1.2.1 Software Defined Networking

As an important contribution to virtualized networking, recent years have seen the rise of novel paradigms to design and manage the networks of data centers, ISPs and Network Service Providers (NSPs), often combining networking and virtualization. In contrast to more traditional approaches, Software Defined Networking (SDN) [73] promotes a clear and standardized separation between the packet forwarding (and manipulation) functionalities of physical (and virtual) switches in the provider network, and the control functionalities required to program the switches forwarding tables. With SDN, switches are programmed by one or a few logically centralized controllers, using a protocol such as OpenFlow [93]. The main purpose of SDN is to simplify network management, configuration and monitoring [67]. However, SDN also had an impact on the datapath implementation of many virtual switches, because their packet processing logic is usually based on OpenFlow or similar SDN protocols. Examples of virtual switches are Open vSwitch [108], BESS [49], Google’s Andromeda [36], VALE [54,117], Click [71] and Snabb [12], just to name a few.

1.2.2 Network Function Virtualization

In addition to rearranging how network switches are configured and managed, ISPs and NSPs can benefit from virtualization technologies by moving their packet process-ing functionalities from dedicated hardware appliances to VMs or containers runnprocess-ing on COTS hardware. This approach is known as Network Function Virtualization (NFV) [46], which brings the advantages of virtualization to network operators:

• Server consolidation. Running different Virtual Network Functions (VNFs) on VMs hosted by the same physical machine allows to maximize the usage of its physical resources (CPUs, memory, I/O) and reduce power consumption.

• Elastic service provisioning. VMs can be started and stopped dynamically to match the current workload. This allows NSPs to avoid over-provisioning and its costs.

• Improved reliability. VMs running on malfunctioning hardware can be easily migrated to a different physical host, possibly without service disruption, but only graceful (and temporary) performance degradation.

• Improved flexibility. VNFs can be connected (chained) with each other into arbitrary topologies without any physical intervention. Similarly, the chaining topology can be easily modified to optimize performance or to cope with changing requirements.

Examples of network services that can be turned into VNFs are Network Address Translation (NAT), Deep Packet Inspection (DPI), Virtual Private Network (VPN), IP Multimedia Subsystem (IMS), firewall processing, IP routing, and some of the compo-nents of Evolved Packet Core (EPC) in the context of 4G LTE mobile networks [48].

(24)

NFV significantly influenced the design of virtual switches and related software. VNFs may need to process 1–10 million packets per second and beyond, which normally flow through a software switch. Dealing with such high rates is particularly challenging, and is the focus of this thesis. A careful design of the software processing pipeline is therefore required to achieve the desired performance. Examples of frameworks that provide efficient mechanisms to interconnect VNFs are Netmap [112], DPDK [4], NetBricks [104], Snabb [12] and NetVM [56].

1.3

Thesis motivation, contributions and structure

At very high data rates (e.g., millions of operations per second), the performance of I/O processing for virtualized execution environments is significantly influenced by low level synchronization and data transfer mechanisms. Efficient queues are required to connect OS processes or threads belonging to a data processing pipeline with minimal overhead. Synchronization between communicating threads — e.g., OS primitives to sleep, wait for events or wake up other threads — is often expensive, especially in the common case where one of the threads is running within a VM and the other is running in the host OS.

With reference to virtualized network I/O, although for general data center work-loads the network I/O bottleneck is usually end-to-end latency (see Sec. 1.2), emerging paradigms such as NFV call for specialized middlebox applications where the bottleneck is usually the packet rate. This thesis focuses on the latter use case.

The throughput and latency behavior of an I/O processing pipeline has a strong dependency on the actual pipeline workload. Different workloads may result in different relative speeds between communicating threads, and thus different operating regimes. Some queue implementations and synchronization schemes may be more favourable than others for a given operating regime, leading to less notifications, wake-ups, or cache misses. Although the virtual switches mentioned in Section 1.2, together with hypervisors like QEMU [26] or Xen [25], try to use efficient queues and suppress notifications as much as possible [118, 123], their algorithms do not work well in all the possible situations. It is not uncommon to observe pathological situations where throughput degrades significantly [80].

The current literature lacks models to analyze these low level mechanism in depth and identify suitable guidelines for the design of high-speed processing systems. This thesis presents a thorough analysis of the impact that queues and synchronization mechanisms have on the throughput, latency and energy efficiency of high-rate virtualized I/O processing, in all the possible operating regimes, and with a particular focus on network I/O. Performance models are introduced and validated experimentally to verify that they are appropriate to represent real systems. The analysis provides suggestions and guidelines to design basic software components for virtual network switches, hypervisor backends for device emulation, and other processing components relevant to Cloud Computing environments.

In more details, this thesis makes the following contributions:

• An overview of some promising software frameworks for packet processing at very high rates, with reference to the NFV use case. Chapter 2 presents these solutions and compares them against a set of desirable features and performance

(25)

benchmarks. One of these technologies, the Netmap passthrough, is contributed by this thesis and thus described in more detail (Section 2.2.4).

• An analytical model for throughput, latency and energy efficiency of Producer Consumer systems. Different synchronization mechanisms are studied, namely notifications, sleeping and busy wait. Guidelines are provided to design efficient Producer Consumer systems under different workloads. The model is introduced in Chapter 3.

• The description and analysis of six general purpose Single Producer Single Con-sumer (SPSC) queues, suitable for very high message rates, with a thorough char-acterization in terms of interactions with the cache coherence system (Chapter 4). Two of these queues are novel contributions.

• The design and performance evaluation of a novel architecture for high-throughput packet scheduling, suitable for data center machines hosting many VMs or con-tainers. This work is presented in Chapter 5.

(26)

CHAPTER

2

Technologies for fast network I/O

S

everal frameworks and tools have been proposed to deal with the demanding network I/O requirements of NFV [46] deployments. Some of these frameworks were also introduced independently of NFV, as a solution to overcome the perfor-mance limitations of traditional OS and hypervisor networking capabilities. Traditional in-kernel network stacks are known to be unable to bear the high traffic loads that are expected on large server machines with high VM density and high-end 10-100Gbit NICs, severely limiting the maximum packet rate that can be achieved between the different components in the system [42, 112]. In any case, different solutions have been introduced, each one coming with its own degree of flexibility, features, performance limitations, so that there are several aspects that users should take into consideration in order to take an informed decision. As an example, some solutions require hardware NIC drivers to be installed in the VMs, while other do not; some solutions provide a virtual switch (see Section 1.2) to connect VMs on the same physical machine, while others explicitly provide a faster “virtual link” abstraction. An appropriate comparison and classification with respect to different practical aspects is needed to help users choose the most convenient option, according to their needs in terms of performance, flexibility, reusability, NIC support, etc.

This Chapter contains a comparative survey of some existing fast network I/O solu-tions for NFV, describing and comparing the ones selected as the most promising and/or used at time of writing. The discussion is mostly limited to data-plane1 capabilities, and therefore does not consider most of the issues related to control-plane (e.g., NFV controllers, performance monitoring, optimal resource allocation, etc.), which are com-pletely orthogonal. The purpose of this survey is to give and overview of the desirable

1In the networking jargon, the term “data-plane” refers to the set of software and hardware resources that carry out packet processing, such as forwarding and manipulation.

(27)

Figure 2.1: Data structures of a VirtIO virtqueue. Each entry in the descriptor table references a buffer in guest memory. Guest buffers can be chained to form scatter-gather lists and exchanged between the guest and the hypervisor through theavail and used rings.

features that I/O processing mechanisms should have to satisfy the most demanding performance requirements. Other surveys related to NFV exist, but they either touch only lightly on existing data-plane solutions [60, 82] or they focus on other aspects like resource allocation [53] or security [145].

We focus primarily on how VMs or containers running on the same host can be connected between each other and/or with the external network, and how flexible and fast these connections are. The solutions analyzed are hardware-based (PCI passthrough with SR-IOV) or software-based, i.e. Open vSwitch (enhanced with DPDK), NetVM, Netmap, Snabb, VPP and BESS.

2.1

Background on VirtIO

VirtIO [123] is a widely used standard and API for I/O paravirtualization, a term that refers to the guest device driver being aware of running inside a VM. Most of hypervisor software (QEMU, bhyve, VirtualBox, Xen, etc.) and guest operating systems (Linux, FreeBSD, Windows) are rapidly converging towards VirtIO as the default I/O API for Virtual Machines. VirtIO is a generic producer-consumer API that allows a guest OS to exchange data with its hypervisor (or other host software). It provides a guest-side API and a hypervisor-side API that are used by the guest and the hypervisor, respectively, to access VirtIO data structures.

(28)

host netmap host kernel

netmap guest user

guest kernel

netmap guest user

guest kernel

NIC

Figure 2.2: The Netmap framework described in Section 2.2.4. Netmap is used also in the guest in order to access a passed-through host Netmap port. Different colors denote different protection domains: host kernel (blue), guest kernel (yellow), guest user-space (green) and NIC hardware (purple). The red lines show the (bidirectional) paths that network packets can follow.

memory shared between the guest and the hypervisor. It is composed of two separate circular arrays, the avail ring and the used ring, plus a descriptor table (Figure 2.1). The descriptor table is an array containing buffer descriptors, where a descriptor contains a pointer, a length, and some flags. Each slot in the avail and used ring references the head of a chain of descriptors (i.e., a scatter-gather list). A guest driver inserts scatter-gather lists in the virtqueue avail ring, where the hypervisor can extract them (in FIFO order). Once the hypervisor has consumed a scatter-gather list, it pushes it to the used ring, where the guest can recover it (and possibly do some cleanup). A VirtIO device may be composed of one or more virtqueues. As an example, the VirtIO network device (virtio-net) has at least a virtqueue for packet transmission and another one for packet reception.

The virtqueue rings and descriptor table, together with the I/O data, can be accessed in shared memory. In addition to those, each virtqueue has a mechanism to let the guest send a notification to the hypervisor and the other way around. Emulated I/O device registers, whose access normally causes a trap into the hypervisor, are only used as guest-to-host notifications. Host-to-guest notifications use hypervisor-specific interrupt injection mechanisms (usually MSI-X interrupts, as they have less overhead than traditional PCI interrupts).

Since notifications are expensive, VirtIO has mechanisms to amortize their cost over multiple I/O operations. Both the guest device driver and the hypervisor suppress notifications while they are actively polling the virtqueue, and enable them only before going to sleep because no more work is pending. This strategy may be very effective at reducing notifications under heavy load, as described in Section 3.1.3.

2.2

An overview of current frameworks

This section provides a short overview of the selected data-plane solutions. The term hostrefers to the physical machine hosting the VMs that make up the NFV chain and it includes the VM hypervisor.

(29)

guest netmap NIC guest kernel guest kernel host netmap guest user socket guest user L2 switch VF VF socket guest user DPDK guest user

Figure 2.3: The SR-IOV technology outlined in Section 2.2.1. In this example guests deploy traditional socket applications, or use DPDK/Netmap for faster processing. Different colors denote different protection domains: host kernel (blue), guest kernel (yellow), guest user-space (green) and NIC hardware (purple). The red lines show the (bidirectional) paths that network packets can follow.

2.2.1 SR-IOV

PCI-passthrough is a widely used technique [15, 64] to passthrough a host PCI device inside a VM (a NIC in this case). On its (emulated) PCI bus, the VM OS detects a PCI device that has the same NIC model, and uses a driver appropriate for that model. The IOMMU [39] is used to provide those memory protection and address translation functionalities that are necessary to let the VM access a host device without compromising the whole host system. The main advantage of PCI-passthrough is that the performance is normally the same as bare-metal. However, the host PCI bus can become a bottleneck as it is shared by all the NICs, and the low number of NICs that can be physically attached to a machine clearly limits the per-host VM density. SR-IOV [37, 58] is a standard for hardware-based network I/O sharing, which tries to overcome such density limitations. SR-IOV extends the NIC capabilities and allows a device to expose to the OS multiple instances of itself, known as Virtual Functions (VFs). The OS sees each VF as a separate PCI NIC (with a separate MAC), and each one can be independently passed through to a different VM. A VF is a lightweight version of a fully featured NIC, equipped with its own private TX/RX descriptor rings (needed to support data transfer capabilities), while all the other parts of the hardware (configuration capabilities) are shared with the other VFs. According to the standard, a SR-IOV-capable NIC can create up to 256 VFs, although the real limit can be lower (e.g. 64), because of the need for private hardware resources and the negative performance impact of sharing internal data-path components. SR-IOV largely removes the VM density bottleneck, since a host can support as many VMs as the total number of VFs available in its NICs, as shown in figure 2.3. Inter-VM packet switching between two VFs belonging to the same physical NIC is handled inside the NIC hardware, by means of an internal Ethernet bridge. The switching logic is somewhat limited, as it is usually based on L2 addresses only. If two VFs belong to different physical NICs, external switching is necessary.

(30)

guest kernel virtio guest kernel netmap guest user socket guest user socket guest user DPDK guest user

Snabb / OVS-DPDK / BESS / VPP host kernel

NIC virtio

Figure 2.4: The Snabb, OVS-DPDK, VPP and BESS frameworks reported in Sections 2.2.2-2.2.7. Guests deploy traditional socket applications, or use DPDK/Netmap for faster processing. Different colors denote different protection domains: host kernel (blue), host user-space (red), guest kernel (yellow), guest user-space (green) and NIC hardware (purple). The red lines show the (bidirectional) paths that network packets can follow.

2.2.2 DPDK-accelerated Open vSwitch (OVS-DPDK)

Open vSwitch (OVS) [108] is a distributed multi-layer virtual switch with extensive support for programmability, as provided by OpenFlow [93]. Due to the wide range of supported features, it is commonly used as virtual switch to connect together VMs and NICs. The switch data-path is implemented in software, either as a kernel-space module or as a user-space daemon. The most interesting capability of OVS with respect to NFV is the possibility to attach VMs to the switch through DPDK-capable ports, leveraging the vhost-user hypervisor technology [10, 11]. The vhost-user interface allows a user-space program to map the memory of a VM into its address user-space, and efficiently exchange packets with the VM through a VirtIO [123] paravirtualized network device, possibly using zero-copy techniques. DPDK-capable OVS ports (including NIC ports) are served by user-space OVS threads, and traffic flowing between them is forwarded through the high performance DPDK framework [4], as shown in figure 2.4. DPDK transmits and receives packets using fast user-space networking techniques, i.e. OS bypass, batch packet processing, preallocated packet buffers, direct access to the NIC DMA capabilities, etc.

2.2.3 NetVM

NetVM [56] is a framework specifically designed for NFV, that builds on DPDK [4] to provide high-level abstractions for developing, deploying and managing chains of VNFs. NetVM relies on DPDK for high-speed NIC I/O and augments it with a shared memory mechanism that allows applications running in trusted VMs (or trusted containers in the more recent OpenNetVM [149]) to exchange packets among them and with the NICs without any data copy. The NetVM threads let NICs DMA data into the hugepages-backed shared memory area and then use lockless SPSC queues to move buffer grants (descriptors) across the chains of VMs, while the data itself is not moved. In addition to zero-copy, NetVM focuses on NUMA-awareness (avoiding accesses to a remote

(31)

socket if possible) and busy waiting to completely avoid interrupts and other types of notifications, as DPDK already does. Applications must be written in terms of callbacks using a NetVM-specific library. The callbacks instruct NetVM about the packets fate, e.g., drop, forward to another VM, or transmit to a NIC.

2.2.4 Netmap

Netmap [112] is a framework for fast user-space I/O that provides an hardware-independent API for raw I/O on physical NICs and other types of software interfaces. Similarly to DPDK, Netmap achieves high performance by means of OS bypass tech-niques, such as: (i) batching, since it is possible to send/receive hundreds of packets with a single system call and/or under the same lock; (ii) preallocation of packet buffers, which saves from the cost of dynamic allocation; (iii) memory mapping of packet buffers in the application address space, to avoid a packet copy across the user-space/kernel-space boundary. Several extensions have been introduced to support network I/O for VMs and containers: the VALE software switch [54, 117] can connect together NICs and VMs, and operate in batch; Netmap pipes implement fast point-to-point links bet-ween two processes; Netmap passthrough [44, 87] allows any host netmap port to be directly used by a VM by means of a common paravirtualized driver, as described in Section 2.2.4.1; Netmap monitors enable traffic monitoring on any Netmap port. All these different types of port (NIC, VALE virtual port, pipe, passthrough port, etc.) can be accessed with the same Netmap API, so that applications can run unmodified everywhere. Some Netmap features specifically target NFV scenarios: pipes supports VM-to-VM virtual links in NFV chains; VALE provides a way to attach many VMs to a physical network; Netmap passthrough is then used to make VALE ports and pipes available inside the VM without the overhead of a virtual NIC emulation layer like VirtIO. Netmap passthrough is explained in more detail in Sections 2.2.4.1-2.2.4.3.

2.2.4.1 Netmap passthrough

In the guest, Netmap can operate over paravirtualized devices such as virtio-net (see Section 2.1), or over emulated devices like e1000 (Intel 1 Gbps NIC); in this way it can achieve good throughputs, i.e., up to 5-8 Mpps when VMs are attached to a virtual switch like VALE as depicted in Figure 1.1. However, those are still significantly lower throughputs than what it is possible for native Netmap applications accessing VALE ports and pipes directly from the host OS (20-50 Mpps). This gap is due the differences in packet format between the Netmap ports and the devices (either emulated or paravirtualized) made available to the VMs. These differences require format conversions and data copies, which introduce significant slowdowns at the speeds of interest.

To overcome this limitation, Netmap passthrough [44] (ptnetmap) has been intro-duced as a technique to completely avoid hypervisor device emulation in the packet datapath, unblocking the full potential of Netmap also within the VM. With ptnetmap, a Netmap port on the host – a VALE port, a hardware NIC, etc. – is exposed to the guest in a protected way, so that Netmap applications running in the guest can directly access the rings and packet buffers of the host port, avoiding all the extra overhead involved in the emulation of network devices. The guest sees a passed-through Netmap port as a paravirtualized device called ptnet, as illustrated in Figure 2.5. The passthrough

(32)

host netmap hypervisor guest nm TX ring #1 nm RX ring #1 netmap application ptnet driver socket application guest network stack CSB guest netmap RX kthread #1 TX kthread #1 netmap-enabled NIC driver HW TX ring #1 HW RX ring #1 ptnet device model ...

Figure 2.5: The ptnet driver used by a VM to passthrough a Netmap port of the host. Both guest Netmap applications and the guest network stack access the mapped Netmap rings, and synchronize with the host using the CSB data structure in shared memory.

is transparent to guest Netmap applications: they don’t need modifications to run on ptnet ports. Netmap system calls issued on a ptnet port (e.g., txsync/rxsync) don’t operate directly on the corresponding host port. Instead, these requests are forwarded to a pool of kernel threads running in the host – one per ring in the current architecture. To forward requests, the ptnet guest driver exchanges information with the host kernel threads, so that they can synchronize guest rings state (as seen by the guest Netmap application) with the rings state of the host port. To all intents and purposes, guest driver and kernel threads form a producer-consumer system (one per ring). There are many aspects in common with virtio-net devices (see Section 2.1):

• A shared memory data structure called Communication Status Block is used to exchange ring state and notification suppression flags.

• I/O registers and MSI-X interrupts are used to let guest and host Netmap wake up each other on CSB updates.

• Notifications are suppressed when not needed, i.e. while the producer or the consumer is actively polling the CSB to check for more work. From a high-level perspective, the system tries to dynamically switch between polling operation under high load, and interrupt-based operation under lower loads, which is the same idea used by Linux NAPI [126].

Dismissing the kernel threads with HyperNF A recent work from Yasukata et al. [146] pointed out that running VM network I/O in a separate thread (as it is the case for ptnetmap) may not be the optimal solution for NFV, especially with the Xen hypervi-sor [25]. VNF throughput may suffer because of the overhead of the VM notifications to the I/O thread. Running the I/O thread on a separate CPU from the VM vCPU thread2

(33)

increases parallelism, but it also introduces an implicit static resource allocation, which may result into CPU underutilization if one of the two threads is mostly idle. Moreover, Xen I/O threads run into a driver domain separated from the VMs, where the hypervi-sor has no way to differentiate traffic belonging to different VMs; as a result, it is not possible to assign different priorities or CPU shares to the I/O operations of different VMs. Following these considerations, HyperNF proposes to dismiss I/O threads and run the I/O processing functionalities (e.g., the virtual switch code) in the context of the vCPU thread, using a new Xen hypercall (or a VMEXIT on KVM). To trigger packet transmission or reception, a VM executes the HyperNF hypercall, which runs the virtual switch datapath directly within the hypervisor. This removes the overhead of notifying a separate I/O thread and ensures that the I/O cost is attributed to the correct VM, enabling accurate resource allocation. HyperNF has been evaluated on Xen using VALE as a virtual switch and ptnetmap in the VMs. Compared to the traditional approach based on I/O threads, HyperNF shows higher throughput (10-73%, depending on the VNF), accurate CPU resource allocation (with deviations of only 3.5%), and better adaptability to changing workloads. On QEMU/KVM throughput improvements are smaller (5-8%) because the cost of notifying the I/O thread is much cheaper 3, and also because KVM is part of the host kernel, with complete visibility into the resources used by the VMs in any case.

2.2.4.2 The ptnet device model

A ptnet interface uses the netmap API as the underlying device model, replacing hard-ware models (e.g., e1000) or paravirtualized ones (e.g., virtio-net). If a passed-through host Netmap port has multiple TX/RX queues, the corresponding ptnet interface will have the same configuration.

The ptnet CSB is laid out as an array of structures, one per ring. Each structure contains the following information:

• head/cur pointers, reflecting the status of the ring as seen by the guest. These are written by the guest and read by the host.

• hwcur/hwtail pointers, reflecting the status of the ring as seen by the host. These are written by the host and read by the guest.

• Two flags to suppress guest-to-host and host-to-guest notifications, respectively. A small number of device registers are used to read configuration (number of rings and descriptors per ring, device MAC address, acknowledged features), or to write configuration (CSB physical address, wanted features). A command register is available to start or stop kernel threads. A full description of the register layout is not reported here for brevity. For each ring, a dedicated kick register is used for guest-to-host notifications. Using different registers is important for performance since each register is associated to a different kernel thread. Using a single register would cause unnecessary wake-ups. A similar approach is used for host-to-guest notifications: a different MSI-X interrupt vector is allocated for each ring, so that different guest applications listening to different rings do not suffer from spurious wake-ups.

3This is the case because the I/O threads in KVM run directly within the hypervisor, while Xen I/O threads run in a separate domain, which implies an additional context switch.

(34)

100 ₁₀1 ₁₀2 ₁₀3 Burst size 2 4 6 8 10 Mpps

Linux in-kernel pktgen tests ptnet-over-VALE

vhost-over-TAP

Figure 2.6: Throughput of the Linux in-kernel pktgen with variable batch size. The vhost optimized implementation of VirtIO over TAP is not able to exploit batching, while the Netmap API allows ptnet to propagate the batch through the backend virtual switch, achieving significantly higher throughputs.

Guest applications can use a ptnet network interface through either the Netmap API or the traditional socket API, as shown in figure 2.5. When used by the guest network stack, a ptnet device driver behaves as any other guest Netmap application, accessing the Netmap rings and buffers. Although the main purpose of Netmap passthrough is to run Netmap applications within VMs at native speeds, a ptnet device can bring the benefits of Netmap batched I/O also to the unmodified network stack. An example is shown in Figure 2.6, where the Linux in-kernel pktgen is used to transmit 60-bytes packets at maximum speed, in bursts of variable sizes, over a ptnet device backed by a VALE port, or over a virtio-net device backed by a TAP device [13] (the latter is the default configuration in most of the Linux/KVM deployments).

Since VirtIO API and TAP have limited support for batching, using batch transmis-sion with virtio-net is not really effective at improving throughput. On the contrary, ptnet performs better when driven with longer bursts, since the batches are preserved across the passed-through netmap port. The throughput nicely increases with the batch sizes, reaching its optimum at about 50. After that, throughput decreases. As explained in [116] (Section 3.1.3), the decrease happens because for larger batches the producer-consumer systems enters a Fast Consumer (FC) regime. The producer-consumer (ptnetmap kernel thread) becomes more efficient w.r.t. the producer (pktgen guest thread), and thus the producer slows down because it spends more time notifying the consumer.

2.2.4.3 Paravirtualized offloads for ptnet devices

Hardware NICs commonly support TCP/IP offloads, the most useful ones being TCP Segmentation Offload (TSO) [35] and TCP checksumming offload. When used together, the network stack can pass to the NIC driver a TCP segment much bigger than the interface MTU, usually 64 KB long, with the TCP checksum not computed. The driver will then program the NIC to perform TCP segmentation and checksum computation

(35)

in hardware. Performance improvements are twofold: (i) segmentation and checksum processing is done faster in the hardware and the CPU can be used for other purposes; (ii) the network stack is traversed less times, since packets are bigger, and so fixed overheads are amortized over more bytes.

When coming to I/O paravirtualization, offloads can be exploited rather than being emulated. When the device emulator receives a large unchecksummed TCP segment, no segmentation or checksum computation is performed. Instead, the packet is directly pushed into the host network stack (or into a VALE switch), where it pops up again as a big unchecksummed TCP segment. If the packet is directed to a paravirtualized network device of another guest on the same host, similarly, it is received by the guest driver and injected into the guest stack as is, without further processing. Checksum is unnecessary because the journey of the packet from one guest to another only goes through memory copies. From a high level perspective, this strategy allows TCP segmentation and checksumming to be performed lazily, and only if necessary - i.e. when the packet needs to go on a physical link.

VirtIO network devices use this technique, which is the key to achieve huge through-puts in terms of TCP (or UDP) bulk transmission. Skipping checksum computation also improves latency. In order to implement paravirtualized offloads, however, guest and hypervisor need to exchange some metadata for each packet, including: (i) checksum offset and length to possibly perform checksum; (ii) Maximum Segment Size (MSS) to possibly perform segmentation. The VirtIO standard [122] defines an header to contain such information, to be prepended to each Ethernet frame. The ptnet device supports paravirtualized offloads using this standard header, which is already supported by the VALE switch and QEMU. This opens the doors for high performance TCP/IP networking and ensures full interoperability between ptnet and virtio-net devices.

2.2.5 Snabb

Snabb [12, 105] is a flexible networking toolkit that allows the programmer to build a custom software packet processing network by connecting together reusable functional blocks, called Apps. Apps can be very simple (mux/demux, repeaters, splitters, etc.) or more complex (learning bridge, IPSec, etc.). Similarly to Click [71], a packet processing system is modeled as a directed graph of Apps, called AppEngine, which runs in the context of the single-threaded Snabb engine process. Multiple independent engines can be used if needed. See figure 2.4. Snabb supports NFV in the following way: (i) the VhostUserApp allows for fast data exchange with VMs, leveraging the same vhost-user technology used by OVS-DPDK; and (ii) some Apps (e.g. Intel10G) are available to access NIC hardware, implementing user-space drivers with OS bypass techniques similar to DPDK and Netmap.

2.2.6 VPP

Vector Packet Processing (VPP) is a feature rich, modular, extensible packet processing framework that offers production quality switch and router functionalities [17, 83]. Similarly to Snabb (Sec. 2.2.5) and Click [71], a VPP application defines a processing graph where each node performs some basic task on the ingress packets, such as parsing a protocol header (e.g., Ethernet, IPv4, IPv6, MPLS), looking up IP routing tables, filtering packets depending on the TCP/IP header fields, etc. VPP runs as a

(36)

user-space process and can use DPDK, Netmap or raw sockets to perform packet I/O, although it is optimized to run on DPDK. VPP drastically reduces the per-packet cost by processing packets in batches (vectors) along the whole processing graph, from the NIC (or VM) receive queues up to the final destination (e.g., transmit NIC queue, drop, or local delivery). Each node processes the whole vector of packets before passing the ownership of the vector to the next designated node(s). The main advantage of this approach is that it makes a very efficient use of the CPU instruction cache, which is usually not big enough to contain the code of the whole forwarding datapath; as a result, for each node, only the first packet in the vector pays the instruction cache misses necessary to load the processing code, while all the other packets do not. This is in sharp contrast with the traditional scalar approach (e.g. used by Linux and FreeBSD network stacks), where each packet traverses the whole forwarding code before processing the next one, and thus intruction cache misses are not amortized. Data prefetching is also aggressively used on the next packets to be processed in the vector, in order to minimize the corresponding data cache miss latency. Similarly to Snabb, VPP supports NFV as it can communicate with VMs through vhost-user ports. A large number of built-in processing modules is available for the user, covering virtually any network protocol or encapsulation feature that may be needed for NFV, L2 switching and L3 routing.

2.2.7 BESS

BESS [2, 49] is an extensible virtual switch explicitly designed for NFV. Similarly to Click [71], Snabb (Sec. 2.2.5) and VPP (Sec. 2.2.6), BESS models the packet processing functionalities as a graph that interconnects VNFs and NICs. Note that this is in contrast with OVS-DPDK (Sec. 2.2.2), that follows a match-action approach based on OpenFlow [93]. BESS runs as a user-space process, relying on DPDK for packet I/O. It can exchange packets with VMs using vhost-user ports. BESS includes an internal task scheduler to share the CPU time among the graph nodes, implement strict priorities, or rate-limit certain nodes.

2.2.8 Other related work

ClickOS [90] uses VALE as a fast Xen hypervisor switch, and a specialized passthrough technique to let the VM map VALE ports in its address space, which is analogous to the current Netmap passthrough; for the purposes of this overview, ClickOS is just an application of Netmap. NetBricks [104] is an NFV architecture that enforces isolation by requiring all functions to be written in a memory safe language (Rust).

2.3

Comparing architectures and features

The solutions described in section 2.2 are now compared against various aspects deemed important to meet the NFV I/O requirements. NFV aims at replacing hardware-based network appliances with software-based Virtual Functions to achieve some benefits: reduce cost, remove vendor lock-in, increase flexibility, while still achieving good performance [46]. Criteria (a), (b) and (c) below evaluate some barriers to vendor lock-in removal; criteria (d), (e) and (f) explore aspects related to performance and cost,

(37)

while criteria (g), (h) and (i) focus on flexibility.

(a) Network backends and portability of VM images. It is desirable that VMs can run unmodified everywhere, independently on the host hardware, hence minimizing the amount of software required in the VM image to be portable. Both OVS-DPDK, Netmap, Snabb, VPP and BESS are very portable, as they require only a standard driver to access the virtual interface: ptnet [87] for Netmap, virtio-net [122, 123] for the other ones. The virtual interface network backend for OVS-DPDK, Snabb, VPP and BESS is a vhost-user port of a virtual switch, where NICs and other software ports can be attached. The backend of a Netmap ptnet interface can be any host Netmap port, i.e. NIC, pipe, VALE port, monitor, etc. Also NetVM applications are fully portable to any NetVM deployment, as they are written in terms of callbacks and don’t see the virtual interfaces; under the hood, NetVM uses a custom PCI device to access the backend rings. In contrast, SR-IOV is backed by Virtual Functions of physical NICs, and requires the VM to contain a driver for any NIC model that may be passed-through as the image is deployed across the ever-moving virtualization infrastructure.

(b) Dependency on specific NIC models. One of the main concerns of NFV is the possibility to deploy applications anywhere, independently on the specific network hardware of the machine that hosts them. Traditional virtualization technologies can provide this decoupling, because the VM virtual interface is emulated in software by the hypervisor. However, the technologies in Sec. 2.2 may be more constrained, and only support a limited range of NICs. Being a PCI passthrough technique, SR-IOV reuses the standard kernel-space drivers shipped with the OS, which are available for virtually any NIC model on the market. On the contrary, the other solutions are required to provide explicit driver support for each NIC to be used. Performance of traditional kernel drivers, even for high-end NICs, is limited by the legacy OS interfaces, which hinder important optimizations like the use pre-allocated packet buffers and batch I/O (e.g. [42, 112]) 4 As a consequence, NIC drivers need to be rewritten from scratch or at least modified for optimized performance. DPDK-based frameworks (OVS-DPDK, NetVM, VPP and BESS) and Snabb rely on user-space drivers rewritten from scratch. DPDK supports 1-40 Gbit NICs from many hardware vendors (Amazon, Broadcom, Cavium, Chelsio, Cesnet, Cisco, Emulex, Intel, Mellanox, Netronome, QLogic, etc.) and software devices (virtio-net, Xen, vmxnet3, etc.), while Snabb only supports Intel 10Gbit NICs and virtio-net. Netmap only supports Intel 1-40Gbit NICs, Chelsio 10Gbit NICs, and virtio-net, veth [14] and ptnet [87] software devices.

(c) Effort required to support more NICs. While SR-IOV reuses standard drivers, the other network I/O frameworks need specialized drivers. It is therefore important to evaluate the development effort needed to add a support for future (or yet unsupported) NIC models. DPDK-based solutions and Snabb need a whole driver to be rewritten from scratch. Being written in Lua [57], Snabb drivers are quite compact (1-2 Klocs), while DPDK drivers typically require 5-40 Klocs. In contrast, Netmap only needs to apply a relatively small patch (∼ 600 locs) to the standard kernel driver; the patch mainly implements the OS-bypass I/O routines. It is worth nothing that both Netmap, Snabb and DPDK are able to work with the original (unmodified) kernel drivers, although at

(38)

reduced performance. This is very useful in practice (e.g., for application prototyping), although not interesting when targeting maximum performance.

(d) Provisioning of VM-to-VM virtual links. NFV setups are often described in terms of chained VNFs, logically connected by Point-to-point (PtP) links. This contrasts with the use of virtual switches, where many VMs or containers are attached together with the NIC(s). Although a single switch may be able to implement one or more PtP links, in practice a true PtP mechanism can reach better performance than a virtual switch, because the processing task is simpler and there is no central bottleneck. Snabb is flexible enough that its Apps (e.g. two VhostUser ones) can be connected in a PtP fashion. The same is true also for VPP and BESS, which also model applications as a composition of reusable modules. NetVM explicitly creates chains using dedicated threads to move packets between adjacent stages of the pipeline. Netmap provides pipes and netmap-accelerated veth devices that implement fast PtP links between two VNFs. OVS-DPDK does not provide PtP links: packets must flow through the OVS instance, which can be configured with static OpenFlow [93] rules to forward packets between pairs of ports. SR-IOV cannot provide PtP, because inter-VM switching is done by the NIC hardware.

(e) Synchronization and CPU utilization. Many frameworks for high-speed net-work I/O rely on busy wait polling to maximize throughput and minimize latency. This is the case for DPDK-based solutions and Snabb. The motivations behind this approach are (i) the assumption that the system is always under high load; and (ii) the quest for the best possible performance irrespective of CPU utilization and protection of NIC hardware. This is achieved by completely avoiding NIC interrupts and system calls and dedicating CPU cores to NIC queues (physical or VirtIO). However, if the system is not always under high load, or there is significant imbalance between the stages of the processing pipeline, most of the CPU time is wasted on busy waiting. This problem becomes even worse if busy waiting is also used inside the VM (e.g. DPDK application on a virtio-net interface), in addition to being used by the host. In contrast, Netmap uses NIC interrupts and standard kernel synchronization mechanisms (e.g. poll()) to block on empty or full NIC queues. This allows the system to be efficient under low load, at the cost of reduced performance under high load. The performance degradation may be small because (i) the cost of system calls and interrupts is usually amortized over very large batch of packets (e.g. 512); and (ii) the per-packet cost of application-specific processing is often at least an order of magnitude higher than the per-packet I/O cost, hence differences in the per-packet I/O cost have limited impact.

(f) Zero-copy capabilities. If the VMs of an NFV chain mutually trust each other, they can share a portion of their address space to avoid data copy and save many CPU cycles. Without mutual trust, however, the copy becomes necessary to ensure address space isolation. OVS-DPDK, Snabb, VPP and BESS use the VirtIO interface to isolate VMs from each other. Since packet buffers are allocated by the software running in the VMs, a packet copy is needed between VM memory and host memory. Conversely, NetVM and Netmap pipes support zero-copy using a shared memory area where packets are stored, and only small (e.g., 16 bytes) packet descriptors are copied across the chain. While NetVM only provides zero-copy, Netmap also supports multiple (isolated) shared

Enhanced network processing in the Cloud Computing era