Design and implementation of the Netmap support for Open vSwitch

(1)

FACOLT `A DI INGEGNERIA

Computer Engineering

Design and implementation of the Netmap

support for Open vSwitch

Supervisor:

Ing. Giuseppe Lettieri

Candidate:

Alessandro Rosetti

(2)

Open vSwitch is a production quality multilayer software switch that has many applications in the computing world, its network performance is a crit-ical aspect. In this thesis I have extended Open vSwitch adding a new kind user-space port that uses netmap API.

Netmap is a framework for high speed packet I/O that can be used by user-space applications to achieve higher network performance compared to the standard operating system network stack.

I will show the implementation details and how it was designed to obtain a high performance. The system has been tested in different configurations comparing to existing solutions and reference implementations. Ovs has a comparable implementation that uses the DPDK framework, this is another API that similarly to Netmap bypasses the traditional network stack.

By using Netmap the virtual switch is able to match the performance of the DPDK implementation while having a better safety of operation, better com-patibility with other operating systems and has a greater flexibility by en-abling high performance on a virtual network interface controller used in container environments.

(3)

Dedication

Alla mia famiglia, ad Elisa

(4)

Bibliography 81 A How to install and test ovs-netmap 82 B Development and verification methodology 86 B.1 Development tools . . . 86 B.2 Scripts . . . 87 B.3 Verification . . . 88 C Source code 89 C.0.1 netmap.h . . . 89 C.0.2 netmap.c . . . 90 C.0.3 netdev-netmap.h . . . 91 C.0.4 netdev-netmap.c . . . 91

C.0.5 Other modifications: ovs-netmap.patch . . . 107

. . . .

(7)

List of Figures

1.1 Open vSwitch logo. . . 10

1.2 Virtual switching . . . 11

1.3 Open vSwitch architecture. . . 14

1.4 Adding a netmap forwarding mechanism to ovs. . . 16

2.1 Network architecture . . . 17

2.2 NICs traditional driver RX/TX path . . . 19

2.3 NICs with netmap RX/TX path . . . 22

2.4 NICs with netmap RX/TX path . . . 23

2.5 DPDK logo. . . 27

3.1 ovs-vswitchd internal structure . . . 31

3.2 Three stage flow cache . . . 34

3.3 Example ovs with two pmd threads . . . 35

3.4 Callback ordering . . . 36

4.1 Netdev providers implementations . . . 39

4.2 Send callback flowchart . . . 44

4.3 two ports sending traffic to one port . . . 45

4.4 Receive callback flowchart . . . 47

4.5 one port duplicating traffic to two ports . . . 48

4.6 Block based allocator . . . 50

4.7 Batch allocation and deallocation . . . 52

4.8 Global an loacal allocator . . . 53

5.1 Basic test configuration . . . 65

5.2 veth to veth port . . . 65

(8)

5.3 RTT in v2v scenario . . . 67

5.4 Packet rate in v2v scenario . . . 67

5.5 variable tx rate on v2v scenario . . . 69

5.6 variable packet size on v2v scenario . . . 71

5.7 nic to nic . . . 72

5.8 RTT in n2n scenario . . . 73

5.9 Packet rate in n2n scenario . . . 73

5.10 variable packet size on n2n scenario . . . 74

5.11 lost packet rate . . . 75

5.12 variable packet size on n2n scenario . . . 76

. . . .

(9)

List of Tables

4.1 Source code modifications . . . 61

5.1 v2v measurements . . . 66

5.2 v2v ovs-netmap performance . . . 70

5.3 n2n measurements . . . 74

5.4 n2n ovs-netmap performance . . . 75

(10)

Chapter 1 Open vSwitch Introduction

1.1 What is Open vSwitch?

Figure 1.1:

Open vSwitch logo.

In computing virtualization is defined as the creation of a software version of something that is usually a physical ob-ject like a hardware component. The rise of server virtu-alization has brought a shift in datacenter networking and with that there has been the need of new and complex tools. Open vSwitch is a software implementation of a distributed virtual multilayer switch used to provide a virtual network

switching stack. It supports standard management interfaces and protocols and enables the forwarding functions to be automated through program-matic extension and control.

Open vSwitch is well suited to work as a production quality virtual switch in multi-server virtualization and container environments. It focuses on an automated and dynamic network control in large scale virtualization deploy-ments. It has also been designed to keep in-kernel code as small as possible and to reuse existing subsystems.

Ovs is usually used to bridge up multiple VMs/containers within one host. It manages both physical ports (for example: eth0, eth1) and virtual ports (tap devices of VMs). Ovs has become the de facto standard virtual switch for all major hypervisors platforms and it is an important component in many large scale deployments.

In network virtualization the virtual switch becomes the primary provider . . . .

(11)

of network. This approach allows the virtual networks to be decoupled from the underlying physical networks and by taking advantage of the flexibility of general purpose processors we can provide logical network abstractions, ser-vices and tools identical to physical networks. The goal is also to try to bring the performace closer to the traditional network devices that using dedicated hardware resources are able to achieve line rate even in the worst case [1].

1.2 Terminology

Before proceeding to describe the ovs architecture, I will clarify some termi-nology that is useful to understand the topic that has been presented.

Virtual switching

Figure 1.2: Virtual switching

Usually servers were only physically connected to each other using hardware based switches. Today virtualization technologies created virtual servers that can be connected to the rest of the network through virtual switches.

The virtual switch is a software application running on a server that behaves as a physical switch. VMs environments and also container environments use various kinds of virtual ethernet ports, these ports are connected to a virtual switch. The virtual switching software allows communication between ports that are handled and it intelligently directs packets to destination ports by . . . .

(12)

looking at its contents. In other words it imitates the physical switch by al-lowing a VM to communicate with another VM. In practice we can also make our switch able to handle both virtual and physical ports in order to connect to other parts of the network.

Multilayer switching

A multilayer switch, it is a networking device that differently from any other traditional switch can operate at higher layers than data link layer of the OSI model. It is able to inspect frames deeper into the protocol stack in order to forward traffic. It may also implement QoS (Quality of Service), VLANs functionality and use UDP or TCP information.

1.3 Open vSwitch features

As I have shown in the introduction hypervisors need the ability to bridge network traffic between VMs and with the outside world. For example on Linux operating systems it is possible to use the built-in L2 switch called bridge, it is fast and reliable so it is reasonable to ask why we need Open vSwitch (see [2]).

The answer is that Open vSwitch is targeted at multiple server virtualization and focuses on the need for automated and dynamic network control in large scale environments. Using the Linux bridge in some cases is not well suited. The environments are often highly dynamic and complex to manage.

Open vSwitch supports a great number of features making it a very versatile component:

• Forwarding layer abstraction: It makes easier porting to new software and hardware platforms.

• Multiple forwarding engines: It has a in-kernel and a user-space dat-apath engine.

• Multi-table forwarding: Implements multiple forwarding tables with a flow-caching engine.

. . . .

(13)

• OpenFlow protocol support: Since its inception ovs configuration is based on this protocol that permits to control how the traffic is for-warded in the switch by decoupling the control plane and the forward-ing plane. Implements version 1.1 and beyond of the OpenFlow proto-col (see [3]). OpenFlow rules can be based on MAC,IP, in-port, VLAN ID. Forwarding rules are stored Flow Tables that are consulted to forward a packet. Having this kind of functionality adds a lot of complexity to the processing of packets.

• IPv6 support, VLAN, QoS and much more.

1.4 Open vSwitch set-up overview

Most network devices implement different planes of operation: The

manage-ment plane that is used to configure and administer the device, the control plane creates the information that orchestrates the packets according to

spe-cific routing protocols and finally the forwarding plane, the engine that uses the information generated by the control plane to send incoming packets. In our case the organization of an ovs set-ups shows that these planes are im-plemented by separate components. This is a good approach to handle the increasing complexity of network virtualization and to be capable of a greater flexibility.

Figure 1.3 shows the typical Open vSwitch reflects the planes of operation division:

1. controller : The controller implements the control plane, it routes traf-fic us and communicates using the OpenFlow protocol to configure the flow tables of the switch in form of flow rules. It is also used to

imple-ment programmatic control like SDN1. For example the FloodLight is a

controller that can be used with Ovs.

2. ovsdb-server : This application holds the switch configuration and other options. It speaks to the ovs-vswitchd and controller processes using the

OVSDB management protocol that works uses a JSON/RPC mechanism.

1

Software defined networking

. . . .

(14)

3. ovs-vswitchd : It’s the core user-space daemon of the virtual switch, it can rely on the ovs kernel module using netlink protocol or by using the user-space data path. It communicates to upper layer using OpenFlow. It uses the information from management or control plane to forward packets between ports.

4. openvswitch.ko : It’s the ovs kernel module that enables kernel-space data path. It is included in the mainline source since version 3.3 of Linux.

Figure 1.3: Open vSwitch architecture.

1.4.1 Userspace and kernelspace datapath

Figure 1.3 shows two connections to the underlying kernel infrastructure, this is due to the fact that ovs implements different data-paths, in other words there are different forwarding engines.

In the current version these two main types of data-paths are:

• kernel data-path: This datapath does most of the forwarding in-kernel (fast path) without the need of user-space. It is implemented as a kernel module (openvswitch.ko) and kept as small as possible. If the packet has not yet been classified is brought to user-space (slow path) when it has to be inspected by consulting the OpenFlow rules. Apart from the in-kernel forwarding other parts are implemented in user-space. It is the default data-path when creating a bridge.

. . . .

(15)

• user-space data-path: It was added since ovs-2.4 and it differs from the in-kernel datapath because forwarding and processing are entirely done from user-space. In Ovs is selected by creating a bridge using the

datapath type equal to netdev. For example:

$ ovs-vsctl add-br br0 -- set Bridge br0 datapath_type=netdev Ovs in user-space can access different kind of devices, for example: Linux or DPDK and other kind of network interfaces.

The user-space datapath has been created to solve performance limits of the in-kernel datapath that is still using the classical operating system stack. For-warding from user-space using standard API while implementing ovs fea-tures results in even worse performance, for this reason it will be mainly used for alternative network APIs.

1.5 Thesis objective

The performance problems over the traditional stack have already been solved by using a framework called DPDK. Adding netmap support to Open vSwitch support is still a great idea because there are some improvements that could be tackled about safety, flexibility and compatibility.

Netmap has a better safety of operation, better flexibility by enabling a new type of port (veth pairs) used in container environments that dpdk does not support. Netmap has also compatibility for more operating systems like Win-dows and native support for BSD. For a complete comparison see section 2.4. The basic idea is to improve Open vSwitch functionalities by adding support for the Netmap framework. Porting a new type of user-space based port is fa-cilitated thanks to the abstractions that Ovs makes. My objective is to match Dpdk speed while solving these issues.

The implementation will be carried out in a way to minimize the modifica-tions to the shared code in order to simplify the work for the integration in future releases of Ovs.

Figure 1.4 summarizes in which part of the Ovs architecture I have integrated the netmap API as a forwarding mechanism.

. . . .

(16)

Figure 1.4: Adding a netmap forwarding mechanism to ovs.

1.6 Thesis organization

This is how my thesis is organized:

• First chapter: In this chapter I explain the objective of my work and de-scribe the features and applications of Open vSwitch.

• Second chapter: This chapter is an overview of different network frame-works: traditional networking stack, DPDK and Netmap.

• Third chapter: This chapter describes the internal structure and features of ovs-vswitch, it is useful to better understand how the implementa-tion have been carried out.

• Fourth chapter: This chapter is about the actual implementation and all the issues that has been resolved in order to obtain a good performance. • Fifth chapter: This chapter describes how the system has been tested against the dpdk implementation. I analyse the result of some use-cases under a varying level of transmit rate and under a varying level of packet size.

• The last chapter is for the conclusions and possible future works.

. . . .

(17)

Chapter 2 Network frameworks

Network performance has become increasingly important over the past years. Servers and other network devices commonly use 1 to 100 Gbps physical net-work interfaces and with this kind of hardware we started to hit some bottle-necks.

For example reaching line rate of a 10 Gbps link is required to transit 14.88

Mpps 1

of 64 Byte packets that is as fast as one every 67.2 ns. Traditional operating system network stack may take 10-20 times longer to transfer the packet to the application [4].

This limitation exists because the network stack was designed for much lower packet rates and resources. This situation can be improved by defining new

APIs2

that eliminate unnecessary overheads.

As we can see the conventional architecture is roughly structured like this:

Figure 2.1: Network architecture

Operating systems are usually divided in two parts: user-space where the

ap-1

Million packets per second

2

API, Application Programming Interface

. . . .

(18)

plications run and kernel-space that implements core mechanisms that deal with the hardware in order to serve applications. The kernel structures aren’t directly accessible but there is an interface that is used to safely perform tasks. In figure 2.1 the application accesses network through a standard socket li-brary interface and then transverses several layers before reaching the the

NIC3. Note that this operation is repeated for each packet that has to be

trans-mitted.

As an example a single sendto function call can cost around 1µs. This number can be divided in a chain of many components: the system call cost includes data copies, header setups, checks, driver operations and interrupts. If we send several minimally sized packets with the standard network interface we just can’t achieve the throughput that the link is actually capable of. This is hurts performance especially on hardware that can achieve throughputs of over 1 Gpbs because it heavily limits the theoretical line rate.

This is one of the reasons why in the last years have been created many projects that aim to improve the situation. This is done by bypassing the standard kernel networking stack and designing a new way of interaction with the network interface.

2.1 Traditional OS APIs

The designers of the network architecture adopted hardware and software solutions that accommodated convenience of use, performance and memory usage that at the time was a scarce resource. At a low level network pack-ets are represented by descriptors (Depending on the OS are called: skbuffs, mbufs or other) that contain some metadata (payload size, flags, etc..) and are associated to the actual payload. Both descriptors and payload data are usually dynamically allocated from a common pool. The operating system and the NIC share a memory region in order to transfer messages to each other. The transmissions and receptions are arranged into a circular arrays called NIC rings and transfers of packets are typically done by DMA opera-tions.

When the controller is opened the device driver allocates receive (RX) rings

3

NIC: network interface controller

. . . .

(19)

with empty buffers and initializes both RX and TX rings. It also initializes the head/tail indexes in the rings (for example rdh,rdt in rx path) that identify the status of the descriptors in the rings.

Figure 2.2: NICs traditional driver RX/TX path

For example:

• In the receive path a message is received the NIC copies via DMA into the first available skbuff and updates the head pointer and possibly sends an interrupt to notify the driver. The driver eventually notices the new message and moves the skbuff up the networking stack replac-ing it by a new empty skbuff. Then it updates the rreplac-ing to make the new skbuff available to the NIC. The message is then copied to user-space and the skbuff is then discarded. Meanwhile the user still have to copy each message to the application using several system calls.

• In the TX path the ring is initially empty. When user-space sends a mes-sage the kernel allocates a skbuffs and pushes it down the stack until it is linked in the TX ring and updates the ring tail.

The NIC reads the message via DMA and sends it over the link. When the transmission is completed the NIC updates the ring head and pos-sibly sends an interrupt. The driver eventually notices and deletes the . . . .

(20)

skbuff. Messages can be sent by the driver while other messages are enqueued.

We can redesign the interaction with the NIC reusing the existing hardware and the concept of ring.

2.1.1 Traditional network stack bypass

Given the situation there was a lot of interest in this field of study that resulted in many new network I/O APIs designed to solve performance issues with the traditional stack. I will now illustrate two existing solutions.

2.2 The Netmap framework

Netmap is a framework for fast packet I/O that aims at reducing the cost of moving traffic between the hardware and the application. It was created

de-veloped by Professor Luigi Rizzo at University of Pisa4.

Efficiency and speed is achieved without requiring hardware modification and with little software modifications to the original drivers and applications. Safety of operation is still ensured because device registers and other kernel structures are still protected and validated by the operating system.

Netmap was designed in order to identify and solve the main packet process-ing costs:

1. system call overheads: single system calls are amortized by using large batches. The effect of this is better perceived when transmitting several small sized packets because of the per packet costs.

2. preallocation of resources: per packet memory dynamic memory allo-cations removed and preallocated resources are reused when no longer needed.

3. memory copies: amortized by sharing buffers and meta-data between kernel and user-space.

4

http://info.iet.unipi.it/luigi/netmap/

. . . .

(21)

Netmap is part of FreeBSD, and it has been ported to Linux and Windows as an external kernel module.

It consists of a kernel module and an API that applications can use to bypass the native operating system network stack, thus achieving a higher perfor-mance.

Netmap supports a wide range of different ports:

• physical NIC ports: used to access physical network interfaces • host ports: used to inject packets into the host stack

• VALE ports: a very fast in-kernel software switch

• netmap pipes: a shared memory packet transport channel • netmap monitors: a mechanism to capture traffic

All these ports are accessed with the same netmap API that is at least one order of magnitude faster than standard operating system mechanism. Using a suitably fast hardware (NIcs, PCIe, CPUs) packet I/O can saturate a

10 Gbps link sending or receiving at line rate, 14.88Mpps that is the required

number of packets per second to saturate a 10 Gbps link. Also other ports like 40 Gbps NICs can achieve 35-40 Mpps, Netmap pipes over 100 Mpps. These are a huge speed-ups with respect to conventional APIs that usually can’t perform beyond 1 Mpps, usually much less than that. Other ports like VALE switch ports, netmap pipes can provide high speed packet I/O between pro-cesses, virtual machines, NICs and the host stack.

The user can dynamically put NICs into netmap mode and transmit and re-ceive raw packets through memory mapped buffers. NICs drivers has to be modified to achieve best performance, NICs without netmap support can still use the API in emulated mode which uses unmodified drivers. Several 1 to 40 Gbps NIC drivers have been natively ported to netmap.

2.2.1 Data structures

Netmap works with three basic types of objects: payload buffers, netmap rings and netmap if [5]. All of these objects reside in the same non page-able memory region shared between all user process.

. . . .

(22)

Figure 2.3: NICs with netmap RX/TX path

• The Netmap buffers have a default fixed size of (2KB) and are shared by NIC and user processes. Each buffer has an unique index that can be translated to the virtual address for the process that is using netmap API and into a physical address for the NIC that uses DMA operations. • The Netmap rings are circular arrays that contain a number of slots that are associated to the preallocated netmap buffers. It also contains head,tail and cur pointers to identify which part of the circular array is owned by the user process or by the NIC.

• The netmap if structure contains informations describing the interface, such as number of rings and memory offsets to the rings.

2.2.2 Opening a device in netmap mode

The NIC is disconnected from the host stack and put in netmap mode by

opening a special device /dev/netmap and issuing an ioctl5NIOCREG.

Sub-sequently the shared memory region is mapped to the user-space application by calling an mmap system call.

Netmap data structures are set-up, rings are pre allocated with netmap buffers and the NIC rings are made to share the same buffers. The NIC has only access to its rings and netmap buffers while the user-space application has

5

ioctl: means ”input-output control” it is a kind of device specific system-call

. . . .

(23)

access to the netmap rings and netmap buffers. The ownership of slots in the ring avoids races because it is well defined by indexes. Applications can ac-cess slots and buffers that are contained in [head,tail) while the NIC can acac-cess slots and buffers that are in the range [tail, head).

The transmission side requires the application to fill the netmap slots and then synchronize with the OS. The receive side requires the application to first request an update in order to read how many available packets have ar-rived.

Netmap supports blocking or non blocking I/O:

• Non blocking IO: User-space signals the OS that rings have been pro-cessed by issuing two distinct ioctls for rx and tx: NIOCRXSYNC and NIOCTXSYNC.

• Blocking IO: is supported through select and poll system calls. NICs with multiple ring pairs result in multiple netmap rings.

In figure 2.4 we can see the full architecture of netmap rings reflecting the structure of nic rings.

Figure 2.4: NICs with netmap RX/TX path

. . . .

(24)

2.2.3 Zero copy support

Using a single memory region to address buffers enables the possibility of zero-copy forwarding between interfaces. This means that different netmap descriptors can exchange netmap buffers avoiding to copy the payload just by swapping buffers indexes values from each other’s netmap rings. For exam-ple one netmap buffer in the transmit ring ends being swapped in the receive ring of another netmap interface descriptor.

Since the shared memory region can be mapped by different processes the buffer indexes use a relative addressing so that pointers can be calculated from different virtual address spaces.

2.2.4 The extra buffers

When the netmap port is opened the netmap rings are allocated with netmap buffers associated to the slots. The extra buffers are netmap buffers that are not currently associated to netmap slots that are contained in the rings. These buffers can be used to replace the buffers linked to any netmap slot using zero copy and then swapped again to another ring when they are ready to be forwarded.

When using zero-copy it is possible to replace a netmap buffer with an extra buffer in order to make the received message independent of the ring and effectively avoiding to copy the message. If it wasn’t independent the buffer could overwritten before it is actually processed.

The extra buffers are obtained when the port is opened and are returned in a linked list. The application will be able to use these buffer indexes freely and it is supposed to return the buffers to netmap when the port is closed. Netmap has no way to know if these buffers are actually used by some other application that is using netmap memory so it is important to return them.

2.2.5 How to use netmap

The developer can access the Netmap APIs by including the following header files:

• net/netmap.h: it contains the bare minimum to access netmap ring netmap slot . . . .

(25)

and netmap if.

• net/netmap user.h: this is optional and it contains more helper func-tions, data structures that simplify and allow a convenient way of inter-action with netmap.

Here is an example on how to open a netmap port requesting 128 extra buffers using both the header files.

s t r u c t nm desc ∗ nmd = NULL ; char∗ ifname = ” netmap : e t h 0 ”; s t r u c t nmreq r e q ;

memset(& req , 0 , s i z e o f( r e q ) ) ; r e q . n r a r g 3 = 1 2 8 ;

nmd = nm open ( ifname , &req , 0 , NULL) ;

The nm open function implements a common way of opening a netmap port by performing a lot of different operations. It returns the nm desc struct that contains all the variables needed to reach the netmap rings, slots and buffers. The function basically opens the device file /dev/netmap and binds the NIC to a port executing an ioctl NIOCREGIF.

The interface name follows a specific syntax that customizes the way the port is opened and for this reason some special characters and suffixes cannot be used to identify a port.

If no other parameters are specified it will also map the netmap memory to the process address space. The memory map has to be skipped in case more than one netmap port has already been opened in the same process.

This is done in two ways:

• It is possible to reuse another netmap descriptor by specifying the nm desc as the third argument, this will copy some informations to the new de-scriptor.

• It is possible to simply avoid the memory map in the nm open but in this case the developer will have to map the memory himself and set-up some internal variables in the netmap descriptor. This is done by specifying a NM OPEN NO MMAP in the second argument.

As we can see in the example the nmreq struct is used for requesting our 128 extra buffers. These buffer indexes are accessed by scanning a list.

. . . .

(26)

u i n t 3 2 t i d x = nmd−>n i f p −>n i b u f s h e a d ;

f o r ( ; i d x ; i d x = ∗ ( u i n t 3 2 t ∗ ) NETMAP BUF( ring , idx ) ) { i d x = ∗ ( u i n t 3 2 t ∗ ) NETMAP BUF( ring , idx ) ;

/ ∗ . . . s t o r e t h e b u f f e r i d x s o m e w h e r e ∗ /

}

The ni bufs head variable contains the first extra buffer index and the next one is accessed by reading the first 4 bytes of the current the buffer as an extra buffer index.

The transmission is done by accessing the TX rings using the netmap ring struct. Using this variable we are able to check the free space and access the individual slots on the example ring number ’r’. The code is just an example that cycles on all the free rings and loads the buffers in order to be able to write some data. / ∗ t r a n s m i s s i o n ∗ / s t r u c t n e t m a p r i n g ∗ r i n g t x = NETMAP TXRING(nmd−>n i f p , r ) ; u i n t 3 2 t s p a c e t x = n m r i n g s p a c e ( r i n g t x ) ; while ( s p a c e t x −−) { s t r u c t n e t m a p s l o t ∗ t s = &r i n g−>s l o t [ r i n g t x −>head ] ; / ∗ o p e r a t e on t h e b u f f e r o r on t h e b u f f e r i n d e x ∗ /

char ∗ buf = NETMAP BUF( ring , t s −>b u f i d x ) ; . . .

t s −>l e n = n e w s i z e ;

r i n g t x −>head = n m r i n g n e x t ( r i n g t x , r i n g t x −>head ) ; }

The reception works very similarly by replacing tx with rx. In that case we will take the buffer instead of writing on it.

Instead of writing data on the buffer if we have already a buffer that contains the packet that we want to send we can swap it to the slot. This swap can happen between buffer indexes of two different interfaces or using an extra buffer. In order to do the swap we have to swap the values of buffer indexes and signal that the buffer has been changed by writing in the slot flags. The following example replaces a slot buffer with an extra buffer:

s t r u c t n e t m a p s l o t ∗ s = &r i n g−>s l o t [ r i n g −>head ] ; u i n t 3 2 t i d x = s−>b u f i d x ; s−>b u f i d x = e x t r a b u f i d x ; s−>f l a g s | = NS BUF CHANGED ; / ∗ s a v e i d x somewhere , t h i s i s now t h e e x t r a b u f f e r . ∗ / . . . . 2. Network frameworks 26

(27)

The port is closed by calling the nm close function. The netmap memory will be unmapped from the process only if it was mapped in the corresponding nm open function of that netmap descriptor.

Before closing the port we could have to return extra buffers that are not used any more to the descriptor that we are closing. If each opened port requests some extra buffers it’s reasonable to return the same quantity of extra buffers when it is removed. These extra buffers obviously don’t have to be the same extra buffers that were given when the port was opened, the important thing is that when all ports have been closed also all extra buffers that the appli-cation is owning have been returned. If it was otherwise netmap will not be able to know if other processes are using these buffers and that memory will be lost. These buffers are returned by writing them again in the linked list accessed by the ni bufs head variable.

u i n t 3 2 t ∗ eb = ( u i n t 3 2 t ∗ ) NETMAP BUF( ring , b u f i d x ) ; ∗eb = nmd−>nifp−>n i b u f s h e a d ;

nmd−>n i f p −>n i b u f s h e a d = b u f i d x ;

Same as for the opening the first four bytes of the buffer are used to store the next buf idx. The buf idx is then written in the netmap descriptor and becomes the new head of the list. The port is finally closed by simply speci-fying the netmap descriptor as argument as follows:

n m c l o s e (nmd) ;

2.3 The DPDK framework

Figure 2.5:

DPDK logo. DPDK (data plane development kit https://dpdk.org/) is a

multi vendor and multi architecture set of libraries and NIC drivers that aims at achieving high performance in packet processing.

It works as an user-space application that bypasses the heavy layers of the Linux kernel networking stack and talks directly to the hardware.

It’s a fully open-source project that has been started by Intel but lately in April 2017 became part of the Linux Foundation, it was created for the telecom-datacom infrastructure (data communication, transmission) but it’s now used . . . .

(28)

in the cloud, data centres, container environments and more. DPDK is mostly written in C, other tools are written in Python.

DPDK achieve its goals by creating an Environment Abstraction Layer (EAL) that hides the environment details and provides a standard programming interface to applications and operating systems. It also provides many addi-tional services.

The DPDK accesses devices via polling to eliminate the performance over-head of interrupt processing, it includes NIC drivers and the following mod-ules:

• Queue manager: implements lock-less queues. • Buffer manager: pre-allocates fixed size buffers.

• Memory manager: allocates pools of objects in memory and uses a ring to store free objects

• Poll mode drivers: PMD are designed to work without asynchronous notifications, reducing overhead.

• Packet framework: a set of libraries that are helpers to develop packet processing

DPDK includes more than 35 libraries that help the development of high per-formance applications.

2.3.1 Ovs-dpdk implementation

Ovs that uses DPDK user-land that is called OVS-DPDK and its implemen-tation includes support for several types of dpdk ports:

• dpdk: this implements a physical NIC ethernet interfaces. • dpdkr: this implements dpdk ring ports.

• dpdk-vhost-user and dpdk-vhost-user-client: this implements ports used in virtual hosting systems.

Open vSwitch has received DPDK support in 2015 (version 2.2). In version 2.4 it has been added support for virtual devices (vHost).

. . . .

(29)

2.3.2 Dpdk memory allocations

The ovs implementation takes advantage of Linux huge-pages. Huge-pages are memory pages that on the most common platforms can have a size from 2MB to 1GB instead of the standard 4KB pages.

As a result performance is increased by reducing cache pressure (reduces the number of TLB (Translation Look-aside Buffer) misses).

Other low level optimizations are related to memory cache line alignment that aim at achieve optimal cache use.

2.4 DPDK and Netmap comparison

Dpdk and Netmap achieve similar goals using different approaches, we can highlight some consequences about the design that characterises these two projects.

Netmap reuses the original patched drivers while dpdk rewrites from scratch these drivers in user-space. The process using dpdk in user-space has access to critical kernel memory and device resources like registers that could the-oretically crash the system. Using netmap the shared memory does not con-tain anything other than shared buffers and other structures and operations on device are still validated by the kernel because it is essentially reusing the original driver that is run inside the kernel.

In terms of safety of operation misbehaving processes using netmap can not cause a kernel crash [5].

In terms of flexibility both support most physical devices but netmap has sup-port also for the veth pairs that are usually used to interconnect containers. Dpdk it would need a rewrite of that driver while netmap has just to insert a relatively small patch that rewires the stack to netmap rings.

Netmap has a better compatibility with other systems because is native to BSD, supports Linux and Windows while dpdk is native to Linux has basic support for BSD and no support for Windows.

. . . .

(30)

Chapter 3 ovs-vswitchd internal architecture

Open vSwitch is intended to be easily portable to new software platforms thanks to its abstractions and modular structure. The ovs code base is written in C language and it consists of around 100 Kloc, netmap support has been added writing a patch that adds around 1.2 Kloc. Most of this code is related to the switch application.

3.1 Internal architecture of ovs-vswitchd

As I explained in the first chapter ovs is composed of at least two main appli-cations, one that handles the configuration (ovsdb-server) and one that imple-ments the switching functionality (ovs-vswitchd). Obviously the main user-space application of the virtual switch (ovs-vswitchd) is the most important to be modified for my work.

Figure 3.1 shows a high-level internal architecture of ovs-vswitchd:

• ovs-vswitchd: The sources are found under the vswitchd directory. It connects to ovsdb-server application and reads the configuration over an IPC mechanism. It passes the information to the lower level the

of-proto library.

• ofproto: The sources are found under the lib directory and includes all the next items of this list. This library implements the OpenFlow switch, it talks to OpenFlow different types of controllers over the network and to switch by using several ofproto providers.

. . . .

(31)

The ofproto provider is what ovs uses to control an OpenFlow capable switch.

• dpif: It is part of the ofproto library and it is used for manipulating datapaths (see 1.4.1).

There are two existing dpif-provider implementations, dpif-netlink and

dpif-netdev. The first one is a Linux specific implementation that talks

to the specific ovskernel module. The second one a generic dpif that is used to implement ovs user-space switching. As we can see these are the two data-path, the kernel-space and the user-space one.

• netdev: This library abstracts the interaction with network devices like ethernet interfaces and its main component is found in lib/netdev.c. Each different type of interface can be added by specifying a netdev

provider. This is actually what we need to do to implement a netmap

port.

Figure 3.1: ovs-vswitchd internal structure

. . . .

(32)

The netdev is an abstract module that can be implemented defining a set of callbacks and a descriptor to hold informations specific to the api that we are implementing. DPDK and other types of port implement a new netdev provider that uses this datapath.

This is exactly what has to be done also for the netmap implementation.

3.1.1 Packet representation: the dp packet

Ovs uses an internal representation of a network packet called dp packet. The dp packet is a data structure that contains a reference to the payload and some meta-data, it also defines a set of functions to simplify the handling of packets.

Each instance of a dp packet has a type of payload allocation:

1. DPBUF MALLOC: The payload data it is obtained through dynamic memory allocation (malloc()) and it has to be consequently released (free()) upon deallocation.

2. DPBUF STACK: The payload data is obtained from an un-movable stack space or static buffer.

3. DPBUF STUB: The payload data is obtained from the stack but it may expanded into heap.

4. DPBUF DPDK: The payload data is obtained from DPDK allocated memory. As we have seen in 2.3.2 DPDK uses huge-pages.

The dpdk implementation heavily modifies the internal structure of the dp packet and defines a some alternative functions that are used when ovs is compiled with dpdk.

It also defines a specific initialization function.

3.1.2 Packet batching: the dp packet batch

Ovs callbacks operate on bursts of packets called batches (dp pachet batch) that have a size defined by a constant: NETDEV MAX BURST that has a value of 32. The size of the batch can’t be easily changed.

. . . .

(33)

Using big batches we would probably have a slightly higher throughput but also higher average packet delays. If we think about the batch processing cost, looking at which output port each packet has to be delivered has a cost and increases with the size of the batch. This size is the result of a trade-off be-tween delay and throughput, it is a limitation that might affect performances. Since it is a costly operation on general purpose processors the processing of a batch has been heavily optimized, it uses complicated internal mechanism of pipelined caching of increasing complexity (EMC exact cache match that works on hashing algorithms that expect exactly 32 packets).

The dp packet batch is implemented as a struct that holds an array of 32 pointers of dp packets. This structure is used as a parameter to the send and receive callbacks. The receive callback will generate dp packets ovs will pro-cess them and decide where to send them and it will call the send callback. After that the batch can be freed.

3.1.3 Batch processing: OpenFlow forwarding

It is important to understand what happens to packets that transverse the virtual switch. As I have described we are dealing with batches of packets, this section describes what happens to the batch in terms of flow classifica-tion and caching. When a batch is received it is analysed in order to find the correct destination of each packet contained. This operation consists in analysing packets in order to be able to consult OpenFlow flow tables. This operation has been heavily optimized because it’s a very expensive op-eration on general purpose processors. It’s important to note that during this time the program is not receiving or sending data. Each packet in the batch is classified generating an information that is used to recognize the same kind of packet quickly the next time it is found [1].

There is a three layer hierarchical flow classification (see 3.2) used to recog-nize the action to be applied to each packet. These actions are obviously done in according to the OpenFlow rules that are installed. The exact match cache (EMC) (see [6]) is the first and quicker check, it uses a small hash table. If the action for the packet can’t be found in the EMC, the search continues in the datapath classifier, and if that also fails the last step is to check the full Ba-sically when packets are processed a key (netdev flow key) is calculated by . . . .

(34)

concatenation of all the packet’s header fields that ovs knows how to handle. This key is then used to look up the EMC to find the corresponding flow (dp netdev flow), which may be found or not. The flow structure contains actions that tell how the packet should be processed [6]. Missed packets are batched together and passed to more expensive lookups as said before.OpenFlow flow tables.

Figure 3.2: Three stage flow cache

By performing some profiling on the execution (even on the simplest one) this operation has a significant cost on the processing but it is needed for having the flexibility of the OpenFlow features.

3.1.4 Multi-threading: the PMD

Another important feature of Ovs is that it has a multi-threaded approach based on the pmd thread where pmd stands for poll mode driver. Ovs can create multiple threads and each of which can have multiple ports assigned to it. The pmd runs a busy loop that sequentially checks if there are packets coming from each port. Port types that actually use pmd mode are assigned to a pmd thread that ovs creates dynamically. Each pmd thread runs on a particular CPU core and is allocated by Ovs when a pmd port is added and there is a free core that can be used. There is a bit mask (pmd-cpu-mask in ovsdb options) that tells ovs which cpu cores are allowed to be used. The following command enables ovs to use cpu0 and cpu1 for pmd execution:

. . . .

(35)

ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x03

The pmd mode of execution of the netdev it was currently used only by netdev-dpdk. It is generic enough to be reused by the ovs-netmap implemen-tation. Other ports types are called non-pmd and are run by a poll mecha-nism in ovs-vswitchd main-loop handling all the ports by using a select call. This mode of operation results in full cpu utilization for each thread even with no traffic but has some throughput and latency advantages.

For example in figure 3.3 a machine with 4 CPUs could be set-up to use the fist two cores one for the pmd. The other cores can be used for the applications that generate traffic.

Figure 3.3: Example ovs with two pmd threads

3.1.5 Temporary packet buffering

To complete the description another feature is called tx-flush and is a very re-cent ovs addition (dec. 2017) that permits to buffer some packets inside some internal queue in order to achieve a better average batching size.

It works by specifying a maximum time that a packet could be stored in in-termediate queues before being sent to the destination port. In the options it is set with other config:tx-flush-interval, in microseconds, for example 50µs: ovs-vsctl set Open_vSwitch . other_config:tx-flush-interval=50 . . . .

(36)

This means there can be several receive callbacks (netdev netmap rxq recv) be-fore actually sending a batch to a port.

Each batch might contain packets that are destined to be forwarded on differ-ent ports. This means that each received batch that has to be forwarded out-side the switch might be fragmented in several very small batches resulting in a slow performance. In the worst a batch would contain only one packet to be forwarded to a particular port, effectively defeating the use of batching. Obviously this option should influence both throughput and latency depend-ing on various parameters of the scenario, for this reason is considered an experimental feature.

Figure 3.4: Callback ordering

Figure 3.4 shows that without flushing option we always have a forwarding of packets after a receive even if just a few packets of the batch are destined to that port. In the other case packets are buffered and effectively sent in a non consecutive send callback. The actual forwarding always happens within a specific time boundary in order to limit the added delay.

3.1.6 Batch processing in the PMD

Since I have described the basic batch processing, the PMD and the optional feature of packet internal buffering I can now explain how the batch is actu-ally processed in the threads. It is very important to highlight what happens to the batch and packets that are received in a pmd thread. The pmd thread does sequentially these actions:

. . . .

(37)

1. Receiving: The netdev receives a batch that contains 0 to 32 packets from a port it is handling.

2. Processing: If the batch is not empty it is analysed in order to look up the destination of each packet. This operation can create several out-put batches that will be sent to different ports. In this step some pack-ets might be cloned it they have multiple destinations. It’s important to note that during this processing the pmd thread is not receiving or sending packets. This means that the cpu time is not always used to perform network transfers.

3. Forwarding or Buffering: Now each output batch could be sent to their destination port. This is normally done by the same pmd thread that received the thread. Depending on how ovs is configured two main things can happen:

(a) The batch is forwarded before calling the next receive callback on the same port

(b) The batch forwarding is delayed and it will be sent later. As ex-plained in the tx-flush section this is done to have a higher average batching size.

If the batch has to be forwarded using a port that is not in the current descriptor this action can be performed from the current thread. This process continues on the pmd checking the next port that it is handling. As we can see the batch is forwarded by the same pmd thread that has re-ceived them and can be buffered and sent later, very recent developments of ovs are changing this behaviour and in future it could happen that buffered packets will actually sent by other inactive pmd threads. These information have been clarified with the help of the Ovs mailing list and will be better discussed in the next chapter.

. . . .

(38)

Chapter 4 Implementation

This chapter illustrates the details of the implementation of a new type of port that supports Netmap API.

It has been developed trying to modify as low as possible of ovs shared code, this helps while rebasing the code to newer versions of ovs. It also reuses much of the ovs code (for example: code that implements lists and vari-ous wrappers). Ovs has a very useful documentation and has a section that explains important internal architectural details and the main requirements needed to implement new features [7]. Inspecting the source code has been also useful: for example the implementation of other types of ports, in par-ticular the Linux and Dpdk implementations.

4.1 Basic netdev implementation

The ovs netdev module is the user-space abstraction of NIC port that has been connected to the virtual switch.

As a first step the objective is to implement a working netmap netdev that successfully transfers the traffic. In the next steps I will concentrate on im-proving performance while ensuring the correctness of the implementation.

4.1.1 Writing a Netdev provider

An ovs netdev provider implementation enables the switch to add ports that take advantage of a particular networking API, forwarding is entirely done . . . .

(39)

from user-space. It’s possible to open an instance of a particular netdev port and add it at runtime to the virtual switch.

The interfaces required to implement a netdev are found under lib/netdev-provider.h. Some functions are required to implement OpenFlow like setting or reporting the Ethernet hardware address of a port for a minimal correct operation. Other functions are required to implement optional features and can be omitted, some other are not needed in all implementations.

The programming language used is C but the code base is structured in an object oriented way. We can consider every netdev implementation a class derived object of a netdev abstract class.

Figure 4.1: Netdev providers implementations

An ovs port is identified by an instance of the struct struct netdev netmap that contains all the necessary variables; for example the nm desc of a netmap port. Each netdev implementation is registered by exporting a set of callback functions that are called when ovs runtime needs to perform some operation like sending or receiving a batch or modifying an internal property.

4.1.2 Netdev callbacks

The callbacks implemented in order to have a correct implementation of basic OpenFlow features: s t a t i c i n t n e t d e v n e t m a p s e n d (s t r u c t n et d e v ∗ netdev , i n t qid , s t r u c t d p p a c k e t b a t c h ∗ b a t c h , b o o l c o n c u r r e n t t x q ) ; s t a t i c i n t n e t d e v n e t m a p r x q r e c v (s t r u c t n e t d e v r x q ∗ rxq , s t r u c t d p p a c k e t b a t c h ∗ b a t c h ) ; s t a t i c s t r u c t n et d e v ∗ n e t d e v n e t m a p a l l o c (void) ; . . . . 4. Implementation 39

(40)

s t a t i c i n t n e t d e v n e t m a p c o n s t r u c t (s t r u c t n et d e v ∗ n et d e v ) ; s t a t i c void n e t d e v n e t m a p d e s t r u c t (s t r u c t n et d e v ∗ n et d e v ) ; s t a t i c void n e t d e v n e t m a p d e a l l o c (s t r u c t n et d e v ∗ n et d e v ) ; s t a t i c i n t n e t d e v n e t m a p c l a s s i n i t (void) ; s t a t i c i n t n e t d e v n e t m a p r e c o n f i g u r e (s t r u c t n et d e v ∗ n et d e v ) ; s t a t i c i n t n e t d e v n e t m a p g e t c o n f i g (c o n s t s t r u c t n et d e v ∗ netdev , s t r u c t smap ∗ a r g s ) ; s t a t i c i n t n e t d e v n e t m a p s e t c o n f i g (s t r u c t n et d e v ∗ netdev , c o n s t s t r u c t smap ∗ a r g s , . . . ) ; s t a t i c i n t n e t d e v n e t m a p g e t i f i n d e x (c o n s t s t r u c t n et d e v ∗ n et d e v ) ; s t a t i c i n t n e t d e v n e t m a p g e t m t u (c o n s t s t r u c t n et d e v ∗ netdev , i n t ∗mtu ) ; s t a t i c i n t n e t d e v n e t m a p s e t m t u (s t r u c t n et d e v ∗ netdev , i n t mtu ) ; s t a t i c i n t n e t d e v n e t m a p s e t e t h e r a d d r (s t r u c t n et d e v ∗ netdev , c o n s t s t r u c t e t h a d d r mac ) ; s t a t i c i n t n e t d e v n e t m a p g e t e t h e r a d d r (c o n s t s t r u c t n et d e v ∗ netdev , s t r u c t e t h a d d r ∗mac ) ; s t a t i c i n t n e t d e v n e t m a p u p d a t e f l a g s (s t r u c t n et d e v ∗ netdev , . . . ) ; s t a t i c i n t n e t d e v n e t m a p g e t c a r r i e r (c o n s t s t r u c t n et d e v ∗ netdev , b o o l ∗ c a r r i e r ) ; s t a t i c i n t n e t d e v n e t m a p g e t s t a t s (c o n s t s t r u c t n et d e v ∗ netdev , s t r u c t n e t d e v s t a t s ∗ s t a t s ) ; s t a t i c i n t n e t d e v n e t m a p g e t s t a t u s (c o n s t s t r u c t n et d e v ∗ netdev , s t r u c t smap ∗ a r g s ) ;

These callbacks are registered using an array of function pointers. The regis-tration is called in initialization of ovs-vswitchd by calling a function named netdev netmap register provider. When the netdev runtime (dpif netdev) will need to execute one of these callbacks it will use these functions pointers. The most important callbacks are the send and receive callbacks that are called to check rings for packets or to forward a batch. The construct and destruct will take care of opening and closing a single port while maintain-ing resources consistent. The other callbacks are used for configuration and setting or reading options.

4.2 How to handle dp packets

At runtime each port receiving messages and ovs uses dp packets descriptors to store the payload. Payload and dp packets could be both allocated using the heap memory. To achieve better performances it is better to try to avoid any continuous system allocations and deallocations. Let’s first concentrate on how to handle the payload allocation.

As described in section 3.1.1 each dp packet references its payload data in dif-ferent ways, for example as a first try we could be using the DPBUF MALLOC . . . .

(41)

type, meaning that the payload is allocated through the system allocator. This way of handling packets would require to perform a copy of the payload from netmap buffers. The performance loss of a memory copy should be small but increases with packet size.

The advantage of this implementation is the simplicity and it would be pos-sible to reuse the netmap buffer right away but it would hurt performances because of the allocations an copies. We can actually avoid to allocate and deallocate payload data. I will need to define a specific type for netmap pay-loads.

4.2.1 The netmap payload type

I can define a new type of payload for the dp packet: the DPBUF NETMAP. It has the task of handling netmap buffers directly. This is simply done by changing the memory pointer of the payload in the dp packet to the pointer of the netmap buffer returned using the netmap slot. It also requires to add some information to the dp packet struct that permits to recognise which the netmap buffer the payload pointer is referring to. This information is the

netmap buffer index, a variable of 4 bytes that is not a memory address used

by netmap to identify a particular buffer.

This introduces a new problem because when we associate a netmap buffer to the newly created dp packet we have to be sure that it won’t be overwrit-ten by another receive callback before it is actually used. The solution is to make each received dp packet independent from the netmap ring is using

extra buffers (see 2.2.4).

When the netdev is receiving messages the dp packet is already holding a buffer index that is swapped with the buffer in the netmap slot that we are receiving from. This is because dp packets are preallocated with an empty buffer when the port is opened. After the swap the dp packet will contain the message and the netmap slot will contain the buffer that the dp packet was previously holding. This means that we can advance the ring pointers because that slot can be freely used.

. . . .

(42)

4.3 The send callback

This callback has the task of sending a batch of packets on behalf of the netmap port instance that the netdev is referring to. This means dp packets are leaving the switch and can then be deallocated.

The input arguments that we need are: 1. netdev: the port instance.

2. batch: it’s the struct that contains several dp packets to be sent.

3. concurrent txq: it’s a boolean variable that signals if the port is concur-rently being used to transmit.

The batch might be coming from any type of port and it will contain netmap packets (DPBUF NETMAP) only if the sender is also a netmap port.

The flow chart at 4.2 shows a simplified logic of the send callback.

It starts by checking if the port flags are set to signal the port as active, if it isn’t the callback cannot continue and it exits. It might perform a lock that is executed if concurrent txq is true. This is decided by ovs and it is set to true when it is possible that two different threads need to concurrently use this netdev instance. After these checks some variables used to access the netmap rings are set-up.

The last ring that was used in the previous callback run is selected, on that ring it is possible to be able to send all, some or no packet, in any case we send all that is possible and if space isn’t enough we will switch to the next ring (that becomes current ring) until all rings have been visited.

Each time a packet is sent on the current ring its type is checked, there are two possibilities:

• It’s not a DPBUF NETMAP: A copy of the payload will be performed from the payload data to the netmap buffer in the current slot. This works because any dp packet type permits to read the payload size and data.

• It’s a DPBUF NETMAP: This means that it comes from a netmap port and it is possible to avoid the copy by performing a zero-copy 2.2.3 that . . . .

(43)

is a swap of the extra buffer contained in the dp packet with the buffer index in the slot.

Using zero-copy two operations happen:

• The buffer index in the netmap slot is swapped with the other one that is in the dp packet.

• The pointer of the payload data is updated to the memory pointer of the payload of the slot. This operation is not needed between netmap ports but it enables non netmap ports to be able to read the payload. When all packet have been sent an ioctl with request type NIOCTXSYNC is called on the netmap file descriptor. This operation synchronizes netmap rings with the NIC rings; packets are then pushed to the wire. After sending all packets I can deallocate the dp packets in the batch. As a last step the section is unlocked if it was previously locked.

. . . .

(44)

Figure 4.2: Send callback flowchart

. . . .

(45)

4.3.1 Send callback mutual exclusion

The possible lock has been implemented using a spinlock1instead of a mutex

in order to avoid context switching. This is consistent with the fact that we are doing a busy loop.

The lock is needed in use cases like the one presented in figure 4.3 where two ports on different threads are concurrently trying to send traffic to the same port.

Figure 4.3: two ports sending traffic to one port

4.4 The receive callback

This callback has the task to try to receive a batch of packets that are sent to the netmap port the netdev instance is referring to. This means packets are entering the switch.

The input arguments that we have are just two:

1. netdev: The netdev instance is obtained from this variable.

2. batch: In this case the dp packets have to be allocated to this variable. The flow chart at 4.4 shows the simplified logic of the receive callback. As we did for the send callback it starts by checking if the port flags are set to signal the port as active, if it isn’t the callback cannot continue and it exits. After this check the total number of packets in each ring awaiting to be received are counted.

Two things can happen:

• There are zero packets: It means that all packets have been consumed or the port is idle. The ioctl NIOCRXSYNC is called in order to syn-chronize the status with the NIC. After this operation we exit, the next

1

Spinlock: it is a lock that avoid operating system context switching by repeatedly check-ing if the lock is available. This is a kind of busy waitcheck-ing because the thread remains active. . . . .

(46)

callback will be called shortly after hopefully there will be some pack-ets in the rings.

If there is still no packet for several callbacks the port might really be idle, this means that no one is sending data but since we are doing a busy loop we still have to check if there are some. The implementa-tion includes a way of limiting the number of ioctl that are called using a timer based on the time stamp counter (TSC). This means that in ev-ery few microseconds there can be only one ioctl call. This parameter is modifiable at runtime using ovsdb-server interface and it is called

netmap-rxsync-intval. This helps reducing the time spent in the pmd

thread on idle ports in favour of other probably active ports.

• There are some packets: The batch is then allocated with 32 dp packets at max (the allocation is discussed in detail in the next section 4.5). It will be less than the maximum burst size only if all the rings contain less than 32 packets. This callback execution will return a non full batch but this situation happens rarely. Since we have previously counted the packets we know exactly how many we can receive so we will visit each ring starting from the one used in the last callback execution, we will switch to the next ring if there are still packets to be received.

When each packet is received the dp packet is initialized by calling the proper dp packet set-up function. The dp packet buffer index will be swapped with the buffer index in the netmap slot and the payload pointer updated.

After receiving all packets the callback returns and ovs processes the batch in order to find the destinations ports. The batch is then forwarded to the ports possibly using several different batches. There are some implications that are discussed in the next section: 4.4.1.

. . . .

(47)

Figure 4.4: Receive callback flowchart

. . . .

(48)

4.4.1 Packet cloning

When the batch is processed by ovs in order to decide where to send the pack-ets it could be fragmented in sub-batches and it may happen that a dp packet has to be forwarded to multiple ports, in this case one packet will be sent as DPBUF NETMAP but the other one will be cloned and it will become a DP-BUF MALLOC. We can’t obviously have the same buffer index in two differ-ent netmap rings.

In the second case it will be handled normally like if it would be a packet coming from a non netmap source. This has a performance loss but it is im-portant that everything works even in this case. Figure 4.5 shows the example use case where this situation could happen:

Figure 4.5: one port duplicating traffic to two ports

4.5 Requirements for an allocator of dp packets

In order to achieve a better performance it is a good idea to avoid any system dynamic memory allocation or deallocation at the steady state in the receive and send callbacks.

Payload allocations have already been solved by using netmap buffers di-rectly but dp packets instances would be still being allocated and deallocated continuously. These memory allocations happen even using netmap zero-copy. Each pair of receive and send callback we would have 32 allocations and 32 deallocations. It is really crucial to avoid or reduce any per packet operation that might to execute system calls and cause a context switch. Dpdk has its implementation of an allocator based on huge pages (see 2.3.2) that is exposed in its API and can be used to allocate dp packets and payload. Netmap doesn’t expose its allocator and so it has to be implemented in ovs code.

. . . .

Design and implementation of the Netmap support for Open vSwitch

Computer Engineering