Extensible paravirtualization using eBPF

(1)

Facoltà di Ingegneria

Laurea Magistrale in Computer Engineering

Tesi di Laurea

Extensible paravirtualization

using eBPF

Relatore

Ing. Giuseppe Lettieri

Correlatore:

Ing. Alessio Bechini

Candidato:

(2)

Virtual machines are increasingly important today because of their extensive use in cloud computing. The performance of these systems is often limited by the fact that a lot of relevant information is only known to the guests running in the virtual machines and is opaque to the host. Paravirtualization, i.e., creating explicit com-munication interfaces between the guests and the host, is the mayor technique used today to overcome this limitation, but the evolution of these interfaces is limited by the need of keeping them compatible with legacy guests. The purpose of this thesis is to design and implement a mechanism to extend paravirtualization using eBPF in order to allow the host to inject arbitrary code into the guest kernel, as well as the application of this mechanism in order to improve the performance of virtual machines.

(3)

List of Figures _vii List of Tables _x Listings _xi 1 Introduction ₁ 1.1 Monitoring . . . 4 1.2 BPF . . . 8

1.3 Extensible paravirtualization mechanism . . . 11

2 State of the art 14 2.1 Virtual machines . . . 15

2.1.1 Emulation . . . 17

2.1.2 Paravirtualization . . . 18

2.1.3 Hardware-assisted virtualization . . . 19

(4)

2.2.2 QEMU-KVM . . . 23

2.2.3 QEMU architecture . . . 24

2.2.4 QEMU event loop . . . 24

2.2.5 QEMU virtual CPUs . . . 27

2.2.6 QEMU device emulation . . . 30

2.3 Linux device driver . . . 31

2.4 Kprobes . . . 34

2.4.1 Kprobes features . . . 34

2.5 Berkeley Packet Filter . . . 36

2.5.1 BPF history . . . 36 2.5.2 BPF evolution . . . 38 2.5.3 BPF program types . . . 39 2.5.4 BPF system calls . . . 40 2.5.5 BPF maps . . . 41 2.5.6 BPF in-kernel verifier . . . 42 2.5.7 BPF helpers . . . 44 2.5.8 BPF in-tree compiling . . . 45 2.5.9 BPF exploitation . . . 46 3 Extensible paravirtualization ₄₇ 3.1 Description . . . 47

(5)

3.2 Architecture . . . 50 3.3 Implementation . . . 55 3.3.1 Host interface . . . 55 3.3.2 Virtual device . . . 57 3.3.3 Device driver . . . 64 3.3.4 Guest daemon . . . 69 3.4 Achievement . . . 72

4 Use case: Virtual to Physical CPUs affinity ₇₇ 4.1 Use case specification . . . 78

4.2 Implementation . . . 88

4.2.1 Extension to the general mechanism . . . 89

4.2.2 BPF program . . . 91

4.2.3 Hyper-threading support . . . 94

4.3 Testing . . . 94

4.3.1 Test environment . . . 96

4.4 Results . . . 98

4.4.1 Virtual CPU pinning . . . 98

4.4.2 Virtual Hyper-thread pinning . . . 104

(6)

(7)

1.1 Linux suddivision in userspace and kernelspace. . . 3

1.2 Kprobe code flow inside a probed function. . . 5

2.1 Hypervisor types and architecture w.r.t. host and guest systems. . . 16

2.2 Host and guest architecture in paravirtualization . . . 18

2.3 CPU modes when running specific domain code . . . 20

2.4 QEMU IOThread execution flow, showing lock and unlock . . . 26

2.5 QEMU IOThread execution flow, showing lock and unlock . . . 27

2.6 QEMU vCPU threads execution flow, showing lock and unlock. . . 29

2.7 Linux Device driver interrupt handler. . . 33

2.8 Kprobes instruction flow. . . 35

3.1 Generic extensible paravirtualization mechanism architecture. . . . 51

3.2 Overview on host to guest communication in extensible paravirtual-ization mechanism . . . 73

(8)

4.1 Dashed arrows represent affinity. Virtual CPUs are threads in the host system and therefore they can have their affinity with host phys-ical CPUs. . . 83 4.2 Dashed arrow represent affinity. A guest thread requires to be

exe-cuted only on vCPU 0 and the request is correctly applied also in the host system, on pCPU 0. . . 84 4.3 Dashed arrow represent affinity. Two guest userspace threads are

pinned on the same core to exploit hyper-threading benefits. The basic mechanism is not enough to apply this CPU pinning correctly on the host side. . . 86 4.4 Dashed arrow represent affinity. Two guest userspace threads are

pinned on the same core to exploit hyper-threading benefits. Hyper-threads pinning also applies on the host system. . . 88 4.5 eBPF program injection and sched_setaffinity() calls detection with

affinity mask forwarding from guest to host, with possible remapping if hyper-threading is used. . . 90 4.6 Throughput comparison between standard system and vCPU

(9)

ning. High load conditions with various serialization percentage are compared. . . 102 4.8 Throughput comparison between standard system and Hyper-thread

pinning. No load and lower load conditions are compared. . . 105 4.9 Throughput comparison between standard system and vCPU

pin-ning. Higher load conditions are compared. . . 106 4.10 Throughput comparison between host system and guest system with

(10)

2.1 Classic BPF versus Extended BPF. . . 38

2.2 eBPF program types explained. . . 40

2.3 eBPF map types explained. . . 41

2.4 eBPF map types explained. . . 45

3.1 Message structure for extensible paravirtualization mechanism. . . . 49

3.2 Main message types for extensible paravirtualization mechanism mes-sages. . . 50

3.3 QEMU virtual device buffer structure. . . 59

4.1 CPU ordering in Hyper-threading Linux systems when running as native i.e. host system . . . 85

4.2 CPU ordering according to QEMU in Hyper-threading Linux sys-tems when running as guest. . . 85

4.3 Virtual CPU pinning test parameter range . . . 98

(11)

3.1 Definition of message structure . . . 50

3.2 Host interface code flow summarized . . . 56

3.3 QEMU virtual device data structure . . . 57

3.4 QEMU virtual device socket communication initialization . . . 60

3.5 QEMU virtual device accept handler . . . 61

3.6 QEMU virtual device connection handler . . . 62

3.7 Device driver interrupt handler implementation . . . 64

3.8 Device driver read operation implementation . . . 66

3.9 Device driver write operation implementation . . . 67

3.10 Device driver llseek operation implementation . . . 68

3.11 Device driver ioctl operation implementation . . . 69

3.12 Guest userspace daemon main loop . . . 70

3.13 Guest userspace daemon PROGRAM_INJECTION handler . . . . 70

4.1 Remapping function used in hyper-threading host-guest systems . . 87

(12)

(13)

Introduction

This work is about paravirtualization, but not in its strict sense. It is about coop-eration between guest operating system and the host but does not involve rewriting the so-called base kernel. This means that we can achieve paravirtualization benefits without the need of rewriting the target software. It is an extensible paravirtualiza-tion mechanism, meaning that it can be applied to unmodified operating system. Guest OS will startup like it does in a non-virtualized environment or also in vir-tualized situations such as hardware-assisted virtualization and it will be able to exploit the benefits of paravirtualization going through the mechanism that will be presented in the following.

Paravirtualization involves modifying the target OS. In the target OS kernel non-virtualizable instructions are replaced with hypercalls that will directly communi-cate with the host virtualization layer. Hypercalls are based on the same concept

(14)

as system calls. System calls are used by userspace application to request function-alities from the OS, thus providing an interface between userspace and the kernel. Hypercalls actually are very similar to system calls, except for the hypervisor. The hypervisor provides hypercall interfaces for other kernel utilities such as memory management and interrupt handling.

One of the requirements of virtualization is that we do not want to change the target software. In order to fulfill this requirement we cannot adopt a modified version of the guest operating system, in particular this option will require to modify the guest OS (and recompile it) every time a new feature is needed, resulting in poor compatibility and portability. It is also important to note that closed source operating systems cannot be modified in order to exploit paravirtualization, as the source code is simply unavailable.

To avoid any changes to the target software (the guest operating system) we have taken a different path. The guest base kernel, that is the part of the kernel that is bound into the image that you boot, i.e. all the kernel except the Loadable Kernel Modules, stays completely unmodified. On boot, the host will attach a special device to the virtual machine, that the guest will be able to use with a driver that is actually a Loadable Kernel Module. This way we can develop a device driver that will be loaded into the guest kernel, providing functionalities that does not require to modify the guest OS. Linux structure is shown in Figure 1.1

(15)

Physical Hardware

Kernel modules (i.e. drivers) System call interface

Kernel space

User space

User-level programs GNU C Library (glibc)

CPU Memory Devices

Figure 1.1: Linux suddivision in userspace and kernelspace.

The counterpart for the device driver is an userspace daemon that by definition is a program that runs as a background process, rather than being under the direct control of an interactive user. The daemon will be used to interact with the device using its driver. This interaction is needed because the device is not a real one, it does not represent a printer or a network card. It is instead a virtual device that allows of course the guest to write and read from it, but also the host to interact with it as well, allowing the host to indirectly communicate with the guest daemon and vice versa. In practice this means that the host will be able to inject commands and/or data inside the guest, while the guest can process those commands and

(16)

reply or send other commands back. Such mechanism is pretty much general, it still needs the capability to interact with the guest, in particular the guest kernel. There are many possible interaction that a userspace program can have with the kernel, specified by system calls and each one is a specific request from the userspace to the kernel, that can be process management, main memory management, file access, networking and device handling.

1.1 Monitoring

It might be interesting to monitor specific system calls or in-kernel functions in order to track what is going on in the kernel. Tracking such functions can provide information that might be exploited in order to improve performance or even for statistical purpose. A very useful tool provided by the Linux kernel are Kernel Probes [1]. Kprobes enables you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. You can trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit. Kprobes allows to run a customized handler when the target function is executed. This way we can detect parameters for system calls that might be stored or used later. Also stored information can be used, for example by passing them to the host system through the virtual device. This is where the extensible paravirtualization mechanism comes into play. In practice we are able to know the state of the guest system, up to the needed degree. The needed knowledge

(17)

of the state of the guest system is given by the feature that we are willing to implement and can vary for each possible scenario. In the typical case, Kprobes-based instrumentation is packaged as a kernel module. The module’s init function installs ("registers") one or more probes, and the exit function unregisters them. The code flow of a kprobed function is better clarified in Figure 1.2.

Figure 1.2: Kprobe code flow inside a probed function.

A registration function such as register_kprobe() specifies where the probe is to be inserted and what handler is to be called when the probe is hit. The handler should be provided by the host specifying each action that should be taken for the intended feature to be implemented. Considering that the level of knowledge for system administrator working in this field should be really good it is also mandatory to think how error prone this mechanism is. In particular using a handler provided by the host might work but also it is possible that the smallest bug might crash the whole guest system. Working inside the kernel i.e. Loadable Kernel Module, shows some difficulties that are linked to the privilege level acquired by being "inside"

(18)

the kernel. Anyway, a good system administrator might be able to prepare a well-written handler to inject inside the guest system to be triggered by kprobes when specific functions are executed.

Although it seems the straightforward way to achieve such goal, it presents some disadvantages: 1) The handler must be part of a Loadable Kernel Module, that is not exactly a detail. It means that a specific kernel module must be dedicated to the kprobe itself, leading to one kernel module per kprobe/feature needed in the guest system. Why? Kprobes can be registered/unregistered at runtime but their behavior, the handler, is defined by the programmer and therefore it is immutable. This clashes with the initial definition of this work like extensible paravirtualization. 2) Assuming we can tolerate the previous usability issue, a possible even larger issue is at stake, that is security. What are the possibilities for a handler issued from outside the guest OS? Potentially unlimited, because by being a handler for a kprobe within a kernel module, the maximum privilege level is obtained, and any action can be performed like in any other kernel module. This can be fatal for the guest system if the handler is malicious or even if not malicious but just contains a bug, leading to a possible kernel crash. Previously we were referring to this situation as an error prone mechanism. Actually, loadable kernel modules are conceived to be pieces of code that can be loaded and unloaded into the kernel upon demand. They extend the functionality of the kernel without the need to reboot the system. Thus, the privilege level given to that code must be the highest,

(19)

otherwise it will not be possible to perform what is requested by the kernel module. On the other hand, we must keep in mind that to allow the host system to catch as many information as needed, the privilege level must be high. For a host system that wants to somehow improve its virtual machine performance or just track it, the most precious information are those kept at kernel level inside the guest. But not only. Some information we are looking for are not kept, they are just events inside the guest operating system such as the execution of a system call. As already stated, tools like Kprobes enables tracing in the kernel, but also debugging and event monitoring. A probe handler can modify the environment of the probed function i.e. by modifying kernel data structures, or by modifying the contents of the pt_regs struct (which are restored to the registers upon return from the breakpoint). So Kprobes can be used, for example, to install a bug fix, to alter the execution of specific functions or to inject faults for testing. Kprobes, of course, has no way to distinguish the deliberately injected faults from the accidental ones.

Up to now, Kprobes offers all the functionality needed by our goal, that is enabling the host system to grasp the desired information, or if you prefer to track, the guest system. As said, those information might be simply "how many processes are running inside the guest?" or to go in a more complex fashion "what arguments are passed to this system call whenever it is invoked?". All those possible questions from the host system might be useful for specific optimizations or for tracing from the outside. But still, as explained earlier Kprobes has some disadvantages that do

(20)

not make it the best choice for this case, which to summarize are poor extensibility and security.

1.2 BPF

BPF stands for Berkeley Packet Filter, a technology developed in 1993 The BSD

Packet Filter: A New Architecture for User-level Packet Capture, S. McCanne and V. Jacobson [10], designed to improve packet capure tools performance. In 2013,

Alexei Starovoitov proposed a major rewrite of BPF [15], which was further devel-oped by Alexei Starovoitov and Daniel Borkmann and included in the Linux kernel in 2014 [2]. Besides those historical references, this turned BPF into a general-purpose execution engine that can be used for a variety of things. Here comes the notation eBPF that stands for extended Berkeley Packet Filter. eBPF allows the kernel to run mini programs on system and application events, such as disk I/O, thereby enabling new system technologies. It makes the kernel fully programmable, empowering users (including non-kernel developers) to customize and control their systems.

eBPF is a flexible and efficient technology composed of an instruction set, stor-age objects, and helper functions. It can be considered a virtual machine due to its virtual instruction set specification. These instructions are executed by a Linux ker-nel BPF runtime, which includes an interpreter and a JIT (Just-In-Time) compiler for turning BPF instructions into native instructions for execution, thus increasing

(21)

execution speed. BPF instructions must first pass through a verifier that checks for safety, ensuring that the BPF program will not crash or corrupt the kernel (it doesn’t, however, prevent the end user from writing illogical programs that may execute but not make sense). eBPF can be used for three class of problems that are networking, observability, and security. This work exploits observability, in order to trace the guest system.

BPF tracing supports multiple sources of events to provide visibility of the entire software stack. One that deserves special mention is dynamic tracing: the ability to insert tracing points into live software. Dynamic tracing costs zero overhead when not in use, as software runs unmodified. BPF can trace the start and end of kernel and application functions, from the many tens of thousands of functions that are typically found running in a software stack. This provides visibility so deep and comprehensive that it can actually provide full knowledge of what is going on inside the system. Even if BPF architecture is going to be described in the next chapters, just for the sake of highlighting why eBPF is so important for our work we will briefly introduce how it works.

An eBPF program is "attached" to a designated code path in the kernel. When the code path is traversed, any attached eBPF programs are executed. Code paths can be of various kind, allowing programs to be attached to tracepoints, kprobes, and perf events. This is what triggered interest into BPF with respect to this work. Since eBPF programs can access kernel data structures, developers can write and

(22)

test new debugging code without having to recompile the kernel. The implications are obvious for busy engineers tracing or debugging issues on live, running systems. So, as it has become more and more evident, what we are trying to do is to exploit the benefits of eBPF to use kprobes safely. How can one say "safely"? One really important aspect of eBPF is that it makes use of an in-kernel verifier. There are inherent security and stability risks with allowing user-space code to run inside the kernel. So, several checks are performed on every eBPF program before it is loaded, that will be further detailed in the following chapter. One check requires the verifier to detect the eBPF program type, to restrict which kernel functions can be called from eBPF programs and which data structures can be accessed. Some program types can directly access network packet data, for example. All program types will be better explained later but for now it is just important to focus on the BPF_PROG_TYPE_KPROBE one. In this work kprobe program type is used, but actually any other program type is usable if the desired features fit well with such program type. In particular, kprobe program type is the relevant one as it allows an eBPF program to be fired just like a kprobe but enjoying the benefits of being an eBPF program. Generally, eBPF programs are loaded by the user process and automatically unloaded when the process exits. This realizes what has been called "extensibility", while the eBPF in-kernel verifier realizes what has been called "security".

(23)

1.3 Extensible paravirtualization mechanism

Back to our system structure, we have a host system that host a virtual machine. On boot, a virtual device is attached to the virtual machine so that a driver in the guest operating system can then utilize such device. Even if this device is not a real one, the guest is not aware of it. So, the device driver works as a usual device driver in Linux, like well documented and explained in Linux Device Drivers, 3rd

Edition [4]. It implements all operations required such as read, write, ioctl, seek,

open and release accordingly to hardware specification, or in our case to virtual hardware specification. Also, the interrupt handler should be specified, if needed.

Inside the guest system a daemon will take care of interacting with the device, using regular system calls such as open, read and write. Reads, in particular, can return data from the device that actually comes from the host. How? The host will use an interface to the QEMU hypervisor. More details about QEMU and its architecture will be provided later, but for the moment it is important to note that QEMU is actually a userspace program, like a simple ls command. This is important because it allows us to do multiple things. The virtual device code, which is part of QEMU, contains a socket listening on given port allowing a client, potentially remote with respect to the host of the virtual machine, to connect to QEMU, enabling the device to receive commands and/or data. Data can be stored inside the device buffer and then read by the guest daemon through a standard read system call. The reverse path is also possible. An eBPF program, already compiled

(24)

for the given guest can be transferred inside the guest and used. Guest daemon is able to load an eBPF program at will, using facilities provided by the libraries.

So that’s how we came at a full circle: a remote client inject a message, composed by header and payload, into the virtual device on the host side, and the guest daemon, through the device driver regulating the access to the device, obtain such message and consume it accordingly to its content. The message content might be whatever, from a simple command to a more complex payload, i.e. an eBPF program containing a kprobe ready to be loaded into the guest kernel. Any information obtained by this eBPF program might be used inside or outside the guest system, possibly being propagated up to host hypervisor or even to the client that initiated the communication.

This mechanism is absolutely general. It allows a remote client to send structured messages to a guest userspace process. The meaning of those messages is up to the developer, that must implement a specific encoding for that type of message and associating the appropriate behavior to the portion of code interested during the execution flow. Specifically, the guest daemon, the device driver but also the device itself should implement the required behavior. This leaves open the possibility to establish well-defined message types for basic operations i.e. injection and loading an eBPF program in the guest kernel, while letting developer to further extend this mechanism based on their needs. One could define new message types containing new commands to let the system perform different actions or to exploit eBPF

(25)

program tracing information in specific ways.

This mechanism allows to manipulate somehow the guest system and to extract information, that might be precious to the host. But keep in mind that the guest is still unmodified software. Being unmodified software allows the host, if possible, to run the virtual machines using hardware-assisted virtualization. There are a lot of benefits in doing so, which are not present in a paravirtualized environment. We can retain the benefits of hardware-assisted virtualization, while being able to extract information from a guest that is aware of being virtualized only because it’s running a special daemon, but not because it is using a modified version of the target OS that includes hypercalls like a standard paravirtualized system.

(26)

State of the art

This chapter will be a summary of the state of the art regarding virtual machines, hypervisor and Linux kernel technologies relevant to this work. This will give us an important insight about the working environment that we are going to use to develop the previously described mechanism. A preliminary phase of study was necessary in order to have a good knowledge of these technologies in order to be able to use them to create software that was consistent with what already existed, but above all could take advantage of existing technologies. This is fundamental precisely because in the implementation of solutions in a productive perspective it is necessary that they are well integrated with everything that already exists, in order to guarantee performance and readability of the code. When we refer to Linux, we refer to the Linux Kernel. Just for information, in all development and test phases Ubuntu 20.04 has been used with kernel version 5.4.0.

(27)

2.1 Virtual machines

A virtual machine is a virtual representation of a physical computer. The various techniques for virtualization make possible the creation of multiple virtual machines inside the same physical computer, called host. Each of these virtual machines, called guest, contains its operating system and userspace applications. Virtual ma-chines are typically prevented from interacting directly with the host hardware. It is the job of a software layer called hypervisor within the host system, to handle the virtual machines access to the host’s physical computational resources, such as processor, memory, and I/O devices. Virtual machines usually execute unmodified code, as if they were actually running on a physical computer. In doing so, special care must be taken because of course there are many instructions that can harm the host system, therefore the hypervisor must check those and run them in a specific way, depending how virtualization is implemented.

There are two types of hypervisors as shown in Figure 2.1: Type 1 hypervisors (bare-metal hypervisors) run directly on the physical hardware, replacing in fact the host OS. Type 2 hypervisors (hosted hypervisors) run as an userspace application within a host OS. In type 2, the hypervisor relies on the host OS to perform certain operations such as managing calls to the CPU, managing network resources, managing memory and storage. This allows for type 2 hypervisors to support a wide range of hardware.

(28)

(a) Type 1 (b) Type 2

Figure 2.1: Hypervisor types and architecture w.r.t. host and guest systems.

Kernel-based Virtual Machine (KVM) is an open source virtualization technol-ogy built into Linux. Specifically, KVM lets you turn Linux into a hypervisor that allows a host machine to run multiple virtual machines. This open sourced Linux-based hypervisor is mostly classified as a type 1 hypervisor. At the same time the host OS is still available, allowing to categorize the overall system as a type 2 hy-pervisor. KVM is also used as a short for QEMU-KVM, that is a fork of the QEMU project, that is a generic and open source machine emulator and virtualizer. But at the time of writing, QEMU-KVM fork and QEMU mainline have been merged so there are no more differences between the two. Many techniques are possible to carry out virtualization, precisely there are emulation, paravirtualization and hardware-assisted virtualization.

(29)

2 hypervisor which exploits hardware-assisted virtualization to reach near-native speed due to reduced intervention of the hypervisor.

2.1.1 Emulation

Emulation is the basic approach. You simply read one instruction at a time and you emulate it in the host system. This requires a big switch, with a case for each instruction in the hypervisor. You can also run a guest operating system that has been developed for a different architecture by the fact that you will anyway emu-late each instruction, meaning that for a specific instruction you will have the host counterpart to handle it. You need a data structure to represent your devices i.e. CPU as a structure with register fields and memory as an array of bytes. Those fields are then manipulated in a big for loop that represent the fetch-decode-execute phases of the guest processor. Also, we can handle interrupts by declaring in soft-ware everything it is needed in a physical system, like an interrupt descriptor table and its pointer. Emulation allow high flexibility w.r.t. which guest system can be used but unfortunately introduce a huge overhead in its fetch-decode-execute loop especially for those instructions that might need more than one host instructions to be emulated. Such disadvantage can be mitigated by using binary translation, that is a technique to first translate blocks of guest code, usually from jump to jump, into host-equivalent code and then execute it. The overhead introduced by the translation phase is hopefully amortized as the translated blocks are executed

(30)

multiple times.

2.1.2 Paravirtualization

Paravirtualization is a technique in which the guest OS and the hypervisor com-municates. This communication mainly aims at improving performance and effi-ciency. Paravirtualization requires to modify the guest OS in order to replace non-virtualizable instruction with hypercalls, that are function calls that communicate directly with the host hypervisor, which in turn is designed to handle this commu-nication with the guest OS. In particular, hypercalls provide support for a range of important operation that the kernel requires, like hardware interactions of which examples can be memory management and interrupt handling. A generic structure is shown in Figure 2.2. The main difference w.r.t. to emulation is that the guest

Figure 2.2: Host and guest architecture in paravirtualization

(31)

a virtualized environment. Paravirtualization offers poor compatibility when speak-ing about closed source operatspeak-ing systems like Windows OS. Those systems cannot be modified to use hypercalls so paravirtualization will not be available in such sit-uations. It is worth to mention that paravirtualization can also boost performance when applied to I/O like in VirtIO.

2.1.3 Hardware-assisted virtualization

With hardware-assisted virtualization the host tries to directly execute guest in-struction on the host processor. If a lot of inin-structions are executed directly, which is often the case, the execution will speed up enormously, reaching near-native speed. It is required that the host machine understands all the instruction defined in the guest, resulting often in having host and guest with the same exact architec-ture. The majority of instructions will be directly executed but there are some that cannot. To realize that, Intel introduce his extension for hardware-assisted virtual-ization called Intel VMX. On Intel CPUs, two new orthogonal modes to the already existing user and system modes have been added: root and non-root modes. Root mode is intended for the host, while the non-root is intended to run guest instruc-tions. Differentiation between system and user, of course, still apply. A graphical representation of the CPU state when running specific domain instructions is shown in Figure 2.3. The main goal is to put limitation to what the guest software can do, while having no limitation when the host is running. In particular there are

(32)

Figure 2.3: CPU modes when running specific domain code

many instructions that should be trapped when executed in non-root mode (by be-ing in the guest code) and emulated via software by the hypervisor. One example are instructions that modify the content of the %CR3 register. CR3 register con-tains the physical address of the base address of the page directory table, that is unique for each running process, but must importantly it will of course differ from host and guest system as they don’t share the same memory space. Therefore, any modification to this register taken by the guest should not affect in any way the host one, or at least do not interfere with the execution flow of the host processes. In practice the CPU can enter in non-root mode via the new VMLAUNCH/VMRESUME instructions and return to root mode using VMEXIT for various reason. The virtual machine state is kept across switching between root and non-root mode or also between execution of different virtual machines thanks to a special data structure

(33)

held by the hypervisor for each virtual machine called Virtual Machine Control Structure. VMCS contains many information, where the most important are: guest and host state, that respectively are the guest and host CPU states before entering the root and non-root mode, VM execution control, that specifies the action that if performed in non-root mode will cause a VMEXIT, VM exit reason, that specify the reason that caused the latest VMEXIT allowing the hypervisor to behave accordingly.

2.2 QEMU

QEMU, short for Quick EMUlator, is an open source emulator that is capable of hardware virtualization. It is a hosted hypervisor (type 2), because it runs on top of an existing operating system inside the host. As a userspace program, QEMU is able to perform a lot of operation with the support of the system calls offered by the operating system. QEMU is capable of emulate a full computer, including possible peripherals, with an incredible range of supported guest OS and instruction set supported. Although its documentation is pretty badly written and many parts are simply missing, going through the code can enlighten enough, allowing to discover how QEMU works in depth. In the following we will be talking about what was important for this work and not about the entire architecture of QEMU.

(34)

2.2.1 QEMU features

As stated in QEMU documentation [13], multiple features are offered by QEMU. It is a fast, cross-platform open source machine emulator which can emulate a large number of hardware architectures. In particular we used QEMU in one of its possible modes that is system emulation. In this mode it is possible to let QEMU emulates a full system i.e. a computer, customizing it at will, inserting one or more processor, various main memory size and attach several peripherals, from storage disks to network card, CD-ROM drives, audio interfaces, USB devices and many more. It is even possible to attach customized virtual device. QEMU can run without a host kernel driver, means that it natively uses dynamic translation for native code to execute it at a reasonable speed. This technique has been previously referred to as binary translation, but further explanation is out of the scope of this work. QEMU is actually portable to several operating systems even if in this work it is only used on Linux kernels. When running full system emulation QEMU can use an in-kernel accelerator. In this work KVM has been used to greatly boost the virtual machine performance up to reaching near-native speed. Executing most of the guest code directly on the host processor, with few exceptions, make QEMU extremely fast which is a feature that is of course much appreciated. It is possible to mount a shared directory from host and guest for the purpose of testing or also to share some contents in case of need. On QEMU Symmetric Multiprocessing SMP is supported. To actually use more than one host CPU an in-kernel accelerator is

(35)

required.

2.2.2 QEMU-KVM

Kernel-based Virtual Machine is a kernel module that provides the virtualization facilities to the user. KVM has been a separated project since it has been included in mainline Linux, as of 2.6.20. After a while, KVM userspace components have been included in mainline QEMU, as of 1.3. So, KVM is a kernel module that allow you to operate with ioctl() on a device driver /dev/kvm, which in turn will provide you facilities to build your virtual machine. I.e. using KVM_CREATE_VM ioctl request number it is possible to retrieve the file descriptor for the newly created virtual machine and from now on it is possible to further customize it with other ioctls. One can use KVM_CREATE_VCPU to add a virtual CPU and then start it with the KVM_RUN. This ioctl is particular because the call is like a blocking one, returning only when the virtual machine needs to exit. On exit, an exit code is annotated in a specific data structure that will allow us to interpretate it and perform the most appropriate action to handle the exit reason. After that, one can simply call ioctl with KVM_RUN again and so on. Nothing like that should be done when using QEMU-kvm because it is already handled in the QEMU internals, but this was just an example of how things actually works inside QEMU-kvm. What is required is that when launching QEMU from command line, one should pass the accelerator options like: QEMU-system-x86_64 -accel=kvm , this way QEMU is aware that we

(36)

would like to use kvm accelerator if present and we will eventually achieve better performance for the already discussed reasons. In this work we refer to QEMU as QEMU-kvm, because in all development and test phases that was the adopted technology. As explained in Section 1.3, the aim of this work is to still benefit of the near-native speed provided by hardware-assisted virtualization but being able to extend at will the degree of communication between host and guest.

It is possible to mount a shared directory from host and guest in case of need thanks to VirtFS, a new paravirtualized filesystem interface designed for improv-ing passthrough technologies in the KVM environment. It is based on the VirtIO framework and uses the 9P protocol. That technique that is worth to mention has been useful through this work for testing and file sharing.

2.2.3 QEMU architecture

It is important to explain how QEMU is structured and how it handles various tasks in order to be able to understand how and where to implement the mechanism described in Section 1.3.

2.2.4 QEMU event loop

QEMU is an event-based software, implemented as a multi-threaded userspace ap-plication. One of the various threads that compose QEMU is the so called IOThread. As the name suggests, the IOThread is responsible to handle I/O operations, that in

(37)

general can be viewed as events. Remember that QEMU is a userspace program and so, all the system facilities are available, in particular the poll() system call. Poll system call allows the event loop to efficiently wait for the event to come. Poll allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input possible). QEMU needs various data structure to run virtual machines and its mandatory to keep them consistent. This is why multi-threading can be dangerous and the whole QEMU software must be written in a thread-safe way. That is why in QEMU a single big lock is implemented, the so called Big QEMU Lock (BQL). This lock prevents the IOThread to violate the consistency of shared data structure that are crucial to QEMU. Unfortunately using a lock, especially a global one, comes with a penalty in terms of performance. Sometimes it is going to happen than one or more threads might be stuck in the lock acquisition phase, instead of running the code. It must also be noted that the way the IOThread and the other threads, that are responsible to run the guest virtual CPUs, are implemented should guarantee a very short period in which any thread can hold the lock. A better visualization of how the IOThread operates is illustrated in Figure 2.4

A file descriptor is considered ready if it is possible to perform a correspond-ing I/O operation (e.g., read, or a sufficiently small write) without blockcorrespond-ing. Each event is associated to a file descriptor that can be submitted to QEMU with a

(38)

Figure 2.4: QEMU IOThread execution flow, showing lock and unlock read handler and a write handler, including a void pointer to some handler spe-cific data structure. Such action is carried out by using the in-QEMU function qemu_set_fd_handler(fd, read_handler, write_handler, void*).

One can add file descriptors at will, that will be considered when the poll system call need to monitor multiple file descriptors. Having too many file descriptors reg-istered to the QEMU poll system call might increase the amount of handle needed, but due to the structure of the IOThread that is better shown in Figure 2.5, the amount of code executed when the lock is held is just the handler. In particular it is very important that those handlers registered to QEMU are very simple and above all efficiently designed.

(39)

Figure 2.5: QEMU IOThread execution flow, showing lock and unlock system, such as files or input/output resources like pipes or network sockets. It is also possible to use file descriptors to allow the VM to communicate with the outside world, like the host or even other remote hosts. With QEMU being an userspace program, it is possible to open a socket and submit it to the qemu_set_fd_handler function, resulting in QEMU monitoring also that file descriptor, making commu-nication with the host possible.

2.2.5 QEMU virtual CPUs

QEMU allows to run virtual machines with one or more processor. QEMU supports Symmetric Multiprocessing, up to 255 CPUs. You just need to pass a -smp cpus=n option when launching QEMU. It is important to note that further specification on the computing facilities structure can be provided. In particular the smp option al-lows the user to specify the number of cores and the number of threads. This feature will be useful in Chapter 4, where we will perform various test on performance speed

(40)

up for SMP virtual machines, that may or may not use hyper-threading. Hyper-threading, HT in short, is the name used by Intel for its proprietary technology for simultaneous multithreading, used to improve parallelization of computation on x86 processors. Back to QEMU, as said it is possible to run virtual machine with multiple processors. But those processors are not physical, they are instead virtual CPUs. Guest operating system is expecting to have a CPU that runs its instruc-tion, and by being unmodified software of course it is expected that this CPU is actually capable of executing instructions. In QEMU virtual CPUs or vCPU are an abstraction that represent a physical CPU in short pCPU, so that the guest can interact with it in a transparent way. Actually, the virtual CPU is a data structure that contains multiple information such as register status, flags and so on. Instruc-tion can now be emulated accordingly to the informaInstruc-tion held by QEMU. When KVM accelerator is in use, things are a little bit different, as most instruction will be directly executed on the host processor, resulting in very few instructions to be emulated. Anyhow, QEMU is structured with one thread per virtual CPU. This means that for every virtual CPU allocated to the virtual machine, through the -smp cpus=n option, an userspace thread will spawn. Each thread is associated with its virtual CPU and this association can be retrieved programmatically doing a loop on the QEMU CPUState data structure. Those threads, are nothing more than userspace threads, meaning that one can invoke system calls from within and they are scheduled like every other userspace thread in the system. With that being

(41)

said, there are no guarantees about scheduling, in a sense that each thread can be scheduled on every physical CPU to carry on the code execution, which mostly will be guest code, emulated or directly executed when hardware-assisted virtualization is enabled. Threads can of course be moved between processors as it happens for every other thread with the standard procedure that is context switch. In figure 2.6 we can appreciate how virtual CPU threads are structured, showing once again the presence of the Big QEMU Lock. In the Figure 2.6 we refer to a QEMU-kvm,

Figure 2.6: QEMU vCPU threads execution flow, showing lock and unlock.

but in case of emulation it will be similar. We can see the lock acquisition and release function calls that as in the IOThread code in 2.5 hold the lock just for handling the VMEXIT reason. This leads to very simple and much optimized han-dlers to be executed when the lock is held. Considering that a virtual CPU thread will spend most of his time "inside" the kvm_vCPU_ioctl(CPU, KVM_RUN) function, the occurrence of VMEXITs can be expected to be relatively low, resulting in a high

(42)

probability for the threads, both vCPU and IOThread, to acquire the lock without blocking.

2.2.6 QEMU device emulation

In QEMU, devices are just code. They are an abstraction. What the developer has to do is to emulate what the guest operating system is expecting to see on bare-metal hardware. The simplest behavior for a device is being a memory map, in which the software can read and write. But I/O devices are well known for their peculiarity, that are side effects on the real world. This is because I/O devices can be of any type from a keyboard to a printer to a network card, and thanks to your read or write at a particular address you might trigger a packet to be sent, or a page to be printed. The end goal is to let the guest userspace applications to interact with device and handle those side effects. A character device will provide a memory space in which it is possible to read and write, and some associated side effects to those operation that can be nothing or something real, i.e. printing a page. Interrupt generation is possible inside devices and also that must be emulated inside a QEMU device. Interrupt emulation will be performed via software, but for the guest operating system no difference can be detected. A data structure must be defined in order to descript the device and store its relevant information, i.e. a pointer to the memory area, and some variables that might be necessary to keep state. By being software, it actually provides the capability to test additional features without the need to build

(43)

a real hardware device, which is also a nice feature for developers. In short, you have to define functions for device initialization and realization, write and read handlers, interrupt raising and lowering routine and device unregistering. Device initialization and closing is pretty much default, especially if you use PCI technology. Instead, what write and read handlers do are completely up to the developer. They can simply write in the device memory or can even access host resources because, as already said, QEMU is an userspace program and the device is part of it, so it’s possible to use all the system calls present in the host operating system to operate on the physical resources of the host machine, from memory to peripherals of any kind.

2.3 Linux device driver

When talking about how Linux handles userspace access to device, it is mandatory to refer to device drivers. Linux device drivers are usually Loadable Kernel Modules that provides facilities to the userspace allowing a smoother access to devices. This way the operating system can hide all the details of the underlying hardware to the user. Devices are seen as files from userspace and any userspace program can call open(), read(), write(), ioctl() and all the other well-known system calls that operates on files. Every device in the system is represented by a device special file i.e. /dev/mydevice. Device files are created with the mknod command, specifying major and minor numbers. All devices controlled by the same device driver share the

(44)

same major number, while the minor number is used to identify between different devices and their controllers. A lot of useful explanation are found in Linux Device

Drivers, 3rd Edition [4]. This book offers various example and a lot of interesting

insights about Linux and the driver programming. For a device that uses PCI, the driver must behave accordingly providing pci_probe() and pci_remove() func-tions. A special data structure called file_operations is provided by the kernel and must be filled specifying handlers for various operations. The most common file operations are read, write, llseek, and unlocked ioctl. Those handlers are sim-ply functions that behave accordingly to the device i.e. for a read handler the most important function parameters will be (char __user *buf, size_t len, loff_t *off) from which you have to operate. In the previous case you can read from device using low level access functions ioread32(base + *off), returning that value from kernel space to userspace using copy_to_user() kernel function. The need for copy_to_user() is because the device is a kernel module, operating in kernel space, while the userspace application that issued the read system call is of course, an userspace program. In Linux kernel, you can recognize userspace pointers by being preceded by __user macro. Inside one of those handlers you can use kernel facilities to make the invoking userspace process suspend until a certain event arrives. This is the case when you want to implement blocking operation like blocking reads. Userspace can try to read the device but if the required data is not available, the process will suspend. How to detect availability or not of new

(45)

data? You might be interested into using a flag, which can tell you if data is ready or not. This flag can be resetted once the read has been completed and triggered by something else, like an interrupt handler. As said before, interrupt raising and lowering is performed on the device side, but in software. This is because the device is a fully emulated one that does not have his physical hardware counterpart. For the guest operating system this difference is not noticeable, and its behavior will be completely transparent to the device being virtual. An interrupt request handler function must be defined in the device driver and it will be executed every time an interrupt request is raised from the device. Every interrupt request is paired with its number that in practice tells you how to handle that specific case. In the interrupt handler one simply reads the interrupt status, handle it depending on the status and finally reset the interrupt status by writing its value back to the device. Interrupt handler is summarized in Figure 2.7. In this work a special focus is on

Figure 2.7: Linux Device driver interrupt handler.

character devices that is the same type of the one implemented and documented in Chapter 3.

(46)

2.4 Kprobes

Kprobes is a debugging mechanism for the Linux kernel which can also be used for monitoring events inside a production system. You can use it to find out perfor-mance bottlenecks, log specific events, trace problems etc. Kprobes was developed by IBM and it has been merged into the Linux kernel. Kprobes enables you to dynamically break into almost any kernel routine and collect debugging and per-formance information non-disruptively. You can trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit. In this work particular attention has been dedicated to kprobes.

2.4.1 Kprobes features

Currently, two types of kprobes exist: kprobe and kretprobe, also known as return probes. Kprobe fires on function call, while kretprobe on function return. Typically, a kprobe is delivered within a kernel module, that one might load and unload in the Linux kernel to include specific dynamic tracing functionalities.

First, a kprobe need to be registered, making a copy of the probed instruction and replacing the first bytes of the probed instruction with a breakpoint instruc-tion i.e. int3 on i386 and x86_64. When a CPU reach the breakpoint instrucinstruc-tion it will be trapped, and the CPU’s register are saved. Control passes to Kprobes that will execute the so called "pre-handler" associated with such kprobe. Next Kprobes single-steps its copy of the probed instruction, actually executing it. After that,

(47)

Kprobes executes the "post-handler", if any, that is associated with such kprobe. The execution flow will then resume as nothing happened from the breakpoint onwards. In case a fault occurs while executing a kprobe handler (pre or post) function, the user can handle the fault by defining a fault-handler. A schematic view is reported in Figure 2.8 for simplicity.

Figure 2.8: Kprobes instruction flow.

A lot of architecture are supported by kprobes, including the most used like x86_64 and arm. To use kprobes it is mandatory to build the kernel with CONFIG_KPROBES=y, even if in recent kernels this is the default. As mentioned, not all kernel address can be kprobed. Kprobes can probe most of the kernel except itself. This means that there are some functions where kprobes cannot probe. Probing such functions can cause a recursive trap (e.g. double fault) or the nested probe handler may never be called. Kprobes manages such functions as a blacklist. Kprobes checks the given probe address against the blacklist and rejects registering it, if the given address is in the blacklist. One can attach a kprobe on a system call or on a kernel function,

(48)

you just need to find its name in cat /proc/kallsyms. It is worth to mention that kprobes allow multiple probes at the same address and they (or multiple instances of the same handler) can potentially run concurrently on different CPUs since Linux v2.6.15-rc1. Kprobes run with preemption or interrupt disabled, which depends on architecture. Introduced overhead with the usage of kprobes is low, as stated in Kprobes documentation [1] where on a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 microseconds to process. We can expect even better results on today’s CPUs. Kprobes is an excellent tool for debugging and tracing and it can also be used for performance measuring. But unfortunately, for this work kprobes alone is not enough and that is why we will introduce the Berkeley Packet Filter technology state of the art.

2.5 Berkeley Packet Filter

We will now focus on the main aspects of Berkeley Packet Filter and its extended version known as extended BPF (eBPF).

2.5.1 BPF history

BPF stands for Berkeley Packet Filter, a technology developed in 1993 and pub-lished in The BSD Packet Filter: A New Architecture for User-level Packet Capture,

S. McCanne and V. Jacobson [10], designed to improve packet capture tools

(49)

in fact in only supported very limited memory slots and CPU registers. Dynamic tracing was first based on a technique used by debuggers to insert breakpoints at arbitrary instruction addresses. With dynamic tracing, the target software records information and then automatically continues execution rather than passing con-trol to an interactive debugger. Dynamic tracing tools were developed, and included tracing languages, but these tools remained obscure and little used. In part because they involved considerable risk: Dynamic tracing requires modification of instruc-tions in an address space, live, and any error could lead to immediate corruption and process or kernel crashes. Dynamic tracing for kernel functions (kprobes) was finally added to Linux in 2004, although it was still not well known and was still difficult to use. Linux added dynamic instrumentation for user-level functions in 2012, in the form of uprobes. In 2013, Alexei Starovoitov proposed a major rewrite of BPF [15], which was further developed by Alexei Starovoitov and Daniel Bork-mann and included in the Linux kernel in 2014 [2], calling it extended BPF. From this point on, we will refer to extended BPF (eBPF) using both eBPF and BPF terms with the same meaning. A lot of new BPF program types emerged and from that time onwards, an increasing number of developers have become interested in this technology, leading to a lot of particular tools for tracing and security. For tracing the main ones are two iovisor projects: bcc [3] and bpftrace [14]. Many of those eBPF tracing tools use both kprobes and uprobes for dynamic tracing of the full software stack.

(50)

2.5.2 BPF evolution

Referring to the major rewrite that happened in 2013 followed by the merge to the Linux kernel in 2014, in Table 2.1 the major improvements to BPF are shown, determining better compatibility with newer architectures such as x86_64. In fact, as the technology advanced BPF remained the same, leading to very small programs (see 16 memory slots) and restricted capabilities due to few possible calls that were also JIT dependent. Also register set was somehow small, with just two registers. With eBPF the range of registers grew in number and in width as well as the storage space, with a considerably larger stack space and most importantly potentially infinite storage using BPF maps.

Factor Classic BPF Extended BPF

Register count 2: A, X 10: R0–R9, plus R10 as a_{read-only frame pointer} Register width 32-bit 64-bit

Storage 16 memory slots: M[0–15] 512 bytes of stack space,_{plus "infinite" map storage} Restricted kernel calls Very limited, JIT specific Yes, via the_{bpf_call instruction}

Event targets Packets, seccomp-BPF Packets, kernel functions,user functions, tracepoints, user markers, PMCs

Table 2.1: Classic BPF versus Extended BPF.

Given its origin, eBPF is especially suited to writing network programs and it is possible to write programs that attach to a network socket to filter traffic, to classify traffic, and to run network classifier actions. It is even possible to modify the settings of an established network socket with an eBPF program. The eXpress

(51)

Data Path (XDP) project, in particular, uses eBPF to do high-performance packet processing by running eBPF programs at the lowest level of the network stack, immediately after a packet is received instead of processing them at user-level. Another type of filtering performed by the kernel is restricting which system calls a process can use. This is done with seccomp BPF. eBPF is also useful for debugging the kernel and carrying out performance analysis; programs can be attached to tracepoints, kprobes, and perf events.

2.5.3 BPF program types

The eBPF program type determines the subset of kernel helper functions that the program may call, see man 7 bpf-helpers. The program type also determines the program input (context), that is the format of struct bpf_context, which is the data blob passed into the eBPF program as the first argument. The main eBPF program types are shown in Table 2.2

When the loaded program is of type BPF_PROG_TYPE_KPROBE the context, which is the program input, is a struct *pt_regs, that is a pointer to register status when the kprobe on which the program is attached fired. This is especially useful as in x86_64 the convention is to pass function arguments on CPU register in a specific order. Thanks to specific functions like PT_REGS_PARM1(ctx) it is possible to retrieve the arguments of the function that triggered the kprobe, i.e. the first one. On the other hand, if the program is of type BPF_PROG_TYPE_KPROBE, all packet

(52)

eBPF Program Type Group

BPF_PROG_TYPE_SOCKET_FILTER BPF_PROG_TYPE_SOCK_OPS

BPF_PROG_TYPE_SK_SKB Socket-related program types BPF_PROG_TYPE_XDP Xpress Data Path related_{program types} BPF_PROG_TYPE_KPROBE

BPF_PROG_TYPE_TRACEPOINT

BPF_PROG_TYPE_PERF_EVENT Tracing program types BPF_PROG_TYPE_CGROUP_SKB

BPF_PROG_TYPE_CGROUP_SOCK Cgroups-related program types BPF_PROG_TYPE_LWT_IN

BPF_PROG_TYPE_LWT_OUT

BPF_PROG_TYPE_LWT_XMIT Lightweight tunnel program types.

Table 2.2: eBPF program types explained.

manipulating bpf helpers are forbidden, which result in an error when loading a program if a call to such helpers is present.

2.5.4 BPF system calls

Interactions with the BPF mechanism, that resides inside the kernel, is performed through the bpf(cmd) system call. The bpf system call allows to perform a range of operations related to eBPF. In particular one can load a bpf program, call one of the bpf-helpers function or access shared data structures such as eBPF maps. Many commands cmd can be passed to bpf system call, including the most used like

(53)

BPF_PROG_LOAD, BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE_ELEM and BPF_MAP_DELETE_ELEM. There are wrappers that can provide smoother ac-cess for the userspace programmer to the bpf infrastructure that are defined in linux/bpf.h such as bpf_map_lookup_elem(map, key, value), instead of hav-ing to use directly the system call every time. Further details can be found in man 2 bpf.

2.5.5 BPF maps

Maps are a generic data structure for storage of different types of data. They allow sharing of data between eBPF kernel programs, and also between kernel and user-space applications. In each map the following attributes can be found: type, maximum number of elements, key and value size in bytes. The most used map types are summarized in Table 2.3 The bpf() system call can be invoked with

eBPF Map Type Explaination

BPF_MAP_TYPE_HASH Hash-table map, key value pairs are_{allocated and freed by the kernel.} BPF_MAP_TYPE_ARRAY Array maps optimized for faster possiblelookup. All elements pre-allocated and

zero initialized at init time.

BPF_MAP_TYPE_PROG_ARRAY Program array map, containing filedescriptor referring to other eBPF programs.

Table 2.3: eBPF map types explained.

a large range of commands, of which many are used to manage eBPF maps. For every bpf program it is advised to choose carefully the best suited eBPF map type,

(54)

as it may impact performance.

2.5.6 BPF in-kernel verifier

The first test ensures that the eBPF program terminates and does not contain any loops that could cause the kernel to lock up. This is checked by doing a depth-first search of the program’s control flow graph (CFG). Unreachable instructions are strictly prohibited; any program that contains unreachable instructions will fail to load. The second stage is more involved and requires the verifier to simulate the execution of the eBPF program one instruction at a time. The virtual machine state is checked before and after the execution of every instruction to ensure that regis-ter and stack state are valid. Out of bounds jumps are prohibited, as is accessing out-of-range data. The verifier does not need to walk every path in the program, it is smart enough to know when the current state of the program is a subset of one it is already checked. Since all previous paths must be valid (otherwise the program would already have failed to load), the current path must also be valid. This allows the verifier to "prune" the current branch and skip its simulation, making it overall faster. The verifier also has a "secure mode" that prohibits pointer arithmetic. Se-cure mode is enabled whenever a user without the CAP_SYS_ADMIN privilege loads an eBPF program. The idea is to make sure that kernel addresses do not leak to unprivileged users and that pointers cannot be written to memory. If secure mode is not enabled, then pointer arithmetic is allowed but only after additional

(55)

checks are performed. For example, all pointer accesses are checked for type, align-ment, and bounds violations. Registers with uninitialized contents (those that have never been written to) cannot be read; doing so cause the program load to fail. The contents of registers R0-R5 are marked as unreadable across functions calls by storing a special value to catch any reads of an uninitialized register. Similar checks are done for reading variables on the stack and to make sure that no instructions write to the read-only frame-pointer register. Lastly, the verifier uses the eBPF program type (covered later) to restrict which kernel functions can be called from eBPF programs and which data structures can be accessed. Some program types are allowed to directly access network packet data, for example. Programs are loaded using the bpf() system call by specifying the BPF_PROG_LOAD command. Com-mands can be used to do many other things that can be found in the bpf() man page. eBPF has a lot of possible program types. The type of program loaded with BPF_PROG_LOAD dictates four things: where the program can be attached, which in-kernel helper functions the verifier will allow to be called, whether net-work packet data can be accessed directly, and the type of object passed as the first argument to the program. In fact, the program type essentially defines an API. New program types have even been created purely to distinguish between different lists of allowed callable function. In this work BPF_PROG_TYPE_KPROBE is used, but actually any other program type is usable if the desired features fit well with such program type. In particular, kprobe program type is the relevant one as it allows

(56)

an eBPF program to be fired just like a kprobe but enjoying the benefits of being an eBPF program. Generally, eBPF programs are loaded by the user process and automatically unloaded when the process exits. This realizes what has been called "extensibility", while the in-kernel verifier realizes what has been called "security".

2.5.7 BPF helpers

As of Linux kernel 5.9.0, but very few have changed since 5.4.0, bpf helpers provide facilities to perform tasks that otherwise would have been very complicated. The most used are the ones to interact with BPF maps and those to read and write from memory regions using pointers obtained through the previously explained PT_PARM#() functions. Make sure to check what your Linux kernel version bpf-helpers provide as the documentation is not up to date. Due to eBPF conventions, a helper cannot have more than five arguments. Internally, eBPF programs call directly into the compiled helper functions without requiring any foreign-function interface. As a result, calling helpers introduces no overhead, thus offering excellent performance. Helpers allow the bpf programmer to "do more" in a controlled way, depending on the program type. The most important bpf helpers are reported in Table 2.4

(57)

BPF helper Explaination

bpf_map_lookup_elem(*bpf_map, *key) Perform a lookup in map for_{an entry associated to key.} bpf_map_update_elem(*bpf_map,

*key, *value, flags)

Add or update the value of the entry associated to key in map with value.

bpf_map_delete_elem(*bpf_map, *key) Delete entry with key from map.

bpf_probe_read(*dst, size, *unsafe_ptr)

For tracing programs, safely attempt to read size bytes from kernel space address unsafe_ptr and store the data in dst.

bpf_trace_printk(...)

This helper is a "printk()-like" facility for debugging. It is possible to read from userspace invoking

read_trace_pipe() function. This function is for debug purposes only and should not be used in production.

Table 2.4: eBPF map types explained.

2.5.8 BPF in-tree compiling

One way, and also the only one I have found to be easily usable, is to perform an in-tree compiling for bpf programs under /samples/bpf, using all bpf-helpers and also bpf_load_file() function offered by that kernel version. bpf_load_file() is to be used inside the userspace program, to load an object file as bpf program inside the kernel. It is a convenience function to load a bpf program in one line, instead of using the bpf() system call with the BPF_PROG_LOAD command. As a result of compiling a bpf program, which is written in C language, an object file will be produced and then used to load the bpf program inside the kernel. BPF

(58)

program loading exploit ELF sections, allowing you to define maps and programs to be attached in the same file but under different ELF sections. An example of ELF section for an eBPF map is struct bpf_map_def SEC("maps") mymap, while for a bpf program to be executed on function call like a kprobe it is necessary to prepend SEC("kprobe/kernel_function"). What you need to do is to have a Linux kernel source code and compile the /samples/bpf content using the make M=samples/bpf command. You should compile both userspace program and bpf program this way. Out-of tree compiling is smooth for higher level tools such as bcc and bpftrace but remains complicated and little documented for low level tools like the one we are going to discuss in Chapter 3.

2.5.9 BPF exploitation

eBPF enables this work. It allows to trace the guest kernel in a secure way. By doing so we can use tracing information in the hypervisor to boost virtual machine performance or simply tracing from the outside. The next chapter will be about how the generic mechanism is structured and its main aspects.

(59)

Extensible

paravirtualization

This chapter is about how the extensible paravirtualization mechanism has been implemented. We will first describe it from a high level perspective, highlighting every important aspect, then we will precisely detail its implementation.

3.1 Description

As stated, for all development and testing phases qemu-kvm has been adopted, resulting in an overall system in which hardware-assisted virtualization is in use.