• Non ci sono risultati.

Temporal and Spatial Isolation in Hypervisors for Multicore Real-Time Systems

N/A
N/A
Protected

Academic year: 2021

Condividi "Temporal and Spatial Isolation in Hypervisors for Multicore Real-Time Systems"

Copied!
70
0
0

Testo completo

(1)

E

MBEDDED

C

OMPUTING

S

YSTEMS

Temporal and Spatial Isolation

in Hypervisors for Multicore

Real-Time Systems

Author:

Paolo Modica

Supervisors:

Prof. Giorgio Buttazzo

Dr. Alessandro Biondi

Institute of Communication, Information and Perception Technologies (TeCIP-SSSA) and Department of Information

Engineering (UNIPI)

(2)

Abstract

The growing demand of new functionalities in modern embed-ded real-time systems has led chip makers to produce the modern multi-core platforms. This trend also increased the need for robust and efficient mixed-criticality systems that need to share the same hardware platform. Hardware virtualization established as a de-facto solution to realize such systems, aiming at guaranteeing time and security requirements.

This thesis addresse the problem of providing spatial and tem-poral isolation between execution domains in a hypervisor running on an ARM multicore platform. The goal is to achieve predictable interference among domains without relying on any information on their behavior and/or configuration, thus enabling the execution of time-sensitive (and possibly safety-critical) guests that are resilient to misbehaviors, cyber attacks, or excessive demand of computational resources that may affect less critical domains. For instance, the pro-posed design solution allows the integration of a real-time operating system with a general purpose operating system (e.g., Linux), which today is a common need in many industrial fields.

Isolation is achieved by carefully managing the two primary sha-red hardware resources of today’s multicore platforms: the last level cache (LLC) and the DRAM memory controller. The XVISOR open-source hypervisor and the ARM Cortex A7 platform have been used as reference systems for the purpose of this work.

Spatial partitioning on the LLC has been implemented by means of cache coloring, which allows reserving a given portion of cache memory to each domain, thus avoiding mutual cache evictions by design. In this work, cache coloring has been tightly integrated with the ARM virtualization extensions (ARM-VE) to deal with the mem-ory virtualization capabilities offered by a two-stage memmem-ory manage-ment unit (MMU) architecture. Temporal isolation on the DRAM con-troller has been implemented by realizing a memory bandwidth reser-vation mechanism, which allows reserving (and contextually limit-ing) a given number of memory accesses across a periodic time win-dow. The reservation mechanism leverages performance counters and specific interrupt signals available on various ARM platforms and has been integrated with the scheduling logic of XVISOR when managing the execution of the virtual CPUs.

An extensive experimental evaluation has been performed on the Raspberry Pi 2 board, showing the effectiveness of the implemented solutions on a case-study composed of multiple Linux guests run-ning state-of-the-art benchmarks. In particular, both cache coloring and memory reservation proved to ensure a strong isolation among domains, with a significant improvement on worst-case execution times due to the limited (or null) contention delays introduced by

(3)

such shared resources. No relevant paybacks in terms of run-time overhead have been observed.

The results of this thesis received considerable attention by the XVISOR community and are going to be integrated in a next release of the hypervisor.

(4)

Contents

Abstract i

Introduction 1

1 State of the Art 4

1.1 Virtualization and Hypervisors . . . 4

1.1.1 Types of Virtualization . . . 6

1.1.2 Types of Hypervisors . . . 8

1.2 ARM Virtualization Extension (VE) . . . 9

1.2.1 Memory translation . . . 10

1.2.2 Large Physical Address Extensions (LPAE) . . . 12

1.2.3 GIC support for virtualization . . . 13

2 Xvisor 16 2.1 Architecture . . . 16 2.2 Hypervisor Timer . . . 17 2.3 Hypervisor Manager . . . 18 2.4 Hypervisor Scheduler . . . 20 2.5 Other Modules . . . 22

3 Isolation on multicore platforms 23 3.1 Contention due to shared cache levels . . . 24

3.2 Memory bandwidth contention . . . 25

4 Cache Partitioning 27 4.1 ARM Cache Architecture . . . 28

4.2 Cache Coloring . . . 29

4.3 Implementation details on Xvisor . . . 31

4.3.1 Guest’s memory management on Xvisor . . . . 31

4.3.2 Changes applied to Xvisor . . . 33

5 Memory Throttling 36 5.1 Memory Reservation Architecture Design . . . 37

5.2 Implementation details on Xvisor . . . 39

5.2.1 Guest scheduling management on Xvisor . . . . 39

5.2.2 ARM PMU Support . . . 41

5.2.3 Memory Reservation System implementation on Xvisor . . . 43

(5)

6 Tools 49

6.1 ARM Fast Models . . . 49

6.2 ARM DS-5 Development Studio . . . 50

6.3 Hardware . . . 50

6.4 Test Benchmarks . . . 51

7 Experimental Results 52 7.1 Cache Coloring Results . . . 52

7.2 Memory Reservation Results . . . 57

8 Conclusion 59 8.1 Future works . . . 60

(6)

List of Figures

1.1 Virtual Machine Map . . . 5

1.2 Mixed OS Environment . . . 6

1.3 Type-1 Hypervisor . . . 8

1.4 Type-2 Hypervisor . . . 9

1.5 ARM Cortex-A with VE Processor’s modes . . . 10

1.6 Stage 2 translation (image from [5]) . . . 12

1.7 Format of long-descriptor table entries (image from [5]) 13 1.8 VA to IPA translation (image from [5]) . . . 13

1.9 GIC with an ARM processor that supports virtualiza-tion (image from [6]) . . . 14

2.1 Xvisor Software Architecture (image from [8]) . . . 17

2.2 Xvisor VCPU state machine . . . 19

2.3 Emulated guest IO event on Xvisor (figure from [8]) . . 21

2.4 Host interrupts handling on Xvisor . . . 21

3.1 An example of a two-level cache hierarchy . . . 24

3.2 Test by Lockheed Martin Space Systems on 8-core plat-form . . . 25

3.3 Simultaneous accesses to main memory by different cores lead to contention . . . 26

4.1 Cache partitioning approaches: (a) index-based (b) way-based . . . 27

4.2 Cache Terminology (image from [5]) . . . 28

4.3 Cortex A7 with 512 KB of L2 shared cache, address bits 29 4.4 Hypervisor’s cache coloring architecture . . . 30

4.5 Guest Address Space Init Flow . . . 32

4.6 Data/Instr. Abort handling flow . . . 32

4.7 New Guest Address Space Init Flow . . . 34

5.1 Memory access interference example . . . 37

5.2 Guests’ State Machine for Memory Reservatio . . . 38

5.3 Memory Reservation Scheduling Example . . . 39

5.4 Scheduler Change State Flows . . . 40

5.5 Cortex A7 PMU Architecture Diagram (image from [11]) 41 5.6 PMU setting/reading flow . . . 42

5.7 New Xvisor State Machine . . . 45

5.8 Recharging Timer Interrupt Handler Flow . . . 46

(7)

7.1 AVG and MAX overall comparison . . . 53 7.2 AVG and MAX 0-256 KB comparison . . . 55 7.3 AVG and MAX 256-512 KB comparison . . . 55 7.4 Iterations Execution time accessing 120 KB of memory . 56 7.5 Iterations Execution time accessing 384 KB of memory . 56 7.6 Comparation accessing 5 MB of memory varing the

bandwidth . . . 57 7.7 AVG execution times comparison varing the bandwidth 58 7.8 MAX execution times comparison varing the bandwidth 58

(8)

Codes

4.1 Guest DTS region example . . . 31

4.2 New Xvisor Region . . . 33

4.3 Xvisor Region Mapping . . . 33

4.4 Check Color function . . . 35

(9)

List of Abbreviations

ECU Elettronic Control Unit

LLC Last Level Cache

VE Virtualization Extension

VM Virtual Machine

VMM Virtual Machine Monitor

PA Physical Address

IPA Intermediate Physical Address

HVC HyperVisor Call

PL# Privilege Level #

LPAE Large Physical Address Extensions

ASID Address Space ID

VMID Virtual Machine ID

GIC Generic Interrupt Controller

MMU Memory Management Unit

PMMU Paged Memory Management Unit

GCC GNU C Compiler

VCPU Virtual CPU

DTS Device Tree Script

DTC Device Tree Compiler

DTB Device Tree Blob

PRR Priority Round Robin

RM Rate Monotonic

WCET Worst-Case Execution Time

GPA Guest Physical Address

HPA Host Physical Address

WFI Wait For Interrupt

(10)

Introduction

Real-time systems are computing systems that must react within pre-cise time constraints to events in the environment. As a consequence, the correct behavior of these systems depends not only on the value of the computation but also on the time at which the results are pro-duced [1]. Out of time results are useless and in some cases even dangerous for the system. Typically, real-time systems are embed-ded in larger systems with control purpose; that’s why we often talk about "Real-Time Embedded Systems".

Embedded computing systems have grown exponentially in re-cent years and are used for many types of application. Today, most commonly used objects are an embedded system; indeed, approx-imately 98% of the processors in the world are integrated into ev-eryday objects. Some examples of application fields are the follow-ing: avionics, automotive, robotics, industrial automation, multime-dia systems, consumer electronics, security, smart home, etc.

Every day, embedded systems become more complex because they need more functionality, better performance, better efficiency, and adapt to new hardware platforms on the market. To observe the remarkable increase in functionality required in new embedded systems, you can notice the exponential growth of features within a cell phone, today called smartphones, from the 80’s to today. An-other example of this phenomenon is observable in the automotive field; modern cars provide us with many new features to support the driver, increase the safety and the comfort unimaginable until a few years ago.

By focusing on the automotive field, which is one of the main ap-plication domains of Real-Time Embedded Systems, we can observe how, at the moment, each function is managed by a single dedicated ECU (Electonic Control Unit). This choice allows us to gain many benefits in the system design and development process, making it easier to implement, test, certify and maintain. It is easy to under-stand that using this policy the ECU number in a car is directly pro-portional to the number of functionality; the exponential increase in functionality and ECU involves problems with space, weight, energy and cost. For these reasons we can not continue this way, we need to look for alternative solutions so that we can leverage a single ECU to handle more functionality, exploiting the potential of the new multi-core platforms.

(11)

guaranteed their behavior, moreover, it is not trivial to test and cer-tify the new system; the main problem is related to the shared re-sources that are a source of interference between them. Interference can occur for several reasons: time interference caused by concur-rent accesses to shared resources, spatial interference linked to the use of shared memory spaces (cache, DRAM, MMC), energy-related interference related to battery usage, etc. Each type of interference is important because their presence has a significant impact on ef-ficiency and schedulability, moreover, reduces the predictability by jeopardizing the system’s safety.

The solution to the interferences problem, using the new multi-core platforms, is an open problem and we are looking at various types of solutions. One of the most important techniques used to solve this type of problem is the virtualization and, in particular, the use of hypervisors.

The Hypervisor is a software that has the task of managing hard-ware resources by assigning them to the various applications that require them. This type of hardware resource management must be transparent to applications, so the hypervisor has the task of virtu-alizing the hardware resources that will be assigned to the various guests. In this way, a guest will have the feeling of using the hard-ware resource, but actually, he will use its virtualized copy, it will then be the task of the hypervisor handle the real hardware resource. Using a hypervisor, you can take two different Operating Systems and bring them simultaneously to the same multicore platform, mak-ing them believe they are the only ones to use it.

For the features listed, a hypervisor would seem the ideal solu-tion because two or more operating systems (guests), with their ap-plications, could run on the same hardware platform without any modification. At the moment the problem is that hypervisors are not designed for managing software with real-time requirements; in fact, it is true that the various guests are isolated but may interfere each another through the use of some shared resources. Fundamental re-sources where guests can create interference are the shared cache, also called Last Level Cache (LLC), and the DRAM memory.

This thesis aims to find a solution to the problem of spatial isola-tion of shared cache memory and time isolaisola-tion of DRAM memory. To achieve the target we have studied and compared different tech-niques already know in the literature designed for normal operating systems, adapting them to being used within an existing hypervi-sor. As regards the spatial isolation of the shared cache, we chose to implement the caching coloring, so that we partition the shared cache and assign to each guest a subset of the total cache. About DRAM memory time isolation, we chose to apply memory throttling techniques to ensure a maximum bandwidth within a single period for each guest. The Hypervisor used to implement these techniques is Xvisor, a type 1 open-source hypervisor. To validate the techniques

(12)

implemented, various tests were carried out using a quad-core hard-ware platform where two independent guests were loaded to use the above resources; we could see how thanks to the changes made it is possible to increase the predictability of each guest and limit mutual interference.

(13)

Chapter 1

State of the Art

In this chapter, we will discuss more in detail about virtualization and how this is currently exploited in computing systems.

In the first part, we will focus on virtualization in general and the basic concepts needed to understand how a hardware system can be virtualized; after that we will analyze in more detail the Hypervisors, underlining the differences between the various types existing with relative advantages and disadvantages.

In the second part, we will look at the ARM VE (Virtualization Ex-tension) architecture designed to provide native hardware support for managing a Hypervisor.

1.1

Virtualization and Hypervisors

Virtualization means the possibility of extracting hardware compo-nents to be made available to the software in the form of virtual re-sources. Virtualization began in the 1960s, as a method of logically dividing the system resources provided by mainframe computers be-tween different applications. Since then, the meaning of the term has broadened[2].

Hardware virtualization or platform virtualization refers to the creation of a virtual machine that acts like a real computer with an operating system. Software executed on these virtual machines is separated from the underlying hardware resources by a new soft-ware layer. The new softsoft-ware layer is called hypervisor or virtual ma-chine monitor (VMM) and has the task of creating and runs virtual machines (VM). The hardware platform on which a hypervisor runs their virtual machines is called host machine, and each virtual ma-chine is called guest mama-chine.

The most important virtualization properties are isolation, encap-sulation, and interposition. The isolation property must guarantee fault confinement within a VM, separation of the software running on the various VMs and also performance isolation through schedul-ing and resource allocation. The encapsulation property must ensure that all states of a VM can be captured and saved in a file, so an oper-ation in a VM is equivalent to a file modificoper-ation, and the complexity will be proportional to the virtual hardware model and independent of guest software configuration. The interposition property means

(14)

that all guest actions go through the virtualising software (VMM) which can inspect, modify and deny operations.

Because of the properties just described system virtualization pro-vide different benefits[3]:

• Multiple secure environment: a system VM provides a sand-box that isolates one system environment from other environ-ments;

• Failure isolation: Virtualization helps isolate the effects of a failure to the VM where the failure occurred;

• Mixed-OS environment: A single hardware platform can sup-port multiple operating systems concurrently;

• Better system utilization: A virtualized system can be (dynam-ically or stat(dynam-ically) re-configured for changing needs.

Formal virtualization definition

Formally virtualization involves the construction of an isomorphism that maps a virtual guest system to a real host system (see figure 1.1)[4]. The function F maps the guest state to the host state. For each sequence of operations ei(Si)that modifies a guest state, must exist a corresponding

e0i(Si0)sequence in the host that performs an equivalent modification.

FIGURE1.1: Virtual Machine Map

There are three important properties about environment created by a VMM:

• Equivalence: The behavior of a running program in a VM must be the same as when executing on a real machine;

• Resource control: All the virtualized resources must be under control of the VMM, and each VM can’t affect arbitrary the sys-tem resources.

(15)

• Efficiency: All innocuous instructions are executed by the hardware directly, with no intervention at all on the part of the control program.

Hardware

Hypervisor (VMM)

Global Scheduler

Local Scheduler

Linux

Local Scheduler

Windows

Local Scheduler

FreeRTOS

Local Scheduler

Mac OS

VM 1

VM 2

VM 3

VM 4

FIGURE1.2: Mixed OS Environment

Virtualization allows a Mixed OS environment (figure 1.2) in which multiple VMs can be run on the same hardware platform to pro-vide individuals or user group with their OS environments. With this model is needed a two-level hierarchical scheduling framework; the VMM layer must have a global scheduler that manages the ex-ecution of the VMs, while each VM can have his local scheduler to handle the execution of his tasks.

1.1.1

Types of Virtualization

We can classify the virtualization methodologies in two main cate-gory Paravirtualization e Full Virtualization.

ParavirtualizationParavirtualized VMM provide a virtual hardware

abstraction that is similar, but not identical to the real hardware, which guarantees a lesser number of privileged CPU instructions need to be executed. In this case, the operating system that runs on a guest VM needs to be modified, so it is aware of the fact that it is running inside an hypervisor. Typically is also needed the paravirtu-alization of device drivers.

Advantages

• No need total hardware emulation, so it’s more efficient of full virtualization;

(16)

• Virtualized OSes can directly communicate with hardware re-sources;

Disadvantages

• The guest OS need to be modified; • Isolation is more challenging;

Full Virtualization

In full virtualization, the guest OS is unaware of the virtualization system. This type of virtualization can be reached with different methodologies, the main are system emulation, binary translation and hardware-assisted.

In system emulation all the hardware resources are emulated, the guest operating system can use the hardware resources only through the hardware emulation layer. In this case, the guest operating sys-tem can be run without any modification. The VMM executes the CPU instructions that need more privileges than are available in the user space.

Advantages

• Complete isolation between guests;

• VMs are not related to any specific hardware platform so we have total portability;

• No modifications to the guest OS are needed;

Disadvantages

• Performance degradation because everything is emulated; The binary translation is based on intercepting OS code, so at run-time, some guest OS instructions are translated. The user-level code is directly executed on the real hardware. Are needed specific device drivers but the guest OS not need any modification.

Advantages

• Complete isolation between guests;

• No modifications to the guest OS are needed;

Disadvantages

• Performance degradation because the VMM layer need to scan the guest code to translate some instructions;

(17)

The hardware assisted virtualization is supported by the new hard-ware platform; it provides some extra hardhard-ware to help a more effi-cient full virtualization and ease guests isolation. An example of this type of hardware platform is the ARM Virtualization Extension (VE).

Advantages

• Complete isolation between guests;

• No modifications to the guest OS are needed; • More efficient than classical system emulation;

Disadvantages

• Need specific hardware extension;

1.1.2

Types of Hypervisors

In their article [4] Popek and Goldberg ranked hypervisors dividing them into two different categories: Type-1, native or bare-metal hyper-visors and Type-2 or hosted hyperhyper-visors.

Type-1 hypervisors

These hypervisors run directly on the host’s hardware to control the hardware and to manage guest operating systems (figure 1.3). The first hypervisor, developed by IBM in the 1960s, were type-1 hypervi-sors. These type of hypervisors are more efficient because they have the total control of the hardware platform, but for the same reason is hardware dependent. Some example of modern type-1 hypervisors are. Xen, Xvisor, Microsoft Hyper-V and VMware ESX/ESXi.

Hardware

Hypervisor

OS

OS

OS

(18)

Type-2 hypervisors

These hypervisors run on a conventional operating system just as other computer programs do (figure 1.4). A guest operating system runs as a process on the host. The type-2 hypervisor has the task of abstract the guest operating systems from the host operating sys-tem. In this case, the guest os is slower because of two layers between them and the hardware. Some type-2 hypervisors are VMware Work-station, VitualBox, Parallels Desktop, QEMU.

Hardware

Hypervisor

OS

OS

OS

OS

FIGURE1.4: Type-2 Hypervisor

However, the distinction between these two types is not necessarily clear. Linux’s Kernel-based Virtual Machine (KVM) and FreeBSD’s bhyve are kernel modules that convert the host operating system to a type-1 hypervisor. At the same time, since Linux distributions and FreeBSD are still general-purpose operating systems, with other ap-plications competing for VM resources, KVM and bhyve can also be categorized as type-2 hypervisors.

1.2

ARM Virtualization Extension (VE)

The Virtualization Extensions to the ARMv7 architecture provide a standard hardware accelerated implementation enabling the creation of high performance hypervisors [5]. The ARM Virtualization Exten-sions make it possible to operate multiple Operating Systems on the same system while offering each such Operating System an illusion of sole ownership of the system by the introduction of new architec-tural features. These are:

• A hypervisor mode, in addition to the current privileged modes. This PL2 mode is even more privileged than PL1 modes. Hyp

(19)

mode is expected to be occupied by Hypervisor software man-aging multiple guests operating systems occupying PL1 and PL0 modes. Hyp mode only exists in the Normal (Non-secure) world;

• An additional memory translation, called Stage 2 is introduced. Previously, the Virtual Address (VA) is translated to Physical Address (PA) by the PL1 and PL0 MMU. This translation is now known as Stage 1, and the old Physical Address is now called Intermediate Physical Address (IPA). The IPA is subjected to another translation level in Stage 2 to get the final PA corre-sponding to the VA;

• Interrupts can be configured to be taken (routed) to the Hyper-visor. The hypervisor will then take care of delivering inter-rupts to the appropriate guest;

• A Hypervisor Call instruction (HVC) for guests to request Hy-pervisor services;

In the figure 1.5 are show the possible ARM processor’s modes with the VE.

PL2 Modes PL1 Modes

PL0 Modes PL0 Modes

PL1 Modes

System Syscall

Fast IRQ Undefined

IRQ Abort System Syscall Undefined IRQ Abort Fast IRQ User User HYP Monitor

Normal World Secure World

FIGURE 1.5: ARM Cortex-A with VE Processor’s

modes

1.2.1

Memory translation

Using the ARM-VE a number of memory translation regimes are pos-sible. With the translation regime term, it is understood the set of

(20)

privileges and execution mode of the core and the set of translation tables used. The translations are carried out using the MMUs and the Translation Table structures create by the software that controls the translation. A stage of translation comprises the set of translation ta-bles that translates an input address to an output address; the input and output addresses take different names depending on the state of translation. The possible translation regimes are the following:

PL1&0 Stage 1

This is the translation regime that a standard Operating System will setup and control usually. This regime is applied when the core is in one of execution modes that fall under PL1(Kernel) and PL0(User) privilege levels. In a conventional system, this regime translates vir-tual addresses to physical addresses. In a virvir-tualized system, how-ever, this physical address is treated as an Intermediate Physical Ad-dress because it is subjected to another stage (stage 2) of translation and if the stage 2 is present this translation is qualified as Stage 1. The IPAs cannot be used to address system memory, but for a guest, they are used to physical addresses.

PL1&0 Stage 2

This comprises of a set of translation tables that the Hypervisor sets up for each of the guests it manages. This stage translates the IPA, that was output by Stage 1, in a physical address that can finally be used to address system memory. The Virtualization Extensions add a set of core registers to control the Stage 2 translation tables. The hypervisor saves and restores these registers whenever it schedules a different guest on the core. Individual guests that are managed by the Hypervisor have no control over, nor are aware of the presence of a Stage 2 translation. When virtualization is in effect, all PL1&0 Stage 1 translations are implicitly subjected to this Stage 2 transla-tion. What Stage 2 translation achieves is virtualization of the guest’s view of physical memory. Every virtual address used in PL1&0 stage is first translated to an IPA by Stage 1. This IPA is translated to the ac-tual physical address by Stage 2. The guest is unaware of, and cannot control, translation by Stage 2 (figure 1.6). By appropriately setting up the Stage 2 translation tables the hypervisor can manage and al-locate physical memory to guests. This can be thought of a natural extension of what an operating system does, managing and allocat-ing physical memory to its applications. In that respect, the guests could be considered as applications of the hypervisor. This feature makes it easy the hypervisor’s task of isolating the guest’s memory space.

PL2

This comprises of a set of translation tables that the Hypervisor sets up for itself to manage its own virtual memory. This regime will

(21)

Peripherals OS Application Translation tables Translation tables Peripherals Peripherals FLASH RAM Peripherals RAM RAM FLASH TTBR{n} VTTBR

Virtual memory map Under control of guuest OS

Physical memory map as seen by

guest, controlled by Hypervisor Real physical memory map

FIGURE1.6: Stage 2 translation (image from [5])

translate virtual addresses used in Hyp mode to Physical Addresses. No additional stages are applied to this translation, therefore it is not qualified by stages. The Virtualization Extension includes a set of registers for the Hypervisor to manage its own translation tables, much like an operating system kernel manages its own translation tables.

1.2.2

Large Physical Address Extensions (LPAE)

Processors that implement the ARMv7-A Large Physical Address Ex-tension (LPAE), expand the range of accessible physical addresses from 4GB (232 bytes) to 1024GB (240 bytes) a terabyte, by

translat-ing 32-bit virtual memory addresses into 40-bit physical memory ad-dresses. To do this they use the Long-descriptor format (figure 1.7). The Virtualization Extensions provide an additional second stage of address translation when running virtual machines. The first stage of this translation produces an Intermediate Physical Address (IPA) and the second stage then produces the physical address. The sec-ond stage of this conversion process is controlled by the Hypervisor, TLB entries can also have an associated Virtual Machine ID (VMID), in addition to an Address space ID (ASID). It is possible to disable the stage 2 MMU and have a flat mapping from IPA to PA. Long-descriptor format memory management includes the following fea-tures:

• 64-bit page descriptors

• Up to three levels of translation tables

• Supports specifying up to 40-bit physical addresses • 1GB, 2MB and 4KB block or page size are supported • Second stage memory translation used for virtualization

(22)

• An additional access permission setting – Privileged eXecute Never (PXN). This marks a page as containing code that can be executed only in a non-privileged (user) mode.

31 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Ignored

30

Next level table address out

63 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 SBZ 62 Attributes Ignored 1 1 31 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Lower Attributes 30 3rd level PA 63 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 SBZ 62 Attr

Ignored First level PA

1 1

Table descriptor

Block or page descriptor

2nd level PA

FIGURE 1.7: Format of long-descriptor table entries

(image from [5])

Long descriptor translation tables output a 40-bit intermediate phys-ical address. The first level translation table is 4 entries, one entry for each 1 GB of virtual memory and indexed by two bits of the VA. The second levele translation table is 512 entries, one entry for each 2MB of virtual memory (in the 1 GB address range of the first level table entry) and indexed by nine bits of the VA. The third level translation table is 512 entries, one entry for each 4KB of virtual memory (in the 2MB address range of the second level table entry) and indexed by 9 bit of the VA. The translation process is show in figure 1.8.

31 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Bits [11:0] of VA

30

Bits [20:12] of VA Bits [29:21] of VA

Page offset address 3rd

level table index 2nd

level table index

1st level table 4KB 2nd level table 4KB 3rd level table 4 entries

2-bit table index 9-bit table index 9-bit table index

512 entries

512 entries

IPA 12-bit address offset

FIGURE1.8: VA to IPA translation (image from [5])

1.2.3

GIC support for virtualization

The Virtualization Extension includes the GIC (general interrupt con-troller) Virtualization Extensions. A virtual machine running on a

(23)

processor communicates with a virtual CPU interface on the GIC. The virtual machine receives virtual interrupts from this interfaces, and cannot distinguish these interrupts from physical interrupts [6]. A hypervisor handles all IRQs, translating those destined for a vir-tual machine into virvir-tual interrupts, and, in conjunction with the GIC, manages the virtual interrupts and the associated physical in-terrupts. It also uses the GIC virtual interface control registers to manage the virtual CPU interface. As part of this control, the hy-pervisor and GIC together provide a virtual distributor, that appears to a virtual machine as the physical GIC distributor. The GIC vir-tual CPU interface signals virvir-tual interrupts to the virvir-tual machine, subject to the usual GIC handling and prioritization rules. Figure 1.9 shows an example of how the GIC handles interrupt in a system with virtualization. Distributor GICD_IGROUP Hardware interrupt Processor GIC EnableGrp0 EnableGrp1 FIQEn==1 Hypervisor Guest OS 0 Non-secure System software Secure Monitor Secure software Virtual Distributor

List Registers IRQ assignment

Group 0 virtual interrupt Group 0 interrupt Guest OS 1 Guest OS 2 EnableGrp0 EnableGrp1 FIQEn==1 Virtual CPU interface CPU interface vFIQ vIRQ IRQ FIQ SoC Maintenance interrupt Group 1 virtual interrupt Register updates Group 1 interrupt

FIGURE 1.9: GIC with an ARM processor that

sup-ports virtualization (image from [6])

When the hypervisor receives an IRQ, it determines whether the in-terrupt is for itself, or for a virtual machine. If it is for a virtual ma-chine, it determines which virtual machine must handle the interrupt

(24)

and generates a virtual interrupt. The GIC Virtualization Extensions provide the following support for a virtual CPU interface:

• GIC virtual interface control registers. These are management registers, accessed by a hypervisor.

• GIC virtual CPU interface registers. These registers provide the virtual CPU interface accessed by the current virtual machine on a connected processor. In general, they have the same for-mat as the GIC physical CPU interface registers, but they oper-ate on the interrupt view defined by the List registers.

A virtual machine communicates with the virtual CPU interface, but cannot detect that it is not communicating wit a GIC physical CPU interface. The virtual CPU interface and the GIC virtual interface control registers are both in the non-secure memory map. A hyper-visor uses the non-secure stage 2 address translations (section 1.2.1) to ensure that the virtual machine cannot access the GIC virtual in-terface control registers.

(25)

Chapter 2

Xvisor

Xvisor is an open-source type-1 hypervisor, which aims at providing a monolithic, light-weight, portable and flexible virtualization solu-tion. It provides a high performance and low memory footprint vir-tualization solution for different ARM architecture with or without Virtualization extension and for other CPU architectures. The Xvi-sor source code is highly portable and can be easily ported to most general-purpose 32-bit or 64-bit architectures as long as they have a paged memory management unit (PMMU) and a port of the GNU C compiler (GCC). Xvisor primarily supports Full virtualization hence, supports a wide range of unmodified Guest operating systems, and also paravirtualization is supported. It has most features expected from a modern hypervisor, such as: device tree based configuration, tickless and high resolution timekeeping, threading framework, host device driver framework, IO device emulation framework, runtime loadable modules, pass through hardware access, dynamic guest cre-ation/destruction, managment terminal, network virtualization, in-put device virtualization, display device virtualization and more [7].

2.1

Architecture

Xvisor is a hardware-assisted system virtualization software running directly on a host machine. Figure 2.1 shows the Xvisor software architecture. All core components of Xvisor such as: CPU virtual-ization, guest IO emulation, background threads, para-virtualization services, management services, and device drivers run as single soft-ware layer with no pre-requisite tool or binary file.

Inside Xvisor the virtual machines instances are referred as "Guest", and the instances of virtual CPUs are called "VCPU". The VCPU be-longing to a guest is referred as "Normal VCPU" and the VCPU not belonging to any guest is referred as "Orphan VCPU". Xvisor cre-ates Ophan VCPUs for various background processing and running management daemons. Referring to the ARM-VE architecture (sec-tion 1.2) has three level of privilege modes: User, Supervisor and Hyp. Xvisor runs normal VCPUs in user and supervisor mode and orphan VCPUs in Hyp mode. Xvisor maintains its configuration in the form of tree data structure called device tree to easy the task of configuring Xvisor running on different hardware platform. Also the

(26)

FIGURE2.1: Xvisor Software Architecture (image from [8])

guest configuration is maintained in the form of a tree data structure, this facilitates easier manipulation of guest hardware, so no source code change are required for creating a customized guest for embed-ded systems. The device tree configuration is done through device tree script (DTS), that must be compiled using a device tree compiler (DTC) to obtain a DTB file (device tree blog or flattened device tree file), that will be passed to Xvisor a boot time.

The most important advantage of Xvisor is its single software layer running with the highest privilege, in which all virtualization relate services are provided. Xvisor’s context switches are very light resulting in the fast handling of nested page faults, special instruc-tion traps, host interrupts, and guest IO events. Furthermore, all device drivers run directly as part of Xvisor with full privilege and without nested page table ensuring no degradation in device driver performance.

2.2

Hypervisor Timer

Like any OS, a hypervisor also needs to keeps track of passing time using a timekeeping subsystem. The Xvisor’s timekeeping subsys-tem is called hypervisor timer. The hypervisor timer subsyssubsys-tem of Xvisor is highly inspired from Linux hrtimer subsystem and is com-pletely tickless. It provides the following features:

(27)

• 64-bit Timestamp: the timestamp represents nanoseconds ela-psed since Xvisor was booted;

• Timer events: We can create or destroy timer events with as-sociated expiry time in nanoseconds and an expire call back handler. The time events are one-shot events and to have pe-riodic timer events we will have to manually re-start the timer event from its expiry call back handler.

2.3

Hypervisor Manager

The hypervisor manager module of Xvisor is responsible for the cre-ation and managing of VCPUs and guest instances. It also provides routines for VCPU state changes, VCPU statistics, and VCPU host CPU changes which are built on top of hypervisor scheduler rou-tines. The VCPU instance has an architecture dependent part and architecture independent part. The architecture dependent part is formed by all the registers (general purpose and private registers). The architecture independent part is formed by the VCPU context, scheduler dynamic context and scheduler static context (details in [9]). A VCPU can be in exactly one state at any give instant of time, the possible states are the following:

• UNKNOWN: VCPU does not belong to any Guest and is not Orphan VCPU. To enforce lower memory foot print, Xvisor pre-allocate memory based on maximum number of VCPUs and put them in this state;

• RESET: VCPU is initialized and is waiting for someone to kick it to READY state. To create a new VCPU, the VCPU scheduler picks up a VCPU in UNKNOWN state from pre-allocated VC-PUs and initialize it. After initialization the newly create VCPU is put in RESET state;

• READY: VCPU is ready to run on hardware;

• RUNNING: VCPU is currently running on hardware;

• PAUSED: VCPU has been stopped and can be resumed later. A VCPU is set in this state when it detercts that the VCPU is idle and can be scheduled out;

• HALTED: VCPU has been stopped and cannot resume. A VCPU is set in this state when some erroneous access is done by that VCPU;

A VCPU state change can occur for various reason but the changes have to strictly follow a finite-state machine (figure 2.2) which is en-sured by the hypervisor scheduler.

(28)

UNKNOWN RESET READY RUNNING HALTED PAUSED [Create] [Destroy] [Kick] [Reset] [Scheduler] [Scheduler] [Reset] [Reset] [Halt] [Reset] [Pause]

FIGURE2.2: Xvisor VCPU state machine

A Guest instance consist of the following:

• ID: Globally unique identification number.

• Device Tree Node: Pointer to Guest device tree node.

• VCPU Count: Number of VCPU instances belonging to this Guest.

• VCPU List: List of VCPU instances belonging to this Guest. • Guest Address Space Info: Information required for managing

Guest physical address space.

• Arch Private: Architecture dependent context of this Guest. A Guest Address Space is architecture independent abstraction which consist of the following:

• Device Tree Node: Pointer to Guest Address Space device tree node

• Guest: Pointer to Guest to which this Guest Address Space be-longs.

• Region List: A set of "Guest Regions"

• Device Emulation Context: Pointer to private information re-quired by device emulation framework per Guest Address Space. Each Guest Region has a unique Guest Physical Address and Physical Size. Further a Guest Region can be one of the three forms:

(29)

• Real Guest Region: A Real Guest Region gives direct access to a Host Machine Device/Memory. This type of regions directly map guest physical address to Host Physical Address.

• Virtual Guest Region: A Virtual Guest Region gives access to an emulated device. This type of region is typically linked with an emulated device. The architecture specific code is responsi-ble for redirecting virtual guest region read/write access to the Xvisor device emulation framework.

• Aliased Guest Region: An Aliased Guest Region gives access to another Guest Region at an alternate Guest Physical Ad-dress.

2.4

Hypervisor Scheduler

The hypervisor scheduler of Xvisor is generic and independent with respect to the scheduling algorithm. It updates per-CPU ready queues whenever it gets notifications from hypervisor manager (section 2.3) about VCPU state change. The hypervisor scheduler uses per-CPU hypervisor timer event (section 2.2) to allocate time slice for a VCPU. When a scheduler timer event expires for a CPU, the scheduler will find next VCPU using some scheduling algorithm and configure the scheduler timer event for next VCPU.

For Xvisor a Normal VCPU is a black box and exception or inter-rupt is the only way to get back control. Whenever Xvisor code is executing it could be in any one of following contexts:

• IRQ Contex: when serving an interrupt generated from some external device of host machine;

• Normal Contex: when emulating some functionality or instruc-tion or emulating IO on behalf of Normal VCPU in Xvisor; • Orphan Contex: when running some part of Xvisor as Orphan

VCPU or Thread;

Xvisor has a special context called Normal Context. The hypervisor is in normal context only when it is doing something on behalf of a mal VCPU such as handling exceptions, emulating IO, etc. The nor-mal context is non-sleepable which means a nornor-mal VCPU cannot be scheduled-out while it is in normal context. In fact, a normal VCPU is only scheduled-out when Xvisor comes out IRQ context or normal context. This helps Xvisor ensure a predictable delay in handling ex-ceptions or emulating IO. For this reason, unlike other hypervisors, Xvisor does not incur any additional scheduling or context switch overhead in emulating guest IO events. As present in figure 2.3, the scenario starts at (1) when a guest IO event is trapped by Xvisor and (2) handles it in a non-sleepable normal context ensures fixed and

(30)

FIGURE 2.3: Emulated guest IO event on Xvisor (fig-ure from [8])

predictable overhead. Also to handling Host Interrupts Xvisor guar-antee the minimum overhead. Xvisor’s host device drivers run as part of Xvisor with the highest privilege. Hence, no scheduling or context switch overhead is incurred fro processing host interrupts as shown in figure 2.4. A scheduling overhead only incurs if the host interrupt is routed to a guest, which is not running currently. The

FIGURE2.4: Host interrupts handling on Xvisor

possible scenarios in which a VCPU context switch is invoked by scheduler are as follows:

• When time slice allotted to current VCPU expires, this is called VCPU preemption;

• If a normal VCPU misbehaves, in this case, it is halted or paused. • An Orphan VCPU (or Thread) chooses to voluntarily pause. • An Orphan VCPU (or Thread) chooses to voluntarily yield its

time slice.

• The VCPU state can also be changed from some other VCPU using hypervisor manager APIs.

The scheduling algorithms are independent of the scheduler module, at the moment those available are Priority Round-Robin (PRR) and Rate Monotonic (RM).

(31)

2.5

Other Modules

Xvisor has more other modules to complete the support of virtual-ized systems.

The Hypervisor Threads module to support the creation of Orphan VCPU to run code at Hyp privilege level as part of Xvisor.

The Device Driver Framework that is very similar to Linux kernel de-vice driver model in terms of abstractions and available APIs.

The Device Emulation Framework to provide a specific Virtual Hard-ware for guests. The Xvisor device emulation framework is designed to be flexible, light-weight and fast.

The Standard I/O subsystem implements various forms of printing and scanning APIs.

The Command Manager provides transport independent way of man-aging and executing commands.

The Storage Virtualization subsystem is simple and light-weight and provides the virtualization of the block devices.

The Network Virtualization module is provided in form of light-weight packet switching framework.

(32)

Chapter 3

Isolation on multicore

platforms

The need for ever-increasing performance over the years has led to an exponential increase in transistors within single-core CPUs; but at some point, this process has come to a saturation point whereby was not possible to increase the number of transistors within a single core due to problems of heat dissipation, size, and energy consump-tion. To solve this problem, we have gone to a different model where multiple, but slower cores are placed into the same CPU.

The introduction of these new architectures introduces several problems that need to be analyzed and taken into account:

• Exploiting parallelism: a sequential code can not exploit the potential of new multi-core platforms;

• Resources contention: more CPUs run simultaneously and have to contend the access to the devices;

• Interference complexity: increase the complexity of interfer-ence computing, due to the unpredictability of simultaneously access to shared resources by the various cores;

• Scheduling complexity: increased complexity of scheduling algorithms;

• Computing WCET: increase the complexity of computing the worst-case execution time (WCET), some assumption valid in the single-core platform are no more valid in the multicore plat-form.

Many of these problems are open issues, and different solutions are proposed. It is easy to understand that these issues have a significant impact on embedded Real-time systems. Furthermore, when hyper-visors are concerned, the issues are not limited to the ones discussed above. In fact, we need to take into account various aspects neces-sary to ensure the creation of isolated partitions such as the memory space isolation of each single partition. In this work, we focused at-tention on the problems of:

(33)

• Temporal isolation on DRAM memory accesses;

3.1

Contention due to shared cache levels

Today, all multicore CPUs have a cache memory hierarchy to im-prove performance. Typically, the first level of the hierarchy consists of small and fast cache memories reserved for each single core, while the second level is commonly composed of a large cache memory shared between all the cores. Some designed includes hierarchies with more than two levels. Anyway, in general, there exists a last cache level (LLC) that is shared between all the cores. Figure 3.1 shows a typical two-level cache hierarchy.

Core 0

Cache L1

Core 1

Cache L1

Core 2

Cache L1

Core 3

Cache L1

Shared Cache L2

FIGURE 3.1: An example of a two-level cache

hierar-chy

In embedded real-time systems, one of the main sources of un-predictability is the CPU cache memory hierarchy. In fact, since the cores run simultaneously, a core can replace the data placed in the LLC by another core, and vice-versa, so generating a mutual interfer-ence that can be highly unpredictable (as well as strongly dependent on the application behavior). For instance, this is a major problem when porting applications from single- to multi-core platforms, as these phenomena where not present in the former design, and may strongly jeopardize the application performance. (See figure 3.2)

As long as caches are concerned, the execution time of a real-time task in a multicore CPU can be affected by different type of interfer-ence, which can be distinguished in:

• Intra-task interference: intra-task interference occurs when two memory entries in the working set are mapped into the same cache set;

• Intra-core interference: intra-core interference happens locally in a core. Specifically, when a preempting task evicts the pre-empted task’s cached data;

(34)

FIGURE 3.2: Test by Lockheed Martin Space Systems on 8-core platform

• Inter-core interference: inter-core interference is present when tasks running on different cores concurrently access a shared level of cache. When this happens, if two lines in the two ad-dressing spaces of the running tasks map to the same cache line, said tasks can repeatedly evict each other in cache, leading to complex timing interactions and thus unpredictability;

Talking about hypervisor and assuming to assign a core to each parti-tion, we focus on the problem of inter-core interference and possible spatial isolation by assigning smaller pieces of shared cache to each partition in order to realize the second level of private cache.

3.2

Memory bandwidth contention

Another primary shared resource for a multicore embedded system is the main memory (DRAM). In this case, at the level of the hy-pervisor, we have two main problems spatial isolation and temporal isolation. As for spatial isolation, the hypervisor will have to en-sure that the partitions have a separate memory space between them. To ensure the separation of memory space Xvisor (section 2) lever-ages the two stlever-ages translation capabilities of the MMU provided by the ARM-VE architecture (section 1.2.1). When happens an access to uncached memory that generates a cache miss, we need to access DRAM memory, at this point, the isolation problem is again present. In this case, it is a time problem because the DRAM memory con-troller is unique and hence the overall bandwidth is shared between the cores, which contend it (figure 3.3).

(35)

Core 0

Cache L1

Core 1

Cache L1

Core 2

Cache L1

Core 3

Cache L1

Shared Cache L2

Memory Controller

DRAM

FIGURE 3.3: Simultaneous accesses to main memory

by different cores lead to contention

In a system managed by a hypervisor, this type of problem can introduce unpredictable delays to guests who need real-time guar-antees. The problem occurs if one of the partitions, even non-real time, begins to have an abnormal behavior by starting an infinite se-ries of accesses to the main memory, also not accessing the space of the other partitions can cause significant interference so that the real-time partition may be affected. To limit this type of problem without first knowing the software that will run on the single partitions the only way possible is to try to realize a time isolation system. Time isolation can be achieved by using a bandwidth reservation mecha-nism, so that each partition has a maximum number of RAM access guaranteed within a given time window. In other words, each par-tition will have a maximum bandwidth enforced by the hypervisor, thus achieving predictable interference among the guests.

(36)

Chapter 4

Cache Partitioning

The inter-core interference problem on a shared cache (section 3.1) can be solved partitioning the shared cache in smaller subsets. The purpose of the caching partitioning is to partition the shared cache into subsets to be assigned to individual cores or single partitions, so as to reduce inter-core interference, increase predictability, and facilitate WCET estimation.

There are two types of cache partitioning: index-based or way-based partitioning. In the index-way-based cache partitioning, the par-titions are formed by aggregation of the cache’s sets. In the way-based cache partitioning, the partitions are formed by aggregating one or more cache ways. Referring to Figure 4.1, we can observe the two approaches applied to an n-way set associative cache. Fig-ure 4.1(a) show index-based partitioning, also called horizontal slic-ing, in which one or more cache sets are considered as isolated parti-tions. Figure 4.1(b) show way-based partitioning, in which each way is considered as an isolated partition, in this case, it is referred to as vertical slicing. Depending on the type of partitioning you need to have a hardware-based or software-based approach, in our case we have chosen to use index-based partitioning using software-based approach by modifying the virtual memory management within the hypervisor.

Way 0 Way 1 Way 2 . . . Way n

Index 0 Index 1 Index 2 Index n . . .

Part of the address (index the set)

Way 0 Way 1 Way 2 . . . Way n

Index 0 Index 1 Index 2 Index n . . .

Part of the address (index the set)

(a) (b)

FIGURE4.1: Cache partitioning approaches: (a)

(37)

4.1

ARM Cache Architecture

In this section we see in detail the shared cache architecture used in ARM processors, recalling the general concepts and terminology used. Let’s start with a summary of some of the terms used talking about cache:

• A line refers to the smallest loadable unit of a cache, a block of contiguous words from main memory;

• The index is the part of a memory address that determines in which line(s) of the cache the address can be found;

• A way is subdivision of a cache, each way being of equal size and indexed in the same fashion. The line associated with a particular index value from each cache way grouped together forms a set.

• The tag is the part of a memory address stored within the cache that identifies the main memory address associated with a line of data.

Tag Index Offset 32-bit address Data RAM Tag RAM

Offset

Index

Set

Way Tag

Line 31 0

FIGURE4.2: Cache Terminology (image from [5])

The main caches of ARM cores are always implemented using a set associative cache. This significantly reduces the likelihood of the cache thrashing, improving program execution speed and giving more deterministic execution. With this kind of cache organization, the cache is divided into some equally-sized pieces, called ways. A mem-ory location can be mapped to a line in whatever way. The index field of the address is used to select a particular line and points to an individual line in each way.

In our work, we have used a quad-core ARM Cortex-A7 pro-cessor, with two cache levels. Each core has a 32 KB L1 2-way set-associative instruction cache, with 32-bytes line length, and a 32 KB L1 4-way set-associative data cache, with 64-bytes line length. All

(38)

the cores share a second level of cache. The L2 shared cache has the following feature:

• 512 KB cache size;

• fixed line length of 64 bytes;

• physically indexed and tagged cache; • 8-way set-associative cache structure; • pseudo-random cache replacement policy;

4.2

Cache Coloring

To achieve index-base partitioning we have chosen to implement col-oring page which is a software technique to control the physically indexed set-associative cache. This technique uses the mapping by which index links the physical memory addresses to a cache set; this is done by hardware so that each address is mapped to one set of the cache. Observing that there are bits that overlap between the phys-ical page number and the index of the set, is possible to use these bits like color index, in this way through the hypervisor we can as-sign different colors to different guests. The number and the size of colors are hardware dependent because are linked to the cache ad-dress format that depends on the cache features. Figure 4.3 shows

Cache line offset Set Index Tag Cache line offset Set Index Tag

Physical page # Page Offset

39 31 31 31 0 0 0 6 6 11 12 15 Row select Row select [15 13] L1 Cache (private) L2 Cache (shared)

Bit of colors managed by the hypervisor

FIGURE4.3: Cortex A7 with 512 KB of L2 shared cache,

address bits

the structure of addresses for the cache hierarchy of the cortex A7 with 512KB of L2 cache. We can notice that the physical page num-ber is overlapped on the l1 cache index set for one bit, which means that we can divide it into two colors at most. As for the L2 cache we see that the overlapping bits are four so theoretically we can get up to 16 colors, but this would also involve partitioning the cache L1, which for our purposes is not interesting because it is private for

(39)

each core. Finally, the useful bits for partitioning the L2 cache are three and give us the ability to get up to 8 colors each with a size of 8KB easily derived from the position of the bits within the address.

To use colors to assign a shared cache partition to different guests, we will have to modify the hypervisor so that it allocates memory space using addresses that belong to the colors chosen for each guest. In this way, guest memory addresses will be mapped in different parts of the cache avoiding interfering with each other. In practice each guest will believe that they have a piece of contiguous memory that will be allocated in physical memory in a discontinuous manner, respecting the above rules. In order to implement this mechanism, we have used the double level of translation of the addresses, espe-cially by changing the translation tables of stage 2. Figure 4.4 show a high-level architecture of the system just described.

{ { { {

Guest0’s colors Guest0’s colors Guest1s colors Guest1s colors Guest0 Guest1

}

{

}

{

Guest OS Virtual pages Guest OS Physical pages ARM IPA Host Physical pages

{ {

Guest0’s colors Guest1s colors Fist translation level Second translation level (ARM Stage 2)

FIGURE4.4: Hypervisor’s cache coloring architecture

Each guest OS will have a virtual address space and will have the task of managing and set page tables for the stage 1 translation. The guest OS will use the generated addresses as physical addresses because it is unaware of being on a VM, but addresses generated are IPA and will not be used to address the host memory. Thanks to this system Everything that happens in guest memory, including the first level of translation, is unknown to the hypervisor which has no need to know more details. The only detail that the hypervisor must know is to which guest each IPA belong to, with this information and knowing the guest’s colors it can set the stage 2 translation tables in such way the coloring rules are respected. So without having any information on the type of guest OS, the hypervisor can eliminate the interference that the guests, in different partitions, can do in the shared cache.

(40)

4.3

Implementation details on Xvisor

To change Xvisor to support cache coloring, we need to understand how Xvisor manages memory space for each guest. For this reason, in this section, we will explain in more detail how this management takes place and what data structures are in involved, and then we will describe how we have modified Xvisor to support cache color-ing.

4.3.1

Guest’s memory management on Xvisor

In Xvisor, the devices visible to each guest are specified through a dts file. Inside the tree, there must be an aspace node that will represent the sub-tree where all the devices with their characteristics are listed. Each node within the aspace node will be mapped to the system as a region, which will be of one of the specific types in the section 2.3. The code 4.1 is an extract from the dts file:

  1 aspace { 2 guest_irq_count = <2048>; 3 4 MEM0: mem0 { 5 manifest_type = "real"; 6 address_type = "memory"; 7 guest_physical_addr = <0x80000000>; 8 physical_size = <0x00000400>; 9 align_order = <21>; 10 device_type = "alloced_ram"; 11 }; 12 . 13 . 14 . 15 };  

CODE4.1: Guest DTS region example

The regions we are interested in are those that will be mapped into memory, and are these must have the following parameters:

• manyfest_type = "real", mean that the region is real;

• address_type = "memory", mean that the region will be mapped in DRAM memory;

• device_type = "alloced_ram" or "alloced_rom", mean that the region represent a memory ram or rom for the guest;

The regions will also have a parameter indicating the physical ad-dress that the guest will use to adad-dress that particular region and an-other parameter indicating the size. When a guest is created for each region found in the dts file Xvisor create a data structure in which

(41)

FIGURE4.5: Guest Address Space Init Flow

all the information, about the region, are saved, this structure will then be stored in memory as a node in a red-black tree so that it can be efficiently searched. When it comes to a region that will need to go to DRAM memory, Xvisor will look for a continuous piece of DRAM memory of the same size as those specified in the dts file not yet assigned to any guest. Once the piece of memory is found, this will be marked as allocated and will no longer be available to other guests and the initial data address of the allocated memory area (host physical address) will be saved in the data structure. So in the data structure, Xvisor will have a guest physical address (GPA), a host physical address (HPA) and a size. Figure 4.5 show the guest ad-dress space initialization flow. When the guest attempts to access the GPA for which the Stage 2 page table is not yet present an abort (in-struction or data) will be generated(figure 4.6), Xvisor will search for the region data structure, extract the host physical address, and fill in the page table of Stage 2 appropriately. This will allow the guest to resume and use the DRAM region that was assigned to him.

(42)

The problem with this approach is that each guest is assigned a portion of contiguous physical memory, which means that each guest uses memory parts that will be mapped across the entire shared cache. The consequence is that two hosts running simultaneously on two different CPUs are generating inter-core interference with each other due to the continuous replacement of data at the shared cache level. To achieve the guest coloring, it is necessary to assign to each guest discontinuous pieces of physical memory, so that their addresses are only of the colors assigned to them.

4.3.2

Changes applied to Xvisor

The first thing to change is the structure of the regions because if you want to allocate more pieces of host memory to the same region, a single HPA will not be enough. The new regions have the following structure:

 

1 struct vmm_region { 2 struct rb_node head; 3 struct dlist phead;

4 struct vmm_devtree_node *node; 5 struct vmm_guest_aspace *aspace; 6 u32 flags; 7 physical_addr_t gphys_addr; 8 physical_addr_t aphys_addr; 9 physical_size_t phys_size; 10 u32 align_order; 11 u32 map_order; 12 u32 maps_count; 13 u32 colors;

14 struct vmm_region_mapping *maps; 15 void *devemu_priv;

16 void *priv; 17 };

 

CODE4.2: New Xvisor Region

The underlined fields are those added to support coloring. Each re-gion will be divided into several pieces that will be called rere-gion maps. Each of these maps will address a portion of host memory. The structure of each single map is as follows:

  1 struct vmm_region_mapping { 2 physical_addr_t hphys_addr; 3 u32 flags; 4 };  

CODE4.3: Xvisor Region Mapping

Using the map_order parameter is possible to vary the size of the maps, this is necessary because to each guest it is possible to assign

(43)

more contiguous colors so in that case, it is like having a single big-ger color, in this way we can use a larbig-ger map. Furthermore is useful to support the coloring in other architecture because the size of the color is architecture dependent. The map_count variable is needed to save the number of maps in which each region is split. To calcu-late it are used the size of the region and the size of the map. The maps pointer variable point to a vector, of size map_count, in which are saved all the maps that form the host memory region. The colors variable allows us to save which colors can be the memory pieces of the region. It is a mask and in our case, the first eight bit are used. Each bit means if a color belongs to a guest (value 1) or not (value 0). Now a part of initialization flow of the guest address space has to be changed. When a memory region is found instead of search a single block of host memory Xvisor search a block of host memory for each map. The block must be of the size of the map and also of one of the color saved in the region colors variable. Figure 4.7 show the new guest address space initialization flow; note that is allowed to manage colored and not colored regions together, but in our work, all the guests are colored.

FIGURE4.7: New Guest Address Space Init Flow

To check if an address belongs to the colors coded in the mask we perform the following steps:

• Extract the color bits from the physical address, in our case three bits;

• The three bits codify a number between zero and seven, one for each color, this number can be used to pick the mask’s bit to check;

• If the bit picked is one the color of the physical address belongs to the colors set in the mask otherwise no.

(44)

Below the function that performs the above steps in an optimized way:

 

1 bool check_color(physical_addr_t pa, u32 colors) 2 {

3 physical_addr_t color = (pa >> VMM_COLOR_SHIFT) &

4 VMM_MASK_COLOR_BIT;

5

6 if ( (colors >> color) & 0x1)

7 return true;

8 else return false; 9 }

 

CODE4.4: Check Color function

All the parameters needed to the guest coloring can be passed at Xvi-sor through the DTS file, we have extended the parser with the new parameters colors and map_order. In this way, no code modification is needed to change the colors assignation to the Guests.

The colors parameter must be a mask of eight bits in which are set the bits of the colors that will be assigned to the guest region. The lower bit means the first color, the higher bit means the eighth color. If the colors parameter is set to zero or isn’t present the guest region will not be colored.

The map_order parameter is needed to indicate the order of the maps; it is used to calculates the size of the maps with the following formula:

map_size = 2map_order (4.1) If the map_order is not specified, it is calculated taking into account the size of a color and the colors assigned. In the case of a not colored region the following relation is guarantee:

(45)

Chapter 5

Memory Throttling

The throttling term is known and widely used in the server domain to indicate techniques for managing and allocating shared resources of a host to various guests. Often, servers are machines that are par-titioned and assigned to different guests who believe they have a dedicated machine but who share hardware with others. One of the main problems to be solved in these cases is that of managing the workload so that this does not exceed the hardware capabilities of the machine. The throttling techniques are needed to monitoring the workload and managing the requests of the various guests so that they do not exceed a threshold assigned.

Focusing on embedded multicore platforms, managed by an hy-pervisor, so that many guests can share the same hardware, we can see that we are in a scenario very similar to the one described above. This type of problem in the case of servers has a significant impact on performance, while in real-time embedded systems besides per-formance, the whole system’s functionality is jeopardized because the correctness of these systems depends on time constraints.

For these reasons, we have decided to analyze and apply throt-tling techniques to manage a shared device essential for each guest: the DRAM memory. In Figure 5.1 we can see an example where a four core platform is allocated by the hypervisor as follows:

• Core 0 and Core 1 execute the hypervisor’s background code; • Core 2 is dedicated at the Guest 0;

• Core 3 is dedicated at the Guest 1;

• The DRAM is partitioned between the hypervisor and the guests. At the bottom of the image we can observe an example of schedul-ing where Guest 1, of which we know nothschedul-ing, starts a long phase of access to DRAM memory, this behavior results in a considerable interference for Guest 0. The biggest problem from the point of view of real-time systems is the lack of information on the amount of in-terference that a guest can potentially create.

To handle this problem and get predictable guest behavior, we implemented a bandwidth reservation technique, assigning each guest a maximum number of memory accesses in each period. In this way

Riferimenti

Documenti correlati

Experimental tests show that the eigenvalues of P are all grouped in a circle with center in the origin and radius about 0.3 except one which is near 1.. Moreover, they show that

Then we have studied the abundance gradients along the Galactic disk produced by our best cosmological model and their dependence upon several parameters: a threshold in the surface

The continuity of this dynamic in Italian life is extraordinary. In the Fascist period, many writers with suspect loyalties

Without loss of generality we may assume that A, B, C are complex numbers along the unit circle. |z|

For Blundon’s inequality see Inequalities associated with the

[r]

▪ Each server processes 2 x data =&gt; 2 x execution time to compute local list. ▪ In terms

Cartesian reference system belonging to the triangle