Management and orchestration of real-time containers for time-sensitive cloud and NFV services on Kubernetes

(1)

Information Engineering Department

University of Pisa, Scuola Superiore Sant’Anna

Management and orchestration of

real-time containers for

time-sensitive cloud and NFV

services on Kubernetes

Master Degree in Embedded Computing Systems

Supervisor:

Tommaso Cucinotta

Author:

Stefano Fiori

(2)

(3)

Abstract

The growing popularity of containers on the one hand and cloud computing on the other is gradually bringing these two worlds together, creating new software that manages computer clusters based on containers as a fundamental distribution unit. In this context, Kubernetes is becoming increasingly popular and adopted as an open-source platform to automate the deployment, right-sizing, and management of containerized applications. Kubernetes is written in Go, a modern programming lan-guage particularly suitable for writing efficient distributed applications, thanks to its compiled nature, along with the wide availability of libraries and tools. Kubernetes, however, does not offer any support for the distribution of containers with real-time features. This thesis deals with integrating into Kubernetes advanced real-time scheduling features available on the Linux kernel today, such as SCHED_DEADLINE, a recently available scheduler for real-time tasks based on Earliest Deadline First. In particular, a hierarchical variant of the latter made available as part of on-going research that aims to use real-time algorithms to control the performance of com-plex software in a distributed environment. It is possible to specify containers with real-time characteristics, which Kubernetes deploys in a node in the cluster with suf-ficient real-time computational bandwidth considering the multi-core characteristics of the nodes, ensuring the predictable execution of such containers.

(4)

(5)

Ringraziamenti

Ringrazio Fosca per essere stata la mia spiaggia protetta, il mio piccolo paradiso, in un momento in cui la bufera mi aveva quasi travolto. ´E soprattuto grazie al suo costante supporto e al suo amore se, un passo alla volta, sono arrivato a questo traguardo.

Ringrazio tutta la mia famiglia e in particolare mio padre, che è stato sempre im-passibile alle mie sconfitte, come se ognuna di esse fosse nulla paragonata al bene che mi vuole. In questo modo anche io ho sentito le mie sconfitte un po più leggere. Ringrazio i miei amici Roberto Tolu e Davide Garau, fonte inesauribile di buon umore e di incoraggiamenti. In questi anni passati all’università, anche se lontani, giorno dopo giorno non ho mai smesso di sentire la loro piacevole presenza.

Ringrazio il mio amico Fabio Deriu, perchè è sempre disposto a darmi una mano. Lui mi ha aiutato a trovare un lavoro dando più senso a tutto ciò che ho studiato. Ringrazio il mio professore Tommaso Cucinotta che in questa esperienza è stato anche un ottimo relatore. La sua simpatia, il suo aiuto e la sua pazienza hanno reso piacevole un lavoro complesso come quello della tesi Magistrale.

Infine ringrazio tutti i miei amici, quelli vicini e quelli lontani, che direttamente o indirettamente hanno contribuito a completare il mio percorso: so che credete in me e la vostra sola esistenza `e per me fondamentale.

(6)

(7)

2.1.2 cgroups . . . 16 2.2 Hierarchical CBS scheduler . . . 16 2.3 Kubernetes Concepts . . . 17 2.3.1 Cluster Architecture . . . 17 2.3.2 Deployment . . . 17 2.3.3 Software Components . . . 18 2.4 Wrap up . . . 21 3 Realized work 23 3.1 Modified API . . . 23 3.1.1 Real-Time resources . . . 24 3.2 Modified Kubelet . . . 24 3.2.1 Real-Time Policy . . . 26 3.3 Modified Kube-scheduler . . . 26 3.3.1 Fit Predicate . . . 28 3.4 Wrap Up . . . 29 4 Scenario 31 4.1 Environment . . . 31

4.2 Create real-time Pods . . . 32

4.2.1 Test real-time scheduling . . . 33

4.2.2 CPUs assignment test . . . 34

4.3 Adding a Pod . . . 34

4.4 Unschedulable Pod . . . 36

4.5 Wrap up . . . 37

5 Conclusion 39 5.1 Future work . . . 39

(8)

(9)

Chapter 1 Introduction

In recent years cluster computing has become a common strategy to achieve high computing power and high availability, and thus solve very complex tasks. In public cloud computing services, access to large clusters owned by companies like Ama-zon Web Services (AWS), Google Cloud, Microsoft Azure, are typically provided in an on-demand fashion to customers who are keen to invest more on their core business rather than on setting up, maintaining and operating IT infrastructures. Moreover, if the computational needs of the company change significantly over time, these platforms ensure high scalability, i.e. to perform more advanced operations and manage higher amounts of data or, on the contrary, to reduce its performance. The advantages that cloud solutions offer and the savings that can result from their use allow companies to be more productive and have a leaner structure.

The full employment of public cloud computing services is prevented by some important problems that not all companies are disposed to tolerate, i.e. latency and security issues. Latency [9] issues play a crucial role in cluster performance. Latency is the time from the source sending a packet to the destination receiving it. Often the user of such clusters is very far away from them, making latency increase to significant levels and/or reaching significant amount of jitter, causing the latency to become unpredictable.. Also, security issues discourage companies from adopting public cloud computing services, indeed data is stored with a third-party provider and accessed over the Internet, making it vulnerable to data loss, data breaches, distributed denial of service attacks (DDos), etc..

More and more companies are moving towards the creation of a so-called private cloud, i.e. where the physical infrastructure is owned by the company itself. In private clouds, the latency and the security problems lose importance, being often the user of the cluster the owner itself, and being close to it. The adoption of private solutions is not only due to the public cloud problems, in fact companies of a cer-tain size adopt private cloud computing solutions because they want to obcer-tain the advantages offered by this technology, namely flexible management of resources and therefore better exploitation of computing resources, on data centers they already own.

The possibility for a company to have its own private cloud, is due mainly to the birth, in recent years, of open-source software capable of managing, with the given

(10)

CHAPTER 1. INTRODUCTION

configuration, a variety of different types of clusters. This software takes the name of Cloud Management & Orchestration, and is responsible for managing the various nodes of the network and schedule tasks on the cluster. Since this software has to deal with heterogeneous hardware the various nodes in the cluster offer, they make massive use of virtualization technologies to provide the user with a transpar-ent interface with respect to the underlying physical architecture.

Among the virtualization mechanisms they make use of, one of the most popular and efficient is container technology. Containers technology offers greater efficiency in the use of resources than traditional machine virtualization techniques.

In a context where latency is not an issue, the importance of having real-time [5] tasks running within the cloud becomes more important. Having the ability to per-form time-constrained tasks within a private cloud can help to face some kind of problems that public clouds cannot address.

The work done in this thesis aims to satisfy the need to perform real-time tasks while preserving the transparency that modern virtualization techniques offer, such as containers technology.

1.1 Containers and industrial practices

Interest in real-time systems has grown significantly in recent years, mainly due to the considerable increase in the use of smart technologies and latency-sensitive ap-plications such as cloud gaming, audio/video streaming and smart homes.

Video games have evolved in recent decades into a thriving entertainment in-dustry. With the spread of broadband Internet, one of the key factors in the growth of games has been online gaming. In the past, games using a client-server model required the purchase and maintenance of dedicated on-premise servers or co-located servers to manage the online infrastructure, which only large companies and publishers could afford. With today’s cloud-based computing resources, game developers and publishers of any dimension can request and receive any resource on demand, avoiding costly up-front monetary outlays and the dangers of over or under-provisioning hardware. However, the games are particularly vulnerable to the latency problems mentioned above, and, for this reason, they need mechanisms to prevent the temporal interference between tasks from degrading the quality of service [19][21].

The Cloud of Things (CoT) paradigm was born from the union of Cloud and In-ternet of Things (IoT) technologies. The CoT is expected to be a promising enabler of many real-world application scenarios, such as the smart house. However, several issues are still under discussion in CoT system design, including how to effectively manage the heterogeneity of IoT devices and how to support robust, low-latency communications between the cloud and the physical world. For this reason, the interest in scheduling real-time tasks is very high, giving rise to a very active field of research [18][20].

A very interesting field in which the need for real-time in the cloud has surfaced 10

(11)

is that of industrial robots. For example, some companies are working on a project called Virtualized Robot Control [29][24], where various parts of a robot’s motion control calculation can be outsourced to a cloud system to some extent. The project aims to replace the hardware (HW) Programmable Logic Controller (PLC) with the software version and run that in a virtualized environment/cloud on commodity HW components. From the cloud platform perspective, one of the main challenges that virtualized control brings in is the execution of real-time applications.

1.2 Thesis Work

Among the Cloud Management & Orchestration software mentioned in the introduc-tory section, we find Kubernetes, which makes containers its main execution unit. Over the years, Kubernetes has grown steadily into popularity to consistently be one of the most beloved open-source platforms [25] for automating the deployment, downsizing, and management of containerized applications. Kubernetes is written in Go [27], also known as golang, an open-source computer programming language whose development began in 2007 at Google, and it was introduced to the public in 2009. Go is a compiled, concurrent, garbage-collected, statically typed language, this makes program written in Go simple, reliable, and efficient.

Enriching Kubernetes with new features we want to bring it into the world of real-time cloud. This has been achieved thanks to the integration in Kubernetes of real-time [5] functionalities available today on the Linux kernel, like SCHED_DEADLINE and its hierarchical variant [16]. Thereby Kubernetes can guarantee the predictable execution of complex software components such as containers. The work done in this thesis makes Kubernetes able to understand real-time requests from containers and to distribute them in the cluster nodes that have enough real-time (RT) com-putational bandwidth, considering at the same time the multi-core characteristics of the nodes. The code implementing Kubernetes with real-time containers feature is open-source and available on github [26].

1.3 Thesis Organization

This thesis is organized as follows:

Chapter 2 explains the various concepts and technologies used to realize the work. In particular, concepts about containers, cloud orchestrators, and a patch to the Linux kernel that provides a real-time hierarchical scheduler. In this chapter, important Kubernetes concepts are presented that are fundamental to understand the later topics.

Chapter 3 describes the work done in detail. It presents, step by step, the problems encountered and the decisions taken to solve or, when not possible, circumvent these obstacles.

Chapter 4 shows a simple use case of the modified version of Kubernetes where we try to deploy some containers and see how the CPUs are assigned to containers. Chapter 5 contains various considerations on the work done, as well as ideas for possible future work and improvements.

(12)

(13)

Chapter 2 Background concepts

Early on, organizations were used to run applications on physical servers. Running applications on bare machines was causing resource allocation issues, there was no way to define resource boundaries. What could happen is an application take up most of the server resources, making other applications underperform. This kind of interference among tasks is called noisy neighbour, and causes time interference, significantly increasing the response times of the various tasks. In addition, having all tasks running on the bare machine does not provide sufficient security, a task can access data of another task. A solution for this would be to run each application on a different physical server. But this did not scale as resources were underutilized, furthermore, it was expensive to maintain many physical servers.

To overcome these problems, virtualization was introduced. Virtualization allows us to run multiple Virtual Machines (VM) on a single physical machine. Applica-tions running within a VM are isolated from other VMs, providing a level of security, since information cannot be freely accessed by other applications. Virtualization en-ables a better utilization of resources and better scalability because applications can be added or updated easily, reducing hardware costs. With virtualization, a set of physical resources can be presented as a cluster of disposable virtual machines. In the case of massively parallel and multi-core machines, VMs are able to solve the problem of the noisy neighbour by dedicating exclusively a set of CPUs to each VM, though this approach results in an under-utilization of resources, especially in the case of real-time and interactive loads such as multimedia tasks. Virtualization does not come for free, indeed it is necessary to virtualize the hardware of a machine, introducing non-negligible overhead. Each VM is a full machine running all the components, including its operating system, on top of the virtualized hardware.

Containers are similar to VMs, but they have relaxed isolation properties to share the Operating System (OS) among the applications. Therefore, containers are con-sidered lightweight. Similar to a VM, a container has its filesystem, CPU, memory, process space, and more. As they are decoupled from the underlying infrastructure, they are portable across clouds and OS distributions. Containers have grown in popularity mainly because of the reduced virtualization overhead, but they lack the security and robustness that traditional virtualization offers. For example, if the kernel of a containerized machine crashes, all hosted containers will crash. However, the guarantees that containers offer cover many application scenarios, especially in

(14)

CHAPTER 2. BACKGROUND CONCEPTS

the case of private cloud computing, where the various tenants are not potentially hostile and competing users.

In this chapter are first explained the concepts and the technologies that make containers possible. After a background on containers, are introduced the two macro-components used to realize the work, the hierarchical scheduler, and Kubernetes. The latter is an in-depth review of its concepts and internals.

2.1 Containers

A container a set of one or more tasks, executing in a self-contained sandbox envi-ronment, including all the dependencies so the application runs successfully. Such an isolated environment is provided employing an image, which makes the container portable and coherent among different computing systems.

Containers aim to similar resource isolation to virtual machines, but they work differ-ently because containers virtualize the operating system instead of hardware. This reason makes containers more efficient and more portable than virtual machines. Container features include the following:

• Security: running an application in a container limits the damage caused by an eventually security breach or violation. Within a container the set of possible actions is restricted, so an eventual intruder is limited to such a set of actions.

• Isolation: containers allow the deployment of application which operates on a different environment from the host one. For instance, multiple applications running in different containers can bind to the same physical network interface by using distinct IP addresses associated with each container.

• Virtualization and transparency: the general principle behind a container is to avoid changing the environment in which applications are running except for addressing security or isolation issues. The virtualized environment of a container hides or limit the visibility of the physical devices or system’s configuration underneath it.

Figure 2.1: A Comparison of Applications Running in a Container to Virtual Ma-chine

(15)

2.1. CONTAINERS

2.1.1 Orchestration Cloud Platforms

In the last few years, the use of cloud computing has increased considerably. From a high-level view, Cloud computing looks like a single huge system, capable to man-age tasks of every type, with very high performance. Although it looks like a single big system, cloud computing is underneath composed by a set of loosely or tightly connected computers, called cluster. The management of this set of computers is left to a Cluster management software, which is in charge to manage such comput-ers(nodes) and schedule tasks on the cluster.

There are very different types of Cluster management software, each one with a different goal, that characterize the product itself. This thesis deals with one of these software, Kubernetes, that aims to automate the deployment of containers on a cluster.

Kubernetes

Kubernetes [23] (commonly stylized as k8s) is one of the most famous Container Or-chestrator, it is an open-source software completely written in Go. Kubernetes [3] was originally designed by Google and was first announced in mid-2014, and its 1.0 version was released in 2015. Along with the Kubernetes v1.0 release, Google part-nered with the Linux Foundation to form the Cloud Native Computing Foundation (CNCF) and offered Kubernetes as a seed technology. A lot of companies across industries are pushing to move data and workloads to the cloud using Kubernetes as its Cloud Management & Orchestration software. Among the success stories of the Kubernetes adoption, we can find: Airbnb [7] which has moved from monolithic architecture to microservice architecture. eBay used k8s to runs about 60 produc-tion clusters across 30, 000 servers. Many other companies [6] adopted Kubernetes to manage their workloads.

Kubernetes aims to provide a platform for automating deployment, scaling, and op-erations of application containers across clusters of hosts. It works with a range of container tools, including Docker [13]. Many cloud services offer a Kubernetes-based platform or infrastructure as a service (PaaS or IaaS) on which Kubernetes can be deployed as a platform-providing service. Since the work is realized on Kubernetes, it will be analyzed more deeply in section 2.3.

Openstack

Openstack [4] is an open-source platform for building public and private clouds which delivers a massively scalable cloud operating systems. OpenStack is more compact than kubernetes, this is because it offers more functionality than the latter. It has been designed with the purpose of supporting an infinite number of different scenarios, different technologies, different image formats, etc.. It has a very active community behind it. OpenStack began in 2010 as a joint project of Rackspace Hosting and NASA. As of 2012, it is managed by the OpenStack Foundation, a non-profit corporate entity established in September 2012 to promote OpenStack software and its community. More than 500 companies have joined the project. Among the organizzations that successfully adopted Openstack to manage their workload, we can find: the CERN [17][22] that used Openstack to build its private cloud, the Adobe Advertising Cloud [12] that runs OpenStack in production across 15

(16)

six data centers in the US, Europe and Asia, and many others [28]. Openstack is agnostic to the virtualization tool, hence it can works also with containers has its virtualized resource, used to compose the cloud. The thesis work is a follow-up to a research line at ReTiS Lab, where similar work has been done on Openstack [10] [11].

2.1.2 cgroups

A building block for containers is a Linux kernel feature called Control Groups [15] [8], usually referred to as cgroups. Such a feature allows to organize tasks into hierar-chical groups, and to fine-tune resource utilization of such groups. Resource control is implemented as a set of subsystem, each of them is supposed to control a certain resource: CPU, memory, disk I/O, and network’s usage.

When a container runtime is asked to start a container, it creates cgroups for each subsystem it wants to limit, and associate the tasks of the new container with the just created cgroups. Thanks to this behavior container can provide one of their main feature, i.e. isolation.

2.2 Hierarchical CBS scheduler

Hierarchical Constant Bandwidth Server [1][2] (HCBS) is a scheduler that extends the CPU scheduling class SCHED_DEADLINE of the Linux kernel for single-threaded real-time tasks. It allows us to associate to a group of tasks CPU reservation, ex-pressed in terms of a reservation runtime Q guaranteed on the CPU every reservation period P .

The process of assigning CPU reservation to a group of tasks is realized by means of the cgroups. The CPU subsystem supports three files:

• cpu.rt_period_us: contains the period reservation in microseconds; • cpu.rt_runtime_us: contains the runtime reservation in microseconds; • cpu.rt_multi_runtime_us: contains a runtime reservation for each CPU, in

microseconds.

When a group of tasks is assigned to a CPU cgroup, the SCHED_DEADLINE scheduler can select such a group to be scheduled, and once selected, the fixed priority real-time scheduler in the Linux kernel select one of the tasks of the scheduled control group.

HCBS patch imposes some rules when writing values in files mentioned above, and to do so, it uses the definition of CPU utilization factor. The CPU utilization factor of a real-time cgroup is defined as U = Q_P, and the rules HCBS imposes are:

• Given a cgroup, the sum of utilization factors of its child cgroups cannot be greater than the owned CPU utilization factor. Each CPU has its utilization factor inside a cgroup, and this rule is applied for each CPU. So we are free to give a cgroup any period and any runtime we want unless we don not exhaust its parent CPU utilization. A particular case is when a cgroup has zero runtime reservation, this would imply that any cgroup created under such a cgroup will not be able to have any runtime reservation.

(17)

2.3. KUBERNETES CONCEPTS

• cpu.rt_runtime_us and cpu.rt_multi_runtime_us accept only values greater than zero.

2.3 Kubernetes Concepts

This section goes through Kubernetes concepts and internals. In order to understand the thesis work, is fundamental to comprehend the Kubernetes concepts, including both cluster components abstraction, and real pieces of software running on the cluster.

As already mentioned in the previous section, Kubernetes [23] (k8s) is one of the most famous Container Orchestrator, it is an open-source software completely writ-ten in Go. It aims to provide a platform for automating deployment, scaling, and operations of application containers across clusters of hosts. It works with a range of container tools, including Docker. Many cloud services offer a Kubernetes-based platform or infrastructure as a service (PaaS or IaaS) on which Kubernetes can be deployed as a platform-providing service.

2.3.1 Cluster Architecture

Kubernetes coordinates a highly available cluster of computers that are connected to work as a single unit. The abstractions in Kubernetes allow you to deploy container-ized applications to a cluster without tying them specifically to individual machines. A cluster in Kubernetes is composed of two types of resources:

• The Master coordinates the cluster

• Nodes are the workers that run applications

Master The Master is in charge to manage the cluster. It coordinates all activities in the cluster, such as scheduling applications, maintaining applications’ desired state, scaling applications, and rolling out new updates.

Nodes Cluster is composed of a set of worker machines, that k8s calls Nodes. A node contains the service needed to run containerized applications, that include kubelet, a container runtime(typically Docker[13]), and the kube-proxy. These ser-vices are analyzed more in-depth in subsection 2.3.3. Every cluster is composed of at least a node, but typically it has several ones. A node may be a virtual or physical machine, depending on the cluster.

2.3.2 Deployment

Is possible to run an application on the cluster employing a deployment config-uration. Such a configuration instructs Kubernetes on how to create and update instances of the application. Once a Deployment is provided, the Kubernetes master schedules the application instances included in that Deployment to run on individ-ual Nodes in the cluster.

Once the application is running, the role of the master is to continuously monitor it 17

(18)

Figure 2.2: Kubernetes Cluster

to keep it healthy. As an example, if the node on which the application is running goes down, the master runs the application on another available node.

Pods To deploy an application, Kubernetes create pods, that are the basic exe-cution unit of a Kubernetes application. Each Pod represents a part of a workload that is running on your cluster. Pod is a Kubernetes abstraction that represents a group of one or more application containers (such as Docker or rkt), and some shared resources for those containers. Those resources include:

• Shared storage, as Volumes

• Networking, as a unique cluster IP address

• Information about how to run each container, such as the container image version or specific ports to use

Containers in a Pod are always co-located and co-scheduled and run in a shared context on the same Node. Fig 2.4 shows an overview of a Node running the four pods in Fig. 2.3.

2.3.3 Software Components

This section introduces the various real components with which Kubernetes realize the cluster and its functionality seen in the previous subsection.

The Kubernetes Master is a collection of three processes that run on a single node in the cluster, such a node is designated as the master node.

Those processes are: kube-apiserver, kube-controller-manager, kube-scheduler. Each non-master node in the cluster runs two processes: kubelet ans kube-proxy 18

(19)

2.3. KUBERNETES CONCEPTS

Figure 2.3: Pods overview

Figure 2.4: Node running two pods

kube-apiserver

The kube-apiserver is the component that validates and configures data for the API objects. The API Server services REST operations and provides the frontend to the cluster’s shared state through which all other components interact.

etcd

etcd [14] is a strongly consistent, distributed key-value store that provides a reliable way to store data. It is another software developed by the Cloud Native Computing Foundation. Kubernetes uses this software to implement a sort of shared memory, where each component can retrieve a consistent state of the entire cluster.

kube-scheduler

This component watches for newly created Pods with no assigned node, and selects a node for them to run on.

Factors taken into account for scheduling decisions include individual and collec-tive resource requirements, hardware/software/policy constraints, affinity and anti-19

(20)

affinity specifications, data locality, inter-workload interference, and deadlines. kube-controller-manager

Control Plane component that runs controller processes.

Logically, each controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.

These controllers include:

Node controller: Responsible for noticing and responding when nodes go down. Replication controller: Responsible for maintaining the correct number of pods for every replication controller object in the system. Endpoints controller: Populates the Endpoints object (that is, joins Services & Pods). Service Account & Token controllers: Create default accounts and API access tokens for new namespaces. cloud-controller-manager

A Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider’s API, and separates out the components that interact with that cloud platform from components that just interact with your cluster.

The cloud-controller-manager only runs controllers that are specific to your cloud provider. If you are running Kubernetes on your own premises, or in a learning envi-ronment inside your own PC, the cluster does not have a cloud controller manager. As with the kube-controller-manager, the cloud-controller-manager combines several logically independent control loops into a single binary that you run as a single process. You can scale horizontally (run more than one copy) to improve performance or to help tolerate failures.

The following controllers can have cloud provider dependencies:

Node controller: For checking the cloud provider to determine if a node has been deleted in the cloud after it stops responding Route controller: For setting up routes in the underlying cloud infrastructure Service controller: For creating, updating and deleting cloud provider load balancers

kubelet

An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod.

The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn’t manage containers which were not created by Kubernetes. kube-proxy

kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.

kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster. kube-proxy uses the operating system packet filtering layer if there is one and it is available. Otherwise, kube-proxy forwards the traffic itself.

(21)

2.4. WRAP UP

Container runtime

The container runtime is the software that is responsible for running containers. Kubernetes supports several container runtimes: Docker, containerd, CRI-O , and any implementation of the Kubernetes CRI (Container Runtime Interface).

Figure 2.5: Components of Kubernetes

2.4 Wrap up

Containers technology was mainly born because of the need to have the isolation properties of Virtual Machines but with a lightweight footprint, and hence better performance.

The growing popularity of containers on the one hand and cloud computing on the other, combines these two worlds, giving birth to new software that manages computer clusters based on the container as a fundamental unit of distribution. This software is generally referred as Orchestration Cloud Platforms.

In Linux, containers are possible thanks to a kernel functionality called Cgroups: it allows us to manage resource limits, like CPU and memory, of a group of tasks.

Through cgroups, a Linux kernel patch named HCBS scheduler has been im-plemented that allows to have a 2-level hierarchy scheduling. The use of this patch allows for associating a CPU reservation to a cgroup, that the Linux scheduler SCHED_DEADLINE will use to decide which group of tasks to schedule. Once a group has been scheduled by SCHED_DEADLINE, another fixed priority scheduler will decide which task among the group to schedule.

Kubernetes concepts and software components have been presented before dive into the thesis work, which aims to adapt the k8s components to exploit the HCBS kernel patch mentioned, enriching Kubernetes with real-time container feature.

(22)

(23)

Chapter 3 Realized work

This chapter describes the work done at the architectural, design, and algorithmic level. The order of sections follows the proposed approach, starting with the modi-fication of the user interface, and implementing in each lower state the mechanisms to make this modification work. At each step, the problems that arise and their possible solutions are described. Each solution has its limits, which are analyzed in detail, then the choice of the solution adopted is justified.

The first section describes the changes made to the Kubernetes API in order to support containers that specify real-time constraints. The second section describes the changes made in the kubelet component, which runs in worker nodes. The third section analyzes the changes made in the kube-scheduler, in order to make it support the scheduling of real-time containers on the cluster nodes.

3.1 Modified API

Each k8s object can be expressed in a YAML format, which is accepted by the various API endpoints. For simplicity’s sake, this format will be used to refer to k8s objects, although you can also use a JSON format to interact with the various endpoints.

As already said, the containers are distributed in the cluster through the Pods, indeed we can specify containers in a Pod Object description. Within the Pod specification, you can specify the amount of resources assigned to each container. Typically requested resources are CPU and RAM memory, although other resources are defined. There are two ways to assign resources to a container, limits and requests. Limits represent the maximum amount of resources to allocate to the container, while requests represent the minimum amount of resources needed by the container in order to run. Listing 3.1 shows the definition of a Pod in which the container requires a minimum amount of half a CPU and a limit of one CPU. apiVersion: v1 kind: Pod metadata: name: cpu-demo namespace: cpu-example spec: containers:

(24)

CHAPTER 3. REALIZED WORK - name: cpu-demo-ctr image: vish/stress resources: limits: cpu: "1" requests: cpu: "0.5" args: - -cpus - "2"

Listing 3.1: Pod YAML definition with a Container with resource assign Kubernetes does not provide any request of type real-time, so we have to add this types of requests. The next subsection will address the addition of three new type of real-time requests.

3.1.1 Real-Time resources

Resources request is the perfect place where add the new resources definitions: rt period, rt runtime and rt cpu. The period and runtime will be interpreted as microseconds, which is the unit of measurement provided by the HCBS scheduler, while the CPU real-time request will be an integer, representing the number of CPU the container needs to run. With the addition of these new resources the YAML format definition of a Pod looks like:

apiVersion: v1 kind: Pod metadata: name: cpu-demo namespace: cpu-example spec: containers: - name: cpu-demo-ctr image: vish/stress resources: requests: rt_period: "1000000" rt_runtime: "100000" rt_cpu: "2"

Listing 3.2: Pod YAML definition with a Container with real-time requests Translating the request above in human language: deploy a pod with a container that needs a real-time budget of 100 milliseconds runtime and 1 second of period running on two cpus.

3.2 Modified Kubelet

The kubelet component, that run on the node, will now have a new responsibilities: 24

(25)

3.2. MODIFIED KUBELET

• if the containers of the new Pod specify a quantity of cpus, decide on which cpus run each container;

• correctly tune the HCBS parameters of the container cgroup;

When the above tasks are executed correctly the HCBS scheduler will schedule at each moment the container with the most imminent deadline.

Kubernetes creates under the cgroup directory of the CPU subsystem a hierarchy of nested cgroups, to host all of its Pods requests. Under the root of the Kubernetes cgroup, named kubepods, we can find two folders named burstable and besteffort. At this level, k8s divide the cgroups of the pods by the Quality of Service class. The QoS classes defined by k8s are three and each pod belongs to only one of these classes:

• Guaranteed: in order for a Pod to belong to this class:

– each of its Container must have a memory limit and a memory request, and they must be the same.

– each of its Container must have a CPU limit and a CPU request, and they must be the same.

• Burstable: in order for a Pod to belong to this class:

– it does not match the criteria for QoS class Guaranteed; – at least one of its Container has a memory or CPU request; • Best-effort: in order for a Pod to belong to this class:

– its Containers must not have any memory or CPU limits or requests. cgroups for the Guaranteed QoS class are created directly under the root cgroup kubepods, while the cgroups for Pods belonging to Burstable and Besteffort classes are created under the respectively cgroup.

Since the new resources only make sense in terms of real-time requirements (real-time requirements are rigorous), the new real-(real-time Containers will always belong to the Burstable QoS class.

To enable the new real-time feature we must add some new flag to kubelet command, in particular:

• rt-hcbs: makes kubelet aware that the HCBS patch is available on the local kernel.

• rt-period: makes kubelet aware of the period reservation capacity it own for allocating to containers.

• rt-runtime: makes kubelet aware of the runtime reservation capacity it own for allocating to containers.

(26)

CHAPTER 3. REALIZED WORK

The two flags rt-period and rt-runtime are used to set the the respectively files of the kubepods cgroup. Moreover, always by means of these flags, we can understand how much capacity the kubelet has to allocate containers.

As an example, if we start kubelet with: --rt-period=1s --rt-runtime=800ms, this would imply kubelet has an utilization capacity of: rt util = rt runtime_{rt period} = 0.8. Until the sum of the various utilization factors of the Pods cgroups under the kube-pods cgroup will not exceeds this utilization, the HCBS patch will allow us to allocate new groups, this limitation has been better explained in section 2.2.

3.2.1 Real-Time Policy

The standard cpumanager associate CPUs to containers in an exclusively way. How-ever, this policy forbids the feature we are introducing in this thesis. Indeed we aim to run concurrently more than one container on each CPU, each one with a different real-time reservation.

For this reason, a new policy, called real-time policy, has been implemented, which allocates CPUs to containers, based on their real-time utilization, following a worst-fit logic. When a container requires a certain number n of CPUs, the cpu-manager, that uses a real-time policy, will assign to the container the ”freest” n CPUs. The Fig 3.1 shows a case where a container requests a utilization of 0.3 on two CPUs, and the node has four CPUs. The CPUs are already partially occupied by some other containers. As we can see CPUs 2 and 3 have a lower load than CPUs 1 and 4. Also, CPU 1 would not be able to accommodate the new container. If CPU 1 would be assigned to the container, it would imply a utilization factor greater than one, and this is not permissible. Recalling section 2.2 the CPU utilization factor is defined as U = Q_P where Q is the runtime reservation and P is the period reservation.

Figure 3.1: Worst-fit logic

3.3 Modified Kube-scheduler

The kube-scheduler will have to work with the newly defined resources when it de-cides on which node deploys a pod. Periodically the kube-scheduler checks if new pods are available for deployment. When a new Pod is available for deployment, the 26

(27)

3.3. MODIFIED KUBE-SCHEDULER

kube-scheduler, based also on the resources required by the Pod and those available on each Node will decide which node in the cluster to deploy the pod on. Further-more, this procedure has to deal with new real-time resources. The kube-scheduler knows the Pods running on each node, and how many CPUs each Pod is using, however, it does not know on which CPUs is running each container. The choice of CPUs on which to run each container is a kubelet internal, which does not commu-nicate with the outside world. If a node is running two containers, each of which requires two CPUs to run, then CPUs occupied could either overlap partially or totally or do not overlap at all. This situation does not help the kube-scheduler in the choice of the node when deploying a Pod.

Let us take a look at a scenario where depending on CPUs assigned to each con-tainer, the scheduling of a new Pod may or may not be scheduled. Suppose we have, for each node, maximum utilization of one and four available CPUs. Each Node is running two containers, and each container has a utilization of 0.4 on 2 CPUs. How-ever, the nodes have assigned the CPUs to the container differently. Fig. 3.2 shows the case mentioned. The new Pod in the scheduling queue is requesting for a

utiliza-Figure 3.2: Scheduling scenario

tion equal to 0.4 (the figure reports this value in percentage) on 3 CPUs. The node one is not able to schedule such a Pod, since CPU1 and CPU2 do not have enough capacity, while CPU3 and CPU4 are free but we miss one more CPU to satisfy Pod requests. The Node, on the other hand, is able to schedule the new Pod having on each CPU an available capacity equal to 0.6.

However, as already said, from the kube-scheduler point of view these nodes look equal. In Fig. 3.3 we can see another case where a Pod requesting a utilization of 0.8 on 2 CPUs, is schedulable by node one rather than node two.

This problem does not find its solution by modifying the algorithm used to assign CPUs to a container, indeed is possible to find scenarios as above after a container removal. The only possible exact solution to this problem is the kube-scheduler knows for each CPU the available utilization.

Given these limits, we can follow a best-effort approach to solve this problem. One way to do so is to compute all possible combinations of CPUs assignments to con-tainers. The greater the number of combinations for which the container can be scheduled, the greater the probability that such a container can be scheduled from that node. However this approach has a very high complexity depending on the 27

(28)

CHAPTER 3. REALIZED WORK

Figure 3.3: Another scheduling scenario

number of CPUs and the number of Containers, so this is not a practicable way. The final followed approach is described in the following subsection.

3.3.1 Fit Predicate

In the Kubernetes internals, the logic used to say if a node is able or not to schedule a container is called fit predicate. We must add to these predicates the below logic, that manages the real-time containers.

Before describing the approach we followed, we must modify the definition of real-time utilization factor:

U = rt runtime

rt period · rt cpu

The only difference with old definition is we have multiplied by rt cpu. So in order to compute the actual utilization of a node, we have to sum the utilization factor of the containers running on such a node.

UN ode = X UCi = Xrt runtime_i rt periodi · rt cpui

We can deduce the new utilization factor definition is a number ranging from zero to the number of CPUs of the node.

The final approach used to say if a container can fit on a node, is to sum the utilization of the various containers that are running on the node with the utilization of the new container, and if this sum is less than the capacity of the node, then we can deploy the container on the node.

UN ode+ UP od ≤ UCapacity

However this formula is ”too optimistic”, indeed we are assuming that it is always possible to fill every single CPU of the node, but as we have seen this is only possible for certain configurations of container to CPUs assignment. Hence we have to relax this limit and introduce a factor to be multiplied by the node capacity.

UN ode+ UP od ≤ σ · UCapacity

(29)

3.4. WRAP UP

Where σ is the security threshold factor ranging from zero to one. This way the node will not be filled to the maximum, but when the sum at the first member approaches the values of the equation at the second member, the probability that the new container will be scheduled is greater, since we have a small margin of error of (1 − σ) · UCapacity. The lower the safety threshold, the higher the probability that

the new containers will be able to be scheduled when we approach the utilization limit. However, increasing this factor too much would lead to the underutilization of the node.

Finally, if the kube-scheduler deploys a new container to a node that is not able to schedule it, the Pod will return to the scheduling queue and the kube-scheduler will memorize the node was not able to schedule such a container. When kube-scheduler will retry to deploy the Pod, it will try to deploy it to a different node. This behaviour is already implemented by the standard kube-scheduler.

Considering that orchestrators software, such as Kubernetes, are designed to deal with application failures, this is not a serious problem, and the proposed approach is a good trade-off between kube-scheduler algorithm complexity and error probability.

3.4 Wrap Up

The work done follows a top-down approach. starting from the modification to the user interface, to the underneath modification to the k8s components.

The k8s API, when defining a Pod, has a section to specify resource assignment on the container. This is the place where three new resources has been intro-duced to support real-time constraints on containers: rt_period, rt_runtime and rt_cpu where rt_period express the container reservation period in microseconds, rt_runtime express the container reservation runtime in microseconds, rt_cpu ex-press the CPUs needed by the container.

The introduction of the new resources needs to be managed by the kubelet com-ponent. The new kubelet has to decide how to assign CPUs to Pod containers and correctly initialize the cgroup of each container by writing in the cpu.rt multi runtime us file the runtime for each CPU used. To assign the CPUs the containers the new kubelet use a newly implemented real-time policy, which assigns CPUs to con-tainers with a worst-fit logic.

The kube-scheduler component was also impacted by the introduction of the new resources. The main limit to scheduling is kube-scheduler does not know how each node assigned CPUs to their containers. In this scenario, the best the kube-scheduler can do is to compute a non-exact fit predicate, that heuristically designates a container as schedulable or not on a given node. The container could still be not schedulable by a node chosen by the kube-scheduler. In this case, the container returns to the kube-scheduler ’s scheduling queue, which notes that the node was not able to schedule that container and then tries with a different node.

(30)

(31)

Chapter 4 Scenario

This chapter explains how to execute the modified version of Kubernetes. Further-more, it shows the creation of new Pods on the cluster and the tests done to check the containers are running using real-time scheduling parameters.

4.1 Environment

To use the real-time version of Kubernetes we need first to install its dependencies in our local machine:

• the Linux kernel modified with the HCBS patch, needed to assign real-time bandwidth to specific CPUs;

• Kubernetes needs a Container Runtime in order to correctly work. To meet this requirement, we can install Docker [13] .

Kubernetes stores cluster objects using etcd as shared memory, so we need to provide it. Fortunately, k8s comes with etcd binary in its codebase. Before continu-ing with the next section let us export the path where k8s can find the etcd binary, by running:

export PATH="./third_party/etcd:${PATH}"

The k8s codebase provides some scripts that aim to ease the development and test tasks. Once we have downloaded the codebase of the modified Kubernetes, we can find such scripts inside the hack directory, inside the base directory of the codebase. Within the hack directory the script local-up-cluster.sh builds and runs locally a Kubernetes cluster with one worker node.

Since this script runs every k8s component without any virtualization method, it gives more control over what is happening inside the system. Given this fact we are going to use this script to test all the added features.

We must pass some variables to the script in order to configure correctly the com-ponents to use our modifications:

• ALLOW_PRIVILEDGED: consent Kubernetes to deploy containers with extended privileges, this is needed to execute, inside the containers, commands with chrt that gives real-time attributes to a process.

(32)

CHAPTER 4. SCENARIO

• enforce-node-allocatable="pods": to configure the overall real-time band-width available for all the pods on the node, by writing the runtime and period cgroup entries of kubepods and QoS cgroups;

• cpu-manager-policy=real-time: to make the kubelet use the new real-time policy described in section 3.2.1

ALLOW_PRIVILEGED="true" \

KUBELET_FLAGS=’--enforce-node-allocatable="pods" \

--cpu-manager-policy=real-time’ ./hack/local-up-cluster.sh Running the command above set up our test environment.

4.2 Create real-time Pods

We can interact with Kubernetes using a command-line tool named kubectl. This is a preferred way to interact with Kubernetes rather than interacting with API directly. The kubectl will translate each command in an HTTP request and send this request to the Kubernetes API.

To make Kubernetes deploy a new real-time container, we have to first define it, possibly in a YAML format:

apiVersion: v1 kind: Pod metadata: name: test-1000000-100000-2-schedule spec: containers: - name: rtcontainer1 image: busybox command: ["/bin/sh", "-c"] resources: requests: rt_period: 1000000 rt_runtime: 100000 rt_cpu: 2 securityContext: privileged: true args: - "chrt -r 1 yes > /dev/null"

Listing 4.1: container running yes with round-robin attributes

We are trying to deploy a Pod with a Container that runs the command yes with round-robin scheduling attribute. The command yes output a string repeatedly until killed, if called without arguments it prints y repeatedly, this means it consumes all the CPU at its disposal. The output is redirected on /dev/null otherwise, its log would increase indiscriminately. Its real-time request is of 1 second of period, and 100 milliseconds of runtime, on two CPUs. To deploy this Pod using kubectl write the definition in a pod.yaml file and run:

(33)

4.2. CREATE REAL-TIME PODS

kubectl apply -f pod.yaml

Listing 4.2: deploy a Pod with kubectl

This will send a request on kube-apiserver, which validates the requests and then create the Pod as a shared object in the cluster in etcd. When the kube-scheduler sees this Pod it will try to schedule such a Pod, and since we only have one node, it will only have to evaluate if new Pod fits on the Node. The Pod requires a utilization equal to 0.1, so it fits on our node. Running the following code will show us the state of the Pod in the cluster:

> kubectl get pods

NAME READY STATUS RESTARTS AGE

test-1000000-100000-2-schedule 1/1 Running 0 8s The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The phases we are interested in are:

• Pending: the Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created. This includes time before being scheduled;

• ContainerCreating: the Pod has been scheduled on the node, which is cre-ating the containers of the Pod;

• Running: the Pod and its containers are running on the node;

4.2.1 Test real-time scheduling

Once the pod is running we can test, using some of the tools provided by Linux, if the container is scheduled using real-time parameters (round-robin in this case). > kubectl get pod test-1000000-100000-2-schedule \

--template="{{(index .status.containerStatuses 0).containerID}}" docker://553afc90665111c405d0500eff157da

f2d77bb2048d014411d795a40d6707f1e

This command will return the ID that docker assigned to the Pod container that we can use to obtain the process ID in the host:

> docker top 553afc906651 -o pid,cls,%cpu

PID %CPU CLS

416943 20.0 RR

The output of this command returns the process ID of the container on the local machine, the total percentage of CPUs used, and the scheduling class used to sched-ule the process. In our case, the command above returns the class RR that states round-robin scheduling is used to schedule such a process. This assures us that the container is being scheduled using real-time parameters. Hence the container must have a reservation in its respective cgroup, otherwise, Linux would not allow starting with the chrt command, returning the error:

chrt: failed to set pid 0’s policy: Operation not permitted

(34)

CHAPTER 4. SCENARIO

4.2.2 CPUs assignment test

We can see the container is using 20% of the CPUs, that is what we expect since we have defined a utilization equal to 0.1 on 2 CPUs and the process running is the yes a command that drains the entire bandwidth at his disposal.

We can see the specific CPUs the container is using running:

> docker inspect --format=’{{.HostConfig.CpusetCpus}}’ 553afc906651 1-2

In our case, the CPUs that the cpu-manager chose to run this container are CP U 1 and CP U 2. We can see this information because the cpu-manager not only sets the values in the cpu.rt_multi_runtime_us file but also sets the values for the cpuset subsystem, which is another cgroup subsystem that allows you to bind processes to a group of CPUs.

The CPUs on which the Pod is running says not to much about the really CPUs used, indeed if one of the two CPUs has runtime zero on the cpu.rt multi runtime us file, the group of tasks would use only one CPU. To ensure the real reservation on each CPUs we have to check the cpu.rt_multi_runtime_us file inside the container cgroup. To obtain the cgroup which the container is assigned to run the command: > docker inspect --format=’{{.HostConfig.CgroupParent}}’ 553afc906651 /kubepods/burstable/pod60b5fe66-6b8a-4dae-810a-3f8224bead73

The command returns the cgroup parent of the container cgroup. The cgroup path above is missing of the cpu subsystem path, because it depends on where the local machine has mounted this subsystem, but typically it is mounted on path /sys/fs/cgroup/cpu/. Joining the subsystem path and the cgroup parent path we can finally reach the Pod cgroup and see the cpu.rt_multi_runtime_us file. > CGPOD=kubepods/burstable/pod60b5fe66-6b8a-4dae-810a-3f8224bead73 > cd /sys/fs/cgroup/cpu/$CGPOD > cat cpu.rt_multi_runtime_us 0 100000 100000 0 > cd 553afc90665111c405d0500eff157daf2d77bb2048d014411d795a40d6707f1e > cat cpu.rt_multi_runtime_us 100000 100000 0 0

Under the cgroup parent path docker creates another cgroup named with the con-tainer ID, the commands above show that also this cgroup is correctly initialized.

4.3 Adding a Pod

We can test if adding another Pod will be placed on free CPUs, i.e. CP U 0 and CP U 3. Running again the listing 4.2 but changing the name-value creates another Pod in the cluster, additionally we can change the rt_runtime value setting it to 50000, i.e. 50 milliseconds of runtime on 1 second period, and the number of CPUs the container needs to run to 3.

(35)

4.3. ADDING A POD

After we have created the new Pod we can run again the steps above to obtain the CPUs on which the container is running and also see the runtime per CPU. What we expect is the cpu-manager choose the CP U 0 and CP U 3 and one CPU between CP U 1 and CP U 2.

> kubectl get pod test-1000000-50000-3-schedule \

--template="{{(index .status.containerStatuses 0).containerID}}" docker://ba6c1cb4aecf2ef497e60b31908ccf

1672002faf88f5e0fa89d82595925c7dd4

> CID=ba6c1cb4aecf2ef497e60b31908ccf1672002faf88f5e0fa89d82595925c7dd4 > docker inspect --format=’{{.HostConfig.CpusetCpus}}’ $ID

0-1,3

> docker inspect --format=’{{.HostConfig.CgroupParent}}’ $ID /kubepods/burstable/podcc73616e-e7c7-4c89-a372-e42897265e38 > CGPOD=/kubepods/burstable/podcc73616e-e7c7-4c89-a372-e42897265e38 > cd /sys/fs/cgroup/cpu/$CGPOD > cat cpu.rt_multi_runtime_us 50000 50000 0 50000 > cat $CID/cpu.rt_multi_runtime_us 50000 50000 0 50000

As we can see, in addition to CP U 0 and CP U 3 the cpu-manager has chosen CP U 1, hence, until now we have a CPU utilization equals to:

UCP U0 = 50000 1000000 = 0.05 UCP U1 = 50000 1000000 + 100000 1000000 = 0.05 + 0.1 = 0.15 UCP U2 = 100000 1000000 = 0.1 UCP U3 = 50000 1000000 = 0.05

If we add another container requesting three CPUs we expect CP U 0 CP U 2 and CP U 3 are assigned. We can once again modify the Pod definition in 4.1 with a different name and to use three CPU with a runtime equals to 300000 microseconds: > kubectl get pod test-1000000-300000-3-schedule \

--template="{{(index .status.containerStatuses 0).containerID}}" docker://58649b82b571df94ed9c7846abbc042a8

74dfd466e67d759509b796ea1da5c5d

> CID=58649b82b571df94ed9c7846abbc042a\ 874dfd466e67d759509b796ea1da5c5d

> docker inspect --format=’{{.HostConfig.CpusetCpus}}’ $CID 0,2-3

> docker inspect --format=’{{.HostConfig.CgroupParent}}’ $CID /kubepods/burstable/podfc9816fa-5b18-4993-89fc-0b49164c60a2 > PID="podfc9816fa-5b18-4993-89fc-0b49164c60a2"

> cd /sys/fs/cgroup/cpu/kubepods/burstable/$PID > cat cpu.rt_multi_runtime_us

(36)

CHAPTER 4. SCENARIO

300000 0 300000 300000

> cat $CID/cpu.rt_multi_runtime_us 300000 0 300000 300000

As we can see we have obtained what we expect, i.e. the cpu-manager assigned to the new Pod the CPUs 0, 2 and 3.

4.4 Unschedulable Pod

Once again we recap the the CPUs utilization, so we can finally test a case where the fit-predicate of the kube-scheduler is satisfied but the container will not fit in the node: UCP U0 = 50000 1000000 + 300000 1000000 = 0.35 UCP U1 = 50000 1000000 + 100000 1000000 = 0.15 UCP U2 = 100000 1000000 + 300000 1000000 = 0.4 UCP U3 = 50000 1000000 + 300000 1000000 = 0.35 U = XUCP Ui = 0.35 + 0.15 + 0.4 + 0.35 = 1.25

Given this state of the CPUs if we try to start another Pod with a container that requests a utilization of 0.5 on 3 different CPUs, we will end in a case the fit-predicate would test:

σ · UCapacity = 0.8 · (0.8 · 4) = 2.56 ≥ UP od+ UN ode= 0.5 · 3 + 1.25 = 2.75

The above inequality is not verified, so if we try to run a container with such a request it should remain pending, until a node with enough utilization becomes available. We repeat again the commands in 4.1 and 4.2 deploy a container with 500000 microseconds of runtime and 3 CPUs and check if the container remain pending.

test-1000000-100000-2-schedule 1/1 Running 0 9m37s test-1000000-300000-3-schedule 1/1 Running 0 8m44s test-1000000-50000-2-schedule 1/1 Running 0 8m53s test-1000000-500000-3-schedule 0/1 Pending 0 5m22s After five minutes the Pod is still in pending status, but if we remove container test-1000000-100000-2-schedule the fit-predicate inequality would become:

σ · UCapacity = 0.8 · (0.8 · 4) = 2.56 ≥ UP od+ UN ode= 0.5 · 3 + 1.25 − 0.1 · 2 = 2.55

(37)

4.5. WRAP UP

Now the fit-predicate is verified but in reality the Pod cannot be scheduled since the node does not have three CPUs with an available utilization equal to 0.5

UCP U0 = 50000 1000000 + 300000 1000000 = 0.35 UCP U1 = 50000 1000000 = 0.05 UCP U2 = 300000 1000000 = 0.3 UCP U3 = 50000 1000000 + 300000 1000000 = 0.35

The CP U 1 and CP U 2 fit the requests but CP U 0 and CP U 3 have only an available utilization equal to 0.45. Indeed if we run once again kubectl get pods we will see an error:

test-1000000-300000-3-schedule 1/1 Running 0 8m44s test-1000000-50000-2-schedule 1/1 Running 0 8m53s test-1000000-500000-3-schedule 0/1 PreStartHookError

2 2m22s

The PreStartHook is an internal phase of the Pod. Before the kubelet starts the container it runs the PreStartHook that includes cpu-manager actions. If we look at the kubelet logs we would see the cpu-manager has not found sufficient CPUs to run the Pod. If the cluster had more nodes, the kube-scheduler would have tried to schedule the pods on other nodes.

4.5 Wrap up

In this chapter we showed the usage of the modified version of Kubernetes that allows us to run containers with real-time scheduling parameters.

Before start using k8s we must prepare our environment installing: • Linux kernel with HCBS patch;

• Docker [13].

Once installed this two dependencies we can run the modified k8s the script local-up-cluster.sh that simulates locally a cluster composed of a single node. This script does not

vir-tualize the component of k8s allowing use to test the new RT features. ALLOW_PRIVILEGED="true" \

KUBELET_FLAGS=’--enforce-node-allocatable="pods" \

--cpu-manager-policy=real-time’ ./hack/local-up-cluster.sh

Once k8s is running we can test the new features defining a Pod in YAML format as shown in listing 4.1 and running:

> kubectl apply -f pod.yaml

This will create a Pod in the cluster and will try to deploy it to a Node. We can check the status of the Pod running:

(38)

CHAPTER 4. SCENARIO

> kubectl get pods

If the Pod has been deployed correctly this command will show the status Running for the just defined Pod.

Finally, we have checked that all the files of the containers and Pods cgroups have been set correctly and that that k8s behaves as expected. This includes:

• CPUs are assigned to containers following the worst-fit logic;

• Containers are not deployed to a node that does not satisfy the kube-scheduler fit-predicate;

• When fit-predicate is satisfied but the cpu-manager does not find sufficient CPUs an error is returned to the user.

(39)

Chapter 5 Conclusion

This thesis work modifies Kubernetes source code adding the possibility to deploy containers with real-time characteristics in the cluster. This was achieved success-fully, by modifying the kubelet and kube-scheduler components, exploiting the fea-tures made available by the HCBS patch to the Linux kernel. To demonstrate the correct functioning of such modification the chapter 4 reports the checks done to ensure this is happening correctly. As we have seen, it was possible to deploy a container so that it could be scheduled according to real-time policies. In a real scenario, such container would not suffer time interference from other tasks and/or containers. This, in the contexts presented in the introduction, leads to the capa-bility of predictable processing for real-time applications, as in the case of gaming, audio/video streaming and smart home scenarios, industrial robotics, and many others.

Note this thesis work is a prototype, which is intended to show that we can meet the real-time constraints using Kubernetes. It is part of an ongoing research, which will require further developments to make a significant impact. Firstly, the kernel’s real-time scheduling functionality will need to be refined and possibly absorbed into the kernel mainline, or it has been noticed by a Linux distributor (e.g., RedHat, Canonical), making the HCBS kernel available through a simple package manager install. Secondly, some k8s core developers should be convinced that the changes made to k8s are right and useful. Finally, these changes require engineering to make them usable in production scenarios, which goes beyond the scope of a master thesis project.

5.1 Future work

During the course of the work, many obstacles were encountered. Some of them were raised by the HCBS patch. In fact, the patch has some limitations that in turn limit the functionality of the prototype. An improvement of the latter would consequently lead to the implementation of more appealing features on Kubernetes as well.

If it was possible to write a negative value as runtime in cpu.rt_runtime_us to indicate that we are not interested in a runtime check at a certain level of the

(40)

CHAPTER 5. CONCLUSION

cgroups hierarchy, then this would simplify the code and logic on QoS and Pod cgroups, making the container management system more dynamic.

A further change that would improve the CPU management, would be to make the cpu-manager dynamic, moving the containers periodically on the CPUs that are freer. If the kubelet was able to handle CPUs dynamically, then the kube-scheduler could assume that the containers are always assigned to CPUs with a certain configuration. This would imply that the kube-scheduler can replicate the container to CPUs assignment on the node and deduce the configuration. However this solution would force the kube-scheduler to periodically reproduce the container to CPUs assignment configuration for each node, and this is not expected behaviour from a lean component such as the kube-scheduler. Hence although this solution is correct in comparison to the heuristic solution presented in the 3.3.1, it remains impractical.

A desirable feature, already mentioned in 3.2.1, which is independent of the way the kubelet assigns the CPUs to the containers, is that the kube-scheduler is aware of the state of the CPUs on each node, i.e. the available utilization of each CPU. This solution is much more stable and secure, however, it requires deep modifica-tions to the Kubernetes code.

These are the next steps in a possible evolution of the work presented, but other use cases may arise as Kubernetes approaches the world of real-time cloud.

(41)

Bibliography

[1] Luca Abeni, Alessio Balsini, and Tommaso Cucinotta. “Container-based real-time scheduling in the linux kernel”. In: ACM SIGBED Review 16.3 (2019), pp. 33–38.

[2] Luca Abeni and Giorgio Buttazzo. “Integrating multimedia applications in hard real-time systems”. In: Proceedings 19th IEEE Real-Time Systems Sym-posium (Cat. No. 98CB36279). IEEE. 1998, pp. 4–13.

[3] Borg, Omega, and Kubernetes - ACM Queue. [Online; accessed 12. Jul. 2020]. Mar. 2016. url: https://queue.acm.org/detail.cfm?id=2898444.

[4] Build the future of Open Infrastructure. [Online; accessed 12. Jul. 2020]. July 2020. url: https://www.openstack.org.

[5] Giorgio C Buttazzo. Hard real-time computing systems: predictable scheduling algorithms and applications. Vol. 24. Springer Science & Business Media, 2011. [6] _{Case Studies. [Online; accessed 13. Jul. 2020]. July 2020. url: https : / /}

kubernetes.io/case-studies.

[7] Melanie Cebula. “Develop Hundreds of Kubernetes Services at Scale with Airbnb”. In: InfoQ (Apr. 2019). url: https://www.infoq.com/presentations/ airbnb-kubernetes-services/?utm_source=youtube&utm_medium=link& utm_campaign=qcontalks.

[8] _{cgroups kernel documentation. [Online; accessed 12. Jul. 2020]. Feb. 2018. url:} https://www.kernel.org/doc/Documentation/cgroup-v2.txt.

[9] Stuart Cheshire. “It’s the latency, stupid”. In: Accessed: Jan 16 (1996), p. 2019. [10] Tommaso Cucinotta et al. “Reducing temporal interference in private clouds through real-time containers”. In: 2019 IEEE International Conference on Edge Computing (EDGE). IEEE. 2019, pp. 124–131.

[11] Tommaso Cucinotta et al. “Virtual Network Functions as Real-Time Contain-ers in Private Clouds.” In: IEEE CLOUD. 2018, pp. 916–919.

[12] Deep Dive into Adobe’s Lean Infrastructure & Cost Savings powered by Open-Stack. [Online; accessed 13. Jul. 2020]. July 2020. url: https://www.openstack. org / videos / summits / sydney 2017 / deep dive into adobes lean -infrastructure-and-cost-savings-powered-by-openstack.

[13] Empowering App Development for Developers | Docker. [Online; accessed 12. Jul. 2020]. July 2020. url: https://www.docker.com.

(42)

BIBLIOGRAPHY

[15] Introduction to Control Groups (Cgroups) Red Hat Enterprise Linux 6 | Red Hat Customer Portal. [Online; accessed 12. Jul. 2020]. July 2020. url: https: / / access . redhat . com / documentation / en - us / red _ hat _ enterprise _ linux/6/html/resource_management_guide/ch01.

[16] LKML: Alessio Balsini: [RFC PATCH 0/3] RT bandwidth constraints enforced by hierarchical DL scheduling. [Online; accessed 14. Jul. 2020]. July 2020. url: https://lkml.org/lkml/2017/3/31/658.

[17] Pablo Llopis et al. “Integrating HPC into an agile and cloud-focused environ-ment at CERN”. In: EPJ Web of Conferences. Vol. 214. EDP Sciences. 2019, p. 07025.

[18] Chenyang Lu. Chenyang Lu’s Dependable Industrial Internet of Things, Keynote, Cyber-Physical Systems Week, April 2018. [Online; accessed 13. Jul. 2020]. July 2020. url: https : / / www . cse . wustl . edu / ~lu / talks / cpsweek18 _ keynote.pdf.

[19] Overview of Cloud Game Infrastructure | Solutions | Google Cloud. [Online; accessed 13. Jul. 2020]. Apr. 2020. url: https : / / cloud . google . com / solutions/gaming/cloud-game-infrastructure.

[20] Pritee Parwekar. “From internet of things towards cloud of things”. In: 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011). IEEE. 2011, pp. 329–333.

[21] _{Jim Poole. “The Future of Cloud Gaming”. In: Equinix (June 2020). url:} https://blog.equinix.com/blog/2019/09/11/the- future- of- cloud-gaming.

[22] Insticc Portal. Agile Infrastructure at CERN - Moving 9’000 Servers into a Private Cloud. [Online; accessed 13. Jul. 2020]. July 2020. url: https://www. insticc.org/portal/NewsDetails/TabId/246/ArtMID/1130/ArticleID/ 561/Agile-Infrastructure-at-CERN---Moving-9000-Servers-into-a-Private-Cloud.aspx.

[23] Production-Grade Container Orchestration. [Online; accessed 12. Jul. 2020]. July 2020. url: https://kubernetes.io.

[24] Real-Time Cloud Robotics in Practical Smart City Applications. [Online; ac-cessed 13. Jul. 2020]. July 2020. url: https://www.slideshare.net/c2ro/ realtime-cloud-robotics-in-practical-smart-city-applications. [25] Stack Overflow Developer Survey 2019. [Online; accessed 13. Jul. 2020]. July

2020. url: https://insights.stackoverflow.com/survey/2019#most-loved-dreaded-and-wanted.

[26] _{stiflerGit. kubernetes. [Online; accessed 13. Jul. 2020]. July 2020. url: https:} //github.com/stiflerGit/kubernetes.

[27] The Go Programming Language. [Online; accessed 13. Jul. 2020]. July 2020. url: https://golang.org.

[28] User Stories Showing How The World #RunsOnOpenStack. [Online; accessed 13. Jul. 2020]. June 2020. url: https://www.openstack.org/user-stories.

(43)

BIBLIOGRAPHY

[29] What will 5G bring to industrial robotics. [Online; accessed 13. Jul. 2020]. July 2020. url: https://www.ericsson.com/en/blog/2018/12/what-will-5g-bring-to-industrial-robotics.

Management and orchestration of real-time containers for time-sensitive cloud and NFV services on Kubernetes

Information Engineering Department

University of Pisa, Scuola Superiore Sant’Anna