Adaptive Partitioning Scheduler for Real-Time Tasks in the Linux Kernel

(1)

UNIVERSITY OF PISA

Department of Computer Engineering

Master of Science in Embedded Computing Systems

Adaptive Partitioning Scheduler for Real-Time

Tasks in the Linux Kernel

Supervisor:

Tommaso Cucinotta

Candidate:

Andrea Stevanato

(2)

Ringraziamenti

A conclusione di questo percorso di studi che mi ha formato non solo da un punto di vista professionale ma anche da un punto di vista umano, mi sento in dovere di rivolgere alcuni ringraziamenti alle persone che hanno reso possibile con il loro supporto e il loro aiuto tutto ciò.

Un primo ringraziamento va ai miei genitori, mia sorella e Pallina perché mi hanno supportato sempre ed è grazie a loro se ho potuto seguire quelle che sono le mie passioni. Anche quando non ne andava dritta una e bocciavo gli esami uno dopo l’altro, loro sono stati presenti, mi hanno sostenuto e hanno continuato a credere in me.

Vorrei inoltre ringraziare Virginia, “la mia fidanzata ufficiale”, che durante questi anni è stata la mia quotidianità, un punto di riferimento, mi ha sempre sostenuto e consigliato nella mia perenne indecisione.

Vorrei ringraziare il prof. Tommaso Cucinotta che mi ha seguito, guidato e consigliato durante tutta il percorso della tesi. Voglio ringraziare in generale tutto il ReTiS, un ambiente sano e confortevole, che mi ha fatto capire che l’università non è solo collera e disperazione.

Un particolare ringraziamento va ai power rangers dei poveri: Davide, Walter, Turi e Federico, che hanno condiviso con me questo percorso e sono diventati la mia seconda famiglia. In questi anni ne abbiamo passate tante con le nostre innumerevoli mangiate a “scassapanza” e tutti gli esami bocciati insieme. Siamo cresciuti e abbiamo perso i capelli insieme (Walter per fortuna ha deciso di rasarsi, Davide ancora no!), ci siamo supportati a vicenda ed è grazie a loro che è stato possibile raggiungere questo traguardo.

Sebbene non abbiano passato il test per i power rangers dei poveri, vorrei ringraziare gli altri ragazzi di ECS che, con la loro poca sanità mentale, le giocate a carte, bang e gattini esplosivi, hanno condiviso questo percorso con me.

Inoltre vorrei ringraziare i ragazzi con cui ho vissuto in questi anni: Davide, Samuele, Saverio, Omer e Paolo, con i quali ho condiviso la quotidianità. Un particolare ringraziamento va a Omer e al fantastico pollo che è solito cucinare.

(3)

I would like also to thank the guys I lived with: Davide, Samuele, Saverio, Omer and Paolo, with which I shared my days. A particular thanks goes to Omer and his amazing chicken which he usually cooks.

Infine vorrei ringraziare gli amici di sempre che, sebbene la distanza ci abbia diviso, sono rimasti sempre vicini e sono stati presenti.

(4)

E da piccolo sognavo anch’io di avere Una teca che dicesse che so fare qualche cosa, ye-yee

(5)

Abstract

The work developed for this thesis has involved the design and implementation of variants of the SCHED_DEADLINE scheduler for real-time tasks in the main-line Linux kernel, with the objective of comparing different adaptive partitioning strategies. Specifically, the partitioning heuristics First-Fit and Worst-Fit have been realized as in-kernel modifications to the SCHED_DEADLINE code. These have been extensively evaluated and compared with the performance of the current global-EDF algorithm already present in SCHED_DEADLINE. The evaluation has taken into account several tasksets deployed in a multi-core system. The tasksets have been randomly generated so that their overall utilization fits in a specific configuration of the system with a given number of cores. The gathered results are discussed in depth considering the various configurations of the system.

(6)

List of Terms and Acronyms

Glossary

a2_{pEDF Adaptive partitioning EDF with pull operation.}

XF Used to indicate either First Fit or Worst Fit heuristics.

Acronyms

CBS Constant Bandwidth Server.

CPS Cyber-Physical System.

DVFS Dynamic Voltage and Frequency Scaling.

EDF Earliest Deadline First.

FF First Fit.

FIFO First In First Out.

gEDF Global Earliest Deadline First.

H Hyperperiod.

H-CBS Hard Constant Bandwidth Server.

ISA Istruction Set Architectures.

LCM Least Common Multiple.

OS Operating System.

P-EDF Partitioned EDF.

RQ RunQueue.

RT Real-Time.

SMP Symmetric Multiprocessing.

WC Worst-Case.

(8)

Chapter 1 Introduction

In recent years, many real-time scheduling algorithms have been proposed in the literature, some of these being: Rate Monotonic, Deadline Monotonic and Earliest Deadline First. These scheduling algorithms allow to schedule applications and tasks on multiple CPUs or cores. Real-time scheduling allows to schedule tasks and applications in such way that they finish their execution before the next activation, which can be identified by the period in case of periodic tasks, or their deadline that often occurs before their next activation. The real-time scheduler may schedule either periodic hard RT tasks, periodic soft RT tasks or both. The first ones are characterized by hard timing constraint, which means that catastrophic consequences may happen if they miss even a single deadline; instead the second ones are characterized by soft timing constraint, which means that only a lowering of the quality of the provided service occurs as a consequence of a deadline miss. Hard RT tasks can be found in safety-critical systems and they are related to the acquisition of information from sensors or sending signals to actuators. Soft RT tasks, instead, are mainly related to system-user interactions. The real-time scheduling algorithms can be divided into two categories: global

scheduling and paritioned scheduling. The first one allows to schedule the tasks,

which are in one single logical queue, across all the CPUs, instead the second allows to partition the tasks among the CPUs and each CPU handles its local queue indipendently. The partitioned approach, even though is simpler, cannot be used if the taskset is not partitionable. Global EDF (gEDF) is one example of global scheduling approach, while an example of partitioned scheduling approach is Partitioned EDF (P-EDF). The partitioned approach is not work conserving because some cores or CPUs can be idle even though in others queues there could be some tasks that are waiting to execute, this might lead to an ineffective use of the processing power. However, although the gEDF is work conserving, it

(9)

CHAPTER 1. INTRODUCTION

is more difficult to analyze and the schedulability analysis is very pessimistic. The schedulability test for a multiprocessor system that has to be performed for a partitioned scheduling algorithm is quite simple due to the fact that it is just the uniprocessor schedulability test repeated once for each queue in the system. Instead, the schedulability test in case of a global scheduling algorithm on multiprocessor system, is generally more complex.

The Adaptive Partitioning EDF (apEDF) scheduling algorithm, proposed in [2], stands somewhat in between gEDF and pEDF, in that algorithm the par-titions are created by means of the First Fit policy and are scheduled through EDF. The First Fit policy tries to put as much workload as it can in the first CPUs thus potentially letting the last CPUs without any tasks, this is done ex-ploiting the push and pull operations. The push and pull operations are used to migrate tasks between the run-queues, each core/CPU in the system has its own run-queue. The push operation is invoked by the CPU that wants to push away a task that resides in its run-queue; instead the pull operation is invoked by a CPU that wants to pull a task on its run-queue, because it was left idle. The

algorithms that have been presented in [2] are apEDF and a2_{pEDF. The apEDF}

al-gorithm implements the First Fit policy by the means of the only push operation while the a2_{pEDF also includes the pull operation.}

Moreover, an admission test for real-time tasks scheduled onto a SMP plat-form for Partitioned EDF (P-EDF) has been proposed in [16]. The test is less pessimistic than the one proposed in [15]. The Partitioned EDF algorithm stati-cally assigns tasks to cores and migrations are not allowed.

Currently, the real-time scheduling in the Linux kernel is available through SCHED_FIFO, SCHED_RR and SCHED_DEADLINE. SCHED_FIFO and SCHED_RR im-plement the POSIX priority-based scheduling in which the scheduled tasks have scheduling priority in the range 1 (low) to 99 (high), the task with the higher priority is scheduled first. For each priority a queue, that is a list of the tasks with the same priority, is maintained. Whenever in the queue there is more than one task, the first one that is scheduled is chosen with the FIFO or Round Robin (RR) algorithm, according to which scheduler is used. The first one schedules the tasks in a first in first out manner, the second one assigns at each task a fixed amount of CPU time. SCHED_DEADLINE, instead, implements the global Earliest Deadline First (gEDF) algorithm by default, but it can be configured to run as P-EDF using cpusets and/or task affinities.

The gEDF in SCHED_DEADLINE implements a multi-processor variant of the Hard Constant Bandwidth Server (H-CBS) over a EDF scheduler. The H-CBS

(10)

is an algorithm that implements the resource reservation mechanism by which tasks have the guarantee to have allocated a CPU for a minimum amount of time, such that they can execute. The runtime is the time that the task wants for its execution and it should be assigned to the task every time a period is elapsed. The H-CBS solves the deadline aging issue, which happens in the original CBS [1] algorithm, under which the server deadline is postponed several times, thus jeopardizing the server execution when some tasks are left free to execute.

Real-time scheduling is challenging for symmetric multiprocessing (SMP) sys-tems as well as heterogeneous syssys-tems. Heterogeneous architectures have different kind of cores or CPUs which can be either high performance or energy efficient cores or CPUs. Furthermore, heterogeneous architectures can be different also for the Instruction Set Architectures (ISA) the different type of cores or CPUs have.

The ARM big.LITTLE1 _{is an example of heterogeneous architecture which}

is made up by the so called islands. Both islands share the same ISA. The big island is made with high-performance cores, instead the little island is made with energy efficient cores. A typical architecture found on Android smartphones has 4 cores in each island.

Usually, the big island is used for foreground tasks that require intensive computations, instead the LITTLE island is used for tasks that execute in back-ground. Since in this case the devices are battery powered, the tasks have to be migrated between the two islands in order to optimize the battery duration, they cannot be left executing always onto the high-performance cores where, in this case, the consequence would be a short battery life.

In [17] a task scheduler for big.LITTLE platforms is proposed. In this ar-chitecture the OS can take decisions that combine the high-performance cores with the energy efficient cores for the tasks scheduling. The proposed scheduler combines the CBS with a power aware job migration policy.

This topic will be further explored in depth throughout the AMPERE project [18]. This aims at developing a software development environment for low-energy and highly parallel and heterogeneous computing architectures. The purpose of the AMPERE software architecture is the capability of transforming the system model description of the cyber-physical systems (CPSs), based on specific model-driven languages, to the parallel programming models supported by the parallel architecture of such systems. The aim of the project is to deliver a new software architecture that will be used for safety-critical automotive and railway systems.

(11)

As described above, real-time scheduling is still an active research topic, there-fore in this thesis an implementation has been realized of the Adaptive Partition-ing Scheduler for Real-Time Tasks in the Linux Kernel next to the already present gEDF delivered by SCHED_DEADLINE. The adaptive partitioning scheduler was implemented by means of the First Fit and Worst Fit heuristics. The heuristics define how partitions are created, that means how the tasks are assigned to the run-queues, which are lists of tasks that are waiting to be scheduled. For both heuristics the push and pull operations have been implemented. In the imple-mentation of the two heuristics the concept of invariance was introduced. The invariance allows to choose whether or not to apply heuristics; if the invariance is enabled, heuristics is always applied otherwise not, if it is not necessary.

Experiments have been executed in order to see how the heuristics behave with respect to global-EDF of SCHED_DEADLINE. Many tasksets with different configuration have been generated with the Randfixedsum algorithm [11]. The tasksets’ configurations differ for total utilization and number of tasks.

In order to be able to perform the experiments the testing application rt-app was used, which is a test application that simulates a real-time periodic load.

The thesis is structured as follows: in Chapter 2 the background notions are explained, which cover all the knowledge required for the understanding of the work. In Chapter 3 an explanation of how Adaptive Partitioning Earliest Deadline First (apEDF) was implemented in the Linux Kernel is provided. The chapter covers the explanation of the First Fit and Worst Fit heuristics, the fallback to gEDF, all the sysctl tunables that have been introduced with the patch and a functions tracing of what kernel’s functions are called for the task scheduling. In

Chapter 4 the experimental evaluation of apEDF, a2_{pEDF and a comparison of}

the worst fit bandwidth and deadline are discussed. Finally, in Chapter 5 there are the conclusions of this work.

(12)

Chapter 2 Background

In this chapter all the knowledge required for the understanding of the imple-mentation of Adaptive Partitioning EDF in the Linux kernel are explained.

The chapter is structured as follows: in Section 2.1 the Hard Constant Band-width Server (H-CBS) [5, 8] is described. Moreover, the differences between the CBS [1] and the H-CBS are highlighted. Section 2.2 explains how the multiproces-sor scheduling works, highlighting the differences between the global scheduling approach and the partitioned scheduling approach. In Section 2.3 the differ-ences between global EDF and partitioned EDF are shown. In Section 2.4 the SCHED_DEADLINE [13] Linux real-time scheduler is described. In Section 2.5 the work about Adaptive Partitioning EDF is covered. Finally, in Section 2.6 the First Fit and Worst Fit heuristics [6] are explained.

2.1 Hard Constant Bandwidth Server

Before talking about the H-CBS it is useful to introduce how the original Constant Bandwidth Server (CBS) [1] works. The CBS is a dynamic priority server that implements the Resource Reservation mechanism.

The Resource Reservation is a mechanism that assigns at each task a fraction of the CPU time, the so is called reserved bandwidth. This is specified in terms of a reservation budget (or runtime). Each task is guaranteed its reserved runtime in each time window with duration equal to the reservation period. Moreover, the task is scheduled in such way that it cannot execute for more than its reserved runtime in each period. Whenever a new task becomes available, a scheduling deadline is assigned to it and it is inserted in a queue with other tasks. If a task tries to execute more than its allowed time, its deadline gets postponed in order to decrease its priority such that other tasks are able to execute. This aims to

(13)

2.1. H-CBS CHAPTER 2. BACKGROUND

reduce the interference with other tasks.

Given a CBS S, it is defined by two parameters: Qs that is the maximum

server budget and Ts that is the server period; moreover, cs is the current server

budget. The server utilization can be computed as Us = Q_T_ss. At each time

instant, a fixed deadline ds,k is associated with the server. At the beginning

ds,0 = 0. The CBS uses the EDF scheduling algorithm for the task scheduling.

Whenever there is a job that wants to execute for cj time units, the CBS assigns

the current server deadline to the job. Once a job completes the execution or gets preempted, the server budget is decreased of an amount of time equal to the time the server executed for (i.e. cs = cs−cj). When the server budget cs is exhausted

(i.e. cs = 0), it is immediately refilled as cs = Qs and the server deadline is

postponed to ds,k+1= ds,k+ Ts. Note that it could happen that the time units of

the job that wants to execute is greater than the available budget of the server (i.e. cj > cs). Hence, the job is allowed to execute for cs time units and once

the budget is exhausted (i.e. cs = 0), it is refilled and its deadline postponed, as

explained above. Once the replenishment happens, the job is scheduled with the

new deadline and it is able to execute for its residual budget cs with the same

deadline. The CBS wake-up rule defines how the server budget and the absolute deadline change whenever a task, which was previously blocked, unblock itself. If a small time has elapsed since the task blocked, then the server keeps the same budget and deadline. However, if a large amount of time was elapsed, such that the task is unblocked near its deadline, the server budget is recharged and the

deadline is postponed to now + Ts. This aims to avoid that the task, which is

unblocked near its deadline, is selected by the EDF scheduler, since it has the highest priority, causing a delay to any other task on the same CPU.

The Hard Constant Bandwidth Server (H-CBS) scheduling algorithm is used to mix hard and soft real-time tasks. The H-CBS can be described by three parameters (Qs, Ds, Ts) where Qsis the maximum server budget, Dsis the server

relative deadline and Tsis the server period (also known as the reservation period).

The server keeps track of the current budget cs (also known as runtime) and the

scheduling deadline d (also known as server absolute deadline). The server can be

in one of the following states: IDLE, READY and THROTTLED. In the IDLE state the server is waiting for a ready task to execute and, once there is some, the server state changes from IDLE to READY; whenever the server executes for δ time units,

its current budget is decreased as cs= cs− δ. The server transits from READY to

THROTTLED when the budget is exhausted (i.e. cs = 0), the budget is recharged

(14)

2.2. MULTIPROCESSOR SCHEDULING CHAPTER 2. BACKGROUND

as ds,k+1 = ds,k+ Ds. If Ts is equal to Ds, as often occurs, than ts is equal to ds.

At this point if the server has some tasks to execute its state returns to READY, IDLE otherwise.

At this point the difference between CBS and H-CBS is clear and concerns when the budget replenishment happens. The H-CBS allows not to jeopardize the execution of hard real time tasks in presence of bursty and highly sporadic tasks. Furthermore, the CBS deadline aging [5] issue is solved by H-CBS. In

order to understand what deadline aging is, assume to have two tasks τ1 and τ2

served by two servers S1 and S2, respectively. Also supposing that initially the

only task active is τ1, which executes without being preempted. Therefore, S1

consumes all its budget, postponing its deadline several times. At some point in time the task τ2 is activated and a deadline is assigned to S2. Since S2has shorter

deadline than S1, according to EDF, it can start to execute. When the budget

of S2 is exhausted, its deadline is postponed and the budget is replenished, but

since S1’s deadline is further away than S2’s, it can continue executing. In other

words, before that S1 can start executing again, some time could be required

and therefore the server is not able to guarantee the execution of QS1 units of

execution within TS1 time units. The H-CBS, thanks to its budget replenishment

policy, solves this issue.

2.2 Multiprocessor scheduling

The multiprocessor real-time scheduling problem consists in deciding how to

schedule n real-time tasks, defined as Γ = _{τi}, onto a platform with M

identi-cal cores/CPUs. Those real-time tasks τi can be defined with three parameters

that are (Ci, Pi, Di): Ci is the Worst-Case Execution Time (WCET), Pi is the

period of minimum inter-arrival times and Di is the relative deadline, where

of-ten Di = Pi. Furthermore, the time when the k-th job of the task τi arrives

and becomes ready is defined by ri,k and fi,k is the time when the job finishes

after executing for ci.k time units. Since the relative deadline is known, the

ab-solute deadline can be defined as di,k = ri,k+ Di. The task misses its deadline if

fi,k > di,k. However, the exact execution time ci,k of a task τi is not known but

it can be assumed that Ci ≥ ci,k∀k. The tardiness of the task i is expressed as

max{0, fi,k− di,k}.

In order to be sure that the taskset is schedulable, a schedulability condition has to be verified. In case of a uniprocessor system, the schedulability condition,

(15)

2.3. GLOBAL AND PARTITIONED EDF CHAPTER 2. BACKGROUND

assuming that Pi = Di, is:

n ∑ i=1 Ci Pi ≤ Ulub ,

with Ulub _{depending on the scheduling algorithm; U}lub _{= 1 when the scheduling}

algorithm is EDF. In case of a multiprocessor system the schedulability condition depends on the scheduling algorithm that can be distinguished in partitioned or

global scheduling.

In case of partitioned scheduling all tasks that belong to the taskset, under the assumption that there is one partition for each CPU, are partitioned among all the CPUs of which the system is composed. Hence, each partition is a list of tasks that can be defined as Γj ⊂ Γ and it is statically assigned to the single core/CPU j. On

each core/CPU either a fixed priority scheduling algorithm like Rate Monotonic (RM), or dynamic priority scheduling algorithm like EDF can be used. Since each core/CPU has a partition of the taskset and its own scheduling algorithm, the schedulability analysis can be performed independently with respect to the other cores/CPUs, which means that the single-core schedulability analysis is performed M times, once for each core/CPU. This approach requires that the tasks are partitionable among the cores/CPUs of which the system is composed such that: ∑ i: τi∈Γj Ci Pi ≤ Ulub j ∈ {1, ..., M},

where Ulub _{= 1 if EDF is used.}

In case of global scheduling all ready tasks are put in the same queue and, at each time t, the m highest priority ready tasks are scheduled. In this case the schedulability test discussed above for a uniprocessor system cannot be directly used. The condition under which a taskset is schedulable with a global scheduling algorithm (e.g. global EDF) is explained in the following section.

2.3 Global and Partitioned EDF

Earliest Deadline First (EDF) is a dynamic priority scheduling algorithm where the tasks priorities are assigned according to their absolute deadline. In dynamic priority scheduling algorithms the priorities are calculated during the execution. Instead, in fixed priority scheduling algorithms the priorities are known a priori and they depend on a parameter of the task. Rate Monotonic is a fixed priority scheduling algorithm which assigns the tasks priorities according to their period,

(16)

2.3. GLOBAL AND PARTITIONED EDF CHAPTER 2. BACKGROUND

this means that the task with the shortest period has the highest priority in the system.

A taskset of n tasks is schedulable with the EDF algorithm into a uniproces-sor [14] system if and only if:

U = n ∑ i=1 Ci Ti ≤ 1,

where Ciis the Worst-Case Execution Time (WCET) of task i and Ti is its period.

The difference between global EDF and partitioned EDF is how the tasks are scheduled among the CPUs that made up the system, since both of them use the EDF algorithm. In order to better explain the difference between the two approaches, consider a system with M CPUs. In global-EDF all the tasks that are ready to be scheduled are in the same ready-queue, which usually is ordered by descending priority. Under this condition at each time t the first m ready tasks, which have the highest priority, are scheduled. Moreover, the EDF schedulability test for uniprocessor system explained above has to be changed according these two different approaches. Under global-EDF, in a system made up of M cores/CPUs, the necessary condition for the schedulability changes as follows: U = n ∑ i=1 Ci Ti ≤ M.

Furthermore, global EDF guarantees that, even though the deadline can be missed, the tardiness is bounded [10], meaning that the time to finish required by each job of the task is beyond its deadline.

In partitioned EDF, instead, there can be up to m different ready queues and all the tasks that belong to the taskset are partitioned among all these queues. The tasks of each queue are scheduled with the EDF algorithm such that the highest priority task of each queue is in execution. The partitions can be created through tasks’ affinities or cpusets. In order to verify if the taskset is schedulable, the uniprocessor schedulability test on each cores/CPUs is performed as they were uniprocessor system, if the condition is satisfied for each core/CPU j, the taskset is schedulable with partitioned EDF.

(17)

2.4. SCHED_DEADLINE CHAPTER 2. BACKGROUND

2.4 SCHED_DEADLINE

SCHED_DEADLINE is a real-time CPU scheduler that is available in Linux since the

version 3.14. It was born with the collaboration of Evidence srl1 _{and the ReTiS}

Lab2 _{of Scuola Sant’Anna. The aim of this scheduler was to introduce in Linux}

the support for the dynamic priority real-time scheduling. SCHED_DEADLINE was born within the context of the ACTORS European Research.

Linux has a modular scheduling framework that can be easily extended with additional policies [13]. The scheduler consists of a main function schedule() and a set of scheduling classes, therefore the new real-time scheduler was added as new scheduling class. Each scheduling policy, like SCHED_FIFO or SCHED_RR and SCHED_DEADLINE, has its own scheduling class, which is a data structure that contains function pointers that are called by the core scheduler.

SCHED_DEADLINE implements a multiprocessor version of the previously men-tioned H-CBS algorithm over an EDF scheduler. The H-CBS algorithm imple-ments a resource reservation mechanism which guarantees that the tasks have the guarantee to execute for their reserved runtime for every period. The task is throttled by the scheduler whenever it tries to execute for more that its reserved runtime within a period. Moreover, SCHED_DEADLINE recently adds the GRUB policy [3, 19] for controlled budget overruns.

In SCHED_DEADLINE the multiprocessor scheduling is achieved with one run-queue for each core/CPU, each run-queue is a list of all tasks that have to be scheduled in the core/CPU where the queue belongs. The tasks can be migrated to a dif-ferent queue, the migrations are performed with push and pull operations. A push operation is started by the core/CPU in order to migrate a lower priority task from its runqueue to a runqueue in which it has the highest priority. A pull operation on a queue, instead, is started when a task is throttled by the sched-uler or blocks itself, therefore a ready task, that has the highest priority among all tasks on such queue, is searched among all others queues. Each runqueue is implemented as a red-black tree data structure ordered by increasing absolute deadline, which allows an efficient manipulation. The red-black tree data struc-ture is a binary tree that self-balance itself. In order to speed up the push and

pull operations, the deadlines of the deadline tasks in execution on each CPU are

stored in a max-heap data structure [12], ordered by increasing absolute deadline; furthermore, a bitmask that allows to know in which CPUs are scheduled dead-line tasks was added. The operations on the mentioned heap have been further

1_{https://www.evidence.eu.com}

(18)

2.4. SCHED_DEADLINE CHAPTER 2. BACKGROUND

optimized in [9].

From the Application Programming Interface (API) point of view the al-ready present system calls sched_setscheduler() and sched_getscheduler() were not extended in order not to break the binary compatibility for exist-ing applications. Therefore, two new system calls called sched_setattr() and sched_getattr() were added. In order to use these system calls with different parameters, the sched_attr data structure was added. This structure requires the specification of the maximum runtime that the application can execute, the period over which this runtime should be guaranteed and the relative deadline.

Since in Linux there are many scheduling classes, each of them have a pri-ority. These priorities are required in order to choose which scheduling class has the precedence whenever there are more applications, scheduled by different scheduling classes, that run in the system. The scheduling classes, ordered by de-scending priority, are: SCHED_DEADLINE, SCHED_RT (realizing POSIX SCHED_RR and SCHED_FIFO), SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE. In order not to jeopardize the system where one or more SCHED_DEADLINE applications run, there are two sysctl tunables that allow to avoid that the SCHED_DEADLINE appli-cations, which have the highest priority, get the whole CPU time, thus not letting any time for the vital functions of the system. These two tunables, that can be found in /proc/sys/kernel/, are: sched_rt_runtime and sched_rt_period. The first one contains the maximum time units, expressed in µs, that can be re-served to SCHED_DEADLINE applications, over a period, expressed in µs, contained in the second one. Default values for these parameters are 950000µs for the run-time and 1000000µs for the period. The runrun-time limitation can be disabled,

allowing to start more applications than the system can handle, by putting_{−1 in}

sched_rt_runtime. In order to be more concrete, consider the following exam-ple: supposing that the system under study has 2 CPUs, the maximum utilization allowed with default parameter, by SCHED_DEADLINE, is Umax = 2∗_1000000950000 = 1.8.

Therefore, the scheduler limits the tasks that can be started in the system mean-ing that, if Utot is the sum of the utilization of all tasks, Utot ≤ Umax; instead, if

the runtime is set to−1 the previous limit can be exceeded, without the scheduler

complaining, allowing to start more tasks no matter how much Utot reaches. This

(19)

2.5. ADAPTIVE PARTITIONING EDF CHAPTER 2. BACKGROUND

2.5 Adaptive Partitioning EDF

Adaptive Partitioning EDF [2] is a migration strategy based on r-EDF [4] that, as the name said, implements a partition scheduling where the EDF algorithm is used in all the partitions. The apEDF strategy is reminiscent of the first fit policy, that is going to be explained in the next section, and it can be seen as the push operation. The apEDF strategy can guarantee that the tardiness is bounded if the utilization of the taskset does not exceed the number of cores/CPUs (i.e.

U ≤ M), however the tardiness bound provided by apEDF can be much larger

than the one provided by gEDF.

In [2] at the apEDF strategy is added the pull operation. The new strategy is

called a2_{pEDF. With respect to the pull operation already present in the gEDF}

in SCHED_DEADLINE, the a2_{pEDF pull operation is executed only from idle CPUs}

and tasks are pulled only from overloaded runqueues. The runqueue is overloaded when it has the bandwidth greater than one, which means that it has more tasks that it can have. The improved strategy, however, is more complex than apEDF due to the presence of the pull operation.

In [2] the experimental evaluation proves that apEDF behaves better than global EDF; the metrics of comparison are the deadline miss ratio and the number of task’s migrations between runqueues. The experiments were performed on 2, 4 and 8 CPUs with 30 taskset made up of 16 tasks each. Moreover, the experimental

evaluation concerning gEDF, apEDF and a2_{pEDF shows that the latter performs}

better than the apEDF, even though at low utilization they are almost similar. All these experiments have been executed in simulation, which means that they always give, if the tasksets are always the same, the same results. This could be not true in a real system in which multiple factors may impact the experiments.

2.6 First Fit and Worst Fit

The First Fit (FF) and Worst Fit (WF) heuristics are two partitioning algorithms with a tasks placement strategies on the runqueues that differ between the two of them. Moreover, the scheduling algorithm like EDF can be used on each runqueue.

In order to explain how these two heuristics work, consider a taskset Γ =_{τi}

of n periodic tasks and a system made up of M cores/CPUs where for each of them there is one runqueue. The tasks can be described by their worst case

(20)

2.6. FIRST FIT AND WORST FIT CHAPTER 2. BACKGROUND

computed as follows: Ui = C_T_ii ≤ 1. The aim of these two heuristics is to partition

the n tasks among the M runqueues using different strategies. The max workload that each runque can handle is less or equal than 1. Furthermore, the tasks that

have been placed to the runqueue can be identified by Γj ⊂ Γ and the runqueue

utilization can be written as:

URQj = ∑ i: τi∈Γj Ci Ti = ∑ i: τi∈Γj Ui ≤ 1, j ∈ {1, ..., M}.

The FF tries to put as much workload as it can in the first queues leaving

the last potentially idle. Whenever there is a new task τk with utilization Uk

that becomes available, if the task fits in the first runqueue, which means that

URQ1 + Uk ≤ 1, the task is placed in Γ1; instead, if the condition is not met, the fit in all subsequent runqueues, from 2 to M , is checked and the first that satisfy the condition is the one where the new task is placed.

In [15] is proved that, in a system made up of M CPUs/cores, where the tasks are assigned to the queues with the FF heuristic, and EDF scheduling algorithm is used, the worst case achievable utilization is:

U_wcEDF−F F = M + 1

2 . (2.1)

In other words, if the taskset has utilization U ≤ M +1

2 no deadlines are missed.

However the Equation 2.1 is very pessimistic and can be improved as follows:

U_wcEDF−F F = β∗ M + 1

β + 1 (2.2)

where β = ⌊ 1

Umax

⌋

. Note that the Equation 2.2 is less pessimistic than 2.1 and

they are equal for Umax = 1. Hence, having a lower Umax leads the equation 2.2

to be greater than equation 2.1.

The WF tries to balance the workload between all the runqueues, which means that whenever a new task becomes available, the task is placed by the worst fit

in the least loaded runqueue. Whenever there is a new task τk with utilization

Uk that becomes available, the least loaded runqueue is the one where the task is

placed, which means that URQz = min{URQ1, ..., URQM}; moreover if two or more

runqueues have the same utilization the first one is chosen.

In order to better understand how the first fit and worst fit work, consider two

runqueues RQ1 and RQ2 that have utilization, due to the tasks that have been

already placed, URQ1 = 0.7 and URQ2 = 0.5, respectively. Suppose that two new

(21)

2.6. FIRST FIT AND WORST FIT CHAPTER 2. BACKGROUND

Uτ1 = 0.1 and Uτ2 = 0.3. The first fit puts first τ1 in RQ1 that leads URQ1 = 0.8 and then tries to place τ2 but, since URQ1 + Uτ2 = 0.8 + 0.3 = 1.1 > 1, the task

τ2 cannot be placed in RQ1. Hence, the first fit checks the second runqueue and,

since URQ2+ Uτ2 = 0.5 + 0.3 = 0.8 < 1, the task τ2 is placed in RQ2 and therefore

its utilization becomes URQ2 = 0.8. Instead, in case of the worst fit policy, the

placement is different. Initially, since URQ2 = 0.5 < 0.7 = URQ1, the least loaded runqueues is RQ2 and therefore τ1 is placed in RQ2 such that URQ2 = 0.8; instead, the second task is placed in RQ1 because URQ1 = 0.7 < 0.8 = URQ2 and therefore

(22)

Chapter 3 Adaptive Partitioning EDF in

Linux

In this chapter the main work of this thesis is presented. Adaptive Partitioning EDF (apEDF) has been implemented in the SCHED_DEADLINE scheduling class of Linux, next to the already present gEDF. The two heuristics that have been implemented are First Fit and Worst Fit [6]. Both the push and pull operations

have been implemented, respectively, in the apEDF and the a2_{pEDF, which is}

apEDF with the pull operation, patches for the Linux kernel. The code of both

patches has been made available on a github repository1_.

The chapter is organized as follows: Section 3.1 describes the SCHED_DEADLINE main functions and explains all the traces of the kernel functions that occur when-ever a SCHED_DEADLINE application or task is scheduled; Section 3.2 explains how the First Fit and Worst Fit heuristics have been implemented in Linux; Section 3.3 explains and describes the fallback to gEDF option; finally, in Section 3.4 all the sysctl tunables that have been introduced are described.

3.1 Kernel function tracing

The assignment of the tasks to the runqueues or the logic behind the tasks’ migrations among the runqueues are defined by some functions that characterize the scheduling class.

In SCHED_DEADLINE the main operations that decide in which runqueue a task must be placed, either for a newly created task or for one being migrated from another another runqueue, are the push and pull operations defined in the

1_{For more informations take a look at: https://github.com/thymbahutymba/linux/tree/}

(23)

3.1. KERNEL FUNCTION TRACING CHAPTER 3. APEDF IN LINUX

push_dl_task() and pull_dl_task() functions.

The enqueue_task_dl(), balance_dl() and select_task_rq_dl() are some functions that define the SCHED_DEADLINE scheduling class and describe the SCHED_ DEADLINE behaviour.

Listing 3.1 is shown the SCHED_DEADLINE scheduling class data structure where each field is a pointer to a function that is called by the core of the Linux scheduler.

Listing 3.1: SCHED_DEADLINE scheduling class const struct s c h e d _ c l a s s d l _ s c h e d _ c l a s s = { . next = &r t _ s c h e d _ c l a s s , . enqueue_task = enqueue_task_dl , . dequeue_task = dequeue_task_dl , . y i e l d _ t a s k = y i e l d _ t a s k _ d l , . check_preempt_curr = check_preempt_curr_dl , . pick_next_task = pick_next_task_dl , . put_prev_task = put_prev_task_dl , . set_next_task = set_next_task_dl , #i f d e f CONFIG_SMP . b a l a n c e = balance_dl , . s e l e c t _ t a s k _ r q = s e l e c t _ t a s k _ r q _ d l , . migrate_task_rq = migrate_task_rq_dl , . set_cpus_allowed = set_cpus_allowed_dl , . r q _ o n l i n e = rq_online_dl , . r q _ o f f l i n e = r q _ o f f l i n e _ d l , . task_woken = task_woken_dl , #endif . t a s k _ t i c k = t a s k _ t i c k _ d l , . t a s k _ f o r k = task_fork_dl , . prio_changed = prio_changed_dl , . switched_from = switched_from_dl , . switched_to = switched_to_dl , . update_curr = update_curr_dl , } ;

In order to trace what happens after the sched_setattr() system call changes the current scheduler to SCHED_DEADLINE, many printk() were added such that a trace has been generated and a pattern found. After the sched_setattr() the SCHED_DEADLINE tasks or applications can be created either continuing in the

(24)

same code flow or with a exec() system call or one of its variants. The exec() system call executes a file replacing the current process with the new one. In both alternatives many cases are going to be analyzed.

The following traces have been generated using the wrap-dl utility which takes, as parameters, the runtime and the period of the wrapped application, which are supplied as command-line parameters. The rt-loop utility, instead, is one of the possible applications that can be wrapped by wrap-dl, it implements a periodic CPU-intensive application and it takes as parameters the computational time and the period.

Listing 3.2 shows the trace resulting from a SCHED_DEADLINE application that is created in the same code flow of the sched_setscheduler() without the use of the exec() system call. In this case the application that is scheduled with SCHED_DEADLINE is just a for loop that wastes some CPU cycles for 500ms and then sleeps for 1µs. In this case the absence of exec() makes the trace more clean. Note that the task is queued just before its execution starts and is dequeued at the end of its execution waiting for the next period.

Listing 3.2: for loop with waste of some time and then 1µs sleep.

enqueue_task_dl switched_to_dl push_dl_task [ time d i f f e r e n c e o f runtime ] __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k [ time d i f f e r e n c e o f p e r i o d _{− runtime ]} dl_task_timer enqueue_task_dl push_dl_task [ time d i f f e r e n c e o f runtime ] REPEAT;

Listing 3.3 shows the trace resulting from a SCHED_DEADLINE application that does nothing, it just sleeps for the whole period. Note that, since the task does nothing, the task is immediately dequeued just after the enqueue waiting for the next period.

Listing 3.3: Application that only sleeps its entire period

enqueue_task_dl switched_to_dl push_dl_task dequeue_task_dl

(25)

3.1. KERNEL FUNCTION TRACING CHAPTER 3. APEDF IN LINUX __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k [ time d i f f e r e n c e with d u r a t i o n e q u a l t o p e r i o d ] s e l e c t _ t a s k _ r q _ d l enqueue_task_dl task_woken_dl dequeue_task_dl __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k [ time d i f f e r e n c e with d u r a t i o n e q u a l t o p e r i o d ] REPEAT;

Listing 3.4 shows the trace resulting from a SCHED_DEADLINE application which just does an infinite loop that does nothing, the application can be seen as for(;;);. Note that, there is no dequeue since the application does not suspend itself but it is throttled by the kernel. Moreover, through the dl_task_timer() the budget is refilled and the application executes again till it is throttled.

Listing 3.4: Application that just does an infinite loop.

enqueue_task_dl switched_to_dl push_dl_task [ time d i f f e r e n c e o f runtime ] __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k [ time d i f f e r e n c e o f p e r i o d _{− runtime ]} dl_task_timer enqueue_task_dl push_dl_task [ time d i f f e r e n c e o f runtime ] REPEAT;

The Listing 3.5 shows the trace of the functions that are called by the sched-uler when the rt-loop utility is executed. The rt-loop utility is started with wrap-dl, which executes it with the SCHED_DEADLINE scheduler through the exec() system call. The rt-loop has been given as parameter to wrap-dl with a bit lower computational time and the same period of wrap-dl, therefore it suspends itself without being throttled by the scheduler. As visible, after the en-queue of wrap-dl in the deadline runen-queue due to the change of scheduling class, the exec() causes the call to the push_dl_task() and select_task_rq_dl(). After each period, the task wakes up causing a select_task_rq_dl() followed

(26)

by the enqueue_task_dl() and task_woken_dl(), then after the task runtime, we see a dequeue when the task returns to sleep waiting for the next period.

Listing 3.5: rt-loop executed with the exec of wrap-dl.

enqueue_task_dl switched_to_dl push_dl_task s e l e c t _ t a s k _ r q _ d l b a l a n c e _ d l dequeue_task_dl __dequeue_task_dl enqueue_task_dl b a l a n c e _ d l

[ time d i f f e r e n c e o f wrap_{−dl runtime ]} dequeue_task_dl

__dequeue_task_dl b a l a n c e _ d l

p u l l _ d l _ t a s k

[ time d i f f e r e n c e o f wrap−dl period − runtime ] s e l e c t _ t a s k _ r q _ d l

enqueue_task_dl task_woken_dl

[ time d i f f e r e n c e o f wrap_{−dl runtime ]} REPEAT;

Listing 3.6 shows the execution of the yes binary which has been wrapped with wrap-dl. The yes binary prints continuously an y to the stdout, which has been redirected to /dev/null to avoid unneeded suspensions, therefore it does not suspend itself like what rt-loop does but, once that its budget is exhausted it is throttled by the kernel. The budget that the application has available is the runtime given to wrap-dl. The budget is replenished after the end of each period. As visible, the first part, since wrap-dl is used, is equal to the previous one. However, with respect to the previous case, there is no dequeue since the yes application does not suspend itself but it is throttled by the kernel. The dl_task_timer() refill the budget after each period and it is followed by the enqueue.

Listing 3.6: yes executed with the exec of wrap-dl.

enqueue_task_dl switched_to_dl push_dl_task s e l e c t _ t a s k _ r q _ d l b a l a n c e _ d l dequeue_task_dl

(27)

3.1. KERNEL FUNCTION TRACING CHAPTER 3. APEDF IN LINUX __dequeue_task_dl enqueue_task_dl __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k s e l e c t _ t a s k _ r q _ d l b a l a n c e _ d l enqueue_task_dl task_woken_dl [ time d i f f e r e n c e o f runtime ] __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k

[ time d i f f e r e n c e o f wrap−dl period − runtime ] dl_task_timer

enqueue_task_dl push_dl_task

[ time d i f f e r e n c e o f runtime ] REPEAT;

The Listing 3.7 shows the trace resulting from a SCHED_DEADLINE application that calls the sched_yield() system call. When the sched_yield() is used with SCHED_RT (i.e. either SCHED_FIFO or SCHED_RR), the task gives up the CPU and it is moved at the end of the queue by the scheduler. However, in SCHED_DEADLINE

the behaviour is different, the task that calls the sched_yield()2 _{gives up its}

remaining runtime and it is throttled by the scheduler till the next period.

Listing 3.7: Application that in the for loop calls the sched_yield.

enqueue_task_dl switched_to_dl push_dl_task __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k [ time d i f f e r e n c e with d u r a t i o n e q u a l t o p e r i o d ] __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k __dequeue_task_dl b a l a n c e _ d l p u l l _ d l _ t a s k [ time d i f f e r e n c e with d u r a t i o n e q u a l t o p e r i o d ] REPEAT;

2_{The behaviour of sched_yield() system call used with SCHED_DEADLINE is explained in the}

official documentation available at https://www.kernel.org/doc/html/latest/scheduler/ sched-deadline.html#behavior-of-sched-yield

(28)

3.2. FF AND WF IN LINUX CHAPTER 3. APEDF IN LINUX

Note that for both Listing 3.2 and Listing 3.4 can be seen that they have the same trace since they are similar. In Listing 3.2 the usleep() of 1µs has no effect on the resulting trace.

3.2 First Fit and Worst fit in Linux

Before going deep in the logic of first fit and worst fit, it is useful to better explain which are the core functions of gEDF in SCHED_DEADLINE. As said in the previous chapter has been said, the push and pull operations are the main operations in gEDF, which are implemented through the push_dl_task() and pull_dl_task() functions.

To realize the apEDF and a2_{pEDF algorithms the push and pull functions of}

gEDF have been renamed in push_dl_task_global_edf() and pull_dl_task_ global_edf(). Thus, the push function for FF and WF, which is the same for both the heuristics, since the difference resides in the auxiliary functions that they use, has been named push_dl_task_xf(). Instead, for the implementation

of a2_{pEDF and its pull operations, the pull_dl_task_first_fit() and pull_}

dl_task_worst_fit() functions have been added. In this case there are two different functions since they differ from each other based on the heuristic.

What follows is a comparison between the push and pull operations of gEDF with the respective ones of both the heuristics; then an explanation is provided of the main changes to the select_task_rq_dl() and the enqueue_task_dl(), which are two of the main functions of the scheduling class. Finally, the concept of invariance is introduced and motivated.

3.2.1 Push operation comparison

First of all the gEDF push function is examined and some of the details of how it works under the hood are covered. Without going deep in the details, which are the retry logic and all the checks that are performed by the push_dl_task_ global_edf(), the push function exploits the find_lock_later_rq() function, which is the one actually in charge to find the correct runqueue where the task has to be placed. The function returns the correct runqueue, which is called later_rq, with the lock held. Furthermore, in the find_lock_later_rq() the find_later_rq() is exploited which, in addition to various checks, tries to find and return the target CPU exploiting the cpudl_find(). However, it could happen that no CPU is found by the cpudl_find(), therefore the find_later_

(29)

later_rq() is triggered. Figure 3.1 shows the sequence diagram of the functions that are exploited by the push_dl_task_global_edf().

cpudl_find() find_later_rq()

find_lock_later_rq()

push_dl_task_global_edf()

later_rq

find_lock_later_rq() find_later_rq() cpudl_find()

cpu later_rq

Figure 3.1: Auxiliary functions the push_dl_task_global_edf() exploits.

In the implementation of the first fit and worst fit the previous chain of functions, with different names, have been reproduced. Some functions are in common for both heuristics, since their logic does not depend on the heuristic itself. These functions are the push_dl_task_xf(), find_lock_xf_suitable_ rq() and find_xf_suitable_rq(). The first function is in charge to find the correct runqueue, called suitable_rq, where the task has to be placed and even-tually it places the task into the suitable runqueue. The push_dl_task_xf() exploits the find_lock_xf_suitable_rq() in order to find the runqueue. The find_lock_xf_suitable_rq() returns the suitable runqueue with the lock held, this is done exploiting the find_xf_suitable_rq(), which is the one actually in charge to find the runqueue through the core functions of the two heuristics. The first_fit_cpu_find() and worst_fit_cpu_find() functions are called by the find_xf_suitable_rq() and they implement the logic of the FF and the WF, re-spectively. The find_xf_suitable_rq() chooses the correct function to call ac-cording to a sysctl tunable that defines which is the current policy. The sysctl tun-ables are going to be explained in details in Section 3.4. Even in this case, both the first_fit_cpu_find() and worst_fit_cpu_find() functions, may not find any suitable CPU. The reason of that is that the bandwidth requirement of the task (i.e. the dl.dl_bw field of the task structure) is greater than the bandwidth left on any runqueue of the system, which is calculated with respect to the active and in-active utilization (i.e. the dl.this_bw field of the runqueue). Therefore, the task cannot be placed in any runqueue, since the EDF utilization upper bound (i.e. 1) would be exceed. Whenever this happens, the find_suitable_rq() returns the

(30)

−1 error value to the find_lock_xf_suitable_rq() and the fallback_to_gedf

field of the dl_rq structure (i.e. dl.fallback_to_gedf of the runqueue struc-ture) is set in order to try to switch to gEDF, if possible.

Since now there are two different push functions, the push_dl_task() has been added; this function can be seen as a wrapper for the available alternatives. The global_sched_dl_is_gedf(), global_sched_dl_is_first_fit() and global_sched_dl_is_worst_fit() functions, which read the current value of the sched_dl_policy sysctl tunable, are exploited by the push_dl_task() in order to check which is the current policy, such that the correct push function is called. The push_dl_task() either calls the push_dl_task_global_edf(), if the sched_dl_policy is set to global EDF, or the push_dl_task_xf(), if the current policy is either the first fit or the worst fit. However, it may happen, with either the FF or the WF, that the task cannot be placed in any runqueue. In this case, the find_xf_suitable_rq() set the dl.fallback_to_gedf field of

the runqueue just before that the −1 error value is returned to the find_lock_

xf_suitable_rq(), such that the fallback to gEDF is triggered. The fallback to gEDF is a mechanism that can be enabled through a sysctl tunable, that allows to switch to gEDF, either when the current policy is FF or WF, whenever it is required (i.e. the task must be placed in a runqueue anyway, at the end of FF or WF). Hence, the push_dl_task_global_edf() follows as a fallback the push_dl_task_xf() for placing the task when the XF (XF used to indicate either FF or WF) heuristic cannot.

3.2.2 Pull operation comparison

Before explaining how the pull functions of both heuristics work, the pull of gEDF is examined.

The pull_dl_task_global_edf() function, which is the pull of global EDF, checks, for each runqueue, whether there is an earliest pushable task (i.e. non-running) that could be pulled away and eventually inserted in the runqueue from which the pull is invoked. The pull function pulls only from different runqueues than the one it is called from. This happens only if some conditions are met:

1. there must be an earliest pushable task in a different runqueue;

2. the task to be pulled has to have its deadline less than a minimum value; 3. the deadline of the task to be pulled has to be before the deadline of the

(31)

The deadline of the target task to pull has to be lower than a minimum value called dmin. This aims to give the possibility to pull, for each runqueue, the earliest pushable task. Whenever a task is pulled, dmin is updated to the deadline of this task such that the others earliest pushable task can be pulled only if they have a deadline occurring earlier than the last one that has been pulled. Moreover, the deadline of the task to be pulled has to be shorter than the current running task such that the latter gets preempted and the pulled task finishes its execution as soon as it gets migrated by the kernel.

The pull_dl_task_first_fit(), which is the pull function of the first fit, pulls the earliest pushable task starting from the last runqueue till the runqueue right next the one of the core/CPU from which the pull is invoked. Even in this case there are some conditions that have to be met. These conditions are the same viewed for the pull of gEDF but in addition the task to be pulled has also to fit the runqueue from which the pull is called. The task fits in the runqueue if the runqueue has enough bandwidth left for the task.

The pull_dl_task_worst_fit(), which is the pull function of worst fit, has been implemented in two ways, which differ from one condition. The first im-plementation pulls the bigger task, from a different runqueue, by means of the bandwidth, while the second implementation pulls the task with the earliest dead-line, still from a different runqueue.

Both implementations are divided in two phases. In the first phase the bigger task (i.e. the task with the greater bandwidth) between the earliest pushable task of each runqueue is searched among all runqueue; the search is performed in all the runqueues different from the one from which the pull is called. Instead, in the second implementation, the task with the earliest deadline, among all the earliest pushable task of each runqueue, is searched. Whenever a task that satisfies the previous criteria is found, the runqueue it belongs to and, depending on the implementation, either its bandwidth requirement or its deadline are saved for the second phase. In the second phase, if some task has been found, the earliest pushable task from the stored runqueue is taken again, since the task found in the first step might not be on the runqueue anymore. If the new earliest pushable task fits in the runqueue from which the pull is invoked and it has either the same or greater bandwidth requirement, in the case of the first implementation, or an earliest or equal deadline of the task found in the first phase, then it is pulled.

Even in this case the wrapper function pull_dl_task() has been added and the global_sched_dl_is_gedf(), global_sched_dl_is_first_fit() and global_sched_dl_is_worst_fit() functions have been exploited, like the push_

(32)

dl_task() does, in order to call the correct pull function.

In order to track the number of migrations due to the pull operation, the /proc/sched_dl_pulls proc entry has been added. This entry is an atomic counter that starts from 0 and counts how many times tasks have been migrated since the kernel boot due to the pull operation.

3.2.3 The select_task_rq_dl() function

The aim of the select_task_rq_dl() function is to search the right CPU for a task; moreover, this search may be required due to the exec() or the wake up of a blocked task. The core scheduler calls the function associated to the select_task_rq field in the scheduling class data structure whenever a task is created with the exec() or the fork() system calls or when the scheduler wants to wake up a blocked task. Even though has been said that the function can be called as a result of a fork(), the SCHED_DEADLINE applications are not allowed to fork by definition, therefore forking after that the scheduler is changed results in an error.

The select_task_rq_dl() function has a different behaviour whether it is called after an exec() or a wake up. The behaviour, when gEDF is the current policy, has not been changed. When the function is preceded by a exec(), sup-posing that the policy is either the first fit or the worst fit, the newly created task has to be placed in a runqueue, therefore the heuristic is always applied. Instead, whether the function is called as a result of a task wake up, the task already belongs to a runqueue so it may not be required to do anything. Hence, the heuristic has to be applied either when the invariance is enabled or when, in the runqueue associated to the CPU from which the select_task_rq_dl() is called, the runqueue is overloaded. The runqueue is overloaded when there are more tasks than it can have, which means that its bandwidth exceeds the allowed limit (i.e. U > 1).

3.2.4 The enqueue_task_dl() function

The core scheduler calls the enqueue_task_dl(), through the enqueue_task function pointer in the data structure of the scheduling class.

The enqueue_task_dl() has been modified in order to be able to apply the heuristic after the sched_setscheduler(), whether the policy is either the first fit or the worst fit. The enqueue is necessary whenever the application creates deadline tasks without using the exec() system call, which instead triggers the

(33)

3.3. FALLBACK TO GLOBAL EDF CHAPTER 3. APEDF IN LINUX

select_task_rq_dl() as it was explained before. The tasks that are created and started in the same code flow of the sched_setscheduler() system call, which changes the scheduler to SCHED_DEADLINE, have to be placed in the correct runqueue according to the current heuristic. Therefore, for this purpose, the new flag ENQUEUE_SETSCHED was added in the sched_setscheduler(). With this new flag the enqueue_task_dl() is aware, from the sched_setscheduler(), that there is a new deadline task to be placed in some runqueue, thus applying the heuristic.

This change has been necessary because the SCHED_DEADLINE application is assigned by default to the core/CPU from which it is started. In other words, this means that, if the application is started from the shell, the application is assigned to the same core/CPU of the shell, thus the heuristic would not be applied.

3.2.5 What is the invariance?

The invariance is a condition that makes sense only when the the current policy is either the FF or the WF. The invariance allows for choosing whether is preferred to reduce the tasks’ migration at the cost not to apply strictly the XF heuristic. In order to explain how the invariance works, suppose that the current policy is the FF and also assume that the invariance is not enabled: whenever the select_task_rq_dl() or a push operation is called, if the runqueue’s bandwidth where the task belong does not exceed the allowed limit, the FF heuristic is not applied, thus letting the task stay in the runqueue where it is. Nevertheless, there can be a runqueue, that belong to a core/CPU with a lower index than the one where the task is, that has enough bandwidth for such task but, since the invariance is not enabled, the task is left where it is, thus the heuristic is not applied.

The heuristic is always applied if the invariance is enabled, resulting in more tasks’ migrations.

The invariance can be enabled through its sysctl tunables which is explained later.

3.3 Fallback to global EDF

The fallback to gEDF is a mechanism that allows to switch to gEDF whenever the taskset is not partitionable and one or more tasks cannot be placed in none of the runqueue. Before that the scheduler can switch to gEDF, when is necessary, it is required that this mechanism is enabled through a sysctl tunable, which is

(34)

3.3. FALLBACK TO GLOBAL EDF CHAPTER 3. APEDF IN LINUX

disabled by default. Whenever this tunable is set to true and either the FF or the WF heuristic cannot place the task in none of the runqueue, the scheduler switches to use gEDF for placing the task in one of the runqueue.

The fallback is done by the means of the push operation or the select_ task_rq_dl(). Whenever the task cannot be placed in none of the runqueue, the find_xf_suitable_rq(), in behalf of the “push” operation, or the select_ task_rq_dl() itself set the flag dl.fallback_to_gedf of the runqueue structure. Therefore, when the sysctl tunable allows the fallback to gEDF and the flag has been set, either the gEDF’s push, in case of the push operation, or the find_later_rq(), in case of the “select”, is called in order place correctly such task. Figure 3.2 shows how the fallback to gEDF works.

In order to better understand which is the case when fallback to gEDF is useful, consider a system made up with 2 cores/CPUs and a taskset with 3 iden-tical tasks, identified by a number from 1 to 3, with utilization equal to 0.6 each, this means that the taskset is not partitionable in two CPUs. The taskset is not partitionable because the utilization of one of the two runqueues exceed the upper limit of EDF (i.e. 1). In this case it is not relevant which the heuristic is. For sake of simplicity, suppose that the tasks arrive in the order of their number. What happens is that the first task is placed in the first runqueue and the second task is placed in the second runqueue, what happens to the third task? The third task cannot be placed neither in the first runqueue nor in the second runqueue, because it does not fit in none of them due to its utilization. When the XF heuristic understands that the third task cannot be placed in none of the runqueue, the dl.fallback_to_gedf field of the runqueue is set by the find_xf_suitable_rq() and, if the fallback is enabled by the sysctl tunable, the fallback to gEDF is triggered.

(35)

3.4. SYSCTL TUNABLES CHAPTER 3. APEDF IN LINUX cpu push_dl_task() alt push_dl_task_xf() suitable_rq NULL *_fit_cpu_find() find_xf_suitable_rq() find_lock_xf_suitable_rq() push_dl_task_xf() suitable_rq

find_lock_xf_suitable_rq() find_suitable_rq() *_fit_cpu_find()

cpu

rq->dl.fallback_to_gedf = True

1

Proceed with the push

0

push_dl_task_global_edf()

[fallback_to_gedf_enabled() && rq->dl.fallback_to_gedf == True] [rq->dl.fallback_to_gedf == False] [cpu == -1] [else] [else] cpudl_find() find_later_rq() find_lock_later_rq() push_dl_task_global_edf() find_lock_later_rq() 1 later_rq later_rq find_later_rq() cpudl_find() alt alt

Figure 3.2: Diagram that shows and describes how the fallback to gEDF works.

3.4 Sysctl Tunables

The sysctl tunables have been discussed for the whole chapter, therefore examine them in details.

All the tunables, that were introduced with the the apEDF and a2_pEDF

patches, may be found in the /proc/sys/kernel directory, all with a common prefix of sched_dl. The tunables that start with sched_rt were already present in SCHED_DEADLINE before. Write access is guaranteed only with root permissions. The sched_dl_policy tunable allows for choosing the policy to use. The admitted value are: 0 for the gEDF, 1 for apEDF First Fit or 2 for apEDF Worst

(36)

3.4. SYSCTL TUNABLES CHAPTER 3. APEDF IN LINUX

Fit. Default value is 0.

The sched_dl_fallback_to_gedf tunable allows to enable the fallback to gEDF, which have been explained before, when set to 1. Default value is 0 (disabled).

The sched_dl_xf_invariance tunable allows to enable, when set to 1, the invariance of the XF heuristic, it is useless when the policy is 0 (i.e. gEDF). Default value is 0 (disabled).

The sched_dl_xf_pull allows to enable the pull of a2_{pEDF, when set to 1,}

which otherwise behaves as apEDF.

The sched_dl_xf_runtime_us tunable defines the maximum runtime per runqueue when the XF heuristic, it means nothing for gEDF.

The difference between sched_rt_runtime_us and sched_dl_xf_runtime_ us is that the first one defines the maximum runtime allowed, to all the SCHED_ DEADLINE applications in the system, over a period, which is defined in sched_ rt_period_us; the second one, instead, allows to compute the utilization upper bound of each runqueue when the current policy is either the first fit or the worst fit.

The Umax that the whole system can handle, in a system made up with M

CPUs, can be computed as follows:

Umax = M∗

sched_rt_runtime_us sched_rt_period_us .

The utilization upper bound of each runqueue can be computed as:

U_maxRQ = sched_dl_xf _runtime_us

sched_rt_period_us .

By default sched_dl_xf_runtime_us is equal to sched_rt_runtime_us. In order to clarify this concept, consider the following example: suppose that the system is made up with 2 cores/CPUs, sched_rt_runtime_us is equal to 800000µs and sched_dl_xf_runtime_us equal to 950000µs, under this as-sumption the maximum utilization allowed to SCHED_DEADLINE applications is 2∗ _1000000800000 = 1.6, but the maximum utilization that each runqueue can reach is

not 1.6/2 = 0.8 but 950000

(37)

Chapter 4 Experimental Evaluation

The epxeriments have been performed by using random tasksets generated with the Randfixedsum algorithm [11]. The tasksets were generated with different size and utilization, which are the taskset’s parameters; for each possible combination of the tasksets’ parameters, 30 different tasksets have been generated and run on the platform.

The duration of each experiment depends on the taskset itself, each taskset,

which is made up with n tasks, has been executed from time 0 to 2∗H, where H =

lcm(P1, . . . , Pn) is the taskset’s hyperperiod and Pi is the period of the task τi.

The hyperperiod is the minimum amount of time after which the schedule repeats itself. Whenever H time units are elapsed, the scheduler schedules the tasks, if the taskset has not changed, in the same order of the previous hyperperiod.

The experiments have been perfomed on 2 and 4 CPUs, with different tasksets for each of them. In order to restrict the CPUs where the tasks could have been

scheduled, a cpuset1 _{has been created.}

The cpuset is a mechanism that allows to assign the cores/CPUs and memory to a set of tasks. Through this mechanism the scheduler is constrained in the scheduling of the tasks or the applications only in the resources that belong to the cpuset. The alternative to the cpuset is to put offline the CPUs which must not be used, such that they are not available at all. This is done by echoing 0 in /sys/devices/system/cpu/cpuX/online, where X is the CPU number.

Since the experiments were executed on 2 and 4 CPUs, different kind of tasksets have been generated. For the case of 2 CPUs, the tasksets that have been created have utilization that goes from U = 1.4 to U = 1.9. Instead, for the case on 4 CPUs, the tasksets that have been created have utilization that goes from U = 2.4 to U = 3.8. The utilization upper values (i.e. 1.9 and 3.8) have