Supporting Parallelism
in Multicore Real-Time
Computing Systems
Corso di dottorato In Emerging Digital Technologies
Anno Accademico
2016-2017
Supporting Parallelism in Multicore
Real-Time Computing Systems
Autore
Alessandra Melani
Tutor
Dissertation submitted to Scuola Superiore Sant’Anna for the degree of Doctor of Philosophy.
Supervisor: Committees:
Giorgio Buttazzo Marko Bertogna
Scuola Superiore Sant’Anna, Italy University of Modena and Reggio Emilia, Italy
Marco Caccamo
University of Illinois at Urbana-Champaign, IL, USA
Lu´ıs Almeida
University of Porto, Portugal
Arne Hamann
Robert Bosch GmbH, Germany
Tommaso Cucinotta
Abstract
The exploitation of multicore processors is crucial to the development of advanced real-time embedded systems, since they allow increasing performance with a contained energy consumption, which also reflects on cooling, weight and space constraints. At the same time, however, the advent of multicore processors challenges the established approaches to system design and analysis.
On one side, modern high-end embedded systems are increasingly concerned with providing high performance in real-time, which requires an evolution of programming paradigms to combine traditional requirements, i.e., ease of programmability and efficient exploitation of parallel resources, with timing and schedulability analysis techniques.
In the context of safety-critical embedded systems, instead, the development of practical solutions is needed to ease the transition to multicore systems, while addressing the major sources of concern that are currently delaying their adoption. In the avionics domain, for instance, such issues are mainly related to certification directives, resource contention and porting existing applications to different plat-forms. In particular, the design of predictable multicore systems should be able to cope with the timing unpredictability arising from physical shared resources, such as last-level caches and main memory.
This dissertation addresses some of the challenges related to the integration of parallel execution in modern real-time multicore systems, along three main dimensions. First, the schedulability problem for parallel tasks in a globally-scheduled system is considered, with the goal of reconciling high performance and predictability. The analysis is then instantiated for the particular case of OpenMP, considering the peculiarities of its execution and scheduling model. Second, a scheduling framework based on partitioned scheduling is presented for executing safety-critical software on parallel platforms, with the salient feature of mitigating the sources of concern mentioned above. Finally, a theoretical execution model is introduced to exploit parallelism at job level by leveraging the pipelined execution on different resources, leading to an efficient use of the computing power offered by multicore systems.
Contents
This dissertation is split into seven chapters. Chapter 1 introduces this work, motivating the increasing diffusion of multicore platforms in the embedded domain and discussing the reasons that are currently delaying their adoption for safety-critical applications. Chapter 2 presents the necessary background material and reviews the related work. Chapter 3 describes the system model and the assumptions used throughout this thesis. In particular, we describe the conditional parallel (cp-task) model, which extends existing parallel models by representing each task as a graph that contains both parallel and conditional nodes. In Chapter 4, efficient ways to compute an upper-bound on the response time of each cp-task are derived using different global scheduling algorithms. Chapter 5 instantiates the general scheduling problem for parallel real-time tasks to one of the most popular parallel programming paradigms, i.e., OpenMP, for which an accurate timing characterization is derived, considering both static and dynamic scheduling strategies. Then, Chapter 6 presents a scheduling framework based on partitioned scheduling to support the development of next-generation avionics systems on multicore platforms. In Chapter 7, smart co-scheduling algorithms are described to exploit parallelism at job level, allowing an easier char-acterization of the timing interference and a significant schedulability improvement over classic execution models. In Chapter 8, we conclude this dissertation and describe possible directions to carry out relevant future research.
Acknowledgments
This dissertation would not have been possible without the help of many talented people.
First of all, I am deeply grateful to Giorgio Buttazzo, my advisor, who gave me continued help, guidance and motivation over the past three years, making my life as a PhD student very fruitful. He has been a constant source of suggestions and encouragement, and allowed me to benefit a lot from his passion and dedication to work.
I am extremely thankful to Marko Bertogna, who not only taught me the “secrets” of real-time scheduling, but also provided me many opportunities to work on extremely interesting problems. He was a constant reference point during the development of many works that originated from our proficuous collaboration.
I would also like to express my sincerest gratitute to Marco Caccamo, who kindly hosted my mem-orable visit to UIUC, which greatly expanded my knowledge about real-time systems, and computing science research in general. It has been such a pleasure to work with and to learn from Marco’s team.
I especially want to thank Renato Mancuso, for the fruitful exchanges of ideas and brainstorming, and for sharing a great time with me in the United States. He built a profound impression on how a successful PhD should ideally be conducted.
Finally, thanks to all my coauthors and colleagues for being part of my academic journey, and for giving me different perspectives about research. This was certainly one of the most instructive experiences of my whole education path.
Contents
1 Introduction 1
2 Background and Related Work 5
2.1 Multiprocessor architectures . . . 5
2.1.1 The contention problem . . . 6
2.2 Multiprocessor real-time scheduling . . . 7
2.2.1 Task modeling . . . 8
2.2.2 Preemption and migration scheme . . . 8
2.2.3 Prioritization scheme . . . 9
2.2.4 Performance metrics . . . 9
2.2.5 Sequential tasks . . . 10
2.2.6 Parallel tasks . . . 15
2.3 OpenMP programming model . . . 18
2.4 Scheduling for Integrated Modular Avionics (IMA) systems . . . 19
2.5 Co-scheduling techniques . . . 21
3 System Model 24 3.1 The conditional parallel task model . . . 24
3.1.1 Special cases . . . 26
3.2 Platform model . . . 27
3.3 Scheduler model . . . 27
4 Efficient Scheduling of Parallel Real-Time Tasks 28 4.1 Critical interference of cp-tasks . . . 29
4.2 Response time analysis . . . 30
4.2.1 Inter-task interference . . . 30
4.2.2 Intra-task interference . . . 32
4.2.3 Computation of cp-task parameters . . . 33
4.2.4 Schedulability condition . . . 35
4.2.5 Improved upper-bounds on intra-task interference . . . 35
4.3 Experimental characterization . . . 39
4.3.1 Experimental results for cp-tasks . . . 41
4.3.2 Experimental results for classic DAG tasks . . . 46
4.3.3 Evaluation of RTA-FP vs. RTA-EDF . . . 46
Contents
5 Timing Characterization of the OpenMP Tasking Model 50
5.1 OpenMP4 tasking model . . . 51
5.1.1 From thread-centric to task-centric model . . . 51
5.1.2 OpenMP4 programming features . . . 52
5.1.3 OpenMP task-to-thread scheduling . . . 54
5.2 Timing characterization of OpenMP4 . . . 56
5.2.1 OpenMP4 tasking model and DAG task model . . . 56
5.2.2 Schedulability problem for OpenMP4 . . . 57
5.3 Schedulability analysis of untied tasks . . . 58
5.4 Impact of tied tasks on scheduling . . . 60
5.4.1 Reduction of available threads . . . 60
5.4.2 Issues on the timing characterization of tied tasks . . . 63
5.5 A static scheduling approach for OpenMP applications . . . 64
5.5.1 Optimal static allocation of OpenMP-DAGs . . . 65
5.5.2 Sub-optimal static allocation of OpenMP-DAGs . . . 68
5.5.3 Evaluation . . . 70
5.6 Summary . . . 73
6 Parallelism Support for Integrated Modular Avionics Systems 75 6.1 Motivating example . . . 78
6.2 System model assumptions . . . 79
6.3 Reclaiming algorithm . . . 81
6.3.1 Budget transfer mechanism . . . 81
6.3.2 Budget assignment policy . . . 85
6.3.3 Evaluation metrics . . . 87 6.4 Schedulability analysis . . . 87 6.4.1 Background . . . 88 6.4.2 Fully-preemptive scheduling . . . 89 6.4.3 Limited-preemptive scheduling . . . 89 6.5 Experimental evaluation . . . 91 6.5.1 Workload generation . . . 91 6.5.2 Simulation results . . . 92 6.6 Summary . . . 99
7 Enhancing Parallelism Through Memory-Processor Co-Scheduling 100 7.1 System model assumptions . . . 102
7.2 Schedulability analysis . . . 103
7.2.1 Critical instant . . . 106
7.2.2 Exact response time analysis . . . 109
7.3 The priority assignment problem . . . 110
7.3.1 Deadline Monotonic (DM) is not optimal . . . 110
7.3.2 Applicability of OPA algorithm . . . 111
7.3.3 OPA-compatible schedulability tests . . . 111
7.3.4 Different priorities for M- and C-phases . . . 112
7.3.5 Two-phase priority assignment . . . 114
7.4 Experimental results . . . 114
Contents
7.4.2 Schedulability results . . . 115 7.4.3 Priority assignment results . . . 118 7.5 Summary . . . 120
8 Conclusions 122
List of Figures
1.1 System parallelism. . . 4
1.2 Task parallelism. . . 4
1.3 Job parallelism. . . 4
2.1 Memory-processor performance gap from 1980 to 2010 (from [HP11]). Performance values on the y-axis are in logarithmic scale. . . 6
2.2 Typical memory hierarchy in a multiprocessor architecture. . . 7
2.3 Pseudocode of the OPA algorithm. . . 11
2.4 Example showing the Dhall’s effect. . . 12
2.5 Worst-case scenario to maximize the workload of an interfering task τi in the sequential case for any work-conserving scheduler. . . 13
2.6 Worst-case scenario to maximize the workload of an interfering task τi in the sequential case under G-EDF. . . 14
2.7 MF/GMF models. . . 15
2.8 RB model. . . 15
2.9 Digraph model. . . 15
2.10 Fork-join model. . . 17
2.11 Synchronous parallel model. . . 17
2.12 DAG model. . . 17
3.1 A sample cp-task. Each vertex is labeled with the WCET of the corresponding sub-task. . 25
4.1 A parallel program with conditional execution. . . 29
4.2 Worst-case scenario to maximize the workload of an interfering cp-task τi. Shifting the problem window by cannot increase the interfering workload. . . 31
4.3 Example of cp-task that shows the pessimism of the upper-bound given in Equation (4.4). 36 4.4 Example of cp-task showing the two sources of pessimism in the upper-bound of Algorithm 2. 38 4.5 Graph structure of the Cholesky benchmark. . . 40
4.6 Randomly generated cp-task having three conditional branches with unbalanced workload. 41 4.7 Randomly generated cp-task with a high level of parallelism and a large degree of connec-tivity. . . 42
4.8 RTA-FP as a function of UT (m = 4, constrained deadlines). . . 43
4.9 RTA-FP as a function of m (UT = 2, constrained deadlines). . . 43
4.10 RTA-FP as a function of n (m = 4, UT = 2, constrained deadlines). . . 43
4.11 RTA-EDF as a function of UT (m = 8, constrained deadlines). . . 45
4.12 RTA-EDF as a function of m (UT = 2, constrained deadlines). . . 45
List of Figures
4.14 RTA-EDF as a function of UT (m = 8, implicit deadlines). . . 47
4.15 RTA-EDF as a function of m (UT = 2, implicit deadlines). . . 47
4.16 RTA-EDF as a function of n (m = 8, UT = 2, implicit deadlines). . . 47
4.17 RTA-FP/EDF as a function of UT (cp-tasks, m = 8, implicit deadlines). . . 48
4.18 RTA-FP/EDF as a function of UT (DAGs, m = 8, implicit deadlines). . . 48
5.1 Example of an OpenMP program using tasking constructs. . . 53
5.2 Example of an OpenMP program using synchronization constructs. . . 55
5.3 DAG corresponding to the program in Figure 5.1. . . 57
5.4 Average makespan as a function of N , with m = 4 and nmax= 8. . . 72
5.5 Average running time as a function of N , with m = 4 and nmax= 8. . . 73
5.6 Average makespan as a function of N , with m = 4 and nmax= 8 for the tied case. . . 74
5.7 Average makespan as a function of N , with m = 4 and nmax= 8 for the untied case. . . . 74
6.1 Example demonstrating the schedulability advantage of a Periodic Server (a) in compari-son with a Cyclic Executive scheduling scheme (b). . . 79
6.2 Example showing the budget waste arising from self-donation. . . 82
6.3 Example showing how budget donation can readmit a skipped job. . . 83
6.4 Example showing the rules for updating the worst-case budget and the budget donation mechanism. . . 84
6.5 Critical instant scenario for an unbound task scheduled under a periodic server. . . 88
6.6 Impact of limited preemption in the last server period. . . 90
6.7 Block diagram of the proposed schedulability analysis. . . 91
6.8 System optional workload rate in the case of bound tasks with log-uniform runtimes. . . . 92
6.9 System optional workload rate in the case of unbound tasks with log-uniform runtimes. . 93
6.10 Load balance index in the case of bound tasks with log-uniform runtimes. . . 93
6.11 Load balance index in the case of unbound tasks with log-uniform runtimes. . . 93
6.12 System optional workload rate in the case of bound tasks with uniform runtimes. . . 94
6.13 System optional workload rate in the case of unbound tasks with uniform runtimes. . . . 95
6.14 Load balance index in the case of bound tasks with uniform runtimes. . . 95
6.15 Load balance index in the case of unbound tasks with uniform runtimes. . . 95
6.16 System optional workload rate under reduced optional utilization with bound tasks. . . . 96
6.17 System optional workload rate under reduced optional utilization with unbound tasks. . . 96
6.18 System optional workload rate with bound tasks and m = 2 cores. . . 97
6.19 System optional workload rate with bound tasks and m = 4 cores. . . 97
6.20 System optional workload rate with bound tasks and m = 8 cores. . . 97
6.21 System optional workload rate with varying optional utilization (workload-based approach). 98 6.22 System optional workload rate with varying optional utilization (priority-based approach). 98 6.23 System optional workload rate with varying optional utilization (fairness-based approach). 98 7.1 EDF is not optimal for collections of sporadic M/C tasks. . . 104
7.2 Synchronous release (a) and critical instant configuration (b). . . 108
7.3 Counterexample on the sub-optimality of DM. . . 111
7.4 Experiments varying UT, with fmc log-uniformly distributed in [0.1, 10], and n = 8. . . 116
7.5 Weighted schedulability as a function of fmc, with n = 8, and implicit deadlines. . . 116
7.6 Experiments varying n, with fmclog-uniformly distributed in [0.1, 10], implicit deadlines. 117 7.7 Experiments varying n, with fmclog-uniformly distributed in [0.1, 10], constrained deadlines.117 7.8 Experiments varying αd, with fmc log-uniformly distributed in [0.1, 10], and n = 8. . . 118
List of Figures
7.9 Evaluation of priority assignments, n = 8 and fmc log-uniformly distributed in [0.1, 10]. . 119
7.10 Evaluation of priority assignments, n = 8 and fmc log-uniformly distributed in [0.1, 10]. . 119
Chapter 1
Introduction
Multicore platforms are rapidly replacing uniprocessor counterparts in several application domains, since they allow increasing performance with a limited energy consumption. Modern embedded applications, such as object tracking and detection, image and video processing, distributed control and even in-car infotainment units, are increasingly characterized by both real-time and high-performance requirements, which can only be met by exploiting the processing capabilities offered by multi- and many-core architec-tures. This appetite for computational power is often accompanied by other requirements, such as low weight, low power consumption and low heat dissipation, which are all directly addressed by multicore processors. In brief, the main reasons that are motivating the transition to multicore architectures can be summarized as follows:
• More computing power: the possibility of combining multiple cores on the same die (without significantly increasing the die size) allows achieving parallel processing at a lower CPU clock frequency, due to the fact that no power is required to drive signals external to the chip and some circuitry is shared among the cores;
• Separation of applications: multicore architectures allow for specialization and/or isolation of sys-tem components. This may be motivated by legal reasons, for example when different software components originating from multiple suppliers may need to coexist in the same system, or sepa-ration of concerns, because different operating systems or applications with different requirements might need to operate independently on the same platform (e.g., general purpose vs. real-time processing);
• Functional safety: in safety-critical scenarios, it is fundamental to ensure avoidance of mutual inter-ference among different applications, possibly having different criticality levels. Running different application software components on different cores represents one of the possible ways of achieving this. With reference to functional safety, parallel processing also allows redundancy, that is, an identical application can be run on multiple cores to ease fault detection;
• Less power dissipation: the great amount of computing power offered by multicores can be exploited “on-demand”, i.e., active tasks can be packed onto fewer cores, and idle cores might be brought into sleep modes, which allows for less power dissipation and reduced energy consumption. However, the intrinsic complexity of the hardware, combined with the evolution of programming models, challenges the traditional timing and schedulability analysis techniques that have been designed primarily in the embedded domain to analyze simple software code, which was meant to run on simple and predictable hardware architectures.
Unfortunately, migrating existing software code to multicore platforms often hides serious compli-cations, especially if the code has been originally designed with the assumption that the underlying platform does not provide parallel execution. According to the sequential programming model that well fits conventional uniprocessors, applications are automatically serialized by the operating system: even if many tasks appear to run simultaneously, in fact just a single task runs at any point in time on the processor. Vice versa, multiple software tasks really run in parallel on a multicore platform, imply-ing that system resources, like last level caches and main memory, may be accessed simultaneously by different tasks. Therefore, an application developed for a single core will likely be subject to a higher interference when executed in a multicore environment. Apart from the impact on the logical correctness of applications, their timing behavior can also be negatively affected, because shared resource contention can lead to significant execution time variations.
Notably, the first general purpose multicore platforms were broadly commercialized as early as 2006 and have reached the embedded market back in 2010. Despite their substantial and now predominant availability as computing platforms, it is emblematic that safety-critical systems are still developed using single-core platforms. For example, no civil avionic systems is being currently produced using a multicore to implement core aircraft functionality. Indeed, a set of practical challenges and considerations originated from the avionic industry, which have been traditionally overlooked by the literature, need to be carefully addressed before a safety-critical multicore system can be even conceived.
On one hand, the analysis of multicore systems has been considered as an intrinsically complex problem since 1969, when Liu [Liu69] observed that “Few of the results obtained for a single processor generalize directly to the multiple processor case; bringing in additional processors adds a new dimension to the scheduling problem. The simple fact that a task can use only one processor even when several processors are free at the same time adds a surprising amount of difficulty to the scheduling of multiple processors”. For example, multiprocessor scheduling suffers from timing anomalies, that is, some task sets might be deemed unschedulable regardless of the number of processors used. Hence, it is significantly harder to identify worst-case scenarios and derive guaranteed utilization bounds, which is in sharp contrast with the uniprocessor case where optimal scheduling algorithms and schedulability tests have been derived.
In addition, in a multicore system there is a strong coupling between timing and schedulability analysis, since worst-case execution times (WCETs) should also account for the cross-core interference generated on shared resources [ADI+15], which is highly detrimental especially in the context of
safety-critical systems, in which a system failure may determine severe damages and even loss of life.
Real-time tasks on multicore processors are traditionally scheduled according to either partitioned or global approaches [DB11b]. Partitioned approaches are based on a static task assignment, that is, a fixed mapping of tasks to processors is computed first, and then scheduling decisions are taken at runtime on each core according to existing scheduling approaches for uniprocessors. In contrast, global scheduling allows a dynamic task assignment, meaning that any task instance can be executed on any core and is free to migrate on different cores during execution. Scheduling decisions are taken by the scheduler assuming a system-wide priority ordering. Traditionally, partitioned scheduling is the favored approach, mainly due to its simplicity and to the possibility of reusing the well-established analytical results for uniprocessors. It has also proved to be effective under popular uniprocessor scheduling algorithms such as Earliest Deadline First (EDF) and Rate Monotonic (RM), and is also widely supported by the industry, e.g., in the automotive and avionics sectors. Nevertheless, partitioning also presents some fundamental drawbacks. First, it is inherently sub-optimal, since the task allocation requires solving a bin-packing problem, which is known to be strongly NP-hard. Typically, partitioning is done using heuristics, which may be unable to schedule task sets that are schedulable using an optimal task assignment. Second, partitioning is not suitable for open systems, in which tasks may dynamically join and leave, because the
Chapter 1. Introduction
arrival of a new task might require re-partitioning the entire system. Third, unused processor capacity cannot be efficiently exploited in a partitioned system, since it cannot be shared among the different processors.
On the other hand, global scheduling strategies allow for an overall higher performance, since the utilization of processing resources is automatically balanced, which leads to lower average response times and to a more efficient reclaiming in case of early task completions. This feature also makes global scheduling more appropriate for open systems. Experimental evaluation with the Linux operating system showed that global configurations are a viable solution for multicore systems [LFCL12]. However, global scheduling policies cannot rely on the wealth of existing results developed for uniprocessors, hence they necessitate of new schedulability analysis techniques, which are often sub-optimal or require idealized assumptions (e.g., Proportional fair (P-fair) scheduling [BCPV96]). A further drawback of global scheduling is that it adds uncertainty to the lower-level timing analysis, which may become pessimistic to account for migration costs.
Depending on the context, either global or partitioned scheduling may be more suitably applied to get the desired performance on multicore systems. Indeed, combinations of the two approaches, usually classified as clustered scheduling approaches, have also been proposed to combine their benefits.
Building upon these observations, this thesis focuses on possible different ways of exploiting paral-lelism in multicore systems, with three possible levels of granularity.
At system level. Different tasks can be executed in parallel on different cores, but each task instance (i.e., each job) always runs on the same core (see Figure 1.1).
Given its simplicity, this model of parallelism appears to be suitable for the development of safety-critical applications, for which it is essential to provide predictability as well as flexibility in managing unexpected temporal misbehaviors of multicore. In this thesis, avionic systems are considered as em-blematic examples of safety-critical real-time systems. Considering this domain, a scheduling framework is proposed to support the widely used Integrated Modular Avionics (IMA) architecture on multicores. The proposed approach ensures temporal modularity among different applications through a resource reservation mechanism. This method is based on partitioned scheduling, as it facilitates timing analysis and is therefore more compatible with existing safety-critical applications. In this context, parallelism is exploited to handle unexpected overload conditions without sacrificing the average-case performance.
At task level. Different parts of the same task can be executed in parallel on different cores, but only one task instance can be executed at any point in time (see Figure 1.2).
This model of parallelism is able to meet the increasing interest for applications with both high performance and real-time requirements, caused by the availability of multi-/many-core platforms in the embedded market [Kala, Par, Key]. Along with this trend, several classic schedulability results need to be extended to comply with such new platforms, providing new models and tests to guarantee the timing requirements of parallel task systems. In this thesis, a schedulability analysis is derived for globally-scheduled parallel systems with the goal of providing temporal predictability while increasing efficiency. The considered computational model supports an intense concurrency between sequential computational units and well captures the parallel nature of modern applications based on programming paradigms such as OpenMP [ope13]. In particular, the execution model of OpenMP is taken as a reference to analyze in detail from a scheduling perspective the possible analogies and deviations with the parallel real-time scheduling strategy developed in this thesis and with other approaches commonly used in the literature.
At job level. Different consecutive parts (or phases) of the same job can be executed on different system resources, in a pipelined way, allowing different tasks that compete for the same resources to run
simultaneously in the system (see Figure 1.3).
This model of parallelism challenges the traditional way of exploiting parallelism, because it introduces a new dimension to the scheduling problem, i.e., different resources in the system are co-scheduled at the same time. According to this model, different phases of different tasks can be executed simultaneously at any point in time.
The usefulness of this model stems from the fact that raw processing power is not the only constraint on system performance. In fact, the performance advantage of having multiple processing cores is limited if such cores share the same system bus and memory bandwidth. In principle, it would be possible for an application using two cores to end up running faster on a single core if memory bandwidth or communication were the limiting factors. Therefore, in the last part of this thesis, a co-scheduling approach is proposed for tasks composed of two phases: a memory phase, where the task content is prefetched from main memory into local memory, and a computation phase, where the task executes on the processor without further accessing memory. The potential of this approach relies on the possibility of overlapping different phases of different tasks, which increases both efficiency and schedulability in single-core systems, and also represents a powerful mechanism to bound resource contention in multicore systems.
…
Core 1Core 2
Core m
Task 1 Task 2 Task 3
Figure 1.1: System parallelism.
…
Core 1
Core 2
Core m
Task 1
Figure 1.2: Task parallelism.
Resource 1 Task 1 (e.g., memory) Resource 2 (e.g., processor) Task 2
Chapter 2
Background and Related Work
This chapter discusses the state of the art on multiprocessor scheduling and summarizes the background concepts necessary to understand the theoretical results presented in the rest of the thesis. In particular, Section 2.1 presents some architectural considerations about multiprocessor platforms, highlighting the major sources of contention that complicate timing analysis and porting of existing applications. Then, Section 2.2 reviews the theory of multiprocessor real-time scheduling, starting from the classic assumption of sequential tasks and then focusing on the more recent results for parallel task systems. Section 2.3 revises some existing works on the convergence between high-performance computing (HPC) and parallel real-time computing, taking OpenMP as a representative programming paradigm. Since Chapter 6 is dedicated to safety-critical avionics systems, Section 2.4 presents an overview of the IMA architecture and the previous works aimed at improving schedulability in single-core integrated modular avionics systems. Finally, Section 2.5 introduces existing execution models based on prefetching techinques, which provide a mechanism to better characterize the interference due to memory contention and effectively reduce memory transfer latency.
2.1
Multiprocessor architectures
Unlike uniprocessor multitasking systems, multiprocessor systems support physical parallelism. They mainly consist of two or more processing units that share system memory through a communication bus. Whenever multiple processing units are put onto the same chip, the multiprocessor system is referred to as multicore. In this case, processing units deployed onto a single chip are called cores. Massive multicore systems consisting of a large number of cores are often called manycores. To achieve architectural scalability, such systems are often organized into clusters of cores, so that cores in a cluster share cache memory, and different clusters share system memory by a bus. For example, the Adapteva Epiphany-IV has 64 cores [Ada], and the Kalray MPPA [Kalb] has 16 clusters of 16 cores each, interconnected by a Network-on-Chip (NoC).
Based on the type of processors, multiprocessor systems can be classified in three categories: • Identical multiprocessor: the system is composed of a number of processors with identical speed
and processing capabilities;
• Uniform multiprocessor: the platform is composed of processors with distinct speeds, which deter-mine the execution rate of tasks;
• Heterogeneous multiprocessor: the processors can execute different tasks at different speeds, hence the execution rate of each task depends on both the type of processor and the task itself.
2.1. Multiprocessor architectures
2.1.1
The contention problem
One of the major obstacles to improve the performance of current computing systems is the growing divergence between processor speed and memory speed [HP11]. Figure 2.1 shows that since 1980, the memory-processor performance gap has been increasing steadily. In the decade 2000-2010, processor speed has been improving by 60% per year, whereas the access time to DRAM has been improving at barely 10%.
Figure 2.1: Memory-processor performance gap from 1980 to 2010 (from [HP11]). Performance values on the y-axis are in logarithmic scale.
For instance, a modern high-end processor such as the Intel Core i7 can generate two data memory references and fetch four instructions per clock cycle. With a 3.2 GHz clock rate, the i7 can generate a peak of 6.4 billion 64-bit data memory references per second, in addition to 3.2 billion 128-bit instruction references, which gives a total peak bandwidth of 102.4 GB/s. Such a value needs to be multiplied by the number of cores used, so four i7 processors could generate a peak bandwidth of 409.6 GB/s. However, the typical peak bandwidth of DDR3 memories is around 20 GB/s. Hence, the coexistence of multiple processing units on the same platform amplifies the already significant performance gap between processing and memory bandwidth.
Computer architects have tried to bridge this gap by introducing multilevel memory hierarchies. A memory hierarchy is usually organized into layers, and each of them is slower, larger and cheaper (per byte) than the levels closer to the processor. This allows enhancing average-case performance at the cost of introducing unpredictability. In the presence of cache memories, for instance, the execution time of a task is a function of the number of cache misses, which in turn depends on the current state of the cache as well as its replacement policy.
In uniprocessor systems, cache contention can be analyzed by considering two types of cache conflicts: i) intra-task conflicts, arising when a task evicts its own cache lines, a phenomenon often called self-eviction; ii) inter-task conflicts, arising in a preemptive system when tasks running on the same core evict cache lines of each other. Cache-related preemption delay (CRPD) analyses have been proposed in the literature to cope with cache-related overheads, and are already integrated into existing WCET analysis tools [WEE+08].
In multiprocessor systems, however, the memory hierarchy is typically designed with at least one private and one shared cache level, as shown in Figure 2.2. This fact extremely complicates timing analysis because, in addition to intra-task and inter-task conflicts, the cache content also depends on
Chapter 2. Background and Related Work Core 1 Core 2 Core 3 Core 4 L1 (private) L1 (private) L1 (private) L1 (private) L2 (shared) DRAM
Figure 2.2: Typical memory hierarchy in a multiprocessor architecture.
inter-core conflicts, arising when multiple cores share a cache and tasks executing in parallel on different cores evict each other cache lines. Additionally, when global schedulers are used, it is difficult to evaluate the so called cache-related migration delay (CRMD), which estimates the number of cache lines that can be reused from the shared cache when a task migrates to a different core [HP09].
An alternative to cache memory is scratchpad memory (SPM), a particular type of static RAM placed close to the processor (similar to L1). Differently from cache memory, which is handled in hardware, the SPM is explicitly managed, in the sense that its address space is mapped into a given memory address of the processor, which moves memory blocks from main memory to SPM in software. SPMs are increasingly replacing traditional data caches, since they allow significant energy, space and cost savings compared to caches [MB12]. Another important advantage is that they offer a much more predictable behavior, because access latencies do not depend on the memory access pattern [PP07,WP13]. Typically, modern architectures are equipped with Direct Memory Access (DMA) engines that are engineered to perform load/unload operations from main memory to the SPM with efficient back-to-back transfers without stalling the processor.
Similarly to shared caches or SPMs, main memory is also shared, hence is subject to contention. In fact, memory requests of cores are mediated by an arbiter, and once a certain core gains access to main memory, all other cores are delayed while requests of other cores are being served. This effect has a sig-nificant impact on task execution times, since it is difficult to bound and analyze for a number of reasons. First, the memory delay depends on the task’s memory access pattern; second, the memory bandwidth may be regulated by unfair arbiters, or the arbitration policy may be even undocumented. Finally, most arbitration policies are not based on the notion of priority, leading to pessimistic assumptions when trying to bound the memory contention effect.
2.2
Multiprocessor real-time scheduling
The purpose of multiprocessor scheduling theory is to study how to schedule real-time workload on multiprocessor platforms. For an exhaustive overview on this research topic, the interested reader can refer to a recent book by Baruah et al. [BBB15]. In the following, we recall some basic notions that are commonly used to categorize multiprocessor scheduling algorithms: i) the task modeling; ii) the preemption and migration scheme; iii) the prioritization scheme. We will also briefly introduce some commonly used metrics to evaluate the performance of schedulability tests.
2.2. Multiprocessor real-time scheduling
2.2.1
Task modeling
Any scheduling policy refers to a specific way of abstracting the computational load to be executed, i.e., the task model. The most commonly used model for hard real-time tasks is the classical Liu and Layland model [LL73], which represents a task τi by only two parameters: a WCET Ci and a period
Ti. According to this scheme, tasks generate instances of computation called jobs with the periodicity
specified by the corresponding value of Ti. This model has been later extended by specifying as a third
parameter the relative deadline Di, which represents the time separation between any job release and
its absolute deadline. This parameter, when specified, allows decoupling the task periodicity from its urgency. In fact, based on the relation between Di and Ti, a task setT can be classified as follows:
• Implicit deadline, if all task relative deadlines are equal to their periods, i.e., Di = Ti. This is
exactly the Liu and Layland model;
• Constrained deadline, if all task relative deadlines are smaller than or equal to their periods, i.e., Di≤ Ti;
• Arbitrary deadline, if no specific relation is assumed between Di and Ti.
Additionally, a task model is referred to as periodic if the parameter Tiindicates the exact separation
between successive job releases, while is denoted as sporadic if Ti represents a minimum inter-arrival
time between subsequent jobs.
An important parameter of a task is its utilization Ui= CTii. The total utilization UT of a task set is
defined as the sum of individual task utilizations, i.e., UT =Pτi∈TUi.
A useful property related to the scheduling algorithm and/or the schedulability test is that of sus-tainability, introduced by Burns and Baruah [BB08]:
Definition 1 (Sustainability). A scheduling algorithm is sustainable with respect to a task model if and only if any schedulable task set remains schedulable by (i) decreasing execution times, (ii) increasing periods or inter-arrival times, and (iii) increasing deadlines. Similarly, a schedulability test is sustainable if any task set that is schedulable by the test remains schedulable when these changes are applied.
Depending on the possibility of allowing parallelism within one task, two classes of task models can be distinguished: sequential and parallel tasks. In Sections 2.2.5 and 2.2.6, an overview of related literature is presented for the two categories.
2.2.2
Preemption and migration scheme
Scheduling algorithms are usually classified based on their degree of preemption, as follows:
• Preemptive scheduling: the scheduler is allowed to suspend one job in favor of a more urgent job, and resume its execution later in the schedule;
• Non-preemptive scheduling: the scheduler is never allowed to suspend one job in favor of another one, hence, once started, every job executes until completion;
• Limited-preemptive scheduling: it is a hybrid approach between preemptive and non-preemptive scheduling that restricts preemption to take place at specific scheduling points. The reader can refer to a recent survey [BBY13] for a complete discussion on the possible approaches to limited-preemptive scheduling.
Multiprocessor scheduling algorithms also require the specification of their degree of migration, as follows:
Chapter 2. Background and Related Work
• Global scheduling: a job is allowed to execute on any processor, with no restrictions; • Partitioned scheduling: a job can only execute on a specific processor;
• Clustered scheduling: each task is assigned to a given set of processors, which defines a cluster, and a job is allowed to migrate within its cluster, i.e., global scheduling is applied at a cluster level. The degree of migration also determines whether a given scheduling algorithm can be regarded as work-conserving or not.
Definition 2 (Work-conserving algorithm). A scheduling algorithm is work-conserving if and only if it never idles processors when there exists at least one task awaiting execution in the system.
By the above definition, global scheduling is generally work-conserving, while partitioned scheduling is not. From a practical point of view, the work-conserving property has the impact of reducing average response times, because processing time is never wasted. As a consequence, unnecessary task preemptions may be avoided under work-conserving strategies.
In this dissertation, we will mainly focus on preemptive scheduling strategies, and consider global scheduling in Chapter 4 and partitioned scheduling in Chapter 6 and 7, and both strategies in Chapter 5.
2.2.3
Prioritization scheme
Most scheduling schemes are based on the notion of priority, which allows the scheduler to select the jobs that should be executed at any point in time on the available processors. Based on the dynamicity of priorities, scheduling algorithms are classified as follows:
• Fixed task priority (FTP): each task is associated with a unique priority value, and all its jobs inherit its priority. The most common scheduling algorithms belonging to this category are Rate Monotonic (RM) and Deadline Monotonic (DM), where priorities are assigned based on periods and relative deadlines, respectively. A smaller period or relative deadline corresponds to a higher priority;
• Fixed job priority (FJP): jobs of the same task may have different priorities, but the priority of any job never changes once it has been assigned. In other words, job priority can change at task level, but not at job level. The most popular example of this type is EDF, in which jobs with an earlier absolute deadline are given a higher priority;
• Dynamic priority (DP): job priority can dynamically change with no restrictions, both at task level and job level. An example of this type is the P-fair [BCPV96] algorithm.
This dissertation is mainly focused on the first two scheduling classes.
2.2.4
Performance metrics
The purpose of a schedulability test is to verify before runtime that all deadlines will be met by the considered scheduling algorithm. In the real-time field, it is essential to prove the correctness of any produced schedule, hence it is useless to have an efficient scheduler if a tight associated schedulability test is not available.
The utilization bound is one of the most common types of schedulability test.
Definition 3 (Utilization bound). The utilization bound for a scheduling algorithm A is the largest utilization value UA such that all task sets T with total utilization UT ≤ UA are always schedulable by
2.2. Multiprocessor real-time scheduling
For implicit-deadline task sets, the utilization bound can be directly used as a metric to compare the effectiveness of different scheduling algorithms. For example, if UA1≥ UA2 for two scheduling algorithms
A1 andA2, it can be concluded that algorithmA1 has a better verifiable performance thanA2.
A different metric called speedup factor or resource augmentation bound [KP00] is often used as a theoretical method to compare the performance of a scheduling algorithm with that of an optimal algorithm.
Definition 4 (Speedup factor). A scheduling algorithm A provides a speedup factor (or resource aug-mentation bound) of b≥ 1 if it can feasibly schedule on m processors with speed b all task sets that are feasibly scheduled by an optimal scheduler onm processors with unitary speed.
In other words, the speedup factor considers the increase in processing speed that would be required so that all task sets that are schedulable by an optimal algorithm on unit-speed processors become schedulable byA. The speedup factor is more often used as a metric to compare algorithms rather than a schedulability test because the optimal scheduler may be unknown (i.e., hypothetical).
A different metric that is often applied to parallel tasks and can also serve as a schedulability test is called capacity augmentation bound [LALG13]:
Definition 5 (Capacity augmentation bound). A scheduler provides a capacity augmentation bound of b if it can schedule any task setT satisfying the following two conditions:
• the total utilization UT is at mostm/b;
• the length of the longest path of each task τi inT is at most Di/b.
A capacity augmentation bound of b also implies a resource augmentation bound of b, because the optimal scheduler cannot schedule any task set that does not satisfy Definition 5. However, differently from a resource augmentation bound, this metric directly provides a schedulability test, because one can easily check the two conditions stated above.
Another useful metric to compare the relative performance of different algorithms is the so called acceptance ratio, which evaluates the number of randomly generated task sets that each algorithm deems schedulable, and corresponds to the ratio between the number of schedulable task sets and the total number of generated task sets. Specific random generation algorithms such as UUnifast [BB05] can be used to generate uniformly distributed utilization values.
These performance metrics can be identically used to evaluate scheduling algorithms as well as schedu-lability tests.
2.2.5
Sequential tasks
This section covers a set of fundamental results about multiprocessor scheduling of sequential tasks that are particularly significant for our research. The reader can refer to [DB11b, BBB15] for a complete survey about the different scheduling techniques for sequential tasks.
Optimal scheduling algorithms, allowing a schedulable utilization equal to the number of processors, are only known for periodic task sets with implicit deadlines. The earliest example is P-fair [BCPV96], which divides the timeline into quanta of equal length and tasks are assigned by the scheduler a num-ber of quanta proportional to their utilization. Many variants of P-fair have been proposed, with the objective of reducing overheads, e.g., ERFair [AS00], PD [BGP95], PD2[ABJ01]), LLREF [CRJ06], and
DP-FAIR [FLS+11]. Another optimal algorithm for multiprocessors is RUN [RLM+11], which stands for
Chapter 2. Background and Related Work
for each priority level k, lowest first {
for each unassigned task τ {
if τ is deemed schedulable by S at priority level k
with all unassigned tasks assumed to have higher priority { assign τ to priority level k
break } } return unschedulable } return schedulable
Figure 2.3: Pseudocode of the OPA algorithm.
problems scheduled by EDF. Finally, the Quasi-Partitioned Scheduling (QPS) [MLR+14] has been
pro-posed as another optimal algorithm that also supports sporadic tasks and dynamic adaptation to system load variations.
All such algorithms require dynamic priorities, and their practical use may be problematic due to the potentially high number of preemptions and migrations. For the general case of sporadic task sets, it has been proven that no optimal online (i.e., non-clairvoyant) preemptive scheduling algorithm can exist [Fis07]. In addition, it has been shown that, when analyzing the schedulability performance of global and partitioned approaches, neither class dominates the other: there are task sets that can be globally scheduled but become unschedulable using any partitioned strategy, and vice versa [LW82, Bar07b]. Therefore, basic results concerning both scheduling classes are recalled.
Partitioned scheduling
Research on partitioned multiprocessor scheduling benefits of the fundamental optimality results for uniprocessor scheduling.
Considering FTP preemptive scheduling, the following results are of great relevance:
• The worst-case response time of a task can be found considering a particular scenario, called “critical instant”, in which the task arrives simultaneously with all higher-priority tasks, and each job is released as soon as possible [LL73]. This situation is often referred to as synchronous periodic arrival sequence;
• Rate Monotonic (RM) priority assignment, which assigns higher priority to tasks with shorter period, is optimal for periodic or sporadic task sets with implicit deadlines [LL73];
• Deadline Monotonic (DM) priority assignment, which assigns higher priority to tasks with shorter relative deadline, is optimal for periodic or sporadic task sets with constrained deadlines [LW82]; • DM is not optimal for task sets with arbitrary deadlines [Leh90], but Audsley’s Priority Assignment
algorithm (OPA) [ABRW91, Aud01] is known to be optimal in this case.
Figure 2.3 shows the pseudocode of the OPA algorithm. Starting from the lowest priority level, the algorithm attempts to find for each level a task that is schedulable by the considered schedulability test S. If, for any priority level, no schedulable task can be found among the unassigned ones, then it can be concluded that no schedulable priority assignment exists.
This priority assignment scheme is optimal in the sense that if there exists a priority assignment that is schedulable according to schedulability test S, then the OPA algorithm will find it. It has also been
2.2. Multiprocessor real-time scheduling
…
⌧
`1⌧
`1⌧
`2⌧
`2⌧
`m, ⌧
h⌧
`m, ⌧
hr
`i= r
hr
`i= r
hd
d
i`i`d
d
hh Figure 2.4: Example showing the Dhall’s effect.proven efficient, as it requires to run the testS at most n(n+1)2 times, where n is the cardinality of the task set.
Considering FJP preemptive scheduling, EDF, which at any time instant schedules the task that is closest to its absolute deadline, is optimal for periodic or sporadic task sets with arbitrary dead-lines [Der74].
Although schedulability analysis for partitioned systems is facilitated by the possible reduction to uniprocessor scheduling, the task allocation problem is analogous to the bin-packing problem, which is known to be NP-hard in the strong sense [GJ79]. Therefore, several bin-packing heuristics such as First-Fit (FF), Next-Fit (NF), Best-Fit (BF), Worst-Fit (WF) may be conveniently used to reduce the complexity of task partitioning [DL78].
Global scheduling
The fundamental feature that differentiates global scheduling from uniprocessor/partitioned scheduling is that “no finite collection of worst-case job arrival sequences has been identified for the global schedul-ing of sporadic task systems” [Bar07a]. In addition to the absence of a critical instant scenario, it is also widely recognized that classical optimal scheduling algorithms for uniprocessors perform very poorly when transposed to a globally-scheduled multiprocessor scenario. This is mainly due to a subtle effect, called Dhall’s effect [DL78], that arises when a number of light tasks (i.e., with small execution require-ments) coexist with a heavy task (i.e., with large execution requirerequire-ments). As evident from Figure 2.4, a task set with one heavy task and many light tasks, with an overall utilization arbitrarily close to 1, might be unschedulable using either EDF or RM. Intuitively, this is the main reason why the unipro-cessor optimality of RM and EDF does not extend to multiprounipro-cessors, as correctly predicted by Liu in 1969 [Liu69].
Overall, this weak scheduling framework has led to the development of numerous feasibility tests and utilization bounds that are only sufficient or only necessary. Among those, a sufficient utilization bound for global EDF (G-EDF) has been presented for task sets with implicit deadline by Goossens et al. [GFB03].
The main drawback of the above tests is that they depend on the maximum execution requirements of tasks, and cannot guarantee schedulability of task sets with at least one heavy task. This is a consequence of the Dhall’s effect mentioned above. Other utilization bounds independent of the maximum execution requirement of tasks have been developed in [SB02]. The test in [GFB03] has been later generalized in [BFB09] for task sets with implicit deadlines.
Chapter 2. Background and Related Work
R
iT
iL
Figure 2.5: Worst-case scenario to maximize the workload of an interfering task τi in the sequential case
for any work-conserving scheduler.
task sets scheduled with fixed priorities. Baker [Bak06] also presented a general methodology to determine the schedulability of sporadic task sets, which represents a seminal result that had a major influence on the subsequent research (see e.g., [BCL05, Bar07b, BC07, BV08]). The analysis technique developed in [Bak06] is based on the notion of problem window. More specifically, schedulability is checked for one task at a time. Initially, the problem window of a task τk is selected equal to the interval between the
release time and the deadline of one job of τk. Then, the interference of higher-priority jobs is taken
into account in the schedulability analysis. The interference contribution of each interfering task τi in
the problem window is divided between carry-in, body, and carry-out jobs. The carry-in job is the first instance of τi that is part of the problem window and has release time before and deadline within the
problem window. The carry-out job is the last instance of τi executing in the problem window, having
a deadline after the problem window. All other instances of τi are named body jobs.
Baker’s technique has been later used by Bertogna and Cirinei [BC07] to develop an iterative Response Time Analysis (RTA) that applies to any work-conserving algorithm, including G-EDF and global Fixed Priority (G-FP). In [Bar07b], Baruah improved the technique by Baker under G-EDF to limit the number of carry-in tasks to m− 1. Guan et al. [GSYY09] extended this reasoning to the RTA technique, refining the schedulability test for G-FP. It has been shown that the test in [GSYY09] analytically dominates [BC07]. Davis and Burns [DB11a] proposed a schedulability test for G-FP that is more pessimistic than [GSYY09] but is compatible with the OPA algorithm, showing that an optimal priority assignment scheme is able to cope with a weaker schedulability test.
Since the work in [BC07] is particularly relevant for our research on global scheduling described in Chapter 4, a detailed overview of the schedulability test presented therein is presented. The considered work assumes sporadic tasks with constrained deadlines.
Response time analysis for globally-scheduled multiprocessors. To analyze the schedulability of a given task τk, the approach adopted in [BC07] defines the workload of a task τiin a time interval of
length L as the amount of computation requested by jobs of τi within the interval. Also, the interference
suffered by τk from task τi is defined as the cumulative length of intervals during which jobs of τk have
been released but cannot execute due to the execution of τi.
Since the interference contributions are difficult to compute exactly, the approach in [BC07] proposes to upper-bound the interference of τi on τk with its workload. In particular, the authors show that an
upper-bound on the workload of an interfering task τi within a window of length L occurs when the
first job of τi starts executing as late as possible (with a starting time aligned with the beginning of the
problem window) and later jobs are executed as soon as possible (see Figure 2.5).
In this configuration, the workload of τi within a window of length L can be upper-bounded by:
2.2. Multiprocessor real-time scheduling
R
iD
iL
T
iT
iFigure 2.6: Worst-case scenario to maximize the workload of an interfering task τi in the sequential case
under G-EDF. where Ni(L) = j L+Ri−Ci Ti k
represents the maximum number of jobs of τi fully contributing with their
Ci inside the problem window.
Note that the bound in Equation (2.1) applies to any work-conserving algorithm. For the particular case of G-EDF, another upper-bound on the interfering contribution of each task is obtained by noting that the absolute deadline of the interfering jobs cannot be later than that of the interfered task. Hence, the interference suffered by τk from τi in a problem window of length L is maximized considering a
scenario in which (i) the absolute deadline of the carry-out job is aligned with the end of the problem window; (ii) jobs of τi are released as late as possible; (iii) the carry-in job executes as close as possible
to its worst-case response time (see Figure 2.6). This configuration leads to the following interference upper-bound:
Ii,k(L) =
L Ti
Ci+ min(Ci, max(0, (t mod Ti)− Di+ Ri)). (2.2)
Finally, the schedulability of a task system can then be checked by a classic iterative response time analysis method:
Theorem 1. (From [BC07]). An upper bound on the response time of τk can be derived by fixed-point
iteration over Rub
k of the following expression, starting with Rubk = Ck:
Rubk ← Ck+ 1 m X i6=k XALG i (Rubk ) , where, with G-FP: XALG i =XiFP = min(Wi(Rubk ), Rubk − Ck+ 1), ∀i < k 0, otherwise ; with G-EDF: XALG
i =XiEDF= minWi(Rkub),Ii,k, Rubk − Ck+ 1 ;
andXiALG= min(Wi(Rubk ), Rubk − Ck+ 1) for any work-conserving scheduler.
The term Rub
k − Ck+ 1 accounts for the fact that if the task under analysis is schedulable, then, for
any interfering task, the part of its execution in parallel with τk does not count as interference [BCL05].
In summary, the iterative procedure to carry out the schedulability test works as follows: i. For each task, Ri is initialized to Di;
Chapter 2. Background and Related Work
task relative deadline, that task is marked as “potentially unschedulable”. The task set is deemed schedulable if there are no potentially unschedulable tasks;
iii. Go back to step ii. until no response time bound is updated.
2.2.6
Parallel tasks
Some computation-intensive applications, such as synthetic vision, object tracking, and video processing, cannot be executed on single-core processors with guaranteed response time bounds. Therefore, they need to be modeled as parallel applications.
Following this need, several task models have been proposed in the real-time literature to represent concurrency and dependencies between different parts of the same tasks. The reader can refer to [SY15] for a recent survey on the topic.
Uniprocessor systems
In the context of uniprocessor scheduling, the multiframe (MF) model was proposed by Mok and Chen [MC97]. The MF model generalizes the Liu and Layland’s model by expressing a task as an infi-nite succession of frames allowing a minimum inter-arrival time between subsequent frames (Figure 2.7). This model has been generalized by Baruah [BCGM99] under the name of generalized multiframe model (GMF), allowing frame deadlines to be different from their minimum inter-arrival time.
The recurring branching (RB) task model [Bar98] further generalizes the GMF model by considering alternative flows of control represented by edges (Figure 2.8). Other models descending from the GMF and RB models have been developed such as the digraph model [SEGY11], which removes the restriction of recurrence by allowing arbitrary directed graphs (even with cycles) to represent the possible release structures of jobs (Figure 2.9).
Feasibility and schedulability analyses have been developed for such models considering a preemptive uniprocessor platform.
Figure 2.7: MF/GMF models. Figure 2.8: RB model. Figure 2.9: Digraph model.
All of the models described above are meant to describe all the different release structures and alternative flows of control of sequential programs executed on a uniprocessor platform. Hence, despite their great expressiveness, none of them is able to properly characterize the concurrent execution of parallel programs in a multicore system.
Multiprocessor systems
A first approach to parallel multicore scheduling is Gang scheduling [Fei96, KI09, GB10], which models parallel threads that are forced to execute simultaneously on multiple processors due to concurrent communication between them. Therefore, a parallel task is represented by its WCET and the number of processors it executes on. In [KI09], a preemptive Gang EDF scheduling algorithm was presented. Later, an exact schedulability condition for FTP Gang scheduling was provided in [GB10].
2.2. Multiprocessor real-time scheduling
A different parallel execution paradigm is the multi-threaded scheduling, in which a parallel task consists of precedence-constrained units of execution (i.e., sub-tasks) that are scheduled on multiple processors. According to this paradigm, which is the focus of our research, it is a choice of the scheduler whether to execute sub-tasks either sequentially or in parallel. This scheme is at the basis of parallel programming libraries such as OpenMP.
Fork-join model. One of the first proposals to model this kind of parallelism is the fork-join (FJ) task model (Figure 2.10), introduced by Lakshmanan, Kato and Rajkumar [LKR10]. In this model, a task is represented as an alternating sequence of sequential and parallel segments, and every parallel segment has the same degree of parallelism, which is constrained to be less than or equal to the number of available processors. This model mainly represents a master thread that forks a set of children. Once the children have terminated their execution, they join again into the master thread. In [LKR10], the authors investigated the partitioned scheduling of periodic FJ task sets, providing a stretching algorithm that tries to execute them sequentially, unless strictly necessary, as this is shown to increase schedulability. A more efficient stretching approach has been proposed in [FMQ11], with the purpose of reducing preemptions and migrations implied by the algorithm. A response time analysis for FJ tasks with arbitrary deadlines has been presented in [AQN+13], assuming partitioned FP scheduling.
Synchronous parallel model. A natural extension of the FJ model is the synchronous parallel (SP) task model (Figure 2.11), allowing consecutive parallel segments and an arbitrary degree of parallelism of every segment. Synchronization is still enforced at the boundary of each segment, in the sense that a sub-task in the new segment may start only after all the sub-tasks in the previous segment have completed. In [SALG11], the scheduling problem of implicit-deadline SP tasks is addressed using a decomposition strategy, that is, each parallel task is transformed into a set of independent sequential tasks by assigning to each segment an intermediate offset and deadline. By decomposing the parallel task into a set of sequential tasks, existing strategies for scheduling sequential tasks on multiprocessors can be reused. The authors also proved resource augmentation bounds of 4 and 5 for the decomposition algorithm under preemptive G-EDF and partitioned Deadline Monotonic (P-DM), respectively. Still relying on a decomposition strategy, an approach to optimize the number of processors for SP task sets is described in [NBGM12].
Andersson and de Niz [Ad12] developed a schedulability analysis for sporadic constrained-deadline SP tasks under G-EDF that does not rely on any decomposition technique. The notions of interference and problem window have been adapted by Chwa et al. [CLP+13] to SP tasks, expressing the relation
between a given task and the interference from higher-priority jobs. Such definitions allowed the authors to identify a worst-case scenario that gives an upper bound to the worst-case response time under G-EDF. Maia et al. [MBNP14] improved over [CLP+13] by providing tighter schedulability conditions and
extending the analysis to G-FP scheduling.
Directed acyclic graph model. A more flexible model of concurrency than the FJ and SP is the Directed Acyclic Graph (DAG) model (Figure 2.12), where a task is represented by a directed acyclic graph in which nodes are sequential tasks and arcs represent precedence constraints between sub-tasks. Baruah et al. [BBMS+12] introduced the scheduling problem for DAG tasks by showing that the
synchronous periodic arrival is not a critical instant leading to the worst-case behavior. The authors considered a single arbitrary-deadline DAG task, for which they showed a resource augmentation bound of at most 2 under G-EDF, and presented sufficient polynomial and pseudo-polynomial schedulability conditions with a resource augmentation bound of 3− m1. Later, the works in [BMSSW13, LALG13] studied the scheduling problem for multiple DAG tasks, concurrently showing a resource augmentation
Chapter 2. Background and Related Work
bound of 2− 1
m for G-EDF. Moreover, Li et al [LALG13] defined the capacity augmentation bound as a
new metric, similar to a utilization bound, to compare parallel scheduling algorithms. In the same work, later extended in [LLF+15], the authors showed a capacity augmentation bound of 4− 2
m for G-EDF,
which directly serves as a schedulability test. In particular, by Definition 5, a task set can be scheduled by G-EDF on a platform with m processors if (i) the total utilization is at most m
4−2 m
; (ii) the length of the longest path of each task is not greater than 4−12
m of its relative deadline.
The authors of [BMSSW13] also proved a resource augmentation bound of 3− 1
m for global
Dead-line Monotonic (G-DM), providing schedulability conditions for both G-EDF and G-DM scheduling. In [Bar14], Baruah proposed an improved schedulability test for G-EDF that analytically dominates the one in [BMSSW13]. Another schedulability test that takes into account the structure of the DAG was proposed in [QFGM13], while a decomposition strategy with a resource augmentation bound of 4 has been proposed in [SFL+14].
A different approach to schedule parallel tasks is the federated scheduling scheme [LCA+14], which is
a generalization of partitioned scheduling to parallel tasks. Federated scheduling assigns a dedicated set of cores to any parallel task with utilization greater than or equal to one, and then applies any greedy work-conserving scheduler to schedule the parallel task on the assigned cores. The remaining light tasks are executed sequentially on the remaining cores. In [LCA+14], the following capacity augmentation
bounds are proved: 2 for federated scheduling, 3+
√ 5
2 for G-EDF and 2 +
√
3 for global Rate Monotonic (G-RM).
Finally, a response time analysis approach for sporadic DAGs scheduled by partitioned Fixed Priority (P-FP) scheduling has been proposed in [FNN16].
Figure 2.10: Fork-join model. Figure 2.11: Synchronous parallel
model. Figure 2.12: DAG model.
Conditional parallel task (cp-task) models. The first attempts at enriching a parallel task model with control-flow information were proposed in the context of uniprocessor systems to provide a more accurate characterization of the worst-case behavior of a task [Bar98, CET01, AEFL08]. In a multicore setting, Fonseca et al. [FNRP15] proposed the multi-DAG model, which represents a parallel task as a collection of DAGs, each representing a different execution flow. The authors proposed a method to combine such flows into a single synchronous parallel task that preserves the execution requirements and the precedence constraints of all the execution flows that can possibly occur at runtime, thus reducing the schedulability problem to a simpler problem for synchronous parallel tasks. A disadvantage of this approach is that it is not scalable with respect to the number of sub-tasks, since the number of different flows through a DAG can be exponential in the number of nodes. Moreover, it adds pessimism in the task transformation process and requires server-based synchronization mechanisms that may be difficult to implement.
The accounting of control flows is also common in the field of static program analysis. When detailed control flow information is available, accurate approaches exist that explicitly detect and discard infeasible execution paths, based on the boolean properties tested inside the conditionals. However, infeasible path detection is a difficult problem that in general requires high-complexity machinery, such as the solution