Modelling Simultaneous Multithreading Contribution to Performances and Power Consumption

(1)

Consumption

Federico Umani

Supervisors:

Daniele De Sensi, Marco Danelutto

(2)

Abstract: Energy-awareness, self-adaptation and Simultaneous Multi-threading (SMT) can coexist to address many of the issues of modern pro-cessing units. The increasing computational power comes along with more strict requirements on energy and thermal properties. CPU manufacturers provide more and more sophisticated way to adapt machine’s computational power to the real application needs in order to save energy. In this thesis we present a new online learning strategy for self-adaptive and power-aware computing able to predict and control power consumption and execution time of another application, without having any information on the application before it starts. Unlike many other similar algorithms, that only include in their model the clock frequency and the number of cores, ours also explicitly includes the number of SMT contexts. In particular we propose two different models that integrate SMT differently: one, the NxC strategy, simply uses the number of virtual cores, defined as the product of the SMT contexts times the number of physical cores; the other, the SMT strategy, adds SMT contribution separately.

Both strategies are validated offline and online. In offline validation we collect the execution time and the power consumption of all the possible con-figurations of number of cores, frequency and SMT contexts, then we use a small subset of these samples to train the model, testing its accuracy by measuring the error on the remaining samples. We repeat this several times to have a statistically sound evaluation. We obtain that the average error of the NxC strategy is 2.90 times bigger than the average error of the SMT strategy. The online validation is done by implementing the two strategies in Nornir, a framework for self-adaptive computing. With Nornir is possible to drive a real application execution by enforcing requisites over a number of different metrics. We use our strategies to enforce requirements over power consumption and execution time for three real applications from PARSEC Benchmark. We find that SMT strategy is able to find a solution in average 31.46% closer to the optimal one with respect to the NxC strategy and 53.78% closer with respect to a strategy that does not make any use of SMT.

Keywords: power-aware computing, parallel computing, simultaneous multi-threading, linear regression, self-adaptive computing, online prediction algo-rithm.

(3)

(4)

2 Background and Related Work 9

2.1 Background . . . 9

2.1.1 Energy and Power . . . 10

2.1.2 Simultaneous Multithreading . . . 13 2.1.3 Self-Adaptation . . . 14 2.1.4 Linear Regression . . . 17 2.1.5 Nornir . . . 19 2.1.6 PARSEC . . . 24 2.2 Related Work . . . 25 3 Models 27 3.1 Execution Time Prediction . . . 27

3.1.1 NxC Strategy . . . 28

3.1.2 SMT Strategy . . . 32

3.2 Power Consumption Prediction . . . 37

3.2.1 NxC Strategy . . . 37 3.2.2 SMT Strategy . . . 40 4 Results 43 4.1 Testing Environment . . . 43 4.2 Cross-Validation Testing . . . 47 5 Nornir Integration 55 5.1 Implementation . . . 55 5.2 Experiments . . . 67

6 Conclusions and Future Work 75

Appendix A 77

(5)

Figure 1.1: Estimation of electricity use in global data centers.

In Figure 1.1 we can see an estimation, according to [29], of the global electricity used by data centers in the next few years. The best case is when

(6)

Figure 1.2: Estimation of energy globally consumed by ICT systems.

considering an annual Energy Intensity (EI) improvement of 22.5%, while the expected is 15%. Even in the best case, the expected increment in power consumption is going to overwhelm the energy-efficiency improvements. The energy-demanding trend is not limited to industry, but also concerns cus-tomer devices such as smartphones, smart TV, wireless networks for faster internet connections and in general all the Information and Communications Technology (ICT) systems. Figure 1.2 shows an overview of expected and best-case electricity usage for all the ICT systems as predicted by [29]. As we can see, while for many other ICT systems the number of terawatt-hours is more or less stable or, at most, slightly growing, the data center electricity usage represents the largest portion of the total and is growing exponentially [30]. Since our work could have great usage in data centers, we are going to address one of the most critical problems of ICT.

The problems caused by energy inefficiency are economical, environmental and heat-related.

Economical: The economical reason behind the necessity of power-aware products is straightforward: energy doesn’t come for free. In the recent years the U.S. Department of Energy has included data centers in the Bet-ter Buildings initiative1_{, by introducing a challenge where all the accepting}

(7)

CO2 emissions) caused by the power plants. The so-called carbon footprint of a system is all the indirect greenhouse gases emissions caused by the system. According to a Natural Resources Defense Council (NRDC) projection [35], in 2020 the US data centers will require the equivalent annual output of 50 power plants, for a carbon footprint of nearly 100 million metric tons of carbon pollution per year.

Another aspect of the environmental impact of power consumption is the water needed for cooling and energy production. The water footprint of data centers is estimated to be between 1047 and 151,061 m3/TJ [36]. Outbound data center traffic generates a water footprint of 1–205 liters per gigabyte, roughly equal to the water footprint of 1 kg of tomatoes at the higher end [36].

Heat: Part of the energy supplied to a machine is dissipated as heat. ICT systems make no difference. With the increasing power density in modern processors, managing on-chip temperature can become a major issue. In the years researchers have proposed many temperature-aware solutions for different architectures [14, 16, 17, 46] because the temperature rise cause the shortened life, malfunction and failure of CPUs, other than forcing water usage for cooling.

Nevertheless, according to [38], the average utilization of many systems is usually in the range 10%-50%. This opens many possibilities for energy saving by increasing the average utilization of these systems.

Both hardware and software solutions has been found to reduce the elec-tricity usage. Software developers try to minimize energy-expensive opera-tions such as memory accesses, for example by increasing locality with

(8)

cache-friendly programming [32, 33]. Since the electrical energy is used to power the system resources, many of the hardware solutions involve using less re-sources or using them for a shorter period of time. Forcing an application to avoid using unnecessary resources can reduce energy consumption. CPU manufacturers provide architectural mechanism to control physical resources such as cores and clock frequency to adapt them to the application require-ments [31]. The goal is to deactivate all the system resources that are not being used, such as cores or RAM modules, as made possible by many man-ufacturers. The best way to do this online, i.e. while the application is executing, is by monitoring how many resources the application needs step by step. The application user could define some performance constraint, for example a maximum tolerable execution time, and a software could activate only the resources needed to fulfill the requirement.

There is a number of different scenarios in which it may be useful to define a constraint on performances or power consumption, especially when talking about data centers.

Performance Requirements: The user may not need to run the applica-tion at maximum performance. For example, let’s suppose to have a backup application that every night makes a copy of the data elaborated by the com-pany during the day in a colocated data center. The application has a 8-hour time span before the new working day begins, but is probably able to finish much earlier at maximum performances. In this specific situation, however, there are no particular advantages in finishing so much before the deadline. On the contrary the user could profit from incentives on power saving from the data center operator [47, 48]. That said, the user still need to finish the operation within 8 hours, so the need for a minimum required performance level [7, 9, 10, 11].

In the case of streaming applications the number of elements to process can vary a lot in a 24-hours time span. A web server could receive more requests during the day and less during the night. Being able to match per-formances and power consumption with the workload is a key feature to en-force Quality of Service (QoS) requirements while containing the costs. Let’s take as an example Amazon Autoscaling2, which let the user to dynamically scale the number of Virtual Machine (VM) instances used by his application. Since the billing plan is based on the number of allocated instances, by using

(9)

outages caused by unpredictable power spikes (i.e. when multiple servers have their power peak at the same time), with the aggravating circumstance that the consequent workload redistribution in other data centers could produce a catastrophic cascading effect. This could be avoided by defining a maximum tolerable level of power consumption.

The most used hardware knobs to reduce power consumption are the num-ber of active cores and their clock frequency (See section 2.2). We want to also consider a third hardware resource that is often ignored, but which is now implemented in many systems: the Simultaneous Multithreading (SMT), or hardware multithreading [13]. This technique allows the processor to keep two or more threads at the same time without having to alternate among them. This is implemented by duplicating some hardware compo-nents, namely those responsible for fetching and decoding the instructions. This is not as having two separate cores, since in SMT the execution engine is not duplicated. A SMT processor capable of keeping two active threads is said to have two contexts or to have a 2-way SMT. SMT is present in many modern processors since it provides performance improvements. Fig-ure 1.3 shows how in recent years the number of SMT contexts per core on microprocessors has gradually increased. To take fully advantage of the modern architectures it’s now necessary to understand SMT contribution to performances and power consumption. Intel R _{implementation of SMT} (called Hyper Threading) is present on many of their latest processors [43] and combined with other technologies. IBM R

also introduced SMT first on POWER processors with POWER5 [45] then also in z System mainframes with z13, recording throughput improvements in their testings. In Table 1.1 we report some data taken from [44] used to motivate SMT addition to z13

(10)

Figure 1.3: A 12-years trend showing the evolution of SMT in Intel R _{Xeon, IBM} R Power and AMD R

Opteron processor families.

mainframe.

From its diffusion, Simultaneous Multithreading has been tested in a va-riety of scenarios and to achieve some goals in terms of performance, power consumption or heat control. SMT has been studied especially in compari-son to Chip Multiprocessor architectures. Chip multiprocessing (CMP) is the concept of placing multiple processor cores on a single chip. The two tech-niques has been compared by having in mind, other than performances, also power consumption [15, 17], thermal properties [14] and both of them [16]. In modern processors both techniques are usually present simultaneously.

SMT, however, today is still not used as CMP to control performances and power consumption of an application. To the best of our knowledge, an accurate prediction model for controlling performances and power con-sumption that includes SMT contexts, other than number of cores and clock frequency, is still missing. Since SMT contexts are often not considered by prediction algorithms [2, 7, 23], considering any additional context as an ad-ditional ”virtual core” could seem a feasible solution. In this way, a 2-way SMT processor with one core and two processors without SMT support are treated the same way by the prediction models. The necessity of a model that explicitly considers the number of SMT contexts is evident from the

(11)

Modeler clients imitating a web browser.

results shown in section 4.2. When we try to include SMT contribution by simply multiplying the number of physical cores and the number of contexts per core the results are unsatisfactory.

In this thesis we propose a model for online prediction of execution time and power consumption that integrates SMT in a different way. We want to build a model that can work online, without having any information on the application when it starts its execution. The model is built to work dynami-cally by using linear regression, a method for machine learning simple enough to be used online. We’ll implement our strategy in Nornir, a framework for self-adaptive and power-aware computing, and we’ll validate it on three real applications from the PARSEC benchmark [26], controlling their execution by imposing some requisites on power consumption and execution time.

When we use our new strategy to chose the best configuration enforcing those requisites, our strategy is able to find a solution that, in average, is 31.46% closer to the optimal one with respect to the strategy that simply considers the virtual cores, and 53.78% closer with respect to the strategy that does not make any use of SMT. Furthermore, while those two strategies fail in finding a configuration respectively the 28.33% and the 31.67% of the times, our strategy only fails the 15.56% of the times. In chapter 2 we introduce some theoretical concepts related to our work and we discuss some related work; in chapter 3 we describe the theory behind our model and we give the mathematical formulation; in chapter 4 we report the error measured for our model and its variants; in chapter 5 we focus on its implementation and finally, in chapter 6 we’ll conclude the work discussing some possible future improvements.

(12)

(13)

Multithreading, linear regression and Nornir itself. We then illustrate some other works that have one or more aspects in common with ours.

2.1 Background

The goal of this section is to provide enough background knowledge to fully understand the work done by building this model and integrating it in the Nornir framework. In order to accomplish this goal this section covers five different field: after an overview on the basics of computational energy and power consumption, we’ll explain what is the Simultaneous Multi-Threading technology and how it works; we’ll overview the main advantages of self-adaptation in respecting some Quality of Service requirements and then we proceed by giving a general description of the Nornir framework for self-adaptive and power-aware computing. After explaining what linear regres-sion is and how it is useful for our work, we finally introduce the PARSEC benchmark suite, whose applications has been used to validate our algorithm.

(14)

2.1.1 Energy and Power

Since being able to predict and control the power consumption of an ap-plication is part of our work, here we introduce some background notion on energy and power in computation. An efficient application in terms of execution time that happen to have low energy efficiency may produce too much heat or may be too expensive to sustain. Multiple servers reaching their power peak at the same time in a over-subscribed data center could cause a power outage and probably also a service interruption. Moreover a portion of the energy consumed is dissipated as heat. Excessive heat can be detrimental for the environment and for hardware durability.

We’ll focus on situations where the target application is the only one running. The power consumed by the machine is measured in Watts (W) and includes the power consumed by the application itself and the power required by the hardware and by the underlying operating system. The machine would indeed need some power even if it were in idle state. We’ll refer to this idle power as to the static power, because is not dynamic, i.e. does not depend on how the application uses the resources.

Power depends on various factors, mainly related to voltage domains. A voltage domain is a portion of the architecture which is powered indepen-dently from the other voltage domains. All the cores in the same voltage domain share the same power supply. The power consumed by a single volt-age domain is usually defined as follows [18, 21]:

P (n, f, v) = Pstatic+ Pdynamic= vIleak+ αcav2f N (n) (2.1)

Where n is the number of active cores in the machine and f is clock frequency of the voltage domain. Ileak, ca and α are respectively the leakage current, the load capacitance and the activity factor. N (n) is the number of active cores within the voltage domain and depends on the total number of active cores in the machine. v, the voltage, can be obtained from N (n) and f once these two quantities are known. We’ll also add to P (n, f, v) as defined above a quantity which varies with the number of SMT contexts c. We’ll try two different ways of doing this, in section 3.2.1 and 3.2.2.

As a general rule, the more resources are used during the computation, the more power is needed (Figure 2.1). This is clearly in contrast with performance needs, as an higher number of resources (together with a good distributed or parallel implementation) can lead to a faster program. The

(15)

Figure 2.1: The power curve for the Bodytrack application from PARSEC Bench-mark (See chapter 4 for further details). We can see as increasing the number of cores, the number of SMT contexts or the clock frequency leads to more power consumption.

solution to this problem it’s not obvious, since a good trade-off must be found. Whether pushing the application performance or keeping it in low energy ranges largely depends on the application domain and on the user’s requirements. In real-time computing, for example, the time constraint is so crucial that performances constraints must be prioritized; on the contrary when dealing with different problems, such as the limited battery problem in a mobile ad-hoc network, the energy consumption must be prioritized at the point that some services (even security services) cannot be afforded.

Some general considerations, however, can still be done. By maximizing parallelism degree and clock frequency more power is required because of the higher number of resources used, but this is true probably for a smaller period of time, since the computation will end its execution earlier thanks to improved performances. Moreover, as already anticipated, unused cores still need some power even while being idle, simply for keeping themselves turned on. So if the goal is to make the application terminate while having consumed the least possible amount of energy, maximizing the resource us-age to accelerate its termination may be, counter-intuitively, a less energy consuming approach. While there exists no easy solution for this problem, it’s clear that execution time has an important role even when talking about energy. The energy, expressed in Joules (J), can be defined as a function of

(16)

Figure 2.2: Example of two application doing the same thing, where one is less power consuming but slower (A), the other is more power consuming but faster (B). Application B consume the less energy overall.

power consumption and execution time.

E = E0 + P (t − t0) (2.2)

Where E0is the energy consumption at the instant t0. Figure 2.2 shows an example where the energy-performance trade-off is not intuitive. Application A and B perform the same calculation, but application B is using all the available resources to terminate faster and concludes its execution after 4 seconds. Even if the application B is more power consuming its execution takes 8 Joules, overall less than application A. If application B terminated after 6 seconds instead of 4, then it would be consuming 12 Joules, more than the application A. There are situation where is useful to consider the energy E as a parameter, for example when limited battery is a problem, but the power consumption P is the quantity we want to work on. This is because power peaks can be a major issue in data centers and because energy can be obtained from power and execution time once we’re able to predict them.

Using all the available resources may be risky and is not always the best solution. A good trade-off must be found by some type of strategy. An application user may want to define a constraint, for example a maximum power consumption, and try to maximize the performances without violating that constraint. Viceversa, the user could instead define a maximum for the execution time.

(17)

Figure 2.3: Difference between Temporal Multithreading (a) and Simultaneous Mul-tithreading (b). The front end of the Hyper Threaded CPU can perform some in-struction level parallelism between the red thread and the yellow thread. The unused slots are in white.

2.1.2 Simultaneous Multithreading

Multithreading is today a largely used and well-know technique, that was originally proposed as a way to increase throughput hiding long latencies [42]. Simultaneous Multithreading (SMT), as opposed to other multithreading techniques such as time-division multithreading, was proposed more recently to increase utilization of out-of-order superscalar processors [13].

This study focuses on the effects of SMT on power consumption and performances. Simultaneous Multi-Threading is a technology better know with the name of Intel R_{’s implementation: Hyper Threading. Simultaneous} Multithreading is a form of multi-threading which exploits hardware sup-port. Multi-threading, broadly speaking, can normally be implemented by temporal division, that is by alternating which thread occupies the core at every given time. Since in this way multiple threads cannot be executed at the same time in the same core, core registers must be updated when

(18)

the thread changes by performing an onerous context switch. Simultaneous Multi-Threading is implemented at hardware level by duplicating some core components, typically the registers for storing the architectural state, but sharing the main execution engine among the threads. This allows the core to host more than one execution flow at the same time.

In the diagram in Figure 2.3 we have two systems, one performing Time-sliced Multithreading and one performing Simultaneous Multithreading (or Hyper Threading). In both cases the execution core is running two threads in parallel, identified by different colours, using seven functional unit pipelines, where each colored square is an instruction. The front end contains the components needed to fetch and decode the instructions. The CPU front end can issue three instructions per clock cycle; the difference is here since while in case (a) all three instructions must come from the same thread, in case (b) the instructions can belong to two or more threads (two if it’s a two-way Simultaneous Multithreading, and so on). If the temporal-threaded front end cannot find three instruction to issue in parallel to the execution unit in a given cycle, the unused slots (the white boxes in the picture) are wasted. The SMT front end could instead issue some instructions from the other thread.

So instead of having two full cores, with Simultaneous Multithreading we have only one core which can execute more than one thread simultane-ously; the number of threads that an Hyper Threaded core can handle is determined by the constructor and publicized as number of contexts or vir-tual cores. When the application does not suffers from high contention on shared memory by its multiple threads, Simultaneous Multithreading brings an improvement in performances and, even if is not as having multiple cores, it’s possible to use both solutions at the same time, by having more than one core with more than one contexts each. Concerning power consumption, Simultaneous Multithreading involves the usage of a greater number of hard-ware resources, and this is costly. Predicting and controlling this extra cost is one of the purposes of this study.

2.1.3 Self-Adaptation

An increasingly important requirement for software-intensive systems is the ability to self-manage by adapting at runtime to handle things as resource variability, changing user needs, and system intrusions or faults [1]. Such a system must continually configure, reconfigure, tune and optimize itself.

(19)

computing node and on the application. Such data is aggregated and then passed to the analyze phase. If measured data reveals that Quality of Service requirements are violated, the plan phase is triggered in order to decide a set of action to be done to achieve the goal. Such actions can be very simple (e.g. scale down the clock frequency of the CPU) or could also be part of a more complex custom strategy. Eventually, the execute phase applies the reconfiguration plan by using appropriate actuators (the knobs). Knobs may be present on both the computing node and the application. The application can pass through various phases unless each control step is a different phase, since the MAPE loop uses results from the previous iterations to predict the metrics of the future ones.

The topic of self-adaptive and self-managing systems has been studied in a variety of application domains, such as user oriented networks, cyber-physical systems and parallel computing.

Networks: User-oriented networked systems simultaneously exploit a vari-ety of wired and wireless communications, such as those related to internet of things and domotics. Every different connection may have different Quality of Service requirements. Furthermore, without dynamic monitoring and self-adaptation, is difficult to guarantee such requirements with unpredictable changes in the workload and in the user needs. That’s why self-adaptation is often considered for user-oriented network applications [39, 40].

Cyber-Physical Systems: Self-adaptive algorithms change their parame-ters, or even their strategy, according to environmental changes. This is a key feature in cyber-physical systems, where sensors and actuators are already fundamental components of the system. Adaptability, realized through

(20)

feed-Figure 2.4: Typical control cycle of a self-adaptive system.

back loops, is a key requirement to deal with uncertain operating conditions in CPS. Adaptation in CPS is a cross-layer concern, where solutions combine different adaptation mechanisms within and across layers [41].

Parallel Computing: In parallel computing, self-adaptation find its spot to dynamically allocate the needed hardware resources to save energy. Li and Mart`ınez [7] propose an heuristic to dynamically optimize power con-sumption of a parallel application that executes on a many-core. In a pre-vious work from Danelutto et al. [24] a self adaptive behaviour has been added to dataflow programming model, a widely used effective solution to implement parallel programming frameworks. Nornir [25], the framework for power-aware parallel computing used in this thesis, and a number of other frameworks for self-adaptive computing [8, 11, 12] are entirely based on some sort of MAPE self-adapting cycle.

The importance of an online prediction model for self-adaptive frameworks relies in allowing them to chose the best configuration for their actuators by predicting their impact on the environment. Different forms of machine learning can be used for this task, from the simple linear regression to the more complex neural networks, according to the performances required by

(21)

the framework itself. In this work an online prediction model is implemented in Nornir self-adaptive framework and linear regression is chosen as regression algorithm.

2.1.4 Linear Regression

Nornir users can specify power constraints or ask Nornir to minimize power consumption while keeping other constraints. Since Nornir is self-adaptive, it must be able to dynamically predict power consumption for all the knob con-figurations and then adjust any parameter it needs to respect the constraints. One of the strategy that Nornir can use to do this involves online learning. Using a machine learning algorithm while the application is executing [23], Nornir can learn its behaviour by storing some application-specific constants, then uses them during the prediction step. Figure 2.5 from [23] shows more in detail how the algorithm works. The algorithm has a calibration phase and a steady phase. During the calibration phase several configurations are visited. Visiting a configuration means setting one or more knob values, such as frequency or number of cores, and measuring the resulting power or per-formances. The model is refined with further configurations until the error estimation is under the tolerance threshold. The model is then ready to be used to predict power and performances of all the other configurations. The steady phase starts by selecting the best one and lasts until the application changes its behaviour, forcing a recalibration.

The training and the predictions are done in practice by building a model for a Multiple Linear Regression [51]. In multiple linear regression we model the relationship between two or more independent variables (called predictors

(22)

Figure 2.6: Example of fitting. In the above picture the generic equation is y = a

x + b, where y is the response, x is the predictor and a and b are the unknown coefficients. In this specific case the fitting is a = 42.84 and b = 12.

or regressors) and a dependent variable (called response) by fitting a linear equation to observed data (Figure 2.6). Among all the possible functions of the family of functions expressed by the generic linear equation, is chosen the one with the minimum distance from the observed points. The equation is generic because contains a number of coefficients that cannot be known a priori, so to chose one specific function we simply have to set the various coefficients to a specific value, possibly the best one.

In our case, for example, the predictors are the number of cores, the fre-quency and so on, while the responses are the power consumed by each one of these configurations (or the execution time, if we’re working on perfor-mances). There is a number of possible way to bind the responses to the regressors, depending on some unknown coefficients, such as Ileak, α and ca in Equation 2.1. The regression is done to find those coefficients and to select the best fitting function. More precisely, when talking about linear regression, the generic equation must be a linear equation.

Let’s suppose we have n observations to obtain the responses y1; y2; . . . ; yn. Let xi = xi,1; xi,2; . . . ; xi,m denote the m predictors for observation i. Then we have:

yi = β0 + β1xi,1+ · · · + βmxi,m+ i (2.3) where βi is a regression coefficient and i is a term representing a

(23)

ran-Figure 2.7: Nornir’s self-adaptive behaviour. This particular cycle is an instance of the more general MAPE cycle (Figure 2.4).

dom error due to measurement limits or fluctuations in the measured value. Regression coefficients βi must be determined in order to enable the correct prediction of the responses for the unobserved predictors, while training the model on a set of observed values. To fit the model, we use the least squares method, which find the coefficients by minimizing the sum of the squares of the residuals (the difference between the real response and the value predicted by the model).

The online learning algorithm and the linear regression can be used with many different models, each one using different knobs to predict some metric. We’ll build a model that will use the number of cores, the clock frequency and the number of SMT contexts to predict power and execution time.

2.1.5 Nornir

While there exists a number of ways to evaluate a linear regression model accuracy, such as evaluating the error with a test set of observations, it’s pos-sible to do it more concretely by using it to predict and control the behaviour of real applications. However interfacing an application with the lower levels its not a simple task also because of hardware-specific and non-transparent mechanisms that must be employed. Measuring power consumption, for ex-ample, largely depends on the means provided by the CPU vendor and the interface with the underlying system must necessarily consider the hardware

(24)

Figure 2.8: Nornir Framework general architecture

on which is running. Obtaining an evaluation on performances while the application is still running is another difficult task. Implementing an online strategy is not simple under the aforementioned circumstances.

Nornir [25] makes this task simpler by providing an interface that hides low-level interaction and performs application monitoring in a variety of dif-ferent ways, from the less intrusive ones for blackbox applications to more accurate ones. Nornir is a customizable, self-adaptive C++-based framework which make use of different strategies to drive an application execution and provides the interface between the model and the applications. Nornir can be used to enforce performance and power consumption constraints on parallel applications running on shared memory multicores. The framework can be customized by algorithm designers to implement new self-adaptive policies. Self-adaptation enables the dynamic selection of resources to allocate to the application in order to meet performance and power consumption require-ments. By dynamically selecting the appropriate number of resources it is possible, for example, to use at each time step the minimum amount of re-sources needed to process the incoming data. The framework alters some knobs, such as the number of virtual cores or their clock frequency, to find a configuration that is able to respect the constraint (Figure 2.7).

Figure 2.8 depicts Nornir’s general architecture. In the top layer there are four different types of application with a gradually decreasing access level: from those built within Nornir to those available only as an executable. The

(25)

Figure 2.9: Flowchart showing the possible choices for the application programmer to interface Nornir manager to an application.

interface between Nornir and the underlying system is provided by Mammut1_, an object-oriented C++ library for easy interaction with a variety of different systems. Mammut provide portable and transparent power consumption monitoring and knob managing. Sensors and Knobs are the lowest level of Nornir architecture. Nornir couples the application with a Manager, which drives the execution in order to to fulfill the requirements expressed by the user. At each sampling interval (the control step), the Manager performs an iteration of the so-called Monitor-Analyze-Plan-Execute (MAPE) loop.

As already stated, Nornir manager can be interfaced to an application in a number of different way, as depicted in the top layer of Nornir architecture on Figure 2.8. Figure 2.9 is a more detailed flowchart describing all the possible choices. The simplest solution is to use Nornir on an existing application without any programmer intervention, when there is no need or no possibility to modify the code. This is the blackbox case. The Nornir manager will run in a separate process and will not interact directly with the application. Since it’s not possible to obtain detailed metrics such as throughput or latency, performance requirements must be expressed in terms of Instructions Per Second. This approach should be used only if no other alternative is viable. If the code can be modified and recompiled, a more accurate approach is the one provided by the instrumentation. With a few calls inserted in the code at the beginning and at the end of the iteration, all the available metrics can be measured and used for the analyze and plan steps. These

(26)

instrumentation calls are programmed to be as lightweight as possible to avoid altering measured values. Some framework could provide access to some additional knobs, and if it is supported by Nornir that’s the case of the Framework Interaction. The last and most complete approach is using Nornir’s Programming Interface when the application must be build from scratch. Since this work is mainly about a new strategy for the plan step of the MAPE loop, its result could be used for any of the above approaches with the exception of the blackbox case, since it doesn’t provide metrics detailed enough to apply our regression algorithms. Let’s take the following streaming application as an example of an instrumented application.

1 using nornir::Instrumenter; 2 Instrumenter r("parameters.xml"); 3 StreamElement* s; 4 while(s = receive()){ 5 r.begin(); 6 process(s); 7 r.end(); 8 } 9 r.terminate();

The Instrumenter defined in line 2 reads an XML file containing, among the other settings, the user requirements. Here is an example of how an XML file could look like.

1 <?xml version="1.0" encoding="UTF-8"?>

2 <nornirParameters> 3 <requirements> 4 <throughput>MAX</throughput> 5 <latency>30</latency> 6 <powerConsumption>50</powerConsumption> 7 <requirements> 8 </nornirParameters>

The Instrumenter opens a connection towards the Nornir’s Manager and send to it the XML File. The processing of each iteration is wrapped between the begin() and the end() calls (lines 5 and 7). These calls store timestamps that are used to derive performance metrics. The begin() call also check if there are pending requests from the Manager, which once per control step needs the data monitored by the Instrumenter.

Figure 2.10 summarizes the interaction between the Manager and the Instrumenter. The Manager’s control step starts with the iteration α and ends with the iteration β. When the control steps begins the Manager makes

(27)

Figure 2.10: Relationship between the Manager and the Instrumenter

a request for the monitored data. Kbegin(i) and Kend(i) are the timestamps associated with the begin() and end() calls. If the number of iterations performed by the application during the control step is defined as β − α, then Nornir derives the throughput as follows:

R = β − α

Kbegin(β) − Kbegin(α)

(2.4) i.e. the number of iterations performed by the application between two suc-cessive monitoring requests, divided by time elapsed between such requests. The latency is instead derived as follows:

L(i) = Kend(i) − Kbegin(i) (2.5) And so on. The manager can then use those metrics to plan a set of actions through the Selector class. A new subclass of the Selector class can be defined, containing a custom strategy that ends up setting the knobs accord-ing to data received by the Instrumenter. In the followaccord-ing example a new Selector, the SelectorDummy, is defined:

1 class SelectorDummy: public Selector{

2 ...

3 KnobsValues getNextKnobsValues(){

4

5 KnobsValues k(KNOB_VALUE_RELATIVE);

(28)

7 k[KNOB_VIRTUAL_CORES] = 25; 8 k[KNOB_FREQUENCY] = 50; 9 }else{ 10 k[KNOB_VIRTUAL_CORES] = 80; 11 k[KNOB_FREQUENCY] = 100; 12 } 13 return k; 14 } 15 };

The getNextKnobsValues() function, defined in line 3, must return an array of knob values according to some criteria. The criteria of SelectorDummy is defined in line 6: if the latency is below the threshold then less resources are made available, otherwise better performances require additional cores and speed. The function lastly returns the chosen knob values (line 13).

A programmer could write a custom Selector or could use a predefined one. Some of the predefined strategies are based on heuristics [7, 22], others are based on complex offline learning algorithms [6] and others are based on online learning [23]. The SelectorLearner uses the online learning algo-rithm described in section 2.1.4, making use of various Predictors for im-plementing different models. This work implements a particular Predictor of the SelectorLearner, the PredictorSMT.

2.1.6 PARSEC

In our work we design, implement and validate a linear regression model to make Nornir’s online learner able to predict the performances and the power consumption of applications when changing number of cores, freq and SMT. To test the accuracy of our model in a realistic scenario, we need to test it on real applications. The PARSEC Benchmark Suite2 [26, 27] provides us this possibility.

PARSEC stands for Princeton Application Repository for Shared-Memory Computers and is a benchmark suite composed of multithreaded applications from a variety of different application domains, including desktop and server applications. Every PARSEC application is parallel, but the parallelization model varies from one to one. We’ve selected 3 out of the 13 applications proposed in the PARSEC benchmark.

(29)

multiple cameras analyzing the foreground and the edges of the images in a sequence. The program extract the image features from a frame set in mul-tiple progressive steps. Each task is assigned to a thread pool from the main thread and the load is balanced dynamically. Bodytrack is a medium-grained data-parallel application with high data sharing.

Facesim: Facesim is an application from the animation domain that com-putes a visually realistic animation of the model of a human face by simulat-ing the underlysimulat-ing physics. The parallelization uses a static partitionsimulat-ing of the mesh and performs computationally intensive tasks. Facesim is a data-parallel coarse-grained application with a very large working set.

2.2 Related Work

Self-adaptive systems are able to alter their behavior without human inter-vention, in order to fulfill some Quality of Service requirement in various surrounding conditions. Altering the behavior usually implies changing the configuration of an application accordingly. Self-adaptive strategies have been proposed to satisfy constraints on performance [2, 3, 6, 7], power con-sumption [5] or both of them [4]. Lohrmann et al.[2] propose a reactive data-flow model to enforce latency guarantees with thread packing and con-currency throttling management. A similar approach is proposed by Gedik et al. [3] to reach high throughput while minimizing resources usage. In a work from Li and Mart`ınez [7] an heuristic is used to achieve performance requirements while minimizing power consumption, by dynamically adjust-ing the number of active cores and their clock frequency. Gandhi et al. [5] aim instead to keep power consumption under a threshold by using forced

(30)

idleness. The algorithm alternates two extreme states: the max performance state where all the resources in terms of cores and frequency are used and the idle state, where the computation is forcibly stopped. In a work from De Matteis et al. [4] a method based on Model Predictive Control (MPC, [52]) in opposition to reactive methods has been proposed to enforce latency, throughput and power constraints on data stream application. In MPC the strategies take into account the behavior of the system along a limited future time horizon to choose the reconfigurations to execute.

All the strategies listed above usually manage to enforce such require-ments even in presence of workload fluctuations or external inferences. How-ever, in many cases, these techniques are only implemented for specific appli-cation domains and doesn’t focus on Simultaneous Multithreading impact. A number frameworks implementing one or more steps of MAPE loop have been proposed [8, 9, 10, 11, 12] but Nornir [25] is the only one both publicly available and fully customizable. Concerning Simultaneous Multi-Threading (SMT [13]), this technique have been often been compared to Chip Multi-Processing architectures in order to study its advantages in terms of energy efficiency, performance and thermal properties [14, 15, 16]. However, a struc-tured prediction model on a self-adaptive computing framework is still miss-ing. Mishra et al. [6] propose a probabilistic graphical model-based learning system (LEO ) that is compared to other online and offline solutions. The algorithm acts on the number of cores, on the frequency and on the memory controller to minimize the power consumption while enforcing some perfor-mances requirements. The number of SMT contexts is also included as an additional possible actuator available to the algorithm. The study, however, is not explicitly focused on showing the model behaviour when varying the number of contexts. Furthermore, this solution needs a precalculated dataset of applications already executed. This has two disadvantages: the first is that this operation is time consuming, the second is that these learning techniques tend to have a bias towards the applications used for the training [25].

This work analyzes Simultaneous Multithreading impact on power con-sumption and performance for different applications from PARSEC benchmark [26, 27] and tries to define an online model able to predict those metrics, even when varying the number of contexts and without any previous information.

(31)

physical and virtual cores (See section 2.1.2 for details on Simultaneous Mul-tithreading). Another model, called SMT strategy, will instead consider the SMT contexts separately from the number of cores. Hence this chapter has two section, one dedicated to execution time and the other dedicated to power consumption, each in turn divided in two parts, one for each different strategy. Our purpose is showing that SMT strategy is overall preferable to NxC strategy.

3.1 Execution Time Prediction

The first metric we want to predict is Execution Time. To achieve this goal we need to build a model which integrates both the number of cores and the frequency of such cores. When talking about cores it’s important to differentiate between physical cores and virtual cores (Figure 3.1). For each physical core there is one ore more virtual cores, depending on the number of contexts per core provided by the hardware SMT technology (See section 2.1.2). It’s possible, for example, to have two virtual cores but only one physical core in a 2-way SMT.

#V irtualCores = #P hysicalCores × #ContextsP erCore (3.1) 27

(32)

Figure 3.1: Difference between physical cores, virtual cores and Voltage Domains.

The NxC model doesn’t distinguish between virtual and physical cores. In other words, having two virtual cores with one physical core is simply con-sidered as having two physical cores (Figure 3.2). Showing that this is not a good solution is a motivation for further development.

Let’s suppose we have a machine supporting SMT. We define n as the number of physical cores and c as the number of contexts per core provided by the machine’s hardware. We also denote with f the frequency at which the cores run. Building a model means searching for a function that can predict the execution time and that depends on the knobs n, f and c:

T (n, f, c) = f unction(n, f, c) (3.2) In the case of Execution Time, we extend the Universal Scalability Law (USL) to include f and c.

3.1.1 NxC Strategy

We first try to only consider virtual cores, as if having an additional context was the same as having an additional core (Figure 3.2). This is a rather simplistic way of modelling SMT. As already stated, the chosen function is a variation of the basic Universal Scalability Law (USL). The USL maps the increment in the number of cores (temporarily denoted with p, before we

(33)

Figure 3.2: When the model ignores SMT, the situation (a) is considered as fully equivalent to the situation (b)

make further considerations on the SMT integration), to the actual gain in performances (the scalability S(p)), taking into account the limits of parallel computations, such as the contention on shared resources and the overhead due to coherency problems. In its basic form, the USL is expressed as follows:

S(p) = p

γ + σ(p − 1) + χp(p − 1) (3.3)

In the above formula γ is the concurrency (or ”ideal parallelism”), a param-eter measuring the ideal gain in performances when increasing p. If, for a given application, all the other parameters were zero and γ were one, that application would simply scale linearly with the number of cores. That’s obviously an ideal situation.

S(p) = p, if γ = 1, σ = 0 and χ = 0 (3.4) The other two parameters are precisely used to make the model more effective in realistic scenarios. The first, σ, models the portion of the com-putation that cannot be parallelized (the ”serial fraction”). An application suffering from high contention will have a σ parameter closer to one.

The other parameter, χ, describes the overhead due to coherency prob-lems, since this enables frequent communications among the cores (is also

(34)

Figure 3.3: Plot showing the relationship between scalability and parallelism degree. The Ideal case is when both σ and χ are zero, while the Amdhal curve refers to the particular case where only χ is zero. When different from zero, the coefficients in this plot are set to γ = 1, σ = 0.1, χ = 0.1

called ”cross-talk”). When χ is closer to one, having an higher number of core could also worsen the performances (Figure 3.3). The well-known Amd-hal’s Law is a particular case of the Universal Scalability Law where χ = 0 and γ = 1.

S(p) = p

1 + σ(p − 1), if γ = 1 and χ = 0 (3.5) The formula can be rewritten by taking into account that the scalability can be defined as the ratio between the execution time needed by the appli-cation when using only 1 core and the time needed by the appliappli-cation when using p cores. In the ideal situation, doubling the parallelism degree would halve the execution time, so S(2) would be equal to 2.

S(p) = T (1)

(35)

The Universal Scalability Law by itself isn’t sufficient, as it doesn’t include the frequency f . Before integrating the frequency into the USL, we chose a function that maps the performance gain to the increment in clock frequency. If fmin is the minimum possible clock frequency, then the execution time must decrease proportionally to the ratio between f and fmin. The equation is then a generic Hyperbola which basically scales T (n, fmin, c) down when increasing frequency:

T (n, f, c) = α ∗ T (n, fmin, c) ∗ fmin

f + β (3.9)

T (n, fmin, c) is the execution time T when n × c virtual cores are used and all of them are running at the minimum possible speed. We can then update the USL to include this information:

T (n, fmin, c) = [γ 1 nc + σ

(nc − 1)

nc + χ(nc − 1)]T (1, fmin, 1) (3.10) The above equation, however, still does not include the frequency f as a variable, since fmin is a constant. The final step consists in replacing the term T (n, fmin, c), as defined in (3.10), in (3.9):

T (n, f, c) = [αγ 1 nc+ ασ (nc − 1) nc + αχ(nc − 1)]T (1, fmin, 1) fmin f + β (3.11) α is then absorbed by γ, σ and χ, as if we were considering three new variables

(36)

A, B and C, defined by A = αγ, B = ασ and C = αχ in: T (n, f, c) = [A 1 nc + B (nc − 1) nc + C(nc − 1)]T (1, fmin, 1) fmin f + β (3.12) Finally we proceed with a Linear Regression over A, B, C and β. We validate this model both offline and online in chapter 4 and section 5.2.

3.1.2 SMT Strategy

Since having more contexts per core it’s not the same as having more physical cores, both from a performance and a power consumption perspective, a more effective model could consider n (the number of physical cores) and c (the number of contexts per core) separately. The first problem that comes to mind when simply multiplying physical cores and number of contexts per core is that some configurations are treated the same way by the model. Let’s take as an example a situation where we have two different machines, one having a more powerful SMT and one simply having more cores. As depicted in Figure 3.4, machine (a) has only two physical cores with three contexts each, while machine (b) has instead three physical cores, but with only two contexts each. In both cases the number of virtual cores n × c is the same and so is the T (n, f, c) obtained from the equation (3.12), but the architectural differences between the two machines are evident. As a simple way to distinguish between physical core and contexts we limit the first Regression to physical cores only, and we later introduce another Regression which takes the number of contexts into account. So while NxC make use of only one regression, the SMT strategy actually make use of two regressions. The first regression is the one already explained in (3.12), but with c = 1.

T (n, f, 1) = [A1 n + B (n − 1) n + C(n − 1)]T (1, fmin, 1) fmin f + β (3.13) Being fixed, the number of contexts is no more a regressor in (3.13), but c will be reintroduced more accurately later. In particular we use the T (n, f, 1) predicted by the first regression to train the other regression which let us obtain T (n, f, c).

(37)

Figure 3.4: Two machine having two different architectures but treated the same way by the model in section 3.1.1.

As explained more in detail in section 2.1.2, the SMT technology con-tributes by introducing some instruction level parallelism using additional hardware components, which let the physical core keep two or more con-trol flows in the core registers to quickly switch among them. This reduces the contention between two threads running on the same physical core com-pared to the case where SMT is not present. There is still more contention, however, with respect to the case where two threads run on two separate physical cores. Let’s take, for example, the scenario depicted in Figure 3.5. We have three different machines: in (a) we have n = 1, c = 1; in (b) we have n = 1, c = 2 and in (c) we have n = 2, c = 1. The front end is dedicated to fetch and decode thread instructions. When two different threads are ex-ecuted in machine (a), they immediately suffer from contention due to the fact that the processor (a) is not able to issue instructions from both thread at the same time. When the two threads alternate, all the registers must be updated performing a contexts switch. This problem is instead not present in processor (b), but they can still suffer from contention in the execution engine. The machine (c) has the best performances, since it does not suffer from contention in the processor (the problem still exists for shared memory accesses, however).

(38)

Figure 3.5: Different levels of contention in different architectures. In (a) we have one physical core and one context, in (b) we have one physical core but two contexts and in (c) we have two physical cores.

of the SMT we suppose that there is some overhead Tc caused by switching context that is not paid when Simultaneous Multi-Threading is in play. How much this Tc is it depends from the number of contexts per core provided by the SMT. So if c is the number of SMT contexts and Tc the overhead when no SMT is used, the execution time reduction due to SMT, Tsaved, can be expressed as follows:

Tsaved= Tc(1 − 1

c) (3.14)

If there is only one context per core (c = 1) then the time saved Tsaved is zero, otherwise it grows with c. In order to better understand this definition, let’s consider the simplified situation presented in Figure 3.6. Let’s suppose to have a simple architecture where only one instruction per clock cycle is issued and the instructions are executed sequentially; let’s also ignore any other overhead with the exception of the one due to contexts switching (In the Figure presented as dark grey boxes). For the seek of simplicity, in this example we consider the extreme situation where every thread execute only one instruction and then is replaced. Each box is a different instruction and is divided in two parts: the left half represents the time for fetch and decode, while the right half is the time needed for the execution. In (b) we have

(39)

2-Figure 3.6: Execution time needed for 4 instructions to complete their execution. Different colours correspond to different threads. The dark boxes represent the context switching overhead.

way SMT, so the processor can fetch and decode two instructions at the same time. The system (a) must pay a context switch every time this happens, while the system (b) can easily switch between the green thread and the red thread. Even the system (b), however, must pay an overhead to load the blue thread, since a 2-way SMT cannot keep 3 ready contexts.

This means that the execution time as a function of n (number of phys-ical cores), f (cores frequency) and c (number of contexts per core) can be extracted from T (n, f, 1) by subtracting Tsaved:

T (n, f, c) = T (n, f, 1) − Tsaved = T (n, f, 1) − Tc(1 −

1 c)

(3.15)

Since T (n, f, 1) is predicted using Linear Regression, the only unknown variable is the overhead Tc.

We know that Tc is a portion of T (n, f, 1), but it’s not possible to know how big it is a priori, since it depends largely on the specific application. Thus we define ω ∈ [0, 1) ⊂ R and:

(40)

So that the closer ω is to 1, the larger is the overhead Tc in proportion to T (n, f, 1). Then the model can be rewritten by replacing Tc in (3.15):

T (n, f, c) = T (n, f, 1) − ωT (n, f, 1)(1 − 1 c) T (n, f, c) T (n, f, 1) = 1 − ω(1 − 1 c) (3.17)

The second Linear Regression introduced by the SMT strategy make use of the Equation 3.17 by taking ω as a coefficient. The pseudo-code in Listing 3.1 summarizes the whole regression process.

1 Step 1. Find constants Fmin, T(1,Fmin,1)

2 Step 2. Find training samples --> [n ; f ; T(n, f, 1)]

3 Step 3. Tune parameters A, B and C using Eq. 3.13 with training samples.

4 Step 4. Find test configurations --> [n ; f ; c]

5 Step 5. Get T(n, f, 1) from Eq. 3.13 with tuned A, B, C for test configs.

6 Step 6. Find training samples --> [c ; T(n, f, c)]

7 Step 7. Tune parameter ω using Equation 3.17 with training samples.

8 Step 8. Get T(n, f, c) from Eq. 3.17 with tuned ω for test configs.

Listing 3.1: Pseudo-code detailing how the double linear regression is done. In Step 1 we get the minimum frequency fmin and the execution time for the slowest configuration T (1, Fmin, 1). In Step 2 we need to measure the execution time in correspondence to various configurations of n and f , but with c = 1. These samples will be used to train the model (Step 3) since, knowing the responses T (n, f, 1), we can tune the A, B and C parameters from Equation 3.13. The purpose of the algorithm is to predict the execution time T (n, f, c) for any requested configuration [n; f ; c], so in Step 4 we start collecting some of them. We’ll call them test configurations to distinguish them from the training samples. In Step 5 we actually use the model trained in Step 3 to get T (n, f, 1) for each test configuration. This information is needed to train the second regression, which uses training data collected in Step 6 to tune the ω parameter from Equation 3.17 (Step 7). In the last step the second regression can be actually used to perform the prediction, obtaining the execution time T (n, f, c) for all the test configurations. We validate this model both offline and online in chapter 4 and section 5.2. This strategy shows an higher accuracy with respect to the NxC strategy described in section 3.1.1.

(41)

As already explained before for execution time prediction (See section 3.1), when talking about cores it’s important to differentiate between physical cores and virtual cores (Figure 3.1).

#V irtualCores = #P hysicalCores × #ContextsP erCore (3.19) In this section as in the previous we’ll first build a strategy that doesn’t distinguish between virtual and physical cores, called the NxC strategy. Then we’ll build another model that integrates SMT in a different way, called the SMT strategy.

3.2.1 NxC Strategy

In this section we’ll follow the same procedure adopted for execution time: we’ll first try to build a linear regression model capable of predicting power consumption without making any distinction between physical cores and vir-tual cores. This means that, in this section, when we talk about cores we are always talking about active virtual cores n × c, calculated as the number of active physical cores times the number of contexts per core. Supposing to always place each thread in a different active virtual core, power consump-tion within a single voltage domain can be modeled [18, 19, 20] as a funcconsump-tion of the voltage v, the clock frequency f and the number of active cores in the voltage domain N (n, c). A voltage domain is a portion of the architecture with a common power supply (Figure 3.7). The voltage of modern volt-age islands can change dynamically based on the need [21], by activating or deactivating components (such as cores) or by altering the clock frequency. Different voltage domains are powered independently, so each one can have a different voltage.

(42)

Figure 3.7: Four voltage domains with four cores each. Each Voltage domain have a different number of active cores, since they can receive power supply indepen-dently.

So let’s consider a single voltage domain. It’s power consumption can be defined as:

P (n, f, c, v) = vIleak+ αcav2f N (n, c) (3.20)

Where v is the voltage, Ileak is the leakage current, ca is the load capacitance and α is the activity factor. The leakage current Ileak is gradual loss of energy due to current passing through barriers normally considered insulating, such as a turned off transistor [18]. It’s possible, for example, that some electrical energy is transferred through a component without turning it on, because the transition has a voltage that’s under the minimum required threshold. The activity factor α is the average number of power consuming bit transition in a clock cycle. We consider α, ca and Ileak as unknown constants and we’ll use linear regression to tune these parameters.

N (n, c) is the number of active virtual cores in the voltage domain. How many cores we’ll be activating in each voltage domain does depend on the thread placement strategy. We could, for example, activate a single core in

(43)

Figure 3.8: Picture summarizing the various definitions. N (n, c) is the number of active core per voltage domain; N tot is instead the total number of cores per voltage domain. K is the total number of voltage domains, while k is the number of active domains and k is the number of inactive domains.

each voltage domain or, if it’s more convenient, activate all the cores in a few voltage domains and leave the others fully turned off. We chose to use a linear placement strategy: Threads are simply placed linearly, activating every core of a voltage domain before proceeding to the next one. This minimizes the number of active voltage domains. Let Ntot be the total number of core available for each voltage domain. Unless we have less threads than cores in a voltage domain, if a voltage domain is active then in average all of its cores we’ll be active:

N (n, c) = min(Ntot, nc) (3.21)

Concerning the voltage v, each voltage domain as a finite number of voltage settings it can assume. Given so, a tabular function V (n, f, c) can be built once and for all for the current machine (programmatically or using the values provided by the CPU vendor). V (n, f, c) returns the voltage when varying the number of cores currently active on the voltage domain N (n, c) and their frequency (f ). From now on V (n, f, c) will be written as Vn, and V (0, f, 0) (the case where all the cores has been turned off) as V0for a clearer exposition, omitting frequency dependency. In a system with multiple voltage domains the above formula is valid for each of them. So let’s consider K = k+k, where

(44)

K is the total number of available voltage domains, k and k the number of active and inactive voltage domains, respectively. A summary of these definitions can be found in Figure 3.8. For inactive voltage domains, N (n, c) is clearly zero so, for a given application, the overall power consumption is:

P (n, f, c, v) = k(vIleak+ αcav2f N (n, c) + k(vIleak)

P (n, f, c) = k(VnIleak + αcaVn2f N (n, c)) + (K − k)V0Ileak

(3.22)

When we follow a linear thread placement the number of active voltage do-main can be expressed as:

k = nc Ntot

(3.23)

Let’s take as an example the linear placement strategy in Figure 3.9. In the example we have four cores per voltage domain ad a total of ten active virtual cores, so Ntot = 4 and nc = 10. In this specific case, then:

N (n, c) = min(4, 10) = 4 k = 10

4

= 3 (3.24)

Summarizing: α, Ileak and caare constants obtained by Linear Regression; V depends on the machine and it’s known a priori; f , n and c are the variables, n and c being part of N (n, c); k and N (n, c) define how many voltage domains and how many cores for each voltage domain are active and can be determined easily once defined the placement strategy.

This strategy has been validated both offline and online in chapter 4 and section 5.2.

3.2.2 SMT Strategy

The goal of this section is to integrate SMT in a more sophisticated way. In the following we are no more considering only active virtual cores, but active physical cores n and contexts c separately. We’ll still be using the Equation 3.22, but without the c variable, which will be reintroduced in the model

(45)

Figure 3.9: Linear placement and Nonlinear placement. In the linear placement all the cores in a voltage domain are activated before moving to the next voltage domain

differently. So let’s suppose we map each thread to a different physical core; let’s suppose also that we can obtain a tabular function Vn that contains, for each possible frequency f and number of active cores N (n), the relative voltage. Then the power consumption can be modeled [18, 19, 20] as a function of N (n) and f . P (n, f ) = k(VnIleak+ αcaVn2f N (n)) + (K − k)V0Ileak k = n Ntot N (n) = min(Ntot, n); (3.25)

The Equation (3.25) only tracks the power consumed by the physical cores, but Simultaneous Multithreading makes use of extra hardware resources in order to work, so there is an overhead in power consumption that must be considered. This quantity depends on c and must be added to the Equation (3.25). Let’s suppose to have a Simultaneous Multithreaded processor where each core has enough hardware components to keep c active contexts. If we call ε the maximum possible overhead due to SMT in the machine, then the actual overhead must be proportional to c, being zero when c = 1 (since we don’t use any extra resource) and tending to ε when c grows.

(46)

Pextra = ε(1 − 1

c) (3.26)

ε is unknown, but can be tuned using linear regression. The final model is then: P (n, f, c) = k(VnIleak + αcaVn2f N (n)) + (K − k)V0Ileak + ε(1 − 1 c) k = n Ntot N (n) = min(Ntot, n); (3.27)

This model is used as detailed in Listing 3.2 to obtain all the information we need.

1 Step 1. Find constants Ntot, K and voltage table V

2 Step 2. Find training samples --> [n ; f ; c; P(n, f, c)]

3 Step 3. Calculate k and N(n) as in 3.27.

4 Step 4. Tune parameters Ileak, α, ca and ε using Eq. 3.27 with training samples.

5 Step 5. Find test configurations --> [n ; f ; c]

6 Step 6. Get P(n, f, c) from Eq. 3.27 with tuned parameters for test configs.

Listing 3.2: Pseudo-code detailing the linear regression for power prediction. In Step 1 we must retrieve the constants Ntot, K and V , three machine-specific constants. In Step 2 it’s necessary to collect some labelled samples, each one obtained by setting a value for n, f , c and then measuring the relative power P (n, f, c). For each sample is then possible to calculate k and N (n), since both depends on n (Step 3). In Step 4 we have to tune Ileak, α, ca and ε using Equation 3.27 and the responses from the training samples. The model is then ready to be used for power prediction, so in Step 5 we get some configuration whose power consumption is unknown and in Step 6 we use our model to obtain the power P (n, f, c) for all these configurations. We validate this model both offline and online in chapter 4 and section 5.2. This strategy shows an higher accuracy with respect to the NxC strategy described in section 3.2.1.

(47)

the error estimation used. In section 4.2 we’ll use a cross-validation testing to evaluate the accuracy of the two strategies.

4.1 Testing Environment

Here we describe the machine used for the measurements, the monitored ap-plications, the dataset and the type of error estimation used. The machine in use for all the tests has a little-endian 64-bit PowerPC architecture with 4 POWER8E 5-cores CPUs. Each core is supported by a 8-way Simultane-ous Multithreading, for a total of 160 hardware threads. The CPU’s clock frequency ranges from 2061 Mhz to 3690 Mhz in 50 discrete increments. The operating system is Ubuntu Linux with 4.4.0-47-generic kernel release. We have monitored three different applications coming from the well-know PARSEC benchmark [26, 27]: Blackscholes, Bodytrack and Facesim. More de-tails on PARSEC benchmark can be found in section 2.1.6. Due to the limits of the input set, Facesim can only run with a maximum of 128 threads. This limits, in our experiments, the number of active physical cores from 20 to 16 and consequently the number of possible configurations. Offline data has been collected by running each application in all the possible configurations, i.e. for every combination of the number of cores n, the frequency f and the

(48)

number of contexts c. Dataset is hence composed by measuring power and execution time when varying n (from 1 to 20 or from 1 to 16 for Facesim), c (from 1 to 8) and f (from 2.061 GHz to 3.69 GHz, in 50 increments), for a total of 8’000 (or 6’400) different combinations (See Figures 4.1 and 4.2). Power measurements are taken with the Mammut1 library, which for this particular architecture relies on Amester2_{. We’ll then use a small subset of} the dataset as training samples and we’ll try to predict power and execution time for all the other configurations, comparing the predicted values with the actual ones.

The model accuracy has been estimated using the Mean Absolute Per-centage Error (MAPE). MAPE is a popular loss function used in regression problems and is defined as follows [28]:

M AP E = k X i=1 Ai− Pi Ai 100% k (4.1)

Where, for each point i out of a total of k points, Ai is the actual value for that point and Pi is the one predicted by the regression model. The difference between the predicted value and the actual value is normalized with respect to the actual value itself, so that we can obtain a relative value. The relative error of each point is then averaged by summing them up and dividing them by k. Using the absolute value instead of the square, as done in other types of error estimation (like Root Mean Square Error), MAPE give us a more intuitive perception of the error with respect to the actual value. Let’s suppose we have predicted, for a given configuration, that the execution time will be 300s. When we actually measure the execution time for that configuration we realize that the actual value is 327s. Then we can calculate how accurate our model has been with that specific configuration with the Absolute Percentage Error:

AP E = 327 − 300 327 100% = 0.0826 × 100% = 8.26% (4.2) To obtain the MAPE we must repeat this for every point and then take the mean value.

1_{http://danieledesensi.github.io/ mammut/} 2_{https://github.com/open-power/amester}

(49)

(50)

(51)

Where fmin and fmax are respectively the minimum and the maximum pos-sible frequency (2061Mhz and 3690Mhz in our case), nmax is the maximum number of physical cores (in our case 20 or 16) and cmaxis the maximum num-ber of contexts (8 in our case). At least one random configuration is added to these four fixed configurations, spacing from a total of 5 to a total of 30 different configurations. Furthermore, to have a statistically sound result, the test was repeated a number of times, until the 95% confidence interval of the MAPE was lower than the 5% of the sample mean. Figures 4.3, 4.4 and 4.5 contain, for each application and for each strategy, a plot graph showing the aforementioned analysis for execution time and power consumption. On the x-axis, the number of configurations used for the training phase, from a minimum of 5 to a maximum of 30. On the y-axis, the average MAPE obtained with that number of configurations.