Approximate energy-aware framework to support intermittent computing

(1)

Scuola di Ingegneria Industriale e dell’Informazione

Dipartimento di Elettronica, Informazione e Bioingegneria

Corso di Laurea Magistrale in

Computer Science and Engineering

Approximate Energy-aware Framework

to Support Intermittent Computing

Supervisor:

Prof. Luca Mottola

Master Graduation Thesis by:

Francesco Cerizzi

Student ID n. 875583

Academic Year 2018-2019

(2)

(3)

my life.

(4)

(5)

Acknowledgements

Thanks to everybody who supported me in this long and difficult travel. I wish to express my deepest gratitude to my advisor, Professor Luca Mottola, who helped me to understand the meaning of scientific research, and encouraged the right attitude during the whole process.

I am very grateful to all my NESlab friends (yeps, we are also colleagues) Andrea, Francesco, Fulvio, and Pietro. Thank you for the good times and for pushing me ahead with this project.

I wish to thank all the ‘Sbagliati’: Antonino, Davide C., Davide L., Fed-erico, Filippo, Gianluca, Lorenzo C., and Lorenzo F. whose assistance was a milestone in the completion of this project, and, more in general, in the quest for the degree.

I am indebted to Alessandro that kept me balanced.

Thanks to Massimo, without your unconditional help, this would not have been possible.

Last but not least, thanks to my family, Davide, Ezio, and Rita your unconditional support is my biggest strength.

(6)

(7)

Abstract

Approximate Computing (AC) is a computing paradigm that trades the out-put accuracy for energy and comout-puting time. The research on this paradigm roughly started at the beginning of the last decade, and, during the mid of the current one, it gained popularity. AC targets computing-intensive applications that do not require an accurate result, allowing an increase of performance.

In parallel, during this decade, the Internet of Things (IoT) systems gained momentum, and today we can experience how pervasive they are. Nowadays, there is a new class of green and eco-friendlyIoTdevices that op-erate without batteries or tethered power line. This reduces the costs of main-tenance and allows implementations or architectures that were considered to be unfeasible due to battery limitations. These devices rely on energy har-vested from the environment, which is unpredictable and results in frequent power failures. These kind of devices, are characterized by small form fac-tor, low computational power, low power consumption, and contained costs. For these reasons, the deployment of an Operating System (OS) which has all the features of a mainstream OS, such as the management of power fail-ures, is not possible; then when a power failure happens all the computations progress are lost. Therefore, the literature describes two kinds of solutions to ensure the program forward across power failures. Checkpoint-based solu-tions, that save the program state into a Non Volatile Memory (NVM) space, to make it persistence across power failures. Task-based solutions, that re-quire the programmer to divide the program into atomic tasks. Each task is re-executed from the beginning upon a power failure, and saves its result only when the computation completes and that ensures a transactional semantic. Both solutions guarantee the forward execution of an application in presence of power failures, we call this new paradigm of computation the intermittent computing. Both solutions introduce an overhead: checkpoints are costly operations in terms of energy for the operations on the NVM and time for collecting the program state data; tasks, instead, can be re-executed if inter-rupted from power failures. Thus, both solutions and Transiently Powered

(8)

Computers (TPCs) in general may benefit from energy saving opportunities that arise fromAC.

In this document, we propose AMNESIA, an approximate energy-aware framework to support intermittent computing. AMNESIA exploits the Loop Perforation technique ofAC, which consists of skipping some loop iterations to reduce energy consumption at the expense of results quality. AMNESIA allows the user to specify energy boundaries and uses it to select the approx-imation that results in the best trade-off between performance and quality. The framework, starting from a set of possible approximations of an appli-cation, is able, by performing an offline exploration, to select the set of the most promising approximation of the application in terms of performance and quality of the results. Then, the runtime component of the framework, when required, runs the approximated version of the application that is compliant with the external-provided energy boundaries.

AMNESIA reaches its goal and it guarantees the execution of an application within the provided energy budget. In addition, since there are different ways to approximate an application, AMNESIA selects approximations of the application that are better than the 80% of the total available approximation for that application.

(9)

Sommario

Il calcolo approssimato `e un paradigma di calcolo che sacrificando parte del-la qualit`a di un risultato, ottiene in cambio un aumento prestazionale come, ad esempio, l’utilizzo di minor risorse energetiche. Il calcolo approssimato si adatta tutti quei processi che richiedono un’elevata potenza computazio-nale, ma che al contempo non sempre richiedono un’estrema precisione del risultato.

In contemporanea, la diffusione dell’Internet of Things (IoT) ha fatto emergere l’esigenza di dispositivi sempre più piccoli, più economici e con mi-nor impatto ambientale, portando alla nascita di una nuova classe di dispo-sitivi: i dispositivi con alimentazione ad intermittenza. Essi sono alimentati con l’energia raccolta dall’ambiente che carica un piccolo condensatore che funge da magazzino di energia. L’assenza di batterie riduce i costi di manu-tenzione, che diventano quasi nulli, e abilita nuove possibilità e servizi che erano prima impraticabili. Purtroppo, il comportamento dell’energia am-bientale non è prevedibile, quindi questa tipologia di dispositivi è soggetta a continui spegnimenti. Uno spegnimento improvviso non è un problema per un dispositivo moderno, come potrebbe essere il nostro laptop, infatti esso ha a disposizione un Sistema Operativo (SO) che si occupa della gestione di vari malfunzionamenti, tra cui uno spegnimento improvviso. I dispositivi con alimentazione ad intermittenza, sono contraddistinti da un fattore di forma ridotto, una bassa potenza computazionale, un basso consumo energetico e da dei costi contenuti. Per queste ragioni, l’utilizzo di un sistema operativo, come può essere quello presente sul nostro laptop, non è possibile; quindi non vi è nulla per mitigare i problemi legati ad uno spegnimento improvviso. Ciò si traduce nel fatto che, ogni volta che vi è uno spegnimento, tutto il progres-so computazionale è perso. La ricerca si è concentrata per risolvere questo problema e consentire il progresso di un’esecuzione anche a fronte di uno spegnimento improvviso, in letteratura sono presentate due soluzioni possi-bili. Soluzioni basate sul meccanismo dei checkpoint, esse salvano lo stato di esecuzione del programma in una memoria non volatile cos`ı da garanti-re l’esecuzione attraverso spegnimenti improvvisi. Soluzioni basate sui task,

(10)

queste soluzioni richiedono al programmatore di dividere il proprio program-ma in task atomici, ossia funzioni che devono essere eseguite atomicamente senza spegnimenti durante la loro esecuzione. Il task, una volta completato, salva il suo risultato in una memoria non volatile; se uno spegnimento avvie-ne durante l’esecuzioavvie-ne del task, il task è rieseguito da capo. Questo assicura una semantica transazionale. Queste due soluzioni, che consentono l’esecu-zione nonostante spegnimenti frequenti, danno vita ad un nuovo paradigma computazionale: la computazione ad intermittenza. Entrambe le soluzioni, però, introducono un overhead: i checkpoint sono operazioni costose in ter-mini di energia spesa per le operazioni in memoria non volatile e di tempo per raccogliere lo stato di esecuzione; i task possono non completare a cau-sa di spegnimenti improvvisi. Quindi, entrambe le soluzioni e in generale i dispositivi alimentati ad intermittenza possono beneficiare delle opportunità di risparmio energetico offerte dalla computazione approssimata.

In questo documento, proponiamo AMNESIA un framework di calcolo ap-prossimato per dispositivi alimentati ad intermittenza. AMNESIA produce un risparmio energetico sfruttando il calcolo approssimato e in particolare la tecnica della Loop Perforation. Questo offre la possibilità di eseguire un’ap-plicazione rispettando un limite di consumo. Il framework, partendo da un set di possibili approssimazioni di un’applicazione, è capace di selezionare il sottoinsieme che offre il miglior compromesso tra performace e qualità del risultato, mediante una procedura di esplorazione offline. Quindi, duran-te l’esecuzione, il framework sceglie l’approssimazione corretta, a fronduran-te di una richiesta di computazione e di un limite di consumo energetico da dover rispettare.

AMNESIA riesce a raggiungere gli obbiettivi prefissati di risparmio tico, garantendo l’esecuzione di un’applicazione all’interno del budget energe-tico prefissato. In pi`u, AMNESIA seleziona un sottoinsieme di approssimazioni che si rivelano essere meglio dell’80% delle approssimazioni totali prese in considerazione per un’applicazione.

(11)

6.1.2 Policies Explorer . . . 62 6.1.3 Policies Encoder . . . 63 6.2 The Executor . . . 63 6.2.1 Policy Selector . . . 64 6.2.2 Runner. . . 65 6.3 Final Structure . . . 65 7 AMNESIA Implementation 69 7.1 Skipping Policies . . . 69 7.2 Explorer Implementation . . . 70 7.2.1 Policy Generation . . . 71 7.2.2 Policy Exploration . . . 73 7.2.3 Policy Encoder . . . 76 7.3 Executor Implementation. . . 77 7.3.1 Policy Selector . . . 77 7.3.2 The Runner . . . 78 8 Evaluation 79 8.1 Evaluation Environment . . . 79 8.1.1 Evaluation Platform . . . 80 8.1.2 Evaluation Tool . . . 80 8.2 Evaluation Inputs . . . 81

(13)

8.2.1 Linear Regression . . . 83 8.2.2 Dijkstra . . . 84 8.2.3 Susan . . . 85 8.3 Evaluation Metrics . . . 87 8.3.1 Approximation Level . . . 87 8.3.2 Score Function . . . 88

8.4 Loop Perforation Evaluation . . . 92

8.4.1 Linear Regression . . . 93

8.4.2 Dijkstra . . . 94

8.4.3 Susan Smoothing . . . 95

8.4.4 Susan Corner Recognition . . . 96

8.5 Observations. . . 97

8.6 Policies Evaluation . . . 98

8.6.1 K-fold Cross-Validation. . . 99

8.6.2 The Explorer Validation . . . 101

8.6.3 Conclusions . . . 105

9 Conclusion and Future Works 107 A Python Generator 111 B Evaluation Data 113 B.1 Linear Regression . . . 113

B.2 Dijkstra . . . 116

B.3 Susan Smoothing . . . 119

(14)

(15)

List of Figures

1.1 Intermittent execution pattern. The capacitor charges during the off periods of the device, and when it reaches the activation threshold — the green line — the device wake up and starts computing, discharging the capacitor until its charge reaches the deactivation threshold — orange line dash then the device

is powered off again. . . 2

1.2 Overview of AMNESIA . . . 6

2.1 SpeedPress components [32] . . . 27

2.2 Architecture of Green framework [6] . . . 28

2.3 ACCEPT ’s workflow schema [1] . . . 29

3.1 Energy trace of an energy harvester [8] — solar power source . 33 3.2 Intermittent execution pattern over time. The capacitor charges during the off periods of the device, and when it reaches the activation threshold — the green line — the device wake up and starts computing, discharging the capacitor until its charge reaches the deactivation threshold — orange line — then the device is powered off again. . . 34

3.3 Comparison between checkpoint based solution — left fig-ure — and task based solution — right figfig-ure. Checkpoint-based solutions maintain program structure, saving the pro-gram state periodically, see Section 3.3.1. tasks-based solution divide program in subroutines and manage them in a transac-tional way, see Section 3.3.2. . . 36

4.1 A typical Neural Network (NN) schema. . . 42

5.1 Arrays used in the Loop Perforation example . . . 48

(16)

5.2 Some of the 45 possible combination of two elements. The mean is computed skipping the red elements of the array, we remember that the exact mean value is mean = 66.5. We can notice that varing the ignored elements the results change, thus the quality of result depends on which elements are

con-sidered by the algorithm. . . 51

5.3 A graph showing the energy consumption variation after a loop perforation of a program of 10 instructions, with 3 of them inside a loop. We are estimates, by using the energy consumption per cycle estimated by Ahmed et al. [3] and the mean number of cycles spent per instruction estimated using MSPsim a simulator of IoT devices detailed in Section 8.1.2, energy consumption of 500nJ per instruction. . . 52

5.4 Example of a parking occupancy sensor based on image recog-nition. The picture on the left is the picture taken by the sensor, on the right the area of interest of the picture for the recognition tool. . . 53

5.5 Some examples of skipping policies applied on the array mean example. The red elements of the array are the skipped ele-ments, the reader may notice that the red elements correspond to the zeros of the function. We also notice that, by keeping the skipping factor constant, as in the first two figures, it is possible to define different skipping policy. . . 58

6.1 Overview of AMNESIA . . . 59

6.2 Offline components of AMNESIA . . . 61

6.3 Online components of AMNESIA . . . 64

6.4 Complete overview with components details of AMNESIA . . . . 65

7.1 A block of 32 bits encoded into one integer. . . 70

8.1 The msp430f1611 Micro Controller Unit (MCU) . . . 80

8.2 The work-flow from the firmware source code creation to the energy consumption estimation . . . 81

8.3 Analysis and results of the linear regression on a sample of the input data set . . . 84

8.4 Susan’s inputs . . . 87

8.5 On the left image without the smoothing filter, on the right a smoothed image. Each image is associated to its module gradient. . . 90

(17)

8.6 Graph of the energy Consumption and the Score VS

Approx-imation of the linear regression. . . 93

8.7 Comparison between the results of Dijkstra. . . 94

8.8 Comparison between the results of Dijkstra. . . 95

8.9 Comparison between the two possible score function of Susan Smoothing . . . 96

8.10 Results obtained by Susan smoothing varying the approxima-tion level; the level of approximaapproxima-tion is indicated below the images. As the reader may notice, the exact result does not have any noise, while the one with 90% of approximation is noised . . . 97

8.11 Comparison between the results obtained by Susan Smoothing over three different inputs . . . 98

8.12 The visual effect of the approximation of the results of the three inputs of the corner recognition algorithm of Susan. Be-low each image, it is indicated the approximation percentage.. 99

8.13 Comparison between the results obtained by Susan Corner Recognition over three different inputs . . . 100

8.14 Linear regression: 10-fold cross-validation results. The reader can find the results of the other folds in Appendix B.1 . . . . 102

8.15 Dijkstra: 10-fold cross-validation results. The reader can find the results of the other folds in Appendix B.2 . . . 103

8.16 Susan Smoothing: 10-fold cross-validation results. The reader can find the results of the other folds in Appendix B.3 . . . . 104

8.17 Susan Corner: 10-fold cross-validation results. The reader can find the results of the other folds in Appendix B.4 . . . 105

B.1 Folds of linear regression from 0 to 4 . . . 114

B.2 Folds of linear regression from 5 to 9 . . . 115

B.3 Folds of Dijkstra from 0 to 4 . . . 117

B.4 Folds of Dijkstra from 5 to 9 . . . 118

B.5 Folds of Susan Smoothing from 0 to 4 . . . 120

B.6 Folds of Susan Smoothing from 5 to 9 . . . 121

B.7 Folds of Susan Corner from 0 to 4 . . . 123

(18)

(19)

List of Tables

4.1 Approximate Computing Techniques (ACTs) that meet the requirements of Section 4.2.1. We use₃when the requirement is met, ₇, on the contrary, when it is not met. We highlight the ACT that meets all of our requirements . . . 45 6.1 Energy consumption table of the approximated version an

ap-plication for the mean calculation . . . 66

(20)

AC Approximate Computing

ACT Approximate Computing Technique AxNN Approximate Neural Network CPU Central Processing Unit

CUDA Compute Unified Device Architecture DCT Discrete Cosine Transform [2]

DRAM Dynamic RAM

DNN Deep Neural Network

ECC Error-Correcting Code memory

eDRAM Embedded DRAM

FP Floating Point

FPU Floating Point Unit GPU Graphics Processing Unit IoT Internet of Things

LSB Least Significant Bit MCU Micro Controller Unit

NN Neural Network

NPU Neural Processing Unit

NVM Non Volatile Memory

OS Operating System

PARSEC The Princeton Application Repository for Shared-Memory Computers [9]

PSNR Peak Signal to Noise Ratio

RAM Random Access Memory

RMSD Root-Mean-Square Deviation SIMD Single Instruction Multiple Data

SRAM StaticRAM

SSIM Structural SIMilarity

TPC Transiently Powered Computer

(21)

Chapter 1 Introduction

Nowadays, energy saving is becoming relevant to everyday life: there are tons of devices such as smartphones, smartwatch, and laptops that could achieve benefits by being energy aware. Moreover, energy-saving also in large-scale applications such as scientific computing, social media, and financial analysis, gains prominence; in fact, according to Mittal [44], today the computational and storage demands of modern systems have far exceeded the available re-sources. All these factors make Approximate Computing (AC) more relevant. AC is a computation paradigm based on the intuitive observation that, while performing exact computation or maintaining peak-level service de-mand requires a high amount of resources, allowing selective approximation or occasional violation of the specification can provide disproportionate effi-ciency gains. In addition, AC is based on the assumption that depending on the application scenario, it is not always required a high level of accuracy. For example, let us consider the MP3 compression algorithm, it achieves memory-saving by distorting the original sound provided to the algorithm. Indeed, this compression algorithm exploits one characteristic of the human ear; since the human ear can nominally hear sounds in the range 20Hz to 20, 000Hz, all the frequencies out of these boundaries are inaudible for hu-mans, then it is possible to cut them off achieving memory-space-saving; this is how the MP3 compression algorithm works and this is an example ofAC applied to data that takes advantages from the conditions of use.

In parallel, a new kind of green and eco-friendly Internet of Things (IoT) devices are born and it requires energy awareness. This new kind of devices are called Transiently Powered Computers (TPCs). ATPC is a battery-less small-scale IoT device that uses as the only power source the energy har-vested by the environment. Having a battery-less system leads to a dramatic reduction of maintenance costs and has the potential to enable applications and services considered to be impractical due to battery limitations. These

(22)

V ff V Time V la g e O

Cha gi g Cha gi g O Cha gi g O

Figure 1.1: A typical intermittent execution pattern. The capacitor charges

up to a voltage that triggersMCUactivation; the execution rapidly

discharges the capacitor to a point in which the stored energy is below the threshold needed to sustain the micro controller activity, hence the device switches off until the end of the next charging phase.

1.1 uncertain power supply

105

Thanks to the significantly higher lifespan and resistance of capacitors,

106

batteryless devices can be deployed in a huge variety of scenario and

107

therefore equipped with many different harvester, scavenging energy

108

from different medium. Each energy source has a different volume and

109

different availability profiles. Even with fairly stable energy sources

110

like solar radiation we can not ensure the absence of power failures.

111

Figure1.1shows a typical intermittent execution pattern. Capacitors

112

with higher capacities increase the amount of energy that can be stored,

113

but also the time needed to reach the activation threshold.

114

Bhatti et al. in [5] propose a systematic analysis of energy

harvest-115

ing techniques, distinguishing between the relevant energy sources

116

and the corresponding extraction techniques. The energy source is

117

the environmental phenomena from which one may extract energy,

118

thanks to harvesting mechanisms. With chemical sources,

bio-119

logical or chemical energy is extracted through chemical reactions.

120

Energy can be extracted from thermal sources by altering the

thermo-121

dynamic equilibrium of an object to produce an energy flow, usable

122

to harvest electric energy. Vibrations, mechanical stress and sound

123

wave are popular sources of kinetic energy [5], that can be harvested

124

thanks to piezoelectric, electromagnetic or electrostatic effects. Kinetic

125

energy can also be harvested from the flow of common fluids like

126

air and water through micro turbines. Visible light, is probably the

127

most commonly known radiant energy source, extracted thanks to

128

photovoltaic cells. Another relevant example of radiant source is the

129

energy extracted from RF transmissions, commonly used to power

130

RFID devices.

131

Biochemical and thermal sources are those with the lowest reported

132

performance in terms of harvested power, typically in the µV range.

133

[Build timestamp - January 14, 2020 at 16:24]

Figure 1.1: Intermittent execution pattern. The capacitor charges during the off periods of the device, and when it reaches the activation thresh-old — the green line — the device wake up and starts computing, discharging the capacitor until its charge reaches the deactivation threshold — orange line dash then the device is powered off again.

systems are characterized by small form factor, low computational power, low power consumption, and contained costs. They rely on a Micro Controller Unit (MCU); one of the most relevant is the Texas Instruments MSP430 family, with a very limited amount of memory and storage. TPCs instead of having a battery, store the energy provided by the harvester in a small capacitor. Some battery-less platforms, such as WISP [54] and Flicker [29], are already available.

Thus, this kind of device has an intrinsic challenge: since the energy har-vested from the environment is an unpredictable power source, the execution of aTPC can be interrupted anytime. This is not a problem for mainstream platforms, such as our laptops they are all equipped with an Operating Sys-tem (OS). The OS takes care of power management and it provides some features to mitigate the problems related to power failures. Since TPCs are small and low powered devices there is no possibility to equip them with a mainstream OS, and then, since nothing takes care of the power failures, they are a problem. Therefore, the literature presents two solutions, detailed in Section3.3, that enable the forward execution on these devices. So, since these devices compute through power failures, it is born a new computational paradigm the intermittent computing, as shown in Figure 1.1, a paradigm of computation that suffers from continuous and unpredictable shutdowns. Therefore in this field, energy-awareness and, more in general, energy-saving are critical aspects.

(23)

1.1 The Need for New Paradigms

In intermittent computing systems, there is a gap when we consider energy-saving features. In fact, this is an emerging field and the researches, as we mentioned in the overview of this Chapter, have focused their efforts on allowing the intermittent computation onTPCs by finding systems to forward the execution after a power failure. There are two possible solutions to the forward execution problem:

checkpoint-based solution:

the checkpoint-based solutions rely on the possibility to save the exe-cution’s state in a non-volatile memory in order to keep the execution progress saved. In case of power failure of the device, it can restart from the last saved checkpoint;

task-based solution:

task-based solutions are a new paradigm of code structure: the pro-grammer divides its application in functions called tasks. The tasks must be atomic, which means that a task must be executed entirely or, if a power failure stops a task in the middle of the execution, it must be re-executed completely. This division allows a transactional semantic, thus, each task execution is pipelined on the devices, and an application is complete when all its tasks are successfully executed. Both these solutions achieve the aims to forward an intermittent execution but, they do not try to save energy during the application execution.

AC offers large energy-saving opportunities. Therefore, many different techniques can be applied to trade off application accuracy with performance. There are software techniques [66], that operate independently of the hard-ware, and there are hardware techniques [66] that, on the other hand, use dedicated hardware to exploit the AC opportunities.

Researchers focus their attention not only on Approximate Computing Techniques (ACTs) but also on approximation frameworks, tools that allow programmers to take advantage of approximation opportunities for achieving performance gains in their applications. There are three relevant frameworks at this time: Green [6], SpeedPress [32], and ACCEPT [1]; they are all discussed in Section 2.3. These frameworks share the aforementioned aims and some peculiarities such as the use of a user-provided quality function to evaluate the quality loss of the approximated result.

In conclusion, we have the AC field, which brings energy-saving oppor-tunities with the sacrifice of some of the resulting quality; and we have new emerging devices, the TPCs, that need to save as much energy as possible

(24)

to maximize the opportunities offered by their unpredictable power sources. Therefore, we fill part of this gap by proposing AMNESIA, a newACframework that focuses its aims to allow the execution of an application in accordance to some user-provided energy boundaries in an intermittent computing setting.

1.2 Contribution

AMNESIA, the framework that we are proposing in this thesis, aims to ex-ploit the energy-saving feature ofAC, achieving the execution of applications within user-defined energy boundaries.

Our contribution is composed basically of three parts. Domain analysis:

as we mentioned in the above sections, there are a lot of ACTs, and all of them may achieve energy-saving results. This is discussed in Chapter 4.

Technique analysis:

after the domain analysis, we select the most promising ACT to be used in our framework, this technique is discussed in Chapter 5. AMNESIA:

the framework that we developed for exploiting the opportunities ofAC in the new field ofTPCs. AMNESIA is discussed in Chapter6, Chapter7 and evaluated in Chapter8.

1.2.1 Domain Analysis

The application domain of TPCs is vast. We deeply analyze this domain from a requirements point of view in Chapter 4 in order to select the best ACTfor our aims. We need anACT that is software-based, that requires no massive program intervention to be implemented or to be used and that is adjustable; indeed, we would like to control the level of approximation with some knobs, and that covers most cases as possible of theTPC field.

The most promising technique, that meets all these requirements, is the Loop Perforation.

1.2.2 Loop Perforation

The Loop Perforation is a softwareACTthat allows the programmer to skip some iterations of a loop. In other words, this technique relies on the belief

(25)

that skipping some iterations of an algorithm’s loop produces an acceptable result with performance gains. This technique satisfies all the requirements of the domain analysis, but its strength is that covers a lot of application of the IoT field. Indeed, analyzing MiBench2 [30], one of the most relevant benchmarks for IoT devices, we discovered that loops are present in almost all the applications.

For instance, let us suppose to calculate the mean of fifty integers. The exact value is the sum of all fifty integers divided by fifty, but what does it happen if we sum only forty elements out of fifty and we divide by forty? We obtain a different result than before, but this result will not be dramatically different from the previous one since, in order to calculate, we used only forty elements out of the previous fifty. This is an approximated result, and we have achieved that by doing forty sums out of fifty required by the exact result. This means that we spent less effort to do our calculation. The aforementioned example simply explains how the loop perforation works and how it saves energy: by skipping loop iterations, it skips some calculations that are performed to obtain the exact result.

Thinking, again, into account the above example of the fifty elements mean, we created an approximation version of the mean that skips ten ele-ments out of fifty, but does the skipped eleele-ments selection make no difference? No, it does. In fact, we assume, and we prove in Chapter8, that the informa-tion is not uniformly distributed inside the input of an applicainforma-tion; therefore, since skipping an element provides always performance gains, but skipping different elements produces a different result, we extend AMNESIA with this notion and we try to select the elements of the input that carries less infor-mation in order to achieve the most accurate result. Thus, an application can be approximated by using different level of approximations, that depends on how many iterations are skipped, and moreover, it is possible to produce, by keeping constant the level of approximation and by selecting different iterations to be skipped, other approximations of the same application.

1.2.3 AMNESIA

AMNESIA stands for ApproxiMate eNergy-aware framEwork Supporting Intermittent Applications, and it allows the execution of an application within

user-provided energy bounds. AMNESIA is detailed in two chapters: Chapter 6, which contains the discussion of the structure of the framework and Chap-ter 7, which contains the implementation of the tool.

Figure1.2is a compact and simplified version of the structure of AMNESIA. As the reader can notice, the framework is composed of two components, the offline component, the Explorer and the online component, the Executor.

(26)

Device

<<Offline Component>>

Explorer Explorer Output <<Online Component>>Executor

Approximated Output Inputs

Inputs

Figure 1.2: Overview of AMNESIA

Explorer

The explorer aims to choose, among various possible approximations, the one that fits best. It chooses by using a user-provided input training set and by exploring the possible approximations against that training set. Finally, it provides a list of possible approximations of the application that is loaded into the device and it is used by the runtime component of AMNESIA. This list is a list of approximations, each approximation contains the iterations of the application that must be skipped to apply that approximation.

Executor

The Executor selects from the set of approximations provided by the Ex-plorer the approximation that fits the energy boundaries provided as input by the user; then, it runs the application by using the approximation selected complying with the energy boundaries.

1.3 Document Structure

The rest of this document is structured in five parts. The description of the state of the art ofAC, the description of the intermittent computing scenario, the analysis of the opportunities that arise from the union of theACparadigm and the TPCs and how to exploit them, the description of AMNESIA, firstly from a theoretical point of view and secondly from an implementation point of view, and finally the evaluation of our proposed framework.

(27)

Approximate Computing State of the Art

Chapter 2 details firstly about, the opportunities that arise from AC and secondly the challenges that it proposes. In this chapter, we analyze how the challenges proposed by theAC are addressed in the literature, and how and whichACTs are proposed by the researchers. We deeply discuss the features of eachACT and we detail the most relevant framework implementations.

Intermittent Computing

In Chapter 3 we analyze the field of intermittent computing, discussing the challenges that this new emerging filed proposes, and how we can fill the gap that is still present there.

Approximate Intermittent Computation

This theme is covered by two chapters:

• Chapter 4explores the solutions that are present in the literature that try to merge the field of AC to the transiently powered devices. It details the functionality of those solutions and their aims. Then, this chapter explores the opportunities of AC which our framework could take advantage, then it analyzes the requirements of theTPCs to select the best ACTs for our aims.

• Chapter 5 details the pros and cons of the previously identified ACT. This chapter goes deeper into the ACT analysis by making examples and considerations about the energy-saving gains that lead from the usage of this technique and about the possibility that not all the ap-proximations with the same energy-saving lead to the same results. Chapter 5analyzes how the information is distributed inside an input, and it suggests the tools that AMNESIA may use to choose the best ap-proximation among all the possible ones with the same performance level.

AMNESIA

AMNESIA is covered in two different chapters: the first one, that is focused on the concept and notions of the contribution and the second one, that is focused on the AMNESIA implementation.

(28)

• Chapter6includes the concept and notions contribution; it starts with an overview of the framework and it takes the reader by hand to ex-plore all the components that build up the tool. Firstly, we discuss the Explorer, the offline component that chooses the list of the available approximations, this is carried out by figuring out which approxima-tion could be the most appropriate when there are more than one ap-proximation for level of apap-proximation with identical performances but different results, indeed we detailed also in this Chapter in Section1.2 the possibility to have different approximations with the same approxi-mation level (by skipping the same number of iterations). Secondly, we analyze the online component behavior that at runtime chooses to use the approximation that fits the requested energy boundaries. Finally, there is a full usage example of the tool.

• Chapter 7 follows the same structure of Chapter 6, proposing firstly the offline component and secondly the online one; but from the imple-mentation point of view.

Evaluation

Within Chapter8the evaluation of AMNESIA is contained. Then we test both the online ad offline component. In the first case, we are interested in looking at the energy-saving gains obtained by the usage of the approximated versions of the application; in the second case, the focus switches to the approximation selection that the offline components opts for, in order to decide if they make sense or not.

The AMNESIA evaluation demonstrates that with the Loop Perforation it is possible to achieve energy-saving inTPC, moreover AMNESIA reaches its goal and it guarantees the execution of an application within the provided energy budget. Finally, we in Chapter 8 we demonstrate that not only there are approximations that perform better than other with the same approximation level, but that AMNESIA is capable to find, in the worst case, an approximation that is better than the 80% of the total available approximation for that application.

(29)

Chapter 2 Approximate Computing

The computational and storage demands of modern systems are exceeding the available resources. This, plus the need to meet hard performance goals, has motivated research into systems that trade accuracy for performance or other benefits (such as robustness, energy savings, etc.,).

Approximate Computing (AC) is based on the intuitive observation that, while performing exact computation or maintaining peak-level service de-mand requires a high amount of resources, allowing selective approximation or occasional violation of the specification can provide disproportionate gains in efficiency. In brief, AC exploits the gap between the level of accuracy re-quired by the applications/users and that provided by the computing system, for achieving diverse optimizations. Thus, AC has the potential to benefit a wide range of applications/frameworks, for example, data analytics, sci-entific computing, multimedia and signal processing, machine learning, and so forth. However, although promising, AC is not a panacea. Effective use of AC requires judicious selection of approximable code/data portions and approximation strategies, since uniform approximation can produce unac-ceptable quality loss [48, 59] and, even worse, approximation in control flow or memory access operations can lead to catastrophic results [70].

Although the term is new, the principle is not: efficiency–accuracy trade-offs are employed into commonly used applications. For example Floating Point (FP) operations, digital signal processing applications and lossy com-pression trade off accuracy in exchange of fast computation or space saving. In this Section we first analyse goals and challenges ofAC. We propose an overview of the most significantACTs and we classify them according to re-lated literature. In conclusion we explore the world of frameworks dedicated to make AC accessible to users.

(30)

2.1 Scope and Challenges

In the following section, we are going to discuss which basisAC was born on, which opportunities arise from it and which challenges we need to address to take advantage of such opportunities.

2.1.1 Quality Metrics

Before dealing with of ‘Scope and Challenges’ ofAC, since the concept behind AC is to trade off accuracy for benefit, we need to make a brief detour into what we mean when we talk about accuracy: behind the concept of accuracy, the literature discusses a metric, which is able to qualitatively how accurate a result is with respect to another one; this metric is called quality of results. Quality of results is evaluated using several metrics — not mutually exclusive — that are defined for each approximated application and for each environment of usage. We give an overview of some quality metrics used in literature to estimate quality of results:

Relative Difference

Relative difference can be considered as the most general accuracy metric: it compares the exact output with the approximated one by calculating the relative difference between outputs and it is a good result quality measure estimation, used among a large set of applications [44]. For instance, accord-ing to Mittal [44] relative difference is used as quality metric among approxi-mated application with different purposes such as, physic simulations, signal analysis and MapReduce programs, e.g., page rank, page length, project popularity, and so forth.

Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM)

PSNRandSSIMare two metrics that measure signal quality. WhilePSNRis the ratio between the maximum possible power of a signal and the power of the corrupting noise that affects the fidelity of its representation,SSIM[64] is a perceptual metric: it considers image degradation as changes in structural information of the image itself, taking advantage of known characteristics of the human visual system. These two metrics are used mainly for measuring accuracy in image or video compression tools (e.g., JPEG, HEVC )

(31)

Classification or Decision Accuracy

Classification accuracy is a quality metric for applications which classify items — e.g., support vector machines, streamcluster, that is a data mining tool from The Princeton Application Repository for Shared-Memory Com-puters [9] (PARSEC). It is the ratio of correct classifications to total classifi-cations. In an analog way, decision accuracy is the quality metric of decision problems such as image binarization. It is expressed by ratio between good decisions — correct incorrect — and total decisions.

In essence, these metrics seek to compare some form of output — de-pending on the application, e.g., pixel values, body position, classification decision, execution time — produced by AC with the exact computation’s result. This comparison estimates the quality of results. Quality of results is the metric that expresses in numbers how far from the exact computation the approximation is.

2.1.2 Motivation

The goal of AC is to trade off computational accuracy — expressed in qual-ity of results — for some kind of benefit such as execution speed or energy-saving. In some scenarios,ACbecomes unavoidable or the opportunity arises spontaneously from the usage context, while in other scenariosAC is specif-ically implemented in order to reach some benefits [44].

2.1.2.1 Unavoidable Approximation or Need for Approximation In case of natural noisy input, limited data precision, measurement error, defecting hardware, or hard real-time constraints, the usage of approximation techniques is unavoidable, e.g., AC is natural in floating point operations where rounding happens frequently. In many complex problems, finding an exact solution may be not feasible or, in worst cases, cannot be known, while finding an inexact solution is simple and can be sufficient [37].

2.1.2.2 Error-Resilience of Programs and Users

The limitations of human senses provide opportunities for AC in visual ap-plications. For instance, image compression standards like JPEG use a lossy compression algorithm to achieve the reduction of image size that reduces the overall image quality too. The compression operates by transforming the input image from 2D spatial domain into frequency, using Discrete Cosine

(32)

Transform [2] (DCT), and a perceptual model, loosely based on the human psychovisual system, discards high-frequency information, i.e., sharp transi-tions in intensity, and color hue [63].

Similarly, many programs have portion of code that are not critical, and small errors in these do not affect significantly or at all quality of results. For example, according to Esmaeilzadeh et al. [20] in a 3D raytracer application, 98% of floating point operations and 91% of data access are approximable. In the same way, since the fact that lower-order bits have smaller significance than higher-order ones, approximating them has only a small impact on the final result [48, 59, 14,22, 47, 27, 56].

2.1.2.3 Energy Efficiency

These days energy efficiency is becoming a key feature of devices, from big data center to tiny embedded devices. Thus we can use approximation techniques to reduce energy consumption. For instance, the energy con-sumption required to read/write memory can be reduced to the cost of a loss in precision [43], by decreasing Embedded DRAM (eDRAM)/Dynamic RAM (DRAM) refresh rate, or Static RAM (SRAM) voltage supply. Oth-erwise there are approximation techniques that alleviate the scalability bot-tleneck [69], or that improve performances by early loop iterations [58, 32], skipping memory access [68] or skipping code blocks [61].

2.1.3 Challenges

AC also presents several challenges that need to be addressed to exploit all the features of this domain.

2.1.3.1 Limited Application Domain and Correctness Issues Due to different applications nature, it is not always possible to use theAC approaches like a panacea in order to obtain better performances. We need to evaluate if an application is promising for an approximation; and if it happens which ACT is the best one to use.

The domain of approximable applications is a subdomain of the entire application universe. An example of a set of applications that cannot be approximated is cryptography applications. Indeed, the algorithms which are the base for this class of applications do not tolerate any kind of error and their operations relay only on the exact computation.

Moreover, if an approximation strategy works for an application it is not guaranteed that all the other known approximation strategies, detailed later

(33)

in Section2.2.4, are usable on the same application and furthermore if more then one works they will not for sure obtain same results. For instance, an approximation strategy can prevent program termination or may generate a corrupted output data [4] or, in addition, it may be the best approximation strategy for an application but at the same time be the worst for another one. Thus, it is up to the user to choose, on an application basis, an approximation strategy that fits the requirements of that application.

2.1.3.2 Gains

We intend as gains all the improvements of performances provided by the usage of any kind of ACTs. Gains in AC are bounded. This is a simple concept but it is important to understand. There are aggressive ACTs that may lead to high gains but these gains have an upper bound defined by the application itself. In other words, let us suppose to enchant an application with a component — software or hardware — that makes every data opera-tion, magically, no-energy-consumptive at the cost of a small approximation. This application will obtain a huge energy-saving but, despite out magical component, the energy consumption of our application will not be 0 because we are not changing at all the costs of other operations of the application, such as load and store in memory.

2.1.3.3 Overhead

As we mentioned, there are different approximation strategies, that there will be detailed in Section 2.2.4, leading to different ACTs. Each ACT needs to be implemented by the programmer or an automatic tool to take advantage of the approximation gains. This means that the code base of an application is changed to exploit the approximation and this variation may introduce a variable size overhead. Thus when we take advantage of an ACTwe need to take into account that sometimes the overhead introduced by the technique itself may be bigger than the benefits obtained. i.e., implementation overhead of someACTs is challenging especially for systems, such asIoTdevices, that have strict memory or energy constraints.

Moreover, we need to consider the ‘human overhead’ that an approxima-tion requires. In fact, the requirement of some kind of programmer inter-vention may be a problem when we scale from a small application to a big one.

For instance, there is anACT, discussed in Section2.2.4, that requires the programmer to write different approximated versions of the same application. This leads to a big waste of memory on the device — the different versions

(34)

must be stored somewhere — and also to a big effort of the programmer for implementing this technique — he/she needs to handwrite dozen of different applications for the same goal.

2.1.3.4 Quality Forecasting and Tunability

Some ACTs offers knob(s) to tune approximation level of the technique it-self [6]. Therefore, instead of executing every computation to full fidelity, the user chooses to use an effort (energy or time) looking at his needs in terms of result quality, energy or time available.

After a knob adjustment a good feature would be an accuracy prediction of the quality of results, but accuracy forecasting increases design and veri-fication costs of the implementation beyond the costs of the approximation itself. The quality forecasting, then, may be not feasible since the fact its overhead can nullify or even results in negative benefits of the approximation. In the following section we are going to discuss ACTs with the aim to address these challenges.

2.2 Strategies for Approximation

In the previous section, we provided a complete picture of of the opportunities arising fromACand the challenges of the latter. Now we are going to discuss how such opportunities can be exploited, which strategies of approximation are presented in the literature and how these strategies may be classified according to Wyse [66].

Since we work on strategies of approximation that are applied to existing programs, we will discuss codebase strategies instead of algorithm ones.

2.2.1 Trade-Off

We bring to the reader’s attention that everyACTintroduces into a program, as discussed in Section 2.1.3.3, an implementation overhead: e.g., let us suppose the existence of two different functions for the same computation: one exact function and one approximated. We need to add new code lines, then new instructions, that impact on power consumption, to choose each time which function must be executed. Thus, when we trade off precision in exchange for performance gains we always must consider the implementation overhead like a cost that may nullify the gains of anACT. All the following strategies, Section 2.2.4, are affected by the above mentioned overhead but,

(35)

we are not going to discuss for each ACThow this overhead impacts on that since, for the proposed techniques, researchers reduced it until it is negligible.

2.2.2 Techniques Classification

In AC there is a vast number of ACTs and their number is increasing fast. In Section 2.2.4 we present various significant ACTs covering different ap-proaches of AC. The diversity of approaches can complicate discussions on the topic and obscure common patterns. A single monolithic category, AC, that spans disparate ideas is too broad to establish the general principles of the field. Thus, this Section provides a classification of many techniques from the field of AC.

According to Wyse [66] AC techniques may be classified in 4 ways: • hardware vs software

• determinism • granularity

• computational resources Hardware vs Software

This is the first and basic classification. It distinguishes techniques by whether they require new hardware or not. They classify an ACT as a software technique if it can work on today’s hardware. For instance, they consider Loop Perforation, detailed in Section 2.2.4, a software technique because it does not require new hardware components to be implemented. Hardware techniques, on the other hand, are the ones that require the design of new hardware or have some hardware requirements for the implementation. For instance, using inexact hardware such as the reduced-precision multiplier unit Section 2.2.4 unlocks new opportunities but needs the design and the hardware support for a new multiplier.

Determinism

ACTs can be both deterministic or nondeterministic. We call deterministic the ACTs that, given the same inputs, provide the same outputs. Instead, a nondeterministic one may produce different outputs with the same input. In other words, while deterministic approximation techniques’ results are not aleatory, the nondeterministic ones are. Thus for a deterministic approxi-mated program, we prove a hardbound: a statement of the form ‘for any

(36)

input, the output can be at most from the correct output’ and for a non-deterministic one we offer only a statistical bound: ‘for any input, the prob-ability that the output exceeds a bound is less than P’. These are guarantees that we use in different application contexts. The first bound is important in long-running applications, such as scientific computing applications, where errors can compound catastrophically. The second one is appropriate in con-texts that compute many independent outputs, for instance, video frames or particle positions.

Granularity

Wyse defines the granularity of anACT as fine-grained or coarse-grained: • fine-grained are techniques that apply to individual memory locations,

instructions or data packets;

• coarse-grained are techniques that affect entire blocks of code functions or data structures.

An approximation technique’s granularity is a critical factor in its generality and efficiency gains. We consider an ACT as fine-grained if it applies to individual memory locations, instructions, or data packets. A fine-grained ACT can be very general: for example, it can potentially be applied to any multiplication in a program but it must avoid critical operations that can arbitrary disrupt the execution. At the opposite, coarse-grained techniques affect large pieces of code that meet a list of restrictions needed by theACT itself, for instance, let us suppose to have a NN accelerator, we could only accelerate a portion of code that is aNN.

Computational Resources

ACTs generally offer benefits in resource efficiency and usage. Wyse classifies ACTs taking into account which computational resource benefits from the approximation.

The classification is made according to the classical resources in computer systems:

• compute • storage

(37)

2.2.3 Identify Variables or Operations

Finding approximable variables and operations inside a program is really important and it is the very first step to implement an approximation strat-egy. Approximable variables or operations in an algorithm depend on im-plementation rather than the algorithm itself. Thus, each application must be evaluated looking for an approximable portion of code in order to choose which strategy can be used and where. This is achieved in two possible ways: manual or automatic.

Automatic Identification

The literature onAC proposes some frameworks [52,13,46] with the aim to discover approximable portion of code. These frameworks work in a similar way in 4 steps:

1. the framework looks for every variables, portion of code such as loops and operation in an application;

2. it selects one of the above step identified element and perturbs it; 3. it runs the perturbed application and evaluate the quality of results

comparing it with the execution without approximation;

4. if the quality of results is considered acceptable, the element that was perturbed is marked as approximable, if not it is marked as nonapprox-imable. The framework restarts from step 2 until all elements identified at step 1 are disrupted one by one.

The existing frameworks are different due to: how they select elements, which approximation strategy(ies) they choose and how they compare the quality of the result.

Manual Identification

On the other side, there is language support [55, 67,12] forAC. Researchers present some new variable types for the variables that can be approximated, or some annotation styles, similar to OpenMP [15], that keep track of the portion of code that can be approximated.

The programmer needs to use these exposed variables or annotations by this new language support, thus the programmer needs a low level and deep knowledge of the algorithm that is going to approximate. This leads to a more precise approximation requiring a big effort to the programmer that not only

(38)

needs to manually edit the program but also needs to be well prepared in AC.

2.2.4 Strategies

Once the elements, which can be approximated, are selected, is time to choose an approximation strategy such as skipping pieces of code, reducing opera-tions precision, and so on. Here we present, according to Mittal [44], the most relevant and used ACTs in the literature. We discuss the implemen-tation idea and the context where they are used. Note that the following strategies are not mutually exclusive, so it is possible to use them in combi-nation.

Memoization

Hardware or Software: Software

Determinism: Deterministic

Granularity: Coarse

Computational Resources: Compute

Classification

Memoization was not born as anACT, for sake of simplicity first, we describe how it works before being an approximation: the base concept is to store results of execution for later reuse with identical inputs. It works by storing a lookup table with the tuple function/input as primary key and result of that tuple. Every time a function is called it checks the lookup table: if there is an entry with the same function and same inputs the result is not computed again and it is taken immediately — searching in a lookup table is O(1) — from the table itself: if the check does not return any result the function is normally computed and the result is stored in a new entry of the lookup table for later reuse.

This is a common technique achieving energy and computation saving but, as we said, it is not yet an ACT. Now we add to the memoization technique an approximate component: we reuse the results stored into the lookup table not only with identical but, also, similar inputs for the same function [25,5]. This is the idea of the approximate memoization technique, it improves the rate of usage of the lookup table, resulting in a more aggressive power and computation saving within an approximation since that the results taken from the lookup table are not always the exact ones.

(39)

Memoization is a powerful technique but it cannot be used in every pos-sible scenario. In fact, there are no guarantees, that if an input of a function is not far from another one, also the results will not be either. In other words, for function like, for instance, sin(x) memoization works really well because for small variation of the parameter x the function assume similar values: sin(x) ≈ sin(x ± ε). For other functions, for example, discontinuous functions, a similar input parameter does not always lead to similar results; in this cases the memoization does not work well and must be avoided. Precision Scaling

Hardware or Software: Software and Hardware

Granularity: Fine

Computational Resources: Compute and Storage

Classification

Precision scaling is a group of ACTs that works by changing the precision — bit width — of input or intermediate operands to reduce storage or com-putation effort. For example, let us suppose to have a FP variable that requires, to be exact, to be stored in 64 bits (double precision), we store it into 32 bits (single precision); for sure our computation will not be the exact one since we are losing fractional parts of the initial value, but it will be faster and less power consumptive.

According to Yeh et al. [69] precision scaling provides two software and one hardware optimizations opportunities:

complex floating point operations to trivial ones — software: this technique changes the precision of values stored in memory; then it may happen that a complex operation such as a FP multiplication becomes a trivial one. e.g., 42 × 1.000002 = 42.000084 is a complexFP operation, but reducing precision of the second operand, it becomes a faster and less power consumptive trivial multiplication by one: 42×1 = 42;

memoization improvements — software:

precision scaling, through precision reduction, combines close values to a single one. For instance: given twoFPvariables, 2.3456 and 2.672, if we change their precision to a 16 bits integer, they both became equal

(40)

to 2. This increases the coverage of memoization technique, detailed inSection 2.2.4, achieving all the benefits of thatACT.

hierarchical Floating Point Unit (FPU) architecture — hardware:

Yeh et al. [69] proposes an hierarchicalFPU. It takes advantage of the fact that some FP operation may require different precisions. Thus, they propose a FPU with different levels of precision that can be se-lected according to the user needs. This leads to faster computation for less precision requiring operations, because a simpler (it manages operation with less bits) FPUis chosen.

Load Value Approximation

Hardware or Software: Hardware

Granularity: Fine

Computational Resources: Storage, Communication

Classification

Load Value Approximation exploits the occurrences of load miss in the cache. In a computer system, the values needed for computation and used by the Central Processing Unit (CPU) are stored into small, low latency memories called cache. Sometimes, not all the needed values can be stored at the same time into caches; when a value is needed and it is not actually stored into cache, it might happen what we call load miss: the processor needs to wait, technically it stalls, until the value is loaded into cache.

Whenever a cache miss occur we need to load the required value from a lower class of cache or memory, this introduces latency in the operations, since loads from lower class cache and memory are not as quick as a load from the first class cache.

Load value approximation leverages the approximable nature of applica-tions to estimate load values, introducing a hardware component: the load value predictor that tries to estimate the value which misses in the cache. Thus, it allows a processor to progress without stalling for a response hiding the cache miss latency.

For example, Miguel et al. [42] presents a new load value technique: com-pared to the traditional load value predictor, in which a block needs to be fetched on every cache miss to confirm the correctness of prediction, their

(41)

technique fetches the blocks occasionally just to train the approximator. Thus, fetching a cache block on each cache miss is not required and this reduces the memory accesses significantly.

Loop Perforation

Granularity: Coarse

Classification

First proposed by Hoffmann [32], Loop Perforation is as simple as well as smart: it is a technique that exploits the loops present in a source code [58]. The idea behind Loop Perforation lies on the belief that not all the iter-ations of a loop are equally important. Therefore, it is possible to skip some iterations generating an approximated result that may be good enough for the quality of results requested by the user [32].

This technique assumes that the information of a set is generally not equally distributed, and even if it is, not considering a part of it will generate an acceptable result within an approximation.

E.g., let us suppose to have a function that takes a set of numbers and by looping them calculates their mean. Our set is S the set of even numbers in [2, 100), and its exact mean is S = 49.0, now let us suppose to apply loop perforation: we skip 1 item every 3 of S. The new calculated mean is S2

3 = 49.52. We achieve these great results since the information of this set is

uniformly distributed and we have skipped this information using a constant pattern.

This is a small taste of Loop Perforation. Since it is our technique of interest, we will detail deeper in the next Section5.1.

Task Skipping

Granularity: Coarse

(42)

Task Skipping allows, during the program execution, to skip code blocks ac-cording to a specific boolean condition to be checked at run-time. Therefore, there are many possible code paths, one complete path that computes the exact solution and many other incomplete paths that produce approximated solution reducing power consumption and computational time.

For example, let us suppose to have a program with two different func-tions foo and bar : we need to update a value N times: foo calculates a raw increment for value, and bar makes the increment done by foo more precise. Therefore, each time, the user or an automatic tool can choose to perform the exact computation or not, controlling a boolean value — in our exam-ple exactComputation. Algorithm 1 is an explanatory pseudocode for the example’s behavior.

Procedure 1. Task Skipping example

1: value← 0

2: for i = 0 to N do

3: value← value + foo(i)

4: if exactComputation is True then

5: value← value + bar(i)

Multiple Inexact Program Versions

Granularity: Coarse

Computational Resources: Compute, Storage and Communication

Classification

There are a lot of applications, such as the ones based on compression al-gorithms or on math functions, that are not approximable using one single algorithm with adjustable accuracy vs performance knob. Therefore, the lit-erature includes techniques that utilize multiple versions of application code with different trade-offs between accuracy and overhead. With this idea of Multiple Inexact Program Versions in mind, some frameworks were born giv-ing to programmers the opportunity to write a program with different levels of approximation sensibility.

(43)

SAGE [53], a framework for Graphics Processing Units (GPUs), chooses runtime among various Compute Unified Device Architecture (CUDA) ker-nels — created at compile time — the one that best fits the quality target.

GREEN [6], using multiple program versions provided by the program-mer, meets at runtime the required level of quality of results achieving energy and performance gains.

Using Inexact or Faulty Hardware

Granularity: Fine

Computational Resources: Compute and Storage

Classification

This is a large bucket of approximation strategies. Here there are all the approximation strategies that enhance performance working on the design of new hardware. Indeed, the usage of inexact or faulty hardware, that takes up small area or has short critical path, unlocks new opportunities to achieve performance gains.

Kulkarni et al. [36] present a reduced-precision 2 ∗ 2 multiplier unit that produces an inexact result (with fewer bits) in 1 case out 16 — it has a 3 bits output thus 310· 310 = 710 because 112· 112= 1112 — achieving a minimum

power saving of 30% compared to an exact multiplier.

Kahng et al. [35] present the design of an inexact adder, that avoids the carry chain to reduce critical-path delay, which can be used to improve performance and/or energy efficiency.

Ganapathy et al. [22] present a technique that works with faulty hardware. They minimize the magnitude of errors when using unreliable memories, which is in contrast to the Error-Correcting Code memory (ECC) technique that effectively corrects the errors. They store words in memory placing Least Significant Bit (LSB) into the faulty memory cells and this reduces the magnitude of output error. They show that, compared to the useECC, their technique achieves significant improvements in latency, power, and area.

(44)

Voltage Scaling

Determinism: Nondeterministic

Granularity: Fine

Classification

Voltage scaling is a technique based on the reduction of voltage supply to some components. This, obviously, reduces the power consumption but it may lead to errors or failures in the less supplied components.

For instance, Sampson et al. [55] reduce SRAM supply voltage saving leakage energy but, meanwhile, they increase the probability of read upset and write failure.

On the other hand, Mittal [43] shows by managing the supply voltage of DRAM, they can achieve efficiency gains.

Reducing Branch Divergence in SIMD Architectures

Determinism: Determinism

Granularity: Coarse

Classification

The idea behind these strategies is to exploit the fact that, sometimes, on Single Instruction Multiple Data (SIMD) architectures, multiple threads, ex-ecuting the same set of instruction, may diverge on a branch instruction, for instance an if-else statement or a loop. This lead to a slower computa-tion since it cannot more proceed in parallel. Thus researchers: Grigorian et al. [26] and Sartori et al. [57], present software techniques that limit or, even better, avoid this divergences, achieving large gains in performance and with a reduction of the quality of results.

(45)

Synchronization Elision

Determinism: Nondeterministic

Granularity: Coarse

Classification

Renganarayana et al. [50] notice that in parallel programming systems a large part of the time is spent not on the computation rather on the synchroniza-tion of different threads.

They propose a methodology for parallel programming with relaxed syn-chronization. This methodology can help reduce, and in some cases com-pletely eliminate, synchronization overhead in parallel programs sacrificing sometimes, depending on race conditions, the quality of results.

Rinard [51] exploits these ideas and presents approximate data struc-tures with construction algorithms that execute without synchronization. The data races present in these algorithms may cause them to drop inserted or appended elements producing approximated results. Nevertheless, the al-gorithms do not crash and may produce a data structure that is accurate enough for its users requirements.

Approximated Neural Network

Hardware or Software: Software and Hardware

Determinism: Deterministic and Nondeterministic

Granularity: Coarse

Classification

Today NNs are becoming very popular and they are present in a lot of sys-tems. Some researchers [62, 71, 17] observe that NNs are usually used in error-tolerant applications. Thus, they propose some techniques to create Approximate Neural Networks (AxNNs).

They carry out techniques that identify either critical or error-resilient neurons, which are the base unit of aNN, and, by approximating the weights

Approximate energy-aware framework to support intermittent computing

Scuola di Ingegneria Industriale e dell’Informazione

Dipartimento di Elettronica, Informazione e Bioingegneria

Corso di Laurea Magistrale in

Computer Science and Engineering

Approximate Energy-aware Framework

to Support Intermittent Computing

Supervisor:

Prof. Luca Mottola

Master Graduation Thesis by:

Francesco Cerizzi

Student ID n. 875583

Academic Year 2018-2019

Acknowledgements

Abstract

Sommario

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

The Need for New Paradigms

1.2

Contribution

1.2.1

Domain Analysis

1.2.2

Loop Perforation

1.2.3

AMNESIA

1.3

Document Structure

Approximate Computing State of the Art

Intermittent Computing

Approximate Intermittent Computation

AMNESIA

Evaluation

Chapter 2

Approximate Computing

2.1

Scope and Challenges

2.1.1

Quality Metrics

2.1.2

Motivation

2.1.3

Challenges

2.2

Strategies for Approximation

2.2.1

Trade-Off

2.2.2

Techniques Classification

2.2.3

Identify Variables or Operations

2.2.4

Strategies