Automatic performance tuning with past knowledge

(1)

POLITECNICO DI MILANO

Corso di Laurea MAGISTRALE in Ingegneria Informatica Dipartimento di Elettronica, Informazione e Bioingegneria

Automatic performance tuning with

past knowledge

Moviri, Akamas Research Group

Relatore: Paolo Cremonesi

Correlatore: Stefano Cereda

Correlatore: Stefano Doni

Tesi di Laurea di:

Zenari Giovanni, 899141

(2)

(3)

Abstract

In the context of performance optimization, automatic performance tun-ing is a trendtun-ing topic, that in the last years have gained in popularity. Still, the main approach used today remains to hire an expert to optimize the performance of an IT system. But with the continuous increase in the com-plexity of modern IT stacks, the manual tuning is becoming infeasible. Many researchers have proposed their way to approach automatic performance tun-ing that are based on different Artificial Intelligence (AI) techniques.

This thesis investigates the use of Recommender System in the field of au-tomatic performance tuning. The main idea that we want to demonstrate is that similar IT systems is optimized similarly. We use Recommender Sys-tems to exploit past knowledge, on the optimization of IT sysSys-tems, to speed up the optimization process of our target system. The topic that we propose here is still mostly unexplored in the research world. Thus, we are going to explore it in it’s most basic form, laying the foundation for future expansion. From this work, we can understand, how Recommender systems can be used in the context of automatic performance optimization with positive results. Moreover compared to the state of the art techniques in automatic per-formance tuning we obtain notable improvement thanks to the use of past knowledge.

Keywords— Automatic Performance Optimization, Recomender System, Artifi-cial Intelligence, Machine Learning

(4)

(5)

Sommario

Nel contesto dell’ottimizazzione delle performance, negli ultimi anni, sono stati fatti molti sforzi nel rendere il processo di ottimizazzione automatico. Questo é stato spinto da un cresente interesse in quello che sono i metodi sviluppati in questo campo e di conseguenza dei benefici che se ne possono trarre. Nel contesto industriale attuale il processo di ottimizzazione delle performance rimane ancora un processo manuale. Questo prevede l’ingaggio di un esperto che svolge il compito di ottimizzazione delle performance di un sistema IT. Questo processo manuale sta diventando sempre meno praticabi-le, causa la complessitá dei moderni sistemi IT, in continuo aumento. Molti ricercatori hanno proposto differenti approci nell’ambito dell’ottimizazzione delle performance automatico. La maggior parte di questi approci si basa sull’utilizzo di tecniche derivanti dal campo dell Intelligenza Artificiale In questa tesi andiamo ad investigate l’uso di tecniche derivanti dal campo dei Sistemi di Raccomandazione nel ambito dell’ottimizzazione delle perfor-mance automatico. In questo ambito l’idea sarebbe di utilizzare conosenza pregressa sull ottimizazzione delle performance di var sistemi IT per ottimi-zarne uno nuovo mai visto prima. I Sistemi di Raccomandazione in questo scenario hanno il compito di creare un concetto di similaritá tra sistemi IT. L’assunzione alla base di questo metodo sta nel ipotizzare che sistemi IT simili si possano ottimizzare in maniera simile e noi vorremo sfruttare questa informazione per velocizzare il processo di ottimizzazione attraveros l’uso di Sistemi di Raccomandazione. Il lavoro di ricerca effettutato in questo campo ´

e ancora molto limitato. Il lavoro presentato servir´a anche come base sul quale si potranno sviluppare nuove e pi´u complesse ricerche in questo ambi-to.

Da questo lavore ´e possibile capire come l’utilizzo di tecniqche derivanti dai Sistemi di Raccomandazione nel ambito dell ottimizzazione delle performan-ce autoamtico sia possibile ottenere risultait interessanti. Comparati con lo stao dell’arte gli approci che proponiamo sono in grado di mostrare notevoli miglioramenti grazia all’utilizzo di una conoscenza pregressa.

(6)

(7)

6 Evaluation Methodology 88 6.1 Experimental Setup . . . 89 6.2 Evaluation Metrics . . . 90 6.3 Hyper-Parameter Tuning . . . 90 6.4 Selected Application . . . 91 6.5 Starting Configuration . . . 92 6.6 Compared Techniques . . . 93 7 Results 94 7.1 Introduction . . . 94 7.2 DaCapo - DaCapo (DD) . . . 94 7.3 Cassandra - DaCapo (CD) . . . 100 8 Conclusion 109 8.1 Dataset . . . 109 8.2 Recommender Systems . . . 109

(9)

A System Metrics 111

(10)

List of Figures

2.1 Example of how BO works. In black we have our

objective function. The blue line is the function

predicted by the gaussian process regressor and the blu shaded area the confidence interval. The black points are the one queried until now to have this function shape, and the red point is the best point

found by BO. . . 18 2.2 CherryPick workflow. . . 20

2.3 Comparing CherryPick with coordinate descent. The

bars show 10th and 90th percentile. . . 22

2.4 Running cost of configurations by CherryPick and

random search. The bars show 10th and 90th

per-centile. . . 23

2.5 GCC/G++ Flags: Execution time (lower is better)

as a function of autotuning time. Aggregated perfor-mance of 30 runs of OpenTuner, error bars indicate

median, 20th, and 80th percentiles. Note that in

(a) the O1/O2/O3 and in (c) the O2/O3 lines are

on top of each other and may be difficult to see. . . 25

2.6 Normalized performance improvement (NPI) with

regard to the RIC model. The performance of the proposed appraoch is represented with black bars while the RIC technique with white bars. Higher

values means better performance . . . 27

3.1 Both the figure represent the sampling of 500 points

in a domain (0.0, 1.0) for both x and y. The

fig-ure on the left sample using a random uniform

ap-proach. The one on the left using a Sobol quasi

random sequence. Both plot are decorated with the histograms on both x and y axis representing the

(11)

3.2 The first figure represent the cumulative amount of time the JIT spends in compiling. The second fig-ure shows the total number of classed loaded by the JVM during execution. On the both the x axis there is the time spent in minutes. The red band represent the region in which we consider to be in a station-ary phase and we collect the measurement relative

to our applications. . . 50 3.3 Both figure show the distribution of the duration per

benchmarks. In the first plot the duration are nor-malized with respect of the baseline. In the second plot the duration are transformed using the

loga-rithm function . . . 52

3.4 The three plots show the how the GC impact the

du-ration of the different benchmarks. In the first plot we have a general overview of how on average the choice of the GC impact all the benchmarks. The second and the third plot show how the impact of the GC subdivided for each benchmark. The third plot with using a logarithm transformation on the y

axis. . . 54

3.5 The three plots show the how the THP impact the

duration of the different benchmarks. In the first plot we have a general overview of how on average the choice of the THP impact all the benchmarks. The second and the third plot show how the impact

of the THP subdivided for each benchmark. The

third plot with using a logarithm transformation on

the y axis. . . 56

3.6 The figure shows how the duration is affected by the

max heap size parameters, measured in Kilobytes, in the case of the eclipse benchmark. Each box plot

represent a range of values for the max heap. . . 57

3.7 The figures show the impact of the number of

par-allel GC threads on the duration distribution. The first plot for Jython the second for PMD. The num-ber of Parallel GC threads are binned to have a

(12)

3.8 Matrix showing the Pearson correlation coefficient computed between couples of configurations vectors. The configuration tested are the ones having the best performance on the relative benchmark. The last row/columns represent the configuration giving

the average best result across all the benchmarks. . 59

4.1 Example of a Decision Tree that evaluate if a person

can be considered fir or unfit . . . 65

4.2 Distribution of the R2 and RMSE score for the

sub-divided for the various DaCapo benchmarks . . . . 69

4.3 Distribution of the R2 and RMSE score for the

sub-divided for the various Cassandra workloads . . . . 70

4.3 Each plot shows the value of the RMSE and R2

scores per DaCapo benchmarks. The RMSE is iden-tified by the blue line and the R2 by the orange line. The RMSE score have been normalized for the base-line duration to plot it with the same scale of R2.The two dashed lines help to understand which the

max-imum and minmax-imum values obtained. The green

lines marks the minimum value found for RMSE. The purple line marks the maximum value found for R2. This two lines help to visualize if the R2 od

RMSE score are reached a steady phase . . . 76

4.3 For Cassandra we plot the R2 and RMSE score grouped

by workloads. In this way is easier to compare the three workloads results. The RMSE is normalized

for the baseline Throughput. . . 77 5.1 Framework schema . . . 79

5.2 Example of application of the top popular algorithm

to suggest a film to watch. . . 83

5.3 Example of the techniques that we use to build a

knowledge that need to work with different type of application and parameters. The gray columns identify different parameters, white columns

(13)

6.1 The orange line represent the R vector. The blue line represent the P I vector obtained from the or-ange line. This example has the objective to clarify why the operation described in equation 1.2 is

per-formed . . . 89

6.2 Example of what AUC values are, when computed

on a cosine function. The value of the AUC measure is identified by the green shaded area between 0 and

a value k, in this case the value of k is close to 1 . . 91

7.1 Performance improvement (PI) of the different

ap-proaches tested over 200 iterations. In this plot for each approach we plot the median over all the

Da-Capo benchmarks tested. . . 96

7.2 Box plots showing the distribution of PI for

differ-ent benchmarks at fixed iterations numbers. It is possible to see the variance that each approach has,

over all the DaCapo benchmarks tested. . . 97

7.3 Plots showing the relationship between ratings

met-ric between DaCapo benchmarks on the same set of configurations. Each point represents one configu-ration. On the x-axis and y-axis is present a den-sity plot to help to read the scatter-plot. The black dashed lines are in correspondence of the line with equation y = 1 and x = 1. These represent the base-line ratings for the two benchmarks. A rating below 1 means that we are improving the performance of

the benchmarks with respect to the baseline. . . 99

7.4 Performance improvement (PI) of the different

ap-proaches tested over 200 iterations. In this plot for each approach we plot the mean over all the

Cas-sandra workloads. . . 102

7.5 Heat-map in which we represent a comparison

be-tween the best configuration on different

applica-tion. Each cell identifies the Pearson correlation

coefficient between the best configurations between the two applications. The best configuration is iden-tified by the configurations with the best perfor-mance improvement, with respect to the baseline

(14)

7.6 Plots showing the relationship between ratings met-ric between DaCapo benchmarks and Cassandra Work-loads on the same set of configurations. Each point represents one configuration. On the x-axis and y-axis is present a density plot to help to read the

scatter-plot. The black dashed lines are in

corre-spondence of the line with equation y = 1 and x = 1. These represent the baseline ratings for the two ap-plications. A rating below 1 means that we are im-proving the performance of the applications with

re-spect to the baseline. . . 104

7.7 Performance Improvement (PI) over the three

dif-ferent workload of Cassandra, for 200 iterations. . . 106

7.8 Radar Plot synthesizing the similarity measure

be-tween Cassandra workloads and DaCapo benchmarks. Each plot represent a workload of Cassandra, each

axis a benchmarks of DaCapo. The point of the

axis is given by the inverse of the similarity com-puted between the two application. The similarity is computed with the Cosine Distance using the CK metrics, as done in the CBF with CK metrics,

(15)

List of Tables

2.1 Search space sizes in number of possible

configura-tions, as represented in OpenTuner. . . 24

3.1 Table summzaring the tool used for the DaCapo

dataset . . . 41

3.2 Lower bound estimation of the number of

parame-ters present in the JVM and OS layers . . . 42

3.3 Compact Routing Example . . . 43

3.4 First 5 OS parameters found as the most important

by the random forest regressor. Higher the value of the Importance column, higher is the importance of

the parameter in estimating the performance metrics 44

3.6 OS parameters used in the DaCapo dataset, with

a small description of what they control. More in-formation can be found on the Red Hat developer

website . . . 45

3.7 Tools used to extract the system metrics collected

during each experiment. JVM and OS in this case

refers to the systems defined in Table 3.1 . . . 47

7.1 Table summarizing the hyper-parameter tuning

re-sults. Each cell represents an AUC value and each table is associated with one of the different proposed approaches, having parameters to tune. The rows identify the four distance metrics used: Euclidean Distance (ed), Cosine Distance (cosine), Jensen Shan-non distance (js) and Correlation Coefficient (corr). In bold we underline the best AUC value for each

(16)

7.2 Table summarizing the hyper-parameter tuning re-sults. Each cell represents an AUC value and each table is associated with one of the different proposed approaches, having parameters to tune. The rows identify the four distance metrics used: Euclidean Distance (ed), Cosine Distance (cosine), Jensen Shan-non distance (js) and Correlation Coefficient (corr). In bold we underline the best AUC value for each

(17)

Chapter 1 Introduction

In a world where technology is everywhere and it’s pervading every aspect of our lives, we grow more and more dependent on it. At the same time, our expectations of its capabilities continuously increase. The study on the per-formance of computer systems is a key factor to have success in the business world. The performance of a computer-based system is often characterized by its ability to perform defined sets of activities at fast rates and with quick response time. Quick response times, speed, and scalability are highly desired attributes of any computer-based system. They are also competitive differ-entiators. That is, they are attributes that distinguish a system from other systems with like functionality and make it more attractive to a prospective buyer or user.

If a system component has poor performance, the system as a whole may not be able to function as intended. If a system has poor performance, it will be unattractive to prospective users and buyers. If the project fails, as a result, the investment in building the system will have been wasted. Exam-ples of these systems are : command and control system, a transaction-based system, an information retrieval system, a video game, an entertainment sys-tem, a system for displaying news, or a system for streaming media. The importance of performance may be seen in the following examples:

• A government-run platform for providing services on a grand scale must be able to handle a large volume of transactions from the date it is brought online. If it is not able to do so, it will be regarded as ineffec-tive. In the United States, the federal web site for applying for health insurance mandated by the Affordable Care Act, healthcare.gov, was extremely slow for some time after it was made available to the pub-lic. According to press reports and testimony before the United States Congress, functional, capacity, and performance requirements were

(18)

un-clear. Moreover, the system was not subjected to rigorous performance tests before being brought online1_.

• A medical records system must be able to pull up patient records and images quickly so that retrieval will not take too much of a doctor’s time away from diagnosis and treatment.

• An online securities trading system must be able to handle large num-bers of transactions per second, especially in a volatile market with high trading volume. A brokerage house whose system cannot do this will lose business very quickly because slow execution could lead to missing a valuable trading opportunity.

The foregoing examples show that performance is crucial to the correct func-tioning of a software system and of the application it controls. As such, performance is an attribute of system quality that presents significant busi-ness and engineering risks. In some applications, such as train control and fire alarm systems, it is also an essential ingredient of safety. Performance engineering mitigates these risks by ensuring that adequate attention is paid to them at every stage of the software lifecycle, while improving the capacity of systems, improving their response times, ensuring their scalability, and increasing user productivity. All of these are key competitive differentiators for any software product.

Today, IT stacks come with hundreds of configuration choices impacting ap-plication performance and infrastructure costs. Turning the right knobs, also called parameters, can yield dramatic improvements, but the number of pos-sible combinations has grown beyond what any human can manage, indeed managing configuration, which are set of parameters, has become one of the main challenges faced by users of modern systems. Most of the times this parameters are left to their default values and only when required tuned to solve the performance issue. A performance analysis is rarely performed at the development stage of an application, even though it has been demon-strated to yield significant improvements [18]. The reason that blocks the performance tuning of an application is mostly the elevate complexity that this process has. Indeed there is no overall optimal configuration that can be used instead, each different applications must be tuned ad-hoc. This has the consequences that, in the IT sector the entire process of parameter tuning is restricted only to high paying customers, with specific needs in terms of performance. Thus derive the necessity to hire experts in the specific applica-tion performance tuning, who spend weeks in the tasks of finding the optimal

1

(19)

configuration. The process that these experts follow can be summarized in the following steps :

• Identification of the conditions that the system is experiencing in a real scenario.

• Replicate the system and conditions it’s subject to in a controlled en-vironment.

• With a trial and error methods, guided by the experience and past knowledge, configurations are tested until the optimal configuration are found.

• The optimal configuration found on the replicated system is applied to the real one.

These steps, even if not extremely complex, are time-consuming and require weeks to complete, even for professionals. Still the available configurations continue to increase, making harder and harder for humans to track these tasks.

Researchers have proposed different solutions to automate the performance optimization process. Most of these approaches try to find the best search techniques to solve the problem faster. Others researchers propose the use of different Machine Learning (ML) algorithms, as Bayesian Optimization, and some tried to used ,past knowledge (previously collected knowledge) to speed up the optimization task. The concept of previously collected knowledge here has a particular meaning. This refers to data collected in the past, containing information on applications performance. The application’s performance is related to metrics like throughput, response time, execution time and many others.

Our research is based on building an automatic performance optimization approach, which makes use of Recommender Systems (RS) techniques to exploit previously collected knowledge, to speed up the optimization task. RS is a subclass of information filtering systems commonly used on e-commerce Web sites that present information on items and products that are likely to be of interest to the user. In presenting the recommendations, the RS will use details of the user profile and opinions and habits of their whole community of users and compare the information to reference characteristics to present the recommendations. With the rise of Youtube, Amazon, Netflix and many other such web services, Recommender Systems have taken more and more place in our lives becoming almost an unavoidable element in our daily online journeys. The approach that we come up with tries to optimize

(20)

the performance of a new application never seen before, using previously collected knowledge about other applications that we have already optimized. Data collected in the past on the optimization of different applications could be used in the present to optimize new applications never optimized before. In this context, Recommender Systems rise as a perfect tool for filtering and selecting which part of the data we previously collected, could be useful or not, building a concept of similarity between applications. The use of RS techniques is the same as in a movie Recommender Systems, only that in this case we are no more talking about films and users but we are talking about applications and configurations. The main idea of exploiting similarities between users to recommend films to watch is applied in the same way here, only we will use similarities between applications.

The proposed approach is tested on the optimization of different benchmarks, the DaCapo benchmark suite and the Cassandra NO-SQL database. The objective is to optimize these benchmarks and compare the results obtained by our approach with state of the art techniques. Our approach for DaCapo has shown to improve in terms of area under the curve concerning the state of the art technique. Reaching near-optimal configuration only testing a few configurations. Also for Cassandra, we were able to improve the area under the curve metrics, giving a boost in performance for the initial configurations tested concerning the state of the art approach.

To evaluate and test the approach proposed we build a Dataset collecting information on the optimization of the different benchmarks that we are considering. This Dataset will be made freely available to other research, or not, works.

Outline

The rest of the work is divided in four main chapters:

• The Background: The Background chapter covers all the related work needed to have some knowledge about the literature and to analyze some relevant papers. Firstly there is the state of the art of automatic tuning systems and after that, there is a section dedicated to Recom-mender Systems.

• The Dataset: The Dataset chapter describes the procedure used to retrieve a proper dataset for the automatic performance tuning objec-tive.

• The Surrogate Model: The time required to build a dataset forced us to build upon it a surrogate model to speed up the works. Here

(21)

is given the detail on how the Surrogate model was selected and the methodology we use to evaluate it.

• The Proposed Methodology: Here we present formally our proposed approach, using Recommender Systems to speed up the automatic per-formance optimization

• The Evaluation: In this chapter, we give the details on the method-ology used to evaluate our proposed approach.

• The Results: In the last chapter, we present the results obtained by our proposed approach.

(22)

Chapter 2 Background

In this chapter, we present the background necessary to understand our work. We can define two main sections in this chapter. The first giving an overview of the state of the art in the field of automatic performance optimization with an introduction to some concept of performance optimization, the second one giving the theoretic fundamentals to understand how a RS works and the principal techniques that we used

2.1 Performance Optimization

Performance optimization is the process of modifying some system-related components, i.e. parameters, to achieve an improvement on the system per-formance. We identify with the terms configurations a set of parameters, of any size and kind. When we talk about system performance we must be able to quantify this concept of ”performance”. This is usually done by using key performance indicators (KPI). This refers to measurable attributes of our target system as throughput, response time, energy consumption, execution time and many others; which characterize the performance of our system. In our thesis works this measurable attribute, or KPI, are called performance metrics. These performance metrics together with configurations are what we look when we optimize the performance of a system. We are interested in capturing the relationship that exists between configurations and perfor-mance metrics, to be able to achieve the best in terms of perforperfor-mance. To study this relationship we must perform experiments on the system we want to optimize. An experiment is the execution of a workload, a simulation of a real-world system load, with associated specific configuration on the target system.

(23)

that is the core of our proposed work. We present the main approaches that have arisen in recent years in this field.

2.1.1 Automatic Performance Optimization

The field of performance tuning in recent years have gained the interest of a continuously growing part of the research community. One technique that seems to have gained the interest of many research teams is Bayesian Opti-mization (BO). In particular, this technique has shown to lead to very good results when used as core techniques in performance tuning framework, in particular in the field of Big Data [1] and High Performance Computing [47]. Since the relevance that BO has obtained we decided to present, in the follow-ing section, some theoretical background on it’s functionfollow-ing and presentfollow-ing one framework, CherryPick, which is used to improve the performance of Big Data workflow based on Spark and Hadoop.

BO is not the only technique that is used, but other different techniques are promising. We decided to present two of them that we consider to be the most interesting ones, OpenTuner and Cobayn. The former is based on the usage of different search techniques to find the best possible configuration for generic computer programs, the latter uses Bayesian Networks to find optimal compiler flags.

OpenTuner will be used as comparison techniques concerning the one that we present, while Cobayn represents one of the techniques closest in terms of the type of approach, thus we will refer and take inspiration from this works in the rest of the thesis. The use of Bayesian Optimization don’t fit our scenario. Thus, we decided to don’t use it as a comparison technique.

2.1.2 Bayesian Optimization

Many optimization problems in machine learning are black-box optimization problems, where the objective function f (x) is unknown (i.e. we do not have an analytical expression for f ). Evaluation of the function is restricted to sampling at a point x and getting a possibly noisy response. If f is cheap to evaluate we could sample at many points (e.g. via grid search, random search or using numeric gradient estimation) but usually, this is not the case, so we need to resort to other methods to evaluate the function f .

Bayesian Optimization is a framework capable to solve optimization prob-lems in this domain where the observation function f (x) is unknown before-hands but can be observed through experiments, and we want to find the global optimum in a minimum number of steps. Bayesian Optimization in-corporates prior belief about f and updates the prior with samples from f ,

(24)

Figure 2.1: Example of how BO works. In black we have our objective function. The blue line is the function predicted by the gaussian process regressor and the blu shaded area the confidence interval. The black points are the one queried until now to have this function shape, and the red point is the best point found by BO.

the experiments, to get a posterior that approximates f .The model used for approximating the objective function is called surrogate model. A popular surrogate model for Bayesian Optimization are Gaussian Processes(GPs). GPs define a prior over functions and we can use them to incorporate prior beliefs about the objective function. By modelling the objective function as a stochastic process, in this case, a Gaussian Process, Bayesian Optimization can compute the confidence interval of the objective function. A confidence interval is an area that the curve of our objective function is most likely(e.g with 95% probability) passing through. For example in Figure 2.1, the black line is the actual function, and with 5 samples Bayesian Optimization com-puted a confidence interval that is marked with a blue shadow area. The confidence interval can guide the selection of the next point to sample, giv-ing information about how much our model is confident about his prediction with the current information.

Bayesian Optimization can smartly decide the next point to sample making use of an acquisition function that directs the sampling to an area where an improvement over the best observation value is likely. Acquisition functions trade-off exploration and exploitation. Exploitation means sampling where the surrogate model predicts a high objective and exploration means sam-pling at location where the prediction uncertainty is high. Both correspond to high acquisition function values and the goal is to maximize the acquisition function to determine the next sampling point. Popular acquisition function are maximum probability of improvement(MPI), expected improvement(EI)

(25)

and upper confidence bound(UCB).

The Bayesian Optimization procedure can be summarized as an iterative procedure, Where u is the acquisition function and D1:t−1 are the t − 1

sam-ples drawn from f so far. The procedure can be summarized as follows, for t=1,2,... repeat:

• Find the next sampling point xt by optimizing the acquisition function

over the GP: xt = argmaxtu(x|D1:t−1).

• Obtain a possibly noise sample yt = f (xt) + t from the objective

function f .

• Add the sample to previous samples D1:t = D1:t−1, (xt, yt) and update

the GP.

2.1.3 CherryPick

Big data analytics running on clouds are growing rapidly and have become important for almost every industry. Choosing the right configuration for an application is critical to be competitive and can decrease the cost to run a job by 2-3x on average. This task result to be very challenging due to the complexity of achieving :

• High Accuracy: The running time and cost of an application have complex relations to the resource of the cloud instances.

• Low Overhead: Brute-force search for the best configuration is expen-sive. There are usually a lot of parameters that can be tuned, from instances type to number of CPU and many others, and this can lead to an expensive search in the space of configurations.

• Adaptivity: Big data applications have diverse internal architecture and manually learning to build this internal structure is not scalable. CherryPick [1] is a system that leverages BO to build performance models for various application that minimize cloud usage cost, guarantee application performance and limit the search overhead. The key idea of CherryPick is to build a model just accurate enough to allow to distinguish near-optimal configurations from the rest, the performance model can also help to under-stand when stop searching earlier once we are confident to have found a good configuration. To leverage BO to find good cloud configurations is required to choose the prior, or surrogate model, and an acquisition function. As a surrogate model, the choice falls on Gaussian Processes, mainly because they

(26)

Figure 2.2: CherryPick workflow.

are non-parametric model and they are flexible enough to approach the actual function given enough data samples. As acquisition function, the choice falls on expected improvement(EI) as it has been shown to behave better than the others choices, maximum probability of improvement(MPI) and upper bound confidence(UBC), and it does not have parameters to choose. For a given application and workload the goal is to find the optimal or near-optimal cloud configuration that satisfies performance requirements and minimizes the total execution cost. Formally:

minimize

x∈S C(x) = P (x) × T (x) (2.1)

subject to T (x) <=Tmax (2.2)

let x be the cloud configurations vector, P (x) be the price per unit time for all the VMs in cloud configuration x, C(x) the total cost of cloud con-figuration x and T (x) the running time function for an application and its workload. Equation 2.1 show the objective of the approach, also called ob-jective function, that is to minimize the cost, thus the result of the multipli-cation between the price of our configuration per hour and the running time. Equation 2.2 represents a constraint that doesn’t allow the model to choose a configuration that has a running time greater than a certain constant, this to prevent the model to choose a configuration with a very low cost but high run time. To solve this optimization problem CherryPick implements a work-flow as shown in Figure 2.2, where Bayesian Optimization is embed-ded. Starting from initial cloud configuration, that can be the default one or user-defined, we leverage Bayesian Optimization to update the confidence interval, that is simply the area in which the objective function is most likely passing through. After that CherryPick relies on the acquisition function,

(27)

EI, to choose the best configuration to run next. At the next step is decided whether to stop the search according to the confidence interval provided by the BO, when the expected improvement is less than a threshold and at least N cloud configurations have been observed the search is stopped. The choice of BO to find optimal cloud configurations for Big Data is due to mainly three factors :

• BO does not limit the function to be of any predefined format, as it is non-parametric, this makes CherryPick useful for a variety of applica-tions and cloud configuraapplica-tions.

• BO typically needs a small number of samples to find a near-optimal solution because it focuses the search on areas that have the largest expected improvement.

• BO tolerates uncertainty. Mainly CherryPick faces two main sources of uncertainty: imperfect performance model that usually have sub-stantial prediction errors; the cloud may not report a stable running time even for the same application due to resource multiplexing across application.

CherryPick is evaluated with 5 different benchmarks application on Spark and Hadoop to exercise different CPU/Disk/RAM/Network resources. TCP-DS, a recent benchmark for big data systems that models a decision support workload. TCP-H, an SQL benchmark that contains many ad-hoc decision supports queries that process large amount of data. TeraSort, a common benchmarking application for big data analytics framework, that requires

balance between high IO bandwidth and CPU speed. The SparkReg, a

benchmark that consists of machine learning workloads implemented on top of Spark. CherryPick is compared with exhaustive search (which finds the best configuration by running all the configurations) and Coordinate descent, which searches one coordinate at a time. The comparison is made using two metrics: the running cost of the configuration, the expense to run a job with the selected configuration, and the search cost, the expense to run all the

sampled configurations. Figure 2.3 shows the median, 10th percentile and Check Re-sults Check Re-sults 90th percentile of running time for the configuration picked by CherryPick for

each of the five workloads. CherryPick finds the exact optimal configuration with 45-90% chance and finds a configuration within 5% of the optimal con-figuration at the median. Using exhaustive search requires 6-9 times more search cost and 5-9.5 times more search time compared with CherryPick. The median configuration suggested by coordinate descent is within 7% of the optimal configuration, but the tail of the configuration suggested by co-ordinate descent can be far from optimal around 76-78%. The results show

(28)

Figure 2.3: Comparing CherryPick with coordinate descent. The bars show 10th and 90th percentile.

that CherryPick selects optimal or near-optimal configurations with much lower search cost than existing solutions as shown in Figure 2.4.

2.1.4 OpenTuner

Program auto-tuning is increasingly being used in domains such as high-performance computing and graphics. Rather than optimizing a program directly, the programmer express a search space of possible implementations and optimizations, that the auto-tuners will search for an optimal implemen-tation, this make the entire optimizations process more efficient and at the same time provide performance portability, as the auto-tuning process can easily be re-run on new machines which require different sets of optimiza-tions.

Auto-tuners usually are relatively simple and project-specific due to some challenges that make the development of auto-tuning frameworks difficult:

• Find the right configurations representation for the problem. Must find a way to represent the complex domain-specific data structure and constraints.

• Choose the size of the valid configuration space. Search spaces can be very large and exhaustive search of such spaces will not complete in human lifetimes. Thus the space can be pruned or intelligent techniques can navigate this huge space avoiding the cost of an exhaustive search. • The landscape of the configuration space. Based on the shape of the configurations one can choose the search space technique more appro-priate, e.g if the configuration space is a monotonic function a high

(29)

Figure 2.4: Running cost of configurations by CherryPick and random search. The bars show 10th and 90th percentile.

climber will be able to find an optimal configuration. A search tech-nique that is optimal in one type of configuration space may fail to locate an adequate configuration in another and this makes difficult to find a robust system that performs well in a variety of situations. OpenTuner[3] is an open-source framework for building domain-specific multi-objective program auto-tuners, the key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneously and tech-niques that perform well will be allocated a larger portion of tests. Ensemble of techniques provide a solution to the large and complex search space by providing both a robust solutions to any type of large search space and a way to incorporate domain-specific search techniques, that can share results using a common results database to constructively help each other in finding an optimal solution.

The OpenTuner’s core technique is the multi-armed bandit with a sliding window, area under the curve credit assignment (AUC Bandit), that is based on an optimal solution of the multi-armed bandit problem [44]. This tech-nique is used to decide which of the different search techtech-niques to use, base on their performance in recent history, the sliding history window. The main idea of this technique is to balance the trade-off between exploration and ex-ploitation. Exploration tends to prefer a technique that has not been chosen frequently. Exploitation tend to choose the techniques that have showed the best performance. The formula describing which techniques we decide to use

(30)

Table 2.1: Search space sizes in number of possible configurations, as represented in OpenTuner.

Project Benchmark Possible Configurations

GCC/G++ Flags all 10806 Halide Blur 1025 Halide Wavelet 1032 Halide Bilateral 10176 HPL n/a 109.9 PetaBricks Poisson 103657 PetaBricks Sort 1090 PetaBricks Strasseb 10188 PetaBricks TriSolve 101559 Stencil all 106.5 Unitary n/a 1021 Mario n/a 106328

is the following one:

arg max_t(AUCt+ C

s

2 log |H| Ht

) (2.3)

Where |H| is the length of the sliding history window. Ht is the number

of times a technique has been used in that history window, C is the con-stant controlling the exploration/exploitation trade off, and AU Ct is the

term quantifying the performance of the technique in the sliding window. The second term of the addition in the equation gets smaller and smaller the more often a technique is used. This is what determine the trade-off between exploration and epxloitation.

OpenTuner supports multiple-user defined objectives, than can span from time, accuracy, energy, size, confidence and many others, that can be cast as maximization or minimization problem.

OpenTuner was validated by using it to implement auto-tuners for seven distinct project, presented in Table 2.1. For many of these benchmarks, the size of the search space makes exhaustive search intractable. Thus, the comparison was made to the best performance obtained by the benchmarks authors when optimality is impossible to quantify. Here will be presented only the first 5 benchmarks since they show the most interesting results. The first benchmarks try to optimize the GCC compiler flags in order to min-imize the execution time of the resulting program. These benchmarks tune three different types of flags to GCC, the optimization levels (-O0, -O1, -O2,

(31)

(a) raytracer.cpp (b) ftt.c

(c) matrixmultiply.cpp (d) tsp ga.cpp

Figure 2.5: GCC/G++ Flags: Execution time (lower is better) as a function of autotuning time. Aggre-gated performance of 30 runs of OpenTuner, error bars indicate median, 20th, and 80th percentiles. Note that in (a) the O1/O2/O3 and in (c) the O2/O3 lines are on top of each other and may be difficult to see.

-O3), 176 flags that can be turned on or off and 145 flags that are integer val-ues. For the target programs to optimize was used, a fast Fourier transform in C taken from SPLASH2 benchmarks suite, a C++ template matrix multiply, a C++ ray tracer, and a genetic algorithm to solve the travelling salesman program in C++. These programs spans from highly optimized code, the fast Fourier transform to low optimized code. The speedups reached ranged from 1.15x for fft to 2.82x for matrix multiply. In Figure 2.5 we show the plots showing the results just described on the GCC compiler flags, as they are the most interesting.

Halide is a domain-specific language and compiler for image processing and computational photography, specifically targeted towards image processing pipelines. Production uses of Halide have relied on hand-tuned schedules, which define the order of execution and placement of data, written by ex-perts. The auto-tuning problem in Halide is to automatically synthesize schedules and search for high-performance configurations of a given pipeline on a given architecture. Three different benchmarks drawn from Halide’s standard demo applications were used as workloads. For all three

(32)

bench-marks, OpenTuner can find a configuration that matches or beats the hand-optimized schedules in a shorter amount of time.

The high-performance Linpack benchmarks are used to evaluate floating-point performance of machines ranging from small multiprocessors to large scale clusters [20]. The benchmark measures the speed of solving a large random dense linear system of equations using distributed memory. To as-sist in tuning HPL includes a built-in auto-tuner that uses exhaustive search over user-provided discrete values of the parameters. OpenTuner obtains the best performance of 86.5% of theoretical peak performance while exploring a small amount of the overall search space.

Petabricks is an implicitly parallel language and compiler which incorporates the concept of algorithmic choice into the language. The Petabricks language provides a framework for the programmer to describe multiple ways of solv-ing a problem. The search space of Petabricks is both low-level optimizations and over different algorithms. In all cases, the autotuners arrive at similar solutions, but for Tridiagonal Solver and Sort OpenTuner beats the native PetaBricks autotuner.

Stencil is a generalized system for autotuning memory-bound stencil com-putations on modern multi-core machine and GPUs. OpenTuner for the Laplacian and divergence kernel benchmarks shows to reach the optimal con-figurations in a small amount of time and exploring less than 2% of the search space.

2.1.5 Cobayn

Usually software developers rely on compiler intelligence for software opti-mization, but this usually means using some standard optimization that has proven to be beneficial for performance (or code size) in most cases. Other compiler optimizations that are not included in the standard optimization can lead to substantial benefit in several performance metrics as execution time, code size and power consumption. The two main approaches for tack-ling the problem of identifying the best compiler optimization are: iterative compilation and machine learning predictive model. The former approach relies on several recompilation phases and then selects the best set of op-timizations. This approach, even if effective, introduces an high overhead as it needs to be evaluated iteratively. The latter collects software features, either online or offline, and then tries to predict the best set of compiler optimizations by training a ML model. The downside is that typically the performance of the final binary is worse than the one found by the iterative approach [23]. COBAYN (Compiler auto-tuning framework using Bayesian Network) [4] is an approach used for a compiler auto-tuning using machine

(33)

Figure 2.6: Normalized performance improvement (NPI) with regard to the RIC model. The performance of the proposed appraoch is represented with black bars while the RIC technique with white bars. Higher values means better performance

(34)

learning to speed up application performance and to reduce the cost of the compiler optimization phase. The main goal of the approach is to identify the best compiler optimizations to be applied to a target application. COBAYN can be summarized in three step:

• Program Characterization: this step is done by measuring micro-architecture independent metrics, that are both dynamic, e.g. Instruction Mix and Branch Predictability. This metrics characterize the program and the evaluation of the approach will be carried out by considering both dy-namic and static metrics.

• Dimensionality Reduction: in this approach dimensional reduction is important for two reason, first it eliminates the noise that perturb the further analysis, second it significantly reduces the training time for the model that will be described in the next step. The techniques used are Principal Component Analysis (PCA) [43] and Exploration Factor Analysis (EFA) [46]. Experimental results show that using EFA bring better results than using PCA.

• Bayesian Networks: COBAYN, instead of trying to predict the best optimization sequence, aims to infer its probability distribution. As probabilistic model the choice fall on Bayesian Networks capability of modelling cause-effect dependencies. This dependencies are suitable for this domain because is expected that the benefits of some compiler op-timization (effects) are due to the presence of some application features (causes). So the objective is to train the Bayesian Network to model the dependencies between configuration and metrics using some train-ing data and then start to feed the network with information collected from the current application being optimized, that wasn’t present in the training data, so to build a posterior probability that model our specific application optimization space.

The evaluation of the COBAYN approach is made by using two major bench-marks suite separately: cBench[31] and PolyBench[34]. Each consists of dif-ferent classes of applications and kernels, ranging from security and cryptog-raphy algorithms to office and image-processing applications. The evaluation is carried out by comparing the proposed approach and the Random Iterative Compilation (RIC) technique in reference to the execution time obtained by -O3 (Eref ), that is a standard optimization level of the gcc compilier. RIC is a technique that uses straight-forward random sampling of compiler flags. In Figure 2.6, where is presented six different benchmark/feature selection methods , is possible to notice as the proposed approach has always the up-per hand on up-performance with respect to the RIC technique. NPI figures

(35)

show that, in all cases, the proposed method was superior in terms of perfor-mance. In terms of extractions, numbers of configurations tested, COBAYN show that after 30 extractions reach the 80% of the optimality.

2.2 Recommender System

Recommender systems are IT tools able to predict the taste and preferences of the user with the objective to select a personalized set of items and ser-vices. The suggestion or recommendation is applied in multiple decisional process, as what news to read or which songs to listen to. Item is the general term used to denote what the systems suggest to the user. In general Recom-mender Systems are based on a set of information from which it is possible to deduct the user preferences towards determined items. These preferences are usually identified as ratings. These ratings can be explicitly derived, as the stars in a movie catalog system that are directly set by the user, or can be implicitly deducted by the interaction of the user with the items[15]. Recommender Systems born at the beginning of the 90s as an information filtering tool, of great utility in domain in which the quantities of items is so large that the user are not able to choose the relevant ones. Recommender Systems plays an important role in web sites of big corporation as Amazon, YouTube, Netflix, Tripadvisor. Netflix for example has granted a one million dollar prize to the team able to improve its Recommender Systems perfor-mance[27]. This growth in importance of RS goes in parallel with the growth of the web, that allow the users to share contents as photos, video and music rapidly and in high volumes. This phenomenon has allowed Recommender Systems to find different possible applications such as: identification of web community and the identification of the most popular topic in this com-munity. In this context two Recommender Systems techniques principally started to gain popularity: content based filtering and collaborative based filtering.

Now we will present this two techniques.

2.2.1 Content Based Filtering

Content-based filtering, also referred to as cognitive filtering, recommends items based on a comparison between the content of the items and a user profile. The content of each item is represented as a set of descriptors or terms, such as the words that occur in a document. The user profile is represented with the same terms and built up by analyzing the content of items which have been seen by the user.

(36)

Content based filtering is fairly limited in specific circumstances. In first instance are limited by the descriptors explicitly associated with the item that we intend to recommend. Usually to overcome this issues is required a manual action to identify this descriptors, that is a great limitation from the practical point of view. Another problem is the over-specialization, for which the algorithm tends with time to suggest items more and more similar to what it had already suggested. This is usually mitigated by the introduction of random component in the algorithm. But one of the main issue is the new user problem. New user doesn’t have seen any items and so the algorithm is not able to perform any suggestion.

2.2.2 Collaborative Filtering

Collaborative Filtering (CF) suggests items to the users based on items for which other users have expressed a rating. A music recommender for exam-ple, first find the users with the most similar music taste to the user for which we want to make the recommendation, then it will suggest the most appreci-ated songs from this set of users. CF is based on the ratings of items by users with a profile similar to the one of the user. Thus is based on the availability of user profile that contains past regress history of ratings. In this way, we don’t need any information on the item that we are suggesting. As the CBF algorithm, CF suffers from the new user problem, in addition in this case we also have the new item problem. The most promising approach presented in this field combines CF and CBF to build new hybrid techniques.

(37)

Chapter 3 Dataset

3.1 Introduction

The performance optimization of a system is, most of the times, a repetitive process, that is usually structured as follows. First, we run a performance test, a type of non-functional software testing technique, in which the per-formance of a system is evaluated under simulated load. The result of a performance test is normally a measure of the response time, throughput, or utilization efficiency of one or more components of our system, that are metrics defining the performance of our systems. These components may be of any size or complexity. Then follow a step of analysis of the results obtained from the performance test. After the analysis of the experiment is done, the system will be tuned, and more tests will follow. Each of these following tests will explore a possible tuning alternative. Each test can be time-consuming and expensive to run. If we base our approach on the previ-ously discussed workflow it will take a long time to finish this research work. Instead, we want to have the ability to test the approaches we propose fast, leaving us space to explore and try different alternatives, not being bound to time-consuming tasks. The idea to speed up this process is to run a large number of test in advance and then save the result we obtain into a dataset. In this way, instead of running a test and collecting results on the system, we simply need to take this information from the dataset previously built. All this allows to have a fast testing and analysis of our approach.

We would like our dataset to collect the performance metrics of multiple ap-plications, to present more reliable and trustworthy results. The available datasets that have these characteristics are rare, and sometimes are not avail-able, not being published anywhere or if published they are incomplete. The only data-set available that is suitable for our purpose is the one shipped

(38)

within Cobayn[4]. The problem with this data-set is that it focuses on com-piler performance tuning, which is not the field in which we want to focus our research, instead, we would like to move into the Java application per-formance tuning. For this specific purpose, there is no data-set available. The lack of dataset satisfying our requirements lead us to the choice of build-ing our dataset, but also will be made publicly available. We decided to build two datasets, to study different type of applications. One studying the per-formance of different Java application using the DaCapo benchmark suite[6], and the other one studying the performance of a NoSQL database, Cassan-dra1, under different type of workload, generated using Yahoo! Cloud Serving Benchmark(YCSB)2_{. Even using the dataset, the time required to run a}

rea-sonable number of test is still too high. We decided to find an alternative way to use the dataset and at the same time reduce the number of experiment that we need to run. We build the Dataset intending to serve as training data for a surrogate model, that simulates the real system under test. Using the data collected in the data-set we can simulate the performance behaviour of Cassandra and DaCapo. The main advantage that we gained with the use of a surrogate model, is in the time required to run an experiment going from minutes for the real system to milliseconds for the simulated one. Also, the number of tests to run decreases greatly.

In this chapter we describe only how the dataset has been constructed, detail on the surrogate model can be found chapter 4.

In this thesis work, we describe in detail the step required in the construction of the DaCapo dataset, while the Cassandra dataset is made available in [22]. The chapter is structured as follows: the next section describes the different tool that has been used to build the data-set; follow a section describing in detail the step made in the building the DaCapo dataset. In the end, we will give some information on the Cassandra dataset necessary to understand the work that we present.

3.2 Tools

In this section we introduce and describe the tool used to build the DaCapo and Cassandra datasets and we motivate their adoption.

1_{http://cassandra.apache.org}

(39)

3.2.1 DaCapo

The DaCapo benchmark suite[6] is a set of open source Java benchmarks representing real world applications. The DaCapo benchmark suite is in-tended as a tool for Java benchmarking by the programming language, mem-ory management and computer architecture communities. DaCapo includes both client and server applications which are selected with the idea of being of general purpose and easy to use, indeed to launch a benchmark you only need to run one command and the Dacapo suite will manage all the aspects of the execution for you. The benchmarks composing this suite are :

• Avrora simulates a number of programs run on a grid of AVR micro-controllers;

• Batik produces a number of Scalable Vector Graphics (SVG) images based on the unit tests in Apache Batik;

• Eclipse executes some of the (non-gui) jdt performance tests for the Eclipse IDE;

• Fop takes an XSL-FO file, parses it and formats it, generating a PDF file;

• H2 executes a JDBCbench-like in-memory benchmark, executing a number of transactions against a model of a banking application, re-placing the hsqldb benchmark;

• Jython inteprets the pybench Python benchmark;

• Luindex Uses lucene to indexes a set of documents; the works of Shake-speare and the King James Bible;

• Lusearch Uses lucene to do a text search of keywords over a corpus of data comprising the works of Shakespeare and the King James Bible; • Pmd analyzes a set of Java classes for a range of source code problems; • Sunflow renders a set of images using ray tracing;

• Tomcat runs a set of queries against a Tomcat server retrieving and verifying the resulting webpages;

• Tradebeans runs the daytrader benchmark via a Jave Beans to a GERONIMO backend with an in memory h2 as the underlying database;

(40)

• Tradesoap runs the daytrader benchmark via a SOAP to a GERON-IMO backend with in memory h2 as the underlying database;

• Xalan transforms XML documents into HTML;

The choice of using DaCapo as benchmarks to build a dataset on was based on multiple factors: DaCapo benchmarks represent significant use cases not only representative of real world application but also diversified between each other; DaCapo is an open-source project freely usable by everyone without the need to buy a license to use it [8]. The other candidates that were evaluated as possible substitutes of DaCapo were all discarded for different reason. The SPEC benchmarks suite because of the excessive price for its license plan. The Java Grande Benchmarks [11] and the Jolden benchmarks3

because of their narrow scope of type of application represented, the former exploring only programs with large demands of memory and the latter sin-gle threaded Java application. The only one that could possible be a best pick compared to DaCapo was the Renaissance benchmark suite4_{, both from}

the point of view of having a large number of diversified application and representing more modern workloads, including multi-threaded application. Indeed the authors of the Renaissance benchmarks suite present it as the ”new DaCapo” [37], able to reveal new performance opportunities for the Java Virtual Machine (JVM). It’s also freely available [39]. The Renaissance benchmarks suite was discarded because it was made available too late in our project timeline, after 4 months of works on DaCapo, making not worth the adoption of the new suite. Will be considered in future extension.

3.2.2 Apache Cassandra

Apache Cassandra5 is a free and open-source, distributed, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous master-less replication allowing low latency operations for all clients. Cassandra is a Java based system that can be managed and mon-itored via Java Management Extension that allow to collect different metrics of the database. The choice of Cassandra was made mainly for two reason: Cassandra represent an I/O intensive benchmark, that is completely differ-ent with respect to the CPU intensive benchmarks of DaCapo, thus giving

3_{https://github.com/farseerfc/purano/tree/master/src/jolden} 4_{https://renaissance.dev}

(41)

more variety to the scenario considered; the second choice is that Cassandra is one of the most popular and used database6 _{and this made its study more}

interesting for a large audience of people. The DaCapo application is made with the intent of performing Java benchmarking and so is shipped with the tools necessary to run a performance test. On the other hand Cassandra is a NoSQL database and doesn’t have any tool to run performance test. The researchers at Yahoo! provided us with a tool able to benchmark the Cassandra database, called YCSB[38].

3.2.3 YCSB

The Yahoo! serving cloud benchmark (YCSB)7 is an open source framework used to facilitate the comparison between cloud storage services. The frame-work consists of a frame-workload generator and a package of standard frame-workloads that covers interesting parts of the performance space (read-heavy work-loads, write-heavy workwork-loads, scan workwork-loads, etc.). An important aspect of the YCSB framework is its extensibility: the workload generator makes it easy to define new workload types, and it is also straightforward to adapt the client to benchmark new data serving systems. The YCSB client is a Java program for generating the data to be loaded to the database. The workload executor drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer, both to load the database (the load phase) and to execute the workload (the trans-action phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. The threads also measure the latency and achieved throughput of their opera-tions, and report these measurements to the statistics module. At the end of the experiment, the statistics module aggregates the measurements and reports average, 95th and 99th percentile latencies/throughput, and either a histogram or time series of the latencies/throughput.

3.2.4 Amazon Web Services

Amazon Web Services (AWS) is a provider of on-demand cloud computing platform to individual, companies and governments. In particular we used a service provided by AWS called Amazon Elastic Compute Cloud (EC2), that allows to rent virtual computers on which to run applications. All the benchmarks considered have been run on EC2 instances, how Amazon call

6_{https://db-engines.com/en/ranking}

(42)

virtual machine. Amazon EC2 provides a wide selection of instance types optimized to fit different use cases, that comprehend varying combinations of CPU, memory, storage, and networking capacity. The reason to use AWS as cloud provider for computing resources is mainly due to these factors: AWS is considered the most enterprise-ready8_{, mission critical hyper-scale}

provider; it’s the cloud provider that our research team has more experience and technical knowledge on; it’s one of the most known and used cloud provider that allows us to guarantee that the when running our experiments we are in environment similar to the production ones. These factors lead to the adoption of AWS as cloud provider for our experiments. Using a cloud provider as support for the computation resources brings a lot of advantages and with it also a lot of disadvantages. The flexibility and computational power available on the cloud is impressive, but this come to a cost. The type of virtual machine that we select as most appropriate for our purpose was the c2.xlarge9_{, that comes at the price of 1000$/month. To keep the expenses}

to a minimum we decided to use what is called a Spot Instance, which is a particular virtual machine that can be killed whenever the cloud provider, AWS in this case, wants. After some preliminary test we observed that our instances were killed once every day. This situation forced us to automate every aspects of our works, even moving single file between instances.

3.2.5 Java

Java is a general-purpose computer programming language that is concur-rent, class-based (a style of Object-oriented programming (OOP)), object-oriented, and specifically designed to have as few implementation dependen-cies as possible. Java application is compiled into byte-code that can be run on any computer architecture for which is available a Java Virtual Machine (JVM), which translates the byte-code into specific machine code. The over-head of interpreting byte-code into machine instructions makes interpreted programs very often run more slowly than native executable [2].

One of the main characteristics of the Java programming language is the presence of an automatic garbage collector to manage the memory in the ob-ject life-cycle. The programmer determines when obob-jects are created and the Java run-time, a set of software tools for development of Java applications, is responsible for recovering the memory once objects are no longer in use. The garbage collector is one of the components that affect the performance of a JVM the most [7]. Modern virtual machines offer a multitude of

op-8_{https://www.gartner.com/doc/reprints?id=1-1CMAPXNO&ct=190709&st=sb} 9_{https://aws.amazon.com/ec2/instance-types/}

(43)

tions for tuning the garbage collector, which can have a significant impact on application performance. To better understand the weight of the garbage collector on the performance of a Java application is necessary to make a small digression and explain in broad terms how garbage collectors work. Automatic garbage collection is the process of looking at heap memory, iden-tifying which objects are in use and which are not, and deleting the unused objects. An in use object, or a referenced object, means that some part of your program still maintains a pointer to that object. An unused object, or unreferenced object, is no longer referenced by any part of your program allowing the GC to reclaim the memory used by an unreferenced object. This process of deallocating memory performed by the garbage collector can be summarized in two steps. The first step in the process is called marking. This is where the garbage collector identifies which pieces of memory are in use and which are not. The second step is called deletion and has two variants normal deletion and deletion with compatction. Normal deletion removes unreferenced objects leaving referenced objects and pointers to free space. To further improve performance, in addition to deleting unreferenced objects, you can also compact the remaining referenced objects. By moving referenced object together, this makes new memory allocation much easier and faster. Having to mark and compact all the objects in a JVM is inefficient but empirical analysis of applications has shown that most objects are short lived [19] and this information learned from the object allocation behavior can be used to enhance the performance of the JVM. Therefore, the heap is broken up into smaller parts or generations. The heap parts are: Young Gen-eration, Old or Tenured GenGen-eration, and Permanent Generation. The Young Generation is where all new objects are allocated and aged. When the young generation fills up, this causes a minor garbage collection. Minor collections can be optimized assuming a high object mortality rate. A young genera-tion full of dead objects is collected very quickly. Some surviving objects are aged and eventually move to the old generation. The Old Generation is used to store long surviving objects. Typically, a threshold is set for young generation object and when that age is met, the object gets moved to the old generation. Eventually the old generation needs to be collected. This event is called a major garbage collection. The Permanent generation con-tains meta-data required by the JVM to describe the classes and methods used in the application. The permanent generation is populated by the JVM at run-time based on classes in use by the application and Java SE library classes and methods may be stored here.

Another characteristic that determines the performance of Java application is the just in time compiler (JIT compiler). During the execution of an application the byte-code is interpreted in native code, but this makes the