Evaluation of CPU Energy Saving Techniques in Distributed Search Engines

(1)

University of Pisa

Scuola Superiore Sant’Anna

Department of Computer Science

Department of Information Engineering

And

Scuola Superiore Sant’Anna

MASTER’S DEGREE IN

COMPUTER SCIENCE AND NETWORKING

(Class LM-18)

Master’s Degree Thesis

Evaluation of CPU Energy Saving

Techniques in Distributed Search Engines

Candidate

Stefano Ceccotti

Supervisor

(2)

There are several factors contributing to the overall energy consumption of such data centers. The main source of energy expenditure comes from the CPU. Hence, several works in the literature have proposed techniques to re-duce the CPU energy consumption of search engines without degrading their performance.

In this thesis, we implement a simulator to evaluate the energy saving of the state-of-the-art techniques. More specifically, we evaluate the PESOS algorithm when deployed in a monolithic, distributed, and replicated search engine. In particular, we compare it with the industry-level PEGASUS algo-rithm, deployed by Google.

Our simulations show that, in a monolithic environment, PESOS can reduce its energy consumption up to ∼16% respect to its original version. The dis-tributed version of PESOS outperforms PEGASUS by a ∼30% in energy consumption without incurring in a significant performance degradation. In the replicated environment we can reduce the energy consumption of PESOS by a further ∼ 2%.

(4)

Chapter 1 Introduction

Modern search engines [16] are composed by hundreds to thousands of nodes (i.e., servers) connected by a fast and reliable infrastructure. A set of low-end servers are mounted within a rack and interconnected using a local Ethernet switch. These rack-level switches have a number of up-link connections to one or more cluster-level Ethernet switches. This second-level switching domain can potentially span more than ten thousand individual servers. This ag-gregation of clusters is generally called data center, and big companies, such as Google, Amazon or Microsoft, manage several data centers geographically distributed to protect themselves against catastrophic data center failures. These data centers need a huge amount of energy to be productive, with corresponding continuous operating costs. Electricity costs (including power and cooling) form an important part of the operational costs of search engine companies [16]. In 2011, Google’s overall power consumption was reported1

to be 2.68 million MWh. Carbon dioxide emissions resulting from fossil fuels (brown energy) combustion are the main cause of global warming due to the greenhouse effect. Large IT companies have recently increased their efforts in reducing the carbon dioxide footprint originated from their data center electricity consumption [43]. On one hand, better infrastructure and modern hardware allow for a more efficient usage of electric resources. On the other hand, data centers can be powered by renewable sources (green energy) that

1

(5)

are both environmental friendly and economically convenient.

Since energy consumption has an important role on the profitability and environmental impact of Web search engines [1], improving their energy effi-ciency is an important aspect. There are several component inside of a data center consuming energy, such as CPU, and memory. In particular, the most energy consumption component of a data center is the CPU as reported in [16]. Focus of this work is to discuss and analyze the state-of-the-art tech-niques present in literature aiming at reducing the energy consumption of servers’ CPUs of a search engine. Improvements or alternative solutions to these techniques will be studied and analyzed.

We will present this work starting with an overview of how modern search engines are structured, namely by three big modules: crawling, indexing and query processing. As the Web index increases in size, a node is responsible to answer a query in some milliseconds to satisfy the user expectation. This objective is achieved by distributing the search engine. Then, we will describe the data center power usage, explaining how it is split up. In this work, we focus on reducing the energy consumption of servers’ CPUs, which are the most energy consuming components in search systems [16], without degrading their performances. In literature we can find several works addressing this problem, such as [1] and [4]. We will describe such solutions along with some possible improvements. Moreover, we will study alternative solutions analyzing the energy consumption of such alternative techniques. To this end, we deployed a simulator to run and test the proposed solutions in three different scenarios: monolithic, distributed, and replicated search engine. In the last part of the thesis we will show the results achieved in these different scenarios, highlighting the pros and cons of each solution. Finally, we will conclude summarizing the work done in this thesis.

(6)

Chapter 2 Background and Related Work

In this chapter we will describe the architecture of a modern search engine (Section 2.1). Then, we will discuss the data center power usage (Section 2.2), we will describe the power model of the data center’s components (Sec-tion 2.3). Finally (Sec(Sec-tion 2.4), we will describe how to reduce the energy consumption of a search engine.

All the major search engines (Google, Bing, Baidu, Yandex,...) rely on a data center to satisfy the huge amount of requests. The most important measure that characterizes a search engine is the quality of its search results [27]. There are two possibilities to improve the search quality of a search en-gine: increase the computational power of each hardware component, such as servers, or optimizing the performance of each software module. The other important aspect for a search engine is to reduce the energy consumption while maintaining an acceptable query response time.

In this chapter we will talk about the main characteristics of a search engine, describing its hardware and software components. In addition, we will re-port some existing solutions and proposed improvements to reduce its energy consumption trying to maintain a good response time.

(7)

2.1 Search Engine Architecture

Web search engines [9] are composed of three main elements: the crawler re-covers documents from the Web, the indexer indexes the documents collected by the crawler, and the searcher solves user queries by using the generated index and other components required to achieve efficient performance. A typical structure of a search engine architecture is depicted in Figure 2.1:

Figure 2.1: A search engine architecture [9].

Large-scale Web search engines usually distribute their operations over multiple, geographically distant data centers. Consequently, all the employed data structures have to be replicated or distributed in different data centers. User requests received by a data center are processed internally by tens to hundreds of computers cluster, according to the nature of the task. A com-mon scalability issue affecting all three systems is the growth of the Web. The Web is constantly growing as in the increase in Web content as in the increase in user requests for these contents. Modern search engines have to deal with this constantly growing of the Web, forcing engineers to study so-lutions to these scalability problems. The classical solution is to implement a search engines in a distributed manner. In a distributed search engine each of the previous phases, such as crawling, indexing and query processing, are now distributed and then replicated in hundreds to thousands of nodes. Indexing and processing phases are not very affected by the distributed ar-chitecture, indeed these phases are fully replicated in each node, while the

(8)

crawling phase requires some extra precautions that we will discuss later. The goal of a distribution is to achieve a good level of scalability, hence to be able to work without problem and, at the same time, to satisfy all the incoming queries within a certain time threshold.

2.1.1 Crawling

A large-scale Web crawler is responsible for two different tasks. The first task is to locate previously not discovered URLs by iteratively following the hyperlinks between the Web pages. Typically a crawler maintains a list of unvisited URLs called frontier, which is initialized with seed URLs. The selection of a good initial seed is a critical problem and several studies propose different solutions, e.g. the focused crawling. The second task is to prefetch from the Web the content of pages that are currently stored in the repository. The former task aims to increase the coverage of the Web repository, while the latter task aims to attain a certain level of freshness.

Individual Web crawlers issue their requests first to the high-level fetching system, which then communicates those requests to the low-level fetching system. This hierarchical implementation allows to do not having to replicate features in individual crawlers. The architecture of such a system is shown in Figure 2.2.

Figure 2.2: A two-level Web page fetching system shared by different Web crawlers that are simultaneously operating in the search engine [9].

(9)

The low-level fetching system implements the most basic network oper-ations, such as resolving the DNS of URLs to an IP address and fetching the content of a page from the Web. The high-level fetching system imple-ments various crawling mechanisms and policies that are applicable to all operational crawlers in the search engine.

For the second task the most naive refreshing technique is to select the pages to be refreshed in a random manner (or with uniform probability). A better alternative is to refresh more often important pages (e.g., those with high link-quality scores or with high impact on search quality). Yet another alternative is to prioritize pages according to their longevity, i.e., the estimated update frequency of the original page. Refreshing may be imple-mented with a number of disk-based FIFO queues can be used to support scheduling of pages for refreshing. Each page is then assigned to a queue according to the estimated utility of refreshing the page.

Distributed Crawling

In search engine companies, often multiple crawlers are operational at a given point in time. These crawlers may be operated by different departments that work independent of each other. Search engines usually implement a Web page fetching system that is shared by multiple crawlers. Every time a par-ticular Web crawler needs to fetch a page from the Web, it issues a new fetching request to this system with appropriate input (e.g., the URL of the page to be downloaded). There two different strategy when we deal with dis-tributed crawlers: static or dynamic assignment. In a static assignment Web is statically partitioned and assigned to crawlers. Crawlers only crawls its part of the Web with no need of coordinator and thus communication. There are two problems with the static assignment: load balancing of URLs and fault-tolerance. Load balancing can be solved using some dynamic relocation, for the fault tolerance we can rely on Consistent Hashing: this technique re-allocates URLs to crawler simply mapping crawlers and URLs into a single addressing space. In a dynamic assignment the central coordinator dynami-cally assigns URLs to crawlers, therefore crawlers need to communicate with

(10)

the coordinator.

2.1.2 Indexing

Web pages retrieved by a crawler are then processed by the indexing system [9]. One of the important tasks performed by the indexing system is to con-vert the pages in the Web repository into appropriate index structures that facilitate searching the content of pages. These index structures include an

Inverted Index (Figure 2.3) together with some other auxiliary data

struc-tures, which are processed each time a query is evaluated. In practice, the time cost of processing these data structures is the dominant factor in the response latency of a search engine. Another important objective of the in-dexing system is to improve the quality of the results served by the search engine. To this end, the indexing system performs two complementary types of task. The first type of task aims to eliminate the useless content that may returns into the search results presented to the users. The second type of task aims to understand the content of Web pages and enrich them with meta-data using some semantic analysis. A rich set of features are extracted from the content of Web pages. These features are later used by the query pro-cessing system to improve the search quality when pages are ranked against queries.

(11)

Figure 2.3: Inverted index1_.

An inverted index consists of two parts: inverted lists that keep certain statistics about the occurrence of the terms in the pages and a dictionary providing a mapping from the terms to the inverted lists. The dictionary is used to access the inverted list of a given term. Each dictionary entry contains the number of pages containing its respective term, a pointer to the start of the inverted list, and possibly other meta-data about the term. The dictionary can be implemented as a hash table or B+ tree, as a trie data structure, or as a sorted array. Since the dictionary is frequently accessed during the query processing to look up the terms, it is usually retained in the memory if possible.

Index Compression

The inverted index structure can be very huge (even terabyte of data) and it ca not be all fitted in the main memory, forcing the system to access the disk. Distributing such index can only alleviate this problem. Compression is an important technique employed in indexing systems to reduce the size of an inverted index. A smaller inverted index implies that a larger portion of

1

https://www.slideshare.net/ChaToX/text-indexing-inverted-indices-56364695

(12)

it can be kept in the memory, reducing the need to perform costly disk ac-cesses. Obviously, the informations contained in the inverted index requires to be decompressed, partially or entirely, before to be processed. Therefore, compression leads to some computational overhead during the processing of queries. However, the efficiency gains in accessing the inverted lists justify the use of compression. In the context of inverted index compression, there are three different trade-off metrics to evaluate the choice of a compression algorithm: compression rate, compression speed, and decompression speed. Typically the compression rate, and decompression speed are the most im-portant metrics, leading to a trade-off between these two.

Distributed Indexing

In distributed systems, the inverted index can be distributed on the search nodes using either a document-based or term-based partitioning strategy. In the case of document-based partitioning, the query is communicated to all search nodes, in case of term-based partitioning, the broker locally maintains some information about the mapping of terms to send the query only to the nodes that contain a mapping for at least a term.

In most Web search engines, document-based index partitioning is preferred due to several reasons. First, document-based partitioning leads to better computational load balance. Second, the brokers tend to be a bottleneck in the case of term-based partitioning. Third, document-based partitioning pro-vides better scalability in terms of increasing number of nodes and collection sizes. Finally, document-based partitioning provides better fault tolerance than term-based index partitioning since the result quality is less likely to be affected in the case of node failures.

2.1.3 Query Processing

The objective in query processing is to estimate the relevance of a set of pages to a given user query, rank the pages in decreasing order of their relevance, and present a small set of most relevant pages to the user [9]. Relevant pages are estimated using algorithms computing the similarity between a given

(13)

query q and a matching page p. A commonly used technique is to assign an offline-computed weight w(t, p) to every term t in the page, indicating the term’s importance. The similarity s(q, p) between query q and some matching page p is then computed by summing these weights over all terms in the query: s(q, p) = i=Q X i=1 w(qi, p)

There are many ways to compute the weight w(t, p). Most techniques com-bine term frequency values within a page and term frequency values of the entire collection. For example, the TF-IDF (term frequency-inverse docu-ment frequency) function computes the weight as:

w(t, p) = f (t, p) |p| · log(

M f (t))

where f (t, p) denotes the number of times term t appears in p, f (t) denotes the number of pages that contain t, and |p| denotes the number of unique terms in p. This scoring strategy is not possible to use in practice, since the execution time would be prohibitively high. Therefore, most search engines adopt a two-phase ranking framework [49]. In the first phase, a sufficiently accurate scoring technique is used to select a small subset of potentially relevant pages from the entire page collection. In the second phase, the pages selected in the first phase are re-ranked by a complex but much more accurate ranking model. The final ranking is obtained by sorting the page scores computed in the second phase in decreasing order.

Distributed Query Processing

As the Web index increases in size, a node is responsible to answer a query in some milliseconds to satisfy the user expectation. This objective is achieved by distributing the inverted index over the nodes of a search cluster and eval-uating each query in parallel.

A search cluster, depending on the size of the Web index, may include hun-dreds or thousands of computers. As the Web index grows in size, the search

(14)

cluster is grown proportionally by adding new nodes such that the index size on each node remains roughly the same. A user query is first processed by a front-end server (Figure 2.4), called broker, which then eventually fans out the query to a large number of leaves, called query servers. The search index is sharded across all of the leaf nodes and each query is processed by every leaf. The broker node is also responsible for aggregating the partial rankings retrieved from each leaf node into a final ranking and communicating this ranking to the final user.

Figure 2.4: Distributed index behaviour [9].

Google clusters aggressively exploit the inherent parallelism in the appli-cation [28], dividing the query stream into multiple streams, each handled by a cluster. By parallelizing the search over many machines, they reduce the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. This aggressive parallelization has the advan-tage to reduce the workload on single cluster and possibly to reduce the total energy consumption.

Architectural Optimizations

In certain search architectures the search space can be a problem, consider-ing the size of the Web. Different kind of architectural optimizations, aim at reducing the search space in order to evaluate queries on smaller portions

(15)

of the Web index [9], have been proposed. This has the effect to reduce the query workload of the result retrieval system as well as the query process-ing time, without hurtprocess-ing the quality of the retrieved search results. These architectural optimizations can be classified under three categories: selective search, static index pruning, and tiering, which are illustrated in Figure 2.5 and discussed below.

Figure 2.5: Three different architectural optimizations for efficient query processing. The solid-line circles indicate the steps that are always executed by the technique, while the optional steps are indicated by the dashed-line circles [9].

The selective search architecture involves two phases: an offline clustering phase and an online retrieval phase. In the clustering phase, the pages in the Web repository are grouped under a number of sub-collections. An inverted index is then built on each sub-collection, and each index is distributed on

(16)

a number of search nodes. In the retrieval phase, a query is firstly received by a federating node, which span-out the query to a subset of search nodes. A search node is selected only if its associated sub-collection is estimated to contain some pages relevant to the query. Then, the federating node collects the results retrieved from the search nodes and returned to the user.

The idea in static index pruning is to construct an inverted index that keeps much less information than the full Web index, yet returning an accurate set of pages [51]. The index pruning techniques rely on removing the least important inverted list entries from the full Web index. In this architecture, queries are first evaluated against the pruned index. The pruned index should avoid access the full Web index. Then, if the results obtained from the pruned index are found unsatisfactory, the full Web index is processed.

In tiering [50], the pages are partitioned into disjoint sets known as tiers, according to their importance, and an index is built on each set of pages. The importance of a page is usually determined by the likelihood of a page appearing in the search results or a link-based importance score. A query is processed by hitting the tiers in decreasing order of page importance and merging the results obtained from each hit tier. In this architecture, the early tiers are smaller in size and keep more important pages, while the later tiers are larger in size and keep less important pages. After obtaining the results from a tier, a decision is made about whether the next tier should be hit or not. Typically, this fall-through decision continues until the quality, or the number, of the retrieved pages obtained so far is not reached.

2.2 Data Center Power Usage

In this section we address the problem of the energy consumption of a search engine [16]. Energy and power usage are also important concerns in the design of WSCs because energy-related costs have become an important component of the total cost of ownership of this class of systems. Figure 2.6 shows that the highest amount of power usage (42.0%) is due to the CPU components, while the amount for DRAM and Disks is only 11.7% and 14.3% respectively.

(17)

The second highest amount of power is for the Cooling Overhead, which is a fundamental system to keep the heat generated by the data center as low as possible.

Figure 2.6: Power usage in a modern data center [16].

By definition, energy efficiency would measure the energy used to run a particular workload within a data center. It can be view as the product of three factors: E = C T E = 1 P U E · 1 SP U E · C T EEC (2.1) Where C stands for Computation, T E stands for Total Energy and T EEC stands for Total Energy to Electronic Components. The first term ( 1

P U E)

mea-sures facility efficiency. Power usage effectiveness (PUE) reflects the quality of the datacenter building infrastructure, and captures the ratio of total build-ing power to IT power (the power consumed by the actual computbuild-ing and network equipment, etc.), computed as:

P U E = (F acilityP ower)/(IT EquipmentP ower) The second term ( 1

(18)

ac-counting for overheads inside servers or other IT equipment using a metric analogous to PUE. Server PUE (SPUE) consists of the ratio of total server input power to the power consumed by the electronic components involved in the computation (e.g. motherboard, disks, CPUs, DRAM, and I/O cards). Substantial amounts of power may be lost in the server’s power supply, volt-age regulator modules, and cooling fans.

The last term ( C

T EEC) accounts for how the electricity delivered to electronic

components is actually translated into useful work.

Energy efficiency can be interpreted in an other way. Let’s assume a system that exhibits a power usage function linear in the utilization u as the one below:

P (u) = Pi+ u(1 − Pi)

where Pi represents the system’s idle power and peak power is normalized to

1.0. In such a system, energy efficiency becomes u/P (u), which reduces to the familiar Amdahl Law formulation:

E(u) = 1

1 − Pi + Pi/u

(2.2) We are interested in values of utilization between 0.1 and 0.5 because, in that case, high values for Piwill result in low efficiency. Indeed, if a sub-component

of a data center is not highly energy proportional, that sub-component will limit the whole system efficiency.

Cooling System

IT equipment consumes electricity to operate, producing heat. The data center is a closed infrastructure and this heat, if not given an outlet, can be very high. That’s why the cooling system is a critical part of a data center’s infrastructure, and several approaches are available to maintain the tempera-ture at an acceptable level to let components running without being damaged or destroyed by the generated heat. Nevertheless, the cooling system is the second highest power consumption of a data center. The goal of cooling is to move heat from the indoor environment (the data center) to the outside,

(19)

using some hierarchy of loop systems replacing the outgoing warm medium with a cool supply from the outside. The two essential options are to move heat either using air or a liquid (typically water or some form of refrigerant).

2.3 Data Center Power Consumption Models

Here, we will discuss about some power consumption model of a data center. In particular we will focus on how to model the power usage of each data center’s component. Servers are the source of productive output of a data center system, conducting most of the work [30]. One of the simplest power models was described by Roy et al [44]. which represented the server power as a summation of CPU and memory power consumption. We represent their power model as:

E(A) = Ecpu(A) + Ememory(A)

where Ecpu(A) and Ememory(A) are energy consumption of the CPU and

the memory while running the algorithm A. More detailed power models have been created by considering other components of a server such as disks, network peripherals, etc. Server energy consumption model described by Tudor et al. [22], augments the above power model with I/O parameters. Their model can be shown as:

Etotal = Ecpu+ Ememory+ EI/O (2.3)

where energy used by the server is expressed as a function of energy used by CPU, memory, and I/O devices. This power model described in Equation 2.3 can be further expanded as:

Etotal = Ecpu+ Ememory+ Edisk+ EN IC

where Ecpu, Ememory, Edisk, and EN IC correspond to energy consumed by

(20)

CPU Power Models

In this work, we focus on reducing the energy consumption of servers’ CPU (second term in Equation 2.1), which is the most energy consuming compo-nent in search systems. Indeed, Barroso et al. [21] analyzed Google servers during peak utilization and showed that processors consumed about 57% of the total server’s power consumption. However, this percentage in 2007 dropped to 42% thanks to the emergence of energy-aware mechanisms. CPU utilization may change over time due to the variation of the workload handled by the CPU. Therefore, CPU utilization can be denoted as a function of time:

E = Z t1

t0

P (u(t)) dt (2.4)

where E is the total energy consumption by a physical node during a time period from t0 to t1 and u(t) corresponds to the CPU utilization which is a

function of time.

Equation 2.4 can be represented also in a discrete time. When constant speed model is employed for modeling the power consumption of a multi-core processor, the processor’s total energy consumption can be expressed as [25]: Pn = n X j=1 Pc(j)

where Pn denotes the power consumption of n cores and Pc(j) corresponds

to the power dissipation of a core j. Unfortunately this equation is not suit-able to model cores at different frequencies, therefore in multi-core systems, the processor power consumption P (k) is commonly modeled [7], given an observation period of length k, as follows:

P (k) = Ps+ n

X

i=1

xi(k)[Pindi + Pdi(k)] (2.5)

where Psdenotes the static power of all power consuming components (except

the cores). This is approximately a constant and can only be removed by powering off the entire processor. xi represents the state of core Ci. If a

(21)

core is active, xi = 1; otherwise, xi = 0 and Ci is turned off. The

frequency-dependent active power for the core is defined as Pi

d(k) = αifi(k)βi, where

both αi and βi are system-dependent parameters. The authors in [40] state

that (2 ≤ α ≤ 3) and (1 ≤ β ≤ 2). Pi

ind is the static power of core i and does

not depend on the supply voltage and frequency. CPU Power Management

Several power management policies leverage DVFS technologies (Dynamic Voltage and Frequency Scaling)2_{; traditionally, it is controlled by the}

op-erating system to monitor the CPU utilization, raising the frequency when utilization is high and decreasing it when it is low. Linux has a built-in driver, cpu freq, based on DVFS.

The basic dynamic power equation: P = CV2_{αf , where P is power, C is}

ca-pacitance, V is the supply voltage, f is clock frequency, and α indicates how often clock ticks lead to switching activity on average, [24] clearly shows the significant leverage possible by adjusting voltage and frequency. If we can re-duce voltage by some small factor, we can rere-duce power by the square of that factor. Reducing supply voltage, however, might possibly reduce the perfor-mance of systems as well. Software level instructions, typically executed by the OS, can set particular values of (V, f) to control the DVFS behaviour. These DVFS adjustments incur some time and energy cost each time they are applied. The goal is to identify regions for which (V, f) adjustments can be helpful in reducing the energy consumption, and, where possible, to amor-tize the overheads of DVFS adjustment.

Modern CPU architectures incorporate a power management technology which supports four power management states: performance states (P-states), throt-tle states (T-states), idle states (C-states) and sleep states (S-states). P-states are predefined sets of frequency and voltage combinations at which an active core can operate; the various P-states are implemented by using a combination of dynamic frequency scaling (DFS) and dynamic voltage scal-ing (DVS). A C-state is an idle state in which parts of the processor are

2

(22)

powered down to save energy, and a higher numbered C-state indicates more power savings.

The objective of DVFS is to move the CPU to the correct state based on the current workload.

Memory Power Models

The fourth largest power consumer in a server (without considering the cool-ing overhead) is its main memory, as it can get to consume about ∼ 11.7% of the total power. IT equipment such as servers include a memory hierarchy. The rapid increase of the DRAM capacity and bandwidth has contributed for DRAM memory sub system to consume a significant portion of the total system power. The broadest definition of the main memory power consump-tion is defined by the static and dynamic power. Lin et al. [52] employed a simple power model estimating the DRAM power (Pdm) at a given moment

as follows,

Pdm = Pstatic dm+ α1µread+ α2µwrite

where α1 and α2 are constants that depend on the processor. The static

power consumption of the DRAM is denoted by Pstatic dm, while the read and

write throughput values are denoted by µreadand µwriterespectively. Another

power model for DRAM energy consumption is based on the observation that energy consumed by the memory bank is directly related to the number of memory read/write operations involved during the time interval of interest [55]. Energy consumption of a DRAM module over the time interval between t1 and t2 is expressed as,

Emem = Z t1 t0 _N X i=1 Ci(t) + D(t) ! Pdr + Pob ! dt

where Ci(t), i = 1, 2, ..., N is the last-level cache misses for all N constituent

cores of the server when executing jobs, D(t) is the data amount due to disk access or OS support and due to performance improvement for peripheral devices. Pdr is the DRAM read/write power per unit data. Pab represents

(23)

the activation power and DRAM background power.

Storage Power Models

Hard Disk Drive (HDD) is currently the main type of secondary storage media used in data center servers, indeed they are the third largest power consumer in a data center. HDD contains disk platters on a rotating spindle and read-write heads floating above the platters. The scarce possibility to access the power states of the hard disk drive and the impact of disk hardware caches make difficult the correct evaluation of the power consumed by an HDD. However, Hylick et al. [53, 54] observed that read energy consumption (Er) of multiple hard disk drives has a cubic relationship with the hard disk

capacity called Logical Block Number (L), i.e. Er ∝ L3. They modeled the

total amount of energy consumed by a drive servicing a set of N requests (considering I time idle) comprised of S seeks as,

Etotal = N X i=0 Eactive+ S X i=0 Eseek + I X i=0 Eidle

where Etotal is the total energy, Eseek is the seek energy and Eidle is the idle

energy in Joules.

2.4 CPU Energy Management in Search

Engines

In this section we tackle the problem of how to reduce the energy consumption of a CPU. Section 2.2 introduces DVFS exploited by the operating system to reduce the energy consumption. Unfortunately adjust the frequency based only on the core utilization is not so much effective from the point of view of the single application, since DVFS does not account for a single application’s workload and may cause latency violations. Moreover the highest inefficiency comes out when a core runs at high frequencies even at low workload.

(24)

consolidate the load onto a fraction of the servers, turning the rest of them off [3]. Alternatively, we can temporarily delay tasks to create sufficiently long periods of idleness so that deep sleep modes are effective. Delaying tasks for hundreds of milliseconds for deep sleep modes to be practical would lead to unacceptable latency violations.

This situation leads to consider a tighter control over the CPU’s frequency. For such reason, Catena et al. [4] propose to control the frequency of CPU cores based on the utilization of the query processing node rather than on the utilization of the cores. Indeed, in the last few years, many tools, called governors, have been made available, such as acpi cpufreq [23]. This driver allows applications to directly manage the CPU cores frequency, instead of relying on the operative systems. For Intel processors one of the most used tool is RAPL (Running Average Power Limit) [20], available only from Sandy Bridge/Xeon architectures: it provides a set of hardware performance coun-ters and I/O models, estimating the average power and energy consumption of the CPU and the possibility to set an higher/lower power cap for the en-tire processor. The RAPL interface allows the user to set an average power limit that the CPU should never exceed. The default time period to average over is 45ms. RAPL can be programmed dynamically by writing a model specific register (MSR). Changes to the power limit take effect within less than 1 millisecond. The power limit can be set in increments of 0.125W using a power control knob exported by the interface. RAPL enforces this power limit mainly by scaling the voltage and frequency but can also modulate the CPU frequency to achieve average frequencies in between p-state frequencies. To guarantee the highest user expectation, i.e. to do not let users waiting too much for a reply, the incoming queries have to be completed under a certain threshold τ , typically in the order of some milliseconds. We also need to avoid to return incomplete or negative answers when a query does not meet its deadline. Each query has an arrival time ai, when it enters the

processing node, and a completion time ci > ai, when it leaves the processing

node. Therefore each query has an absolute deadline di = ai + τ . In other

words we want that the relation ci ≤ di is satisfied as much as possible.

(25)

also the tail latency. Tail latency is a better metric respect to mean/median latency for Web search engines. In fact, measuring the tail latency, we state that most of the requests are served within the measured time interval. The more an answer is close to such threshold the lower is the time waited by the user. We then evaluated the 95% percentile tail latency to measure the effects of power management mechanisms on the responsiveness of search systems.

Existing Solutions

In this section we will talk about the existing solutions to reduce the energy consumption of the CPU for query processing.

The simplest algorithm is the one setting the maximum frequency for the entire period. This solution is called perf. From Equation 2.2 we can state that perf has a very low energy efficiency since queries are executed at the maximum speed and Pi assumes high values with respect to the utilization of

the cores. An alternative solution is power [32], which throttles CPU core frequencies according to the core utilization. A more refined version, called cons [4], makes use of the queueing theory to select the most appropriate frequency. It exploits the well-known utilization factor formula, which is ρ = λ/µ, computed as the ratio between the arrival rate and the service rate.

Monolithic Environment

The monolithic version of a search engine is composed by just one client and one server. The internal representation of the server is showed in Figure 2.7. When a query reaches the processing node it is dispatched to a query server by a query router. The query router dispatches an incoming query to the least loaded query server.

(26)

Figure 2.7: The architecture of a query processing node [1].

In this monolithic environment we test all the algorithms to produce what will be called the baseline, necessary to compare such results with the distributed ones. All the measurements - energy consumption and tail latency - are collected directly in the query processing node.

PESOS Algorithm

Let’s now introduce an algorithm called Predictive Energy Saving Online Scheduling Algorithm (PESOS) [1]. It is based on query efficiency pre-dictors [33] rather than the core utilization. Let us consider J = {J1, ..., Jn}

a generic computing jobs, that must be executed over a time interval [t0, t1).

Each job Ji has an arrival time ai and an arbitrary deadline di which are

known a priori. The minimum-energy scheduling problem (MESP) aims at finding a feasible schedule - a schedule is said feasible if each job in J is completed within its deadline - such that the total energy consumption is minimized, i.e., argmin S=(ψ,φ) E(S) = Z t1 t0 P (ψ(t)) dt

where S = (ψ, φ) denotes, respectively, the processing speed and the job in execution, both at time t. The YDS algorithm [6] solves the MESP in polyno-mial time O(n3_{), where n is the number of jobs. Since YDS works optimally}

only in an offline scenario where the number of jobs is fixed, authors in [48] proposed an heuristic version called OYDS (Online YDS), recomputing the

(27)

MESP anytime a query arrives. Catena at al. [1] exploit OYDS using predic-tors to estimate the termination of the incoming queries and select the best frequency to complete them minimizing the energy consumption.

There are two different configurations: time conservative (TC) and energy

conservative (EC). We will refer to the former when the predictors are

cor-rected using their RMSE (Root Mean Square Error), to the latter when there are no corrections.

In addition, given a query qi with deadline di and completion time ci, they

define its tardiness as Ti = max{0, di − ci}. They aim at minimizing the

tardiness of late queries, by reducing the time budget of on time queries. A late query will have a tardiness given by the amount of time a query requires to be completed exceeding its deadline. The tardiness is computed as follow: given a queue of queries Q sorted by arrival time, they compute the total tardiness of the late queries in Q when all queries are processed at maximum frequency. Then they compute the sharded tardiness H(Q) of the on time queries in Q by dividing the total tardiness by the number of on time queries in Q, and they reduce the on time queries’ deadlines by H(Q). The on time queries will have less time to be completed, giving more time to the late queries.

Query Efficiency Predictors

Query efficiency predictors (QEPs) are techniques that estimate the execu-tion time of a query before it is actually processed. Knowing in advance the execution time of queries permits to improve the performance of a search engine.

The authors of PESOS adapt the query efficiency predictors (QEPs) in-troduced in [33] to estimate the number of scored postings for a query. They divide queries into six query classes according to their number of terms, where queries with six or more terms are considered as queries of class six. They get 13 term-based features for each term, computed from the ranking of each document relative to the terms, using three different functions: maximum, variance and sum. These functions generate 39 aggregated features per query.

(28)

They use a training set to learn a set of linear regressors πx, one for each

query class. Each regressor takes in input the 39 query-based aggregated features from the feature set, and estimates the number of postings. After the training phase, they use the validation set to see how predictors perform. They then use the RMSE ρx computed in the validation phase to correct the

value of the predictors. The result of the training and validation phase is a set of predictors Π = {˜π1, ˜π1, ..., ˜π6+}.

Training processing time predictors

Processing 60,000 queries, they learn a set of single-variable linear regressors σf

x that estimate the processing time of a query given the number of its scored

postings. These processing times will be used by OYDS to map processing speed into CPU core frequencies.

Again, they divide the queries into six classes (see Section 2.4). For each query class and each frequency f , they learn a single-variable linear regressor σf

x. They will measure the RMSE ρfx and the coefficient of determination R2

performing a validation set after a training phase. The validation set is aslo used to check how well the predictors perform.

Finally, the RMSE Rf

x computed in the validation phase is used to

com-pensate the predictors’ estimates. The result of the training and validation phases is a set of predictors Σ = {˜σf₁, ˜σ₂f, ..., ˜σf₆₊}.

Translating Processing speeds into CPU frequencies

CPU cores can operate at discrete frequencies f ∈ F , where F is a discrete set of available frequencies. As we introduced before, OYDS maps processing speed into CPU core frequencies. To do so, for each frequency f , a single-variable linear predictor σf

x(q) is trained to predict, through the estimated

number of postings, the processing time of a query q composed by x terms: σxf(qi) = αfxπ˜x(qi) + βxf

(29)

where αf

x and βxf are the coefficients learned by the regressors. Thus, for each

frequency f a new set Σ of single-variable linear regressors σf

x is derived. To

compensate the predictor error its RMSE (ρf

x) is added, obtaining,

˜

σ_xf = σf_x(qi) + ρfx

Now, Σ can be used to translate processing speeds to CPU core frequencies. Firstly, OYDS computes the required processing time ri of a query qi by

multiplying the predicted number of scored postings πx(qi) by the associated

processing speed s. Then, we check each regressor πx(qi) in Π′ in ascending

order of frequency f . If the expected query processing time at frequency f is less than ri, we use frequency f to process qi. If we are not able to find a

suitable frequency f , we use the maximum available frequency. Load-Sensitive Selective Pruning Strategy

We also evaluated a slightly different solution respect to PESOS, still based on query efficiency predictors. The authors in [19] propose a set of different query processing strategies computing the expected processing time of a given query using query efficiency predictors. Then, the time budget for the query is computed, such that the processing time for all the queries in queue is kept below a certain threshold. This technique is used for query pruning but it can be applied also for query processing, indeed the bound phase returns a time budget that can be used to select the best frequency according to the latency threshold. From now on this technique will be called LSSP (Load-Sensitive Selective Pruning). The framework have to implement the following 2 strategies as showed in Algorithm 1:

• Predict(): Defines a mechanism allowing to predict the processing time for each query in the queue. This mechanism is used to estimate the processing times ek(qi) of the available processing strategies.

• Bound(): Defines a method to compute the time budget T (qi) for

(30)

waiting to be processed. The time budget defines a bound on the processing time that query q1 will be permitted.

Algorithm 1 Load-Sensitive Selective Pruning Framework. Input: The queries q1, ..., qn

The completion time threshold τ

Output: The computed bound T (q1) for query q1

1: for all processing strategies σk, k = 1, ..., p

2: for all enqueued queues qi, i = 1, ..., n

3: expected processing time ek(qi) ← Predict(σk, qi)

4: Time budget T (q1) ← Bound(τ, σ1(q1), ..., σp(q1))

The method used to evaluate the time budget T (q1) is what the authors

call Altruistic because of its inherent fairness. It firstly computes how much time is left to empty the current queue. This is simply the time at which the lastly queued query qn should be completed (tn+ τ ) minus the current time

t. Formally, ∆n, the remaining time to finish processing up to query n, is:

∆n= (tn+ τ ) − t

Then, to compute the maximum time available for q1 we have to subtract

the minimum time necessary to process all the queued queries. This time is simply given by the sum of the estimations ep(qi)3 of the processing time

needed by the fastest processing strategy p. Hence we define the available

slack time, ˜∆n, as:

˜ ∆n= ∆n− n X i=1 ep(qi).

If ˜∆n > 0, we evenly distribute this extra slack time to the queued queries.

In doing so, if some time is left to process all enqueued queries faster than the minimum possible, each one might receive a fair amount of extra processing time, as far as no additional queries are received; in that case the slack time is recomputed. Hence the processing bound for query q1 becomes ep(q1)+ ˜∆n/n.

3

By definition ep(qi) = argminkek(qi), namely the query qi is processed as fast as

(31)

However, this quantity can exceed ∆14, and will result in too much extra

budget assigned to query q1, beyond the time threshold τ . In this case, the

processing bound for the query q1 is simply ∆1. Finally, if ˜∆n≤ 0, we process

the query as fast as possible:

T (q1) =    min{∆1, ep(q1) + ˜∆n/n} if ˜∆n > 0 ep(q1), otherwise Distributed Environment

We evaluated PESOS also in a distributed environment, composed by a single client, a broker node and N server nodes each containing a shard of the entire index. As discussed in section 2.1, we decided to adopt the document-based partition strategy. This means that the query is sent to all the N nodes and the results aggregated in the broker, responsible to evaluate the tail latency. Variability in the latency distribution of individual components can results in unpredictable tail latency spikes [42]. Overprovisioning of resources, careful real-time engineering of software, and improved reliability can all be used in all components to reduce the base causes of variability.

PEGASUS

To evaluate the performance of the distributed PESOS, we compared it with a another solution called PEGASUS (Power and Energy Gains Automatically Saved from Underutilized Systems) [3]: PEGASUS is a dynamic, feedback-based controller that enforces the iso-latency policy. The authors propose a power management policy called iso-latency that addresses the problems of existing DVFS systems. It monitors the request latencies and adjusts the power cap of all servers so that the given latency threshold is barely met under any load. Instead of changing the frequency PEGASUS increases or decreases the limit power consumption of the cpu using RAPL. The Appli-cation Performance Monitor monitors (see Figure 2.8) workload latency and reports the amount of headroom to the PEGASUS controller.

4

(32)

Figure 2.8: Block diagram showing high level operation and communication paths for PEGASUS [3].

The PEGASUS controller is responsible for determining the proper power adjustment based on the latency headroom (see Table 2.1), which it then pe-riodically communicates to each of the local agents running on each server. Finally, the local agents set the power limit from the PEGASUS controller using the RAPL interface.

PEGASUS is defined by two distinct components: policy (control parame-ters) and enforcement (actuation). The policy component is responsible for determining how much to adjust the CPU power limit. The policy portion is workload specific, and is dependent on the characteristics of the workload, such as the latency metric (mean or tail latency) as well as the sensitivity of the workload to CPU power. The enforcement portion is workload agnostic. It applies the CPU power limit determined by the policy of PEGASUS uni-formly across all nodes in the cluster. Once the local agent receives a request from the central controller, it only takes milliseconds before the new power limit is enforced by RAPL.

(33)

Input Action

X > T Set max power, wait 5 minutes

Y > 1.35T Set max power

Y > T Increase power by 7%

0.85T ≤ Y ≤ T Keep current power Y < 0.85T Lower power by 1% Y < 0.60T Lower power by 3% Table 2.1: PEGASUS policy.

PEGASUS selects the correct action based on which input has occurred, in particular it depends on 3 values: the average latency X over the last 30 minutes, the instantaneous latency Y and the target latency T . The table lists rules in decreasing priority, i.e. only the first matching rule is applied. The rationale for using both target latency and instantaneous latency is that the instantaneous latency is a good early warning signal that lets PEGASUS know if a violation might occur. This gives PEGASUS the opportunity to gracefully increase the power limit to avoid violations before they happen. In addition, using the instantaneous latency allows PEGASUS to update the power limit more frequently, as opposed to waiting for the 30 second moving average to change.

Their results show that PEGASUS is able to reduce the energy consumption of a small cluster (hundreds of servers) of ∼ 11% compared to the baseline, with a reduction up to ∼ 20% in large clusters (thousands of servers). Here the baseline is simply a cluster which controls the frequency using the DVFS mechanisms.

Replicated Environment

In the worst-case query volume a data center must provide a sub-second response time. This has the obvious drawback that the power consump-tion/electricity costs are not taken into account. In [2] the authors argue the possibility to dynamically adapt the behaviour of the search engine -according to the variations of the query load - while providing acceptable

(34)

query latencies and minimising the number of machines used to process the queries.

By estimating the arrival times and processing requirements of future queries, they derived a self-adapting mechanisms for the search engine model that can reduce power consumption without negatively impacting efficiency, by means of dynamic optimization scheme. In order to model this power-latency trade-off, they propose cost combinations of the following type:

gk(·) = λPk(·) + (1 − λ)Lk(·) (2.6)

where k is the k-th time slot out of N (where N is defined by the user), Pk(·)

and Lk(·) are the power cost function and latency cost function explained

below, defined for various values of λ ∈ [0, 1).

Considering only ON and STANDBY states - with a power consumption of Pon and Pstandby respectively - and assuming that at a given time slot k only

uk out of the total M machines are active, the power cost function can be

defined as follows:

PonTsuk+ PstandbyTs(M − uk)

assuming that all active nodes will always be processing queries. Normalizing this quantity by the maximum consumable energy for M machines, we obtain the following expression:

Pk(·) = P (uk) =

1 M Pon

[Ponuk+ Pstandby(M − uk)]

For the latency cost function, at a given time slot k, we have the following queries completion time:

Tk=

xk+ ¯wk

uk

¯ vk

where xk are the queued queries at the beginning of time slot k, ¯wk the

(35)

the mean completion time. Normalizing it in the [0,1) interval, we have: L1_k(·) = 1 − exp(αTk) (2.7) L2_k(·) =    0 if Tk≤ Ts 1 otherwise (2.8) L3_k(·) =    0 if Tk ≤ Ts 1 − exp(α(Tk− Ts)) otherwise (2.9)

where Tk as defined before, Ts is the length of a time slot and α is an

expo-nential parameter < 0.

In this model xk, which represents the number of queued queries at the

beginning of a time slot k, can be considered a state where a decision uk can

be associated with a transition from state xk to state fk(xk, uk) = xk− uk·

Ts/¯vk+ ¯wk at a cost gk(xk, uk) = λPk(xk, uk, ¯vk, ¯wk)+(1−λ)Lk(xk, uk, ¯vk, ¯wk).

It can be equivalently represented by a graph as represented in Figure 2.9, where the arcs correspond to transitions between states at successive stages and each arc has an associated cost corresponding to gk(·).

Figure 2.9: Transition graph for a deterministic problem with 2 machines [2]. Thus it is equivalent to finding the minimum length path from the initial node (at stage x0) to the artificial terminal node of the graph. If ¯wk, the

estimated number of incoming queries in a time slot k, is estimated using the actual number of incoming queries in the same time slot of a previous day, then the minimum path can be computed a priori, i.e. Dijkstra algorithm, denoting this solution as LongTerm, otherwise, then estimating the number

(36)

of incoming queries using current and historical values, only one step ahead where, at each stage, we select the next stage reachable with minimum cost from current stage. We will denote this solution algorithm with ShortTerm. Experiments are conducted in comparison to two baselines running queries at the maximum speed: Na¨ıve (always uses the maximum number of ma-chines) and Threshold (calculates the number of necessary machines to ensure latency values under a threshold). Their results show that this self-adapting model can achieve an energy saving of 33% while only degrading mean query completion time by 10 ms compared to baselines that provision replicas based on a previous day’s traffic.

Improvements

In this section we will talk about the possible improvements for the previous solutions, in all the possible environments.

• Monolithic Environment: We here propose alternative implementations of PESOS regarding the management of incoming queries.

1) we consider the usage of a central queue in the query router node instead of a queue in each query server node. A query is logically as-signed to a core according to the type of scheduler. Different types of schedulers will be discussed later. When a core asks for a new query, it sends a request to the query router. The query router then looks for the first query assigned to that core and returns it. This solutions should outperform the standard one because the central queue is able to re-distribute the queries anytime a query arrives or is completed. 2) We propose different scheduling techniques to assign the incoming queries. By default a new query is assigned to the core with the low-est number of queries in the queue. This is the classical load balancing strategy that we refer as LL (Least Loaded). We decided then to sched-ule the incoming queries in two different ways exploiting the query effi-ciency predictors: the core with the lowest predicted frequency (LPF ) and the core with the lowest completion time (LCT ). The former strat-egy assigns the query to the core that predicts the lowest frequency if

(37)

that query were to be assigned to that core. The latter strategy com-putes the time a core needs to complete all its enqueued queries at the maximum frequency; the query will be assigned to the core showing the lowest completion time.

3) Along with these new strategies we implemented a classical tech-nique called Job Stealing: it allows an idle core to literally steal a job (in this case a query) from the queue of another core. This technique should ”adjust” the cores in case one is faster than another. The steal happens to the core with the highest frequency.

• Distributed Environment: In a distributed context, each shard has a different inverted index, hence a different posting list for each query. Therefore, different shards could lead to different processing times for the same query, resulting in a possible wasting of time and energy for the fastest nodes with respect to the slowest ones. What we tried to investigate for the distributed version of PESOS is the possibility of reducing the frequency of a core in case it is going faster then another for the same queries. This difference in completion time produces what we call slack time. To this end, it was necessary to implement a global controller, called PESOS controller, running in the broker node eval-uating the slack time using the following formula:

et(q, n) =

pos(q,n)

X

j∈Queue(n)

Trs(t, j) (2.10)

Tb(t, n) = min∀q ∈ Queue(n), ∀k 6= n : max{et(q, k)}

(2.11)

where pos(q, n) is the position in the queue of node n for a query q, Trs(t, j) is the residual service time5 of the query j and Tb(t, n) is the

extra time budget for node n. Equation 2.10 returns the extra time budget at time t for the query q, starting from the first query until 5

Trs(t, j) is calculated as the predicted service time of the query j at the maximum

frequency minus the elapsed time computed as: t − tj, namely the current time t minus

(38)

we find the input query q. Equation 2.11 is a bit more complicated: it applies first a maximum to get the highest extra time budget for a query from the other nodes (k 6= n), then a minimum to guarantees that the extra time budget can not be assigned only for the single query, but it must considers also the other queries waiting in the queue. This minimum ensures that the slack time satisfies all the latency thresholds. When a new query arrives or a server node completes a query the PESOS controller starts to evaluate the slack time: for a new query the new slack time is sent to all the server nodes, while in the latter case only to the nodes which have not yet complete the query. The slack time is then added to the base time budget of the local PESOS controller, resulting in a higher time to answer the query and the core frequency is re-evaluated according to this new time budget.

In case of cluster with thousands of node, this solution can be easily parallelizable assigning to each thread a different node.

• Replicated Environment: In a replicated environment, we decided to explore what happens if we turn-off the unused nodes. In Section 2.4 we said that the time to wake-up a node is about 200 seconds. This is true if we use the replicated nodes only for query processing. In-deed, while idle, these nodes can be used for any other phase of the search engine, such as indexing or crawling. Therefore, we turn-off the unused node only to evaluate the energy expenditure for the query processing. That’s why we do not account for the wake-up energy consumption. Starting from the idea presented in 2.4, we decided to investigate deeply which is the best configuration to save more energy in a replicated context. The previous solution bases its decisions in a graph following the shortest path from the source to the ending node. The decision of which arc to choose is based on several factors, such as the number of enqueued queries and the time to complete them ac-cording to the CPU’s power expenditure.

Therefore, we deployed a slightly different solution which takes in con-sideration the utilization factor of each server node rather than the

(39)

mean completion time of the incoming queries. Firstly, we run the monolithic version of the algorithm (whatever it is) computing the en-ergy consumption of the server node and collecting, in a file, its average utilization factor during a previous day (it is preferable to compute the average utilization factor among different days). As in the previous solution, we collect them in time slots of length k. Then, we use this file to run the replicated version of the system in a iterative manner, until we do not find the best configuration. The number of iterations are limited by the number of time slots. The proposed algorithm works in this way: we store in an array U the utilization factors contained in the file. Then, we built the sorted array U′ _{in decreasing order of}

utilization factor. The number of active replicas depends on the cur-rent utilization factor respect to a target parameter T . Such parameter takes values from U′ _{according to the current iteration, that is if we}

consider i the i-th iteration we have T = U′_{[i] (starting from 0). The}

number of active replica nodes is given by the following formula:

f (u, n) =    n + 1 if u ≥ T n − 1 otherwise

where u is the utilization factor of the current time slot (i.e., u = U [j] where j is the current time slot) and n is the number of replicated nodes used in the previous slot. If the relation u ≥ T is satisfied we increase by 1 the number of replicas (until the maximum), otherwise we reduce it by 1 (not less then the minimum). At the end of the day we compare the current energy consumption with the current minimum, initialized with the one computed in the monolithic version: if we reduce the energy consumption we assign T to the minimum and we get the next target T as previously described, otherwise we stop. The best target is the target chosen from the previous iteration. Note that there is the possibility to not be able to reduce the energy consumption of the monolithic version; in that case, the best solution is to do not consider replicated nodes for query processing.

(40)

Alternative Solutions

During the study of alternative solutions to PESOS, we found some other in-teresting works that can be adopted with, or as an alternative to, PESOS. For instance, some interesting works [11, 13] rely on Queueing Networks (QN). Work [12] is based also on Model Predictive Control (MPC). Such model is described by a linear system that continuously configure the adaptation knobs of a QN (i.e., routing probabilities, and service rates) tracking the desired performance requirements. We can adapt such model setting the fre-quency and the tail latency as performance requirements. Their solution can be adapted also in a distributed environment.

Moreover, Bini at al. [39] consider the possibility to reduce the energy con-sumption in real-time systems. Tasks are composed by a fraction of time scalable by frequency and another does not (i.e., I/O operations). The power-aware processor is characterized by a set of operating modes. Given the task scheduling they compute the optimal frequency to complete them. They aim at finding the best processor mode that minimizes the power consumption according to such optimal frequency and the arrival of tasks. We can adapt this model to a search engine where tasks can be considered as queries. Another work exploits machine learning techniques for data center optimiza-tion [15]. It exploits a three-layered neural network to reduce the energy consumption of Google data center considering several features, such as the total server IT load, outdoor wind speed, the total number of chilled water injection pumps running, etc. We can adapt this system to improve the pre-diction of the completion time of queries.

Finally, we think that query processing parallelization [17, 18] can be inte-grated with PESOS with the possibility to improve its performance reducing, at the same time, the energy consumption.

(41)

Chapter 3 Energy Efficiency Simulator

In this chapter we will describe the simulator used to analyze the discussed algorithms. Section 3.1 describes the architecture of the simulator. In Section 3.2 we will describe how to model a search engine using this simulator.

Evaluate the energy efficiency of an algorithm requires the use of a CPU, or cluster of CPUs, for an entire day. In case of a bug the algorithm must be run again wasting a lot of time, resources, and energy. To avoid this situa-tion we decided to implement a discrete-event simulator where the analyzed techniques can be analyzed.

Large-scale search engines, as described in Section 2.1, can be seen as multi-component systems whose individual design, implementation, deployment, and operation are always in constant evolution, and thus it is of paramount importance to be able to predict their performance with precise and practical methods. We have found discrete event simulation to be a useful tool in this context because it enables us to both represent the actual system in a one-to-one correspondence with its main components (including user behavior) and simulate the cost of their relevant operations in a precise and high-level manner.

In the literature we can find plenty of simulators. One of the most popular simulator is NS2 [35], in which a lot of network protocols are implemented and results can be quickly obtained – more ideas can be tested in a smaller time frame. Every element of the network has an associated Agent to let nodes

(42)

communicate each other by means of event generators. Unfortunately, NS2 is not very suitable to model real systems which are too complex to be modelled, i.e. complicated structure. CloudSim [34] is a famous tool that is actually a toolkit for simulation of cloud scenarios. CloudSim actually enables the users to have a proper insight into cloud scenarios without worrying about the low level implementation details. All components in CloudSim communicate through message passing.

Our implementation1_{gets the best from the two simulators introduced before,}

in fact we decided to implement a layered infrastructure where each layer can rely on the services provided by the underlying layer. The architecture of our simulator will be explained deeply in the next sections.

3.1 Simulator Architecture

This simulator, written in Java, is basically a network simulator where nodes can act as modules of a system. Once the network has been established, the definition of each module is quite easy.

The architecture of our simulator is made of 3 layers: Network, Resources and Services as shown in Figure 3.1. The hierarchical representation of the system allows us to handle systems of ever-increasing complexity and capability by adding or changing the layers to improve overall system capability while using the components that are still in place.

1

(43)

Figure 3.1: The 3 layers composing the architecture of the simulator (adapted from [34]).

Each layer is responsible to offer a reliable yet simple interface to be ac-cessed by the upper layer. For this reason the creation of a network can be done in a few steps: firstly we create the topology of the network, then we add an Agent to the nodes we want to control and finally we add the

Event-Generator s to let nodes communicate each other. Extending this network

to represent a search engine is straightforward: all the search engine compo-nents can be represented by a node with a proper Agent responsible to send or receive the incoming messages. Client nodes send queries at given queries per second (QPS) rate while a server, upon received a query, answers with a set of retrieved pages.

The work in [36] explains how to build a search engine simulator. It ap-plies the bulk-synchronous parallel (BSP) model of parallel computation [37] for a macroscopic representation of the system (cluster of processors level) and Multi-BSP [38] for a microscopic representation of the system (multi-core processor level). These models provide a well-defined structure of parallel computation that simplifies the determination of hardware and system soft-ware costs, as well as the debugging and verification of simulation programs. In BSP, the cluster of processors is seen as P processors with local memory, where communication among them is performed via point-to-point messages. Parallel execution is organized as a sequence of supersteps. In each

(44)

super-step, the processors can only work on local data and send messages to other processors. The barrier synchronization concludes the superstep: it ensures that all communications are properly concluded.

Our implementation gets inspiration from these models, where components capable of processing (i.e., nodes or processors) communicate each other through a network that routes messages between pairs of such components. The only difference is that we do not need any synchronization mechanism during the parallel computations.

3.1.1 First Layer: Network

The first layer is responsible to model the underlying network topology, made of nodes and links. It is also responsible to correctly managing the sending and receiving of events generated by the various nodes. The network can be instantiated and managed using a NetworkTopology object where nodes and links can be added. Nodes and links can be generated programmatically, with the appropriate class, or using a JSON file. To explain how they work let us assume to have a network with just one client and one server connected by a 70Mb/s link. A node, represented by the NetworkNode class, must contains the following fields:

• id: unique identifier representing a node. It is used by the upper layer to retrieve the associated node;

• name: the name of the node;

• delay: internal delay of the node, i.e. the time to execute a task. It must be expressed in milliseconds.

We must now define the link. A link, represented by the NetworkLink class, is defined by the following fields:

• fromId: the source node identifier; • destId: the destination node identifier;

Evaluation of CPU Energy Saving Techniques in Distributed Search Engines

University of Pisa

Scuola Superiore Sant’Anna

Department of Computer Science

Department of Information Engineering

And

Scuola Superiore Sant’Anna

MASTER’S DEGREE IN

COMPUTER SCIENCE AND NETWORKING

(Class LM-18)

Master’s Degree Thesis

Evaluation of CPU Energy Saving

Techniques in Distributed Search Engines

Candidate

Stefano Ceccotti

Supervisor

Supervisor

Contents

Abstract

Chapter 1

Introduction

Chapter 2

Background and Related Work

2.1

Search Engine Architecture

2.1.1

Crawling

2.1.2

Indexing

2.1.3

Query Processing

2.2

Data Center Power Usage

Cooling System

2.3

Data Center Power Consumption Models

CPU Power Models

Memory Power Models

Storage Power Models

2.4

CPU Energy Management in Search

Engines

Existing Solutions

Improvements

Alternative Solutions

Chapter 3

Energy Efficiency Simulator

3.1

Simulator Architecture

3.1.1

First Layer: Network