Energy-performance, tradeoff in page ranking algorithms on big.LITTLE architectures

(1)

UNIVERSITY

OF PISA

SCUOLA SUPERIORE SANT’ANNA

Computer Science & Networking

ENERGY – PERFORMANCE TRADE OFF IN PAGE RANKING ALGORITHMS

ON BIG.LITTLE ARCHITECTURES

Referee: Dr. MASSIMO COPPOLA Candidate: STEFANO LORIGA Academic Year 2016 -2017 Supervisor:

(2)

(3)

2

INDEX

1. Introduction ... 5

2. ARM machines tested ... 7

2.1 ODROID-XU4 ... 7

2.1.1 Processor’s History ... 7

2.1.2 Hardware information ... 8

2.2 ODROID-C2 ... 9

2.2.1 Processor information ... 9

2.2.2 Hardware information ... 9

3. Problem definition ... 11

3.1 Page Ranking ... 14

3.2 The selection of dumping factor ... 15

4. PageRank algorithms ... 16

4.1 Power method ... 16

4.2 Arnoldi method ... 17

4.2.1 Residual determination ... 18

4.2.2 Arnoldi PageRank ... 19

5. Parallelization ... 21

5.1 Data structure used ... 21

5.2 Data parallel ... 22

(4)

3

5.3 Implementation ... 25

6. Parallel Programming framework ... 27

6.1 Fast Flow ... 27

6.2 pThread ... 30

7. Algorithms: c++ codes ... 31

7.1 codeS for ODROID-XU4 ... 31

7.1.1 Power method (CSR) ... 31

7.1.2 Power Method ... 34

7.1.3 Arnoldi method ... 37

7.2 codes for odroid-c2 ... 43

8. Evaluation libraries ... 45

8.1 MAMMUT: energy profiling ... 45

8.2 Parameter for performance evaluation ... 45

9. Tests and results ... 47

9.1 Performance results ... 47

9.2 Energy results ... 56

10. Conclusion ... 61

(5)

4 ABSTRACT

The thesis discusses experiments relative to the execution of page ranking algorithms on the heterogeneous architectures big.LITTLE by ARM, looking for optimal performance and energy consumption. Different algorithms have been implemented with different configuration of the machine (fast cores, slow cores, mix of cores).

Additional experiments have made on a more classical 64-bit languages ARM multicore for comparison.

(6)

5

1. INTRODUCTION

The aim of the master thesis is to study the behaviour of ARM big.LITTLE and ARM architectures executing ranking algorithms, focusing on the trade-off between performance and energy consumption.

These machines were tested with different parallel application solving the PageRank problem, i.e. a matrix-vector product used to rank all pages in that web-network.

The choice of this algorithm is due to the fact that it presents two fundamental features to reach our aim, such as stressing both memory and computation.

Clearly, it was not possible to execute the ranking of millions of pages with the limited memory provided by the machines, but this gives the possibility to analyse the trade-off between memory and performance.

In fact, the different implementations proposed have the purpose to stress the limited cache memory and to monitor the behaviour of processors with different power.

These little powerful computer machines are Odroid-XU4, that includes a processor Exynos5 octa (5422), and Odroid-C2, that includes an Amlogic S905.

The processor of the first machine belongs to the Exynos series of ARM-based System-on-Chips (SoCs) developed and manufactured by Samsung Electronics, while the second one belongs to an American technology company that was founded in the US as Amlogic Inc. and is predominantly focused on designing and selling SoC (System on Chip) integrated circuits.

We have decided to run these algorithms on different architecture to compare the behaviour of ARM big.LITTLE and ARM architecture, and to see the difference between 64-bit (Odroid-C2) and 32-64-bit(Odroid-C8), with which we expect more efficiency in terms of energy for Odroid-C2 respect to the Odroid-C8.

(7)

6 Thesis is organized as follows:

• chap 2 presents a detailed description of machine’s characteristics used • chap 3 defines the page ranking problem from the theoretical point of view • chap 4 describes two different algorithms implementation of the page ranking • chap 5 presents the parallelization model, including data structure and pattern used • chap 6 describes the parallel programming framework exploited

• chap 7 includes the algorithm codes implemented in C++ programming language • chap 8 describes the libraries used to evaluate performance and energy

consumption

(8)

7

2. ARM MACHINES TESTED

2.1 ODROID-XU4

Odroid-XU4 is powered by ARM big.LITTLE technology, the Heterogeneous Multi-Processing (HMP) solution.

2.1.1 PROCESSOR’S HISTORY

In 2010, Samsung launched the S5PC110 (now Exynos3 Single) in its Samsung Galaxy S mobile phone, which featured a licensed ARM Cortex-A8 CPU.

In early 2011, Samsung first launched the Exynos 4210 SoC in its Samsung Galaxy S II mobile smartphone. The driver code for the Exynos 4210 was made available in the Linux kernel and support was added in version 3.2 in November 2011.

On 29 September 2011, Samsung introduced Exynos 4212 as a successor to the 4210; it features a higher clock frequency and "50 percent higher 3D graphics performance over the previous processor generation".

Built with a 32nm High-K Metal Gate (HKMG) low-power process, it promises a "30 percent lower power-level over the previous process generation”.

On 30 November 2011, Samsung released information about their upcoming SoC with a dual-core ARM Cortex-A15 CPU, which was initially named "Exynos 5250" and was later renamed to Exynos5 Dual. This SoC has a memory interface providing 12.8 GB/s of memory bandwidth, support for USB 3.0 and SATA 3, can decode full 1080p video at 60 fps along with simultaneously displaying WQXGA-resolution (2560x1600) on a mobile display as well as 1080p over HDMI.

On 26 April 2012, Samsung released the Exynos4 Quad, which powers the Samsung Galaxy S III and Samsung Galaxy Note II. The Exynos4 Quad SoC uses 20% less power than the SoC in Samsung Galaxy S II.

The last release is the Exynos5 octa that is a processor giving notable graphical performances from its advanced GPU. A seamless combination of the ultimate processing power and energy efficiency.

This last version, more precisely the Exynos 5422, is the one that has been used. It presents a CPU octa-core, composed by a quad-core ARM Cortex-A15 (1.9-2.1 GHz) and a quad-core

(9)

8

ARM Cortex-A7 (1.3-1.5 GHz) in ARM big.LITTLE configuration. Its memory technology is a 32-bit, dual-channel 933 MHz LPDDR3e (14.9 GB/s).

Now, let's see in more detail the hardware information of this machine.

2.1.2 HARDWARE INFORMATION

ODROID XU4 is a new generation of computing device with powerful, energy-efficient hardware and a small form factor. Offering open source support, the board can run various flavours of Linux, including the latest Ubuntu 16.04 and Android 4.4 KitKat, 5.0 Lollipop and 7.1 Nougat. By implementing the eMMC 5.0, USB 3.0 and Gigabit Ethernet interfaces, the ODROID-XU4 boasts optimal data transfer speeds, a feature that is increasingly required to support advanced processing power on ARM devices. This allows users an upgrade in computing, with faster booting, web browsing, networking and 3D games.

It has a processor Samsung Exynos5[1] (Exynos 5422) Octa ARM A15 and Cortex™-A7 CPUs, where Cortex™-A15 is a quad core with a maximum frequency of 2.1GHz while Cortex™-A7 is a quad core with maximum frequency of 1.5GHz.

These are used together in a big.LITTLE architecture, coupling relatively battery-saving and slower processor cores (LITTLE) with relatively more powerful and power-hungry ones (big). The "big", or faster, cores are used for computation-intensive tasks such as gaming, whereas the "little", or slower, cores are used for less intensive tasks.

It presents a GPU (Graphics Processing Unit) ARM Mali-T628 MP6. The ARM Mali-T628 MP6 is a separate graphics processor solely intended for the accelerated creation of images to be output to a display.

The external components are a RAM interface Dual-Channel DDR2, LPDDR2, LPDDR3, DDR3L SDRAM, a primary camera support up to 16MP, video encoding and decoding H.264 & VP8, and a display resolution support 2560x1600 pixels.

It presents four indicator LEDs that provide visual feedback: 1. Red LED:

Power: Hooked up to 5V power 2. Blue LED:

Alive Solid light: u-boot is running;

(10)

9

All the components described above, are visible in the following figure (fig. 1 [2]).

Fig. 1: ODROID-XU4

2.2 ODROID-C2

ODROID-C2 a single board computer from the Korean company Hardkernel. It is considerate as the most advanced architecture for mobile devices and embedded 64-bit computing.

We show in the next section the information about the processor and the hardware characteristics.

2.2.1 PROCESSOR INFORMATION

The Amlogic S9 series are the first 64-bit Amlogic products line-up Media Player SoCs. All members of the S9 family were reported to be internally limited to 1.5 GHz instead of the advertised 2.0 GHz clock rate. As today, it is not clear if the limitation is due to hardware, firmware or software issues.

2.2.2 HARDWARE INFORMATION

The ODROID-C2 is a 64-bit quad-core single board computer (SBC) that is one of the most cost-effective 64-bit development boards available in the ARM world. It is esteemed to be

(11)

10

the most powerful low-cost single board computer available, as well as being an extremely versatile device.

It can be used as a home theatre set-top box, a general computer for web browsing, gaming and socializing, a compact tool for college or office work, a prototyping device for hardware tinkering, a controller for home automation, a workstation for software development, and much more.

Different modern Linux/Android distributions that run on the ODROID-C2, such as Ubuntu, Fedora, Android, Debian and so on.

The ODROID-C2 is an ARM device and it is considered the most advanced architecture for mobile devices and embedded 64-bit computing. The ARM processor’s small size, reduced complexity and low power consumption makes it very suitable for miniaturized devices such as wearables and embedded controllers.

The processor is an S905[3] ARM-64bit 1.5 GHz Quad-core with 2GByte DDR3 RAM, Gigabit-Ethernet and IR-receiver.

The size of this computer is still only 85x56 mm with a weight of 40g, and offers silent operation, 2~5W average power usage, and instant portability.

In the following figure (fig. 2 [4]), are represented in a block diagram all the hardware components of the machine:

(12)

11

3. PROBLEM DEFINITION

PageRank is a link analysis algorithm and it assigns a numerical weight to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of measuring its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references [5].

The numerical weight that it assigns to any given element E is referred to as the PageRank of E itself and denoted by PR(E).

A PageRank processes a mathematical algorithm based on the web-graph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs.

The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it (i.e. incoming links). A page that is linked to by many pages with high PageRank receives a high rank itself. In practice, the PageRank concept may be vulnerable to manipulation.

Research has been conducted to identifying falsely influenced PageRank rankings. The goal is to find an effective means of ignoring links from documents with falsely influenced PageRank.

The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called iterations, to adjust approximate PageRank values to more closely reflect the theoretical true value.

The PageRank conferred by an outbound link is equal to the document's own PageRank score divided by the number of outbound links [6].

So, assume a small universe of four web pages: A, B, C and D, the PageRank conferred by an outbound link is equal to the document's own PageRank score divided by the number of outbound links L().

(13)

12 𝑃𝑅(𝐴) =𝑃𝑅(𝐵) 𝐿(𝐵) + 𝑃𝑅(𝐶) 𝐿(𝐶) + 𝑃𝑅(𝐷) 𝐿(𝐷)

Then in general, the PageRank value for a generic page u can be expressed as:

𝑃𝑅(𝑢) = ∑ 𝑃𝑅(𝑣)

𝐿(𝑣) 𝑣∈𝐵𝑢

i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set 𝐵𝑢, that is the set containing all pages linking to page u, divided by the number L(v) of links from page v.

The PageRank theory assumes that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor will be set to 0.85.

The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents (N) in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores.

𝑃𝑅(𝐴) = 1 − 𝑑 𝑁 + 𝑑 ( 𝑃𝑅(𝐵) 𝐿(𝐵) + 𝑃𝑅(𝐶) 𝐿(𝐶) + 𝑃𝑅(𝐷) 𝐿(𝐷) )

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain in which the states are pages, and the transitions, which are all equally probable, are the links between pages.

If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.

When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection. Their PageRank scores are therefore divided evenly among all other pages. In other words, to be fair with pages that are not sinks, these random

(14)

13

transitions are added to all nodes, with a residual probability d called damping factor, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.

So, let be 𝑝𝑖 ∈ 𝑁 a web page in the target network 𝑁 = (𝑁, 𝐸) and its PageRank value be indicated with 𝑃𝑅𝑝𝑖. Then the 𝑃𝑅𝑝𝑖 value is evaluated as:

𝑃𝑅𝑝𝑖 = 1 − 𝑑 𝑁 + 𝑑 ∑ 𝑃𝑅_𝑝_𝑗 𝐿(𝑝𝑗) 𝑝𝑗∈M(𝑝𝑗)

where 𝑝1, 𝑝2, … , 𝑝𝑁are the pages under consideration, 𝑀(𝑝𝑗) is the set of pages that link to 𝑝𝑖, 𝐿𝑝𝑗 is the number of outbound links on page 𝑝𝑗, and N is the total number of pages.

The PageRank values are the entries of the dominant right eigenvector of the modified adjacency matrix. 𝑅 = [ 𝑃𝑅(𝑝1) … 𝑃𝑅(𝑝𝑛) ] = [ (1 − 𝑑)/𝑁 … (1 − 𝑑)/𝑁 ] + 𝑑 [ 𝑙(𝑝1, 𝑝1) . . 𝑙(𝑝1, 𝑝𝑁) . . . . . 𝑙(𝑝𝑁, 𝑝1) . . 𝑙(𝑝𝑁, 𝑝𝑁) ]

where the adjacency function 𝑙(𝑝_𝑖, 𝑝_𝑗) is 0, if 𝑝𝑖does not link to 𝑝𝑗, and normalized such that, for each j

∑ 𝑙(𝑝_𝑖, 𝑝_𝑗 𝑁

𝑖=1

) = 1

i.e. the elements of each column sum up to 1, so the matrix is a stochastic matrix.

PageRank can be computed either iteratively or algebraically. The iterative method can be viewed as the power iteration method or the power method. The basic mathematical operations performed are identical.

(15)

14

3.1 PAGE RANKING

Given a directed graph G (N, E), with N being the set of nodes and E the set of edges, we can equivalently represent it by means of its adjacency matrix 𝐴𝐺 (𝑎𝑖𝑗) ∈ 𝑀|𝑁|𝑥|𝑁|{[0, 1]} , in which each entry 𝑎_𝑖𝑗is defined as:

𝑎𝑖𝑗 = {1/𝑂𝑖 𝑖𝑓 (𝑖, 𝑗) ∈ 𝐸 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

where 𝑜𝑖 is the number of outlinks of node i.

In this way, the PageRank problem can be seen as a matrix-problem:

𝑃𝑅(𝑘+1) = 𝐴𝑇_𝐺𝑃𝑅(𝑘) , 𝑘 = 1, 2, …

where the notation 𝑃𝑅(𝑘) indicated the value of the PageRank vector assumed during the k-th iteration of the algorithm. This formulation of the problem leads us to the strategy called the Power Method.

The theory behind the problem of PageRank resolves on the Perron-Frobenius theorem for irreducible and stochastic matrices. The theorem is not stated here nor all related details, since this is not the aim of this report. However, it worth observing that, in general, the matrix AG could not be stochastic and irreducible.

Mathematically, these two problems are solved inducing some modifications on the matrix in order to guarantee the existence and uniqueness of the PageRank vector.

It has been shown that working with a modified matrix, say 𝐴𝐺′ , defined as:

𝐴_𝐺′ ≜ (1 − 𝑑)𝑒 𝑒 𝑇

|𝑁| + 𝑑 𝐴𝐺

solves both problems. 𝑑 ∈ [0, 1] is the damping factor, i.e. the ability to jump from a node to another node, regardless the link-structure of the matrix.

e

is a vector of all ones.

(16)

15

Just to give an intuition, the fact that we are dividing by |N| solves the problem of having a stochastic-by-row matrix, while the irreducibility is solved by the introduction of the damping factor.

The damping factor represents a crucial parameter for the PageRank problem, since it expresses the well-known trade-off between accuracy of the results and speed of convergence of the algorithm.

3.2 THE SELECTION OF DUMPING FACTOR

It may be shown that if our matrix 𝐴𝐺′𝑇 has eigenvalues: {1, λ2, λ3, . . .} our new matrix 𝐴_𝐺′′𝑇 will have the eigenvalues: {1, dλ2, dλ3, . . .}.

The value d therefore heavily influences our calculations [6]. It can be shown that the Power method approximately converges at the rate of 𝐶|λ2/λ1| (𝑚), and as we created our final matrix 𝐴𝐺′𝑇, we scaled down all eigenvalues (as shown above) except the largest one with the factor d. Since:

•

A small d (≈ 0.7) would therefore mean that the Power method converges quickly. This also means that our result would poorly describe the underlying link structure as we allow the teleportation to heavily influence the result.

•

A large d (≈ 0.9) on the other hand means that the Power method converges

slowly, but the answer better describes the properties of the real underlying link structure.

(17)

16

4. PAGERANK ALGORITHMS

Now, we present the algorithms used to test the machines. These are the classical Power method, implemented with two different storage methods, and the Arnoldi method.

4.1 POWER METHOD

According to the considerations of the previews section, the Power Method applied to the PageRank problem has been implemented following this pseudo-code (Algorithm 1).

Algorithm 1 – Power method Page Rank Require: [𝐴𝐺𝑇, 𝑑]

Ensure: [𝑃𝑅]

1: initialize PR(0)_{with an uniform starting value, [1/N, … , 1/N]} 2: for 𝑘 = 1, 2, … do

3: 𝑃𝑅(𝑘) =1−𝑑

|𝑁| 𝑒 + 𝑑𝐴𝐺

𝑇_𝑃𝑅(𝑘−1)

The Power Method is a classical and, under some aspects, obsolete way of calculating the dominant eigenvector of a matrix. However, there are some strengths of this method:

• it only requires a matrix-by-vector product for each iteration; • it does not alter the initial matrix;

• we only need to store the previous computed PageRank vector.

And these are indeed the reasons why Google still makes use of this technique. As mentioned above, the Power method in many cases is an obsolete method.

However, there are some better methods that can be used when we are only interested in a subset of the entire internet, e.g. a university’s or a small country’s link structure. With these smaller matrices, we can use more memory for our calculations.

(18)

17

4.2 ARNOLDI METHOD

The Arnoldi Method for the PageRank problem is an alternative and advanced way to compute the PageRank vector of a target network. We will see that this method is more computationally expensive with respect to the Power Method, but it requires fewer iterations.

In what follows we first analyse the maths underlying the Arnoldi Method and then describe how this is used for PageRank computations [7].

Definition 1. Let 𝐴 ∈ 𝑀𝑁𝑥𝑁 be a matrix and 𝑥 ≠ 0 a vector of length n, we define the sequence:

𝑥, 𝐴𝑥, 𝐴2𝑥, 𝐴3𝑥, … as Krylov sequence, based on A and x.

Definition 2. Given Krylov sequence, based on A and x, we define the matrix: 𝐾𝑘(𝐴, 𝑥) = [𝑥|𝐴𝑥|𝐴2𝑥| … | 𝐴𝑘−1𝑥] ∈ 𝑀𝑛×𝑘

the k-th Krylov matrix.

Definition 3. The k-th Krylov subspace, 𝑘𝑘(𝐴, 𝑥), is the subspace formed by columns of k-th Krylov matrix 𝐾𝑘(𝐴, 𝑥). i.e.:

𝑘𝑘(𝐴, 𝑥) = 𝑆𝑝𝑎𝑛(𝑥|𝐴𝑥|𝐴2𝑥| … | 𝐴𝑘−1𝑥)

The Arnoldi algorithm is used to generate an orthonormal basis 𝑄𝑘 = [𝑞0|𝑞1| … |𝑞𝑘−1] ∈ 𝑀𝑛×𝑘 for 𝑘𝑘(𝐴, 𝑥) such that:

𝐻_𝑘,𝑘 = 𝑄_𝑘𝑇𝐴𝑄_𝑘

where 𝐻𝑘,𝑘 ∈ 𝑀𝑘×𝑘(ℝ) is an upper-Hessenberg matrix, i.e. a matrix in which all elements below the first diagonal above the principal are all 0. 𝐴 is the matrix for the Krylov subspace; 𝑞0 is the initial vector and it must be non-zero; 𝑎 is the accuracy value and k is the predefined number of steps, i.e. order of the Krylov subspace.

Definition 4. The eigenvalues of matrix 𝐻𝑘,𝑘 , denoted by {𝜃1, 𝜃2, … , 𝜃𝑘} are colled Ritz values of 𝐴. The way we can exploit the Arnoldi algorithm for the PageRanking problem is the following. We first generate a small Hessenberg matrix 𝐻𝑘+1,𝑘 starting from the initial matrix 𝐴𝐺end then approximate the eigenvalues of 𝐴𝐺 from its Ritz values {𝜃𝑖}𝑖=1𝑘 .

(19)

18

The reason is that the Ritz values will converge, as 𝑘 → ∞ , to the eigenvalues of 𝐴𝐺 , but they are much cheaper to compute with respect to the true eigenvalues.

Then we select the eigenvector (the dominant one) corresponding to the largest Ritz value 𝜃_𝑚𝑎𝑥 of 𝐴𝐺 and compute an approximation of the PageRank vector by multiplying this dominant eigenvector with the Krylov basis 𝑄𝑘 .

Algorithm 2 - Arnoldi algorithm Require: [𝐴, 𝑞0, 𝑎, 𝑘] Ensure: [𝑄𝑘, 𝐻𝑘+1,𝑘] 1: 𝑞1 = 𝑞0/‖𝑞0‖2 2: 𝒇𝒐𝒓 𝑗 = 1, 2, . . . , 𝑘 𝒅𝒐 3: 𝑤 = 𝐴𝑞𝑗 4: 𝒇𝒐𝒓 𝑖 = 1, . . . , 𝑗 𝒅𝒐 5: ℎ𝑖𝑗 = 𝑤𝑞𝑖 6: 𝑤 = 𝑤 − ℎ_𝑖𝑗𝑞_𝑖 7: ℎ_𝑗+1,𝑗 = ‖𝑤‖₂ 8: 𝒊𝒇 ℎ_𝑗+1,𝑗 < 𝑎 𝒕𝒉𝒆𝒏 9: 𝑏𝑟𝑒𝑎𝑘 10: 𝑞_𝑗+1 = 𝑤 / ℎ_𝑗+1,𝑗

4.2.1 RESIDUAL DETERMINATION

It has been proved that we do not need to explicitly compute the residual norm between two consecutive PageRank vectors, as done during the Power Method. One of the biggest advantage of the Arnoldi Method for PageRanking is that a cheaper residual can be computed as:

ℎ𝑗+1,𝑗 |𝑒𝑗𝑦(𝑗)| ,

where j is the index of the outermost loop in the Arnoldi procedure, 𝑒𝑗 is the vector of all zeros except a 1 in the j-th position and 𝑦(𝑗) is the eigenvector associated to 𝜃𝑚𝑎𝑥, largest eigenvalue of 𝐻𝑗,𝑗 [8].

(20)

19 Algorithm 3 – StoppedArnoldi Require: [𝐴, 𝑞0, 𝑎, 𝑡] Ensure: [𝑄𝑘, 𝐻𝑘+1,𝑘] 1: 𝑞₁ = 𝑞₀/‖𝑞₀‖₂ 2: 𝒇𝒐𝒓 𝑗 = 1, 2, . . . 𝒅𝒐 3: 𝑤 = 𝐴𝑞𝑗 4: 𝒇𝒐𝒓 𝑖 = 1, . . . , 𝑗 𝒅𝒐 5: ℎ𝑖𝑗 = 𝑤𝑞_𝑖 6: 𝑤 = 𝑤 − ℎ𝑖𝑗𝑞𝑖 7: ℎ_𝑗+1,𝑗 = ‖𝑤‖₂ 8: 𝒊𝒇 ℎ_𝑗+1,𝑗 < 𝑎 𝒕𝒉𝒆𝒏 9: 𝒃𝒓𝒆𝒂𝒌 10: 𝑞𝑗+1 = 𝑤 / ℎ𝑗+1,𝑗 11: 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝐻_𝑗,𝑗 𝑦(𝑗) = 𝛉_𝑚𝑎𝑥 𝑦(𝑗) 12: 𝒊𝒇 ℎ_𝑗+1,𝑗 |𝑒_𝑗𝑦(𝑗)| < 𝑡 𝒕𝒉𝒆𝒏 13: 𝑘 = 𝑗 14: 𝒃𝒓𝒆𝒂𝒌

4.2.2 ARNOLDI PAGERANK

We finally present the way Arnoldi is applied to the PageRank problem. Notice that the damping factor d is considered when performing matrix-by-vector multiplication at step (3) of the Arnoldi Stopped procedure.

Algorithm 4 – Arnoldi PageRank Require: [𝐴_𝐺𝑇, 𝑎, 𝑡]

Ensure: [𝑃𝑅]

1: Initialize PR with an arbitrary non − zero starting value, usually uniform 2: [𝑄𝑘, 𝐻𝑘+1,𝑘] = 𝑆𝑡𝑜𝑝𝑝𝑒𝑑𝐴𝑟𝑛𝑜𝑙𝑑𝑖 (𝐴𝐺𝑇, 𝑃𝑅, 𝑎, 𝑡)

3: 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝐻𝑗,𝑗 𝑦(𝑗) = 𝛉𝑚𝑎𝑥 𝑦(𝑗) 4: 𝑃𝑅 = 𝑄𝑘𝑦

(21)

20

The main advantage of the Arnoldi method applied to PageRanking is that we are computing the eigenvalues of a relatively small matrix, since 𝑘 ≪ |𝑁|. However, with respect to the Power Method, it requires not only a matrix-by-vector multiplication, but also inner products and norm computations.

Clearly the cost of Arnoldi algorithms increase as the number of iterations increase, for a series of reasons:

• the determination of a new vector for the basis means to orthogonalize it against all the others;

• eigenvalues and eigenvectors of the Hessenberg matrix need to be computed at each iteration;

• more inner product and norm computations at each iteration (orthogonalization); • the Hessenberg matrix will grow at each iteration.

Therefore, we expect that the Arnoldi Method for PageRank will converge within less iterations with respect to the power method.

(22)

21

5. PARALLELIZATION

In this section, we explain the meaningful points that are followed when parallelizing the described PageRank algorithms, i.e.:

• which data structures have been implemented and used;

• how to code a fast matrix-by-vector product for sparse input matrices; • which parallel skeleton has been used.

5.1 DATA STRUCTURE USED

Basically, we need an efficient data structure for storing the sparse web matrix and the PageRank vector.

We wish to be able to multiply two real values, one belonging to our web matrix and the other to the PageRank vector in an efficient way, thus avoiding to traverse the entire row/vector, which is the most expensive operation in PageRank computations.

Therefore, the best data structure is the std::vector<> which objects are Tuple.

This object is being implemented to handle the input sparse matrix in a compressed form. Since, Tuple object is composed by the index of the outgoing link, the index of the node reached by the arc, and the arc value itself, i.e.: (idxI, idxJ, value).

The data structure for the matrix is built upon the previous observation. We have chosen to store our sparse matrices row-oriented, i.e. the matrix is represented by a sequence of rows. The Compress Row Storage format is one of the most extensively used storage scheme for general sparse matrices, with minimal storage requirements. Algorithms for many important mathematical operations are easily implemented, for example matrix vector multiplication (SpMxV) [10].

The decision to adopt a scheme of Compressed – Row Storage (CSR), in which we obviously store only non-zero entries in the web matrix, reduces the space occupy, that will be equal to the number of the non-zero items.

i j

(23)

22

The other data structure used, is a not compressed matrix. This has been used to test our computer machine, watching the behaviour with full memory. To obtain a matrix data structure we have used a std::vector<std::vector<double>>.

This implementation aims to stress the memory because all the items are stored, without distinction between zero e non-zero. So, our data structure, that is a square matrix, will occupy N2 _{space in memory.}

5.2 DATA PARALLEL

The most important and costly operation in PageRank algorithms is the sparse matrix-by-vector multiplication, as already noticed. Thus, we need a parallel skeleton which works well for this kind of computation and for the residual determination.

The most natural way of implementing the algorithm is by map skeleton, that belongs to the data parallel computation.

Data parallel computation is characterized by partitioning of (large) data structures and function replication. Moreover, data replication is sometimes adopted as well.

We will see that data parallelism is a very powerful and flexible paradigm, that can be exploited in several forms according to the strategy of input data partitioning and replication, to the strategy of output data collection and presentation, and to the organization of workers (independent or interacting), though this flexibility is inevitably paid with a more complex implementation of programs and of programming tools. The knowledge of the sequential computation form is necessary, both for functions and for data, and this is the main reason for the parallel program design complexity.

Besides service time and completion time, latency and memory capacity are optimized compared to farms and other stream parallel paradigms, though a potential load unbalance exists.

All workers apply the same function 𝐹 to the corresponding partitions in parallel.

Unless statically allocated, input data are distributed by program through a scatter, and/or a multicast, operation. Output data may be collected through a gather operation, however other possibilities exist, notably reduce operations.

(24)

23

The simplest data parallel computation is the so-called map, in which the workers are fully independent, i.e. each of them operates on its own local data only, without any cooperation during the execution of 𝐹.

The complexity in data parallel program design can be mastered through a systematic and formal approach which, starting from the sequential computation, can derive the basic characteristics of one or more equivalent data parallel computations.

We adopt the so called Virtual Processors approach consisting in two main phases:

1. according to the structure of sequential computation, we derive an abstract representation of the equivalent data parallel computation, which is characterized by the maximum theoretical parallelism compatible with the computation semantics. The modules of this abstract version are called (for historical reasons) virtual processors. Data are partitioned among the virtual processors according to the program semantics. The goal of the virtual processors version is only to capture the essential features of the data parallel computation. Because it is a maximally parallel version, often the grain size of data partitions is the minimal one, but this is not necessarily true in all cases (however, it is of a grain (much) smaller than the actual one);

2. we map the virtual processor computations onto the available processing resources with coarser grain worker modules and coarser grain data partitions. This executable process-level program is parametric in the parallelism degree 𝑛. In order to effectively execute the program, the degree of parallelism must be chosen at loading time, according to the resources that are allocated.

We have used the 𝒎𝒂𝒑 paradigm because it is the parallel programming pattern which permits to have an implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

(25)

24

5.2.1 MAP

Map implementation of a skeleton consists of scatter module, a gather module and a set of worker modules {𝑤𝑜𝑟𝑘𝑒𝑟𝑖}𝑖=1𝑛 .

The scatter is in charge of scattering the data among the workers, such that each of them will work on a partition of the matrix consisting of 𝑔 = |𝑁|/𝑛 rows. The whole PageRank vector is sent to each worker (broadcast).

Each worker executes the matrix-by-vector product on its own data, i.e. it is completely independent from the others workers, and sends the result to the gather module. This is shown in the following implementation graph (Fig. 4):

(26)

25 Fig. 5

In Fig.5 is shown how the scatter module executes the partitioning of the starting matrix.

5.3 IMPLEMENTATION

On each implementation of power method, the implementation graph represented in the previous paragraph suggests us to use a map parallel pattern to express the computation described above, thus based on the scatter-compute-gather paradigm.

The scatter will distribute the input matrix in a static way. The scheduling policy is selected by specifying the grain parameter of the parallel_for method.

Each worker computes the same function on the assigned partition. In a nutshell, the input task is split into multiple sub-task each one computed in parallel and then collected together in one single output task. The computation on the sub-tasks may be completely independent.

The loop body of a worker may be a standard function or a C++11 lambda-function. A C++11 lambda-function is a new feature of the C++11 standard already supported by many compilers. Lambda-functions are unnamed closures (i.e. function objects that can be constructed and managed like data) that allow functions to be syntactically defined where

(27)

26

and when needed. When lambda functions are built, they can capture the state of non-local variables named in the wrapped code either by value or by reference.

At the end of the computation, the private results are added to a shared data structure whose accesses are handled by the assigned index.

These three phases are iterated a finite number of times chosen in such a way we may better compare the results of the different algorithms. Consequently, the map pattern is nested inside a loopback pattern (Fig. 6).

The loopback-condition is the one already mentioned: the map is executed a finite number of times; that represents our convergence criteria. So, considering all the computation, the time spent before parallel map becomes negligible. Since, according to the Amdahl’s low, the serial fraction, i.e. the part not parallelized, have a minimal impact on the speed-up [12].

For the Arnoldi method, it is a different story. In fact, we have a 𝑙𝑜𝑜𝑝𝑏𝑎𝑐𝑘 − 𝑚𝑎𝑝 during the stoppedArnoldi function, i.e. in the phase to build the two square matrices used to calculate the eigenvector. While a 𝑚𝑎𝑝 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 is used to calculate the PageRank vector from the eigenvector obtained in the previous phase.

In the next section, we look at the various codes implemented using C++ language.

scatter Wi

Wn

W1

gather

(28)

27

6. PARALLEL PROGRAMMING FRAMEWORK

6.1 FAST FLOW

The parallel implementation is built on top of FastFlow [13]. Fastflow is a C++ parallel programming framework advocating high-level, pattern-based parallel programming. It chiefly supports streaming and data parallelism, targeting heterogenous platforms composed of clusters of shared-memory platforms, possibly equipped with computing accelerators such as NVidia GPGPUs, Xeon Phi, Tilera TILE64.

The main design philosophy of FastFlow is to provide application designers with key features for parallel programming (e.g. time-to-market, portability, efficiency and performance portability) via suitable parallel programming abstractions and carefully designed run-time support.

FastFlow comes as a C++ template library designed as a stack of layers that progressively abstract out the programming of parallel applications. The goal of the stack is threefold: portability, extensibility, and performance.

For this, all the three layers are implemented as thin layers of C++ templates that are: 1. seamlessly portable;

2. easily extended via sub-classing;

3. statically compiled and cross-optimised with the application.

The terse design ensures easy portability on wide range OSes and CPUs with a C++ compiler. By using the service time as the key measure, we define quantitative parameters to estimate how good is the parallel solution we implemented. The main indicators are efficiency and scalability.

The FastFlow run-time support uses several techniques to efficiently support fine grain parallelism (and very high frequency streaming). Among these we have:

• non-blocking multi-threading with lock-less synchronisations; • zero-copy network messaging (via 0MQ/TCP and RDMA/Infiniband); • asynchronous data feeding for accelerator offloading.

(29)

28

FastFlow has been adopted by a few research projects and third-party development initiatives, and has thus been tested in a variety of application scenarios: from systems biology to high-frequency trading.

The abstraction process has two main goals:

1. to promote high-level, platform-independent structured parallel programming, and in particular skeletal programming (i.e. pattern-based explicit parallel programming).

2. to support the development of portable and efficient applications for homogenous and heterogenous of platforms, including multicore, many-core (e.g. GPGPU, FPGA), and distributed clusters of them.

These two goals have been perceived as a dichotomy for many years by many computer practitioners. Researchers in the skeletal and high-level parallel programming community never perceived it this way. Indeed, Fastflow is the nth-in-a-row high-level pattern-based high-level parallel programming frameworks we designed in the last fifteen years.

FastFlow architecture is organised in three main tiers:

1. High-level patterns They are clearly characterised in a specific usage context and are targeted to the parallelisation of sequential (legacy) code. Examples are exploitation of loop parallelism, stream parallelism, data-parallel algorithms, execution of general workflows of tasks, etc. They are typically equipped with self-optimisation capabilities (e.g. load-balancing, grain auto-tuning, parallelism-degree auto-tuning) and exhibit limited nesting capability. Examples are: parallel-for, pipeline, stencil-reduce, MDF (macro-data-flow). Some of them target specific devices (e.g. GPGPUs). They are implemented on top of core patterns.

2. Core patterns They provide a general data parallel programming model with its run-time support, which is designed to be minimal and reduce to the minimum typical sources of overheads in parallel programming. At this level, there are two patterns (farm and pipeline) and one pattern-modifier (feedback). They make it possible to build very general (deadlock-free) cyclic process networks. They are not graphs of

(30)

29

tasks, they are graphs of parallel executors (processes/threads). Tasks or data items flows across them. Overall, the programming model can be envisioned as a shared-memory streaming model, i.e. a shared-shared-memory model equipped with message-passing synchronisations. They are implemented on top of building blocks.

3. Building blocks They provide the basic blocks to build (and generate via C++ header-only templates) the run-time support of core patterns. Typical objects at this level are queues (e.g. wait-free fence-free SPSC queues, bound and unbound), process and thread containers (as C++ classes) mediator threads/processes (extensible and configurable schedulers and gatherers). The shared-memory run-time support extensively uses nonblocking lock-free (and fence-free) algorithms, the distributed run-time support employs zero-copy messaging, the GPGPUs support exploits asynchrony and SIMT optimised algorithms.

Fig. 7: different layers

In the above figure, different layers of FastFlow are represented [14]. They are targeted to support different kinds of programmer.

(31)

30

6.2 PTHREAD

In shared memory multiprocessor architectures, threads can be used to implement parallelism. For UNIX systems, a standardized C language threads programming interface has been specified by the IEEE POSIX 1003.1c standard. Implementations that adhere to this standard are referred to as PThread [15].

PThread, short for Posix Thread, is a library that you link with your program in order to spawn functions as separate threads. So, a thread is spawned by transferring control to a specific function that you have defined.

Since, pThread allows a program to control multiple different flows of work that overlap in time. Each flow of work is referred to as a thread, while creation and control over these flows is achieved by making calls to the POSIX Threads API.

All threads share a single executable, a single set of global variables, and a single heap (to support malloc, new), but each thread has its own stack (to host function arguments, private variables).

(32)

31

7. ALGORITHMS: C++ CODES

7.1 CODES FOR ODROID-XU4

All the algorithms posted below, present the inclusion of two libraries: FastFlow, <ff/parallel_for.hpp> used to parallelize, and MAMMUT, <mammut/mammut.hpp> used to measure the energy consumption. These are described in the next section.

Below, we show all the algorithms implemented in C++ language, shown previously in a pseudocode format, i.e. in an informal high-level description.

7.1.1 POWER METHOD (CSR)

1. #include <iostream> 2. #include <fstream> 3. #include <sys/time.h> 4. #include <cmath> 5. #include <vector> 6. #include <string> 7. #include <ff/parallel_for.hpp> 8. #include <mammut/mammut.hpp> 9. 10. #include <cassert> 11. #include <unistd.h> 12.

13. using namespace mammut;

14. using namespace mammut::energy;

15. 16. class Tuple 17. { 18. int i, j; 19. double value; 20. public: 21. Tuple(){}

22. Tuple(int i, int j, double value)

23. { 24. this->i=i; 25. this->j=j; 26. this->value=value; 27. } 28.

29. void print() const

30. {

31. std::cout<<"("<<i<<", "<<j<<", "<<value<<") ";

32. }

33.

34. int getI() const

35. {

36. return i;

37. }

38. int getJ() const

39. {

40. return j;

(33)

32

42.

43. double getValue() const

44. {

45. return value;

46. }

47.

48. void setValue(double value)

49. {

50. this->value=value;

51. }

52. };

53.

54. int main(int argc, char const *argv[])

55. {

56. Mammut m;

57. Energy* energy = m.getInstanceEnergy();

58. Joules j;

59.

60. Counter* counter = energy->getCounter();

61. if(!counter){

62. cout << "Power counters not available on this machine." << endl;

63. return -1;

64. }

65.

66. timeval startTot, stopTot;

67. double elapsedTimeTot;

68. std::vector<Tuple> matrix;

69. int nodes, nZ, idxI, idxJ, nWorker;

70. double value=0.0;

71. int numCPU = sysconf( _SC_NPROCESSORS_ONLN );

72. nWorker=atoi(argv[1]);

73. if (nWorker > numCPU)

74. {

75. std::cout<< "\nNumber of workers higher than numeber of cores!\n"<<std::endl;

76. exit(EXIT_FAILURE);

77. }

78.

79. std::string fileName = "C-39.txt";

80. std::ifstream dataSet(fileName);

81. bool first= true;

82. while(!dataSet.eof()) 83. { 84. if (first) 85. { 86. first=false; 87. dataSet >> nodes >> nZ; 88. matrix.reserve(nZ); 89. } 90. else 91. {

92. dataSet >> idxI >> idxJ >> value;

93. matrix.emplace_back(idxI, idxJ, value);

94. }

95. }

96. dataSet.close();

97.

98. //outdegrees of each node non-zero

99. int outDegrees[nodes];

100. for (int i = 0; i < nodes; ++i)

101. { 102. outDegrees[i]=0; 103. } 104. for (int k = 0; k < nZ; ++k) 105. { 106. outDegrees[matrix[k].getI()]++; 107. } 108.

(34)

33

109. //update matrix

110. double newValue=0.0;

111. for (int i = 0; i < nZ; ++i)

112. {

113. newValue=1/(double)outDegrees[matrix[i].getI()];

114. matrix[i].setValue(newValue); 115. } 116. 117. //computation 118. double pageRank[nodes]; 119. double newPageRank[nodes]; 120. 121. // damping factor 122. double d = 0.85; 123. 124. //default rank

125. double defaultRank = (1 - d) / nodes;

126.

127. //initailization of vector rank

128. double init = 1.0 / nodes;

129. for (int i=0; i<nodes; ++i)

130. { 131. pageRank[i]=init; 132. newPageRank[i]=0.0; 133. } 134. 135. long g = (nodes/nWorker);

136. ff::ParallelFor pf(nWorker, false);

137.

138. double sum;

139. counter->reset();

140. gettimeofday(&startTot, NULL);

141. for (int iterator = 0; iterator < 8; ++iterator)

142. {

143. //---//

144. //--- workerF ---//

145. //---//

146. auto workerF = [&matrix, &pageRank, &newPageRank, d, defaultRank, &sum]

147. (const int start, const int end, const int thid)

148. {

149. for (int k = start; k < end; ++k)

150. {

151. sum = 0.0;

152. for (int it = 0; it < matrix.size(); ++it)

153. {

154. if (matrix[it].getJ() == k+1)

155. {

156. sum += matrix[it].getValue() * pageRank[k];

157. }

158. }

159. newPageRank[k] = defaultRank + (d * sum);

160. }

161. };

162.

163. pf.parallel_for_idx(0, nodes, 1, g, workerF);

164.

165. for(int k = 0; k < nodes; k++)

166. { 167. pageRank[k] = newPageRank[k]; 168. } 169. 170. } 171. gettimeofday(&stopTot, NULL); 172. 173. j = counter->getJoules();

174. std::cout << j << " joules consumed." << std::endl;

(35)

34

176. elapsedTimeTot = (stopTot.tv_sec - startTot.tv_sec) * 1000.0;

177. elapsedTimeTot += (stopTot.tv_usec - startTot.tv_usec) / 1000.0;

178.

179. //std::cout<<"\nPageRank made in ";

180. std::cout<<elapsedTimeTot<<std::endl;

181. return 0;

182. }

In this code, a class Tuple is present which permits to store all the non-zero items for compress the weight of the starting spare matrix, i.e. a Compressed Storage Row format. The drawback is a poor efficiency in terms of computation, because it needs an indirect addressing step for every single scalar operation in a matrix-vector product or a preconditioner solve. This increases the calculation time as well as the energy spent.

7.1.2 POWER METHOD

1. #include <iostream> 2. #include <fstream> 3. #include <sys/time.h> 4. #include <cmath> 5. #include <vector> 6. #include <string> 7. 8. #include <ff/parallel_for.hpp> 9. #include <mammut/mammut.hpp> 10. 11. #include <cassert> 12. #include <unistd.h> 13.

16.

17. int main(int argc, char const *argv[])

18. {

19. Mammut m;

21. Joules j;

22.

24. if(!counter){

25. std::cout << "Power counters not available on this machine.\n";

26. return -1;

27. }

28.

29.

30. timeval startTot, stopTot, start, stop;

31. double elapsedTimeTot, elapsedTime;

32. int nodes, nZ, idxI, idxJ, nWorker;

33. double value;

34.

36. nWorker=atoi(argv[1]);

38. {

39. std::cout<< "\nNumber of workers higher than numeber of cores!\n"<<std::endl;

40. exit(EXIT_FAILURE);

(36)

35

42.

43. std::string fileName = "C-39.txt";

44. std::ifstream dataSet(fileName);

45. dataSet >> nodes >> nZ;

46. std::vector<std::vector<double>> matrix(nodes);

47. {

48. std::vector<std::vector<double>> tmp(nodes);

49. for (int i = 0; i < nodes; ++i)

50. {

51. tmp[i].reserve(nodes);

52. for (int j = 0; j < nodes; ++j)

53. { 54. tmp[i][j] = 0.0; 55. } 56. } 57. while(!dataSet.eof()) 58. {

60. tmp[idxI-1][idxJ-1] = value;

61. }

63.

64. //outDegrees of each nodes non-zero

66. for (int i = 0; i < nodes; ++i)

67. {

68. outDegrees[i] = 0;

69. }

70. for (int i = 0; i < nodes; ++i)

71. {

72. for (int j = 0; j < nodes; ++j)

73. { 74. if (tmp[i][j] != 0.0) 75. { 76. outDegrees[i]++; 77. } 78. } 79. } 80. //update matrix

81. for (int i = 0; i < nodes; ++i)

82. {

83. for (int j = 0; j < nodes; ++j)

84. {

85. if (tmp[i][j] != 0.0)

86. {

87. tmp[i][j] = 1/(double)outDegrees[i];

88. }

89. }

90. }

91. //transpose

92. for (int i = 0; i < nodes; ++i)

93. {

94. matrix[i].reserve(nodes);

95. for (int j = 0; j < nodes; ++j)

96. { 97. matrix[i][j] = tmp[j][i]; 98. } 99. } 100. } 101. //damping factor 102. double d = 0.85; 103. 104. //default rank

105. double defaultRank = (1 - d) / nodes;

106.

107. double pageRank[nodes];

(37)

36

109. for (int i = 0; i < nodes; ++i)

110. {

111. pageRank[i] = (double)1/nodes;

112. newPageRank[i] = 0.0;

113. }

114. double sum;

115. long g = (nodes/nWorker);

117.

120. for (int iter = 0; iter < 8; ++iter)

121. {

122.

123. //---//

124. //--- workerF ---//

125. //---//

126. auto workerF = [&matrix, &pageRank, &newPageRank, &nodes, &sum, &d, &defaultRank](const int start, const int end, const int thid)

127. {

129. {

130. sum = 0.0;

131. for (int j = 0; j < nodes; ++j)

132. {

133. sum += matrix[k][j] * pageRank[j];

134. }

135. newPageRank[k] = defaultRank + (d * sum);

136. }

137. };

139.

140. for (int i = 0; i < nodes; ++i)

141. {

142. pageRank[i] = newPageRank[i];

143. }

144. }

145. gettimeofday(&stopTot, NULL);

148.

149. j = counter->getJoules();

150. std::cout << j << " joules consumed.\n";

151.

153. return 0;

154. }

This implementation is similar to the preview one with only one difference: the management of input matrix. This is stored without any compression, so it occupies so much memory than the CSR method, as we can see in Fig. 9: close to 1Gb vs ¼ Gb, more or less.

(38)

37

Fig. 9: memory usage: not compressed vs CSR

The advantage of this implementation is that we succeed keeping the matrix is maintained in cache memory and therefor all the cost to handle the indexes are null. This bring to an important increase of the performance, reducing the completion time.

7.1.3 ARNOLDI METHOD

1. #include <iostream> 2. #include <fstream> 3. #include <sys/time.h> 4. #include <math.h> 5. #include <cmath> 6. #include <vector> 7. #include <string> 8. #include <ff/parallel_for.hpp> 9. #include <mammut/mammut.hpp> 10. #include <Eigen/Eigenvalues> 11. #include <Eigen/Dense> 12. #include <cassert> 13. #include <unistd.h> 14.

17.

18. using namespace Eigen;

19. 20. 21. struct arnoldiResult 22. { 23. std::vector<std::vector<float>> H; 24. std::vector<std::vector<float>> Q; 25. }; 26. 27.

(39)

38

28. arnoldiResult arnoldi_stopped(std::vector<std::vector<float>>& matrix, int nodes, int nZ)

29. {

30. arnoldiResult ar;

31.

32. //q0 = arbitrary non-zero vector

33. std::vector<float> q0(nodes, 0.0);

34. float norm =0.0;

35. for (int i = 0; i < nodes; ++i)

36. { 37. q0[i] = 1.0/nZ; 38. norm += q0[i]*q0[i]; 39. } 40. 41. norm = sqrt(norm); 42. //q1 = q0/||q0||2

43. std::vector<float> q1(nodes, 0.0);

44. for (int i = 0; i < nodes; ++i)

45. {

46. q1[i] = q0[i] / norm;

47. // std::cout<<q1[i]<<std::endl; 48. } 49. 50. ar.Q.push_back(q0); 51. ar.Q.push_back(q1); 52. 53. //start computation

54. std::vector<float> w(nodes, 0.0);

55. std::vector<float> q(nodes, 0.0);

56. int j=1;

57.

58. while (true)

59. {

60. //w = A vj

61. for (int row = 0; row < nodes; ++row) {

62. for (int col = 0; col < nodes; ++col) {

63. w[row] += matrix[row][col] * ar.Q[j][col];

64. }

65. }

66.

67. float h_ij = 0.0;

68. std::vector<float> h_j(nodes, 0.0);

69.

70. for (int i = 1; i <= j; ++i)

71. {

72. // h_i,j = w * v_i

73. for (int k = 0; k < nodes; ++k)

74. { 75. h_ij += w[k] * ar.Q[i][k]; 76. //std::cout<<ar.Q[i][k]<<std::endl; 77. } 78. // std::cout<<h_ij<<std::endl; 79. h_j[i] = h_ij; 80. 81. // w = w - hi,j vi

82. for (int k = 0; k < nodes; ++k)

83. { 84. w[k] -= h_ij * ar.Q[i][k]; 85. // std::cout<<"w[" << k << "] = " << w[k] <<std::endl; 86. } 87. } 88. 89. float norm2 = 0.0;

90. for (int k = 0; k < nodes; ++k)

91. {

92. norm2 += w[k] * w[k];

(40)

39 94. norm2 = sqrt(norm2); 95. 96. // h_j+1,j = ||w||2 97. h_j[j+1] = norm2; 98. 99. ar.H.push_back(h_j); 100. 101. //h_j+1,j < a 102. if (norm2 < 1e-30) 103. { 104. break; 105. } 106.

107. for (int k = 0; k < nodes; ++k)

108. { 109. q[k] = w[k] / norm2; 110. } 111. 112. ar.Q.push_back(q); 113. 114. MatrixXf H(j, j); 115. 116. // init

117. for (int i = 0; i < j; ++i)

118. {

119. for (int ii = 0; ii < j; ++ii)

120. {

121. if (i<=ii) H(i,ii)=0.0;

122. else H(i, ii) = ar.H[i][ii];

123. }

124. }

125.

126. EigenSolver<MatrixXf> es(H);

127.

128. // std::cout << "The eigenvalues of A are:" << std::endl << es.eigenva lues() << std::endl;

129. // std::cout << "The matrix of eigenvectors, V, is:" << std::endl << e s.eigenvectors() << std::endl;

130.

131. float maxEigenvalues=-1e200;

132. int maxIndex=0;

133. for (int i = 0; i < j; ++i)

134. { 135. if (es.eigenvalues()[i].real()>maxEigenvalues) 136. { 137. maxEigenvalues = es.eigenvalues()[i].real(); 138. maxIndex=i; 139. } 140. } 141.

142. float residual = norm2 * std::abs((es.eigenvectors().col(maxIndex)[j-1]).real()); 143. // std::cout<<"residual is "<<residual<<std::endl; 144. if (residual<1e-30) 145. { 146. break; 147. } 148. j++; 149. } 150. 151. return ar; 152. } 153. 154.

155. int main(int argc, char const *argv[])

156. {

(41)

40

158. Mammut m;

160. Joules j;

161.

163. if(!counter){

164. cout << "Power counters not available on this machine." << endl;

165. return -1;

166. }

167.

168. int nodes, nZ, idxI, idxJ;

169. float value;

170.

172. int nWorker=atoi(argv[1]);

174. {

175. std::cout<<"Number of workers higher than numeber of cores!\n"<<std::endl;

176. exit(EXIT_FAILURE); 177. } 178. 179. std::string fileName = "C-39.txt"; 180. std::ifstream dataSet(fileName); 181. dataSet >> nodes >> nZ; 182.

183. std::vector<std::vector<float>> matrix(nodes);

184. {

185. std::vector<std::vector<float>> tmp(nodes);

186. for (int i = 0; i < nodes; ++i)

187. {

188. tmp[i].reserve(nodes);

189. for (int j = 0; j < nodes; ++j)

190. { 191. tmp[i][j] = 0.0; 192. } 193. } 194. while(!dataSet.eof()) 195. {

197. tmp[idxI-1][idxJ-1] = value;

198. }

200.

201. //outDegrees of each nodes non-zero

203. for (int i = 0; i < nodes; ++i)

204. {

205. outDegrees[i] = 0;

206. }

207. for (int i = 0; i < nodes; ++i)

208. {

209. for (int j = 0; j < nodes; ++j)

210. { 211. if (tmp[i][j] != 0.0) 212. { 213. outDegrees[i]++; 214. } 215. } 216. } 217. 218. //update matrix

219. for (int i = 0; i < nodes; ++i)

220. {

221. for (int j = 0; j < nodes; ++j)

222. {

223. if (tmp[i][j] != 0.0)

(42)

41

225. tmp[i][j] = 1/(float)outDegrees[i];

226. }

227. }

228. }

229.

230. //transpose

231. for (int i = 0; i < nodes; ++i)

232. {

233. matrix[i].reserve(nodes);

234. for (int j = 0; j < nodes; ++j)

235. { 236. matrix[i][j] = tmp[j][i]; 237. } 238. } 239. } 240. 241. float pageRank[nodes];

242. float startingValue = 1.0 / nodes;

243. for (int it = 0; it < nodes; ++it)

244. {

245. pageRank[it] = startingValue;

246. }

247.

248. auto results = arnoldi_stopped(matrix, nodes, nZ);

249.

250. int k = results.H.size();

251.

252. MatrixXf H(k, k);

253. for (int i = 0; i < k; ++i)

254. { 255. for (int j = 0; j < k; ++j) 256. { 257. H(i,j)=results.H[i][j]; 258. } 259. } 260. 261. int maxIndex; 262. float maxEigenvalues=-1e200; 263. EigenSolver<MatrixXf> es(H);

264. for (int i = 0; i < k; ++i)

265. { 266. if (es.eigenvalues()[i].real()>maxEigenvalues) 267. { 268. maxEigenvalues = es.eigenvalues()[i].real(); 269. maxIndex=i; 270. } 271. } 272. 273. long g = (nodes/nWorker);

275.

276. auto workerF = [&results, &pageRank, &k, &es, &maxIndex](const int start,

const int end, const int thid)

277. {

278. for (int part = start; part < end; ++part)

279. { 280. float sum = 0.0; 281. for(int j = 0; j < k-1; j++) 282. { 283. sum += results.Q[j][part]*es.eigenvectors().col(maxIndex)[j].real(); 284. } 285. pageRank[part] = sum; 286. } 287. }; 288.

289. timeval startTot, stopTot;

(43)

42

291.

294.

296.

297. gettimeofday(&stopTot, NULL);

298.

299. j = counter->getJoules();

300. std::cout << j << " joules consumed." << std::endl;

301.

304.

306. return 0;

307. }

308.

Arnoldi method is a little bit more complicated. First of all, it needs a class to implement an iterative Arnoldi Stopped (see Algorithm 3) to compute the dominant eigenvector, then the result obtained is used to calculate the PageRank vector (see Algorithm 4).

The Eigen library has been used to calculate eigenvector and eigenvalues.

According with the result obtained, the completion time spent in the stoppedArnoldi function is higher than the matrix-vector product to calculate the PageRank vector. Moreover, the number of iteration can increase and consequently also the completion time spent in the function. Since, we need to parallelize also the stoppedArnoldi phase that represents a bottleneck. Then a loopback-map is used to remove it.

57. long grain = (nodes/nWorker);

58. ff::ParallelFor pfor(nWorker, false);

59. 60. while (true) 61. { 62. //w = A vj 63. //---// 64. //--- workerFun ---// 65. //---//

66. auto workerFun = [&matrix, &ar, &nodes, &w, &j](const int start, const int end, const int thid)

67. {

69. {

70. for (int col = 0; col < nodes; ++col)

71. { 72. w[k] += matrix[k][col] * ar.Q[j][col]; 73. } 74. } 75. }; 76.