• Non ci sono risultati.

Parallelizing Loops Parallelizing Loops

N/A
N/A
Protected

Academic year: 2021

Condividi "Parallelizing Loops Parallelizing Loops"

Copied!
26
0
0

Testo completo

(1)

Parallelizing Loops Parallelizing Loops

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

moreno.marzolla@unibo.it

(2)
(3)

Credits

Salvatore Orlando (Univ. Ca' Foscari di Venezia)

Mary Hall (Univ. of Utah)

(4)

Loop optimization

90% of execution time in 10% of the code

Mostly in loops

Loop optimizations

Transform loops preserving the semantics

Goal

Single-threaded system: mostly optimizing for memory hierarchy

Multi-threaded and vector systems: loop parallelization

(5)

Data Dependence

Two memory accesses are involved in a data

dependence if they may refer to the same memory location and one of the references is a write

Data-Flow or true dependence

RAW (Read After Write)

Anti dependence

WAR (Write After Read)

Output dependence

WAW (Write After Write)

a = b + c;

d = 2 * a;

S1

S2

c = a + b;

a = 2 * a;

S1

S2

a = k;

if (a>0) {

S1

S1 S2

S1 S2

S1

(6)

Control Dependence

An instruction S2 has a control dependence on S1 if the outcome of S1 determines whether S2 is executed or not

Of course, S1 and S2 can not be exchanged

This type of dependence applies to the condition of an if-then-else or loop with respect to their bodies

if (a>0) {

a = 2 * c;

} else { b = 3;

S1

S3 S2

S1

S2 S3

(7)

Dependence

In the following, we always use a simple arrow to denote any type of dependence

S2 depends on S1

S1

S2

(8)

Fundamental Theorem of Dependence

“Any reordering transformation that preserves every dependence in a program preserves the meaning of that program”

Recognizing parallel loops (intuitively)

Find data dependencies in loop

No dependencies crossing iteration boundary → loop can be parallelized

(9)

i = 2 i = 1

i = 0

Example

Each iteration does not depend on previous ones

There are no dependencies crossing iteration boundaries

This loop is fully parallelizable

Loop iterations can be performed concurrently, in any order

Iterations can be split across processors

for (i=0; i<n; i++) { a[i] = b[i] + c[i];

}

S1

S1 S1 S1 ...

i = 3

S1

(10)

Example

Each iteration depends on the previous one (RAW)

Loop-carried dependence

Hence, this loop is not parallelizable with trivial transformations

for (i=1; i<n; i++) { a[i] = a[i-1] + b[i]

}

S1

i = 3 i = 2

i = 1

S1 S1 S1 ...

i = 4

S1

(11)

Example

We have a loop-carried dependency on s that can not be removed with trivial loop transformations

but can be removed with non-trivial transformations

this is a reduction, indeed!

s = 0;

for (i=0; i<n; i++) { s = s + a[i];

}

i = 2 i = 1

i = 0

S1 S1 S1 ...

i = 3

S1

S1

(12)

i = 3 i = 4 i = 5 i = 2

Exercise: draw the dependencies among iterations of the following loop

for (i=2; i<n; i++) {

a[i] = 4 * c[i-1] - 2;

b[i] = a[i] * 2;

c[i] = a[i-1] + 3;

d[i] = b[i] + c[i-2];

}

...

S1 S2 S3 S4

S1

S2

S3

S1

S2

S3

S1

S2

S3

S1

S2

S3

i = 6

S1

S2

S3

(13)

Exercise (cont.)

i = 3 i = 4 i = 5

i = 2

for (i=2; i<n; i++) {

a[i] = 4 * c[i-1] - 2;

b[i] = a[i] * 2;

c[i] = a[i-1] + 3;

d[i] = b[i] + c[i-2];

}

...

S1 S2 S3 S4

S1

S2

S3

S1

S2

S3

S1

S2

S3

S1

S2

S3

i = 6

S1

S2

S3

RAW

(14)

Exercise (cont.)

i = 3 i = 4 i = 5

i = 2

for (i=2; i<n; i++) {

a[i] = 4 * c[i-1] - 2;

b[i] = a[i] * 2;

c[i] = a[i-1] + 3;

d[i] = b[i] + c[i-2];

}

...

S1 S2 S3 S4

S1

S2

S3

S1

S2

S3

S1

S2

S3

S1

S2

S3

i = 6

S1

S2

S3

RAW

(15)

Exercise (cont.)

i = 3 i = 4 i = 5

i = 2

for (i=2; i<n; i++) {

a[i] = 4 * c[i-1] - 2;

b[i] = a[i] * 2;

c[i] = a[i-1] + 3;

d[i] = b[i] + c[i-2];

}

...

S1 S2 S3 S4

S1

S2

S3

S1

S2

S3

S1

S2

S3

S1

S2

S3

i = 6

S1

S2

S3

RAW

(16)

Exercise (cont.)

i = 3 i = 4 i = 5

i = 2

for (i=2; i<n; i++) {

a[i] = 4 * c[i-1] - 2;

b[i] = a[i] * 2;

c[i] = a[i-1] + 3;

d[i] = b[i] + c[i-2];

}

...

S1 S2 S3 S4

S1

S2

S3

S1

S2

S3

S1

S2

S3

S1

S2

S3

i = 6

S1

S2

S3

RAW + RAW

(17)

Exercise (cont.)

i = 3 i = 4 i = 5

i = 2

for (i=2; i<n; i++) {

a[i] = 4 * c[i-1] - 2;

b[i] = a[i] * 2;

c[i] = a[i-1] + 3;

d[i] = b[i] + c[i-2];

}

...

S1 S2 S3 S4

S1

S2

S3

S1

S2

S3

S1

S2

S3

S1

S2

S3

i = 6

S1

S2

S3

(18)

i = 2

i = 1 i = n-2

i = n-2 i = 1 i = 2

i = 0

Removing dependences:

Loop aligning

Dependencies can sometimes be removed by aligning loop iterations

a[0] = 0;

for (i=0; i<n-1; i++) { a[i+1] = b[i] * c[i];

d[i] = a[i] + 2;

}

a[0] = 0;

d[0] = a[0] + 2;

for (i=1; i<n-1; i++) {

a[i] = b[i-1] * c[i-1];

d[i] = a[i] + 2;

}

a[n-1] = b[n-2] * c[n-2];

S1 S2

... ...

S1 S1 S1 S1 T1 T1 T1

T1 T2

i = 2

T1

(19)

i = 0 i = 1 i = 2 i = n-2

i = 1 i = 2 i = n-2

i = 0

Removing dependences:

Reordering / 1

Some dependencies can be removed by reordering

for (i=0; i<n-1; i++) { a[i+1] = c[i] + k;

b[i] = b[i] + a[i];

c[i+1] = d[i] + w;

}

for (i=0; i<n-1; i++) { c[i+1] = d[i] + w;

a[i+1] = c[i] + k;

b[i] = b[i] + a[i];

}

...

S1 S2 S3

...

S1

S2

S1

S2

S1

S2

S1

S2

S3 S1 S2

S1 S3

S1 S3

S1 S3

S1 S3

(20)

Reordering / 2

for (i=0; i<n-1; i++) { c[i+1] = d[i] + w;

a[i+1] = c[i] + k;

b[i] = b[i] + a[i];

}

b[0] = b[0] + a[0];

a[1] = c[0] + k;

b[1] = c[1] + a[1];

for (i=1; i<n-2; i++) { c[i] = d[i-1] + w;

a[i+1] = c[i] + k;

b[i+1] = b[i+1] + a[i+1];

}c[n-2] = d[n-3] + w;

a[n-1] = c[n-2] + k;

c[n-1] = d[n-2] + w;

After reordering, we can align loop iterations

S3 S1 S2

i = 0 i = 1 i = 2 i = n-2

...

S1 S3

S1 S3

S1 S3

S1 S3

i = 1 i = 1 i = 1 i = 1

T3 T1 T2

T1 T3

T1 T3

T1 T3

T1 T3

...

(21)

Loop interchange

Exchanging the loop indexes might allow the outer loop to be parallelized

Why? To use coarse-grained parallelism (if appropriate)

for (i=0; i<n; i++) { for (j=1; j<m; j++) {

a[i][j] = a[i][j-1] + b[i];

} } for (j=1; j<m; j++) {

for (i=0; i<n; i++) {

a[i][j] = a[i][j-1] + b[i];

} }

S1 S1 S1

S1

S1 S1 S1

S1

S1 S1 S1

S1

i=0 i=1 i=2

i=n-1

S1 S1 S1

S1 S1

S1 S1

S1

S1 S1 S1

S1

S1 S1 S1

S1

i=0 i=1 i=2

i=n-1

S1 S1 S1

S1 ...

...

S1 S1

... ... ... ...

...

...

...

...

Parallelize the inner loop Parallelize the outer loop

(22)

Difficult dependences

for (i=1; i<n; i++) { for (j=1; j<m; j++) {

a[i][j] = a[i-1][j-1] + a[i-1][j ] + a[i ][j-1];

} }

i=0

i=1

i=2

i=n-1

a[i-1][j-1] a[i-1][j]

a[i][j-1] a[i][j]

j=0 j=1 j=2 j=3 j=m-1

(23)

Difficult dependences

i=0

i=1

i=2

i=n-1

j=0 j=1 j=2 j=3 j=m-1

It is not possible to parallelize the loop iterations no matter what loop we consider

for (i=1; i<n; i++) { for (j=1; j<m; j++) {

a[i][j] = a[i-1][j-1] + a[i-1][j ] + a[i ][j-1];

} }

(24)

Difficult dependences

i=0

i=1

i=2

j=0 j=1 j=2 j=3 j=m-1

It is not possible to parallelize the loop iterations no matter what loop we consider

for (i=1; i<n; i++) { for (j=1; j<m; j++) {

a[i][j] = a[i-1][j-1] + a[i-1][j ] + a[i ][j-1];

}}

(25)

Solution

It is possible to parallelize the inner loop by sweeping the matrix diagonally

Wavefront sweep

for (slice=0; slice < n + m - 1; slice++) { z1 = slice < m ? 0 : slice - m + 1;

z2 = slice < n ? 0 : slice - n + 1;

/* The following loop can be parallelized */

for (i = slice - z2; i >= z1; i--) { j = slice - i;

/* process a[i][j] … */

} }

(26)

Conclusions

A loop can be parallelized if there are no dependencies crossing loop iterations

Some kinds of dependencies can be eliminated by

Applying code transformations (reordering, aligning)

Using patterns (e.g., scan, reduction)

Sweeping across iterations "diagonally"

Unfortunately, there are situations where

dependencies can not be removed, no matter what

Riferimenti

Documenti correlati

scenario, the environmental growth rate would be negative, and in the long-run it will lead environmental quality to become so poor to be no longer appealing for tourists, who will

Toxicity Effects of Functionalized Quantum Dots, Gold and Polystyrene Nanoparticles on Target Aquatic Biological Models: A Review.. Giovanni Libralato 1 ID , Emilia Galdiero 1,

[r]

The emergence of venture capitalism, defi ned as the combination of venture capital companies able to screen, fund and assist the growth of new science- based startup

I dati di EC/IC/LC50 sono stati presentati indirettamente come resecontrario di rimozione della tossicità rispetto tecnologie di depurazione in caso la tossicità è stata espressa

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-5/W1, 2017 GEOMATICS &amp; RESTORATION – Conservation of

The Pierre Auger Observatory [1] was designed to measure properties of the extensive air showers produced by cosmic rays with ultra-high energies above 10 18 eVc.

Here, we present the analysis of four additional clusters in the MCs, namely Lindsay 38, Lindsay 113, NGC 2121, and NGC 2155, for which we recently obtained new UV HST