• Non ci sono risultati.

Parallel Computing Parallel Computing Architectures Architectures

N/A
N/A
Protected

Academic year: 2021

Condividi "Parallel Computing Parallel Computing Architectures Architectures"

Copied!
21
0
0

Testo completo

(1)

Parallel Computing Parallel Computing

Architectures Architectures

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

http://www.moreno.marzolla.name/

Pacheco, excerpts from Chapter 2

(2)
(3)

Von Neumann architecture

ALU

R0 R1

Rn-1

PC

Memory Memory

Bus Control

Data Address

(4)

An Abstract Parallel Architecture

How is parallelism handled?

What is the exact physical location of the memories?

What is the topology of the interconnect network?

Processor

Memory Memory Memory

Processor Processor Processor

Interconnection Network

(5)

Why are parallel architectures important?

There is no "typical" parallel computer: different vendors use different architectures

There is currently no “universal” programming paradigm that fits all architectures

Parallel programs must be tailored to the underlying parallel architecture

The architecture of a parallel computer limits the choice of the programming paradigm that can be used

(6)

Flynn's Taxonomy

SISD

Single Instruction Stream Single Data Stream

SIMD

Single Instruction Stream Multiple Data Streams

MISD

Multiple Instruction Streams Single Data Stream

MIMD

Multiple Instruction Streams Multiple Data Streams

Data Streams

Single Multiple

Instruction Streams MultipleSingle

Von Neumann architecture

(7)

SIMD

SIMD instructions apply the same operation (e.g.,

sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)

This means that there must be 4/8/... independent ALUs

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

LOAD A[1]

LOAD B[1]

C[1] = A[1] + B[1]

STORE C[1]

LOAD A[2]

LOAD B[2]

C[2] = A[2] + B[2]

STORE C[2]

LOAD A[3]

LOAD B[3]

C[3] = A[3] + B[3]

STORE C[3]

Time

(8)

SSE

Streaming SIMD Extensions

Extension to the x86 instruction set

Provide new SIMD instructions operating on small arrays of integer or floating-point numbers

Introduces 8 new 128-bit registers (XMM0—XMM7)

SSE2 instructions can handle

2 64-bit doubles, or

2 64-bit integers, or

4 32-bit integers, or

8 16-bit integers, or

16 8-bit chars

128 bit

(9)

Example

__m128i a = _mm_set_epi32( 1, 2, 3, 4 );

__m128i b = _mm_set_epi32( 2, 4, 6, 8 );

__m128i s = _mm_add_epi32(a, b);

1 2 3 4

a b

32 bit 32 bit 32 bit 32 bit

+ + + +

2 4 6 8

(10)

MIMD

In MIMD systems there are multiple execution units that can execute multiple sequences of instructions

Multiple Instruction Streams

Each execution unit generally operates on its own input data

Multiple Data Streams

CALL F() z = 8 y = 1.7

a = 18 b = 9 if ( a>b ) c = 7

w = 7 t = 13 k = G(w,t)

Time

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

(11)

MIMD architectures

Shared Memory

A set of processors sharing a common memory space

Each processor can access any memory location

Distributed Memory

A set of compute nodes connected through an interconnection

network

The most simple example:

cluster of PCs connected via Ethernet

Nodes can share data through explicit communications

Memory

CPU CPU CPU

Interconnect

CPU

CPU Mem

CPU Mem

CPU Mem

CPU Mem

(12)
(13)
(14)

Hybrid architectures

Many HPC systems are based on hybrid architectures

Each compute node is a shared-memory multiprocessor

A large number of compute nodes is connected through an interconnect network

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

(15)

www.top500.org

(June 2020)

(16)

www.top500.org

(June 2020)

(17)

SANDIA ASCI RED Date:

1996

Peak performance:

1.8Teraflops Floor space:

150m2

Power consumption:

800.000 Watt

(18)

Sony

PLAYSTATION 3 Date:

2006

Peak performance:

>1.8Teraflops Floor space:

0.08m2

Power consumption:

SANDIA ASCI RED Date:

1996

Peak performance:

1.8Teraflops Floor space:

150m2

Power consumption:

800.000 Watt

(19)

Inside SONY's PS3

(20)

Rules of thumb

When writing parallel applications (especially on

distributed-memory architectures) keep in mind that:

Computation is fast

Communication is slow

Input/output is incredibly slow

(21)

Recap

Shared memory

Advantages:

Easier to program

Useful for applications with irregular data access patterns (e.g., graph algorithms)

Disadvantages:

The programmer must take care of race conditions

Limited memory bandwidth

Distributed memory

Advantages:

Highly scalable, provide very high computational power by adding more nodes

Useful for applications with strong locality of reference, with high computation /

communication ratio

Disadvantages:

Latency of interconnect network

Difficult to program

Riferimenti

Documenti correlati

• The sections construct is for work-sharing, where a current team of threads is used to execute statements of each section concurrently. #pragma

• private(list of variables) - a new version of the original variable with the same type and size is created in the memory of each thread belonging to the parallel region..

• When a kernel function is launched from the host side, execution is moved to a device where a large number of threads are generated and each thread executes the statements

• CUDA compiler optimization replaces branch instructions (which cause actual control flow to diverge) with predicated instructions for short, conditional code segments.

• With each phase focusing on a small subset of the input matrix values, the threads can collaboratively load the subset into the shared memory and use the values in the shared.

• When all threads of a warp execute a load instruction, if all accessed locations fall into the same burst section, only one DRAM request will be made and the access is fully

In the first iteration of the while loop, each thread index the input buffer using its global thread index: Thread 0 accesses element 0, Thread 1. accesses element

• Use of const __restrict__ qualifiers for the mask parameter informs the compiler that it is eligible for constant caching, for