Parallel Computing Parallel Computing Architectures Architectures

(1)

Parallel Computing Parallel Computing

Architectures Architectures

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

http://www.moreno.marzolla.name/

Pacheco, excerpts from Chapter 2

(2)

(3)

Von Neumann architecture

ALU

R0 R1

Rn-1

PC

Memory Memory

Bus _Control

Data Address

(4)

An Abstract Parallel Architecture

●

How is parallelism handled?

●

What is the exact physical location of the memories?

●

What is the topology of the interconnect network?

Processor

Memory Memory Memory

Processor Processor Processor

Interconnection Network

(5)

Why are parallel architectures important?

●

There is no "typical" parallel computer: different vendors use different architectures

●

There is currently no “universal” programming paradigm that fits all architectures

– Parallel programs must be tailored to the underlying parallel architecture

– The architecture of a parallel computer limits the choice of the programming paradigm that can be used

(6)

Flynn's Taxonomy

SISD

Single Instruction Stream Single Data Stream

SIMD

Single Instruction Stream Multiple Data Streams

MISD

Multiple Instruction Streams Single Data Stream

MIMD

Multiple Instruction Streams Multiple Data Streams

Data Streams

Single Multiple

Instruction Streams MultipleSingle

Von Neumann architecture

(7)

SIMD

●

SIMD instructions apply the same operation (e.g.,

sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)

– This means that there must be 4/8/... independent ALUs

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

LOAD A[1]

LOAD B[1]

C[1] = A[1] + B[1]

STORE C[1]

LOAD A[2]

LOAD B[2]

C[2] = A[2] + B[2]

STORE C[2]

LOAD A[3]

LOAD B[3]

C[3] = A[3] + B[3]

STORE C[3]

Time

(8)

SSE

Streaming SIMD Extensions

●

Extension to the x86 instruction set

●

Provide new SIMD instructions operating on small arrays of integer or floating-point numbers

●

Introduces 8 new 128-bit registers (XMM0—XMM7)

●

SSE2 instructions can handle

– 2 64-bit doubles, or

– 2 64-bit integers, or

– 16 8-bit chars

128 bit

(9)

Example

__m128i a = _mm_set_epi32( 1, 2, 3, 4 );

__m128i b = _mm_set_epi32( 2, 4, 6, 8 );

__m128i s = _mm_add_epi32(a, b);

1 2 3 4

a b

32 bit 32 bit 32 bit 32 bit

+ + + +

2 4 6 8

(10)

MIMD

●

In MIMD systems there are multiple execution units that can execute multiple sequences of instructions

– Multiple Instruction Streams

●

Each execution unit generally operates on its own input data

– Multiple Data Streams

CALL F() z = 8 y = 1.7

a = 18 b = 9 if ( a>b ) c = 7

w = 7 t = 13 k = G(w,t)

Time

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

(11)

MIMD architectures

● Shared Memory

– A set of processors sharing a common memory space

– Each processor can access any memory location

● Distributed Memory

– A set of compute nodes connected through an interconnection

network

● The most simple example:

cluster of PCs connected via Ethernet

– Nodes can share data through explicit communications

Memory

CPU CPU CPU

Interconnect

CPU

CPU Mem

(12)

(13)

(14)

Hybrid architectures

●

Many HPC systems are based on hybrid architectures

– Each compute node is a shared-memory multiprocessor

– A large number of compute nodes is connected through an interconnect network

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

(15)

www.top500.org

(June 2020)

(16)

www.top500.org

(June 2020)

(17)

SANDIA ASCI RED Date:

1996

Peak performance:

1.8Teraflops Floor space:

150m²

Power consumption:

800.000 Watt

(18)

Sony

PLAYSTATION 3 Date:

2006

Peak performance:

>1.8Teraflops Floor space:

0.08m²

Power consumption:

SANDIA ASCI RED Date:

1996

Peak performance:

1.8Teraflops Floor space:

150m²

Power consumption:

800.000 Watt

(19)

Inside SONY's PS3

(20)

Rules of thumb

●

When writing parallel applications (especially on

distributed-memory architectures) keep in mind that:

– Computation is fast

– Communication is slow

– Input/output is incredibly slow

(21)

Recap

Shared memory

● Advantages:

– Easier to program

– Useful for applications with irregular data access patterns (e.g., graph algorithms)

● Disadvantages:

– The programmer must take care of race conditions

– Limited memory bandwidth

Distributed memory

● Advantages:

– Highly scalable, provide very high computational power by adding more nodes

– Useful for applications with strong locality of reference, with high computation /

communication ratio

● Disadvantages:

– Latency of interconnect network

– Difficult to program