Parallel Computing Parallel Computing
Architectures Architectures
Moreno Marzolla
Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna
http://www.moreno.marzolla.name/
Pacheco, excerpts from Chapter 2
Von Neumann architecture
ALU
R0 R1
Rn-1
PC
Memory Memory
Bus Control
Data Address
An Abstract Parallel Architecture
●
How is parallelism handled?
●
What is the exact physical location of the memories?
●
What is the topology of the interconnect network?
Processor
Memory Memory Memory
Processor Processor Processor
Interconnection Network
Why are parallel architectures important?
●
There is no "typical" parallel computer: different vendors use different architectures
●
There is currently no “universal” programming paradigm that fits all architectures
– Parallel programs must be tailored to the underlying parallel architecture
– The architecture of a parallel computer limits the choice of the programming paradigm that can be used
Flynn's Taxonomy
SISD
Single Instruction Stream Single Data Stream
SIMD
Single Instruction Stream Multiple Data Streams
MISD
Multiple Instruction Streams Single Data Stream
MIMD
Multiple Instruction Streams Multiple Data Streams
Data Streams
Single Multiple
Instruction Streams MultipleSingle
Von Neumann architecture
SIMD
●
SIMD instructions apply the same operation (e.g.,
sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)
– This means that there must be 4/8/... independent ALUs
LOAD A[0]
LOAD B[0]
C[0] = A[0] + B[0]
STORE C[0]
LOAD A[1]
LOAD B[1]
C[1] = A[1] + B[1]
STORE C[1]
LOAD A[2]
LOAD B[2]
C[2] = A[2] + B[2]
STORE C[2]
LOAD A[3]
LOAD B[3]
C[3] = A[3] + B[3]
STORE C[3]
Time
SSE
Streaming SIMD Extensions
●
Extension to the x86 instruction set
●
Provide new SIMD instructions operating on small arrays of integer or floating-point numbers
●
Introduces 8 new 128-bit registers (XMM0—XMM7)
●
SSE2 instructions can handle
– 2 64-bit doubles, or
– 2 64-bit integers, or
– 4 32-bit integers, or
– 8 16-bit integers, or
– 16 8-bit chars
128 bit
Example
__m128i a = _mm_set_epi32( 1, 2, 3, 4 );
__m128i b = _mm_set_epi32( 2, 4, 6, 8 );
__m128i s = _mm_add_epi32(a, b);
1 2 3 4
a b
32 bit 32 bit 32 bit 32 bit
+ + + +
2 4 6 8
MIMD
●
In MIMD systems there are multiple execution units that can execute multiple sequences of instructions
– Multiple Instruction Streams
●
Each execution unit generally operates on its own input data
– Multiple Data Streams
CALL F() z = 8 y = 1.7
a = 18 b = 9 if ( a>b ) c = 7
w = 7 t = 13 k = G(w,t)
Time
LOAD A[0]
LOAD B[0]
C[0] = A[0] + B[0]
MIMD architectures
● Shared Memory
– A set of processors sharing a common memory space
– Each processor can access any memory location
● Distributed Memory
– A set of compute nodes connected through an interconnection
network
● The most simple example:
cluster of PCs connected via Ethernet
– Nodes can share data through explicit communications
Memory
CPU CPU CPU
Interconnect
CPU
CPU Mem
CPU Mem
CPU Mem
CPU Mem
Hybrid architectures
●
Many HPC systems are based on hybrid architectures
– Each compute node is a shared-memory multiprocessor
– A large number of compute nodes is connected through an interconnect network
CPU Mem CPU
CPU CPU GPU GPU
CPU Mem CPU
CPU CPU GPU GPU
CPU Mem CPU
CPU CPU GPU GPU
CPU Mem CPU
CPU CPU GPU GPU
www.top500.org
(June 2020)
www.top500.org
(June 2020)
SANDIA ASCI RED Date:
1996
Peak performance:
1.8Teraflops Floor space:
150m2
Power consumption:
800.000 Watt
Sony
PLAYSTATION 3 Date:
2006
Peak performance:
>1.8Teraflops Floor space:
0.08m2
Power consumption:
SANDIA ASCI RED Date:
1996
Peak performance:
1.8Teraflops Floor space:
150m2
Power consumption:
800.000 Watt
Inside SONY's PS3
Rules of thumb
●
When writing parallel applications (especially on
distributed-memory architectures) keep in mind that:
– Computation is fast
– Communication is slow
– Input/output is incredibly slow
Recap
Shared memory
● Advantages:
– Easier to program
– Useful for applications with irregular data access patterns (e.g., graph algorithms)
● Disadvantages:
– The programmer must take care of race conditions
– Limited memory bandwidth
Distributed memory
● Advantages:
– Highly scalable, provide very high computational power by adding more nodes
– Useful for applications with strong locality of reference, with high computation /
communication ratio
● Disadvantages:
– Latency of interconnect network
– Difficult to program