Parallel Programming Trends in Extremely Scalable Architectures
Carlo Cavazzoni, HPC department, CINECA
CINECA
CINECA
non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography andExperimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR).
CINECA
is the largest Italian computing centre, one of the most important worldwide.The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.
Why parallel programming?
Solve larger problems
Run memory demanding codes
Solve problems with greater speed
Modern Parallel Architectures
Two basic architectural scheme:
Distributed Memory Shared Memory
Now most computers have a mixed architecture
+ accelerators -> hybrid architectures
Distributed Memory
memory
CPU
memory
CPU
memory
CPU
memory
NETWORK
memory memory
CPU
n o d e
n o d e e e n o d e n o d e
Shared Memory
CPU
memory
CPU CPU CPU CPU
Real Shared
CPU CPU CPU CPU CPU
System Bus
Memory banks
Virtual Shared
CPU CPU CPU CPU CPU CPU
HUB HUB HUB HUB HUB HUB
Network
memory memory memory memory memory memory
node node node node node node
Mixed Architectures
CPU
memory
CPU
CPU
memory
CPU
CPU
memory
CPU
NETWORK
node
node
node
Most Common Networks
Cube, hypercube, n-cube
Torus in 1,2,...,N Dim
switch switched
Fat Tree
HPC Trends
T o p 5 0 0
….
Paradigm Change in HPC
What about applications?
Next HPC system installed in CINECA will have 200000 cores
Roadmap to Exascale
(architectural trends)
Dennard Scaling law (MOSFET)
L’ = L / 2 V’ = V / 2 F’ = F * 2
D’ = 1 / L
2= 4D P’ = P
do not hold anymore!
The power crisis!
L’ = L / 2 V’ = ~V F’ = ~F * 2
D’ = 1 / L
2= 4 * D P’ = 4 * P
The core frequency
and performance do not grow following the
Moore’s law any longer
CPU + Accelerator to maintain the
architectures evolution In the Moore’s law
Programming crisis!
Where Watts are burnt?
Today (at 40nm) moving 3 64bit operands to compute a 64bit floating- point FMA takes 4.7x the energy with respect to the FMA operation itself
A B C
D = A + B* C
MPP System
When?
2012
PFlop/s >2 Power >1MWatt
Cores >150000 Threads >500000 Arch Option
for BG/Q
Accelerator
A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni)
CPU ACC .
CPU ACC . Physical integration
Single thread perf. throughput
nVIDIA GPU
Fermi implementation will pack 512
processor cores
ATI FireStream, AMD GPU
2012
New Graphics Core Next “GCN”
With new instruction set and new
Intel MIC (Knight Ferry)
What about parallel App?
In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).
maximum speedup tends to 1 / ( 1 − P )
P= parallel fraction
1000000 core
Programming Models
• Message Passing (MPI)
• Shared Memory (OpenMP)
• Partitioned Global Address Space Programming (PGAS) Languages
UPC, Coarray Fortran, Titanium
• Next Generation Programming Languages and Models
Chapel, X10, Fortress
• Languages and Paradigm for Hardware Accelerators
CUDA, OpenCL
• Hybrid: MPI + OpenMP + CUDA/OpenCL
trends
Vector
Distributed memory
Shared Memory
MPP System, Message Passing: MPI
Multi core nodes: OpenMP
Accelerator (GPGPU, FPGA): Cuda, OpenCL
Scalar Application
Message Passing
domain decomposition
memory
CPU
node
memory
CPU
node memory
CPU
node
memory
CPU
node
memory
CPU
node memory
CPU
node
Internal High Performance Network
Ghost Cells - Data exchange
i,j i+1,j i-1,j
i,j+1
i,j-1
sub-domain boundaries
Ghost Cells
i,j i+1,j i-1,j
i,j+1
i,j-1
Processor 1
Processor 2
i,j i+1,j i-1,j
i,j+1
Ghost Cells exchanged between processors at every update
i,j i+1,j i-1,j
i,j+1
i,j-1
Processor 1
Processor 2
i,j i+1,j i-1,j
i,j+1
Message Passing: MPI
Main Characteristic
• Library
• Coarse grain
• Inter node parallelization (few real alternative)
• Domain partition
• Distributed Memory
• Almost all HPC parallel App
Open Issue
• Latency
• OS jitter
• Scalability
Shared memory
memory
CPU
node
CPU CPU CPU Thread 0
Thread 1 Thread 2
Thread 3
x
y
Shared Memory: OpenMP
Main Characteristic
• Compiler directives
• Medium grain
• Intra node parallelization (pthreads)
• Loop or iteration partition
• Shared memory
• Many HPC App
Open Issue
• Thread creation overhead
• Memory/core affinity
• Interface with MPI
OpenMP
!$omp parallel do do i = 1 , nsl
call 1DFFT along z ( f [ offset( threadid ) ] ) end do
!$omp end parallel do
call fw_scatter ( . . . )
!$omp parallel do i = 1 , nzl
!$omp parallel do do j = 1 , Nx
call 1DFFT along y ( f [ offset( threadid ) ] ) end do
!$omp parallel do do j = 1, Ny
call 1DFFT along x ( f [ offset( threadid ) ] )
Accelerator/GPGPU
Sum of 1D array
+
CUDA sample
void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) {
output[ i ] = input1[ i ] + input2[ i ];
} }
__global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x;
if ( idx < length ) {
output[ idx ] = input1[ idx ] + input2[ idx ];
} }