Parallel Programming Trends in Extremely Scalable Architectures

(1)

Parallel Programming Trends in Extremely Scalable Architectures

Carlo Cavazzoni, HPC department, CINECA

(2)

CINECA

non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and

Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR).

CINECA

is the largest Italian computing centre, one of the most important worldwide.

The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

(3)

Why parallel programming?

Solve larger problems

Run memory demanding codes

Solve problems with greater speed

(4)

Modern Parallel Architectures

Two basic architectural scheme:

Distributed Memory Shared Memory

Now most computers have a mixed architecture

+ accelerators -> hybrid architectures

(5)

Distributed Memory

memory

CPU

memory

CPU

memory

CPU

memory

NETWORK

memory memory

CPU

n o d e

n o d e e e n o d e n o d e

(6)

Shared Memory

CPU

memory

CPU CPU CPU CPU

(7)

Real Shared

CPU CPU CPU CPU CPU

System Bus

Memory banks

(8)

Virtual Shared

CPU CPU CPU CPU CPU CPU

HUB HUB HUB HUB HUB HUB

Network

memory memory memory memory memory memory

node node node node node node

(9)

Mixed Architectures

CPU

memory

CPU

memory

CPU

memory

CPU

NETWORK

node

(10)

Most Common Networks

Cube, hypercube, n-cube

Torus in 1,2,...,N Dim

switch switched

Fat Tree

(11)

HPC Trends

(12)

T o p 5 0 0

….

Paradigm Change in HPC

What about applications?

Next HPC system installed in CINECA will have 200000 cores

(13)

Roadmap to Exascale

(architectural trends)

(14)

Dennard Scaling law (MOSFET)

L’ = L / 2 V’ = V / 2 F’ = F * 2

D’ = 1 / L

²

= 4D P’ = P

do not hold anymore!

The power crisis!

L’ = L / 2 V’ = ~V F’ = ~F * 2

D’ = 1 / L

²

= 4 * D P’ = 4 * P

The core frequency

and performance do not grow following the

Moore’s law any longer

CPU + Accelerator to maintain the

architectures evolution In the Moore’s law

Programming crisis!

(15)

Where Watts are burnt?

Today (at 40nm) moving 3 64bit operands to compute a 64bit floating- point FMA takes 4.7x the energy with respect to the FMA operation itself

A B C

**D = A + B* C**

(16)

MPP System

When?

2012

PFlop/s >2 Power >1MWatt

Cores >150000 Threads >500000 Arch Option

for BG/Q

(17)

Accelerator

A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni)

CPU ACC .

CPU ACC . Physical integration

Single thread perf. throughput

(18)

nVIDIA GPU

Fermi implementation will pack 512

processor cores

(19)

ATI FireStream, AMD GPU

2012

New Graphics Core Next “GCN”

With new instruction set and new

(20)

Intel MIC (Knight Ferry)

(21)

What about parallel App?

In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

maximum speedup tends to 1 / ( 1 − P )

P= parallel fraction

1000000 core

(22)

Programming Models

• Message Passing (MPI)

• Shared Memory (OpenMP)

• Partitioned Global Address Space Programming (PGAS) Languages

 UPC, Coarray Fortran, Titanium

• Next Generation Programming Languages and Models

 Chapel, X10, Fortress

• Languages and Paradigm for Hardware Accelerators

 CUDA, OpenCL

• Hybrid: MPI + OpenMP + CUDA/OpenCL

(23)

trends

Vector

Distributed memory

Shared Memory

MPP System, Message Passing: MPI

Multi core nodes: OpenMP

Accelerator (GPGPU, FPGA): Cuda, OpenCL

Scalar Application

(24)

Message Passing

domain decomposition

memory

CPU

node

memory

CPU

node memory

CPU

node

memory

CPU

node

memory

CPU

node memory

CPU

node

Internal High Performance Network

(25)

Ghost Cells - Data exchange

i,j i+1,j i-1,j

i,j+1

i,j-1

sub-domain boundaries

Ghost Cells

i,j i+1,j i-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,j i-1,j

i,j+1

Ghost Cells exchanged between processors at every update

i,j i+1,j i-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,j i-1,j

i,j+1

(26)

Message Passing: MPI

Main Characteristic

• Library

• Coarse grain

• Inter node parallelization (few real alternative)

• Domain partition

• Distributed Memory

• Almost all HPC parallel App

Open Issue

• Latency

• OS jitter

• Scalability

(27)

Shared memory

memory

CPU

node

CPU CPU CPU Thread 0

Thread 1 Thread 2

Thread 3

x

y

(28)

Shared Memory: OpenMP

Main Characteristic

• Compiler directives

• Medium grain

• Intra node parallelization (pthreads)

• Loop or iteration partition

• Shared memory

• Many HPC App

Open Issue

• Thread creation overhead

• Memory/core affinity

• Interface with MPI

(29)

OpenMP

!$omp parallel do do i = 1 , nsl

call 1DFFT along z ( f [ offset( threadid ) ] ) end do

!$omp end parallel do

call fw_scatter ( . . . )

!$omp parallel do i = 1 , nzl

!$omp parallel do do j = 1 , Nx

call 1DFFT along y ( f [ offset( threadid ) ] ) end do

!$omp parallel do do j = 1, Ny

call 1DFFT along x ( f [ offset( threadid ) ] )

(30)

Accelerator/GPGPU

Sum of 1D array

+

(31)

CUDA sample

void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) {

output[ i ] = input1[ i ] + input2[ i ];

} }

__global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x;

if ( idx < length ) {

output[ idx ] = input1[ idx ] + input2[ idx ];

} }

(32)

CUDA

OpenCL

Main Characteristic

• Ad-hoc compiler

• Fine grain

• offload parallelization (GPU)

• Single iteration parallelization

• Ad-hoc memory

• Few HPC App

Open Issue

• Memory copy

• Standard

• Tools

• Integration with other

languages

(33)

Hybrid

(MPI+OpenMP+CUDA+…

Take the positive off all models Exploit memory hierarchy

Many HPC applications are adopting this model Mainly due to developer inertia

Hard to rewrite million of source lines

…+python)

(34)

Hybrid parallel programming

MPI: Domain partition

OpenMP: External loop partition

CUDA: assign inner loops Iteration to GPU threads

http://www.qe-forge.org/

Quantum ESPRESSO

Python: Ensemble simulations

(35)

Storage I/O

• The I/O subsystem is not keeping the pace with CPU

• Checkpointing will not be possible

• Reduce I/O

• On the fly analysis and statistics

• Disk only for archiving

• Scratch on non volatile

memory (“close to RAM”)

(36)