• Non ci sono risultati.

Parallel Programming Trends in Extremely Scalable Architectures

N/A
N/A
Protected

Academic year: 2021

Condividi "Parallel Programming Trends in Extremely Scalable Architectures"

Copied!
38
0
0

Testo completo

(1)

Parallel Programming Trends in Extremely Scalable Architectures

Carlo Cavazzoni, HPC department, CINECA

(2)

CINECA

CINECA

non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and

Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR).

CINECA

is the largest Italian computing centre, one of the most important worldwide.

The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

(3)

Why parallel programming?

Solve larger problems

Run memory demanding codes

Solve problems with greater speed

(4)

Modern Parallel Architectures

Two basic architectural scheme:

Distributed Memory Shared Memory

Now most computers have a mixed architecture

+ accelerators -> hybrid architectures

(5)

Distributed Memory

memory

CPU

memory

CPU

memory

CPU

memory

NETWORK

memory memory

CPU

n o d e

n o d e e e n o d e n o d e

(6)

Shared Memory

CPU

memory

CPU CPU CPU CPU

(7)

Real Shared

CPU CPU CPU CPU CPU

System Bus

Memory banks

(8)

Virtual Shared

CPU CPU CPU CPU CPU CPU

HUB HUB HUB HUB HUB HUB

Network

memory memory memory memory memory memory

node node node node node node

(9)

Mixed Architectures

CPU

memory

CPU

CPU

memory

CPU

CPU

memory

CPU

NETWORK

node

node

node

(10)

Most Common Networks

Cube, hypercube, n-cube

Torus in 1,2,...,N Dim

switch switched

Fat Tree

(11)

HPC Trends

(12)

T o p 5 0 0

….

Paradigm Change in HPC

What about applications?

Next HPC system installed in CINECA will have 200000 cores

(13)

Roadmap to Exascale

(architectural trends)

(14)

Dennard Scaling law (MOSFET)

L’ = L / 2 V’ = V / 2 F’ = F * 2

D’ = 1 / L

2

= 4D P’ = P

do not hold anymore!

The power crisis!

L’ = L / 2 V’ = ~V F’ = ~F * 2

D’ = 1 / L

2

= 4 * D P’ = 4 * P

The core frequency

and performance do not grow following the

Moore’s law any longer

CPU + Accelerator to maintain the

architectures evolution In the Moore’s law

Programming crisis!

(15)

Where Watts are burnt?

Today (at 40nm) moving 3 64bit operands to compute a 64bit floating- point FMA takes 4.7x the energy with respect to the FMA operation itself

A B C

D = A + B* C

(16)

MPP System

When?

2012

PFlop/s >2 Power >1MWatt

Cores >150000 Threads >500000 Arch Option

for BG/Q

(17)

Accelerator

A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni)

CPU ACC .

CPU ACC . Physical integration

Single thread perf. throughput

(18)

nVIDIA GPU

Fermi implementation will pack 512

processor cores

(19)

ATI FireStream, AMD GPU

2012

New Graphics Core Next “GCN”

With new instruction set and new

(20)

Intel MIC (Knight Ferry)

(21)

What about parallel App?

In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

maximum speedup tends to 1 / ( 1 − P )

P= parallel fraction

1000000 core

(22)

Programming Models

Message Passing (MPI)

Shared Memory (OpenMP)

• Partitioned Global Address Space Programming (PGAS) Languages

 UPC, Coarray Fortran, Titanium

• Next Generation Programming Languages and Models

 Chapel, X10, Fortress

Languages and Paradigm for Hardware Accelerators

 CUDA, OpenCL

Hybrid: MPI + OpenMP + CUDA/OpenCL

(23)

trends

Vector

Distributed memory

Shared Memory

MPP System, Message Passing: MPI

Multi core nodes: OpenMP

Accelerator (GPGPU, FPGA): Cuda, OpenCL

Scalar Application

(24)

Message Passing

domain decomposition

memory

CPU

node

memory

CPU

node memory

CPU

node

memory

CPU

node

memory

CPU

node memory

CPU

node

Internal High Performance Network

(25)

Ghost Cells - Data exchange

i,j i+1,j i-1,j

i,j+1

i,j-1

sub-domain boundaries

Ghost Cells

i,j i+1,j i-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,j i-1,j

i,j+1

Ghost Cells exchanged between processors at every update

i,j i+1,j i-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,j i-1,j

i,j+1

(26)

Message Passing: MPI

Main Characteristic

• Library

• Coarse grain

• Inter node parallelization (few real alternative)

• Domain partition

• Distributed Memory

• Almost all HPC parallel App

Open Issue

• Latency

• OS jitter

• Scalability

(27)

Shared memory

memory

CPU

node

CPU CPU CPU Thread 0

Thread 1 Thread 2

Thread 3

x

y

(28)

Shared Memory: OpenMP

Main Characteristic

• Compiler directives

• Medium grain

• Intra node parallelization (pthreads)

• Loop or iteration partition

• Shared memory

• Many HPC App

Open Issue

• Thread creation overhead

• Memory/core affinity

• Interface with MPI

(29)

OpenMP

!$omp parallel do do i = 1 , nsl

call 1DFFT along z ( f [ offset( threadid ) ] ) end do

!$omp end parallel do

call fw_scatter ( . . . )

!$omp parallel do i = 1 , nzl

!$omp parallel do do j = 1 , Nx

call 1DFFT along y ( f [ offset( threadid ) ] ) end do

!$omp parallel do do j = 1, Ny

call 1DFFT along x ( f [ offset( threadid ) ] )

(30)

Accelerator/GPGPU

Sum of 1D array

+

(31)

CUDA sample

void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) {

output[ i ] = input1[ i ] + input2[ i ];

} }

__global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x;

if ( idx < length ) {

output[ idx ] = input1[ idx ] + input2[ idx ];

} }

(32)

CUDA

OpenCL

Main Characteristic

• Ad-hoc compiler

• Fine grain

• offload parallelization (GPU)

• Single iteration parallelization

• Ad-hoc memory

• Few HPC App

Open Issue

• Memory copy

• Standard

• Tools

• Integration with other

languages

(33)

Hybrid

(MPI+OpenMP+CUDA+…

Take the positive off all models Exploit memory hierarchy

Many HPC applications are adopting this model Mainly due to developer inertia

Hard to rewrite million of source lines

…+python)

(34)

Hybrid parallel programming

MPI: Domain partition

OpenMP: External loop partition

CUDA: assign inner loops Iteration to GPU threads

http://www.qe-forge.org/

Quantum ESPRESSO

Python: Ensemble simulations

(35)

Storage I/O

• The I/O subsystem is not keeping the pace with CPU

• Checkpointing will not be possible

• Reduce I/O

• On the fly analysis and statistics

• Disk only for archiving

• Scratch on non volatile

memory (“close to RAM”)

(36)

PRACE

PRACE Research Infrastructure (www.prace-ri.eu) the top level of the European HPC ecosystem

The vision of PRACE is to enable and support European global leadership in public and private research and development.

CINECA (representing Italy) is an hosting member

of PRACE

can host a Tier-0 system

European (PRACE)

Local

Tier 0 Tier 1

Tier 2

National (CINECA today)

(37)

FERMI @ CINECA

PRACE Tier-0 System

Architecture: 10 BGQ Frame Model: IBM-BG/Q

Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: 163840

Computing Nodes: 10240 RAM: 1GByte / core

Internal Network: 5D Torus

Disk Space: 2PByte of scratch space

Peak Performance: 2PFlop/s

(38)

Conclusion

• Exploit millions of ALU

• Hybrid Hardware

• Hybrid codes

• Memory Hierarchy

• Flops/Watt (more that Flops/Sec)

• I/O subsystem

• Non volatile memory

• Fault Tolerance!

Parallel programming trends in extremely scalable architectures

Riferimenti

Documenti correlati

In this way, gut-dysbiosis and composition of “ancient” microbiota could be linked to pathogenesis of numerous chronic liver diseases such as chronic hepatitis B (CHB),

• Homo sapiens ecological niche oversteps our physiological tolerance limits by means of culture. • The origin of technological advancement endowing Homo

The relative importance of global and localized shape changes was determined by comparing the results of the regression of the partial warps only (localized changes

● Strong Scaling: increase the number of processors p keeping the total problem size fixed. – The total amount of work

[r]

◦ a written test: 4 questions, 1.30 hours (2/5 of the final mark). ◦ a practical project (3/5 of the

Private governance is currently being evoked as a viable solution to many public policy goals. However, in some circumstances it has shown to produce more harm than

88 Moreover, the fact that private governance arrange- ments tend to be global certainly implies that the best possible response would in many cases be an