• Non ci sono risultati.

High-level lock-less programming for multicore

N/A
N/A
Protected

Academic year: 2021

Condividi "High-level lock-less programming for multicore"

Copied!
1
0
0

Testo completo

(1)

High-level lock-less programming for multi-core

Fabio Tordini

, Marco Aldinucci

, and Massimo Torquati

University of Torino, Italy

University of Pisa, Italy

T

HE

P

ROBLEMS

Peak performance is hard to achieve on cache-coherent multi-core architectures and requires

sub-stantial programming and tuning efforts. Performance portability is even harder.

Performance is (often) not about Flops, it is about data movement.

Coarse Grain Concurrency is nearly exhausted. Programming systems should be designed to

sup-port fast data movement and enforce locality. They should be efficient at fine grain.

Non-blocking algorithms coupled with concurrent data structures can be fast but are complex to

be exploited. They can be hardly composed and should be abstracted out.

High-level approaches should be used to provide effective abstractions. A computer language

is not a computing model. A library is not a computing model. A litmus paper: system programmers use the techniques they advocate?

T

HE

A

PPROACH

 Dijkstra’s structured programming approach (“go-to statement considered harmful”) ◦ Are send/receive, lock/unlock and CAS harmless than go-to?

 Manage communications and synchronizations by way of high-level constructs

◦ Skeletons & Patterns [Col89]. Exploited in several frameworks, inter alia: Intel TBB, Fast-Flow [ADKT12], Google’s MapReduce.

◦ Patterns typically used to discipline true dependencies and process-to-CPUs mapping.

 Message-passing/shared-memory is not a dichotomy. They can be coupled in a richer

program-ming model, e.g. messages for synchronizations and shared-memory for data exchanges.

F

AST

F

LOW

: T

HE

B

IG

P

ICTURE

– http://mc-fastflow.sourceforge.net

class ff node { protected:

virtual bool push(void∗ data) { return qout−>push(data);

}

virtual bool pop(void∗∗ data) { return qin−>pop(data);

}

public:

virtual void∗ svc(void ∗ task) = 0; virtual int svc init () {

return 0; };

virtual void svc end() {} ...

private:

SPSC∗ qin; SPSC∗ qout; };

Arbitrary streaming networks layer

implements the ff_node, i.e. the building block of networks and serves as a container for business code and "mediators". Cyclic networks use uSPSC to avoid deadlocks.

...

...

pipeline farm Divide&Conquer

+

... their composition #include <iostream> #include <string> #include <ff/pipeline.hpp> using namespace std;

class Stage1: public ff :: ff node {

private: int c ; public:

Stage1(): c(0) {};

void ∗ svc(void ∗ task) {

string ∗s = NULL;

if (c++<3) s = new string(”Hello”); return s;

} };

class Stage2: public ff :: ff node {

public:

void ∗ svc(void ∗ task) {

string ∗s = (string ∗) task;

cout << s−>append(” world!\n”); delete (s) ;

return GO ON; }

};

int main(int argc, char ∗ argv[]) {

ff :: ff pipeline pipe;

pipe.add stage(new Stage1()); pipe.add stage(new Stage2()); pipe.run and wait end();

}

Streaming network patterns layer provides (streaming) parallel

programming patterns: farm, farm-with-feedback (i.e. Divide&Conquer), pipeline, and their arbitrary nesting and composition. Patterns discipline true dependencies whereas data is moved via shared memory.

bool push(void∗ data) {

if (buf[pwrite]==NULL) { WMB(); // write−memory−barrier buf[pwrite] = data; pwrite+=(pwrite+1>=size)?(1−size):1; return true; } return false; }

bool pop(void∗∗ data) {

if (buf[pread]==NULL) return false; ∗data = buf[pread]; buf[pread]=NULL; pread+=(pread+1>=size)?(1−size):1; return true; }

Simple streaming networks layer

implements efficient SPSC bound and unbound wait-free queues. They requires no CAS and no fences (TSO, a WMB for WO consistency). SPSC: 2x faster than Lamport queue, uSPSC 20x faster than Michael-Scott linked-list [ADKM12].

0 20 40 60 80 100 64 1k 8k Latency (ns) Nehalem E5@2.2GHz Nehalem E7@2.0GHz Sandy Bridge E5@2.0GHz Opteron Magny-Cours@2.3GHz Cortex A9@1.0GHz

Buffer size

Unbound FF uSPSC

linked list of circular buffers

head tail Bound FF SPSC circular buffer head tail

+

Streaming network patterns

Skeletons: pipeline, map farm, reduce, D&C, ...

Arbitrary streaming networks

Lock-free SPSC/MPMC queues + FF nodes

Simple streaming networks

Lock-free SPSC queues + threading model

FastFlow

Multicore and manycore

cc-UMA & cc-NUMA - TSO and WO consistency

Applications on multicore and manycore

Efficient and portable - designed with high-level patterns

Applications are build using Fastflow

patterns or extending them in a OO style. Synchronizations are hidden within the library.

MPI core2core on E7@2Ghz: ~190 ns

0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Speedup N. of cores Tc = 5 µs (medium grain) Ideal FastFlow TBB OpenMP Cilk 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Speedup N. of cores Tc = 0.5 µs (fine grain) Ideal FastFlow TBB OpenMP Cilk 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Speedup N. of cores Tc = 50 µs (coarse grain) Ideal FastFlow TBB OpenMP Cilk

E

XAMPLE

:

A

S

TREAM

-

ORIENTED

M

EM

A

LLOCATOR

B

UILT WITH

F

AST

F

LOW

FFalloc FFalloc FFalloc FFalloc P Cn C2 C1 Producer P: for(i=0;i<10M;i++){ pi = malloc(rnd(size)); *pi=...; dispatch_RR pi; }

...

Consumer Ci: while (pi=get()) do_work(1μs,pi); free(pi); } uSPSC queue P Cn C2 C1

...

FFalloc FFalloc FFalloc FFalloc

...

uSPSC queue

+

=

malloc free free free 0 2 4 6 8 10 12 14 1 4 8 12 16 20 24 28 32 Time (s) N. of (dealloc) threads

10M alloc/dealloc (32B) - 1µs tasks - 32-core Intel E7 2Ghz Hoard-3.9 libc-6 TBB-4.0 FastFlow Ideal

F

UTURE

D

IRECTIONS

A significant speed edge over state-of-the-art parallel allocators can be achieved by specializing a (rel-atively simple, built on top of the SLAB allocator) memory allocator with high-level patterns. The al-location technique get advantages from the low-overhead of the run-time (based on lock-free uSPSC) and the knowledge of high-level semantics (producer-consumer). We believe the approach can be improved on both directions, i.e.

Memory Affinity concerns mapping and allocation of data structures in memory. Data structures can

be coupled with parallel patterns (also thanks to specialized allocation strategies). Experimen-tation is feasible thanks to the lock-free allocator already implemented in FastFlow.

Lock-free run-time support can be extended with more (location-aware)

Multiple-Producer/Multiple-Consumer data structures and transactional primitives.

R

EFERENCES

ADKM12 M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient

un-bounded lock-free queue for multi-core systems. In Proc. of Euro-Par, LNCS, Rhodes Island, Greece, Aug. 2012. Springer.

ADKT12 M. Aldinucci, M. Danelutto, P. Kilpatrick, and M. Torquati. FastFlow: high-level and efficient

streaming on multi-core. In Programming Multi-core and Many-core Computing Systems, Parallel and Distributed Computing, chap. 13. Wiley, 2012.

Col89 M. Cole: Algorithmic Skeletons: Structured Management of Parallel Computations. Research

Riferimenti

Documenti correlati

• the power terms have been evaluated in different time intervals after the detected event; • a systematic approach to rapidly obtain the lowest possible number of features preserv-

parallel directive, it creates a pool of threads and becomes the master of the team. – The master has thread

– If a variable is private on a task construct, the references to it inside the construct are to new uninitialized storage that is created when the task is executed. – If a

#pragma omp parallel Parallel region, teams of threads, structured block, interleaved execution across threads.

● In OpenMP, the scope of a variable refers to the set of threads that can access the variable in a parallel block... default(none) can be used to check whether you consider all

Behind so much alarm is the International Linear Collider (ILC) – a large particle accelerator facility which, according to the report, should be built on American territory,

In this paper, we compare levels of consensual and non-consensual sexual activity as reported in the Sexual Health and Attitudes of Australian Prisoners (SHAAP) survey, 5 and the use

Tumor size criterion often allows to differentiate a benign from a malignant adrenal mass, but in oncocytic neoplasm neither size lesion nor imaging findings are useful for