High-level lock-less programming for multicore

(1)

High-level lock-less programming for multi-core

Fabio Tordini

◦

, Marco Aldinucci

◦

, and Massimo Torquati

†

◦

University of Torino, Italy

†

University of Pisa, Italy

T

HE

P

ROBLEMS

Peak performance is hard to achieve on cache-coherent multi-core architectures and requires

sub-stantial programming and tuning efforts. Performance portability is even harder.

Performance is (often) not about Flops, it is about data movement.

Coarse Grain Concurrency is nearly exhausted. Programming systems should be designed to

sup-port fast data movement and enforce locality. They should be efficient at fine grain.

Non-blocking algorithms coupled with concurrent data structures can be fast but are complex to

be exploited. They can be hardly composed and should be abstracted out.

High-level approaches should be used to provide effective abstractions. A computer language

is not a computing model. A library is not a computing model. A litmus paper: system programmers use the techniques they advocate?

T

HE

A

PPROACH

Dijkstra’s structured programming approach (“go-to statement considered harmful”) ◦ Are send/receive, lock/unlock and CAS harmless than go-to?

Manage communications and synchronizations by way of high-level constructs

◦ Skeletons & Patterns [Col89]. Exploited in several frameworks, inter alia: Intel TBB, Fast-Flow [ADKT12], Google’s MapReduce.

◦ Patterns typically used to discipline true dependencies and process-to-CPUs mapping.

 Message-passing/shared-memory is not a dichotomy. They can be coupled in a richer

program-ming model, e.g. messages for synchronizations and shared-memory for data exchanges.

F

AST

F

LOW

: T

HE

B

IG

P

ICTURE

– http://mc-fastflow.sourceforge.net

class ﬀ node _{ protected:

virtual bool push(void_{∗ data) {} return qout_{−>push(data);}

}

virtual bool pop(void_{∗∗ data) {} return qin_{−>pop(data);}

}

public:

virtual void_{∗ svc(void ∗ task) = 0;} virtual int svc init () _{

return 0; };

virtual void svc end() _{} ...

private:

SPSC_{∗ qin;} SPSC_{∗ qout;} };

Arbitrary streaming networks layer

implements the ff_node, i.e. the building block of networks and serves as a container for business code and "mediators". Cyclic networks use uSPSC to avoid deadlocks.

...

pipeline farm _{Divide&Conquer}

+

... their composition #include <iostream> #include <string> #include <ﬀ/pipeline.hpp> using namespace std;

class Stage1: public ﬀ :: ﬀ node _{

private: int c ; public:

Stage1(): c(0) _{};

void _{∗ svc(void ∗ task) {}

string _{∗s = NULL;}

if (c++<3) s = new string(”Hello”); return s;

} };

class Stage2: public ﬀ :: ﬀ node _{

public:

void _{∗ svc(void ∗ task) {}

string _{∗s = (string ∗) task;}

cout << s_{−>append(” world!\n”);} delete (s) ;

return GO ON; }

};

int main(int argc, char _{∗ argv[]) {}

ﬀ :: ﬀ pipeline pipe;

pipe.add stage(new Stage1()); pipe.add stage(new Stage2()); pipe.run and wait end();

}

Streaming network patterns layer provides (streaming) parallel

programming patterns: farm, farm-with-feedback (i.e. Divide&Conquer), pipeline, and their arbitrary nesting and composition. Patterns discipline true dependencies whereas data is moved via shared memory.

bool push(void_{∗ data) {}

if (buf[pwrite]==NULL) _{ WMB(); // write_{−memory−barrier} buf[pwrite] = data; pwrite+=(pwrite+1>=size)?(1_−size):1; return true; } return false; }

bool pop(void_{∗∗ data) {}

if (buf[pread]==NULL) return false; ∗data = buf[pread]; buf[pread]=NULL; pread+=(pread+1>=size)?(1_−size):1; return true; }

Simple streaming networks layer

implements efficient SPSC bound and unbound wait-free queues. They requires no CAS and no fences (TSO, a WMB for WO consistency). SPSC: 2x faster than Lamport queue, uSPSC 20x faster than Michael-Scott linked-list [ADKM12].

0 20 40 60 80 100 64 1k 8k Latency (ns) Nehalem E5@2.2GHz Nehalem E7@2.0GHz Sandy Bridge E5@2.0GHz Opteron Magny-Cours@2.3GHz Cortex A9@1.0GHz

Buffer size

Unbound FF uSPSC

linked list of circular buffers

head tail Bound FF SPSC circular buffer head tail

+

Streaming network patterns

Skeletons: pipeline, map farm, reduce, D&C, ...

Arbitrary streaming networks

Lock-free SPSC/MPMC queues + FF nodes

Simple streaming networks

Lock-free SPSC queues + threading model

FastFlow

Multicore and manycore

cc-UMA & cc-NUMA - TSO and WO consistency

Applications on multicore and manycore

Efficient and portable - designed with high-level patterns

Applications are build using Fastflow

patterns or extending them in a OO style. Synchronizations are hidden within the library.

MPI core2core on E7@2Ghz: ~190 ns

0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Speedup N. of cores Tc = 5 µs (medium grain) Ideal FastFlow TBB OpenMP Cilk 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Speedup N. of cores Tc = 0.5 µs (fine grain) Ideal FastFlow TBB OpenMP Cilk 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Speedup N. of cores Tc = 50 µs (coarse grain) Ideal FastFlow TBB OpenMP Cilk

E

XAMPLE

:

A

S

TREAM

-

ORIENTED

M

EM

A

LLOCATOR

B

UILT WITH

F

AST

F

LOW

FFalloc FFalloc FFalloc FFalloc P Cn C2 C1 Producer P: for(i=0;i<10M;i++){ pi = malloc(rnd(size)); *pi=...; dispatch_RR pi; }

...

Consumer Ci: while (pi=get()) do_work(1μs,pi); free(pi); } uSPSC queue P Cn C2 C1

...

FFalloc FFalloc FFalloc FFalloc

...

uSPSC queue

+

=

malloc free free free 0 2 4 6 8 10 12 14 1 4 8 12 16 20 24 28 32 Time (s) N. of (dealloc) threads

10M alloc/dealloc (32B) - 1µs tasks - 32-core Intel E7 2Ghz Hoard-3.9 libc-6 TBB-4.0 FastFlow Ideal

F

UTURE

D

IRECTIONS

A significant speed edge over state-of-the-art parallel allocators can be achieved by specializing a (rel-atively simple, built on top of the SLAB allocator) memory allocator with high-level patterns. The al-location technique get advantages from the low-overhead of the run-time (based on lock-free uSPSC) and the knowledge of high-level semantics (producer-consumer). We believe the approach can be improved on both directions, i.e.

Memory Affinity concerns mapping and allocation of data structures in memory. Data structures can

be coupled with parallel patterns (also thanks to specialized allocation strategies). Experimen-tation is feasible thanks to the lock-free allocator already implemented in FastFlow.

Lock-free run-time support can be extended with more (location-aware)

Multiple-Producer/Multiple-Consumer data structures and transactional primitives.

R

EFERENCES

ADKM12 M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient

un-bounded lock-free queue for multi-core systems. In Proc. of Euro-Par, LNCS, Rhodes Island, Greece, Aug. 2012. Springer.

ADKT12 M. Aldinucci, M. Danelutto, P. Kilpatrick, and M. Torquati. FastFlow: high-level and efficient

streaming on multi-core. In Programming Multi-core and Many-core Computing Systems, Parallel and Distributed Computing, chap. 13. Wiley, 2012.

Col89 M. Cole: Algorithmic Skeletons: Structured Management of Parallel Computations. Research