State access patterns in tree structured parallel computations

(1)

DEPARTMENT OF COMPUTER SCIENCE

Master Degree in Computer Science

Master Thesis

State access patterns

in tree structured parallel computations

Author

Supervisor

(2)

Abstract

Pattern-based frameworks for parallel programming provide a set of parallel patterns that solve recurrent problems. Typically these patterns are studied and implemented in their basic stateless form. In our work we classify and describe a set of access patterns for data structures that constitutes the shared state in tree structured parallel computations. These state management patterns can be composed with the computational ones to help the application programmer to abstract from the state declaration and access schedule issues related to the concurrent execution of parallel activities. We analyse the main properties of the computational structure and we present a general characterization of four state access patterns, according to the extent and way in which the state resource is structured and accessed. We study the parallel opportunities of the introduction in the computational pattern at hand of each one of the proposed state patterns, discussing possible implementation and synchronization schemas. We propose and develop a generic C++ parallel interface with stateful variants. Keeping the focus on the extracted properties for the computational structure and for the state management, we identify a comprehensive set of synthetic applications and we run them to study the effects of their potential and meaningful combinations. The experimental implementation and tests, performed on three state-of-the-art shared memory multicore architectures, assess the applicability and efficiency of the proposed approach in the context of tree structured parallel computations.

(3)

1 Introduction 1

1.1 Thesis outline . . . 4

2 Background 5 2.1 Parallelism exploitation . . . 5

2.2 Algorithmic skeletons and parallel design patterns . . . 7

2.3 Programming language: C++ . . . 10

2.4 Parallel architectures . . . 12

3 Problem characterization 15 3.1 Tree structured computations . . . 15

3.1.1 Divide and Conquer pattern . . . 21

3.1.2 Branch and Bound pattern . . . 22

3.2 State access patterns definition . . . 23

3.2.1 Read-only . . . 23

3.2.2 Resource . . . 24

3.2.3 Accumulator . . . 25

3.2.4 Fully partitioned . . . 26

4 Properties description 28 4.1 Tree structure properties . . . 28

4.1.1 Computational tree shape . . . 29

4.1.2 Computational regularity of phases . . . 30

4.1.3 Partitioning regularity . . . 31

4.2 State access properties . . . 32

4.2.1 Occurrences of accesses . . . 32

4.2.2 Functional grain . . . 32

4.2.3 Relative order of operations . . . 33

(4)

Contents

4.2.5 Role and impact . . . 34

4.3 Summary . . . 35

5 Implementation analysis 37 5.1 Overview . . . 37

5.2 Tree structured computations parallelization . . . 40

5.2.1 Skeleton implementation . . . 44

5.2.2 Parallel executor instantiation . . . 46

5.3 State access patterns integration . . . 49

6 Applications 55 6.1 Synthetic applications . . . 55

6.1.1 Divide and Conquer application . . . 57

6.1.2 Branch and Bound application . . . 58

6.2 Use cases . . . 59

7 Experiments 63 7.1 Target machines . . . 63

7.2 Tests . . . 64

7.2.1 Performances . . . 68

7.2.2 Classification of use cases . . . 83

7.2.3 Test results summary . . . 87

8 Conclusions 89 8.1 Future work . . . 91 List of Figures 94 List of Tables 98 List of Acronyms 99 Bibliography 100

(5)

Introduction

With the objective of providing the programmer of parallel applications with an effective programming environment, two main approaches have been proposed and developed in completely disjoint research frameworks: algorithmic skeletons by the High Performance Computing community and parallel design patterns by software engineering researchers. The two proposal have many similarities addressed at different levels of abstraction. Algorithmic skeletons are aimed at directly providing to the application programmer pre-defined, efficient building blocks for parallel computations, whereas parallel design patterns propose directions, suggestions, examples and concrete ways to program those building blocks in different contexts.

The first branch of the research in efficient parallel programming imple-mentations is represented by the study over algorithmic skeletons [12, 13, 17], introduced by Cole in his PhD thesis [11]. Algorithmic skeletons were defined as pre-determined models encapsulating the structure of a parallel computation, that are provided to the user as building blocks to be used to write applications. Later on, Cole published his “skeleton manifesto” [12] where skeletons are described in a fairly more abstract way. The definition of algorithmic skeletons evolved in time, we can summarise that an algorithmic skeleton is a parametric, reusable and portable programming abstraction modelling a known, common and efficient parallelism exploitation pattern. The key idea behind skeletons is to bridge the gap between software quality and performance by exploiting common organisation models for parallel code.

Parallel design patterns have been introduced in early ’00 [32], for the purpose of modelling solutions to recurring parallel problems. They have been proposed with the same aims of (sequential) design patterns [24], namely to describe solutions to recurrent problems in the context of interest, where the

(6)

Chapter 1. Introduction

context of interest is parallel software design rather than generic object-oriented software design [43, 31, 30]. These patterns represent a software engineering perspective on the parallel computing problems.

Parallel design patterns have been used to implement structured parallel programming frameworks [25, 30, 31, 43] and a notable research activity has been generated that eventually led to the recognition of the fact that parallel design patterns may be a key point in the resolution of problems related to the development of efficient parallel software, as clearly stated in the Berkeley report [4].

Indeed, structured parallel programming models support the design and implementation of parallel applications. They provide the parallel application programmer with a set of predefined, ready to use parallel pattern abstractions that may be directly instantiated, alone or in composition, to model the complete parallel behaviour of the application at hand. This raises the level of abstraction by ensuring that the application programmer need not to be concerned with architectural and parallelism exploitation issues during application development. These issues are rather handled efficiently by the framework programmer, using the state-of-the-art techniques.

In the context of parallel design patterns, some classical structured problems have been studied. Several recent parallel programming frameworks adopted design patterns concepts, such as Intel Threading Building Blocks (TBB) [37, 28] and Microsoft Task Parallel Library (TPL) [29]. Usually patterns are studied and offered in their stateless form, i.e. as patterns where the parallel activities do not share an internal state nor support access to some more generalised notion of “pattern” global state.

Thus, applications programmers themselves must manage state declarations and access schedule within pattern business logic code, which is notably error prone and may impair the benefits of the pattern philosophy. The concurrent access to shared data structures also represents one of the classical sources of bottlenecks in existing parallel applications. In fact, it makes up the serial fraction that limits, by the Amdahl Law [2], the speedup achievable in the parallel application. With the growing and ubiquitous presence of shared memory multi/many core architectures, the problem of being able to manage some kind of shared global state is becoming again more and more interesting.

The research in this context has many possibilities of development still open. The global aim of this research branch is to help the design of stateful parallel applications introducing a set of classified state access patterns that can be

(7)

easily combined in an orthogonal way with the computational parallel patterns, already studied and commonly used and encoded in several structured parallel programming frameworks, including Muesli [10], SKEPU [22], GrPPI [38] and FastFlow [1].

A recent work in this field has been conducted in the context of stateful em-barrassingly parallel computations, by Danelutto et al. in [18]: they introduced a classification schema for stateful stream and data parallel computations and identified the conditions under which meaningful scalability may be obtained on state-of-the-art shared memory architectures for each of the detected classes. That analysis is focused and limited to the study of stateful variations of embarrassingly parallel applications, the ones that can be modelled with a global emitter of tasks and a string of identical worker processing units, such as task farm and map patterns.

In this thesis, we extend the research to the case of stateful tree structured parallel computations, exploring some well-known design patterns that exploit this shape. We first study the characterization of such computational design patterns in the parallel context, then we present a classification of the state patterns to be combined to them, according to the extent and way in which the state can be structured, managed and accessed. In particular, we identify a range of interesting and well-separated models and we list some characteristics that may guide the programmer choice of the state pattern that better fits the application at hand. After the description of the patterns, we discuss possible implementation schemes and opportunities of parallelization and adaptivity.

We conduce preliminary experimental tests on three different state-of-the-art multicore shared memory architectures, with C++ implementations of the state patterns applied to many variants of synthetic applications, trying to make an exhaustive analysis of the meaningful combinations of computational structure and state management properties. We report the most interesting results that assess the feasibility and efficiency of the proposed approach.

Following the terminology introduced to describe the structural properties of the problem, we identify and describe some concrete applications that may map to the experimental ones.

An interesting result of this work is the effectiveness of the application of the same classes of state access patterns identified for the embarrassingly parallel applications also to the case of tree structured parallel computations. This is a notable result in the context of applying an approach in the design of the parallel skeleton programming environment that tends to identify and

(8)

Chapter 1. Introduction

define a small set of simple, general purpose and composable models.

1.1 Thesis outline

In chapter 2 we introduce some background notions useful to understand the analysis in the context of parallel programming, we present the development framework and we describe the state-of-the-art development environments. A characterization of the problem under analysis and the main classification of the four identified state access patterns are presented in chapter 3. In chapter 4 we analyse the structural and behavioural properties that may impact on the state access pattern choice and on the performances over specific applications. Then, in chapter 5, we study the parallelization opportunities of the structured problem at hand and the synchronization mechanisms needed to implement correctly and efficiently the semantic of the state management patterns. In chapter 6 we present the synthetic applications implemented for the experimental sessions and we describe some concrete use cases that map to them. In chapter 7, after a brief description of the three target machines architectures, we report the results of the bunch of tests performed, analysing and comparing the performances of a range of problem settings variants. Finally, in chapter 8 we recap the main results and contribution of this work in the research field of stateful applications design and we discuss some proposal for future development.

(9)

Background

2.1 Parallelism exploitation

A parallel program is composed of simultaneously executing processes/threads. Several forms of parallelization across multiple processors can be exploited by applications running on parallel computing environments. An application can be decomposed according to different orthogonal formulations. In [8] the authors outline a structured parallel programming model that clearly separates data parallel and control parallel mechanisms in a way they can be orthogonally composed to build more complex parallelism exploitation patterns.

Control parallelism, also known as task parallelism, refers to the concurrent

execution of tasks originated by the computational control structure of an application, thus derived by its control dependencies graph. Control parallelism focuses on distributing the execution across the available processors, performing portions of the whole computational work independently and concurrently on different processes or threads. A further useful distinction is the degree of regularity in the dependencies between tasks: in regular control parallelism tasks are similar and have predictable dependencies, in irregular control parallelism tasks are dissimilar in a way that creates unpredictable dependencies. An example of the first type is the decomposition of a dense matrix multiplication into a set of dot products; while sparse matrix multiplication may be less regular. Branch and bound optimizations over a decision tree for a chess program is an even more irregular case. In Flynn’s taxonomy1 _{[23], control parallelism is}

usually classified as MIMD/MPMD or MISD2_{. Typical control parallel patterns} 1_{A classic categorization of parallel processors by Flynn based on whether they have} multiple flows of control or multiple streams of data.

(10)

Chapter 2. Background

include speculative conditionals as well as loops.

Data parallelism, instead, focuses on distributing the application data across

the available resources, which operate on the assigned portion in parallel. This formulation is used to speed up a single computation by splitting it into parallel sub-tasks. This form of parallelization is especially suitable to be applied on regular data structures, like arrays and matrices, exploiting the work on each element in parallel: a set of tasks will operate independently on disjoint partitions of the original data. In a multiprocessor system executing a single set of instructions, data parallelism is achieved when each processor performs the same task on different distributed data. In Flynn’s taxonomy, data parallelism is usually classified as MIMD/SPMD or SIMD3_{. Typical data parallel patterns}

include map, reduce and parallel prefix.

While control parallelism is about running at the same time many (different) tasks originated by the control structure of the problem, data parallelism involves running the same task concurrently on different portions of the same original data. We can summarise that control parallelism is an approach that is more oriented around functional decomposition, while data parallelism focuses on original data partitioning. P3_{L, Pisa Parallel Programming Language [5], is}

a skeleton based language providing a mix of task and data parallel skeletons [17, 36].

Another form of parallelization is stream parallelism. Stream parallelism exploits parallelism among computations relative to different, independent data items appearing on the application input channel at different points in time. Each independent execution ends with the delivery of the computation result on the program output stream. Common stream parallel patterns are pipeline and task farm. In recent years, data stream processing (DaSP) research grows up [3, 19], for the increasing interest in application dealing with the efficient real-time processing of large amount of streaming data, generated from sources typically external to the application, which presents at the source of the application. Indeed, several important on-line and real-time applications can be modelled as DaSP, including network traffic analysis, financial trading, data mining, and many others.

It is important to outline that data parallel computation do not concern necessarily stream parallelism. It may be the case that a data parallel

compu-streams. MISD: multiple instruction, single data.

3_{MIMD: multiple instruction, multiple data. SPMD: single program, multiple data. SIMD:} single instruction, multiple data.

(11)

tation is used to implement the computation relative to a single stream item in a stream parallel program, however. In this case, data parallelism is used to speed up the computation of all the items belonging to the input stream.

2.2 Algorithmic skeletons and parallel design

patterns

The knowledge needed to implement efficiently the parallel behaviour of an application is completely different from the knowledge needed to program the business logic4 _{of the application, that is the code actually computing the}

result(s) of the problem.

The first studies on the issues related to the implementation of efficient parallel applications on target parallel architectures observed that the patterns used to exploit parallelism in these applications were not a large number. Moreover, researchers pointed out that the correct and efficient programming of these parallel patterns – which includes their design, coding, debugging and fine tuning – constitutes the major programming effort when implementing efficient parallel applications. Applications clearly separating the code used to implement parallel patterns from the functional code were much easier to develop, debug and tune, with respect to the ones were parallelism implementation code and functional code were fully interleaved.

Starting from these observations, researchers try to identify, design and develop “parallel patterns”, that is parallelism exploitation patterns studied without taking into account the final application functional code, which is assumed to be provided as a pattern parameter.

This led, in different times and within different research communities, to the development of two distinct research branches:

• algorithmic skeleton, introduced in late ’80s [11] by the High Performance Computing community, and

• parallel design patterns, evolved by software engineering researchers in early ’00 [32].

Algorithmic skeletons were introduced as Cole [11] recognized that most

of the parallel applications make use of well know parallelism exploitation

4_{The term business logic usually denotes the functional code of the application, distinct} from the non-functional code modelling, as an example, the implementation of parallelism in the application.

(12)

patterns. This formulation had the aim of provide the patterns as building blocks that the programmers could instantiate by supplying the business logic of their application. In his algorithmic skeleton “manifesto” [12] Cole lists four principles which “should guide the design and development of skeletal programming systems”. Following Cole’s last and more general perspective [13], algorithmic skeletons can be defined as higher order functions encapsulating the parallelism exploitation patterns, whose functional parameters provide the business logic of the application. In the original definition of Cole, there was no mention to the fact that skeletons could be nested. Therefore, any skeleton application was the result of the instantiation of a single algorithmic skeleton and thus, in turn, of a single parallelism exploitation pattern. Most of the algorithmic skeleton framework developed so far include the possibility to nest skeletons. However, different frameworks offer different composition/nesting opportunities.

Parallel design patterns [32] have been introduced with the intent to model

solutions to recurring parallel problems. They represent a software engineering perspective on the parallel computing problems. As all the other studies in the field of design patterns definition [24], parallel design patterns capture solutions that have developed and evolved over time. The way they are presented – grouped in layers, with each layer including patterns of interest when looking at the parallel programming problems from a particular abstraction level – represents a kind of methodology suitable to support the parallel application developer in the design of efficient applications.

Therefore, a design pattern is not a programming construct, as it happens in case of the algorithmic skeletons. Rather, design patterns can be viewed as “recipes” that can be followed to achieve efficient solutions to common programming problems.

Sample parallel design patterns solve typical parallelism exploitation prob-lems, such as divide and conquer, pipeline and master-worker, modelling respectively the problem of efficient implementation of recursive divide and conquer computations, of staged computations and of computations split in several independent tasks.

Overall, the programmer using design patterns to structure the parallel skeleton of his application will produce more robust and maintainable code than the programmer using a more traditional – non-design pattern based – approach.

(13)

programming frameworks very similar to the algorithmic skeleton ones [25, 30, 31, 43], and a notable research activity has been generated that eventually led to the recognition of the fact that parallel design patterns may be a key point in the solution of the problems related to the development of efficient parallel software. The Berkeley report [4], published in 2006, clearly states the fundamental role of the research community for the good exploitation of the parallel computing opportunities, and lists some principles to be followed to efficiently address the problem of designing and developing software for parallel machines.

Algorithmic skeletons and design patterns tackle this problem from dif-ferent perspectives: high performance computing and software engineering, respectively. To make a comparison, we can note that algorithmic skeletons are aimed at directly providing pre-defined, efficient building blocks for parallel applications to the application programmer, while design patterns are aimed at providing specific instructions to program those building blocks, at different levels of abstraction. A large number of algorithmic skeleton frameworks are available, including Muesli [10], SKEPU [22], GrPPI [38] and FastFlow [1], while design pattern related methodologies are used in different – apparently non-structured – parallel programming frameworks. Several recent parallel programming frameworks adopted design patterns concepts, such as Intel Threading Building Blocks (TBB) [37, 28] and Microsoft Task Parallel Library (TPL) [29].

All these frameworks always considered, designed and provided the final user with stateless skeletons, that is skeletons whose functional behaviour may be completely modelled through higher order functions. This is mainly due to the class of parallel architectures available when the first skeleton frameworks were conceived and implemented. At that time, “parallel” architectures was a synonym of “distributed memory” architectures. Clusters, as well as ad

hoc multiprocessors, did not provide any kind of shared memory mechanisms

and therefore separated processing elements had to be programmed to access disjoint, rather than shared, data structures. This situation has changed, as nowadays a number of multi/many core architectures provide shared memory mechanisms [34, 35], symmetric (UMA5_{) or distributed (NUMA}6_{), taking also}

advantage from the memory hierarchy organization [6].

The concept of a shared state may be useful to model parallel computations.

5_{Uniform memory access.} 6_{Non-uniform memory access.}

(14)

In some cases, it may be exploited to maintain in a state variable some counters relative to the current parallel computation, in other cases it is worth allowing concurrent activities to access some centralized and shared data structure. Managing and scheduling complex access patterns to a shared data structure represents a complicate and error prone activity, requiring proper planning of shared data access scheduling and synchronization. It also represents one of the classical sources of bottlenecks in existing parallel applications: when shared data structures must be accessed concurrently, the concurrent activities rarely may achieve ideal speedup. The accesses to the shared data structure, in fact, make up the serial fraction that limits7 the speedup achievable by the parallel application. Therefore, the studies over parallel shared data access has always been discouraged and ignored in structured parallel programming models based on the algorithmic skeleton concept. With the growing and ubiquitous presence of shared memory multi/many core architectures, however, the problem of being able to manage some kind of shared global state is becoming again more and more interesting. In our work, we start from the structure exposed through the skeleton composition used to model parallel applications to improve the efficiency in dealing with shared state within parallel computations.

2.3 Programming language: C++

For the implementation task we choose C++ high-level programming language. We exploit its abstraction and encapsulation features, as offered by the object-oriented programming model. Moreover, C++ includes built-in support for threads, mutual exclusion and condition variables8_{, useful in the context at}

hand.

In particular, for the implementation of the concurrent processing entities, we make use of the pthread library9, wrapped into native C POSIX threading model10 _{in C++ model. The POSIX thread libraries are a standard based}

thread API for C/C++. Threads define system interfaces and functionalities to support the source portability of applications. Thread library allows to

7_{Amdahl [2] argued that the speedup of an application is limited by the non-parallelizable} serial portion of the work.

8_{Thread support library https://en.cppreference.com/w/cpp/thread}

9_The _single _UNIX _{specification,} _version _2. _pthread.h _- _threads https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread.h.html

10_The _single _UNIX _{specification,} _version _2. _POSIX _Threads http://pubs.opengroup.org/onlinepubs/007908799/xsh/threads.html

(15)

spawn multiple flows of control within the main process. It is most effective on multi-processor or multi-core systems where the process flow can be scheduled to run on another processor, thus gaining speed through parallel or distributed processing. Threads require less overhead than forking or spawning a new process because the system does not initialize a new system virtual memory space and environment for the process. A thread is spawned by defining and providing a function which will be processed in its control flow.

We also use the built-in support for synchronization provided by the library: • mutexes: mutual exclusion lock which enforces exclusive access by a

thread to a variable or set of variables;

• condition variables: object used to suspend the thread execution while waiting for a condition to be fulfilled;

• joins: make a process wait till the termination of the joined thread.

Mutexes are used to prevent data inconsistencies due to race conditions. A

race condition occurs when two or more threads need to perform operations on the same memory area, but the results of computations depends on the order in which these operations are performed. Mutexes are used for serializing accesses to such shared resources, to protect a segment of memory, a critical region, from other threads. Anytime a global resource is accessed by more than one thread the resource should be protected by using a mutex associated with it.

A condition variable is an object able to block the calling thread until notified to resume. The condition variable mechanism allows threads to suspend execution and to release the processor until some condition is satisfied. A condition variable must always be associated with a mutex to avoid deadlocks.

A join is performed when a concurrent entity wants to wait for another thread to finish. A thread calling routine may launch multiple threads and then wait for them to complete and get their results.

In some of the developed tests, we also exploit C++ atomic library11_{to model}

the shared state: it provides components for fine-grained atomic operations allowing for lockless concurrent programming. These atomic operations can optionally specify the minimum memory visibility constraints required for an operation. Each atomic operation is indivisible with regards to any other atomic operation that involves the same object, thus atomic objects are free of data races.

(16)

In addition to the presence of these built-in features useful to manage concurrent entities and to model data sharing mechanisms, our choice of C++ programming language was also driven by the objective of performance mea-surement and analysis. C++ is known to report better performances for this kind of multithreaded applications, with respect for example to Java programming language. Most template libraries, like TBB, or extensions, like Intel Cilk Plus [39], or standards, like OpenMP API [14], supports or are programmed in C++. Thus, the choice to use C++ for our prototype implementation of stateful skeletons opens the possibility for a meaningful comparison with similar versions implemented with (the introduction of) the cited frameworks, as a future work related to this thesis.

2.4 Parallel architectures

The class of parallel architectures available when the first skeleton frameworks were conceived and implemented influenced their development, in particular for the design of the concept of shared data structures in parallel applications. At that time, “parallel” architectures meant “distributed memory” architectures: they did not provide any kind of shared memory mechanisms and therefore separated processing elements had to be programmed to access disjoint, rather than shared, data structures. Nowadays, several multi/many core architectures provide shared memory mechanisms, with advantageous memory hierarchy orga-nizations. A general concept of multicore exploitation is that the improvement in performance gained using multiple processors depends a lot on the software algorithms used and on their implementation. With the widespread diffusion of shared memory multi/many core architectures, however, the problem of being able to manage efficiently some kind of shared global state has become more interesting.

As target systems of our tests and experiments we use three different state-of-the-art multicore shared memory architectures:

• architecture 1, with 24 2-way hyperthreading cores, processor model Intel R _Xeon R _{CPU E5-2670 v3 - Haswell - @ 2.30 GHz;}

• architecture 2, with 16 2-way hyperthreading cores, processor model Intel R _Xeon R _{CPU E5-2650 0 - Sandy Bridge EP - @ 2.00 GHz;}

• architecture 3, with 64 2-way hyperthreading cores, processor model AMD R _EPYCTM

(17)

Machine (126GB total) NUMANode P#0 (63GB) Package P#0 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#2 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#4 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#6 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#8 PU P#32 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#10 PU P#34 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#12 PU P#36 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#14 PU P#38 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#16 PU P#40 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#18 PU P#42 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#20 PU P#44 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#22 PU P#46 PCI 8086:154d enp2s0f0 PCI 8086:154d enp2s0f1 PCI 10de:17fd card0 renderD128 PCI 10de:17fd card1 renderD129 PCI 8086:8d62 PCI 8086:1521 enp1s0f0 PCI 8086:1521 eno1 PCI 102b:0534 PCI 8086:8d02 sda sdb NUMANode P#1 (63GB) Package P#1 L3 (30MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#1 PU P#25 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#3 PU P#27 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#5 PU P#29 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#7 PU P#31 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#9 PU P#33 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#11 PU P#35 L2 (256KB) L1d (32KB) L1i (32KB) Core P#8 PU P#13 PU P#37 L2 (256KB) L1d (32KB) L1i (32KB) Core P#9 PU P#15 PU P#39 L2 (256KB) L1d (32KB) L1i (32KB) Core P#10 PU P#17 PU P#41 L2 (256KB) L1d (32KB) L1i (32KB) Core P#11 PU P#19 PU P#43 L2 (256KB) L1d (32KB) L1i (32KB) Core P#12 PU P#21 PU P#45 L2 (256KB) L1d (32KB) L1i (32KB) Core P#13 PU P#23 PU P#47 PCI 10de:17fd card2 renderD130 PCI 10de:17fd card3 renderD131 Host: C4130-gpu Indexes: physical Date: mar 30 lug 2019 10:48:46 CEST

Figure 2.1: The topology, extracted with hwloc tool, of one of the target NUMA architectures used for the experiments.

Processor model: Intel R _Xeon R _{CPU E5-2670 v3 - Haswell - @ 2.30GHz.}

We will describe the processors characteristics in more details in the Experiments section (chapter 7).

We report in Figure 2.1 the topology of one of the target machines, extracted with the tools of hwloc package. The others have similar layouts.

We observe since now that the machines present a NUMA memory hierarchy organization. In HW1 and HW2 machines the available cores are grouped in two sockets, thus two NUMA nodes are present. In HW3 there are two packages of 32 cores grouped in four sockets per package. The first and the second cache levels are local to each core. In HW3 the third level cache is shared by half of the cores of a NUMA node; in the other architectures it is shared between all the cores on the same CPU socket. Processors can access their own memory directly and memory attached to other processors across the communication channels. The cache coherency protocol is extended across all cache levels. However, access to

(18)

memory attached to a remote processor is slower – has higher latency and also typically reduced bandwidth – than access to a local memory. This results in non-uniform memory access: this topology makes memory access time depend on the location of data. The constants involved are not small: access to main memory can be hundreds of times slower than access to cache. Ideally, threads should be placed on cores close to the data they should process, or vice versa. In this kind of systems, advanced techniques have to be used to enforce data locality and therefore achieve complete exploitation of the architecture peculiar features.

In our implementations we exploit the default cache coherence protocols as offered by the system producer: our experiments have a fine grain with respect to the state definition and management, but a coarse grain with respect to the memory and architecture features exploitation. Possible developments of the work presented in this thesis could cover other tests to study the behaviour of the proposed models when the memory access and management of the system is taken greater into account, also referring to non-uniform memory access architectures.

(19)

Problem characterization

In the context of parallel applications, we focus on structured problems that exploits a tree shaped computational structure. We discuss the introduc-tion of some of the most general stateful pattern instances, the management mechanisms needed by each kind of state and their possible impact over the computational work.

In this chapter we identify a set of interesting general state access and management behaviours, giving a formal characterization of the properties of each case. We extend their description to the suitability to the specific type of structured applications taken into analysis. This classification schema will be the starting point for the definition and implementation of state patterns and for the study of the impact of the introduction of some kind of state management in tree shaped parallel problems.

3.1 Tree structured computations

When considering tree structured computations, we will after refer to the shape of the concurrent activities graph of the application, a formalism aimed at modelling the concurrent activities needed to implement a given parallel application. The concurrent activities graph is a logical model of the problem and is defined as a graph G “ pN, Aq, where the nodes in N represent (macro) functions/activities to be executed by the application on some data, and edges in A express control and data dependencies among such functions during computation. Control dependencies are those dependencies explicitly given by the programmer to establish an ordering in the execution of concurrent activities. Data dependencies are those dependencies modelling the fact that a data item produced by a concurrent action, modelled by an activity node, is

(20)

Chapter 3. Problem characterization

G “ pN, Aq

N “ ta, b, c, d, e, f, g, h, iu

A “ tpa, bq, pb, cq, pb, dq, pc, eq, pd, eq, pe, f q, pe, gq, pe, hq, pf, iqu

a b c d e f i g h

Figure 3.1: Concurrent activities graph. Nodes represent all the activities that have to be performed to complete the execution, edges model control and data

dependencies between such activities.

needed to perform the concurrent activity modelled by another node. A general example is presented in Figure 3.1.

Usually, the activity graph of a pattern-modelled application does not have an arbitrary structure, but it presents a shape that depends on the computational patterns exploited by it and can refer to a standard schema. This representation of a problem is particularly useful to identify possible program parts that can run in parallel: they are the ones with no dependencies – neither control nor data dependencies – among them. Activities/nodes linked by a control or data dependency must be executed sequentially, although they can be executed in parallel with other concurrent activities. In the example shown in Figure 3.1 we have that the activity in b cannot be executed till the completion of a, but after the execution of b activities c and d can run in parallel. The results of both is needed to be able to perform task e. Then, activities represented with nodes f , g and h can run in parallel. After the completion of f , also node i can be executed concurrently with g and h possibly still running.

Within the concurrent activities graph, we also identify the critical path, as the “longest” path of activities connected by control or data dependencies from an input node to an output node. “Longest” in this context has to be read in terms of performance. The critical path is the natural candidate when looking for optimizations in a parallel application. In fact, as a critical path starts with the availability of the input data and application results may only be delivered

(21)

G “ pN, Aq

N “ ta, b, c, d, e, f, g, hu

A “ tpa, bq, pa, cq, pb, dq, pb, eq, pc, f q, pc, gq, pc, hqu

a

b

d e

c

f g h

Figure 3.2: Tree shaped concurrent activities graph. All the dependencies between tasks form a tree of relations.

when all the output activities are terminated, the best time we can achieve in the execution of our parallel application is the time taken in the execution of the whole critical path. In the example in Figure 3.1 the critical path is:

a –> b –> maxtc, du –> e –> f –> i.

In our context, we will have a tree shaped concurrent activities graph (Figure 3.2) where input working data, usually fully available at the beginning of the computation, stays at the source of the flow, and thus it represents the input to the root node of the computational tree. This data will be split according to some partitioning function, generating a number of children nodes that will operate on the – possibly overlapping – partitions. The branching continues recursively till leaves functions, where the fulfil of some conditions, specified by the application logic, stops the splitting of working data. At this point, some “base case” results are sequentially computed.

In a more general view of the problem into analysis, instead of a tree we can think to a graph shaped data flow, where the concurrent activities graph has a well-defined shape: the computation starts as described above, with recursive partitioning, but it does not ends at the “leaves”. Instead, these “special” internal (ex-leaves) nodes are collapsed again to a final sink node through the execution of a series of combination steps of the partitioned data that follow the sibling relations of the nodes of origin of the data, thus creating

(22)

G “ pN, Aq

N “ ta, b, c, d, e, a1_{, b}1_{, c}1_u

A “ tpa, bq, pa, cq, pb, dq, pb, eq, pd, b1

q, pe, b1q, pc, c1q, pb1, a1 q, pc1, a1 qu a b d e c b1 _c1 a1

Figure 3.3: “Mirrored tree” shaped concurrent activities graph. The activities and dependencies of the above part of the tree are replicated, in reverse order,

on the below part.

a tree identical to the top-down one, that is visited/generated in the reverse order with respect to the above part. We call this particular shape, sketched in Figure 3.3, “mirrored tree” shape.

The general tree shape outlined above is exploited by well-known patterns such as Divide and Conquer and Branch and Bound.

It can be the case that an application following these computational patterns needs to maintain and manage some kind of internal state. We study some of the most general ways to manage the state in stateful applications of this kind, trying to identify the main properties of the problem structure, and the conditions of applicability of the different state patterns. Analysing the identified typologies of state access and management, we consider the specific parallelization opportunities that can be applied in the context of the computational pattern taken into account, and we test them over synthetic applications to study the scalability benefits that can be obtained in different settings.

(23)

f g

..., xi`1, xi f pxi`1q, f pxiq gpf pxi`1qq, gpf pxiqq

Figure 3.4: General pipeline: the first stage computes function f on input data, the second stage computes function g on the result of the execution of

the previous stage.

In our analysis we take into account some meaningful macro-cases: • read-only state;

• resource state; • accumulator state; • fully partitioned state.

This range of cases have been studied also for embarrassingly parallel computations in [18], with good results for the feasibility and performances of the approach.

In our study of the effects of the management of some kind of state in a parallel application, we started from the analysis of the case of pipelined applications, which was not investigated by the work of Danelutto et al.

This kind of parallel applications are characterized by a stream of input data, that arrives at possibly different rates to the source stage of the pipeline. Indeed, the pipeline skeleton is typically used to model computations expressed in stages: each input item is processed in sequence by the functional stages, and the output of a stage is the input for the subsequent stage.

In pipelined computations the parallelization is exploited by the execution of different stages onto different items of the input stream at the same time. For example, if we have a two stage pipeline (see Figure 3.4), with the source stage computing function f and stage two computing function g, and if xi and

xi`1 happen to be consecutive items of the input stream, than f pxi`1q and

gpf pxiqq will be computed in parallel, for any i.

We mainly study the properties of tracking algorithms, analysing applica-tions about object tracking and feature extracapplica-tions. In this kind of applicaapplica-tions the sequence of stages of the tracking pipeline is executed over a stream of input video frames. The state is represented with a vector of features describing the object to be tracked, and it is usually a global state. In other cases, there can be many different states, but local to each functional stage.

(24)

Figure 3.5: General scheme of dedup application pipeline (from [20], with permission).

As a stateful pipeline problem, we also analyse the structure of the dedup application included in the PARSEC benchmark suite [7]. PARSEC (Princeton Application Repository for Shared-Memory Computers) is a collection of various multithreaded programs with high system requirements. This suite has been used for stressing shared-memory architectures [41], and moreover it has been recently used to assess the expressive power of emerging parallel programming frameworks [9]. PARSEC consists of 13 programs from different areas of computing. One of them is dedup: a streaming application that compresses a data stream with a combination of global and local compression phases called deduplication. The dedup benchmark can be modelled using different nestings of pipe and farm patterns. Figure 3.5 shows a representation of the dedup pipeline general scheme.

The state of this application, a hash table shared between two stages, ends to be modelled as a fully partitioned state, with entry-based synchronization mechanisms.

In this preliminary analysis on the context of pipelined applications, we did not find out significant opportunities related to implementation of different state management policies. The static structure of this pattern and its fixed flow of control seem, to us, not to offer a sufficiently active range of behaviours and concurrency possibilities.

So, we moved to the case of tree structured parallel problems: these kinds of computational structures seem to be more interesting and challenging for the study of the management supports and of the impacts of state patterns variants. In this work we focus on problems on this class, referring to well-known patterns that exploits this shape, like of Divide and Conquer pattern and Branch and Bound pattern.

(25)

3.1.1 Divide and Conquer pattern

Divide and Conquer is a common algorithm design pattern that applies when a problem can be divided into smaller subproblems recursively. The recursive division proceeds until a base case is reached that can be solved serially, and the final result is rebuilt out of the partial results of the recursive subproblems.

This paradigm is often used to find a time-optimal solution of a problem. The main schema proposes to decompose a given problem into two or more simpler subproblems, to solve them in turn of the original instance, and to compose the subproblems solutions to obtain the final result. With this approach, only problems of – what is considered – enough simplicity are solved directly.

In the tree representation of this pattern, starting from the root containing the input problem, a divide function logically creates branches of new subparti-tions of the input task. Each subpartition has to be evaluated: according to the base case test, if the instance is simple enough it is solved directly using a

base case function, and thus that instance is a leaf of the concurrent activities

tree. Otherwise it is recursively divided, originating a new subtree. A combine function then aggregates the subtasks solutions to produce the subtrees results and, finally, the global result of the root problem.

Thus, we can summarize that divide and conquer algorithm operates ac-cording to these main principles and operations:

• divide phase: recursively splits the problem into two or more smaller subproblems, till the reaching of simple enough instances;

• base case solution: directly solves the problem, after a base case test that checks if the problem instance is simple enough;

• conquer phase: combines siblings results, to give a solution to the original, root problem.

Analysing the pattern, we can outline that the depth of the tree of partitions logically generated by the divide and conquer skeleton is in general unknown looking only at the input data, and therefore some dynamic behaviour must be implemented to be able to run arbitrary depth sub-partitions trees. Moreover, also the partitions distribution, thus the sub-trees balance, cannot be easily inferred from the input data only, as it depends on the business logic of the specific application that exploit the framework.

We will see in the next chapters how these structural properties are related to the parallel exploitation of the computational and state access patterns.

(26)

3.1.2 Branch and Bound pattern

Branch and Bound is a common algorithm design pattern with a tree discovery of the solution space, consisting of a systematic enumeration of candidate solutions.

In the tree representation of this pattern, the root node is responsible for the full solution space. The algorithm visits branches of this tree, exploring subsets of the solution space. Each branch is checked on the basis of an estimation of the objective function optimal solution and it is not further explored if it cannot produce a better solution than the best one found so far by the algorithm. Otherwise, the candidate solutions of that branch are enumerated and checked. The subproblems/branches are generated with a recursive hierarchical relation: a parent problem is divided into two or more disjoint child problems, and we have that the solution for the parent problem corresponds to the optimal solution for one of the child problems.

The goal of a branch and bound algorithm is to find a value x that maximizes or minimizes the value of an objective function f pxq, among the search space set S of admissible or candidate solutions.

A branch and bound algorithm operates according to these main principles and operations:

• branching phase: recursively splits the search space into two or more smaller spaces, subsets of the feasible region, then minimizing or maxi-mizing f pxq on these smaller spaces;

• bounding phase: to improve on the performance of brute-force enumera-tion and test of candidate soluenumera-tions, the algorithm keeps track of bounds on the optimum it is trying to find. These bounds are used to “prune” the search space, that is to avoid exploring branches that will not for sure improve the local optimum;

• solution test: determines whether an item of the explored space represents a single candidate solution.

Using these operations, a branch and bound algorithm performs a top-down recursive search through the tree of instances generated by the recursive branching. During the visit, the algorithm checks whether the bound of each branch improves the global lower bound found till that moment of the search space exploration. If not, the branch is safely discarded from the search and the recursion along that path stops. This pruning step is usually implemented by

(27)

maintaining a global variable that records the bound seen among all instances examined so far.

3.2 State access patterns definition

Various situations can arise when the state is taken into account in a tree shaped computation executed with parallel activities. According to the way in which the state is logically used in the application and depending on the access pattern semantic, many specific possibilities of management and parallelization can be identified. The consequences of the usage of state variables within structured parallel computations depend on the kind of shared state and, with a relatively smaller weight, on the kind of computational parallel pattern used.

The general characteristics of each state access pattern make it possible to extract a parallel model for the state management of the main application it applies to. The definition of a state access pattern classification, combined with the analysis of further structural properties of the actual stateful application to be parallelized, turns out to support non-serial implementations of the state concept inside a problem. It provides a general approach to the state characterization and to the related opportunities of management, gives an idea of execution behaviours and theoretical scalability boundaries, and moreover it offers some general guidelines to support the implementation task, especially for data structures and synchronization choices.

In the following sections, we present the main state access patterns we studied in this work from a theoretical perspective.

3.2.1 Read-only

A read-only state is a static resource that is accessed from parallel activities only for read operations. This is not actually a shared state, as no modification is made to the shared structure. The only concurrent accesses are those aimed at reading (parts of) the shared structure. This kind of state does not need a specific model for the structure and access policies, because read operations can always happen in parallel. Moreover, the memory location of the state data structure is fixed and known by all entities, and the content of the state can be loaded and cached at the first access, with an advantage for memory access and time performances. Read-only state may be handled by copying. In case the data structure is large and shared memory is supported, accesses to

(28)

the shared data structure do not need any kind of synchronization nor cache coherence needs to be ensured on the cache copies of the shared data structure (parts). Indeed, if the whole state data structure is not too big and fits the cache space assigned to the parallel activity, no cache faults will happen after the first load, leading to notable benefits for frequently accessed states.

An application with a read-only state can run exploiting the tree structured computational pattern with no need of further mechanisms or synchronization restrictions.

In the Experiments section (chapter 7) we will use the timing results of tests with this kind of state management as a base line to compare all the other more advanced patterns.

3.2.2 Resource

The case of resource state models the situation in which the whole state data structure has to be accessed serially. This is the kind of shared state access that mimics an exclusive access to a shared resource. It corresponds in obtaining a lock on the shared state variable, then modifying the variable and eventually releasing the lock.

The state update relative to an input update ui is computed as a function S of the value of the update and of the value of the state at time i: si`1“ Spui, siq.

Usually, we apply this pattern when we have no further information about the properties of the state update function S or about the state representation, thus we enforce the stronger assumptions over its management.

An important property of this access pattern is that the updates are applied in their strict arrival order, meaning that for two consecutive updates ui, ui`1 we compute the new state si`2 by Spui`1, Spui, siqq, where si is the value of the state present at time i, the arrival time of update ui.

Resource state variables should be handled with full synchronization in order to guarantee the exclusive access to the state values. To provide consistent access to the state and to ensure the order of execution of input updates, the state is managed by a single logical component. This entity logically provides an external communication channel to receive state updates and is fully responsible for all synchronization mechanisms. All the other active entities of the parallel execution can access and update the state data only using the interfaces exposed by the state manager.

(29)

different policies. They can have a read-only reference to the global state to be consistently accessed at every need, or they can ask the manager for new state updates. Depending on the application logic, read operations can be performed in strict order with the interleaved writes, or can be served independently from actual updates.

If additional synchronization between updates effectiveness and concurrent activities executions is needed for the correctness of the application result, it has to be considered and handled explicitly by the application programmer in the parallel structure definition. We will see some examples in the application section (chapter 6).

3.2.3 Accumulator

In the accumulator state access pattern the shared state data is generally represented as a non-structured value, accessible by more than one concurrent entities, which is always updated applying a function S restricted to be of the form:

Spui, scurrq “ ui‘ scurr

where ‘ (read as o-plus) has to be an associative and commutative operator, and scurr is the value of the internal state present at the moment when the update data ui is provided.

Indeed, to apply this kind of state definition and management we have to know some specific properties over the state update function.

Relying on the associativity and commutativity properties of the ‘ operator, many optimization opportunities on the state sharing policies are possible. They can simplify the state management, in particular removing a part of the competitions for the state access replicating the state data structure along the computation. In our context of tree structured parallel computations, the application can specify a policy based on some property of the tree structure, for example on the level of nodes as shown in Figure 3.6, to replicate the state in subtree-local states, that will accumulate only a specific subset of the whole parallel competitors and related state updates.

A different concurrent manager can be set up for each local state: it will be responsible only for updates and accesses on that specific replicated state. If needed, a “mirrored” tree of accumulation of the replicated states can be performed in parallel, to obtain the final global state of all updates.

(30)

s0

s1 s2 s3 s4

si

n

Figure 3.6: Accumulator state access pattern. Tree of accumulator states: state s1 is responsible for all the nodes rooted at node n.

3.2.4 Fully partitioned

To apply this kind of state definition and management, the state of the ap-plication has to be partitionable. This means that we have to know at least some properties about the state data structure and the specific use that the application does of its internal state.

The partitionable state can be modelled as a vector v of values of primitive type, with length N equal to the total number of partitions in which it can be split.

A hash function H can be specified that applies some policy to map each received update to a specific position in r0, N ´1s, thus to an identified partition of the state vector. For example, the map to the vector entry can depend on the update producer or on a characteristic of the generated value. An update event

ui always maps to a single partition, identified by vrHpuiqs, contributing only to the modification of that portion of the global state. The actual modification of the vector entry is computed by a function over the update data and the state partition value present at its arrival.

In this situation, to improve performances we could set up an independent manager for each state partition: it will be responsible only for the accesses and the updates to that specific subpart of the state.

(31)

The parallel execution needs only the synchronizations of the components that access the same state partition. Here we exploit the property that different state entries are not needed nor modified by an update on another partition.

These characteristics make it possible to extract a parallel model for the state management of the main application.

(32)

Chapter 4

Properties description

In this chapter we describe the main properties of the problem subjects. We analyse the composition of tree structured computations and the features relative to the management of a shared state. Each one of the state access patterns individuated in the problem characterization can exploit more than one combination of these properties options. This theoretical investigation helps us to focus on the abstract dimensions of the problem and it will drive the implementation and the sessions of experiments of the stateful parallel skeleton.

4.1 Tree structure properties

In the scenario of tree structured computations, we focus on parallel applications that need to maintain and manage some kind of internal state. We call them

stateful parallel applications.

Many cases can arise, for example in the design patterns described in the previous section the introduction of a state can be useful to represent the global bound for the branching, or to store some data that has to be shared among subtrees to support decisional phases.

Before going into the details of the specificity of each state access pattern, we want to define which are the main properties that distinguish between the actual shapes and the internal compositions of such computational trees, introducing some dimensions that can help to characterize the behaviour of a stateful parallel application. These differences may reflect on the state representation and on the state access pattern choices and can help to select the ones that better fit the specific case.

(33)

to identify such structural properties of the computation. We use the variants that originate from the meaningful combinations of the defined dimensions to produce an exhaustive set of synthetic tests and to analyse how the differences reflect on behaviours and performances.

4.1.1 Computational tree shape

In a tree structured computation some fixed phases can be identified to describe the general pattern of execution.

In the previous chapter, we list the main operations and properties that we can clearly identify in divide and conquer and in branch and bound design patterns. In the most general case, on a tree shaped application we can identify the following main phases:

• split phase: top-down generation of new subtasks from root input data or from parent internal node data;

• base case: stop of the branching along the path, instance data are no more split;

• combine phase (optional): bottom-up aggregation of the subcases data results, till the root node of the original problem.

The flow of data and activities generated by these phases can be represented as a tree, or more in general as a graph composed of two identical trees (see an example in Figure 4.1): the above part represents the partitioning tasks, the below part the nodes of the respective combine steps. We already defined this particular case “mirrored tree” shape (Figure 3.3).

Split phase maps to internal nodes, growing the above part of the tree from

the root; base cases represent the leaves of this above-tree, and eventually they are also the total subcases that have to be aggregated by the combine phase, internal nodes of the respective below-tree.

Going from root to leaves, the size of the single subcase data decreases, till the end of the branching when the stop condition is met. The optional combine phase instead makes the reverse path, aggregating subtrees results, till the root node which computes the last combination of data to produce the final result of the computation. In the absence of this phase, the solution can be automatically available when the tree visit and exploration ends at the base case leaf node.

(34)

Chapter 4. Properties description

split phases

base cases

combine phases

Figure 4.1: General shape of the computational graph with the split tree, the base cases nodes and the combine tree.

4.1.2 Computational regularity of phases

According to the general phases definition of a tree shaped computation pre-sented in the previous section, many situations may occur in a real application of this kind.

For example, the depicted phases – split, base case solution, combine – can present significantly different functions and respective amount of work, and this unbalance leads to a predominance of a phase with respect to the others in the total execution time. This situation can be referred to as computational regularity of the tree structure.

To define these irregularities with more precision, we formally model the amount of work that has to be done by each phase. We assign to each phase a

multiplication factor which expresses the proportionality of the execution time

of the phase with respect to the size of the data that has to be processed by the task. To obtain an estimate of the execution time of a specific step in the tree visit, we can multiply the phase factor by the size of the task data to be split, combined, or solved. As an example, the well-known merge sort algorithm presents a high multiplication factor for the combine phase and a very smaller one for the split steps. The opposite situation, with the highest factor for the partitioning phase, can be experimented in the quick sort algorithm.

(35)

More specifically, these factors may be related each other by many propor-tions, originating a variety of tree visit steps execution times distributions. In our analysis we focus on the following cases:

• balanced phases: all the main phases of the tree visit have around the same execution time multiplication factors, meaning that given a length, split, solve, or combine a task of that length produces its result in around the same time;

• unbalanced base case phase: the base case resolution phase has an execution time multiplication factor sensibly greater than the others, while the split and combine phases have similar multiplication factors; • unbalanced split phase: the split phase execution time multiplication

factor is sensibly greater than the combine one, leading to an unbalance in the top-down visit of the tree of tasks (application example: QuickSort); • unbalanced combine phase: the combine phase execution time multi-plication factor is sensibly greater than the divide one, leading to an unbalance in the bottom-up visit of the tree of tasks (application example: MergeSort).

4.1.3 Partitioning regularity

Another property of the presented general pattern that can vary in real appli-cation scenarios is related to the definition of the branching function of a super task in smaller sub tasks.

The split phase, or more in general an internal node in the tree, may generate a number of partitions, possibly not equal for every division, that can range from a minimum of 2 up.

These subtasks may be generated by partitioning the parent task in a balanced way or in an unbalanced way. We define these two cases as:

• regular partitions: the generated subtasks have around the same size, resulting from the supertask size divided by the number of newly generated partitions;

• irregular partitions: the generated subtasks have sensibly different sizes, resulting from a number of unbalanced fractions of the unit supertask size.

(36)

Chapter 4. Properties description

4.2 State access properties

Given the algorithm design skeleton sketched out in previous sections, our analysis proceeds focusing on additional properties related to the introduction of the state access patterns to a tree shaped computation.

4.2.1 Occurrences of accesses

Some of the primary properties that can impact on the application definition and execution are related to the occurrences of state accesses along the tree structure.

We refer to these parameters as:

• frequency: the amount of state accesses and updates represents a percent-age of the total work, considering the number of these kind of operations with respect to the whole number of functional steps (split, base case and combine) of the application;

• position: the state accesses and updates can be spread along the tree or located on an identifiable region, for example at the leaves or on a specific subset of the internal nodes following a defined policy;

• distribution: the state accesses and updates can be uniformly of non-uniformly distributed all over the computational tree structure, causing a similar distribution of the cost of the state management during the tree visit.

All of these properties are related each other and can produce many com-binations and settings exploited by real applications. For example, having a per leaf access (position property exploitation) to the state denotes a scenario where the position of the update operations is predefined, well identifiable and distributed on a specific part of the tree, and also the number or frequency of updates can be inferred if we know some other property of the algorithm logic. All of these give a lot of information useful to make specific optimization choices in the state access pattern and related synchronization mechanisms.

4.2.2 Functional grain

The state of an application can be accessed at different grain. Depending on the knowledge that we have over the application logic and properties, the state