A DSL tool for parallel application design

(1)

University of Pisa

Scuola superiore Sant’Anna

A DSL tool for parallel application

design

Author:

Leonardo Gazzarri

Supervisor:

Marco Danelutto

A thesis submitted for the degree of

MSc in Computer Science and Networking

(2)

(3)

iii

University of Pisa Scuola superiore Sant’Anna

Abstract

Department of Computer Science

MSc in Computer Science and Networking

A DSL tool for parallel application design

by Leonardo Gazzarri

This thesis is about the development of RPL-Shell, a DSL-based tool aiming to facilitate the design process of structured parallel application development. The tool, written in C++ and intended to be delivered along with the paral-lel programming framework FastFlow, provides to the designer the possibility to explore a space of functionally equivalent, alternative parallel solutions for structured parallel applications. Starting from a simple parallel expression the user can generate a list of functionally equivalent expressions, change their non-functional parameters by hand or by applying optimizers, modify global environ-ment variables and ask for approximate evaluations of performance measures. Finally the user can pick the "best" solution and eventually ask the shell to generate FastFlow code using a prototype compiler.

(4)

(5)

v

Acknowledgements

I would like to express my profound gratitude to my supervisor Marco Danelutto for the useful comments, remarks and engagement through the learning process of this master thesis.

(6)

(7)

vii

List of Figures

2.1 Example of a skeleton tree . . . 10

2.2 Pipe skeleton . . . 11

2.3 Farm skeleton with emitter and collector . . . 12

2.4 Map skeleton with scatter and gather and four workers for an input task composed of four subtasks . . . 14

2.5 Reduce computation on tree . . . 15

2.6 FastFlow architecture (taken from [29]) . . . 26

3.1 A high level view of RPL-Shell components. . . 33

3.2 Two skeleton trees. Service times for computing a single element are 10 and 20 respectively for s1 and s2. For each node: in bold is specified the number of workers, in red the input size of the input data structure, in blue the servicetime. Input size is 10. . 37

3.3 The shell inner working . . . 38

4.1 A view of the scanner and parser . . . 48

4.2 RPL tree for the assignment "a = pipe(seq(5),seq(10),b)" . 49 4.3 Three relevant trees about the application of mapofcomp rule on a given skeleton tree. Colors show the correspondences between placeholders ∆ and the matched subtrees of ∆sk (represented by subfigure (A)) . . . 57

(12)

(13)

xiii

List of Tables

3.1 Current skeleton set supported . . . 32

3.2 Current set of the evaluators . . . 41

3.3 Example of two evaluators: name "is" represents a shortcut for "inputsize", while name "nw" is a shortcut for "number of workers". 41 3.4 Table of rewriting rules . . . 43

4.1 Summary of the extensions . . . 68

5.1 Profiling of the sequential video filter application (time in mil-liseconds) . . . 72

5.2 Video filter results (time in milliseconds) . . . 74

5.3 Video filter pattern expressions . . . 74

5.4 Profiling of the sequential mandelbrot application (time in mil-liseconds) . . . 76

5.5 Mandelbrot pattern expressions . . . 78

5.6 Bad mandelbrot results (time in milliseconds) . . . 78

(14)

(15)

1

Chapter 1

Introduction

1.1 Context

Computer evolution in the past two decades experienced a migration from uniprocessor systems towards multi-core and many-core systems. This revo-lution affected high-end systems like supercomputers (thousands to millions of cores) as well as personal low-end computing facilities, like laptops. Even the smallest modern computers, like phones, have multi-core processors inside. The increasing power of parallel computers led naturally to an increasing demand for experts able to write parallel applications and for efficient parallel programming paradigms/models to take advantage of the current hardware.

Although nowadays parallelism is everywhere, most of the application mers are not experts in writing good parallel applications and parallel program-ming is still considered a complex, error-prone, ad-hoc process. Different steps are needed to build a parallel application solving a given problem:

• find parallel concurrent activities, i.e. find those activities that can be performed concurrently in order to solve the problem

• coordinate these activities, i.e. define how these activities communicate and synchronize with each other

• implement the parallel activities and the relative coordination mechanisms The classical way to implement a parallel application consists in using a frame-work providing low-level primitives needed to exploit parallelism, e.g. proper calls to set up parallel activities as threads or processes, communication mech-anisms as shared memory or message queues, and synchronization mechmech-anisms as locks and semaphores. Orchestrating a parallel application with a signifi-cant number of parallel activities using this kind of low-level primitives may be

(16)

problematic, in particular when there are dependencies between parallel tasks. These dependencies usually require synchronizations that if not implemented correctly may lead the application into well-known problems like deadlocks or race conditions, in addition to scalability issues (see Amdahl law). This kind of problems and an intrinsic undefined behavior make a parallel application difficult to test and debug.

Structured parallel programming approach hides such complexity by encour-aging the programmer to use and combine together different parallel patterns. Parallel patterns, also known as algorithmic skeletons [1] can be defined as higher-order functions encapsulating/modeling valuable and common parallel structures used in effective and efficient parallel applications whose functional parameters provide the business logic of the application. A parallel program-ming framework based on algorithmic skeletons provides a set of skeletons that the user can instantiate with proper business logic code, according to some parallel structure individuated in its application.

Algorithmic skeletons simplify parallel programming by hiding all the imple-mentation details relative to the application parallelism exploitation, especially by avoiding to worry the programmer about the use of communication and syn-chronization mechanisms. Other positive consequences are that correctness and efficiency of parallelism exploitation is a concern of the framework and no longer of the application programmer, thus making testing and debugging activities much more easier.

1.2 Motivation

Although structured parallel programming simplifies parallel programming, the exploration of functionally equivalent parallel solutions in order to find the "best" solution with respect to some non-functional properties of interest re-mains a complex task for several reasons:

• the programmer needs to know how to derive functional equivalent parallel combinations from other patterns

• the programmer needs to know how to evaluate this space of solutions in order to pick the best, either by profiling or by estimation using perfor-mance models associated to the skeletons

(17)

1.3. Our contribution 3

• the programmer needs to know how to set other non-functional param-eter values on a pattern (e.g. parallelism degree) and other valuable pa-rameters (e.g. grainsize) in order to get the best solution for the given non-functional properties of interest

All the previous points need relevant expertise and/or significant coding activity is needed in order to test/profile each solution in the space. The effort in coding is very dependent on the parallel programming framework chosen but is typically a lengthy and tedious process.

1.3 Our contribution

This thesis work is based on the work in [2] and [3] and consists in the im-plementation of a DSL-based tool that re-implements and extends a previous proof-of-concept prototype written in OCaml with the intent to provide a tool to be delivered along with the parallel programming framework FastFlow [4] to facilitate the process of structured parallel application development before the actual code phase.

Therefore this thesis work attempts to overcome the previously mentioned com-plications in structured parallel application design by designing and implement-ing RPL-Shell, a DSL-based tool suitable for supportimplement-ing exploration of a space of equivalent alternative parallel implementations of an application at a high-level of abstraction. RPL-Shell contributions are the following:

• RPL, a domain-specific language (DSL) for describing a parallel structure of application;

• a scripting language that supports assignments of RPL-expressions to en-vironment names and a set of verbs for annotating, optimizing and refac-toring expressions;

• an introductory version of a compiler that generates high-level FastFlow code from the RPL representation of the parallel application;

Starting from a simple parallel expression the user (properly using the offered shell verbs) can generate a list of functionally equivalent expressions, change its non-functional parameters at hand or by applying optimizers, modify global environment variables and ask for approximate evaluations of performance mea-sures and eventually pick the best solution.

(18)

1.4 Thesis structure

The rest of this thesis is organized as follows:

• Chapter 2 provides the necessary background. In particular algorith-mic skeletons, associated performance models and rewriting rules will be discussed. Therefore a brief excursus about parallel programming frame-works with a special focus on FastFlow will be done.

• Chapter 3 provides the main design choices of RPL-Shell showing a gen-eral picture of what has been implemented at a high-level of abstraction. • Chapter 4 provides the main implementation details of RPL-Shell show-ing principal algorithms and representations. Moreover it provides details in how the tool can be extended in order to support for example new rewriting rules or optimizers.

• Chapter 5 reports experimental validation of the tool, in particular ex-periments of generated FastFlow code designed to test the effectiveness of our approach.

• Chapter 6 discusses related work, outlines future directions and traces conclusion remarks.

(19)

5

Chapter 2

Background

This chapter introduces the necessary background to better understand the the-sis work. After presenting the concept of algorithmic skeleton that is central for a correct understanding of the possibility offered by the RPL-Shell and the included patterns, we will describe in separate sections what are performance models and the rewriting rules for parallel patterns. These two concepts are particularly important since performance models will be used to implement evaluators that are procedures giving to the user (and to the tool itself) the possibility to evaluate alternative and functionally equivalent patterns compo-sitions, while rewriting rules will be used inside the system in order to provide to the user ways to generate alternative forms for a starting pattern modeling the application. Finally we will describe FastFlow, the target parallel program-ming environment of the RPL-Shell, by giving an excursus of some of its core characteristics that will be useful to understand the design of our introductory version of code generator.

2.1 Algorithmic skeletons

2.1.1 Definitions

Murray Cole laid the foundations of algorithmic skeleton concept in 1989 [1] by defining a skeleton programming framework as a programming model where the programmer is presented with a selection of specialised higher order func-tions from which only one must be selected as the outermost function of the program. In this way the chosen higher-order function defines a procedure tem-plate, specifying the overall parallel structure of the computation. He called such set of functions "algorithmic skeletons" because they describe the compu-tational skeleton of an algorithm without overspecifying the details. In such

(20)

kind of model the programmer must describe a solution for a specific problem by choosing one of the possible algorithmic skeletons in the framework with-out concerning abwith-out the lower level details of the program. With structured parallel programming in mind each one of these legal higher order functions should describe a computational structure for which is possible an efficient par-allel implementation, i.e. it should correspond to a "good parpar-allel algorithmic technique".

Summarizing Cole’s definition we can understand an algorithmic skeleton as an higher order function with associated parallel semantics that takes as param-eters one or more portions of sequential code (the business logic of the appli-cation). This means that the same skeleton, with different provided business logic, models different applications having the same parallel structure. In this first definition there is no mention about skeleton nesting, namely the ability to hierarchically compose patterns, so any skeleton-based application is expression of a single parallelism exploitation pattern.

During the 90’s and the beginning of 00’s several research groups evolved the skeleton concept and Cole’s definition evolved itself [5, 6]. Although different definitions exist a widely accepted one can be summarized by the following definition that can be found in [7, 8]:

Definition 2.1 An algorithmic skeleton is a parametric, reusable and portable programming abstraction modeling a known, common and efficient parallelism exploitation pattern.

Such definition leads to the following considerations on its embedded features: • it is parametric in the sense that it accepts as parameters some kind of

computations specializing what the skeleton computes (the business logic) rather than how the parallel computation is performed. Such parameters may be either sequential skeletons (i.e. skeletons modeling sequential portions of the code) or even other skeletons [9].

• it is reusable in the sense that it can be reused in different contexts. For this reason a parallel programming model based on skeletons (a paral-lel language or a library/framework) should enable composability of the proposed patterns.

• it is portable in the sense that efficient implementations of the skele-ton should exist for different architectures. This is important because a

(21)

2.1. Algorithmic skeletons 7

skeleton-based parallel programming model should decouple the skeleton semantics from the parallel implementation in the target architecture1. • it models common and efficient parallelism exploitation patterns. It has

been recognized that most of the parallel applications exploit well-known parallel structures; for this reason we are interested in having skeletons modeling these structures and for which exist an efficient implementation in a different range of architectures. Moreover a skeleton should be rec-ognizable as a pattern by decoupling the behaviour from the structure, i.e. it should specify a problem-independent structure and note where the problem-specific logic must be supplied [10].

2.1.2 Classification of algorithmic skeletons

Starting from the introductory work of Cole several algorithmic skeletons mod-eling well-known parallelism exploitation patterns have been identified and dif-ferent programming environments [11,12,13,14] based on skeletons have been developed. Although these frameworks expose different sets of skeletons, a com-mon subset may be found in most of these programming environments. Based on their functionality, skeletons can be categorized as stream (task) parallel, data parallel and resolution skeletons [15]. Together with these categories, all exploiting parallelism patterns, we can add control skeletons whose express structured sequential/coordination patterns needed to support/coordinate par-allel skeletons. In the following these categories will be explained and some notable representative skeletons will be briefly given. In 2.1.4 some of these patterns will be better explained.

Stream parallel skeletons

Stream/task parallel skeletons are meaningful only for computations processing some kind of input stream. A stream is a possibly infinite sequence of tasks, for example images represented as matrices of some known size. In this kind of skeletons parallelism is exploited by computing different input tasks in parallel. Examples of stream parallel skeletons are:

• Pipe, models computations expressed in stages. Parallelism is achieved by computing stages in parallel on different consecutive inputs (and the same input in different times).

(22)

• Farm, also known as Master-Worker, models embarrassingly parallel2

computations. Parallelism is achieved by proper scheduling of indepen-dent tasks to different parallel activities computing tasks at the same time.

Data parallel skeletons

Data parallel skeletons are able to operate either on single values or on streams. In this kind of skeletons parallelism is exploited by computing different elements of an input data structure in parallel. Examples of data parallel skeletons are:

• Map, based to the well-known higher order function map typical of many functional languages. Parallelism is achieved by applying the same func-tion to each element composing the input data structure simultaneously. • Reduce, based on the well-known higher order function fold of functional

languages. Parallelism is achieved by using parallel activities organized in a tree: each activity performs an associative operator (given as functional parameter) over the results given by the children activities. The leaf activities compute the reduce operation locally over a partition of two or more elements of the input data structure.

• Data parallel (or Fork ), it has similar semantics of map but different functions are applied simultaneously to different elements of the input data structure.

• Stencil, models computations that update "array" elements according to the result of a function applied to a fixed neighborhood pattern (the sten-cil). Parallelism is achieved by updating these elements in parallel (typi-cally requiring data exchange among the parallel activities).

Resolution skeletons

Resolution skeletons represent/model algorithmic methods for the solution of a family of problems. A notable example of resolution skeleton is:

• Divide & Conquer [16] that recursively splits the original problem into subproblems until these become simple enough to be solved directly, then computes the solution of the original problem from the solutions of the 2_{Where little or no effort is needed to separate the problem in different parallel activities,}

(23)

subproblems. In this skeleton parallelism may be exploited in the divide phase by recursively forking different parallel activities. When no further recursions are performed, the results at each level are combined in a merge parallel phase.

Control skeletons

Control skeletons don’t exploit a parallelism structure but they represent skele-tons that express structured sequential or coordination patterns that are needed in order to support or coordinate other parallel skeletons. Notable control skele-tons are:

• Sequential, also known as Seq, it wraps sequential portions of code (busi-ness logic code) and terminates recursive nesting of skeletons.

• Sequence, also known as Comp, computes a list of skeletons sequentially one after the other.

• Conditional, also known as If, provides conditional branch. Given a boolean condition and two skeletons only one of them will be executed in function of the condition value.

2.1.3 Algorithmic skeleton nesting

Skeleton nesting is the capability to hierarchically compose skeletons (i.e. use skeletons as parameters for other skeletons) and it is important in order to get complex patterns from simple ones. Composability in structured parallel programming plays a big role in expressiveness, flexibility [5], reusability [17] and portability.

Skeleton trees

Skeleton nesting implies that complex patterns are built from the set of skeletons provided by the parallel programming environment chosen. Such complex pat-terns, and thus skeleton programs3_{, are obviously representable as trees whose}

nodes represent algorithmic skeletons and leaves are sequential nodes. Such trees are called skeleton trees. Figure2.1 shows an example of a skeleton tree.

3_{In the end a skeleton program is a complex pattern and viceversa because skeleton}

(24)

Pipe Farm Map Comp Seq Seq Farm Comp Seq Seq

Figure 2.1: Example of a skeleton tree

Full nesting

A parallel programming environment based on algorithmic skeletons provides full nesting when each one of its skeletons can be a code parameter of any other skeleton (excluding sequential skeletons modeling sequential portions of code). This property is also known as flat nesting and leads to the following important consideration:

• given two skeletons ∆1 and ∆2 both compositions ∆1(∆2) and ∆2(∆1)

should make sense and should not lead in incompatibilities or inefficiencies.

Two-tier model

During the 90’s, P3L [11] design experience showed that some skeleton com-positions don’t actually make sense: task and data real parallel applications exhibit a two tier structure in which task parallelism is exploited at a coarse level among groups of processes exploiting data parallelism [18]. This observa-tion led to the definiobserva-tion of a two tier composiobserva-tion rule stating that a skeleton tree is legal (and worth) if and only if has the following structure:

• at the upper level only stream parallel skeletons are used; • at the lower level only data parallel skeletons are used; • at the leaves level only sequential skeletons are used;

Figure 2.1 shows an example of a legal tree according to the two-tier model4. 4_{Comp of seq are actually sequential skeletons}

(25)

2.1.4 Skeletons as higher-order functions with parallel

se-mantics

In 2.1.1 we gave a definition for algorithmic skeletons saying that they can be seen essentially as higher-order functions modeling parallelism exploitation patterns; in 2.1.2 instead we classify algorithmic skeletons in base of the type of computation of interest (stream or single data) giving brief and incomplete descriptions of some notable skeletons. In the following we will give for the most representative parallel skeletons (i.e. pipe, farm, map and reduce) a better description, by providing functional and parallel semantics for each of them.

Pipe skeleton

Pipe skeleton can be used to model an application computing a function f that is a composition of n ≥ 1 functions as follows:

f (x) = (fn◦ fn−1◦ ... ◦ f2◦ f1)(x) = fn(fn−1(...f2(f1(x))))

The parallelism exploitation pattern modeled is given in figure2.2by the graph of parallel activities, here called stages, one for each function.

Figure 2.2: Pipe skeleton

By definingf_n as follows f_n(x) =    fn(fn−1(x)), if n≥ 2 f1(x), if n = 1

We can define a pipe in the following way:

Definition 2.2 A pipe skeleton is an higher order function taking in input • a stream of elements {x1, x2, ..., xm} each one having type α1

(26)

• a sequence of n functions {f1, f2, ..., fn} with fi having type fi : αi → αi+1

and producing in output a stream of elements {f_n(x1), fn(x2), ..., fn(xm)} by

computing in parallel the following functions5 for any i

f_n(xi) fn−1(xi+1) ... f2(xi+n−2) f1(xi+n−1)

Farm skeleton

Farm skeleton is used to model embarrassingly parallel stream computations, i.e. those computations in which no dependency is present between the parallel activities for computing different input stream tasks. In order to represent visually the parallel exploitation pattern modeled by the farm skeleton, figure 2.3 shows a classical picture of a well-known farm template6_.

Figure 2.3: Farm skeleton with emitter and collector

As we can see by the above figure, the farm skeleton is based on functional replication: function f is replicated among a set of parallel activities, called workers, and parallelism is achieved by independent computation of independent tasks properly scheduled among the workers. Notice that the figure 2.3 is just only one of the possible implementations of the farm skeleton, that is just an abstraction. Now we can define a farm as follows:

5_{Suppose for simplicity that given m ≤ j ≤ 0 and 1 ≤ k ≤ n, f}

k(xj) represents an

admissible computation over undefined input

6_{Simply speaking, a template is an implementation of a skeleton on a given architecture:}

it defines a set of parallel activities and how they communicate plus associated performance model. Given a parallel skeleton multiple templates implementing that skeleton may exist.

(27)

Definition 2.3 A farm is a higher-order function taking in input • a stream of elements {x1, x2, ..., xm} each one having type α

• a function f of type α → β • an integer n

and producing in output a stream of elements {f (x1), f (x2), ..., f (xn)} by

com-puting in parallel n consecutive tasks of the input stream by using n parallel activities, called workers.

In definition 2.3 the parameter n represents a well-known non-functional pa-rameter called parallelism degree. Even if the requirement of parallelism degree in the farm skeleton goes against the idea that a skeleton is just a program-ming abstraction hiding all the details related to the exploitation of parallelism, most of skeleton-based parallel programming environments allow programmers to choose the desired parallelism degree. Another feasible alternative is to not leave such important parameter under the programmer responsibility and let the runtime support of the framework do the proper choice, possibly applying some well-known performance models (see2.2).

Map skeleton

Map skeleton can be used to model computations in which different sub-tasks can be derived from a single task and no (or little) dependency is present for computing the sub-tasks. Data parallel skeletons like map man operate on streams but parallelism exploitation is over proper decomposition of a task. The following figure 2.4 shows a possible map template:

(28)

Figure 2.4: Map skeleton with scatter and gather and four workers for an input task composed of four subtasks

The task is basically a collection of sub-tasks, for the sake of simplicity in the following will be referred just as a "vector of elements". Now we can define a map as follows:

Definition 2.4 A map is a higher-order function taking in input • a single vector x = [x1, x2, ..., xm] having type vec<α>

• a function f of type α → β

and producing in output a vector y = [f (x1), f (x2), ..., f (xm)] of type vec<β>

where computation of the elements of the vector is performed in parallel by using m parallel activities called workers.

Using this definition we can operate on a stream of vectors by using a map as a functional parameter of a pipe with but we can’t have a pipe inside a map (two-tier model approach). However it’s possible to overcome this "problem" either by extending the map to operate directly on streams or by redefining pipes and farms as simpler higher order functions (without parallel semantics) operating no more on streams but on single values, and then using them as parameters of some other higher order function implementing stream parallelism. In 2.3 we will return to this problem and we will define the new semantics for pipes and farms.

(29)

It is worth pointing out that in the last definition we don’t have the parameter for the parallelism degree that appears in the farm case instead. No wonder indeed that the choice for parallelism degree for map computations is more rare. This is due to the fact that the dimension of the input data collection is usually known and all its elements are given at the same time. In the farm case instead the situation is different: stream dimension is typically unknown and stream items do not exist at the same time.

Reduce skeleton

Reduce skeleton is used to model fold/reduce computations, i.e. computations in which given a collection of items v = [v1, v2, ..., vm] (e.g. a vector) and a

commutative and associative operator ⊗, the result is a scalar x = v1⊗ v2⊗

... ⊗ vm. Figure 2.5 shows how parallelism exploitation is achieved for reduce

skeleton assuming that each ⊗ is computed by a different parallel agent:

x ⊗ ⊗ ⊗ v8 v7 ⊗ v6 v5 ⊗ ⊗ v4 v3 ⊗ v2 v1

Figure 2.5: Reduce computation on tree

Reduce skeleton is therefore defined as follows:

Definition 2.5 A reduce skeleton is a higher-order function taking in input • a single vector x = [x1, x2, ..., xm] having each element of same type α

• an associative and commutative function ⊗ : α × α → α

and producing in output a scalar y = x1⊗ ... ⊗ xm of type α where computation

has been performed in parallel by using 2m−1_{− 1 agents organized in a binary}

tree: each leaf agent computes directly over the elements of the assigned partition while other agents compute over the partial results received by its children.

(30)

Actually in most of skeleton-based frameworks to the skeleton is possible specify the maximum number of agents participating in the reduce operation, i.e. its parallelism degree.

2.1.5 Stateless and stateful skeletons

Most of the skeleton-based frameworks offer a parallel programming model based on stateless skeletons. With the given functional semantics shown in 2.1.4it is easily understandable that skeletons may be implemented completely as higher-order functions with a declarative meaning that is independent from particular implementations. In such "pure" functional frameworks the concept of mutable state has been typically forbidden for essentially two reasons:

• due to historical reasons related to the fact that at the beginning of the work on algorithmic skeletons, parallel machines were actually distributed memory architectures

• due to the fact shared state is a typical "enemy"7 _{of parallel code, it is}

indeed a typical source of scalability issues and nonetheless of bugs related to race conditions

However with the growing presence of shared multi/many core architectures the idea to provide a support of some sort of global shared state becomes a point of interest since in practice cases exist in where the existence of a mutable global state greatly simplifies programming activity. A discussion related on how to introduce stateful skeletons and how to modify the given semantics in 2.1.4 to support state goes beyond our actual purposes but it is actually possible and maybe object of our future work for the RPL-Shell context since FastFlow offers mechanisms exploiting shared state.

2.1.6 A survey on algorithmic skeleton frameworks

We already said in this section that several frameworks based on algorithmic skeletons have been developed during the years. Here we will present a very short survey (very much inspired by [15]) of some active parallel programming environments since the methodology developed in our tool is general and may be possibly extended to target different frameworks other than FastFlow.

(31)

Mallba [19] is a library of skeletons for combinatorial optimization that in con-trast with other classic skeleton-based frameworks offer skeletons as para-metric search strategies rather than parapara-metric parallel constructs. The skeletons are not nestable and are implemented as C++ classes that repre-sent an abstraction of the entities participating in the resolution method. Parallelism is achieved via a proper custom MPI sublayer and the library targets COW/NOW architectures.

Muesli [13] is a library that offers algorithmic skeletons as C++ objects. Muesli offers a stateless skeleton set including patterns for stream (e.g. farm and pipeline) and data parallelism (e.g. map and others). Skeleton nesting is limited by the two-tier model. It targets COW/NOW architec-tures via MPI and multi-cores vis OpenMP.

OSL [20] is a C++ library that offers data parallel skeletons such as map and reduce. The skeletons are stateless and implemented as C++ functions operating on distributed data structures called distributed arrays. It is built on top of MPI and targets distributed memory architectures. SkeCL [21] is a C++ library that offers data parallel skeletons such as map

and reduce. The skeletons provided are implemented as operators on data parallel containers (SkelCL::Vector). The library is implemented using OpenCL and it targets GPU and multi-GPU systems.

SkePU [22] is a C++ library that offers data parallel skeletons such as map, reduce and others. It targets multi-core, GPU (mostly) and multi-GPU systems. Each skeleton has several different implementations: for sequen-tial C, OpenMP, OpenCL, CUDA, and multi-GPU OpenCL and CUDA. Similiarly to SkeCL it uses data parallel containers and skeletons are op-erators on these data structures.

SkeTo [23] is a C++ library offering several data parallel skeletons for lists, matrices, and trees. Actually it does not offer nestable parallelism pat-terns but operations for parallel data structures and the implemented skeletons are stateless. It is built on top of MPI and therefore intended for distributed environments such as PC clusters. An important feature is that provides an automatic fusion rule to merge two successive function invocations in a single one [15].

(32)

2.2 Performance models

When designing and implementing parallel applications, performance is the main non functional concern8 _{of interest. In structured parallel programming}

we are interested to use parallel patterns characterized by performance models that allow the designer to evaluate performance metrics of interest in function of typical non functional parameters9 _{(e.g. sequential service time or}

paral-lelism degree) that can depend either by the parallel pattern or the underlying architecture. The existence of performance models is important because the designer can:

• predict the performance of a possible parallel implementation;

• compare different equivalent (see 2.3) parallel structures before coding phase;

• tune non functional parameters (such as parallelism deegree) in a proper way to avoid wasting resources;

In the following we will show what are typical performance measures of interest for a parallel application and give performance models for the skeletons pipe, farm, map and reduce.

2.2.1 Performance measures

Given an application, either sequential or parallel inside, we are interested in the following measures:

• Latency (L) is the time spent between the moment a given activity re-ceives an input and the one it returns an output;

• Service time (Ts) is the time between the acceptance of two consecutive

inputs;

• Bandwidth (B) is the inverse of Ts;

• Completion time (T c) is the overall latency of an application on a given input data set, i.e. the time between the acceptance of the first input and the delivery of the result for the last input;

8_{A non functional concern is related in how results are computed and not in what is}

computed by an application

9_{A non functional parameter is a skeleton parameter not related to functional business}

(33)

2.2. Performance models 19

We can observe that for a sequential application L = Ts, while in parallel

service time and latency may be different and dependent on the parallelism degree. Notice that parallel applications may exhibit lower service times and higher latencies with respect to their sequential versions: this is not a problem in stream based computations where service time is key parameter but may be a problem in single data computations.

Let’s introduce a sequential application M having service time Tseq and its

parallel version Σ(n) -with parallelism degree equal to n- having service time TΣ(n). We are interested in derived measures giving us a quantitative metric

on how much Σ(n) is "good" with respect to the sequential version M . Before introducing these measures we have to define the concept of ideal service time as the ratio between the sequential time and the parallelism degree:

Tid(n) =

Tseq

n

Now we can define the following derived measures for the parallel computation Σ(n):

• Efficiency () is defined as the ratio between the ideal service time and the service time of the parallel version

Σ(n) =

Tid(n)

TΣ(n)

• Speedup (sp) is defined as the ratio between the sequential time and the servicetime of the parallel version

spΣ(n) =

Tseq

TΣ(n)

• Scalability (sc) is defined as the ratio between the service time of Σ(1) and Σ(n)

scΣ(n) =

TΣ(1)

TΣ(n)

Over these definitions we can give the following considerations:

• Efficiency (0 < ≤ 1) tell us how close the effective performance is to the ideal one. In this sense it gives an idea of how good the resources are utilized. Observe that efficiency is a relative measure, some parallel

(34)

systems may have low efficiency and very fast, other ones may have high efficiency but very slow.

• Speedup measures how good the parallel implementation is with respect to the sequential version.

• Scalability measures how good the parallel version in achieving better performances on larger parallelism degrees. Scalability does not make comparisons with the "best" sequential time and situations in which we may have high-scalability and low-speedup are possible.

2.2.2 Approximate performance models

In the following we will give approximate performance models for the skeletons of interest presented in2.1.4. These models are "approximate" in the sense they abstract most of the characteristics of the underlying template or architecture such as overheads introduced by communications.

Pipe cost model

Let assume our pipe have n stages and i-th stage has service time Tiand latency

Li, the service time of the pipe will be

Tpipe= max 1≤j≤n{Tj}

The pipe latency instead is:

Lpipe = n

X

j=1

Lj

Notice that the pipe skeleton doesn’t offer benefits in latency with respect to a single sequential stage computing the composition of functions. Actually, although the approximate model doesn’t capture it, the latency is increased due to added communication overheads between stages.

Let m a finite length stream. The model for the completion time for a pipeline of n stages, with an effective service time Tpipe per stage, instead is defined as

follows:

(35)

2.2. Performance models 21

The first addend represents a filling transient phase of the pipe, while the sec-ond one represents the duration for the steady-state phase plus an emptying transient phase. When m >> n we can approximate the formula as follows:

Tc ≈ m ∗ Tpipe

Farm cost model

Let assume to have the classical farm template consisting of emitter/collector process and a sequence of n workers (as shown in 2.3. In this template we can recognize a pipe of three stages (the emitter, an internal parallel stage of workers and the collector), so the service of the farm will be

Tf arm(n) = max{Temitter,

Tworker

n , Tcollector}

The farm latency instead is:

Lf arm(n) = Lemitter + Lworker+ Lcollector

As we can see latency is greater than sequential latency (Lworker).

Let m a finite length stream. The completion time is quite similar to the pipe evaluation, therefore we can approximate it as follows:

Tc= m ∗ Tf arm(n)

We want eventually find the optimal parallelism degree for the farm, considering the mean interarrival time TA of two consecutive tasks coming from the input

stream. In this case the farm should accommodate its number of workers either according to the emitter/collector times (if Temitter ≤ TA) or to the interarrival

time (if TA ≤ Temitter). So we can define noptas the best number that maximizes

the bandwidth:

nopt =

Tworker

max{TA, Temitter, Tcollector}

Map cost model

Let assume to have a map template consisting of scatter/gather processes and a sequence of n workers (as shown in figure2.4. The scatter process is the one

(36)

charged to divide the input data structure in different partitions and schedule them to different workers. This phase of "divide & schedule" can be actually implemented using an active entity or a passive one (a shared task queue), nevertheless it has a cost that is typically linear in the number of partition-s/workers. The gather process instead is in charge to assemble the original partitions in a single data structure and it has a cost linear in the number of partitions/workers too10_{. So, given a map having n workers implementing some}

function f with computation time Tf and operating on an input data structure

of size M , we can assume that each worker will work on data partitions of size g = M/n. Then the latency (and service time) of the map will be:

Lmap(n, g) = Lscatter(n, g) + g ∗ Tf + Lgather(n, g)

Assuming linear scatter and gather linear implementations we can (1) simplify the model assuming scatter and gather having same latency and, (2) consider the approximate model for scatter latency as follows:

Lscatter(n, g) = Tdivide(g) + n ∗ Temitter ≈ n ∗ Temitter

So we can rewrite the latency of the map as follows:

Lmap(n, g) ≈ 2 ∗ Lscatter(n, g) + g ∗ Tf ≈ 2n ∗ Temitter+

M ∗ Tf

n

We want eventually find the optimal parallelism degree: dLmap(n, g)

dn = 2 ∗ Temitter+ −M ∗ Tf

n2 = 0

from which we obtain:

nopt=

&r

M ∗ Tf

2Temitter

'

Reduce cost model

Let assume to have a reduce template composed of n workers organized in a vector. We can map the binary tree representing the reduce computation (see 2.1.4) in the vector. Assuming that T⊗ is the time to perform the commutative

10_{Actually a pipeline effect may exist and the gather phase should be hidden in the final}

(37)

2.3. Skeleton transformation rules 23

and associative operator ⊗ on two elements, the latency (and service time) of a reduce over a vector of size M and using n workers will be:

Lreduce(n, M ) = T⊗∗ ( M n − 1) + T⊗∗ log2(n) ≈ T⊗∗ M n + T⊗∗ log2(n)

We may also interested to find the optimal parallelism degree for reduce: dLreduce(n, M ) dn = T⊗ nlog(2) − T⊗ M n2 = 0 from which: nopt = dM ∗ log(2)e

2.3 Skeleton transformation rules

Rewriting techniques have been exploited in a variety of contexts, from logics to linguistics. It is known that functional equivalences can be established be-tween different skeleton nestings [24, 25, 26]. In the following we will redefine the functional semantics of farm and pipe skeletons in order to match the con-siderations done in2.1.4 (when introduced the map skeleton) and then we will present some of the most common rewriting rules.

2.3.1 Refining the semantics for Pipe and Farm skeletons

Using the semantics defined in2.1.4we can’t have compositions of skeletons such as map(pipe(∆1, ∆2)) because type incompatibility reasons. In the following we

will redefine the functional semantics of pipe and farm skeletons as much simpler higher order functions and introducing another higher order function streamer modeling stream parallelism.

Definition 2.6 A pipe is a higher order function taking in input • an element x having type α1

• a sequence of n functions {f1, f2, ..., fn} where fi has type αi → αi+1 and

producing in output an element y = fn(...f2(f1(x))) of type αn+1.

(38)

• an element x having type α

• a function f with type α → β and producing in output an element y = f (x) of type β.

The higher order function streamer can be defined with an associated parallel semantics as follows:

Definition 2.8 A streamer is a higher order function taking in input • a stream of elements x1, x2, ..., xm where each element has type α

• a function f with type α → β and producing in output a stream of elements {f (x1), f (x2), ..., f (xm)} by computing in parallel each element of the input

stream.

2.3.2 Rewriting rules

In the following we will consider skeleton tree having pipe and comp nodes with at most two children. This is not a limitation because that we can have "pipe of pipes". We can give now the following rules establishing equivalence between stateless skeleton expressions.

• Farm introduction/elimination: at any point where we have a skeleton tree ∆ we can insert a farm, and viceversa.

∆ ≡ F arm(∆)

• Pipe introduction/elimination: pipes and comps are functionally equiv-alent since they model compositions of functions.

P ipe(∆1, ∆2) ≡ Comp(∆1, ∆2)

• Comp associativity: a Comp is associative.

Comp(∆1, Comp(∆2, ∆3)) ≡ Comp(Comp(∆1, ∆2), ∆3)

• Pipe associativity: a Pipe is associative.

(39)

2.4. FastFlow 25

• Map distributivity: we can always rewrite map of sequences as sequence of maps

M ap(Comp(∆1, ∆2)) ≡ Comp(M ap(∆1), M ap(∆2))

Notice that from the previous rules we can derive other rules such as the fol-lowing one:

M ap(P ipe(∆1, ∆2)) ≡ P ipe(M ap(∆1), M ap(∆2))

that is obtained from the following sequence of rewritings:

M ap(P ipe(∆1, ∆2))

≡ M ap(Comp(∆1, ∆2))

≡ Comp(M ap(∆1), M ap(∆2))

≡ P ipe(M ap(∆1), M ap(∆2))

We can give also an idea of the general validity of the rewriting rules by infor-mally proving that, with the functional semantics defined, given a rewriting rule r, the computation for some input x of the lhs(r)11 _{leads to the same output of}

the computation on the same input x of rhs(r)12_{. Let consider the basic farm}

introduction rule:

∆ ≡ F arm(∆)

We know that ∆ is a skeleton (possibly a composition of skeletons) modeling some higher order function f : α → β (possibly a composition of functions). Given an input element x of type α the application of the skeleton expression ∆ on x has as a result y = f (x) of type β. The F arm(∆) according to the semantics given in definition 2.7 models the same function f : α → β and therefore the computation of F arm(∆) the same input x of ∆ leads to the same result y = f (x) of type β.

2.4 FastFlow

The framework used to test the final results of our developed methodology is FastFlow [4,27, 28]. FastFlow is an algorithmic skeleton framework developed

11_{left-hand side of rewriting rule r} 12_{right-hand side of rewriting rule r}

(40)

Figure 2.6: FastFlow architecture (taken from [29])

and maintained at the Dept. of Computer Science of the University of Pisa and Torino. The programming framework provides to the programmers efficient par-allel exploitation patterns suitable to implement stream parpar-allel applications. FastFlow has been designed according to the layered architecture shown in figure 2.6. The three layers composing FastFlow are the "building blocks" layer of-fering basic low level mechanisms for communication, synchronization and code wrapping, the "core patterns" layer providing common stream parallel pat-terns and the "high-level patpat-terns" layer that provides other skeletons built on top of the core patterns, such as divide&conquer and parallel_for skeletons. In the following we will give some FastFlow concepts for targeting stream and data parallelism in heterogeneous multi/many-core platforms. Together with the concepts we will also put some code that exemplifies the usage13_{. For an}

effective tutorial/guide look at the tutorial in [29].

(41)

2.4. FastFlow 27

2.4.1 ff_node

The ff_node abstraction provides to the parallel application programmer a way to define sequential code, i.e. business logic code. In a multi-core the ff_node object represents an active parallel activity. The programmer can use the class by simply writing a proper subclass of ff_node and implementing its svc14

method. A FastFlow ff_node can be defined as follows:

1 #i n c l u d e < f f / node . hpp> 2 u s i n g namespace f f ; 3 s t r u c t myStage : f f _ n o d e { 4 i n t s v c _ i n i t ( ) { // n o t mandatory 5 // i n i t i a l i z e t h e s t a g e 6 r e t u r n 0 ; // r e t u r n i n g non z e r o means e r r o r ! 7 } 8 v o i d ∗ s v c (v o i d ∗ t a s k ) { 9 // b u s i n e s s c o d e h e r e w o r k i n g on t h e i n p u t t a s k

10 r e t u r n t a s k ; // may r e t u r n a t a s k o r EOS, GO_ON, GO_OUT,

EOS_NOFREEZE 11 } 12 v o i d svc_end ( ) { // n o t mandatory 13 // f i n a l i z e t h e s t a g e , f r e e r e s o u r c e s , . . . 14 } 15 } ;

Listing 2.1: ff_node structure (taken from [29])

The svc method wraps the body of the concurrent activity and it is called every time a new input task is provided (via void* parameter) and producing one or more outputs (via return value as void* or invocations of ff_send_out method). In case of NULL is returned the concurrent activity terminates itself.

2.4.2 ff_pipeline

The ff_pipeline represents a convenient way to instantiate pipe skeletons. The simplest way to create a pipeline is to define the stages as proper instantiations of ff_nodes and add them in the proper order to the pipeline by calling its add_stage method. The following example shows the usage of ff_pipeline:

1 #i n c l u d e < f f / node . hpp> 2 #i n c l u d e < f f / p i p e l i n e . hpp> 3 . . .

4 // f i r s t s t a g e

(42)

5 s t r u c t s t a g e 1 : f f _ n o d e { . . . } ; 6 // s e c o n d s t a g e 7 s t r u c t s t a g e 2 : f f _ n o d e { . . . } ; 8 // t h i r d s t a g e 9 s t r u c t s t a g e 3 : f f _ n o d e { . . . } ; 10 11 i n t main ( ) { 12 . . . 13 s t a g e 1 s 1 ; 14 s t a g e 2 s 2 ; 15 s t a g e 3 s 3 ; 16 17 f f _ p i p e l i n e p i p e ; 18 p i p e . add_stage (& s 1 ) ; 19 p i p e . add_stage (& s 2 ) ; 20 p i p e . add_stage (& s 3 ) ; 21 22 p i p e . run_and_wait_end ( ) ; 23 . . . 24 }

Listing 2.2: ff_pipeline example of usage

2.4.3 ff_farm

The ff_farm abstraction provides to the programmer the possibility to instan-tiate a farm skeleton: the standard way to create a farm with n workers is by packing n pointers to ff_node objects in a std::vector and pass it as param-eter to the ff_farm constructor. The following example shows the usage of ff_farm: 1 #i n c l u d e < f f / farm . hpp> 2 u s i n g namespace f f ; 3 . . . 4 i n t main ( ) { 5 . . . 6 s t d : : v e c t o r w o r k e r s ; 7 f o r (i n t i =0; i < n ; i ++)

8 w o r k e r s . push_back (new Worker ( ) ) ; 9 ff_farm<> farm ( w o r k e r s ) ;

10 . . . 11 }

(43)

2.4. FastFlow 29

2.4.4 Data parallelism in FastFlow

Convenient abstractions for map-like and reduce-like patterns are the FastFlow’s ParallelFor and ParallelForReduce patterns.

• The ParallelFor pattern can be used to parallelize loops with indepen-dent iterations. It offers different parallel_for methods;

• The ParallelForReduce pattern can be used to parallelize loops with independent iterations and having reduction variables. It offers several parallel_reduce and parallel_for methods;

In order to combine stream and data parallel computations in a two-tier way a convenient abstraction is given by the ff_Map node. This is just a ff_node_t that wraps a ParallelForReduce pattern. The following example shows an usage example for the ff_Map abstraction:

1 s t r u c t mapStage : ff_Map<task_t> {

2 mapStage ( ) : ff_Map<task_t >( ff_realNumCores ( ) ) {} 3 t a s k _t ∗ s v c ( t a s k_ t ∗ t a s k ) {

4

5 // t h i s i s t h e p a r a l l e l _ f o r p r o v i d e d by t h e ff_Map c l a s s

6 ff_Map<task_t > : : p a r a l l e l _ f o r ( 0 , SIZE , [ & t a s k ] (c o n s t l o n g i ) { 7 t a s k −> f i r s t .o p e r a t o r[ ] ( i ) += t a s k −>s e c o n d .o p e r a t o r [ ] ( i ) ; 8 } ) ; 9 10 f o r(s i z e _ t i =0; i <SIZE;++ i ) 11 s t d : : c o u t << t a s k −> f i r s t .o p e r a t o r[ ] ( i ) << " "; 12 s t d : : c o u t << " \n"; 13 d e l e t e t a s k ; 14 r e t u r n GO_ON; 15 } 16 } ;

Listing 2.4: ff_Map example (taken from [29])

It is strongly suggested the use of a ff_Map rather than a plain ff_node wrap-ping a ParallelForReduce because the former case the runtime knows that the node inside is parallel and can apply more optimizations, such as disabling/en-abling of scheduler or better map-to-cores of the threads.

(44)

(45)

31

Chapter 3

RPL-Shell: Design

This chapter introduces the logical design project of RPL-Shell, our DSL based tool supporting the design of parallel applications. After presenting the require-ments and functionalities of the tool, we will give first a high-level overview of the project, and then we’ll provide a description of its founding components.

3.1 Requirements and functionalities

RPL-Shell aims to provide to the programmer a tool to facilitate the design and optimization of structured parallel applications. The tool is in the form of a command line (textual) shell and provides to the user a scripting language capable to:

• express parallel pattern expressions (nestings of algorithmic skeletons) • compute non-functional properties exposed by the patterns

• import/generate C++ code (still preliminary)

The skeleton set currently managed by RPL-Shell is composed of some of the more common skeletons, already introduced in Chapter 2, and available in dif-ferent skeleton based programming frameworks1 _[₁₃_,₃₀_,₂₂_,₂₁_{]. These skeletons}

are briefly summarized in table3.1.

1_{At least a subset of these skeletons may be found in most of the skeleton based}

(46)

Pattern Description

Seq Sequential code wrapper Comp Sequential composition

Pipe Stage parallel computation on consecutive items of a stream

Farm Embarassingly stream parallel computation Map Embarassingly data parallel computation Reduce Tree-shaped parallel computation of the

"sum" of all the items in a collection

Table 3.1: Current skeleton set supported

Given this skeleton set a wide number of possible applications can be modeled, including

• video stream processing: pipe(videosource, filter1, filter2, videodrain) • map-reduce applications: pipe(map(...), reduce(...))

Given a simple parallel pattern expression the user is guided in the design phase by providing proper mechanisms for non-functional property management and space exploration of functionally-equivalent solutions. Such mechanisms include the possibility to:

• annotate patterns in order to modify values of non functional parameters of interest (such as servicetime for seq, or pardegree for farm)

• evaluate non functional properties of patterns, i.e. get/compute values associated to a pattern for non functional parameters of interest (e.g. get/compute the servicetime for a pipe).

• optimize patterns, i.e. tune the values of non functional parameters asso-ciated to a pattern in order to get the best estimated performance (accord-ing to some model). They may also change the structure of the patterns (e.g. merge fastest consecutive stages of a pipe on condition that the overall service time will not increase).

• rewrite patterns, i.e. apply one or more rewriting rules to generate new functionally equivalent patterns.

• modify parameters of interest such as the dimension of the stream or the maximum number of available resources in the system.

(47)

3.2. Design overview 33

Figure 3.1: A high level view of RPL-Shell components

With such functionalities available, the user can explore the space of functionally equivalent solutions, tune their non functional parameters by hand or through the use of known optimizations, ask for approximate evaluations and eventually pick up the best solution. Finally he can generate code automatically or by hand with the evergreen coding activity.

3.2 Design overview

In order to meet the requirements mentioned in the previous section, the system needs to implement a domain specific language capable to describe the parallel structure of applications and provide to the user all the mechanisms to support annotations, evaluations, optimizations, refactoring, import/generation of code and finally modification of the work environment. All these mechanisms have a direct correspondence with specific constructs of the language that will be introduced in 3.3.1. Figure 3.1 instead shows a high-level view of the princi-pal logic components of RPL-Shell, each one dealing with a different kind of mechanism. As we can see the principal logical components of the systems are: • Shell implements the "front-end" of the system by directly interacting with the user and it uses three different components, the lexer, the parser and the interpreter which actually implement the domain specific lan-guage.

(48)

• Annotators are procedures assigning an annotation for some non-functional parameter of interest to the outer node of the skeleton expression associ-ated with some name. They implement annotation mechanisms.

• Evaluators are procedures computing the value of a property of a given non-functional concern for one or more skeleton expressions associated with a name. They implement evaluation mechanisms.

• Optimizers are procedures optimizing the value of a property of a given non functional concern (e.g the service time) for some name in the envi-ronment, either by modifying non functional parameters that impact in the computation of that value (evaluation), or by changing the structure of the skeleton tree associated to that name. They implement optimization mechanisms.

• Rewriters are procedures changing the structure of a skeleton expression by applying one or more known rewriting rule. They implement rewriting mechanisms.

• Gencode is a procedure that generates Fastflow code starting from a parallel pattern expression with sequential parameters representing some imported business logic.

In3.4we will discuss these components with more detail. An important part of the game that is not shown in the picture is the environment, where associations between names and skeleton trees representing pattern expressions are stored and provided to the shell user in such a way they can be subsequently reused.

3.3 The shell

The shell represents the main logical component of the system. It is actually composed by four components: rplsh, lexer, parser and interpreter. The first implements a basic command-loop while the latter three implement the language itself. In the following we will show the syntax of the domain specific language implemented in the shell and then a high-level view of the inner shell structure.

(49)

3.3. The shell 35

3.3.1 Domain specific language

Here we will introduce the language proposed and the main design choices. The syntax used to describe the language is expressed in a pseudo Extended Backus-Naur Form (EBNF) with the following assumptions:

• non terminal symbols are delimited by < and > symbols;

• the curly brackets { and } denote term that may be present just once, or not at all;

• the curly brackets { and } followed by the repetition symbol * denote term that may be present multiple times, or not at all;

The domain specific language has been designed by separating the actual lan-guage expressing pattern composition from the specific verbs implementing the required mechanisms introduced in the previous sections. Listings 3.1 and 3.2 show respectively the grammars for pattern expressions and verbs. Assignments (not shown in the listings) have the usual syntax and they introduce names in the environment with an associated skeleton tree (that is the tree representation of the parallel pattern expression).

Listing 3.1: Syntax for composing the parallell patterns captionpos

: : = <seq−p a t t > | <stream−p a t t > | <data−p a t t > <seq−p a t t > : : = s e q ({<number> {,}) | s o u r c e ({<number >}) | d r a i n ( { number ) | comp({,}∗) | <stream−p a t t >::= p i p e ({,}∗) | farm({,}) | <data−p a t t > : : = map( {,}) | r e d u c e ({,}) |

We decided to keep an effective and concise syntax for the representation of the skeletons in order to give to the user the possibility to nest any pattern inside any other pattern2 _{in an intuitive way. However in stream and data parallel}

patterns we can also annotate the parallelism degree (non terminal symbol integer) while defining the starting parallel expression. Particular attention need to be given to the parameters of sequential nodes:

(50)

• non terminal symbol <number> specifies a ground-annotation for the se-quential service time (or latency)

• non terminal symbol <boolean> specifies if the provided time has to be intended as the time to compute the whole input data structure (e.g. a vector) or only a single item of the input data structure (e.g. an element of a vector ).

The last point is particularly important since in order to model data-parallel computations we need to differentiate the behavior of a sequential node in func-tion of the size of the data structure. This choice will be explained later in 3.3.2.

Listing 3.2: Syntax for the shell verbs captionpos

show {by <param>{,<param>}∗ {+<number>}} <a n n o t a t e > : : = a n n o t a t e w i t h <param> <v a l u e >

<r e w r i t e > : : = r e w r i t e w i t h <r r u l e >{,< r r u l e >}∗ <o p t i m i z e > : : = o p t i m i z e w i t h <o r u l e >{,< o r u l e >}∗ <h i s t o r y > : : = h i s t o r y {}

<import> : : = i m p o r t "< f i l e −name>" <gencode> : : = g e n c o d e

We decided to have a more verbose grammar for verbs, possibly close to "the natural language" in order to give to the user the immediate sense of what he is writing as well as of the semantics of the actions. The choice to keep dis-tinct verbs for the disdis-tinct mechanisms and the use of complete auto-explicative names for parameters and rules moves on this direction too. Details about the semantics of the verbs and its parameters lie outside the context of this chapter and will be provided separately (see Appendix A: User Manual). However 3.4 will show the correspondence between logical components and the verbs of the language, giving a picture quite clear of their semantics.

(51)

3.3. The shell 37

3.3.2 Separate semantics for seq nodes

Let assume to have two sequential skeletons s1 and s2 (candidate to be used inside a map) with annotations for their servicetime (see3.4.1); we would like to model data parallel computations with an input data structure size greater than one and in such a way that the annotated servicetime values for the sequential nodes s1 and s2 are measurements of the computation for a single item of the composing data structure, or better, of the composing data structure of size equal to one. In such case we can legitimately assume that sequential time of the two sequential patterns can be modeled also in function of the input size and a map computation is functionally equivalent to the one made in sequential3_.

For such seq nodes we can introduce a mapelim rewriting rule:

mapelim : map(∆) → ∆

Now, we can use the same sequential skeleton both in stream and data par-allel computations, provided that the computation of the service time of an expression takes into account also the input size of the data structure. Figure 3.2 shows intuitive results of two visits, one top-down for assigning the input size to each node of the skeleton tree representing the expression, the other bottom-up to compute the service time.

Farm (2 10 30) Map (5 10 60) Comp (2 60) s1 (220) s2 (2 40) (a) Farm (2 10150) Comp (10 300) s1 (10100) s2 (10200) (b)

Figure 3.2: Two skeleton trees. Service times for computing a single element are 10 and 20 respectively for s1 and s2. For each node: in bold is specified the number of workers, in red the input size of the input data structure, in blue the servicetime.

Input size is 10.

3_{In this case we can see seq nodes s1 and s2 as functional maps (or functional reduces)}

taking in input a data structure of some dimension and with an embedded function (for which we know the computation time on single value) that knows how to update each element of the structure

(52)

Let assume now to have a and b some sequential wrappers, an input size greater than one and consider the following parallel pattern expression

p = pipe(a, map(b))

This is a stream parallel program with data parallelism inside; the user knows that b is code that updates an element of the input data structure (e.g. a cell of a matrix) and so he can put an estimation for it, but he may know absolutely nothing about a4 except for its service time obtained by profiling. In this case the user should have an annotation mechanism to distinguish what sequential wrappers are suitable to be data parallel parameters and which other not: the mechanism proposed is to annotate a sequential wrapper with a boolean flag datap (see 3.3.1 and 3.4.1) indicating if the servicetime provided refers to the time to compute a sub-element of the input data structure or the whole data structure composed by inputsize sub-elements.

3.3.3 Shell user interface

The user interacts with the shell by providing commands. Commands may be assignments (e.g. p = pipe(farm(a), farm(b)) or verbs (e.g. rewrite p with pipelim). Figure3.3 shows a general picture :

Figure 3.3: The shell inner working

The shell-loop takes in input a command line and performs the following steps: 4_{It could be a wrapper for some unknown library routine}

A DSL tool for parallel application design

University of Pisa

Scuola superiore Sant’Anna