Multi-Core Approach for Real-Time Decoding in Optical Fiber Sensor Systems

(1)

Universit`a degli studi di Pisa, Scuola Superiore Sant’Anna

Master Degree in Computer Science and Networking

Master Thesis

Multi-Core Approach for Real-Time

Fiber Optic Sensing Systems

Candidate Filippo Venuti Supervisor Marco Vanneschi Supervisor Fabrizio Di Pasquale

Academic Year 2010/2011

(2)

(3)

(4)

(5)

List of Figures

2.1 Distributed Sensing - Example . . . 6

2.2 Multipoint Sensing - Example . . . 6

2.3 Spectra of scattered light in optical fiber . . . 8

2.4 Distributed Temperature Sensing - Single Pulse vs Coded (127 bits) [10] . . . 10

2.5 Fiber Temperature Sensing with Coding - Result [11] . . . 10

2.6 System Overview . . . 11

2.7 Decoder Input - Example . . . 14

2.8 Decoder Output - Example . . . 14

4.1 Example of CSR storage format . . . 24

4.2 Example of CSC storage format . . . 25

4.3 Example of COO storage format - Assuming row-major ordering 26 5.1 Matrix Data Decomposition Schemes . . . 32

5.2 Virtual Processor Reduce Operation . . . 40

5.3 Logarithmic implementation of a generic reduction operation . . 41

5.4 Farm with an input stream made by arrays of size N . . . 49

5.5 Reduce with an input stream made by single elements of X . . . 52

5.6 Single Element Stream - Example . . . 53

5.7 Data involved in the Output Reconstruction operation . . . 57

5.8 Division of input vector Y in sparse areas in case of N = 8 and s = 2 . . . 58

5.9 Output Reconstruction operation example - Assuming N = 8, code size = 4 . . . 60

5.10 Decoding Matrix - Assuming N = 6, code size = 3 . . . 61

5.11 Farm with an input stream made by subsets of X, case N = 6 and code size = 3 . . . 62

7.1 Code Length effect on Scalability in case of N = NM AX = 60000 and 1D matrix decomposition . . . 92

(10)

7.3 Completion Time comparison with increasing code length and

fixed input size to N = NM AX = 60000 . . . 94

7.4 Code length effect on efficiency - Input size fixed . . . 96

7.5 Code length effect on efficiency - Input size fixed . . . 96

7.6 Input size effect on scalability - 1D matrix decomposition . . . . 97

7.7 Input size effect on scalability - 2D matrix decomposition . . . . 98

7.8 Completion Time comparison with increasing input size and fixed code length to code size = 3000 . . . 99

7.9 Code size effect on scalability - MODE1 . . . 101

7.10 Code size effect on scalability - MODE1 ALT . . . 101

7.11 Code size effect on scalability - MODE3 . . . 102

7.12 Code size effect on scalability - Overview . . . 103

7.13 Code size effect on completion time - Stream Parallelism . . . . 104

7.14 Input Size effect on scalability - MODE1 . . . 105

7.15 Input Size effect on scalability - MODE1 ALT . . . 106

7.16 Input Size effect on scalability - MODE3 . . . 107

7.17 Input Size effect on scalability - Overview . . . 108

(11)

Acknowledgments

In the following I would like to thanks both my advisors, Prof. Marco Van-neschi and Prof. Fabrizio Di Pasquale, for giving me the opportunity to work on this subject. Moreover I need to thank them also for their always inspiring suggestions and comments through all my dissertation writing. I am sure it would have not been possible without their help.

Moreover, I want to thank also the Photonics Laboratory guys of Scuola Superiore Sant’anna, who helped me in the initial phase of the thesis, Tiziano Nanniperi and Alessandro Signorini. Also the guys of the High Performance Computing Laboratory of the Department of Computer Science of University of Pisa, who gave me the opportunity to use their equipment for the experimental tests.

I would like to thank also all my friends, from the new friends I made in these years to the old ones. All of them were incredible in supporting me along all the path. Without them, this would not be possible.

Last but not least, I need to thank my parents and my brother for being such an outstanding example, that gave me the strength to continue and never give up.

(12)

(13)

Chapter 1 Introduction

The thesis will deal with the development, testing and subsequent implemen-tation of a decoding algorithm under a multi-core architecture to provide real-time detection capabilities applied to a optical fiber sensing system.

Optical fiber sensors are attracting the interest of many fields, ranging from civil and geo-technical engineering to the oil and gas industry, from energy management to railway, highway and structural health monitoring. Optical fiber sensors take advantage of optical pulse coding techniques to extend the sensing range of the fiber while keeping meter or sub-meter scale spatial reso-lution.

The introduction of coding techniques required the addition of a decoding process applied to coded traces. This decoding process is needed to reconstruct back the fiber single pulse response and then the physical parameters of the fiber. The decoding process is ideally applied to a data stream resulting from an analog-to-digital converter characterized by stringent temporal constraints of about a fraction of second. Presently, due to the amount of calculations required by the decoding process, a large processing time hinders an acceptable

(14)

sensing performance. Note that some specific sensing applications fiels, such as point or distributed dynamic strain measurements would require a real-time decoding process. [1] [2]

In this scenario, the thesis investigates possible solutions to limit the time needed by the decoding process based on a multi-core processor architecture. The structure of the thesis resembles the overall process of development. In detail the analysis started with an overview of the fiber optic sensing tech-nology, as reported in Chapter 2. In this chapter the main characteristics of distributed sensingare introduced and some examples on how coding tech-niques can be exploited to improve the system performance are provided. The chapter ends with the description of the decoding architecture that is going to be modelled.

The analysis continues in Chapter 3 in which the decoding matrix used to perform the decoding are provided is analyzed The basic procedure of matrix calculation is first presented and then analyzed, focusing on the character-istics that could be exploited in the actual implementation of the decoding procedure.

Chapter 4 exploits the structure of the matrix derived in Chapter 3 to pro-vide an efficient sequential algorithm. Moreover this chapter introduces the known main limitations of such operation on modern multi-processor architec-ture. Subsequently the sequential algorithm is further optimized taking into account those limitations.

Chapter 5 investigates the parallelization techniques that could be applied to execute the decoding operation in parallel. At first two possible ways to implement a parallel decoding operation are analyzed. After possible schemes

(15)

3

supporting stream processing, that could be implemented in actual systems to increase the decoding performance, are considered.

Chapter 6 presents the actual implementation of the schemes presented in Chapter 5. The chapter focuses mainly on the presentation of the actual programming tool used, Intel c TBB, and on how it is used to implement the schemes. This chapter will follow the same structure of the previous chapter. For each implementation the corresponding code is also provided, avoiding however all the auxiliary functions.

Chapter 7 is presenting the experimental results of the schemes presented in Chapter 5 and implemented in Chapter 6. The used methodology for the tests and a short overview of the used performance parameters are provided before presenting and describing the achieved results.

(16)

(17)

Chapter 2 Distributed Fiber Optic Sensor

Optical fibers have many features that make them good for sensing [14]. The main advantage is that an optical fiber can be used to make distributed mea-surements over long distances. Furthermore optical fiber sensor only require power supply in the interrogation unit and are immune to electromagnetic interference.

2.1 Sensing Principles

In [14] are identified two types of optical fiber sensing technology. The first one is called distributed sensing and the second one is called multipoint sensing.

Most distributed sensing technology is based on the Optical Time Domain Reflectometer. Many different approaches have been proposed to realize dis-tributed sensing; for example it is possible to measure the strain and or the temperature applied to an optical fiber through the so called Brillouin op-tical time domain reflactometry (B-OTDR) technique. On the other hand, Raman OTDR (R-OTDR), allows distributed temperature measurement only. All these techniques relies upon different properties of the light that is guided

(18)

into the fiber. The first one relies upon the so called Brillouin effect while the latter relies upon the Raman effect. In the next these effects will not be further analyzed, and only a brief overview will be provided.

Multipoint optical fiber sensing can also be performed using Fiber Bragg Gratings. This kind of optical sensors are called point sensors and they allow for both static and dynamic measurements of physical parameters such as temperature, strain, pressure, humidity, magnetic and electronic field, etc . . .

Light Source Coupler Photodetector + Signal Processing Unit Sensor (fiber) Distibuted information

Figure 2.1: Distributed Sensing - Example

Figures 2.1 and 2.2 show the basic principle of distributed and multipoint

White Light Source

Coupler Photodetector + Signal Processing Unit Sensor Point information Optical Fiber

Figure 2.2: Multipoint Sensing - Example

techniques. The main difference is in the physical effect exploited and in the way in which the sensing is performed. The distributed approach provides a continuous measurement all along the fiber length, while the multipoint approach provides measurement only for predefined parts of the fiber.

(19)

2.1. SENSING PRINCIPLES 7

to provide a suitable architecture to enable real-time decoding for distributed sensing systems based on pulse coding. In the next section the basic principles of distributed sensing will be presented.

2.1.1 Distributed Sensing

Most Distributed Sensing techniques exploit the so called Optical Time Domain Reflectometer, or simply OTDR. The working principle of an OTDR consist in measuring the backscattered light coming back from distributed points along the fiber.

It is important to underline the fact that the backscattered light is just a small portion the the input light. In general light travelling in an optical fiber is subject to linear and non linear effects causing light to be scattered. Notice that scattering of light can be induced by many different physical events. Moreover, a scatter event can reflect light in any direction. The OTDR is going to measure only the portion of reflected light that has been reflected backwards, or backscattered. The main thing that is important to underline is that only a very small portion of light is being backscattered.

There are three types of scattered lights generated in an optical fiber:

• Rayleight backscattered light, at the same frequency as the input signal, let call it v;

• Brillouin backscattered light, at about v ± 10 Ghz, for silica fibers;

• Raman backscattered light, at about v ± 13 Thz, for silica fibers.

(20)

Intensity

Input Signal

Rayleigh backscattered light Brillouin backscattered light Raman backscattered light

Wavelength

Intensity of scattered light varies depending

on temperature

Light frequency of scattered light varies

depending on strain

Figure 2.3: Spectra of scattered light in optical fiber

The various backscattering processes provide different sensing capabilities. While the Brillouin backscattered light can be used to measure the strain and or temperature, since its frequency varies depending on both these physical paramters, the Raman backscattered light can only be used to measure the temperature since its intensity varies depending on the temperature.

These effects are interesting and are enabling distributed sensing. As shown in Figure 2.3, the main limitation of this sensing technology is the intensity of the backscattered light. Even if the input signal at frequency v is strong, the

(21)

2.2. CODING 9

backscattered light is very low in power.

2.2 Coding

All OTDR based technologies and applications are affected by a trade-off be-tween SNR (Signal to Noise Ratio) and spatial resolution.

Coding techniques have been introduced to overcome this limitation. The working principle of coding is to utilize coded sequences of unit short pulse to obtain an higher trace level, and thus a better SNR from the increased effective pulse-width / power of the probe, while keeping the spatial resolution that characterizes short pulses. Coding requires anyway an additional step to obtain back a meaningful result. Indeed the final OTDR trace, in conventional shape but with the improved OSNR, can be restored from the coded traces by decoding process.

The use of coding in optical fiber sensors provides better performances of the system in terms of resolution and maximum sensing distance. For example Figure 2.4, extracted from [11], reports the improvement by a 255-bit S-Code for sensing temperature, with respect to a conventional system. Without going in detail onto the experimental set-up and the results, we can say that the improvement, in both stability of the measured temperature and maximum reachable length is significant. Other studies that confirm similar results are presented in [12], [13], [4] and [5].

Another advantage of coding is related to the improvement in resolution, as shown in [10]. Figure 2.5 compares a traditional OTDR trace and a coded OTDR trace, clearly pointing out SNR improvement when using coding.

(22)

10 CHAPTER 2. DISTRIBUTED FIBER OPTIC SENSOR 0 5 10 15 20 25 30 0 10 20 30 40 50 Distance [Km] Te m p er at u re re so lu ti o n [K ] S-coded BDTS Single-pulsed BDTS

Figure 4. Temperature resolution for single-pulsed BDTS and 127-bit Simplex-coded BDTS.

10.35 10.4 10.45 10.5 10.55 10.6 10.65 280 300 320 340 360 Distance [km] Te m p er at ur e [K ] Spatial resolution: 42 m 10% 90%

Figure 5. Achieved spatial resolution.

On the other hand, the actual spatial resolution achieved by the BDTS has been obtained by calculating the 10% to 90% response distance. Figure 5 reports the spatial resolution calculation, which results to be 42 m for both single and Simplex-coded BDTS.

It is important to remark that from the experiment we have detected that coding process results in a longer effective length for the BDTS with respect to the single-pulsed BDTS, producing a reduction of the stimulated Brillouin threshold value. This feature limits the maximum input peak power allowed when using coding techniques; nevertheless, coding techniques provide accurate temperature sensing at low peak power levels. We have found that when using 127-bit Simplex coded BDTS, it is possible to achieve a similar optimum temperature resolution than for single-pulsed BDTS, but using ~ 15 dB lower input peak power.

V. CONCLUSIONS

In conclusion, we have proposed and experimentally demonstrated the feasibility of using coding technique in LPR-based BDTS for temperature resolution enhancement. Experimental results show that 127-bit Simplex coding

offers ~ 7.1 dB SNR enhancement, allowing for 30 km sensing distance with a temperature and spatial resolution of 5.0 K and 42 m, respectively, using only 10 mW pulse peak power at the fiber input. Results indicate that coding techniques can be successfully applied in BDTS as an alternative to optical pulse amplification, allowing for the use of low power optical sources and providing at the same time a high-performance cost-effective solution.

REFERENCES

[1] Nature Photonics, “Optical-fibre Sensors,” Tech. Focus, vol. 2, pp. 143-158, 2008.

[2] S. M. Maughan, H. H. Kee and T. P. Newson, ”Simultaneous distributed fibre temperature and strain sensor using microwave coherent detection of spontaneous Brillouin backscatter,” Meas. Sci. Technol., vol 12, pp 834-842, 2001.

[3] P. C. Wait, K. De Souza, T. P. Newson., “A theoretical comparison of spontaneous Raman and Brillouin based fibre optic distributed temperature sensors,” Opt. Commun., vol. 144, pp. 17-23, 1997. [4] X. Bao, D. J. Webb, and D. A. Jackson, “Combined distributed

temperature and strain sensor based on Brillouin loss in an optical fiber,” Opt. Lett., vol. 19, no. 2, pp. 141-143, 1994.

[5] Y. T. Cho, M. Alahbabi, M. J. Gunning, T. P. Newson, “50-km single-ended spontaneous-Brillouin-based distributed-temperature sensor exploiting pulsed Raman amplification,” Opt. Lett., vol. 28, pp. 1651-1653, 2003.

[6] K. De. Souza and T. P. Newson, “Improvement of signal-to-noise capabilities of a distributed temperature sensor using optical preamplification,” Meas. Sci. Technol., vol.12, pp. 952- 957, 2001. [7] M. D. Jones, “Using Simplex codes to improve OTDR Sensitivity,”

IEEE Photon. Technol. Lett., vol. 15, pp. 822–824, Jul. 1993. [8] D. Lee et al, “Analysis and Experimental Demonstration of Simplex

Coding Technique for SNR Enhancement of OTDR,” In Proc. IEEE LTIMC, pp. 118-122, New York, USA, Oct. 2004.

[9] P. C. Wait, T. P. Newson., “Landau Placzek ratio applied to distributed fiber sensing,” Opt. Commun., vol. 122, pp. 141-146, 1996.

[10] J. Park et al, “Raman-based distributed temperature sensor with Simplex coding and link optimization,” IEEE Photon. Technol. Lett., vol. 18, pp. 1879-1881, 2006.

285

Figure 2.4: Distributed Temperature Sensing - Single Pulse vs Coded (127 bits) [10]

10-Anti-Stokes Trace (255bit S-code)

-splice

H

40m 5 splice 0 Room Temperature (300K) Heating SMF 2, DSF 2 (340K) -5 11 -10 fiber end 0 5 10 15 20 25 30 35 40 45 50 55 Distance_[kml

through an optical circulator. The backscattered signals were then coupled to the receiver (InGaAs APD with 0.7 excess noise factor and high-gain TIA with 3 MHz bandwidth), after a band-pass filter (1400~1510nm for the AS, 1520~1600nm for the Rayleigh scattered pump light). The calculated temperature sensitivity with this set-up was 0.65%/K (at T=300K) for this AS optical filter bandwidth.

After the receiver, an Analog-to-Digital Converter (ADC) was used to sample the incoming trace data at 20 MHz. The code-imprinted traces were then transmitted to the PC, where the de-coding process was achieved. The spatial resolution (defined from the 10%-90% OTDR response time suitably converted into corresponding distance) for the current DTS system was measured to be 17 m (inset in fig. 10) for all distances. Five spools of fibers (SMF1, 2 and DSF1, 2, 3) were used (10, 1.5, 19.5, 2, and 17 km respectively), spliced together to compose a 50 km sensing link. SMF2 and DSF2 were placed inside a Temperature-Controlled Chamber (TCC), while the other spools were maintained at room temperature.

Fig. 10. Decoded traces of Anti-Stokes power (black : room temperature, gray : SMF2 and DSF2 at 340K).

Fig. 10 shows the experimentally obtained AS traces with the sensing link at 300K (black line) and at 340K (for SMF2 and DSF2 only. gray line); decoded from the 255 coded traces (706 averages are taken for each single-codeword trace) [22]. Increases in AS power in heated fiber sections are evident. It is also possible to see the expected enhancement of sensing range with the use of higher Raman gain fiber (DSF) after the SMF (z = 11.5 km), in the same figure.

Fig. 11(a) shows the corresponding temperature distribution, obtained from the ratio PAS /PRS. To compare, the

temperature measured with the simple average of 180,000 single-pulse traces (same measurement number as that of coded OTDR) are shown in Fig. 11(b).

Fig. 11. Measured temperature with (a) 255bit coded and (b) conventional DTS. Inset : LD output (first 30 bits of code)

0 10 2 0 30 40 5 0 1 00 2 00 3 00 4 00 5 00 ( b) Co nventio nal DTS Te m p er at ur e [K ] Distance [km ] 0 5 1 0 1 5 2 0 2 5 3 0 0 2 0 4 0 6 0 8 0 0 1 0 20 30 4 0 50 100 200 300 400 500 Heatin g SM F2 and DSF2

(a) 255 bit S-code DTS

Te mp er at ur e [K ] Distan ce [k m ]

Proc. of SPIE Vol. 6781 678129-10

(23)

2.3. DECODING ARCHITECTURE 11

Note that coding techniques exploit the basic idea of sending codewords instead of single pulses to improve the measurement SNR with the same mea-surement time.

2.3 Decoding Architecture

As introduced in Section 2.2, the decoding operation is needed to obtain back the final OTDR trace. The generic scheme of the decoding procedure is shown in Figure 2.6.

ADC Vin

Average Decoder

u_bit code pattern s

X[n] Y[n]

# of samples

Figure 2.6: System Overview

The scheme is characterized by various elements. In detail there main blocks can be identified:

• An Analog-to-Digital Converver, or ADC, at the input of the system;

• An Average module;

• A Decoder module.

The ADC module is in charge of executing an analog-to-digital conversion. The optical signal, once detected, is an analog signal since it is continuous in time and thus it is necessary to convert this into a flow of digital values. This

(24)

operation is done by the ADC module. An ADC module is characterized by, for our interests, two parameters, the rate and the resolution.

The rate parameter is characterizing the speed at which new digital values are sampled from the analog signal. The rate of new values is called the sampling rate or sampling frequency of the converter, namely fs. The sampling rate, fs, defines the number of samples per unit of time, usually expressed in seconds, taken from a continuous signal to make a discrete signal. For time-domain signals, as in our case, the unit for sampling rate is hertz, defined as s−1. Sometimes it is also noted as Sa

s , samples per seconds. Today technology is capable of providing ADC of about 500 MSa/s.

The resolution parameter indicates the number of discrete values that can be produced over the range of analog values. The values are usually stored electronically in binary form, so the resolution is usually expressed in bits. As a consequence the number of discrete values available, or ”levels”, is a power of two. For example, an ADC with a resolution of 16 bits can encode an analog input to one of 216 _{different levels, thus 65536 different discrete values.} As introduced the integer values are stored electronically in binary form. Its form can be different, and depends upon the actually used device. The most common schemes are:

• Signed Binary Integers • Gray Integer Scheme

The Average module is self explanatory, it simply provides averaging among different traces.

(25)

2.3. DECODING ARCHITECTURE 13

received trace. The thesis is related to the study of a suitable architecture to provide decoding through the exploitation of multi-core systems to enable order of millisecond decoding. This module is characterized by three inputs:

• u bit is an integer value that represents the number of bits that should be used to transform the received bit sequence into an actual integer value; • code pattern is an input over which is provided the code pattern used to encode the signal. This parameter is needed to produce the decoding matrix that is used for the decoding;

• s that represents the value:

s = N

code size (2.1)

and describes the relation between the dimension of the input N and of the code. This information is needed by the decoder to calculate the decoding matrix.

There can be identified some upper bound in the system parameters. These are useful to determine the maximum performance of the system.

Variable Value Unit Nmax 60000 Samples

n bit ≤ 16 bit

fs ≤ 500 M Sa_s

The operation that the decoder needs to implement is a matrix-vector mul-tiplication. Indeed the data received, X, is a coded vector. Graphically the receiver receives a trace like the one shown in Figure 2.7. This data is mean-ingless until decoding. The decoder module needs to obtain a resulting trace

(26)

as shown in 2.8. Such operation is described by a matrix-vector multiplication, or MxV.

s 2s 3s

X[n]

n

Figure 2.7: Decoder Input - Example

s 2s 3s

Y[n]

n

Figure 2.8: Decoder Output - Example

The MxV operation is the kernel of the decoder. To provide real-time, or millisecond sensing capabilities, the decoding operation must be executed exploiting current computer architectures. In the following we studya possible decoder based on shared memory architectures.

(27)

Chapter 3 Decoding Matrix

As introduced in Chapter 1, the operation of the decoder is a matrix-vector multiplication, MxV in the following. In the following the matrix structure is analyzed.

Before starting the matrix analysis some names need to be given.

• D represents the decoding matrix

• C represents the coding matrix;

• p represents the code pattern;

• i represents the code pattern size;

• n represents the number of samples parameter;

• s represents the following ratio:

s = n

(28)

The decoding matrix D is formally defined as: D =          C_1,1−1I C_1,2−1I . . . C_1,i−1I C_2,1−1I C_2,2−1I . . . C_2,i−1I C_3,1−1I C_3,2−1I . . . C_3,i−1I .. . ... . .. ... C_i,1−1I C_i,2−1I . . . C_i,i−1I

         (3.2)

In which I is an identity matrix of size s and C_i,j−1 is the element in position (i, j) of the matrix C−1.

In the following the various elements of D are going to be analyzed.

3.1 The Coding Matrix

Coding matrix, C, is generated from the code pattern p. The code pattern, defined in Equation 3.3, is a one dimensional array of size i that represents the actual coding used in the OTDR trace.

p[i] =hd1 . . . di i

(3.3)

This information is the only needed information to calculate the coding matrix C. Indeed, as shown in Equation 3.4, each row of the coding matrix C is composed by exactly all the elements of the code pattern. Furthermore the i-th row of C contains the code pattern right-rotated by i.

C =          d1 d2 . . . di di d1 . . . di−1 di−1 di . . . di−2 .. . ... . .. ... d2 d3 . . . d1          (3.4)

(29)

3.2. THE DECODING MATRIX 17

The structure of C is periodic and provides a structure similar to a Toeplitz Matrix or diagonal-constant matrix. A Toeplitz matrix is a matrix in which each descending diagonal from right to left is constant. An example is shown in Equation 3.5.          a b c d e f a b c d g f a b c h g f a b i h g f a          (3.5)

Notice that, beside being a Toeplitz matrix, from its structure another characteristic can be identified. Indeed, differently from a generic Toeplitz matrix, like the one shown in Equation 3.5, our coding matrix can also be defined as a Circulant matrix, due to the relation that exists between the rows.

An useful result is shown in [7]. Indeed it has been proved that the inverse of a circulant matrix is itself a circulant matrix. Thus not only C is circulant but also its inverse C−1 is circulant.

3.2 The Decoding Matrix

Due to the results of Section 3.1, it can be inferred that also the decoding matrix D is a circulant matrix. Previously D has been defined as in the following: D =          C_1,1−1I C_1,2−1I . . . C_1,i−1I C_2,1−1I C_2,2−1I . . . C_2,i−1I C_3,1−1I C_3,2−1I . . . C_3,i−1I .. . ... . .. ... C_i,1−1I C_i,2−1I . . . C_i,i−1I

         (3.6)

(30)

3.2.1 Circulant Property

From the observation that C and C−1 are both circulant matrix, the decoding matrix D can be simplified as shown in Equation 3.7.

D =        C_1,1−1I C_1,2−1I . . . C_1,i−1I C_1,i−1I C_1,1−1I . . . C_1,i−1−1 I .. . ... . .. ... C_1,2−1I C_1,3−1I . . . C_1,1−1I        (3.7)

This representation of D is immediately identified as a circulant matrix, since the j-th row of D is the first row of D right-rotated by j.

The circulant nature of D implies that in order to have full information on the decoding matrix it is enough to have access to a sub-matrix, let’s call it H, defined as:

H =hC_1,1−1I C_1,2−1I . . . C_1,i−1I i

(3.8)

The H matrix can provide the same informations as the full matrix D. Indeed to obtain back a generic row j of D is enough to apply a right-rotation on H of j positions.

The advantage of the simplification is that to store the full D (is)2_elements are needed in memory, while storing only the sub-matrix H only requires s ∗ is = is2 _elements.

3.2.2 Sparsity Property

If we analyze the way in which H is built, we find out that each element of the inverse coding matrix C_i,j−1 is multiplied by an identity matrix I of size s. This means that the resulting matrix C_i,j−1I is a matrix that have elements different

(31)

3.2. THE DECODING MATRIX 19

from zero only in the diagonal, as shown in Equation 3.9.

C_1,1−1I =        C_1,1−1 0 . . . 0 0 C_1,1−1 . . . 0 .. . ... . .. ... 0 0 . . . C_1,1−1        (3.9)

Thus it is a diagonal matrix. It means that from each C_i,j−1I there are only s elements that are different from zero. In general a square matrix of size s needs to store s2 _{elements. Furthermore since H is composed by i matrices of} type C_i,j−1I, the overall number of elements that are needed to be stores is i ∗ s, thus only those that are different from zero.

Let’s make an example assuming a matrix H in the case of s = 2 and i = 3, The resulting H is shown in the following Equation:

H = " C_1,1−1 C_1,1−1 C_1,2−1 C_1,2−1 C_1,3−1 C_1,3−1 # (3.10)

In which zeros are left blank, to underline the fact that there is no need to store them in memory.

Looking at the periodic structure of H it is possible to notice that even H can be further simplified. Indeed H is itself circulant. Thus to fully store all H matrix it is enough to store its first row.

3.2.3 Sparse-Circulant Matrix

The properties presented in Section 3.2.1 and Section 3.2.2 are useful to identify periodicities into the matrix. Indeed both properties are fully mathematically defined. For example, let’s assume a matrix D with s = 2 and i = 3 as shown

(32)

in Equation 3.11. D =             C_1,1−1 C_1,1−1 C_1,2−1 C_1,2−1 C_1,3−1 C_1,3−1 C_1,3−1 C_1,3−1 C_1,1−1 C_1,1−1 C_1,2−1 C_1,2−1 C_1,2−1 C_1,2−1 C_1,3−1 C_1,3−1 C_1,1−1 C_1,1−1             (3.11)

The sparsity property is easily identified. Indeed non-zero elements are all spaced s − 1 elements. Moreover these elements can be exactly identified. Indeed each row will have exactly i elements, all spaced s − 1. Same relation can be observed for the columns.

The circulant property is easily identified too. This, as already explained, can be useful to strongly limit the memory usage. As for the sparsity property, this feature can be observed on both rows and columns.

These properties will be both analyzed in Chapter 4 to identify the most appropriate way to store the matrix in memory.

(33)

Chapter 4 Sparse Matrix-Vector

Multiplication

In Chapter 3 we have shown that the used decoding matrix can be defined as a sparse circulant matrix. This kind of matrix is one of the most important computational kernels in scientific applications, such as physics or financial problems.

The Sparse Matrix Vector multiplication, SpMxV in the future, is usually characterized by low performances on modern multi-processors architectures. In [17] it has been shown that SpMxV is a memory-bound operation. This means that the ratio between the operation executed and the amount of data on which these operations are executed is very low. Indeed, in case of a classical MxM operation there are O(n3_{) operations over O(n}2_{) data. In case of SpMxV} there is O(n2_{) operations over O(n}2_{) data. From this observation, it is possible} to infer that the data reusage experienced by SpMxV operation is very little.

The SpMxV operation can be also analyzed from another point of view. Indeed, the sparsity property introduces big overheads if the matrix is stored completely in memory. Two main performance degradations can be identified:

(34)

• Excessive memory usage, since all zero elements are stored in memory even if their contribution to the final result is null;

• Limited temporal locality.

These two problems are highly related. The sparsity property implies that the access to the data might not me sequential. This means that using the stan-dard data representation for a matrix implies an overhead in the memory and additional computational steps. To avoid these sources of inefficiencies differ-ent data structures can be used to limit the number of useless computations and the waste of memory to store zero elements.

4.1 SpMxV Data Storage Schemes

Various schemes are available in the literature to perform data storage:

• Compressed Storage Row ;

• Compressed Storage Column;

• Compressed Coordinate Format.

All the above schemes are targeting the same problem. Indeed store all the non-zero elements contiguously in memory. All these schemes anyway are affected by various overheads. The main overhead is due to the additional data structures that are needed to traverse the matrix and the vector. This means that accessing the elements of a matrix may require additional memory accesses, increasing the overall memory bandwidth required by the SpMxV operation.

(35)

4.1. SPMXV DATA STORAGE SCHEMES 23

The Compressed Storage Row, or simply CSR, is characterized by storing contiguously in memory all non-zero elements, in a single data structure. This storage format requires to be used with two additional data structures. A row pointer array, row ptr, that is in charge of storing pointer to the first element of each row, and a column index array, col ind, that for each non-zero element of the matrix stores its column index.

The Compressed Storage Column, or simply CSC, is similar to the previous one with the only difference that the non-zero elements of the matrix are all stored in a column-major order. Like the previous one the CSC representation requires additional data structures. A column pointer array, col ptr, that is in charge of storing the pointer to the first element of each column, and a row index array, row ind, that for each non-zero element of the matrix stores its row index.

The Compressed Coordinate Format, or simply COO, is a general imple-mentation of the two previously presented data formats. COO is a general representation since, for an arbitrary sparsity pattern, the required storage is always proportional to the number of non-zero elements. As the other schemes, it requires additional data structures. A row index array, row ind to store the row indices and a column indices, col ind to store the column indices. Notice that the non-zero elements of the matrix can be stores in any order.

Lot of research has been done on providing optimizations techniques [19], [21], [8] and [9]. The results shows that in order to obtain an efficient SpMxV an exploitation of the matrix structure and of processor architecture is needed.

(36)

4.1.1 Compressed Storage Row

The Compressed Storage Row scheme [18], also called CSR is the most used storage format. A sparse matrix of n rows, is characterized by nnz non-zero elements, that are stored contiguously in memory in a row-major ordering. This storage format is characterized by two additional data structures: col ind and row ptr. Both these data structures are arrays.

• col ind array is of size nnz and stores the column of each element in the original matrix;

• row ptr array is of size n + 1 and stores a pointer to the first element of each row.

Figure 4.1 shows an example of a generic sparse matrix 6 X 6 stored with the CSR format. 7 1 0 1 4 0 0 3 12 0 5 0 0 0 1 8 3 0 0 22 0 2 0 0 0 10 9 0 0 1 0 0 0 3 0 0 A = 0 3 7 8 12 13 12 1 9 1 8 2 3 1 1 3 22 10 7 4 5 3 1 2 3 0 2 3 5 4 0 1 3 4 0 0 1 2 A col_ind row_ptr

Figure 4.1: Example of CSR storage format

4.1.2 Compressed Storage Column

The Compressed Storage Column scheme [18], also called CSC is, with respect to the CSR scheme, less used. A sparse matrix of n rows, is characterized by nnz non-zero elements, that are stored contiguously in memory in a

(37)

column-4.1. SPMXV DATA STORAGE SCHEMES 25

major ordering. This storage format is characterized by two additional data structures: col ptr and row ind. Both these data structures are arrays.

• row ind array is of size nnz and stores the row of each element in the original matrix;

• col ptr array is of size n + 1 and stores a pointer to the first element of each column.

Figure 4.2 shows the same matrix as in Figure 4.1, but stored in a CSC format.

7 1 0 1 4 0 0 3 12 0 5 0 0 0 1 8 3 0 0 22 0 2 0 0 0 10 9 0 0 1 0 0 0 3 0 0 A = 0 4 7 10 12 15 1 1 7 4 12 3 5 1 8 3 2 22 9 1 10 3 1 3 4 5 0 3 5 0 1 5 1 3 0 2 3 1 A row_ind col_ptr

Figure 4.2: Example of CSC storage format

4.1.3 Compressed Coordinate Format

The Compressed Coordinate Format scheme [18], also called COO can be seen as a generalization of the two previous schemes. A sparse matrix of n rows, is characterized by nnz non-zero elements, that are stored contiguously in memory in any ordering. This storage format is characterized by two additional data structures: col ind and row ind. Both these data structures are arrays.

• row ind array is of size nnz and stores the row of each element in the original matrix;

(38)

• col ind array is of size nnz and stores the column of each element in the original matrix.

Figure 4.3 shows the same matrix as the previous examples of Figure 4.1 and Figure 4.2. 7 1 0 1 4 0 0 3 12 0 5 0 0 0 1 8 3 0 0 22 0 2 0 0 0 10 9 0 0 1 0 0 0 3 0 0 A = 12 1 9 1 8 2 3 1 1 3 22 10 7 4 5 3 1 2 3 0 2 3 5 4 0 1 3 4 0 0 1 2 A col_ind 0 0 0 1 1 1 1 2 3 3 3 3 4 5 5 5 row_ind

Figure 4.3: Example of COO storage format - Assuming row-major ordering

It is easy to identify this format as a general format, from which the CSR and CSC formats have been derived. In particular, this format is general since it does not impose any ordering in the data. Indeed any ordering, even unordered data, is acceptable by the scheme. The main drawback of the scheme is that both row ind and col ind arrays are fully stored in memory, increasing the amount of memory needed to store the whole matrix.

4.2 Sequential Algorithm

As shown in Section 4.1, various possible schemes are available to store the decoding matrix in memory. In the following the Compressed Storage Row data structure presented before will be used.

The classical algorithm used to perform a Matrix-Vector multiplication is shown in Code 1. This algorithm assumes a classical matrix stored as an array of arrays.

(39)

4.3. PERFORMANCE DEGRADATIONS 27 1 size == > d i m e n s i o n of the M a t r i x 2 x == > the i n p u t v e c t o r 3 y == > the o u t p u t v e c t o r 4 5 M a t r i x D is s t o r e s in an a r r a y of p o i n t e r s d [ -][ -] 6 */

7 for ( i =0; i < size ; i ++) { /* for any row i */

8 for ( j =0; j < size ; j ++) { /* any c o l u m n j */ 9 y [ i ] += d [ i ][ j ] * x [ j ];

10 }

11 }

Code 1: Basic Sequential Algorithm

In case of adoption of other formats to store the data structure even the sequential algorithm is going to slightly change. Indeed, in case of CSR data format the sequential algorithm is shown in Code 2

1 size == > d i m e n s i o n of the M a t r i x

2 nnz == > n u m b e r of non - zero e l e m e n t s in the m a t r i x 3 x == > the i n p u t v e c t o r 4 y == > the o u t p u t v e c t o r 5 6 M a t r i x D is s t o r e d as a s i n g l e a r r a y d [ nnz ] 7 */ 8 for ( i =0; i < size ; i ++) { 9 for ( j = r o w _ p t r [ i ]; j < r o w _ p t r [ i +1]; j ++) { 10 y [ i ] += d [ j ] * x [ c o l _ i n d [ j ]]; 11 } 12 }

Code 2: CSR data-format for Matrix-Vector multiplication

4.3 Performance Degradations

This section briefly presents the various performance degradations of a generic SpMxV operation using a CSR format scheme.

(40)

4.3.1 Memory Intensity

The memory intensity degradation can be referred also as a lack of temporal locality. Indeed in a SpMxV operation each matrix element participates in only two operations. This implies that the ratio between memory and operations significantly increases, limiting the overall system performance.

In this condition the performances are not determined by the processor used, but by the memory sub-system [17]. In [8] this issue has been analyzed. The result shows that lowering the amount of memory needed to store the various data structures provides a significant performance improvement on many test matrices.

Furthermore the operation itself exhibits no reuse in the data structures used to represent the matrix. Thus, assuming to store a matrix M in a CSR format, no data reuse in M, col ind and row ptr.

4.3.2 Indirect Memory References

Two sources of indirect memory references are present in the case of CSR data-format. One in row ptr to determine the bounds of the inner loop and one for the x access, due to col ind accesses. [?] shows that the main source of performance degradation is due to the indirect accesses to col ind structure. Indeed an access to x is burdened by one additional access to col ind data structure, that unavoidably increases the number of memory accesses.

Differently, the row ptr accesses are less affecting the performance. This is a predictable result. Indeed the accesses to the row ptr structure are very rare and replace an already existing overhead in the inner loop initialization.

(41)

4.4. OPTIMIZED SEQUENTIAL ALGORITHM 29

4.3.3 Row Length

This source of performance degradation arises from the fact that, in general, there could exist rows with very few non-zero elements. This condition, in our case, is not of interest. Indeed in our case the matrix is provided by the same number of non-zero element per row.

4.4 Optimized Sequential Algorithm

The previous analysis of the various performance degradation lead to an effort in better optimizing the sequential algorithm taking advantage of the matrix periodic structure.

The main performance issue that has been analyzed is the indirect memory accesses problem. The optimized algorithm, based on results of Section 3.2.3, is shown in Code 3. This optimized sequential algorithm exploits the periodic 1 size == > d i m e n s i o n of the M a t r i x

2 nnz == > n u m b e r of non - zero e l e m e n t s in the m a t r i x 3 s == > size / c o d e _ s i z e

4 x == > the i n p u t v e c t o r 5 y == > the o u t p u t v e c t o r

6 d [ nnz ] == > c o n t i g u o u s a r r a y to s t o r e m a t r i x

7 */

8 for (int i = 0; i < size ; i ++) { 9 int s t a r t = r o w _ p t r [ i ];

10 int end = s t a r t + c o d e _ s i z e ; 11 int r o w _ p o s = c o l _ i n d [ s t a r t ];

12 for (int j = s t a r t ; j < end ; j ++) { 13 y [ i ] += d [ j ] * x [ r o w _ p o s ];

14 r o w _ p o s += s ;

15 }

16 }

(42)

structure of the matrix. Indeed the various indirect memory accesses in Code 2 have been limited. In detail it has been observed that:

• Each row of the decoding matrix is composed by exactly code size ele-ments;

• The elements of X needed to calculate yi are all separated by a fixed amount of values s, defined as:

s = size

code size (4.1)

These observation allowed us to limit the indirect memory references to only one per outermost loop execution.

(43)

Chapter 5 Parallelization Techniques

This section is related to the analysis of the various parallelization techniques that can be used to parallelize the decoding procedure exploited by the matrix-vector product. In general two main classes of parallelization can be identified, based on the kind of input the system is working with:

• Data-Parallel, for the case in which the input is a single data structure;

• Stream-Parallel, for the case in which the input is a finite or infinite sequence of elements.

Each of these class of parallelization is characterized by different computation schemes.

5.0.1 Matrix Decomposition

Another aspect that has been taken into account in the parallelization of the algorithm is related to the data structures involved in the operation. We have analyzed in detail the decoding matrix. This analysis has been performed at

(44)

the beginning of the parallelization analysis since it is a recurring aspect on all the successive implementations.

In literature three main decomposition patterns are usually presented:

• row decomposition (1D Matrix Decomposition);

• column decomposition (1D Matrix Decomposition);

• block decomposition (2D Matrix Decomposition).

These decomposition patterns are shown graphically in Figure 5.1. These

Row decomposition Column decomposition Block decomposition

Figure 5.1: Matrix Data Decomposition Schemes

different approaches will be taken into account during the parallelization.

5.1 Data Parallel

Following the data parallel paradigm different possible schemes can be identi-fied:

• Map,

• Stencil, • Reduce,

(45)

5.1. DATA PARALLEL 33

• Divide and Conquer.

The map scheme is characterized by the parallelization of the sequential computation identifying independent sub computations. If we assume to have a set of workers that executes a defined function F, this parallel computation is called map if and only if the workers are fully independent, thus no cooperation between workers is present during the execution of F.

The stencil scheme is a complex computation in which workers cooperate through data exchange.

The reduce scheme is an extension of the sequential reduce operation. Re-duce operation is a function defined as

x = reduce(A, op) (5.1)

in which op is an associative operator and A is an array of M elements over which op is defined, thus:

reduce(A, op) = A[0] op A[1] op ... op A[M − 1] (5.2)

In case of parallel computations, a reduce operation can be parallelized by exploiting the associativity of op.

The divide and conquer approach is the parallel implementation of the well-known divide and conquer of the sequential world.

To describe a data parallel program usually a systematic approach called Virtual Processors is used. With this approach it is possible to describe the complexity of a data parallel program focusing on:

(46)

• Output Data collection;

Moreover, such approach is capable to provide useful information to under-stand which data parallel model is applicable.

In the next two sections two possible parallelization of the data parallel computation are presented.

The first one is based on the map criterion. Indeed looking at the sequential algorithm, presented in Section 4.4, the simplest way to parallelize the compu-tation is to assign to each Virtual Processor a compucompu-tation of the outer loop. This kind of computation is independent, meaning that concurrent execution of the outer loop are possible, since there is no need of data exchange.

The second one is based on a map-reduce criterion. The idea is to assign each element of the matrix to a single Virtual Processor. Assuming to call the generic Virtual Processor VP[i][j], each one is in charge of:

1. Execute the multiplication between D[i][j] and X[j].

2. Take part to a reduce operation to calculate the final result Y[j].

5.1.1 Map

Looking at Code 3 in Section 4.4, the parallelism can be exploited in the outermost loop of the matrix-vector multiplication. Since the number of times the loop is executed is size, this is also the total number of Virtual Processors, namely an array of virtual processors of size size is defined, formally defined VP[size].

Each Virtual Processor in the map paradigm needs to be capable of execut-ing one function independently from the other Virtual Processors.

(47)

Instantiat-5.1. DATA PARALLEL 35

ing this definition in our system, such independent function can be identified as the row-column multiplication that each outermost loop executes. Notice that to ensure the independence, each Virtual Processor needs to have access to:

• The element in which to store the result;

• All the code size elements of the i-th row of the decoding matrix;

• All size elements of th input vector X.

The pseudo-code describing the Virtual Processor program is shown in Code 4.

1 p a r a l l e l VP [ M ]; // d e f i n i t i o n of M V i r t u a l P r o c e s s o r s 2 int D [ M ][ N ] , X [ M ] , Y [ M ] // D e c o d i n g Matrix , I n p u t and

O u t p u t r e s p e c t i v e l y 3 4 /* 5 V i r t u a l P r o c e s s o r D e f i n i t i o n 6 & 7 P r i v a t e v a r i a b l e s d e f i n i t i o n 8 */ 9 10 VP [ i ] :: D [ i ][*] , X [*] , Y [ i ]; 11 /* 12 B e h a v i o u r of each V i r t u a l P r o c e s s o r 13 */ 14 for ( j = 0; j < N ; j ++) { 15 C [ i ] += D [ i ][ j ] * X [ j ] 16 }

(48)

Cost-Model

The cost model of a map parallelization is needed to obtain the optimum number of workers. Let’s first define some variables that will be used later:

• M the dimension of the matrix;

• n the number of workers;

• k the size of partitioned data structure;

• g the partition size in which the decoding matrix is divided, defined as:

g = M

n (5.3)

• TF the time needed to execute the single operation, thus the operation described by a Virtual Processor.

Assuming the availability of n workers, the calculation time of each worker can be expressed as:

Tw calc= gTF (5.4)

The overall completion time needs to take into account two extra steps. Indeed a scatter operation is needed to provide workers the informations and a gather operation is needed to provide back the result.

The cost of a scatter operation is defined as:

(49)

The cost of a gather operation can be approximated to:

Tgather≈ Tscatter (5.6)

Once defined these two extra operations, the overall completion time can be evaluated as:

TC = Tscatter+ Tgather+ Tw calc= 2Tscatter+ Tw calc (5.7)

Thus:

TC = 2n(Tsend+ gTtransm) + gTF = 2nTsetup+ 2kTtransm+ TF M

n (5.8)

This result can be further simplified taking into account the shared memory architecture that is going to be used. Indeed there is no need of explicitly passing all the data to the workers. The data needed to execute the operation can be retrieved independently by each worker from the shared memory. This lead to a simplified formula:

TC = 2nTsetup+ TF M

n (5.9)

To conclude the analysis of the cost model, must be evaluated in detail the TF contribution. This term, as already explained, takes into account the time needed by each worker to execute the operation described by a single Virtual Processor. Instantiating this definition into our system, TF is related to the operation of row-column multiplication. Thus the execution time can be evaluated as:

(50)

Letting Tsum = Tmult = Top, the Virtual Processor execution time becomes

TF = 3Topcode size (5.11)

The optimal parallelism can be obtained from equation 5.9 by letting the first derivative of the completion time with respect to the number of workers to zero. Resulting in nopt = &s M TF 2Tsetup ' (5.12)

Further simplifications can be done to equation 5.9. Indeed a pipeline effect can be observed in the collective operations. Indeed all the send procedures executed by the workers, once completed their work, are overlapped, except the last one. This refinement of the completion time results in

TC = (n + 1)Tsetup+ gTF (5.13)

From which we can obtain a more refined optimal degree of parallelism follow-ing the same procedure as explained above, endfollow-ing up in

nopt = &s M TF Tsetup ' (5.14)

5.1.2 Map-reduce

An alternative solution to the previously presented map approach is a combi-nation of a map with a reduction. As already introduced, the idea is to assign each element of the decoding matrix to a single worker. This means that each worker is in charge to:

(51)

2. Take part to a reduce operation to calculate the final result Y[j].

Differently from the previous parallel computation, this approach requires an additional step to obtain a correct result. This is due to the fact that a single worker is assigned to a single element of the decoding matrix. Meaning that, to obtain the correct output result, the map operation needs to be followed by a reduction operation.

To develop the Virtual Processor approach already used for the pure map scheme, must be assumed the presence of a matrix of Virtual Processors, let’s call them VP[i][j]. Each virtual processor encapsulate a single element of the decoding matrix and needs to execute a two-step computation:

1. Calculate D[i][j] ∗ x[j], in which D[i][j] represents the element of the decoding matrix assigned to VP[i][j], and store the partial result locally;

2. Participate to a reduction operation to calculate the final result Y [j].

From the presentation of the work executed by each Virtual Processor can be noticed that the reduction operation is needed only between Virtual Processors that are in the same row.

As example, let’s assume to have a squared decoding matrix of size 3. Following the Virtual Processor approach, we end up in defining a matrix of Virtual Processors as shown in Figure 5.2, in which at each one is assigned a single element of the matrix. Each Virtual Processor is in charge of executing a single operation of multiplication between one element of the matrix and one element of the input vector. Once concluded the first operation, the reduction operation is performed. In Figure 5.2 the reduction operation is represented

(52)

1,1 0,1 0,2 0,0 1,2 1,0 2,1 2,0 2,2

3

3 +

+

0 1 2

=

Y

Figure 5.2: Virtual Processor Reduce Operation

by arcs and labels. Each arc represents a logical dependence between Virtual Processors and its label represents the operation that needs to be performed between the two VPs. Notice that in our case we are dealing with a sum operation, thus the ordering in which the operations are carried out is not of interest due to the commutative property of the sum. Thus the ordering of the operations is not affecting the final result. Differently, the last arc, labelled with =, provides the same meaning as:

Y [i] = size X j=0

V P [i][j] (5.15)

in which with VP[i][j] is indicated the result of the operation of multiplication done in the first step of the computation.

The reduction operation can be described too with the Virtual Processor approach. Let’s focus on one single operation of reduction of a generic row i. Reduction operation can be implemented as a tree, as shown in Figure 5.3.

(53)

Figure 5.3: Logarithmic implementation of a generic reduction operation Since the reduction operation is assumed to be executed after the map computation, no need of data scattering among Virtual Processors, indeed each one has its own calculated value. The reduction proceeds as shown in Code 5.

1 /*

2 R e d u c t i o n o p e r a t i o n t o o k s log ( N ) s t e p s

3 */

4 for k = 1 , ... , log ( N ) { // log base 2 5

6 /*

7 At each step each V i r t u a l P r o c e s s o r

8 who t a k e s part to the r e d u c t i o n o p e r a t i o n n e e d s to

9 p e r f o r m 10 */ 11 p a r a l l e l VP [ N ]: // d e f i n i t i o n of N p a r a l l e l p r o c e s s e s 12 VP [ i ]:: // b e h a v i o u r of each V i r t u a l P r o c e s s o r 13 if (( i +1) mod 2^ i ) = 0) { 14 Y [ i ] += Y [ i -2^( k -1) ]; 15 } 16 } // end for

Code 5: Pseudo-code of the Virtual Processor approach for the reduction operation

Cost Model

To provide a mapping of the presented Virtual Processor computation into a parametric version with parallelism degree p must be first provided the

(54)

method-ology with which the Virtual Processors are mapped into actual processing units.

Let’s firstly define some variables:

• p, as the parallelism degree defined as:

p = sq (5.16)

• m, as the number of rows of the Virtual Processor matrix, thus also of the decoding matrix;

• n, as the number of columns of the Virtual Processor matrix, thus also of the decoding matrix.

In general, the Virtual Processor matrix can be seen as composed by sub-matrices like:     A0,0 . . . A0,q−1 .. . . .. ... As−1,0 . . . As−1,q−1     (5.17)

In which each sub-matrix is composed by elements:     ai0,j0 . . . ai0,jl−1 .. . . .. ... aik−1,j0 . . . aik−1,jl−1     (5.18)

where k = d(m_s)e and l = d(n_q)e.

Given this decomposition scheme of the Virtual Processor matrix, can be defined the grain of the decomposition, or partition size, g as:

(55)

From the general description of the computation, the completion time can be expressed as:

Tc = pTscatter+ gTF + Treduce (5.20)

in which, similarly to the pure map approach, Tscatter represents the time needed to provide the workers with the data and TF the calculation time of each operation expressed by one Virtual Processor. Differently from the map approach, there is no gather operation. This is because, in our case, the gathering is executed by the reduction operation.

As already introduced, each row represents a different reduction operation, thus, given this decomposition scheme, each processing node participates into k different reduction operations. Thus equation 5.20 can be further defined as:

Tc= pTscatter+ gTF + kTreduce (5.21)

This formula shows that the k reduction operation that each workers are per-formed sequentially.

The reduction operation itself can be mapped into a tree computation as shown in Figure ??. The parallel reduction operation must be expressed as an operation with parallelism degree q and grain l. Notice that q represents the number of column blocks that each row is composed by, while l represents the size of each column block. With these parameters, the reduction operation can be expressed as:

(56)

Assuming a tree computation:

Treduce(q, l) = (l − 1)Tsum+ log2q(Tsend+ Tsum) (5.23)

In which Tsum expresses the time needed to execute a single operation of sum between two elements.

Substituting Equation 5.23 in Equation 5.21 the overall completion time can be expressed as:

Tc = pTscatter+ gTF + k h

(l − 1)Tsum+ log2q(Tsend+ Tsum) i

(5.24)

This formula can be further simplified if a shared memory architecture is as-sumed, as already done for the map proposed approach, leading to a completion time of:

Tc= pTsetup+ gTF + k h

(l − 1)Tsum+ log2q(Tsetup+ Tsum) i

(5.25)

Notice that in the reduction operation the result is present on the last worker. Thus at the end of each reduction operation must be performed an additional communication to provide back the result. So Equation 5.25 must take into account this, becoming:

Tc = pTsetup+ gTF + k h

(l − 1)Tsum+ log2q(Tsetup+ Tsum) + Tsetup i = (p + k)Tsetup+ gTF + k

h

(l − 1)Tsum+ log2q(Tsetup+ Tsum)

i (5.26)

To better quantify the formula, must then be analyzed the TF term. As al-ready introduced, this term takes into account the actual computation done by each Virtual Processor. From the computational scheme presented before,

(57)

each Virtual Processor executes only one multiplication operation before the reduction takes places. Thus, assuming Tmult as the time taken to execute a single operation, Equation 5.26 becomes:

Tc= (p + k)Tsetup+ gTmult+ k h

(l − 1)Tsum + log2q(Tsetup+ Tsum) i

(5.27)

Moreover, can be assumed that the cost of a multiplication and a sum operation are equal. Thus Top= Tmult = Tsum, substituting this in Equation 5.27:

Tc= (p + k)Tsetup+ gTop+ k h

(l − 1)Top+ log2q(Tsetup+ Top) i

(5.28)

To provide the optimum number of workers, popt, a two step analysis must be executed. Indeed due to the complexity of the scheme is preferable to analyze separately the two steps of the computation. The first step completion time can be approximated to:

Tc1 = pTsetup+ gTF (5.29)

To obtain the optimum number of workers is enough, as did before, to calculate the derivative of the completion time with respect to p and evaluate the result with respect to p. The result is the following relation:

popt = &s mnTF Tsetup ' (5.30)

For the reduce operation the parallelism degree that needs to be optimized is q. The procedure is similar to the one presented before. The optimum value

(58)

for q is equal to: qopt= & nToplog 2 Tsetup ' (5.31)

Notice that, fixed Top, n, m and Tsend it is possible to observe that popt >> qopt. For example, assuming:

• Top= τ , so that a basic operation, such as sum and multiplication, took one clock cycle;

• n = 10000; • m = 10000;

• Tsetup = 10000τ , so that the time needed by the system to start the execution of a newly created computation.

The optimal parallelism degree can be evaluated as: popt = 317

qopt = 3

(5.32)

These results are providing two informations. One is related to the overall optimal number of processing nodes for the first operation. The second is related to the optimal number in which the columns must be divided to obtain optimal performances in the reduction operation.

Combining the two results it is possible to provide an overall parallelism degree. Indeed, from the previous results can be inferred that to obtain optimal completion time, is desiderable to have 317 processing nodes with a particular mapping. Indeed, to obtain also optimal reduction operation, each row must be divided between 3 processing node only.

(59)

5.2. STREAM PARALLELISM 47

5.2 Stream Parallelism

Following the stream parallelism paradigm different schemes can be identified:

• Pipeline;

• Farm;

• Data-Flow.

The pipeline scheme is characterized by a decomposition of the computation in stages. If it is possible to identify a computation as a linear combination of n functions, it is possible a pipeline parallelization among n stages. The main characteristic of this model, but also of other models, is the lowering of the service time.

The farm scheme is characterized by a replication of a function among a set of workers. The general implementation of a farm is shown in Figure ??. In which are present three kind of modules: emitter, workers, and collector. Many slightly different farm approaches are possible, such as functional replication with state, and all relies upon the farm definition.

The data-flow scheme relies upon the so called Bernstein conditions. Indeed given a sequential computation, a partial ordering of the operations can be obtained applying the Bernstein conditions. The resulting partial ordering can be then represented by a data-flow graph.

Before presenting the various proposed schemes it is necessary to charac-terize the stream that the system is going to work with. In this stage of the design the input stream is left free, meaning that no specific communication

(60)

pattern is forced. This implies that the decoding procedure can impose the best, from a decoding performance point of view, input stream structure.

Before proceeding with the analysis of the various proposed schemes, it is useful to define some common names that will be used in the following sections:

• N , that represents the input size;

• code size, that represents the length of the code;

• f , that represents the frequency of the A/D converter;

• s = N code size;

• X, that represents the input;

• Y , that represents the output;

• D, that represents the decoding matrix.

To find out the best model of computation to solve the problem three different types of input streams are considered:

• An input stream with inter-arrival time of TA = _f1N in which each ele-ment is the full X;

• An input stream with inter-arrival time of TA = 1_f in which each element is a single element of X;

• An input stream with inter-arrival time of TA = 1_fN ∗ 1_s in which each element is a particular subset of X, the one that contains all the elements that are needed to calculate a single Y element.

(61)

5.2. STREAM PARALLELISM 49 1 5 9 Emitter 1 2 n ... ... ... Workers Collector input stream 9 12 56 8 3 89 6 12 6

Figure 5.4: Farm with an input stream made by arrays of size N

5.2.1 Stream of arrays of size N

This approach is shown schematically in Figure 5.4. It is characterized by an input stream composed by arrays of size N with an inter-arrival time TA of:

TA= 1

fN (5.33)

The role of the workers is to execute the actual decoding of the input array X and provide to the collector the decoded vector Y .

By queuing theory we have that the Emitter is a bottleneck in the farm if holds the following:

TA< TE (5.34)

in which TE is the service time of the emitter. Notice that if a non ordering emitter is assumed, the latency of the emitter can be approximated to the communication latency Lcom. Thus relation 5.34 can be rewritten as

Multi-Core Approach for Real-Time Decoding in Optical Fiber Sensor Systems

Master Degree in Computer Science and Networking

Master Thesis

Multi-Core Approach for Real-Time

Fiber Optic Sensing Systems

Academic Year 2010/2011

Contents

List of Figures

Acknowledgments

Chapter 1

Introduction

Chapter 2

Distributed Fiber Optic Sensor

2.1

Sensing Principles

2.1.1

Distributed Sensing

2.2

Coding

H

2.3

Decoding Architecture

Chapter 3

Decoding Matrix

3.1

The Coding Matrix

3.2

The Decoding Matrix

3.2.1

Circulant Property

3.2.2

Sparsity Property

3.2.3

Sparse-Circulant Matrix

Chapter 4

Sparse Matrix-Vector

Multiplication

4.1

SpMxV Data Storage Schemes

4.1.1

Compressed Storage Row

4.1.2

Compressed Storage Column

4.1.3

Compressed Coordinate Format

4.2

Sequential Algorithm

4.3

Performance Degradations

4.3.1

Memory Intensity

4.3.2

Indirect Memory References

4.3.3

Row Length

4.4

Optimized Sequential Algorithm

Chapter 5

Parallelization Techniques

5.0.1

Matrix Decomposition

5.1

Data Parallel

5.1.1

Map

5.1.2

Map-reduce

3

3

+

+

+

+

+

+

=

=

=

Y

5.2