• Non ci sono risultati.

The Experiment

2.5 Model 3: GAME

2.5.3 The Experiment

host-to-device transfer slows down the process, in which case also the serial part of the code runs on the device. Having analyzed the application profile, we apply either Amdahl or Gustafson law to estimate an upper limit of the achievable speedup. Once we have located a hotspot in our application’s profile assessment, we used Thrust library to expose the parallelism in that portion of our code as a call to an external function. We then executed this external function onto the GPU and retrieved the results without requiring major changes to the rest of the application.

We adopted the Thrust code optimization in order to gain, at the cost of lower speedup, a rapid development and a better code readability. There are three high-level optimization techniques that we employed to yield significant performance speedups when using Thrust:

1. Fusion: In computations with low arithmetic intensity, the ratio of calculations per memory access are constrained by the available memory bandwidth and do not fully exploits the GPU. One technique for increasing the computational intensity of an algorithm is to fuse multiple pipeline stages together into a single one. In Thrust a better approach is to fuse the functions into a single operation g( f (x)) and halve the number of memory transactions. Unless f and g are computationally expensive operations, the fused implementation will run approximately twice as fast as the first approach. Fusing a transformation with other algorithms is a worthwhile optimization. Thrust provides a transform iterator which allows transformations to be fused with any algorithm;

2. Structure of Arrays (SoA): An alternative way to improve memory efficiency is to ensure that all memory accesses benefit from coalescing, since coalesced memory access patterns are considerably faster than non-coalesced transactions.

The most common violation of the memory coalescing rules arises when using an Array of Structures (AoS) data layout. An alternative to the AoS layout is the SoA approach, where the components of each structure are stored in separate arrays. The advantage of the SoA method is that regular access to its components of a given vector is fusible;

3. Implicit Sequences: the use of implicit ranges, i.e. those ranges whose values are defined programmatically and not stored anywhere in memory. Thrust provides a counting iterator, which acts like an explicit range of values, but does not carry any overhead. Specifically, when counting iterator is de-referenced it generates the appropriate value on the fly and yields that value to the caller.

run on a 2.0 GHz Intel Core i7 2630QM quad core CPU running 64-bit Windows 7 Home Premium SP1. The CPU code was compiled using the Microsoft C/C++

Optimizing Compiler version 16.00 and GPU benchmarks were performed using the NVIDIA CUDA programming toolkit version 4.1 running on a NVIDIA GPUs GeForce GT540M device.

As execution parameters were chosen combinations of:

• Max number of iterations: 1000, 2000, 4000, 10000, 20000 and 40000;

• Order (max degree) of polynomial expansion: 1, 2, 4 and 8;

The other parameters remain unchanged for all tests:

• Random mode for initial population: normal distribution in [-1,+1];

• Type of error function (fitness): Threshold Mean Square Error (TMSE);

• TMSE threshold: 0.49;

• Selection criterion: both ranking and roulette;

• Training error threshold: 0.001, used as a stopping criteria;

• Crossover application probability rate: 0.9;

• Mutation application probability rate: 0.2;

• Number of tournament chromosomes at each selection stage: 4;

• Elitism chromosomes at each evolution: 2.

The experiments were done by using three implementations of the GAME model:

• serial: the CPU-based original implementation (serial code);

• Opt: an intermediate version, obtained during the APOD process application (optimized serial code);

• GPU: the parallelized version, as obtained at the end of an entire APOD process application.

For the scope of the present experiment, we have preliminarily verified the perfect correspondence between CPU- and GPU-based implementations in terms of classification performance. In fact, the scientific results for the serial version have been already evaluated and documented in a recent paper (Brescia et al., 2012a), where the serial version of GAME was originally included in a set of machine learning models, provided within our team and compared with the traditional analytical methods.

Referring to the best results for the three versions of GAME implementation, we obtained the percentages shown in Tab. 2.1.

The results were evaluated by means of three statistical figures of merit, for instance completeness, contamination and accuracy. However, these terms are

type of experiment missing features figure of merit serial Opt GPU complete patterns – class.accuracy 82.1 82.2 81.9 completeness 73.3 73.0 72.9 contamination 18.7 18.5 18.8 without feat. 11 11 class.accuracy 81.9 82.1 81.7 completeness 79.3 79.1 78.9 contamination 19.6 19.5 19.8 only optical 8, 9, 10, 11 class.accuracy 86.4 86.3 86.0 completeness 78.9 78.6 78.4 contamination 13.9 13.7 14.1 mixed 5, 8, 9, 10, 11 class.accuracy 86.7 86.9 86.5 completeness 81.5 81.4 81.2 contamination 16.6 16.2 16.7 Table 2.1: Summary of the performances (in percentage) of the three versions of the GAME classifier. There are reported results for the four main dataset feature pruning experiments, respectively with all 11 features, without the last structural parameter (tidal radius), with optical features only and the last one without 5 features (mixed between optical and structural types).

differently defined by astronomers and data miners. In this case, for classification accuracywe intend the fraction of patterns (objects) which are correctly classified (either GCs or non-GCs) with respect to the total number of objects in the sample;

the completeness is the fraction of GCs which are correctly classified as such and finally, the contamination is the fraction of non-GC objects which are erroneously classified as GCs. In terms of accuracy and completeness, by using all available features but the number 11 (the tidal radius), we obtain marginally better results, as can be expected given the high noise present in this last parameter, which is affected by the large background due to the host galaxy light. In terms of contamination, better results are obtained by removing structural parameters, demonstrating the relevance of information carried by optical features in the observed objects (in particular, the isophotal and aperture magnitudes and the FWHM of the image were recognized as the most relevant by all pruning tests). Moreover, these experiments have also the advantage to reduce the number of features within patterns, without affecting the classification performance. The less numerous are the patterns, the shorter the execution time of the training phase, thus providing a benefit to the overall computational cost of the experiments.

Finally, it is worth pointing out that the performance results quoted in Tab. 2.1 are all referred to the test samples (without considering the training results), and do not include possible biases affecting the KB itself. Hence they rely on the assumption that the KB is a fair and complete representation the real population that we want to identify.

From the computational point of view, as expected, for all the three versions of

GAME, we obtained consistent results, slightly varying in terms of less significant digits, trivially motivated by the intrinsic randomness of the genetic algorithms and also by the precision imposed by different processing units. We recall in fact, that the GPU devices used for experiments are optimized for single precision and they may reveal a little lack of performance in the case of double precision calculations.

GPU devices, certified for double precision would be made available for testing in the first quarter of 2013, when optimized Kepler GPU class type devices will be commercially distributed.

After having verified the computing consistency among the three implementations, we investigated the analysis of performance in terms of execution speed.

We performed a series of tests on the parallelized version of the model in order to evaluate the scalability performance with respect to the data volume. These tests make use of the dataset used for the scientific experiments (see Sec. 4.2), extended by simply replicating its rows several times, thus obtaining a uniform group of datasets with incremental size.

By considering a GPU device with 96 cores, 2GB of dedicated global memory and a theoretical throughput of about 14GB/sec on the PCI-E bus, connecting the CPU (hereinafter Host) and GPU (hereinafter Device), we compared the execution of a complete training phase, by varying the degree of the polynomial expansion (function polytrigo) and the size of the input dataset.

Tab. 2.2 reports the results, where the training cycles have been fixed to 4000 iterations for simplicity. The derived quantities are the following:

• HtoD (Host to Device): time elapsed during the data transfer from Host to Device;

• DtoH (Device to Host): the opposite of HtoD;

• DtoD (Device to Device): time elapsed during the data transfers internally to the device global memory;

• DP (Device Processing): processing time on the GPU side;

• HP (Host Processing): processing time on the CPU side;

• P (Processing): total duration of the training processing phase (excluding data transfer time);

• T (Transfer): total duration of various data transfers between Host and Device;

• TOT (Total): Total time duration of a training execution;

The relationships among these terms are the following:

P= DP + HP (2.34)

T = HtoD + DtoH + DtoD (2.35)

T OT = P + T (2.36) The total execution time of a training phase, given by Eq. 2.36, can be obtained by adding the processing time (on both Host and Device), given by Eq. 2.34, to the data transfer time (Eq. 2.35). However, in principle, in Eq. 2.35 it would be necessary to calculate also the time elapsed during data transfers through the shared memory of the Device, i.e. the data flow among all threads of the GPU blocks. In our case this mechanism is automatically optimized by Thrust itself, which takes care of the flow, thus hiding the thread communication setup to the programmer.

By comparing the time spent to transfer data between Host and Device (sum of columns HtoD and DtoH) with the processing time (column P), it appears an expected not negligible contribution of transfer time, well known as one of the most significant bottlenecks for the current type of GPUs (i.e. before the Kepler technology). In this case, a particular effort is required in the code design to minimize as much as possible this data flow. In our case, as the data size increases, such contribution remains almost constant (approximately 9%), thus confirming the correctness of our approach in keeping this mechanism under control.

In terms of work distribution between Host and Device, by comparing the processing time between Host and Device (respectively, columns HP and DP), for all data size the percentage of GPU calculation time remains almost the same (approximately 70% of the whole processing time, given in column P). Furthermore, by evaluating the average time needed to process one MB of data at different polynomial degrees, when the size grows from 0.15 to 512 MB, we obtain on average ∼ 0.058 MB/sec with degree= 1, ∼ 0.055 MB/sec with degree = 2, ∼ 0.046 MB/sec with degree = 4, and ∼ 0.036 MB/sec with degree = 8. Of course, as it was expected, the increase in the polynomial degree affects the average time to process a single MB of data, but of an amount which is small in respect of the growing size of data.

In conclusions, by considering both scientific and computing performances of the GPU based model, we can conclude that a polynomial expansion of degree 8 is a good compromise between the approximation of the desired training error and the processing time needed to reach it, still maintaining good performances also in the case of large volumes of training data.

The three versions of GAME, respectively the serial one, the serial optimized and the parallel, have been tested under the same setup conditions. As expected, while the two CPU-based versions, serial and Opt, appear comparable, there is an evident difference with the GPU version. The diagrams shown in Fig. 2.8 report the direct comparisons among the three GAME versions, by setting an incremented degree of the polynomial expansion which represents the evaluation function for chromosomes.

The trends show that the execution time increases always in a linear way with the number of iterations, once fixed the polynomial degree. This is what we expected because the algorithm repeats the same operations at each iteration. The GPU version speed is always at least one order of magnitude less than the other two

degree data size DP HP HtoD DtoH DtoD P T T OT 1

0.15MB 9.59 5.16 0.11 0.11 0.03 14.75 0.25 15.00 16MB 152.12 65.20 10.00 10.00 2.68 217.32 22.68 240.00 128MB 1191.26 510.54 82.00 81.20 5.50 1701.80 168.70 1870.50 512MB 4919.60 2108.40 322.00 322.00 8.00 7028.00 652.00 7680.00 2

0.15MB 10.38 5.35 0.12 0.12 0.04 15.73 0.27 16.00 16MB 161.74 69.32 11.00 11.00 2.95 231.05 24.95 256.00 128MB 1265.59 542.39 90.20 89.32 6.05 1807.98 185.57 1993.55 512MB 5232.36 2242.44 354.20 354.20 8.80 7474.80 717.20 8192.00 4

0.15MB 12.72 5.98 0.13 0.13 0.04 18.70 0.30 19.00 16MB 193.59 82.97 12.10 12.10 3.24 276.56 27.44 304.00 128MB 1517.58 650.39 99.22 98.25 6.66 2167.98 204.13 2372.11 512MB 6257.36 2681.72 389.62 389.62 9.68 8939.08 788.92 9728.00 8

0.15MB 16.33 7.34 0.14 0.14 0.04 23.67 0.33 24.00 16MB 247.67 106.14 13.31 13.31 3.57 353.81 30.19 384.00 128MB 1947.10 834.47 109.14 108.08 7.32 2781.58 224.54 3006.12 512MB 7994.13 3426.06 428.58 428.58 10.65 11420.19 867.81 12288.00

Table 2.2: Summary of the time performances of the GPU version of the GAME classifier. The figures were obtained by varying the degree of the polynomial function polytrigo (first column) and the size of input dataset (second column).

The training cycles have been fixed to 4000 iterations. For each degree value, the first row (with datasize= 0.15MB) is referred to the original size of the dataset used for scientific experiments. The other rows are referred to the same dataset but artificially extended by replicating its rows a number of times. All other columns 3-10 are time values expressed in seconds. Their meaning is defined as follows: DP (Device Processing) is the processing time on the GPU side; HP (Host Processing) is the processing time on the CPU side; HtoD (Host to Device) is the time elapsed during the data transfer from Host to Device; DtoH (Device to Host) is the opposite of HtoD; DtoD (Device to Device) is the time elapsed during the data transfers internally to the device global memory; P is the sum of columns DP and HP; T is the sum of columns HtoD, DtoH and DtoD and finally the last column T OT is the sum of columns P and T .

(a)

(b)

(c)

(d)

Figure 2.8: Comparison among the three GAME implementations with the polyno-mial degree 1 (a), 2 (b), 4 (c) and 8 (d). The squares represent the serial version;

rhombi represent the Opt version, while triangles are used for the GPU version.

Figure 2.9: Speedup comparison among GAME CPU implementations (serial and Opt) against the GPU version.

implementations. We remark also that the classification performance of the GAME model increases by growing the polynomial degree, starting to reach good results from a value equal to 4. Exactly when the difference between CPU and GPU versions starts to be 2 orders of magnitude.

In the diagram of Fig. 2.9, the GPU version is compared against the CPU imple-mentations. As shown, the speedup increases proportionally with the increasing of the polynomial degree. The diagram shows that for the average speed in a range of iterations from 1000 to 40000, the GPU algorithm exploits the data parallelism as much data are simultaneously processed. As previously mentioned, an increase of maximum degree in the polynomial expansion leads to an increase in the num-ber of genes and consequently to a larger population matrix. The GPU algorithm outperforms the CPU performance by a factor ranging from 8× to 200× against the serial version and in a range from 6× to 125× against the Opt version, enabling an intensive and highly scalable use of the algorithm that were previously impossible to be achieved with a CPU.