DQ, DQM and scalability issues - Euclid Data Quality

3.3 Euclid Data Quality

3.3.5 DQ, DQM and scalability issues

DQ and DQM are computing intensive and their computational cost grows quickly with the size and complexity of the data to be analyzed. In what follows we shortly describe how Graphic Processing Units (GPUs) could offer an effective and inexpensive way to deal with such problem even in the framework of a mission as complex as Euclid is.

In Euclid SGS warehouse the scientific quality control is particularly referred with data and metadata related to both images and spectra. Most of the KDD techniques based on machine learning that could directly be employed on such kind of data can be considered naturally parallel in terms of their analysis computation.

As an example let us consider the Multi Objective Genetic Algorithm (MOGA), based on the linkage between feature selections and association rules, that is one of the key concepts in the DQ methodology. The main motivation for using GA in the discovery of high-level prediction rules is that they perform a global search and cope better with attribute interaction, often used in DM problems. Therefore a parallel GA further promotes the performance of computing, particularly required on massive data warehouse quality control. A traditional parallel computing environment is very difficult and expensive to set up. This can be circumvented by recurring to graphics hardware, inexpensive, more powerful, and perfectly comparable with other more complex HPC mainframes in terms of computing power (many frameworks based on GPU architecture are already included in the top 500 HPC worldwide supercomputer ranking).

The DAME Program (see section 3.1) has already started the investigation on the design and implementation of a hierarchical parallel genetic algorithm, implemented on new technology based on multi-core Graphics Processing Unit

(GPU) provided by NVIDIA Company, by using the Compute Unified Device Architecture (CUDA) parallel programming SDK. CUDA is a platform for massively parallel high-performance computing on the company’s powerful GPUs. At its cores are three key abstractions: (a) a hierarchy of thread groups, (b) shared memories, and (c) barrier synchronization, that are simply exposed to the programmer as a minimal set of language extensions. These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse grained data parallelism and task parallelism.

According to the above concepts, we intend to contribute to DQ control of SGS data warehouse by introducing and manage a specific Work Package (WP) devoted to quality control of incoming data to the SGS. This could be achieved by providing a DQM framework able to support data quality management within each system team connected with data handling (both in terms of storage, retrieval and maintaining). In order to do so, we can provide both software and specific hardware infrastructures, located into official EC SDCs, in order to minimize data flow requirements, available to end users via web interfaces.

The DQ WG should interact closely with many other WGs since it needs input from many different teams. A detailed WBS for DQ and DQM management will be provided in the next future and will be progressively adjusted to match the evolution of the project organization structure.

Classification Problems

“Oh Be A Fine Girl Kiss Me (Right Now Sweetheart)”

Henry Norris Russell.

In this chapter I present present three different encountered problems that I tackled during my PhD which are gathered together because they were approached with the same data mining functionality: Classification (see sec. 2.1). Classification problems are among the most relevant and commonly problems in astronomy.

A typical, and crucial, example is the extraction of scientifically useful infor-mation from wide field astronomical images (both photographic plates and CCD frames) and the recognition of the objects against a noisy background (a prob-lem which in image processing is also known as image segmentation) and their classification in unresolved (starlike) and resolved (galaxy).

This chapter is structured as follow:

In section 4.1 I show how the use of neural networks for object classification plus the novel SPREAD MODEL parameter push down to the limiting magnitude the possibility to perform a reliable star/galaxy separation. In section 4.2 I present an application o to the identification of candidate Globular Clusters in deep, wide-field, single band HST images on a small dataset while in section 4.3 I show an application to the detection of Active Galactic Nuclei on a medium dataset.

4.1 Comparison of source extraction software

In order to obtain a better understanding of the data, part of my work was to compare the catalog extraction performances obtained using the new combination of SExtractor with PSFEx, against the more traditional and diffuse application of

this section is largely extracted from:

• Annunziatella, M.; Mercurio, A.; Brescia, M.; Cavuoti, S.; Longo, G.; 2013, Inside catalogs: a comparison of source extraction software, PASP, 125, 68

109

DAOPHOT with ALLSTAR; therefore, the present section may provide a guide for the selection of the most suitable catalog extraction software. Both software packages were tested on two kinds of simulated images having, respectively, a uniform spatial distribution of sources and an overdensity in the center. In both cases, SExtractor is able to generate a deeper catalog than DAOPHOT. Moreover, the use of neural networks for object classification plus the novel SPREAD MODEL parameter push down to the limiting magnitude the possibility of star/galaxy sep-aration. DAOPHOT and ALLSTAR provide an optimal solution for point-source photometry in stellar fields and very accurate and reliable PSF photometry, with ro-bust star-galaxy separation. However, they are not useful for galaxy characterization, and do not generate catalogs that are very complete for faint sources. On the other hand, SExtractor, along with the new capability to derive PSF photometry, turns to be competitive and returns accurate photometry also for galaxies. We can assess that the new version of SExtractor, used in conjunction with PSFEx, represents a very powerful software package for source extraction with performances comparable to those of DAOPHOT. Finally, by comparing the results obtained in the case of a uniform and of an overdense spatial distribution of stars, we notice, for both software packages, a decline for the latter case in the quality of the results produced in terms of magnitudes and centroids.

When extracting a catalog of objects from an astronomical image the main aspects to take into account are: to detect as many sources as possible; to minimize the contribution of spurious objects; to correctly separate sources in their classes (e.g. star/galaxy classification); to produce accurate measurements of photometric quantities; and, finally, to obtain accurate estimates of the positions of the centroids of the sources.

Among the main source extraction software packages used by the astronomical com-munity, there are SExtractor (Bert`ın & Arnouts, 1996) and DAOPHOT II (Stetson, 1987), and the latter is often used in combination with its companion tool ALLSTAR (Stetson, 1994). SExtractor is commonly used in extragalactic astronomy and has been designed to extract a list of measured properties from images, for both stars and galaxies. DAOPHOT and ALLSTAR were designed to perform mainly stellar photometry. So far, only DAOPHOT II, used together with ALLSTAR, has been able to produce more accurate photometry for stellar objects using the Point Spread Function (PSF) fitting technique, while the PSF fitting photometry in SExtractor has, instead, become possible only in the recent years. The first attempt was in the late 90s, when the PSFEx (PSF Extractor) software package became available within the TERAPIX “consortium”. This tool extracts precise models of the PSF from images processed by SExtractor. Only after 2010, through the public release of PSFEx (Bert`ın, 2011)¹, and with the recent evolution of computing power, has PSF fitting photometry become fully available in SExtractor.

The scope of this section is to compare the results obtained using the combination of

1Available at http://www.astromatic.net/software/psfex.

SExtractor with PSFEx, and DAOPHOT with ALLSTAR, by focusing, in particular, on the completeness and reliability of the extracted catalogs, on the accuracy of pho-tometry, and on the determination of centroids, both with aperture and PSF-fitting photometry. A previous comparison among extraction software tools was performed by Becker et al. (2007). They, in pursuit of LSST science requirements, performed a comparison among DAOPHOT, two versions of SExtractor (respectively 2.3.2 and 2.4.4), and DoPhot (Mateo & Schechter, 1989). However, differently from the present work where simulations are used, they evaluated as “true” values the measurements obtained with the SDSS imaging pipeline photo (Lupton et al., 2001).

Furthermore, we wish to stress that their results were biased by the fact that in 2007 the PSF fitting feature had not yet been implemented in SExtractor.

The present work performs, for the first time, a comparison between DAOPHOT and SExtractor PSF photometry, providing a guide for the selection of the most suitable catalog extraction software packages.

The simulations used for the comparison are described in Sect. 4.1.1. In Sect. 4.1.2, the main input parameters of the software packages are overviewed and the adopted values are specified in Sect. 4.1.3. In Sect. 4.1.4, the obtained results are shown. In order to better evaluate the performances of both software packages on crowded fields, we describe: in Sect. 4.1.5, a test performed on an image showing an over-density in the center. Finally, the results are summarized in Sect. 4.1.6, together with our conclusions.

Nel documento Data-rich astronomy: mining synoptic sky surveys (pagine 112-117)