• Non ci sono risultati.

Technical Aspects of Data Quality

3.3 Euclid Data Quality

3.3.2 Technical Aspects of Data Quality

More technically speaking, in a warehouse based on intensive and diversified engineering and scientific data governance, like Euclid, some errors and inconsis-tencies could arise (in raw images and data, metadata, processed data) in terms of:

• Blank, NaN and Null values;

• Hidden relations, missed or bad (in terms of its ontology) foreign keys;

• Wrong or out-of-range values and constants;

• Unknown values (symbols);

• Misleading instrument signature parameters and/or metadata;

• Duplicate fields, parameters, matrix rows and columns, values;

Such quality bugs are very general but represent a good starting point to clarify the concept of general DQ in a data warehouse. Of course, they can be extended and well defined over both different data types (images, telemetry data, metadata, etc.) project definition time (especially according to the Euclidisation process). Samples of requirements/targets of a first stage of DQ framework are the following:

• Quickly Browsing Data Structures: the goal is to understand metadata for columns (size, comment) and table keys, including mechanisms to detect and identify missing or erroneously documented data;

• Getting an Overview of Database Content: the goal is to get statistics re-garding data volumes. This feature familiarizes the user with a database, regardless of complexity, with no need of a guide. It also lets him detect some modeling errors, such the absence of keys.

• Do Columns Contain Null or Blank Values?: Detect and clean null and empty values from the data. It can help to check the presence of null and empty values in the data volume.

• About Redundant Values in a Column: to verify that we have one and only one record for each value in the loaded column.

• Is a Max/Min Value for a Column Expected?: Quickly detect data anoma-lies. It should also offer functionalities to Set data thresholds for controlling max/min values.

• What is the best data sample?: given a user customized function (mathe-matical, statistical, empirical, etc.), it is possible to locate the winner datum in the volume.

• Using Statistics: options to get a quick, global view of the frequency of the data and identify potential errors or misleading information inside. It can also get frequency statistics which let user to identify most present values in data or to check advanced statistics indicators for a complex analysis.

• Analyzing Intervals in Numeric Data: to get a frequency table with a specified aggregation of interval values. It can also create frequency tables for numerical columns specifying the number of desired bins.

• Targeting Intervals: to understand customer distribution to better target advertising. It enables to show the distribution of any numerical variable, by also detecting min and max, and also the most frequent data intervals.

• Identifying and Correcting Bad Data: of course there should be present any kind of prerequisite, such as any understanding of pattern usage and its standardization (Euclidisation, VO like or whatever).

• Getting a Column Pattern: the goal is to identify a data column structure and/or pattern. After having defined a pattern frequency indicator, column values can be grouped in intervals.

• Detecting Keys in Tables: to discover all candidate keys in a table. This could be used as information to optimize data distribution.

• Are There Duplicate Records in Data?: the goal is to detect and clean data from duplicate records, possibly starting from prerequisites based upon a default/custom correlation analysis. This is in principle a crucial function due to huge amount of data to be collected inside Euclid data centres.

• Column Comparison Analysis: the goal is to check that no redundancy exists in columns and optimize space reserved for data. For example to ensure that values used in a table are present in related dimension tables, or to discover new foreign keys.

• Recursive Relationships: this is a functionality connected with the explo-ration inside data about possible relations that recursively can be occur at different levels of data processing system.

• Creating a Report: the DQ system should be able to summarize any created and performed analysis, by specifying also the desired level of details. Of course its prerequisites are the creation of a DQ database (TBD).

• Monitoring: Monitoring helps supervise and control data quality over time.

It also allows monitoring the duration of an analysis and to identify those that are taking longer than normal. The monitoring can include:

– Reporting on Data Quality: monitor how the data quality indicators evolve over time. It has as prerequisites the scheduled reports foreseen to be executed regularly on the data warehouse. With a report on changes in data quality, it is easier to detect any change in data without performing the entire analyses one after the other.

– Tracking Data: Track development and revise old data. This feature lets users control data quality in time.

– Alert: DQ should generate alerts when certain values are outside of a range defined by custom rules or thresholds. This feature can be automated and closed by generating a report.

• Mining Rules for Data Quality procedures: the goal is to provide data exploration, analysis and mining rules in order to control and/or to test the accuracy of the data and to help decision making. These rules can be in principle based on deterministic, statistical or machine learning methods, able to find correlations.

The last element of the above list was intentionally left as the last item, because it represents the more complex but at the same time interesting and powerful functionality for the data quality control in a PB project such as Euclid. It directly connects flexibility of potential customized data quality analysis with powerful methods usually employed in Data Mining (DM), or equivalently in Knowledge Discovery in Databases (KDD). Their inclusion into the data quality domain permits to overcome or at least to optimize the usual DQ analysis rules, based on:

• Image (science & calibration) analysis – Detection of abnormal patterns – Anomalous Instrumental signature

• Column analysis – Simple analysis

– Simple statistics evolution – Frequency statistics – Pattern matching

• Overview analysis – Connection analysis – Catalogue analysis – Scheme analysis

• Redundancy analysis

• Duplication analysis

Here we would like to mention technical data requirements, as arisen from the SRS document of the mission, that we intend to take strongly under consideration throughout the DQ analysis and design.

Concerning the general data processing and archiving system specifications, there are important issues to be verified and validated by DQ systems, directly coming from the requirements:

• All information stored into official catalogue database should be accessible through standard SQL (or equivalent) queries;

• SQL queries could be able to search and retrieve all Euclid stacked images and/or individual exposures;

• The pipelined data should always be composed by catalogs of objects (selected by photometry, weak lensing, spectroscopy in wide or deep surveys) within a standard queryable database. For instance, for the cluster catalogues, the database should always include consistent information, like sky coordinates (RA/Dec), best estimate of redshift (either spectroscopic or photometric), richness, S/N ratio, number of objects, velocity dispersion, magnitude in different bands and morphology;

• The data analysis should include source detection and photometry extraction for time-series data, including DQ flags used for validation;

• The results of any data analysis should be provided in a queryable database providing master variable source catalogues as well as all relevant metadata.

• The database should include verification and validation tools to check the pres-ence and consistency of periodogram spectral densities for periodic variables, flags indicating object variability and other particular features;