An attempt to infer some useful cuts of the outliers objects

zspecClass calib_depth_mag

0 407 (12%)

1 761 (22%)

2 1,938 (54%)

3 406 (12%)

tables 3.4, 3.5, 3.7, the new parameter space including magnitudes and colors, the new split percentages together with the new decay parameter, ﬁxed at 0.1, led to better performances not only in terms of σ₆₈ but also, as the spec-z statistics on the PDF concerns, in terms of the percentages of samples that had a spec-z falling within 1 bin from the PDF peak, which increased from 27% to 34%.

3.6 An attempt to infer some useful cuts of the outliers ob-jects

This section describes a series of actions aimed to gain a deeper insight into the features of the PDF of the test set for which the spectroscopic information is available. The approach was to divide the samples of the test set in outliers and non-outliers, this time using the deﬁnition of best-estimatephoto-z as calculated by the PDF algorithm (cf. Sec. 2.4). Remember that such value does not necessarily correspond to photo − z⁰(i.e. to the estimate of the photo-z for the non perturbed test set), according to the well know normalized conditions:

• |best-estimate photo-z- spec-z| |/(1+spec-z)>0.15 for outliers that are:

#326/3512 (∼ 9% )

• |best-estimate photo-z- spec-z|/(1+spec-z)<0.15²for non-outliers that are:

#3186/3512 (∼ 91%)

The quoted division of samples between outliers and non-outliers, has been combined together to the deﬁnition of some PDF features (such as PdfWidth, PdfNBins, and so on) which have been already described in Sec. 2.5.2 in order to look for correlation and/or speciﬁc trends among outliers, non-outliers, and such features.

All this was also in order to look for useful cuts to be applied to the data in order to remove the most part of outliers for which the estimate of the PDF is unreliable, by preserving, at

2The 0.15 value is commonly adopted in the photo-z literature to deﬁne outliers and was initially derived from simulations.

Table 3.7 Number of objects per zspecClass in the two subsets of outliers and non-outliers for the test set of the calib_depth_mag catalog. *The ﬁrst percentages are on the test subsets (outliers/non-outliers), the second percentages on the whole test set.

zspecClass Outliers (#326/3512) Non-Outliers (#3186/3512)

0 0 407 (13%-11%)*

1 0 761 (24%-22%)*

2 218 (67%-6%)* 1,720(54%- 49%)*

3 108 (33%-3%)* 298 (9%- 8%)*

Table 3.8 Statistics of the 4 indicators of the PDF features quoted above.

PDF features Outliers Non-Outliers

mean SD MIN MAX mean SD MIN MAX

pdfWidth 1.63 1.11 0.04 4.34 0.61 0.69 0.02 4.44

pdfNBins 34.41 18 3 78 18.27 12.21 2 75

pdfPeakHeight 0.13 0.14 0.03 0.84 0.23 0.4 0.03 0.93 pdfNearPeakWidth 0.24 0.17 0 0.78 0.23 0.12 0 0.72

same time, a remarkable number of non-outliers.

The conditions/cuts that determined objects with reliable PDF (i.e. with appropriate values of PDF features in order to minimize the outliers) will be used to flag the “verif” samples (for which, we remember we do not have spectroscopic information) as “useful” i.e. endowed with a reliable PDF: these samples will be those with PDF features fulfilling the same conditions/cuts found for the test set PDFs. A realiable PDF for the verif catalog, to be returned for the challenge, is flagged with a 1 in the column “USE”(see Sec. 3.2.1).

First of all, in table 3.7, we reported the numbers of objects per zspecClass for outliers and non-outliers objects.

As expected, we had no outliers of zspecClass 0 and 1. As anticipated below, for what the reliability of the PDF is concerned, we decided to introduce some new indicators (PDF features) among which we expected a certain degree of correlation to exists: pdfWidth, pdfN-Bins, pdfPeakHeight and pdfNearPeakWidth (already deﬁned in Sec. 2.5.2). The statistics of these 4 parameters is given in table 3.8 for outliers and non-outliers.

Looking at the mean values of PDF features in table 3.8, we can see the differences for the two populations: outliers have wider PDFs, with a higher number of bins (intervals of amplitude ∆z=0.02, in which the PDF is not null) and shallower peaks with respect to those of the non-outliers samples, as expected.

3.6 An attempt to infer some useful cuts of the outliers objects 53

3.6.1 Outlier cuts

In ﬁgures 3.2, 3.3, 3.5 are given the scatter plots for the following couples of parameters used for the evaluation of the reliability of the PDF (see previous section).

Precisely:

1. Figure 3.2: scatter plot of pdfNBins vs pdfWidth. We expected a strong correlation between these two parameters, with a collocation of the outliers in the region with higher values of pdfWidth as well as pdfNBins, and, actually, up to a certain extent, this is visible in ﬁgure 3.2. Several trials of different cuts have been done in order to remove the most part of outliers, and to see if a recalculation of the normalized statistical parameters on the test set sample led to an improvement. In ﬁgure 3.2, the red straight line corresponds to:

pd f NBins− 35 × pd fWidth + 35 = 0 (3.5) By keeping the objects with Eq. 3.5 ≥ 0, we removed 27% of outliers (2% of objects with respect to the whole sample);

Fig. 3.2 PdfNbins VS pdfWidth.

2. Figure 3.3: scatter plot of pdfNearPeakWidth vs pdfWidth: we can note that ∼ 39% of the outliers (4% of the whole sample) are under the parabolic branch, deﬁned equation is:

pd f NearPeakWidth− 0.199 ×p

pd f Width= 0 (3.6)

therefore with the condition Eq. 3.6≥ 0 we kept a congruous number of non-outliers, removing the quoted fraction of outliers; moreover, the removal of samples on the left

of the vertical straight line

pd f Width= 2 (3.7)

with the condition Eq. 3.7 < 2, allowed the removal of 35% of outliers (3% on the whole sample); ﬁnally, the removal of the samples above the orizontal line

pd f NearPeakWidth= 0.44 (3.8)

with the condition Eq. 3.8 < 0.44, allowed the removal of 15% of outliers (1% on the whole sample). These last cuts, discussed in equations 3.7 and 3.8 are better visible in the distributions of the relative parameters, shown for both outliers and non-outliers, in ﬁgure 3.4.

Fig. 3.3 PdfNearPeakWidth VS pdfWidth.

3. Figure 3.5: scatter plot of pdfPeakHeight vs pdfWidth: we expected a strong anti-correlation between these two parameters, although it becomes not visible from a certain width threshold up, in any case the selection of the region between the black straight line

pd f PeakHeight = 0.09 (3.9)

and the hyperbolic branch

pd f PeakHeight− (0.13/pd fWidth) − 0.11 = 0 (3.10) with the conditions Eq. 3.9 > 0.09 and Eq. 3.10 ≤ 0, allowed to remove the great part of the outliers (∼ 59%, 5% on the whole sample).

3.6 An attempt to infer some useful cuts of the outliers objects 55

Fig. 3.4 PdfWidth distribution (top panel) and of pdfNearPeakWidth (bottom panel) for outlier and non-outliers objects in the test set: the cut of samples with pdfWidth higher than 2 and pdfNearPeakWidth higher than 0.44 ensures the compromise between leaving a congruous number of non-outliers, removing the most part of outliers.

Fig. 3.5 PdfPeakHeight VS pdfWidth.

All the cuts on the PDF features will be combined in several ways, and the statistics on the test set recalculated in the following section, in order to meet the requirements of the Euclid Data Challenge 2.

Nel documento Machine Learning based Probability Density Functions of photometric redshifts and their application to cosmology in the era of dark Universe (pagine 70-75)