The experiments - Photometric AGN Classification

4.3 Photometric AGN Classification

4.3.2 The experiments

CLAS S CAT ALOGU E E xp.1 E xp.2 E xp.3

NotAGN All Class0 − −

T ype1 S orrentino Class1 Class1 −

T ype2 S orrentino Class1 Class0 −

Mix − LINER Kau f f man Class0 − −

Mix − S ey f ert Kau f f man Class0 − − Pure − LINER Kau f f man Class1 − Class0 Pure − S ey f ert Kau f f man Class1 − Class1

Mix − LINER − T ype1 overlap Class0 − −

Mix − S ey f ert − T ype1 overlap Class0 − − Pure − LINER − T ype1 overlap Class1 Class1 Class0 Pure − S ey f ert − T ype1 overlap Class1 Class1 Class1

Mix − LINER − T ype2 overlap Class0 − −

Mix − S ey f ert − T ype2 overlap Class0 − − Pure − LINER − T ype2 overlap Class1 Class0 Class0 Pure − S ey f ert − T ype2 overlap Class1 Class0 Class1 S IZE: 24293 S orrentino 84885 1570 30380 S IZE: 88178 Kau f f man

Table 4.9. The dataset composition after the merging from original catalogues. The empty fields indicate the unused typology. The division between class 0 and class 1 are referred to the target vector (used during training). The final sizes fo the three experiment datasets are obtained after the D’Abrusco photo-z catalogue matching and the whole NaN removal process.

2. Type 1 vs Type 2 3. Seyferts vs LINERs

The dataset for the first experiment is the whole dataset itself, with 84885 objects; the dataset for the second experiment contains just the objects belonging to the dataset of Sorrentino and that are pure AGN, resulting into 1570 objects; the last dataset contains the objects, belonging to the catalogue of Kauffman, that are pure AGN divided into LINERs and Seyferts, obtaining 30380 objects; the datasets obtained are summarized in table 4.9.

• completeness of a class: the ratio between the amount of correctly classified objects of that class and the total number of objects of that class in the dataset.

• contamination of a class: is the dual of the pureness, the ratio of misclassi-fied object of a class and the number of objects classimisclassi-fied in that class.

Each experiment was performed by dividing the dataset in two parts, a train set containing 80% of the dataset and a test set containing the remaining 20%. All the results presented in the following sections are evaluated on the test set that is never used to train the ML models.

The dataset

In order to perform our experiments we use the following groups of parameters (also named as input features) to feed our nets:

1. petroR50 u 2. petroR50 g 3. petroR50 r 4. petroR50 i 5. petroR50 z

6. concentration index r 7. z phot corr

8. f ibermag r 9. (u − g)dered) 10. (g − r)dered) 11. (r − i)dered) 12. (i − z)dered) 13. deredr

Parameters i-v are the radii which contain 50% of the petrosian magnitude.

The SDSS has adopted a modified form of the Petrosian (1976) system, by measuring galaxy fluxes within a circular aperture whose radius is defined by the shape of the azimuthally averaged light profile.

In order to define the Petrosian radii, it is necessary to report what it is intended as petrosian ratio. the “Petrosian ratio” <Pat a radius r from the center of an object

to be the ratio of the local surface brightness in an annulus at r to the mean surface brightness within r (Blanton et al., 2001; Yasuda et al., 2001):

<_P(r)= R1.25r

0.8r dr⁰2πr”I(r⁰)/[π(1.25²− 0.8²)r²] Rr

0 dr⁰2πr⁰I(r⁰)/(πr²) (4.9) where I(r) is the azimuthally averaged surface brightness profile.

The Petrosian radius rPis defined as the radius at which <P(rP) equals some specified value <P,lim, set to 0.2 in our case. The Petrosian flux in any band is then defined as the flux within a certain number NP(equal to 2.0 in our case) of r Petrosian radii:

The concentration index, parameter vi, is the ratio between radii that contain 50% and 90% of the total flux from the object in the approximation of symmetric source, is a good index of how much the flux is bounded in the center of the source.

The photometric redshift, parameter vii, is the redshift derived by the catalogue of D’Abrusco et al. (2007).

The fiber magnitude in the r-band, parameter viii, is the flux contained in 3”.

Parameters from ix to xii are deredded colors, while the parameter xiii is the deredded r-band magnitude.

The presence of parameters not measured, usually known as “Not a Number” or NaN, can be misleading for our models so we remove each object that has at least one parameter not measured.

Experiment Nr. 1 - AGN Classification

Concerning the first of the three mentioned experiments, i.e. classification between AGN and not AGN, the described ML models are feeded using a target vector whose values are labeled as 1 for each object that is over the Kewley’s line(pure AGN) and 0 for the object under this line (mixed zone and certainly not AGN), resulting in 84885 objects after the removal of the patterns affected by NaN presence. According to the mentioned strategy the train set (80% of the whole dataset) contains 67908 patterns while the test set (20% of the dataset) contains 16977 patterns.

The best result has been obtained by the MLP with the Quasi Newton learning rule.

The following values summarize the validation performances after the training for the best experiment.

• total efficiency : e = 75.99%

• AGN pureness: pagn= 71.38%

Figure 4.21. Ratio of correct evaluated AGN and not-AGN misclassified, first half of the Test Set.

• AGN completeness: cagn = 55.64%

• not AGN completeness: cmix = 87.44%

• AGN contamination: 1 − pagn= 28.62%

The percentage of false positives that we know spectroscopically to be surely not AGN is 0.89%; The percentage of object that spectroscopically are certainly not AGN that became false positive is 0.82%.

As we have seen the contamination due to galaxies is very small, and con-tamination caused by objects in the Mixed zone could be partially attributable to unrecognized AGN. Anyway we tried to maximize pureness of the AGN class. This was done by varying the confidence threshold value. The MLP classifier is not crisp, so the output represents a probability of belonging to a class; in this sense a confidence threshold value major than 0.5 is considered as class label 1 (in that case, a class label 1 stands for AGN objects). In order to avoid statistical fluctuations the results on the test set have been splitted by considering alternately the first half, the second one and then the whole test set, as shown in figures respectively 4.21, 4.22 and 4.23.

By following such strategy we found a common maximum at a threshold of around 0.837; with such threshold we obtained a very low completeness, about 9.13% but a great pureness, about 89.14%.

With the SVM we reached a lower efficiency that in the best experiment is equal to 75.76% obtained with C= 32768 and γ = 0.001953125; see figure 4.24.

A single SVM training experiment, in that case, had a huge computational cost (about one week), thus, in order to be able to perform the foreseen 110 experiments, as described in the section 2.3.1, we launched them in parallel by exploiting the SCoPE¹⁰GRID infrastructure resources (Brescia et al., 2009) available.

10http://www.scope.unina.it

Figure 4.22. Ratio of correct evaluated AGN and not-AGN misclassified, first half of the Test Set.

Figure 4.23. Ratio of correct evaluated AGN and not-AGN misclassified, whole Test Set.

rule total% agn% not agn% pureness contamin.

CG 75.54% 55.63% 86.74% 68.49% 31.51%

SCG 75.74% 55.12% 87.23% 68.36% 31.64%

QNA 75.99% 55.64% 87.44% 71.38% 28.62%

SVM 75.76% 55.41% 87.20% 70.91% 29.09%

Table 4.10. Experiment nr. 1, results; the first column is the model used, while the others are the percentages of respectively, total efficiency, AGN completeness, Not-AGN completeness, AGN pureness and AGN contamination.

Figure 4.24. AGN classification, contour graph of the grid search for the best configuration of C, and γ; γ vary as: C= 2⁻⁵, 2⁻³, ...2¹⁵, γ = 2−15, 2⁻¹³...2³

The table 4.10 reports the complete results by using the three MLP learning rules and the SVM.

Experiment Nr. 2 - Type 1 - Type 2 AGNs separation

Concerning the second of mentioned three experiments, i.e. classification between type 1 and type 2 objects, the described ML models are feeded using a target vector whose values are labeled as 1 for each object that are classified as seyfert 1 in the catalogue of Sorrentino et al. (2006) and as 0 the objects classified as seyfert 2, resulting in 1570 objects after the removal of the patterns affected by NaN values.

According to the mentioned strategy the train set (80% of the whole dataset) contains 1256 patterns while the test set (20% of the dataset) contains 314 patterns.

The best result has been obtained by the MLP with the Quasi Newton learning rule.

The following values summarize the validation performances after the training for the best experiment.

• total efficiency: e = 99.4%

• type 1 pureness: p_type1 = 98.4%

• type 2 pureness: ptype2 = 100%

learning rule total efficiency type 1 % type 2 %

CG 97.1% 94.7% 98.9%

SCG 95.9% 93.1% 97.8%

QNA 99.4% 98.4% 100%

SVM 81.5% 75.8% 75.7%

Table 4.11. Experiment nr. 2, results; the first column is the model used, while the others are respectively, the total efficiency, type 1 pureness and type 2 pureness.

• type 1 completeness: ctype1 = 100%

• type 2 completeness: c_type2 = 98.9%

Using the SVM, the best result produced a total efficiency equal to 81.5%

obtained with C= 512 and γ = 0.03125; see figure 4.25.

The table 4.11 reports the complete results by using the three MLP learning rules and the SVM.

Experiment Nr. 3 - Seyferts - LINERs separation

Concerning the last experiment, i.e. classification between Seyferts and LINERS objects, the described ML models are feeded using a target vector whose values are labeled as 1 for objects under the Heckman’s line and 0 for objects over this line, resulting in 30380 objects after the removal of the patterns affected by NaN presence.

According to the mentioned strategy the train set (80% of the whole dataset) contains 24304 patterns while the test set (20% of the dataset) contains 6076 patterns.

The best result has been obtained by the MLP with the Quasi Newton learning rule.

The following values summarize the validation performances after the training for the best experiment.

• total efficiency: e = 79.69%

• Seyfert pureness: pS ey f ert = 74.76%

• LINER pureness: pLINER= 81.09%

• Seyfert completeness: cS ey f ert= 52.77%

• LINER completeness: cLINER = 91.69%

Using SVM we reached a total efficiency equal to 78.18% obtained with C = 8192 and γ= 0.03125; see figure 4.26.

The table 4.12 reports the complete results by using the three MLP learning rules and the SVM.

Figure 4.25. Type 1 - Type 2 separation, contour graph of the grid search for the best configuration of C, andγ; γ vary as: C= 2⁻⁵, 2⁻³, ...2¹⁵, γ = 2−15, 2⁻¹³...2³

learning rule total efficiency Seyfert % LINER %

CG 78.09% 72.34% 79.53%

SCG 79.36% 74.37% 80.74%

QNA 79.69% 74.76% 81.09%

SVM 78.18% 71.01% 80.24%

Table 4.12. Experiment nr. 3, results; the first column is the model used, while the others are respectively, the total efficiency, Seyfert pureness and the LINER pureness.

Figure 4.26. Seyferts - LINERs separation, contour graph of the grid search for the best configuration of C, andγ; γ vary as: C = 2⁻⁵, 2⁻³, ...2¹⁵, γ = 2−15, 2⁻¹³...2³

Nel documento Data-rich astronomy: mining synoptic sky surveys (pagine 163-171)