Mixed mode data clustering: an approach based on tectrachoric correlations

(1)

Based on Tetrachoric Correlations

Isabella Morlini

Abstract In this paper we face the problem of clustering mixed mode data by assuming that the observed binary variables are generated from latent continuous variables. We perform a principal components analysis on the matrix of tetrachoric correlations and we then estimate the scores of each latent variable and construct a data matrix with continuous variables to be used in fully Guassian mixture models or in the k-means cluster analysis. The calculation of the expected a posteriori (EAP) estimates may proceed by simply considering a limited number of quadrature points. Main results on a simulation study and on a real data set are reported.

1 Introduction

One possible approach to cluster analysis is the mixture maximum likelihood method, in which the data to be clustered are assumed to come from a finite mixture of populations. The method has been well developed and much used for the case of normal populations. A main advantage in using Gaussian distributions is that a number of possible restrictions on the covariance matrices has been proposed in literature (e.g., [1,3]) to deal with different local dependencies and, at the same time, to alleviate the problem of the rapidly growing of the parameters with the data dimension and with the number of clusters. A large range of Gaussian models are available, from the simple spherical one to the least parsimonious where all elements of the covariance matrix are allowed to vary across clusters. Practical applications, however, often involve mixture of categorical and continuous variables. Everitt [4] and Everitt and Merette [5] extended the normal model to deal with mixed mode data but the computation involved in their model is so extensive that is only fea-sible for data with very few categorical variables. Lawrence and Krzanowski [7] and Vermunt & Magidson [12] propose conditional Gaussian models with local

I. Morlini (B)

Dipartimento di Scienze Sociali, Cognitive e Quantitative, Università di Modena e Reggio Emilia, 42100 Reggio Emilia, Italy,

email: isabella.morlini@unimore.it

B. Fichet et al. (eds.), Classification and Multivariate Analysis for Complex Data

Structures, Studies in Classification, Data Analysis, and Knowledge Organization,

DOI 10.1007/978-3-642-13312-1_9,CSpringer-Verlag Berlin Heidelberg 2011

(2)

categorical variables and between pairs of continuous variables and are dealt via joint multinomial and multivariate normal distributions. In the “Latent Gold” pack-age [11] the dependence between a categorical and a continuous variable may be dealt with a sort of “trick”, by doubling the categorical variable and treating the variable also as a covariate. The estimated dependence, however, may not vary between groups. The mixture model for large data sets implemented in the pack-age SPSS is also based on joint multinomial and gaussian distributions and postu-late the hypothesis of local independence between a categorical and a continuous variable.

Here we face the problem of clustering data with different scales and allow-ing local dependencies also between a categorical and a continuous variable by assuming that each observed categorical variable is generated from a latent con-tinuous variable and by estimating the scores of these latent variables. In eco-nomics, these variables are called utility functions and the assumption is that the response (which may be, for example, the presence or the absence of a public service or a public utility) are determined by the crossing of certain thresholds in these functions (see, among others, [8]). Heckman [6] models whether or not American states have introduced fair-employment legislation and describes the cor-responding latent response as the “sentiment” favoring fair-employment legislation. In genetics, the latent response is interpreted as the “liability” to develop a qual-itative trait or phenotype. There are also examples of continuous variables which are sampled as binary (among others, bit data which are originated by electric voltages). Skrondal and Rabe-Hesketh [10], pp. 16–17, report various interpreta-tions of these latent variables and also state that assuming a latent continuous vari-able may be useful regardless of whether the latent response can be given a real meaning.

This work represents the first step in the construction of fully Gaussian models for classification, in which correlations among variables may vary across groups and also variable selection may be faced differently in each group. Here we estimate the scores of each latent variable and reach a data matrix with all continuous variables to be used in these models. An application shows that some benefits of using a data matrix with all continuous variables instead of a mixed mode data matrix may be reached in the k-means cluster analysis.

2 From Binary Variables to Continuous Variables

The essential feature of the method to be described in this section is that the observed categorical variables are generated from underlying latent continuous variables according to the values of a set of thresholds. Here we formalize results regarding binary variables but the theory may be extended to multinomial variables by esti-mating the matrix of polychoric correlations. Given p vectors of binary variables observed for a sample of size n, a contingency table for each couple of variables Xk

(3)

xk = 0 xk = 1

xj = 0 ej k bj k

xj = 1 cj k dj k

The estimated value for the threshold generating the variable Xk is the value hk

satisfying Φ(hk) = (ej k + cj k)/n. For variable Xj it is the value hj satisfying

Φ(hj) = (ej k + bj k)/n, where Φ is the standard normal cumulative distribution

function. We then estimate the tetrachoric correlation coefficient rj k conditional on

these thresholds, via maximum likelihood. The solution may be found iteratively or by using the following approximate analytic solution:

rj k = sin ⎛ ⎝π 2 * 1+ 4ej kbj kcj kdj kn 2 (ej kdj k− bj kcj k)2(ej k+ dj k)(bj k+ cj k) +_−1/2⎞ ⎠ (1)

In tables with zero frequencies, zero values are set to 0.5. In a simulation study with 5000 different data sets of size(100 × 6) generated from 10 multivariate normal populations, the estimator (1) has been shown to give better results than the other ones based on approximate analytic solutions of the likelihood function. The(n× p) matrix of the scores of the p latent continuous variables is reached with expected a posteriori (EAP) estimates. In order to reach semi parametric estimates, we consider a model based on principal components rather than on factors (see, for example, [2] and [9], for EAP estimates reached by considering a fully parametric model where also thresholds, eigenvalues and eigenvectors associated with each factor are esti-mated by maximizing the likelihood function). We perform a principal component analysis on the matrix of tetrachoric correlations (which does not require previous smoothing if the matrix is not positive definite) and consider the following model:

ti j = aj 1yi 1+ aj 2yi 2+ . . . + aj kyi k+ . . . + aj pyi p (2)

where ti j is the score of principal component j for case i , aj k are the loadings

(eigenvectors) and yi k is the score for case i relative to the k latent variable

associ-ated with the observed categorical variable xkas follows: xi k = 1 if yi k ≥ hk and

xi k = 0 if yi k < hk. As assumed for the thresholds estimates, y ∼ N(0, I ) and

t ∼ N(0, Λ) where Λ is a diagonal matrix with elements λ2_j = _kp₌₁a2_{j k} equal to the eigenvalues. The EAP estimator of the j th principal component score is the mean of the posterior distribution of tj, which is expressed by:

(4)

In the following equations, for economy of space, w will be omitted. Givenσ2_{j k} = λ2 j − a2j k = h =ka2j h, then P(xi k= 1|tj) = 1 σj k √ 2π _∞ hk e −(ti j−aj kyi k)2/2σ2j k d yi k (4)

Introducing the change in the variable:

P(xi k = 1|tj) = 1 aj k √ 2π _(t_{i j}_−a_{j k}hk)/σj k −∞ e (−z2_/2) d z (aj k > 0) (5) P(xi k = 1|tj) = 1 −aj k √ 2π _∞ (ti j−aj khk)/σj k e(−z2/2)d z (aj k < 0) (6)

Letting zj k = (ti j − aj khk)/σj k and Fj k(tj) = (aj k)−1Φ(zj k) when aj k > 0,

Fj k(tj) = |aj k|−1(1 − Φ(zj k)) when aj k < 0, assuming the independence of the

binary variables xkconditional on each component tj, it results

f(xi|tj) = p

k₌₁

Fj k(tj)xi k[1 − Fj k(tj)]1−xi k (7)

We consider S quadrature points and estimate the scores as follows:

˜ti j = S s=1 ts j φ(ts j)_kp₌₁Fj k(tj)xi k[1 − Fj k(tj)]1−xi k S s=1φ(ts j) p k=1Fj k(tj)xi k[1 − Fj k(tj)]1−xi k (8)

where ts j are equally spaced points in[−zj, zj] with Φ(−zj/λj) = 0.001, φ(ts j)

are the density functions of these points in the N(0, λ2_j) curve times the interval size.

Given the estimates˜ti j, the EAP estimates ˜yi kof the latent variables may be then

reached through analogous steps. The EAP estimator of the kth variable scores is the mean of the posterior distribution of yk, which is expressed by:

˜yi k= E(yi k|xi k; ti) = ykf(yk|xi k; ti)dyk = ykf(xi k|yk; ti)g(yk) f(xi k|yk; ti)g(yk)dyk d yk (9) Let y_{i k}+be the values yi k ≥ hkand yi k−be the values yi k < hk, then

(5)

˜ti j −a jk y−i k σ jk e−z2/2d z (aj k > 0) (13) Let z+_{j k} = ˜ti j− aj ky + i k σj k z−_{j k}= ˜ti j− aj ky − i k σj k (14)

and F+_{j k}(yk) = (aj k)−1Φ(z+j k) when aj k > 0, Fj k+(yk) = |aj k|−1(1−Φ(z+j k)) when

aj k < 0, F−_{j k}(yk) = |aj k|−1Φ(z−_{j k}) when aj k < 0, F+_{j k}(yk) = (aj k)−1(1 − Φ(z−_{j k}))

when aj k > 0. Then f (xi k|yk; ti) =

p

j=1F+j k(yk)xi kF−j k(yk)1−xi k × φ(˜ti j).

Con-sidering S quadrature points we estimate the scores as follows:

˜yi k = S s=1 ysk φ(ysk)p_j₌₁(F+_{j k}(ys)xi kF−_{j k}(ys)1−xi k× φ(˜ti j)) S s=1φ(ysk)( p j=1F+j k(ys)xi kFj k−(ys)1−xi k× φ(˜ti j)) (15)

where ysk are equally spaced points in[−zj hk] when xi k = 0, in [hk zj] when

xi k = 1, with Φ(−zj) = 0.001, φ(ysk) being the density functions of these points

in the N(0, 1) curve times the interval size.

3 Main Results on a Simulation Study and on a Real Data Set

A simulation study is used to evaluate the accuracy of the tetrachoric correlations and the scores estimates. From 10 standard multivariate normal populations with correlation matrices P with equal elementsρr c, r = c, out of the main diagonal,

ranging form 0.0 to 0.95, we generate 5,000 data sets (500 from each population) of size(100 × 6). We then dichotomize the 6 variables by imposing random thresh-olds from a uniform distribution in the interval [−2 + 2]. The mean absolute errors (MAEs) for the thresholds estimates for each variables (averaged over the 5,000 data sets and the 100 observations of each set) are always less than 0.06. Consider-ing “difficult variables”, originated by thresholds outside the interval [−1 + 1], the MAEs increase to 0.11. These less accurate estimates also lead to larger errors for the scores estimates. Using (1), the mean absolute errors (MAEs) obtained for the

(6)

ρ, averaged over the 500 data sets generated with each correlation matrix,

the 100 observations of each set and the 15 correlation coefficients, are:

ρ = 0 ρ = 0.2 ρ = 0.3 ρ = 0.4 ρ = 0.5 ρ = 0.6 ρ = 0.7 ρ = 0.8 ρ = 0.9 ρ = 0.95

0.07 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.05 0.03 Results seem particularly accurate for all values ofρ. Mean errors also decrease as long as the real correlations among variables increase. Boxplots of the MAEs for the eigenvalues of the principal components, calculated between the eigenvalues of the correlation matrix P used to generate the data and the correlation matrix R of the generated data, are reported in Fig.1. For values ofρr cnot exceeding 0.8, estimates

Fig. 1 Boxplots of the mean absolute errors of the eigenvalues plotted along the original corre-lationsρr c. In the left-hand boxes, errors are calculated between eigenvalues of the tetrachoric correlation matrix and eigenvalues of the matrix R of the generated data. In the right-hand boxes, errors are calculated between eigenvalues of the tetrachoric correlation matrix and eigenvalues of the matrix P used to generate the data. In the upper boxes errors are averaged over the six eigenvalues. In the lower boxes, errors are calculated only for the first eigenvalues

(7)

of all the eigenvalues better recover the computed correlation matrix, rather than the matrix used to generate the data. This is not true for the first eigenvalue: when this one is large (and the correlations are larger than 0.8) the estimates better recover the first eigenvalues of the matrix P. We then estimate the scores of each latent variable and of the principal components. The MAEs, averaged over the 500 data sets generated for each correlation matrix and over the six variables and the 100 observations of each set, are:

ρ = 0 ρ = 0.2 ρ = 0.3 ρ = 0.4 ρ = 0.5 ρ = 0.6 ρ = 0.7 ρ = 0.8 ρ = 0.9 ρ = 0.95

MAE (˜ti j) 0.87 0.70 0.69 0.65 0.64 0.60 0.57 0.51 0.45 0.42

MAE (˜yi j) 0.59 0.59 0.58 0.58 0.58 0.59 0.58 0.58 0.58 0.59

As long as the correlation among variables increases, there is an improvement in the principal components estimates. On the contrary, results regarding the latent variables do not seem to depend onρ. Estimates of the scores of the latent vari-ables show improvements in average accuracy when the generated thresholds are close to zero, that is are close to the mean (and the median) of the latent variables. When the thresholds are beyond the range [−1 + 1], average errors are signifi-cantly greater. Average errors, however, are always less than the variance of each variable and results seem enough accurate. Table1reports the MAEs of the latent variables scores obtained in a further study. Here the 6 binary variables are obtained by generating a(5000 × 6) data set from the same zero-mean multivariate normal populations as before, but with fixed thresholds:−2, −0.5, 0, 0.2, 0.5, 2. Average errors (in the last row) show that the accuracy of the EAP estimates increases as long as the threshold approaches zero. On the other hand, considering the errors computed for different values of the true scores, we note that minima average errors (reported in bold) are obtained for values near the thresholds. The worst fittings are obtained for large positive values when the threshold is−2 and for large negative values when the threshold is+2. For variables with thresholds −0.2, 0, 0.2 and 0.5, the correlations between real and estimated scores are 0.74, 0.78, 0.78 and 0.75, respectively.

Table 1 Mean Absolute Errors for the estimates of the 6 latent variables scores, divided into 9 groups. Groups are based on the magnitude of the true score values

thresh.= −2 thresh.= −0.5 thresh.= 0.0 thresh.= 0.2 thresh.= 0.5 thresh.= 2

scores MAEs MAEs MAEs MAEs MAEs MAEs

< −1.3 0.62 1.02 1.46 1.62 1.89 2.74 [−1.3 − 0.8) 0.20 0.32 0.77 0.92 1.21 1.99 [−0.8 − 0.5) 0.17 0.11 0.40 0.56 0.84 1.60 [−0.5 − 0.3) 0.48 0.23 0.11 0.27 0.54 1.28 [−0.3 + 0.0) 0.75 0.08 0.17 0.08 0.29 1.00 [+0.0 + 0.3) 1.02 0.29 0.15 0.23 0.08 0.73 [+0.3 + 0.5) 1.30 0.55 0.13 0.09 0.19 0.47 [+0.5 + 0.8) 1.61 0.83 0.41 0.22 0.13 0.18 ≥ +0.8 2.40 1.54 1.14 0.96 0.64 0.41 average 0.77 0.53 0.48 0.48 0.50 0.75

(8)

ing depository (http://archive.ics.uci.edu/ml/). The features encode the geometry of the image as well as phrases occurring in the URL, the image’s URL and alt text, the anchor text, and words occurring near the anchor text. The cluster membership of each image is known (clusters are: advertisement or not advertisement). After removing instances with missing values and selecting binary variables with relative frequencies higher than 0.1, we reach a data set with 2,359 instances, 3 continuous variables and 10 binary variables. We perform a k-means cluster analysis and we run the mixture model implemented in SPSS first with mixed mode variables (normaliz-ing continuous variables in the interval [0 1]) and then with all continuous variables (with the estimated scores of the binary ones). The classification error rate decrease from 33 to 30% with k-means and from 35 to 32% with the mixture model.

4 Concluding Remarks

Although it is clearly impossible to generalize from the results presented, it does appear that estimating the scores of the latent continuous variables generating the binary values may improve the clustering results and, above all, it allows fully Gaus-sian models with different correlations among the variables in each group to be used for classification. This paper describes an initial investigation into the feasibility of estimating the scores of each latent continuous variable. In literature, only EAP estimates of the most relevant factors have been presented, for the different aims of estimating composed items that are assumed to represent a particular set of con-structs and for data reduction. Here the aim is to reach a continuous data matrix, of the same dimension of the original one. Possible variations and improvements to the method proposed are relevant topics for future research. Future simulations involve data generated from distributions rather than the normal, to explore whether the EAP estimates work well also in these cases. Indeed, although the threshold estimates are based on the normal distribution and the ti j and the yi j are supposed

to be Gaussian, EAP estimates are little affected by the choice of this distribution since loadings and eigenvalues are not estimated by maximum likelihood.

References

1. Banfield, J.D., Raftery, A.E.: Model based Gaussian and non Gaussian clustering. Biometrics 48, 803–821 (1993)

2. Bock, R.D., Mislevy, R.J.: Adaptive EAP estimation of ability in a microcomputer environ-ment. Appl. Psychol. Meas. 6(4), 431–434 (1982)

3. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28: 781–793 (1995)

4. Everitt, B.S.: A finite mixture model for the clustering of mixed mode data. Stat. Probab. Lett. 6, 305–309 (1988)

5. Everitt, B.S., Merette, C.: The clustering of mixed-mode data: a comparison of possible approaches. J. Appl. Stat. 17(3), 284–297 (1990)

(9)

6. Heckman, J.J.: Dummy endogenous variables in a simultaneous equation system. Economet-rica 47, 153–161 (1978)

7. Lawrence, C.J., Krzanowski, W.J.: Mixture separation for mixed-mode data. Stat. Comput. 6, 85–92 (1996)

8. Manski, C.: Identification of binary response models. J. Am. Stat. Assoc. 83, 729–738 (1988) 9. Muraki, E., Engelhard, G.: Full-information item factor analysis: application of EAP scores.

Appl. Psychol. Meas. 9(4), 417–430 (1985)

10. Skrondal, A., Rabe-Hesketh, S.: Generalized Latent Variable Modeling. Chapman & Hall, London (2004)

11. Vermunt, J.K., Magidson, J.: Latent Gold User’s Guide. Statistical Innovation, Belmont, MA (2000)

12. Vermunt, J.K., Magidson, J.: Latent class cluster analysis. In: Hagenaars, J.A., McCutcheon, A.L. (eds.) Applied Latent Class Analysis, pp. 89–106. Cambridge University Press, Cambridge (2002)