ii iilii !i!!il 'ii!!i!iii!iii[!iiiiiiil i

(1)

9 Physica-Verlag 2004

T h e Forward Search and Data Visualisation

A n t h o n y C. A t k i n s o n I a n d M a r c o Riani 2

1 T h e L o n d o n School of E c o n o m i c s , L o n d o n W C 2 A 2 A E , U K [email protected]

2 Sezione di Statistica e Informatica, D i p a r t i m e n t o di E c o n o m i a , U n i v e r s i t ~ di P a r m a , I t a l y , m r i a n i ~ u n i p r . i t

S u m m a r y

A statistical analysis using the forward search produces many graphs. For multivariate data an appreciable proportion of these are a variety of plots of the Mahalanobis distances of the individual observations during the search.

Each unit, originally a point in v-dimensional space, is then represented by a curve in two dimensions connecting the almost n values of the distance for each unit calculated during the search. Our task is now to recognise and classify these curves: we may find several clusters of data, or outliers or some unexpected, non-normal, structure. We look at the plots from five data sets. Statistical techniques include cluster analysis and transformations to multivariate normality.

K e y w o r d s : clustering; forward plot; graphics for multivariate data; Ma- halanobis distance; outliers; power transformation; robustness; very robust methods.

I I n t r o d u c t i o n

The forward search is a powerful robust statistical method for exploring the relationship between data and fitted models. It relies on the interpretation of a large number of graphs. Atkinson and Riani (2000) describe its use in

(2)

linear and nonlinear regression, response t r a n s f o r m a t i o n a n d in generalized linear models, where the emphasis is on the detection of unidentified subsets of the d a t a and of multiple masked outliers and of their effect on inferences.

In this p a p e r we e x t e n d the m e t h o d to the analysis of m u l t i v a r i a t e data. Our emphasis here is r a t h e r more on the d a t a and less on the m u l t i v a r i a t e n o r m a l model.

The forward search orders the observations by closeness t o the assumed model, s t a r t i n g from a small subset of the d a t a and increasing t h e n u m b e r of observations rn used for fitting the model. Outliers a n d small unidentified subsets of observations enter a t the end of the search. E v e n if t h e r e are a number of groups, as in cluster analysis, we s t a r t by fitting one m u l t i v a r i a t e normal d i s t r i b u t i o n to the data. An i m p o r t a n t g r a p h i c a l tool is a variety of plots of the M a h a l a n o b i s distances of the individual observations during the search. Each unit, originally a point in v-dimensional space, is then represented by a curve in two dimensions connecting the a l m o s t n values of the distance for t h a t unit calculated during the search. O u r t a s k is now to classify these curves.

Section 2 defines the search. Thereafter we proceed by examples, all of which are of m u l t i v a r i a t e data. The first is of m e a s u r e m e n t s on Swiss heads which we use to introduce forward plots of M a h a l a n o b i s distances.

These plots indicate some outliers in the milk d a t a a n a l y s e d in w We then use a s y n t h e t i c e x a m p l e in w to c a l i b r a t e our plots, showing t h e effect of clusters. F o r w a r d plots of individual M a h a l a n o b i s distances a n d of the trace of the e s t i m a t e d covariance m a t r i x are also helpful in revealing u n s u s p e c t e d structure. These ideas are illustrated in w where more s t r u c t u r e is discovered in the milk d a t a . Sections 7 and 8 are b o t h concerned with d a t a with several groups. In w we outline the use of the forward search in cluster analysis.

Because we use M a h a l a n o b i s distances, it is i m p o r t a n t t h a t t h e d a t a are a p p r o x i m a t e l y normal. We therefore introduce in w a m u l t i v a r i a t e form of the Box-Cox family of transformations. We use forward plots of tests for t r a n s f o r m a t i o n in w to emphasize the i m p o r t a n c e of the forward search in detecting the effect of j u s t a few observations on inferences. We conclude in w with some c o m m e n t s on c o m p u t a t i o n and on extensions of our graphical presentation to larger d a t a sets.

2 T h e F o r w a r d S e a r c h

The main diagnostic tools t h a t we use are various plots of M a h a l a n o b i s distances. T h e squared distances for the sample are defined as

d~ = { y ~ - ~ } r ~ - l { y ~ _ ~}, (i = 1 . . . . ,~), (1) where t~ is the vector of means of the n observations a n d E is the unbiased e s t i m a t o r of the p o p u l a t i o n covariance matrix.

(3)

w e obtain n s q u a r e d M a h a l a n o b i s distances

d2(m) = {Yi - ,5(m)}T~-l(m){Yi --/5(m)}, (i = 1 . . . . ,n). (2) We often s t a r t the search with a small subset of mo observations, chosen from the r o b u s t b i v a r i a t e boxplots of Zani, Riani, and Corbellini (1998) to exclude outlying o b s e r v a t i o n s in any two-dimensional plot of t h e d a t a . T h e content of the contours is a d j u s t e d to give an initial subset of the required size. The search is not sensitive to the exact choice of this subset.

W h e n m o b s e r v a t i o n s are used in fitting, the o p t i m u m subset S* (m) yields n squared d i s t a n c e s d~(m*). We order these s q u a r e d d i s t a n c e s and t a k e the observations c o r r e s p o n d i n g to the m + 1 smallest as t h e new subset S* (m + 1). Usually this process augments the subset by only one observation, b u t sometimes two or more observations enter as one or m o r e leave. Due to the form of the search, outliers, if any, tend to enter as m a p p r o a c h e s n.

In our e x a m p l e s we look at forward plots of the d i s t a n c e s di(m*). These distances t e n d to decrease as n increases. If interest is in the l a t t e r p a r t of the search we m a y also look at s c a l e d distances

^ ^ "~ 1 / 2 v

(

ddm*) • ~ I~('~*)l/l~(~)l) , (a)

where v is t h e dimension of the observations y a n d ~ ( n ) is the e s t i m a t e of 2 at the end of t h e search.

Such a rescaling increases emphasis on the l a t e r p a r t s of forward plots.

Thus, for d e t e c t i n g initial clusters, the original distances m a y be better, whereas for a c o n f i r m a t o r y analysis, where we are i n t e r e s t e d in the possible presence of outliers or undetected small clusters, we might prefer scaled distances.

If there are clusters, the confirmatory p a r t of our analysis includes a forward search fitting an individual model for each cluster. For all unassigned observations we calculate the distance from each cluster centre. Observations are included in the subset of the cluster to which t h e y are nearest and the distances to all cluster centres are monitored.

U n f o r t u n a t e l y we m a y be comparing M a h a l a n o b i s distances for c o m p a c t and d i s p e r s e d c l u s t e r s . A forward allocation can t h e n lead to a dispersed group "invading" a c o m p a c t group with a wrong allocation of several observations. We therefore introduce s t a n d a r d i z e d distances, a d j u s t e d for the variance of the i n d i v i d u a l cluster. The c u s t o m a r y s q u a r e d M a h a l a n o b i s distance for t h e i t h observation from the gth group at s t e p m is

d~i (.~) = {y~ - ~ ( . ~ ) } r E ; 1(.~) {yg~ _ ~ (.~)}, (4) where # a n d 2 are e s t i m a t e d for each group. T h e s t a n d a r d i z e d distance is dg~(.~) = d ~ ( - ~ ) l ~ m l 1/2v, (5)

(4)

in which the effect of differing variances between groups has been completely eliminated. These distances will produce measures close to Euclidean distance, with the consequence that compact clusters may tend to "invade"

dispersed ones, the reverse of the behaviour with the usual Mahalanobis distance. We generally need to look at plots of both distances.

3 S w i s s H e a d s

As a first example of the use of forward plots we start with data which seem to have a single multivariate normal distribution. The data, given by Flury and Riedwyl (1988, p. 218), are six readings on the dimensions of the heads of 200 twenty year old Swiss soldiers. The forward plot of scaled distances in Figure 1 shows little structure. The rising diagonal white band separates those units which are in the subset from those that are not. At the end of the search there seem to be two outliers, observations 104 and 111.

(,o

0

111 tO4

5 0 1 O0 1 5 0 2 0 0

S u b s e t s i z e m

Figure 1: Swiss heads: forward plot of scaled Mahalanobis distances. Units 104 and 111 are outlying towards the end of the search

Figure 2 repeats Figure 1 with the distances for units 104 and 111 highlighted. These are the largest distances at the end of the search, even though they decrease in the final two steps, when the units concerned join the subset.

This is an example of masking, which is so slight as not to be misleading.

Although the distances for the two units are the largest in the last thirty steps of the search, they rank much lower on size earlier on. This atypical

(5)

Iz)

=15

111 lOZ

50 100 150 200

S u b s e t size m

F i g u r e 2: Swiss heads: forward plot of scaled M a h a l a n o b i s distances. T h e t r a j e c t o r i e s for units 104 and 111 are highlighted

F i g u r e 3 is a s c a t t e r p l o t m a t r i x of the observations. As in F i g u r e 2, we highlight units 104 and 111. It is clear t h a t the search has identified two units with the largest values of Y4. W h a t is not clear is whether these units affect any inferences drawn from the d a t a . We r e t u r n to this in w where we m o n i t o r some other quantities d u r i n g the search.

4 M i l k D a t a

W e n o w consider a slightly m o r e complicated example, to w h i c h w e shall again return for a fuller analysis.

D a u d i n , D u b y , a n d Trecourt (1988) give data on the composition of 85 containers of milk, on each of w h i c h eight m e a s u r e m e n t s were m a d e . A scatterplot m a t r i x of the data is in Figure 4. T h e panel for Ys a n d Y6 s h o w s clearly that one unit is r e m o t e in this bivariate projection. Otherwise, several panels s h o w a strong rising diagonal structure. T h e r e is one gross outlier in Figure 5, the forward plot of scaled M a h a l a n o b i s distances. Until m = n the distance for unit 69 is off the top of the plot. Unit 44 b e c o m e s less outlying as the search progresses a n d is the first of a g r o u p of four m o d e r a t e l y outlying units to enter the subset. T h e others, units ], 2 a n d 41 are clearly outlying

(6)

t

I O 0 110 120

' t§ *~+

y l + + ~

+•.•+ ⁺

.ot .tl

o + + :+~. y 2

u ++ * + +

++ +;,. -~ +~

+ ~.~

+ § +

§

P--. + + ~

+ +

N +

5 0 6 0 7O

~+~,~. + I I ++.~+

+~/~s~il~=i;+ +§ + . ~ ~ 9

~ ₊ /I _{/ l +}:,+~*v# ₊ +

~-+4I + +

~ + _{+ ~ . +} + ~- + ₊

+

+ § ++

4-

~+ :t+

+ ++

y 3 + +

++

l i e 9 l I

+ + + .~

~.,~+ ~. * + ++ .

+ + + ~ , + ~ ~+ ^{y 4}

§ + + / +

*+ * + ++ g

§ II -'~ +~,tf~ll + ~el;~+~l+~,~J~

+++ + + +++

~4 *+++. II ++-< . . . *~-, ^{. J I .} ~-+, ^.

100 110 120 130 110 120 130 140

125 135 145

+ +5 ++ ++ :I~

/ / + "1" / ~

+ l: [

I + + / t + + + i - ;

9 e l i

, ~ t ,0~ + + ++

avf:'< ."

y 5

+§ ~+ , , ,

115 125 135

~+ ~-

y 6

F i g u r e 3: Swiss heads: s c a t t e r p l o t m a t r i x of the six m e a s u r e m e n t s on 200 heads. Units 104 and 111 are p l o t t e d as d o t s

until m = 81 when unit 44 joins. T h e distances t h e n decrease until t h e r e is some masking at the end of the search, with unit 77 having the t h i r d largest distance.

We now look in more detail at the data. The two s c a t t e r p l o t s of F i g u r e 6 are details of Figure 4 with the five points highlighted. Unit 69, at the t o p of b o t h plots, is the last to enter and is the clear outlier a l r e a d y mentioned. T h e g r o u p of four units, 1, 2, 41 and 44, are p a r t i c u l a r l y evident in the r i g h t - h a n d panel of the figure. W h a t also seems a p p a r e n t in this figure is a s e p a r a t e d g r o u p of seven observations in the lower left-hand corner of the plot of Y6 a g a i n s t y4. This group has no obvious effect on the forward plot of distances.

(7)

o

' . . . tt ^"

i

' ~ 1 7 6 ~ . . . i . . . ~ . . .

1026 1030 30 32 34 22 26 30 34 115 125

Figure 4: Milk data: scatterplot matrix. There is a strong diagonal s t r u c t u r e in some panels

5 T h r e e Clusters, T w o Outliers

We now consider the analysis of a simulated data set, partly in order to t r a i n our eyes to the i n t e r p r e t a t i o n of further plots of the type we have already seen. There are three new features of this example a n d its analysis - one is t h a t the d a t a contain more t h a n one cluster, although we initially fit a single multivariate normal distribution. The second is t h a t we c a n choose our s t a r t i n g point to be in each cluster in t u r n a n d the t h i r d is t h a t we introduce several powerful new plots.

There are only two variables. T h e d a t a contain three clusters a n d two outliers. T h e sizes of the groups are 60 in the tighter cluster, 80 in the more dispersed cluster, 18 a n d 2. F i g u r e 7 shows the data. Since there are only

(8)

=o

"o ..Q o

r LO

0

tO

0

44 __

] I I I I

40 50 60 70 80

Subset size m

Figure 5: Milk data: forward plot of scaled Mahalanobis distances. The trace for unit 69 is off the plot until the last step

69=

, - '

o

m.'P

69-

2 1 .

4111 4 4 e o 9

e

30 31 32 33 34 35 2 4 2 5 2 6 2 7 28

y3 y4

Figure 6: Milk data: scatterplots of Y6 against Y3 and Y6 against Y4. The outlying units in Figure 5 are labelled

(9)

. . : . . . -

9 ~ ~ 1 4 9 9

9 9 0 0 9 9

Q 9 9

] I I I I

-10 -5 0 5 10

y l

F i g u r e 7: T h r e e Clusters, Two Outliers: s c a t t e r p l o t of the d a t a

t w o variables, the structure of the data is apparent f r o m the scatterplot. W e n o w see h o w it is reflected in the plots f r o m the forward search. W e begin the forward search with m 9 -- 28, the observations being chosen by the m e t h o d of robust boxplots with elliptical contours. T h e starting ellipse has mostly chosen observations f r o m the g r o u p of 60. T h e few observations f r o m the diffuse g r o u p are eliminated at the first forward step a n d the search then continues solely with observations from the c o m p a c t g r o u p until m -- 61 w h e n observations f r o m the diffuse g r o u p enter. Just before all the diffuse observations are included, the t w o outliers, observations 159 a n d 160 enter the subset, although they are soon rejected, rejoining again at the very end of the search.

Fig. 8 is a forward plot of the scaled M a h a l a n o b i s distances, in w h i c h w e have used different line types for the three clusters a n d the t w o outliers. This plot indicates all the s t r u c t u r e in the data, which is p a r t i c u l a r l y clear up to m = 60: the second group is clearly s e p a r a t e d from the first, the distances of the c o m p a c t group of 18 are evident at the t o p of t h e plot and the two outliers follow an i n d e p e n d e n t path. Once the second group s t a r t s to enter at m = 61, the smaller distances are not so readily i n t e r p r e t e d , a l t h o u g h the group of 18 r e m a i n s distinct. T h e two outliers re-emerge at the very end.

We do not have to plot all the M a h a l a n o b i s distances. F i g u r e 9 s u m m a r i s e s the d i s t r i b u t i o n of distances in Fig. 8. For each of the t h r e e clusters the figure shows the 5%, m e d i a n and 95% points of the d i s t r i b u t i o n of distances at each m. G r o u p 4 consists of the two outliers. T h e panels clearly show t h a t the t r a j e c t o r i e s of the distances in the different groups are distinct. F i g u r e 10 is the forward plot of the m i n i m u m M a h a l a n o b i s d i s t a n c e s a m o n g s t units not in the subset. T h i s distance will be large for a single outlier. If there are

(10)

"o

o

i ] i i i i i

4 0 6 0 8 0 1 0 0 1 2 0 1 4 0 1 6 0

S u b s e t size m

Figure 8: Three Clusters, Two Outliers: forward plot of scaled Mahalanobis distances, with line type indicating membership of the clusters and the outliers. The initial subset, found by the method of robust ellipses, consists mostly of units in the compact group

Group 1 Group 2

r---x.___

4 0 6 0 8 0 1 0 0 1 4 0 4 0 6 0 8 0 1 0 0 140

Group 3 Group 4

4 0 6 0 8 0 1 0 0 140 4 0 6 0 8 0 1 0 0 140

Figure 9: Three Clusters, Two Outliers: forward plot of summary of scaled Mahalanobis distances for the groups in Fig. 9; 5%, median and 95% points of the distribution of distances for each m. Group 4 consists of the two outliers

(11)

E

i I I i / I [

4 0 6 0 8 0 1 0 0 120 1 4 0 1 6 0

F i g u r e 10: T h r e e Clusters, Two Outliers: forward plot of the m i n i m u m Ma- halanobis d i s t a n c e s amongst units not in t h e subset: again rn0 = 28, found by the m e t h o d of r o b u s t ellipses

several outliers the distance will be large j u s t before the first one enters the subset. Once it has entered, the e s t i m a t e s of the p a r a m e t e r s will change a n d the d i s t a n c e to the remaining outliers m a y be less. This is true of a cluster of outliers a n d is true of the sharp spike at rn = 60 which clearly shows the end of the first cluster. The second spike a t rn = 142 is a r a t h e r less clear i n d i c a t i o n of the second group, a l t h o u g h the spike at the end of the plot clearly shows the presence of, now, two outliers.

A t h i r d new plot, shown in F i g u r e 11, is the forward plot of the scaled trace of the e s t i m a t e d eovariance matrix. T h e scaling gives a value of one to the t r a c e a t the end of the search. This is small up to rn = 60, corresponding to fitting units in the tight cluster. It t h e n increases steadily up to rn = 80.

At this p o i n t units from both groups are in the subset, with the mean of the observations in between the two groups. T h e value of the trace increases only g r a d u a l l y until the t h i r d group and outliers b e c o m e i m p o r t a n t after rn = 140.

To i l l u s t r a t e the effect of the s t a r t i n g p o i n t of t h e search we now begin with 20 o b s e r v a t i o n s from the cluster of 80 i n d i c a t e d in F i g u r e 8. T h e for- w a r d plot of scaled Mahalanobis distances is in F i g u r e 12. It is a m a z i n g l y informative. T h e first p a r t of the plot, up to rn = 80 seems to show two groups a n d two outliers. Something is clearly wrong with fitting a single model a r o u n d rn = 100: there are p e r h a p s two clusters, of different sizes t h a n before. T h e t r a n s i t i o n between these two p a r t s of the plot is fascinat-

(12)

L I I r I I [

40 60 80 1 O0 120 140 160

Subset size m

Figure 11: Three Clusters, Two Outliers: forward plot of the scaled trace of the covariance matrix for the search also shown in Figures 8 and 10

:E

I r I I I I I I

2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0 160

Figure 12: Three Clusters, Two Outliers: forward plot of scaled Mahalanobis distances starting in the largest group, with line type indicating membership of the clusters and the outliers, as in Figure 8. All the structure is revealed

(13)

becomes the group of 60 with the smallest distances in the l a t t e r p a r t of t h e plot. T h u s the plot reveals the t h r e e g r o u p s a n d two outliers, which r e a p p e a r a t the end of the search. This one plot shows all the s t r u c t u r e t h a t we have built into the data. T h e i n d i c a t i o n is t h a t we m a y need to run m o r e t h a n one forward search, constraining t h e s t a r t i n g point by the i n f o r m a t i o n t h a t we have o b t a i n e d from earlier searches.

6 Milk D a t a A g a i n

So far we have detected the effect of one gross outlier and found a further cluster of four outliers from the forward plot of M a h a l a n o b i s distances in F i g u r e 5. F i g u r e 13 shows the t r a c e of the covariance m a t r i x . T h e r e is a n

E

"6 8 8 ii1

J

I

4 0 5 0

F i g u r e 13: Milk data:

m a t r i x

6 0 7 0 8 0

forward plot of the scaled trace of the covariance

a p p r e c i a b l e increase s t a r t i n g at rn = 71 when observation 11 enters. T h e o t h e r observations entering d u r i n g this rising period are 76, 15, 14, 12 a n d 13. This behaviour is reminiscent of t h a t we saw in F i g u r e 11 as a second cluster entered the subset in the e x a m p l e with three clusters a n d two outliers.

T h e s c a t t e r p l o t of the data, with u n i t 69 removed, is in F i g u r e 14. W e have highlighted the six observations entering during this rising period. In all panels t h e y cause an extension of t h e range of the d a t a and so of t h e variances of the individual responses. T h e y are six of the seven we previously n o t e d in the lower left-hand corner of the plot of Y6 against Y4 of F i g u r e 6. T h e panel for yl a g a i n s t y7 in Figure 14 clearly s e p a r a t e s the seventh o b s e r v a t i o n from the other six. Although this g r o u p of six observations has no obvious effect

(14)

3 2 3 6 2 4 2 6 2 8 2 3 2 5 2 7 1 3 , 0 1 5 0

, . , . , ,

9 ,...

't, ~ 9 ~ _

y l

9 9 9 ~ ~ 9 9 9 o

. . ,,." .*o ⁹ ⁹

I . 9 e ,

~ ' "'

I ~ y6

;~ .r ~ ~,: ~ ~ 0 "~

9 u 9 " e Q 9 9

1 0 2 6 1 0 . 3 0 3 0 3 2 3 4 2 2 2 4 2 6 1 1 5 1 2 5

F i g u r e 14: Milk data: s c a t t e r p l o t m a t r i x . T h e six observations h i g h l i g h t e d enter during the rising p e r i o d from m = 71 in Figure 13

on the forward plot of distances, we have been able to identify its effect on the e s t i m a t e d covariance m a t r i x .

7 S w i s s B a n k N o t e D a t a

A s a second e x a m p l e with at least t w o clusters w e look at readings o n six dimensions of 200 Swiss b a n k notes, i00 of which m a y be genuine a n d I00 forged. All notes have been w i t h d r a w n from circulation, so s o m e of the notes in either group m a y have been misclassified. Also, the forged notes m a y not form a h o m o g e n e o u s group. For example, there m a y be m o r e than one forger at work. T h e data, a n d a reproduction of the b a n k note, are given by Flury a n d Riedwyl (1988, p. 4-8).

Figure 15 shows the scaled M a h a l a n o b i s distances from a forward search

(15)

o

i i i 1 - -

5 0 1 0 0 1 5 0 2 0 0

S u b s e t size m

Figure 15: Swiss b a n k n o t e d a t a , forward plot of scaled M a h a l a n o b i s distances for b o t h groups: the d o t t e d lines are for the forgeries

s t a r t i n g with 20 observations on notes believed genuine. T h e traces of distances for the forgeries are shown with d o t t e d lines. In t h e first p a r t of the search, up to rn = 93, the observations seem to fall into two groups. One has small distances a n d is c o m p o s e d of observations w i t h i n or s h o r t l y to join the subset. Above these t h e r e are some outliers and then, higher still, a con- c e n t r a t e d b a n d of outliers, all of which are behaving similarly. T h e s t r u c t u r e of this plot has much in c o m m o n with t h a t of F i g u r e 12, a l t h o u g h t h e r e are three clusters. Here t h e two groups are a p p a r e n t , the forward search yielding the same 100 forgeries as F l u r y and Riedwyl (1988). However, the l a t t e r p a r t of the search in F i g u r e 15 does suggest there is s o m e t h i n g further to explain:

units 1 and 40, shown by continuous lines, are o u t l y i n g from G r o u p 1 and fall among a much larger n u m b e r of outlying units all from G r o u p 2.

The s t r u c t u r e of t h e group of forgeries is also r e a d i l y revealed by the forward plot of scaled M a h a l a n o b i s distances j u s t for the forgeries shown in F i g u r e 16. In the centre of the plot, a r o u n d m = 70 this shows a clear s t r u c t u r e of a central group, one outlier from t h a t g r o u p a n d a second group of 15 outliers. As successive units from this cluster enter after m -= 85, they become less r e m o t e a n d the distances decrease.

In this e x a m p l e the forward search clearly indicates not only t h e presence of two groups of notes, b u t also t h a t the group of forgeries is n o t homogeneous, itself consisting of two subgroups. Once we know w h a t we are looking for, this t h i r d group can be identified on the s c a t t e r p l o t m a t r i x of t h e observations.

(16)

c ~

o

i i r i

4 0 6 0 8 0 1 O0

S u b s e t size rn

F i g u r e 16: Swiss b a n k note data, forward plot of scaled M a h a l a n o b i s distances: the forgeries, showing evidence of a t h i r d g r o u p of 15 units p l o t t e d with continuous lines

F i g u r e 17 is one panel, t h a t of Y6 against Y4, in which the t h r e e groups are most clearly seen.

8 C l u s t e r A n a l y s i s a n d t h e D i a b e t e s D a t a

In cluster analysis we seek to divide the d a t a into homogeneous groups, or clusters, w i t h o u t knowing how m a n y clusters t h e r e are. As an e x a m p l e of how the forward search can help we look a t 145 o b s e r v a t i o n s on diabetes pa- tients, which have been used in the s t a t i s t i c a l l i t e r a t u r e as a difficult example of cluster analysis. A discussion is given, for example, by F r a l e y and Raftery (1998). T h e d a t a were i n t r o d u c e d by Reaven and Miller (1979). There are three m e a s u r e m e n t s on each patient: yl and y2 are respectively p l a s m a glucose a n d p l a s m a insulin response to oral glucose; y3 is the degree of insulin resistance. T h e observations were classified into t h r e e clusters by doctors, but we ignore this information in our analysis.

F i g u r e 18 is a s c a t t e r p l o t m a t r i x of the d a t a . T h e r e seems to be a central cluster a n d two ' a r m s ' forming s e p a r a t e clusters. T h e first cluster is appre- ciably more c o m p a c t t h a n the other two. T h e r e would seem to be no obvious breaks between clusters, so t h a t we can expect our plots to yield less sharp answers t h a n those for the previous examples with two or more groups.

We proceed as before, first fitting a single d i s t r i b u t i o n to all the data.

(17)

+ + + +

-H- -t- + - H - + +

+ -t-I- +-4- -P +

+ + + . . . I - ..PI- +

-H- + -H-I- + + +

- t - + + 4 + - I F -t- + - I - +

+ + -t-I- + +

+ + + + +

-t-I- + + +

+-I- + + + +

+ +

+ 9

1 1 1 1 6

182 187 192

161 138

148

1% ¹⁷¹

9 9 g o 9 9 9

9 9 o 9 o 9 9 9 9 0 9

9 9 9 1 4 9 9 9 9 9

9 o o 9 9 9 9 9 9

9 o 9 9 9

g o 9 9 9 1 4 9 9

167

162

I I I I I

8 9 10 11 12

y4

Figure 17: Swiss b a n k note data. Scatterplot of Y6 against Y4 w h i c h reveals m o s t of the structure of all 200 observations: there are three groups a n d an outlier f r o m G r o u p I, the crosses

T h e forward p l o t of M a h a l a n o b i s distances c o n t a i n s some gaps, although not as clear as those in Figures 8 and 12, which do suggest three groups. If there are g r o u p s of observations, their d i s t a n c e s will t e n d to increase and decrease together. T h e information in such forward plots of distances can then be s u m m a r i z e d by looking at the changes in distances. F i g u r e 19 shows such changes in the forward plot of M a h a l a n o b i s distances, ordered by first a p p e a r a n c e in the subset, with black r e p r e s e n t i n g an increase. The t h i r d group are t h e last to enter this forward search, a n d show clearly at the t o p of the plot. T h e second group is certainly different, e n t e r i n g i m m e d i a t e l y before the t h i r d g r o u p a n d so coming lower in the plot. T h e division between the first and second groups is not so clear - it might be anywhere between 55 and 75 on the scale of ordered units.

These p r e l i m i n a r y groups can be refined by further forward searches, for

(18)

46

4 0 0 8 0 0 1 2 0 0 1 6 0 0

. . . : ^{9 ,,}

I_:

⁸

yl =plasma 8

glucose

I o1111~ g

L I

i ^{: "}

" '

PZ y3=insulin 8

resistance

'lb 9 9

100 20o 30o 0 2o0 4;0 6;0

F i g u r e 18: D i a b e t e s data: s c a t t e r p l o t m a t r i x . T h e r e m a y be three overlap- ping clusters

e x a m p l e s t a r t i n g with units believed to be c o r r e c t l y classified. G r o u p 2 is r a t h e r heterogeneous, being formed from units we did not assign to G r o u p s 1 or 3 in the p r e l i m i n a r y analysis. Despite this, the forward plot of scaled M a h a l a n o b i s distances in Figure 20 does show a c o m m o n s t r u c t u r e for all units. I n i t i a l l y there is much a c t i v i t y as t h e units from G r o u p 2 are introduced. T h e n G r o u p 1 enters, providing a p e r i o d of stable growth in n e a r l y all distances. At the end, G r o u p 3 enters a n d the b e h a v i o u r is once again less homogeneous. The t r a j e c t o r i e s of five units have been highlighted on the plot. These seem r a t h e r different from the o t h e r t r a j e c t o r i e s and we r e t u r n to these units in a moment.

Similar analyses of individual M a h a l a n o b i s d i s t a n c e s from the other groups lead to a division into certainly assigned units a n d those a b o u t which we are less sure. We then switch to forward searches fitting three ellipsoids. Since the variances of the groups are not equal, we use s t a n d a r d i z e d distances with all seemingly certainly allocated observations fitted before any of the unassigned observations are allocated. M a n y of these units do not change their g r o u p m e m b e r s h i p and so can be a l l o c a t e d w i t h certainty.

Our final forward search is represented in F i g u r e 21, for m = 117 to 145.

(19)

8

._= o

8 o

)!!):)

!(!i:il

,1,1t!!!1))!!

'!lll.l'll'l ii,lr,l!,Li

!.jl~i:ii,

llli, :!hlll I

!111)'!!1L 111

ii iilii !i!!il 'ii!!i!iii!iii[!iiiiiiil i

i I i

20 40 60

... 'i),.'.. h .1 .,'..!1.., ! ;,m! '~I

,,.,'" _, "i ..',,..: "" :4 ",'', ,:i'illlllll';l';ll,i,,,';;." _{Y ~} _{" ||} . ~ " l " . . . , h , , h . .

"" 9 II , I " . 1 . " ' 1 ' " "1" . . I ' " i t ',,,,::,,:h:,l-I~.?:::

, : . . s , . . . , . , - . . | I " ' : . " ' I " ' : ; 1 " f ' ! .':

9 :| ,I ::i :,." I;;; ; "",,,,ll,.'-IJ;;, ....

9 9 "":':: ":': ""l " "'" :~ "':''~ : "" ::: .'.:m. ,

:'" : ",'i" "':J '" ::' ll.m. -': , l,lm :,,

9 ' 9 , " " i . l ~ " | ' " i l l . ' . . . i i . . i . i i .1 . : 1 1 1 : : 1 ; | 1 " ; 1 " . ' " : " " " ' " " " l . . .

iri.:.,.,..:,.i.: :-..:.,..,...,..: , ...,,...,

9 I " l i i i i i . % 1 . , . I I . I . . . . " " " '

9 il,-" ":''-;i ... ,,,,,,I,,111, .

.ll I . 9 . i - I i i I ; m l l l l l l ; ' w l ; ~ l l : l * n l m l l i ' l l | 9

,.., . . . : . . . . I , , , , , ..,.II, .,IIIII III

I ; U " l " I I I I I I ; : l l " . "" " 1:1 , '11 " ' '1 " ' "

;,..;..:...I.,,,;,,, I,: I" ..,:. :. I':!l,! :..;6; ,,, 9 - '1""11.''''-I "'';:'. ::,,.. :'l ,." .Jill II1' I: I"1 l i*'"' .... .!!!5.,., . ,.I. ,I,' ,J, ,i, Id , I I',! tal.,,

"'"" :!:!!,m:imil,;llm':'i," mm'ml m a. 9 ,;"I,I:|! ,'mml"ll' '", .... l l Ill.." ...

9 1 ... I I I I :"1 I ..I mm,m :-midilli"iillllil:ql

!!,,,.,l,I;.m:,.,I.:,,.,,,, i1:,.1:1,.,!, ..,,,ll,':,,Imlll.:!!!

I I I I

80 100 120 140

Subset size m

Figure 19: Diabetes data: forward plot of changes in Mahalanobis distances (black is an increase), suggesting membership of the three groups

As a result of our earlier analyses, 116 units were considered classified with certainty. The bottom three lines of the figure are the key and the first col- umn our provisional classification. The lower section of the figure above the key shows the behaviour of units we considered well classified, but which were subject to reclassification during the search. Using standardized distances, units are attracted to less dispersed groups. The figure shows several units from Group 3 attracted to group 2, which has a smaller dispersion, although larger than that of Group 1. An example is Unit 131, highlighted in Fig- ure 20, which remained throughout in Group 2, although our other analyses indicated membership of Group 3. The top part of the figure shows units about whose membership we are uncertain. Three of the other units highlighted in Figure 20 are classified in Group 1. Despite all our analyses, we are unable to decide about the first five units in the figure.

Figure 22 is the scatterplot matrix of the final allocation from the forward search, with uncertain units highlighted: these units fall at the intersections of clusters. Comparison with the doctors' original allocation shows that our clustering provides groups which are more coherent and compact.

(20)

c :

~g

i i i I i i f

2 0 4 0 6 0 8 0 1 O0 120 1 4 0

S u b s e t s i z e rn

Figure 20: Diabetes data, starting from Group 2: forward plot of scaled Mahalanobis distances for units provisionally in Group 2; the highlighted units are not certainly classified

9 F i n d i n g a T r a n s f o r m a t i o n w i t h t h e F o r w a r d S e a r c h

Transformations of the data can be used to help satisfy the assumptions of multivariate normality. We have found t h a t extension of the Box and Cox (1964) family to multivariate responses often leads to a significant increase in normality, and so to a simplified analysis of the data. We let yij be the i-th observation on response j; j = 1 , . . . , v. The normalized transformation of Yij is

zij()tj) _- X yi~ ~ - 1 j G j ~j-1 (;, # o)

= G j logyij (~ = 0),

where Gj is the geometric mean of the j - t h variable. The values ~j = 1, j = 1 , . . . , v, correspond to no transformation of any of the responses. If the transformed observations are normally distributed with mean #(A) and covariance matrix E()~), the loglikelihood of the observations is given by

l(~)= n 1 n

- ~ log 27rIE(A)I - ~ ~-~,{zi - ] ~ t ( ) ~ ) } T ~ - l ( ) ~ ) { z i -- ~(/~)}, (6)

i = 1

(21)

~ § 9 * * + §

! 8 1 + § 2 4 7

~ 0 . ~ § 2 4 7 2 4 7 2 4 7 2 4 7 2 4 7

* ~ + + § 2 4 7 2 4 7

~ * ~ + + + + * + § 2 4 7

~ 1 . ~ + + § 2 4 7 2 4 7 2 4 7

~ w

2 5 § 2 4 7 2 4 7 2 4 7

l ~ 7 0 ~ O e O O 0 e o e e

e g o g O $ @ e o e o 0

~ + + + + * ~ §

* ~ + + + + ~ + § 2 4 7

+ + ~ + §

: : . . : : : . . : : .

l ~ 5 0 0 e e e e e o o o o o e o e o o 1 3 1 o o 0 0 0 0 o o o o o 0 0 o o 0 e 1 1 3 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

l l ~ o o o o o o e o o o o o 0 O O O O 1 1 1 0 0 O O O O O O O O O O O O O O A 1 0 7 A 6 6 6 6 6 & & & O O O O O O O A

7 ~ O O O O O O O O O O 0 O O O O O O 6 4 e o e e o o o o o o 0 0 o o o o o

O O O O O O O e O O 0 0 e O O O O O O O O O O O G e o o o o o o e o e o G o o o o o o o o O e O

& 6 6 A A & A A O O & &

A 6 6 6 & A A & A A A &

Q e O Q O O O O O O O 0 I Q o o o o e o o e e o

e o o e e o o o o l o e

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

L I I I I I I I I i I I l I I I I I I I I L L L I L L I I I I

115 117 119 121 1 2 3 1 2 5 1 2 7 1 2 9 131 1.33 1..35 1.57 l j 9 141 1 4 3 1 4 5

F i g u r e 21: Diabetes d a t a : final forward search. T h e classification of units in the top panel is u n c e r t a i n

where z~ = ( z i l , . . . , z~v) T is t h e v z 1 vector which denotes t h e t r a n s f o r m e d d a t a for unit i. S u b s t i t u t i n g the m a x i m u m likelihood e s t i m a t e s fi(A) and E(,~) in equation (6), the m a x i m i z e d loglikelihood can be w r i t t e n as

l(A) = c o n s t a n t - n/2 log [~(A)l. (7) To test the hypothesis .~ = A0, the likelihood ratio test

TLR = n ]og{ I ~ ( ~ 0 ) l / l ~ ( ~ ) l } (S) can be c o m p a r e d with the X 2 d i s t r i b u t i o n on v degrees of freedom. In equation (8) the m a x i m u m likelihood e s t i m a t e ~ is found by n u m e r i c a l search.

Riani and Atkinson (2000) use the forward search to find s a t i s f a c t o r y t r a n s f o r m a t i o n s for a single variable, if such exist, and the o b s e r v a t i o n s t h a t are influential in their choice. For m u l t i v a r i a t e t r a n s f o r m a t i o n s Riani and Atkinson (2001) m o n i t o r a sequence of forward plots of the square root of the likelihood ratio s t a t i s t i c (8) to o b t a i n t r a n s f o r m a t i o n s which described most of the d a t a , with t h e outliers entering at the end of the searches. Re- sults on the d i s t r i b u t i o n of the t e s t s t a t i s t i c for univariate t r a n s f o r m a t i o n s in regression are in A t k i n s o n a n d Riani (2002). These show t h a t a s y m p t o t i c re- sults are a good guide to significance, provided there is a s t r o n g r e l a t i o n s h i p between the mean and t h e e x p l a n a t o r y variables.

(22)

i ~ i i i

l yl=plasma

glucos_~_e _

N~

J ' < : , ,:

1 O0 2 0 0 3 0 0

4 0 0 8 0 0 1 2 0 0 1 6 0 0

i i i i i i i ~ i i

=

y2=plasma insulin

;r ^,

y3=insulin resistance

i q ,

2 0 0 4 0 0 6 0 0 o

Figure 22: Diabetes data: scatterplot matrix, showing units not certainly classified

10 Swiss Heads A g a i n

One purpose of an analysis using the forward search is to establish the connec- tion between individual units and inferences. We conclude with a dramatic example of the way in which overall statistics can be misleading about the greater part of the data, using the d a t a on Swiss heads introduced in w

In w we showed t h a t there were two outliers, units 104 and 111, which were the last two to enter the forward search. T h e y did not seem to have any effect on inferences. However, Figure 23 is a forward plot of the likelihood ratio test for testing t h a t all six values of A are equal to one. This is based on a search on untransformed data, so that the order of entry of the units is the same as in our earlier analysis. In particular, units 104 and 111 are the last two to enter. The figure shows the enormous impact these two observations have on the evidence for transforming the data. At m = 198 the value of the statistic is 7.218, only slightly above the expected value for a X~ random variable. This rises to 15.84 after the two outliers have entered, a value above the 95% point of the distribution, which is indicated on the plot. W i t h o u t the information provided by the forward plot it would be easy to be misled

(23)

T F r ~ r l

80 100 120 140 160 180 200

S u b s e t size m

Figure 23: Swiss heads: forward plot of the likelihood r a t i o test for the hypothesis of no t r a n s f o r m a t i o n . The horizontal line is t h e 95% point of )i5.

The last two units to enter provide all the evidence for a t r a n s f o r m a t i o n

into believing t h a t the d a t a need transformation.

Evidence for a t r a n s f o r m a t i o n is provided by a skewed distribution. The only skew d i s t r i b u t i o n in the s c a t t e r p l o t s of the d a t a such as those in Figure 3 is the m a r g i n a l d i s t r i b u t i o n of Y4, caused by the o u t l y i n g values of units 104 and 111. To t e s t w h e t h e r all the evidence for the t r a n s f o r m a t i o n is due to y4 we r e p e a t the c a l c u l a t i o n of the likelihood ratio, b u t now only t e s t i n g whether

~ 4 = 1. T h e o t h e r five values of A are kept at one, b o t h in the null p a r a m e t e r vector A0 a n d in t h e m . l . e . A . The search is therefore t h e s a m e as before, but now gives rise to F i g u r e 24, showing a test s t a t i s t i c to be c o m p a r e d with X~.

It is now even clearer t h a t all evidence for t r a n s f o r m a t i o n of Y4 is provided by the inclusion of units 104 and 111. At the end of the search the test statistic has a value of 8.789, c o m p a r e d with 15.84 in F i g u r e 23 for t r a n s f o r m i n g all six variables. T h e difference, 7.05, is not significant for a X52 r a n d o m variable, so t h a t the evidence of t h e tests at the end of the search is t h a t Aa -fi 1, whereas all other variables are equal to one.

This e x a m p l e shows the i m p o r t a n c e of the forward search in discovering influential observations. It also shows the power of combining statistical modelling with g r a p h i c a l presentation.

(24)

. o

"-t

I I ~ I I

8 0 1 O 0 1 2 0 1 4 0 1 6 0 1 8 0 2 0 0

Subset size m

Figure 24: Swiss heads: forward plot of the likelihood ratio test for the hypothesis of no transformation of y4. The horizontal lines are the 95% and 990/0 points of X~. The last two units to enter also provide all the evidence for this transformation

11 D i s c u s s i o n

1 1 . 1 C o m p u t i n g

Our examples are calculated using Gauss, which provides a combination of numerical programming and working graphics that we have found invaluable in developing the methods of the forward search. Most of our production plots are produced in S-Plus9 The methods for univariate data (regression and generalized linear models) have been programmed in a Graphical User Interface (GUI) for S-Plus, which is available from the website for Atkinson and Riani (2000). A second book, Atkinson, Riani, and Cerioli (2003) on the analysis of multivariate data will give details of the methods sketched in this paper.

S o f t w a r e will be m a d e available o n the s a m e site http://www.riani, it/at

1 1 . 2 Graphics

Forward plots of Mahalanobis distances, such as Figure 1, are taken directly from the output of our Gauss program. On the screen the variety of line types and colours, together with the ability to zoom, makes it possible to follow the trajectories of individual units in a way which is impossible on the printed page. Procedures for enhancing the plots include highlighting the