### cái>/f

_{32.30}

### UNIVERSITAT DE BARCELONA

A DISTANCE BASED APPROACH TO DISCR1MINANT

ANALYSIS AND ITS PROPERTIES

by

C. M. Cuadras

AMS _{Subject Classification:} _{62H30}

BIBLIOTEC AD LA LJN VERSITAT DE BARCELONA

0701570686

Mathematics _{Preprint Series No. 90}

i

A DISTANCE BASED APPROACH TO DISCRIMINANT

ANALYSIS AND ITS PROPERTIES

C.M. Cuadras

Departament d’Estadistica Diagonal, 645

08028 _{Barcelona,} _{Spain}

ABSTRACT

A _{general distance based} _{method for allocating} _{an} _{observation}
to one of several known _{populations,} _{on} _{the} _{basis of both} _{}

con-tinuous and discrete _{explanatory} _{variables, is} _{proposed and} _{}
stu-died. This method _{depends} on a given statistical distance
be-tween _{observations,} _{and leads} _{to} some classic discriminant rules

by taking suitable distances. The error rates can be easily

com-puted and, unlike other distance classification rules, the prior probabilities can be taken into account.

Key words and phrases: Linearand quadratic discriminant

func-tions; Location model; Statistical distances; Given margináis.

AMS _{subject} _{classification: 62H30}

1

1. INTRODUCTION

The _{problem of allocating} _{an} _{individual} _{to} _{one} _{of} _{two} _{populations} _{was}
proposed by Fisher (1936), who obtained the linear discriminant function
(LDF). Optimality ofLDF under the assumption of multivariate normality,

with a common covariance matrix, is a well-known property. LDF gives
a _{good performance} _{even} _{when} _{the} _{above} _{assumptions} _{are} _{violated and}
several studies have been carried outon LDF as well as alternative methods

(Krzanowski, 1977, Seber, 1984, Raveh, 1989). The discrimination pro¬
blemleads toa_{quadratic discriminant function} (QDF) _{when the covariance}
matrices are not _{equal.}

LDF is _{essentially based} _{on} _{continuous} _{variables} _{(Wald,} _{1944,} _{}
Ander-son, 1958). However, most problems in applied statistics fall in the mixed

case, i.e., the variables are both continuous and discrete. An interesting

approach, based in the location model

### (LM),

has been given by Krzanowski (1975,1986, 1987). LMis applicable when the data contain both continuousand _{binary variables. This method}_{computes} _{an} _{LDF}_{for}_{each} _{pattern} _{of}_{the}
binary variables, is óptima! under the assumption of conditional normality
and can be extended to the multistate case for discrete variables
(Lachen-bruch and _{Goldstein,} _{1979).} _{However, the LM approach} _{needs} _{a} _{conside¬}
rable _{computational effort and has} _{not} _{been implemented in the standard}
packages (Knoke, 1982). Moreover LM cannot be used when there are not

enough data for manylocations (Vlachonikolis and Marriott, 1982).

Logistic regression (LR) is another interesting approach, which ineludes

quite a wide class of models and also allows discrimination with mixed va¬
riables _{(Anderson, 1972). LR also requires} _{a} _{large} _{amount} _{of computing.}

LDF, QDF, LM and LRare all based on the ratioof probability density

functions, the máximum likelihood rule

### (ML).

The ML rule is a particular caseof the_{Bayes discriminant rule}

_{(BR), the rule obtained after supposing}that the

_{populations have prior probabilities. BR provides}

_{a}

_{general}

_{ap¬}proach to discriminant analysis which has certain óptima! properties from a theoretica!

_{point}

_{of view.}

Another _{general approach, first studied by} _{Matusita} _{(1956) is based} _{on}
the _{concept} _{of distance. Let d(w,}_{7r,) be} _{a} _{distance between} _{w} _{to} _{7ri} _{=}

1,2, where u isanindividualtobe allocated. The allocation ruleis: allocate
uj to the nearest_{population} _{tt,} _{.}

r

LDF is _{closely related} _{to}_{the Mahalanobis distance and}_{both}_{approaches,}
ML and _{distance,}areequivalent for multivariate normal data (Mardia et al,

### 1979).

If we substitute the covariance matrix### by the identity

matrix, weobtain the Euclidean distance classiñer _{(EDF), which is} _{an} _{alternative} _{to}
LDF _{(Marco} _{et} _{al,} _{1987), and has advantages when the number of variables}
is_{large relative} _{to}_{the training}_{sample}_{size.} _{EDF is} _{a}_{distance rule based}_{on}
Euclidean distance. The distance rule is also _{equivalent} _{to} _{BR} _{(including}
ML _{rule), for the LM model,} _{as} _{is} _{proved} _{by Krzanowski}

_{(1986, 1987).}

A distance-based _{approach} _{(DB)} _{for regression and} _{classification} _{with}
mixed variables was _{proposed by Cuadras} (1989) and Cuadras and _{Arenas}

### (1990).

Although considerable research has been done on discrimination, and the State of the Art of allocation has reached such a high level that itseems difficult tofurnish _{original} _{ideas,}_{this} _{paper} _{has the} _{aim}_{of examining}
further _{aspects} _{and applications of discrimination} _{by the above mentioned}
DB _{approach.}

Given every couple of individuáis the DB method uses a distance

d(u>,u}') to computediscriminant functions. The assumption underlyingthis

method is the_{following: in} _{some} _{circumnstances} _{(mixed data, for example),}
it is easier to obtain a distance d(., _{.) than} _{a} _{probability density function}

!>(•)•

2. A DISTANCE-BASED CLASSIFICATION RULE IN TWO

POPULATIONS OBTAINED FROM SAMPLE DATA

Let 7ri, 7T2 betwopopulations and supposethat u is an individualtobe

allocated. _{Suppose that} _{a} _{distance function} _{¿(., .) is defined} on the basis

of several _{(mixed) variables. The distance-based rule (1),} _{or} _{DB} _{rule,} _{is}
introduced and some of its _{properties} _{are} studied by Cuadras (1989).
2.1 _{Training} _{sets}

Suppose that samples C\ and C2 ofsizes ni and «2 areavailables from tti and 7T2, respectively. Suppose that an ni x ni intradistance matrix

D\ =

### (¿¿¿(I))

_{can}be computed

_{on}

### C\ by

using### the

distance### 6(., .),

_{as}

f

D2 =

### (¿»'i(2)),

and### ¿¿(2),

i = l,---,n2 are analogously ¡ntroduced. Let us define the discriminant functions as follows:### am

= *= 1-2### (>)

nk _{¿} _{£nk} _{..}

A decisión rule for allocation of u> is:

allocate u _{to ;r,} if /,(u>) _{=} _{min}{/j(w),

### /2(w)}

(2)This rule leads to a minimum-distance classification rule, _{as a} _{consequence}

of theorem 1.

Let us denote C= {1,2,_{•}_{• •}

### ,n}.

D_{=}

### (6¿¿)

is_{an}

_{n}

_{x}

_{n}

### distance

matrix,(n + 1) is a new individual, ¿i,- • •,Sn are the distances from (n + 1) to

each ofC and _{f(n}

_{+1)}

_{is}

_{defined according}

_{to}

_{(1).}

_{Then}

_{(n}

_{+1)}

_{and C}

_{can}be

_{represented by the points P, Pi},• • •,Pn 6 RT XgRa, where g =

### y/^J

and r+ s = n— 1. The coordinates are _{given} _{by}

P =

Pi = (x'i, gy'¡, 0) i = 1,•• •,n

and it issatisfied that

### ¿ij

=### d2(Pi,Pj)

=### ll*¿

-### *jl|2

-### l|s/¿

“### V¿||2

where

_{||}

_{•}

_{||}

_{means}

_{the Euclidean}

_{norm.}

_{An}

_{explicit construction of P, Pj,}

_{• • •,}

P„ is given in the following:

Theorem 1

Set P = (*,y,0), where _{x -} (E,Xi)/n, _{y =}

(Ey¿)/n-Then

f(, _{:W)} =

### d2(p,p)

In _{particular,} _{if D is} _{a} _{Euclidean distance matrix and} _{n+1,} _{C} _{are repre¬}
sented _{by} *,*!,•••,*„€ RPi then

f

f(u) =

### d2(x,x)

= ||* -### x||2

Proof. Let B =

### (6tJ)

= HAH,### where JET

is### the centring matrix

### and

A =### (a,j)

_{=}

### -f(¿,2).

_{If B}

_{>}

_{O,}

_{i.e.}

_{D is}

_{a}

### Euclidean

### distance

### matrix,

let Ax be the _{diagonal matrix of positive eigenvalues} _{and X the matrix}

of _{corresponding} _{eigenvectors.} _{The} _{rows} _{of X,}

_{x\,}

_{•••,*'„,}

_{verify}

_{<5,2}

_{=}

### II®,

—### Xj||2

and constitute the classical solution of MDS### (Mardia

et### al.,

1979).

In _{general} _{(D need} _{not} _{be} _{Euclidean),} _{we} _{must} _{also consider} _{a} _{diago¬}
nal matrix _{Ay} _{of negative eigenvalues}

_{of}

_{B,}

_{whose corresponding eigen¬}

vectors are the columns of the matrix _{Y, and} y'are _{the} _{rows}
of Y _{(Lingoes, 1971). Each element i} _{(say) of C} _{can} _{be} _{represented by}

P¡ = (*',,y'¿,0) and (n+1) by

### (x',y',z).

Note that, as 1 is aneigenvec-tor of B of_{eigenvalue 0,} _{then} _{X'l} _{=} _{0} _{and} _{Y'}_{1} = 0, so * = (J2xi)/n =

0, 17=

### (£3/,■)/«

= 0.Gower _{(1968)} _{proves}_{that if B} _{>} _{0,} _{the coordinates of}

_{(n +1)}

_{are}

_{given}by

z X

,2

where

If i? is anon-Euclidean distance matrix, it is easily proved that

t

On the other hand

henee

### Y. 6¡j

=### 2

ntra### B

/(n+ 1) =

### z2

+### (x'x

-### y'y)

Since x = 0 and _{y} _{=} 0, all coordinates of P _{are} nuil. Henee the distance

between P and P =

### (x',gy',z)

_{is}/(n

_{+}l).

_{□}

Now it is clear that decisión rule _{(2)} _{is} _{equivalent} _{to}
allocatew _{to tt,} _{if}

### d2(P,P)

_{<}

### d?(Q,Q)

otherwise to where _{P,P,Q and Q} _{are} _{suitably constructed. Note} _{that}
this construction is not needed to obtain and _{/2(w).}

Decisiónrule _{(2) leads} _{to}_{standard} _{discriminant functions} _{for}_{continuous}
data and usual distances. Let *¿ti,• • • fc = 1,2, be the training

samples and * the observation to be allocated. Then (2) is based on the LDF

L(x) = x--(xi + x2) S 1 (xx - *2),

where S is the _{pooled sample covariance matrix,} _{provided that the} _{square}
distance is the Mahalanobis distance

{xki -

### xkj)'

### S

### 1

### (xki

-### xkj).

(3)If the covariance matrices need not be_{equal and} we replace S for Sk in

(3), then rule (2) is based on a QDF.

Finally, ifwe choose the Euclidean distance, this decisión rule is based on the EDF

1 _ _ 1' _ _

E(x) = X --(*!+*2) (X1-X2).

2.2 Error rates

The _{leaving-one-out} _{method} _{to} _{estímate} _{the probability of }
misclassi-fication _{(Lachenbruch, 1975)} _{can} _{be} _{applied} _{with simple} _{additionai} _{com¬}
putaron. Let us denote A = (a,j), B =

### (6tJ) and C

=### (cij),

wheref

aij =

### ¿5(1),

### bij

=### ¿5(2)

### and

c,¿ =### ¿2(t,j),

### where i

G### C\ and j

G### Cj.

A,B and C are ni X ni,n2 X n2 and ni x n2 matrices respectively. If

ni na

O =

### ^

®ij, ®¡r, =

### )

,Cir,i<j t=1 r=J

and

_{£>,&_, and}

_{c.j}

_{are}

_{similarly}

_{defined,}

_{to}

_{allocate i}

_{G}

_{C\}

_{we use}

_{the}

discriminant function
/i(0 = (ni - 1)

### 1

a, - (ni - 1)### 2

(a- a,)and

/2(i) = nj

### 1c¿.

—### nj2&.

Henee

mi =

### freql€C)

### [/i(i)

_{-}

### /2(i)

_{>}

### 0]

### (4)

is the _{frequeney of individuáis} _{misclassiñed in C\.} _{The frequeney} _{m2} _{of}
misclassification in _{C2} _{is} _{analogously obtained} _{as a} _{function} _{of b,bj and}
c.j. If the n = ni + n2 observationsare random samples

### from

iri Utt2,### the

error rate is estimated _{by}

mi + m2

ni + n2 (5)

Thus é is _{directly computed from A, B} _{and C.}

Now _{suppose}_{that} _{C\} _{and} _{C2} _{are} _{nonoverlapping classes, that} _{is,}

d2 < d2 (xly) _{Vi} _{G} _{Cx}

_{(6)}

1

d2

_{(Vj,y-j)}

### <

### d2

### (y¿,*,)

### Vj

### 6

### C2

### (7)

where x,-,x* are the Euclidean representation oft as a member of Ci,C2,

respectively, and x_, is the mean on Ci after leaving out individual i.

Similarly we have _{y3,}

_{y*}

_{and}

_{y_3.}It is

_{easily proved that}

Mi) =

### d2(xi,x-i),

M*) =

### d2(x*,y).

Thus,if both_{(6) and (7) apply,} mi = m2 = 0 and aperfect classification
for_{nonoverlappling classes is obtained.}

2.3 Distance between _{samples}

Suppose that the observable variables areboth quantitative and
qualita-tive _{(nominal, ordinal, dichotomous) and that the distribution of the varia¬}
bles is unknown. Then we _{may} _{use any} _{distance function chosen} _{among}_{the}

wide_{repertory}_{available. One candidate is the}_{square}_{distance}

_{d?}

_{=}

_{1}

_{—}where stJ isthe aJl-purposemeasure

### of similarity proposed by Gower

### (1971)

and well described_{by Seber}

_{(1984). This distance provides}

_{a}

_{Euclidean}distance matrix and

_{gives}

_{good results in regression}

_{with}

_{mixed data,}see

Cuadras and Arenas _{(1990). When} _{some}_{valúes} _{are} _{missing,} _{can} _{also be}
obtained but it could be a non-Euclidean distance. However, rule (2) also
applies. Therefore, this DB method can be used for handling missing valúes
in discriminant _{analysis.}

For continuous _{variables,} _{letting (x,i,}• • •,

### x,p), (xjj,

• • •,### Xjp) be

twoob-servations for individuáis i and _{j,} _{we} _{also} _{use} _{the} _{square} _{distance}

### d2j

=### |x¿i

-xji### |

+• • •+### |x,p

-### Xjp|

### (8)

which will be confronted with the Mahalanobis and Euclidean distances in
the _{examples} _{given in section 6.}

t

3. CLASSIFICATION WHEN THE DISTRIBUTIONS ARE

KNOWN

Suppose that the observable variables are related to a random vector

with a_{probability density} _{p,(x)} _{with} _{respect to} _{a} _{suitable} _{measure} _{A, if}_{*}
comes from tt,-, i = 1,2.

Let _{*o} _{be}_{an}_{observation} _{to}_{be}_{allocated} _{and} _{S(.,} _{.)} _{a}_{distance}_{function.}
The discriminant function which _{generalizes} _{(1)} _{is given} _{by}

/»(*o) =

### J

### 62

_{(x0,x)pi(x)d\(x)}

### -

### ^

### J

### 62(x,y)pi(x)pi(y)dX(x)dX(y)

=

### Hí0-\h{.

### (9)

Hío is theexpectation of the random variable

### S2

(xo,x) on . Weassumex,y tobe independenttoobtain the expectation of

### ¿2(x,y)

on 7r,X7r, .The decisión rule for _{allocating} _{xq} _{is}

allocate _{*0}_{to}7r¿ if /,(*o) = niiní/^xo),

### /2(*o)}

(10)Before _{presenting} _{some} _{properties} _{of the} _{DB} _{rule (10),} _{we} _{introduce}
a distance between individuáis based on the so-called Rao distance _{(Rao,}

1945) and studied laterbyBurbeaand Rao(1982), Oller and Cuadras (1985)

and others.

3.1 A distance between individuáis

Suppose that the random vector X is related to astatistical model 5 =

### {p(x;0},

where# is an n-dimensional vector parameter and### p{x,0)

is aprobability density. Assume the usual regularity conditions and consider the efficient score

Ze = —logp(X;0) (11)

Then _{E(Z$)} _{=} _{0} _{and G} _{=}

_{E(ZeZ¿)}

_{is}

_{the Fisher information matrix,}where Z$ is interpreted as a column vector.

r

Transform twoobservations Xi,x2 to z\,z2 _{by using} (11). The square
distance between *i,*2 withrespect tothe statistical model S = {p(*,0)}

is _{given by}

### 62(xux2)

= (Zi ~### ZiYG-1

{Zx - Z2) (12)This distance was introduced by Cuadras (1988, 1989) and especially 011er

(1989), who provides further theoretical justification.

If _{p(x,0)} _{is} _{the N(p,£) distribution} _{(S being}

_{fixed),}

_{then Z^}

_{=}

### U~l(X

— n), G =### 27-1

and (12) reduces to the familiar Mahalanobis dis¬ tance.If X = (ii,-*-,xm) has the multinomial distribution

### flpj;*,

wherea:* € {0,1}, k = 1,•• •,m, Xi falls in the cell r and X2 falls in the cell s, we obtain the square distance

### 62(x1,x2)

= (1 +P,### *)

### (13)

where 6rs is the Kronecker delta.

In _{general, distance} _{(12) is related} _{to} _{the quadratic form}

_{Z'eGZg, where}

Zg is given by

### (11),

which satisfies (011er,### 1989)

E

_{(Z'gG~xZg)}

=

### E

### (tra G^Z'gZ'g)

=### tra

### (g^g)

=### n.

### (14)

Suppose next that the random variables are related to a random vector

W — (X\,X2), where X, is _{a} random vector with density p,(x¿, 0¿) with

respect toameasure A,-, i = 1,2. Supposethat there isadensity p(w,0\,02)
with _{respect to} _{a}_{suitable} _{measure p,} _{with margináis}

_{p,(x,-,#,),}

_{t}

_{=}

_{1,2.}

Letting

Ze, =

_{QQ-\ogp,(xt,dt)}

let us define
t

and the _{symmetric matrix}

£*12

_{^}

G22 _{)}

_{’}

the_{expectation}_{taken} _{with} _{respect to} _{}
p(te,0i,02)-Since

Gu =

### J J

### ZgtZ'8'P(xi,X2,61,02)d\i(xi)d\2(x2)

=

### J

### Zg{Z6>

### pi(xi,Oi)dXi(xi)

it is clear that is the Fisher information matrix for _{p,(x¿,0,),} _{i} _{=} _{1,2.}

However, G is not a Fisher information matrix for p(w,0i,$2).

Assume that G12 is a suitable fixed matrix and wi = (xn,xi2), w2 =

(*21,*22) aretwoobservations transformedto z\ =

### (zn,zi2)>

*2 = (*2i>*22)by meansof (11), _{i.e.,}

zij =

### —logp^Oj).

We define the _{square}_{distance between} _{Wi} _{and} _{W2}

### S2(wi,w2)

= (z\ -### z2)'

### G~*

(zi - z2) (15)This distancecanbe used when wisamixed random_{vector,}_{for}_{example,}
x\ is the continuous type and 2^ is the discrete type.

When both X\ and X2 have absolutely continuous distributions with

respect to the Lebesgue measure, the pdf p(w,61,62) exists. Theorem 2

Suppose that

### (11)

gives a transformation from to zgt which is one-to-one and satisfies the condition of the_{change of variables for múltiple}

r

some restrictive conditions) there _{exists} _{a} pdf

### p(w,0i,02)

with margináisp(*,,0,), i = 1,2, such that G\2 =

### £(Z$,

•Z'e3)-This result canbe obtained as a_{consequence}_{of the}_{problem of}_{}
construc-ting probability distributions with given multivariate margináis and agiven
intercorrelation matrix. _{However, this construction is} _{quite complicated. A}
solution is_{given} _{by} _{Cuadras} _{(1990).}

Although the density p(w,0\,02) exists, the true function is unknown

in _{practice.} _{Nevertheless, distance} _{(15)} _{can} _{be computed. Consequently,} _{a}
DB classification rule is available for mixed _{variables, if only the marginal}
distributions are known. _{This,}_{again,} _{is} _{the main idea of this} _{paper.}

3.2 Parameter estimation with _{given margináis}

Parameters _{0\} _{and 62} _{are} _{unknown in} _{applications.} _{Suppose that}

W{ = (x,i,2¿2), i = I,---,N, is arandom sample from a population with

pdf p(w,0\,02)• However, only the margináis pdfs p,(2,,0,),t = 1,2, are

known.

Let 0 = (0i,02) be_{the ML estimation} obtained considering the product

pdf pi(xi,0i) -^2(^21 ^2)1 i-e-,

N 2

### EE»¡(«s.t)

=### «

where

Thetrue valué _{0q} is characterized by (Kent, 1982, Royall, 1986)

where the _{expectation} _{is} _{taken with} _{respect to} _{}
p(w,01,^2)-Since

f

=

### J

### Ui{x-l,Oi)p-i{xi,Ol)dxi

### =

### O

and _{similarly} _{for} _{O is} _{&} _{consistent} _{estímate} _{of $q.} _{(See also}
Huster et _{al,} _{1989). Finally,} _{a} _{consistent} _{estimation} _{of}

_{Gkj}

_{is given}

_{by}

Gkj

### 4E"‘

_{V'i}

### (z¡¡,i¡)

.t=l

3.3 Some _{properties of the classification rule}

The discriminant function _{(9) and the decisión rule} _{(10) satisfy} _{certain}

properties.

1) Hi is the diversity coefficient of tt, and Híq is the average difference between fio and 7r,, where tto = {*o}• Since Ho = 0 we find that

ft(xo) = H,o ~

### \

### (Hl

### +

### H0)

is the Jensen difference between ir¿ and tto.

2) Suppose that x\ and *2 are independent random vectors and the

distances

_{£?(.,}

_{.),¿2(-» •)}

_{3X6}

_{defined by using}

_{3:1,12}

_{respectively.}

According to Oller (1989), let us define the square distance between

(*11, *12) and (*21, *22) by

### ^2(*ll,*i2;*21i*22)

= ¿l(*lli*2l) +### ^(*21,

*22),and let _{(xoi,*o2) be} _{an} _{observation} _{to} _{be allocated. Then, using} _{an}
obvious notation, it is verified that

/t(*01, *02) = /.l(*0l)+/.2(*02).

### (16)

## 3)If

X is and the Mahalanobis distance is adopted, then /.'(*0) = (*o -### Mi)'

(*o- Mi)and rule _{(10) leads} _{to} _{a} _{minimum-distance rule. This result} _{is} _{even}
valid for nonnormal data, see theorem 3.

f

4) If X has the multinomial distribution n

### p*¡¡,

when X come### from

w,, and ifwe use distance_{(13), then}

/.(so) = (1 ~Pik)/Pik

### (17)

if _{*o} _{falls in the cell k.} _{Thus,} _{the} _{classification rule} _{is:}
allocatexq to ir,- if pu¡ =

min{pi/t,p2fc}-Let us now _{suppose} _{that given} _{Xi,xi} _{€} _{7r,,} _{there exists} _{<p : ir,-} _{—►}

Rq,ti = ip(xi),t2 = <p(x2)1 such that

### S2(xi,x2)

= (ti -### t2)'(ti

- Í2) ti,<2 € Rqi.e., £(., .) is a g-dimensional Euclidean distance. One way to define t¿ =

### <p(x¡)

isti = G~iZ{

where Z, is obtained from (11) and G is the Fisher information matrix. Generally, if are given and ¿>(.,.) is a Euclidean distance,

£1,• ••,ín can be obtained by metric scaling. Assumefurther that

Mi =

### [<¿>(*)]

exists andE*i

### (II*-Mili)2

< o® where Emeans_{expectation}

_{with}

_{respect to}

_{pi{x).}Theorem 3

Let *0 bean observation tobe allocated, t0 = <p(*o) and fi(xQ) given

by (9). Then

Proof. Let xi,• ••,xn beiid random vectorsfrom x, and let ti = <¿>(x,), i =

1,• ••,N. Then

### S2(xr,xa)

= (tr -### t,)'(tr

- t3)and, after some _{algebra,} we find

2

### ¿L¿2(a5r’**)=

_{]v}

where t = (]>3r tr)/TV. From the law of large numbers it

### follows

that_{as}

N —► oc the above _{expression} _{tends} _{to}

### ^E^{S2(x,y)}

### =

### E„t{\\t

-### /x,||2}.

Similarly

### 4

_{£}

_{<2(xo,}

_{*,)}

= ### iit„

-### i\\2

+ -### *)'(*'

### -')

■*

r 1 r

and, as N —► oo this expression tends to

### E„,{S2(x0,x)}

=### ||to-/*¿||2+ ^ir,-{||t-Mili2}

and_{(18) follows.}

_{□}

For distance _{(15)} _{an} _{alternative} _{proof of Theorem} _{3} _{is given.} _{Using} _{a}
suitable notation:

### ¿2(xo,®)

=### (z0-z)'G

### 1

(z0-z),### E(z,G~1z)

=### EÍXtslG-'zz')

= tra### (G_1G)

= n.Henee, because E(z) = 0,

### ¿2(x0,x)j

=### z'oG

### ^o

### +

### n.

6 _{(x,y)} _{=}

_{(zx-zy)G}

_{(zx-zy),}

E

_{E(z'xG~'zx)}

_{+}

_{E(z,yG~1zy)}

_{-}

_{2E{z'xG~'zy)}

2n.
We condude that the discriminant function is _{given by}

/(*o) =

### z'oG-'zo,

the _{square} _{distance} _{between} _{zo} _{=} _{33}

_{logp(*o,0) and}

_{E(z)}

_{=}

_{0.}

4. THE BAYESIAN APPROACH

The _{Bayes} _{discriminant rule} _{(BR) allocates} _{xo} _{to} _{the population} _{for}
which _{qip¡(x)} _{is} _{greatest,} _{where} _{q\} _{and} _{<72} _{are} _{the prior probabilities of}

drawing anobservation from 7Ti and ^2 respectively. BR leads toML when

qi = 92 = 1/2.

If _{Ti} _{is} _{i} _{=} _{1,2, ML is} _{equivalent} _{to} _{the mínimum distance}
rule_{provided that} _{the Mahalanobis distance is}_{adopted. For mixed data and}
using the location model LM, Krzanowski (1986) using a distance based on
Matusita _{affinity,} _{proves} _{that LM is also} _{equivalent} _{to} _{a} _{mínimum} _{distance}
rule. _{However, neither the LM} _{approach} _{ñor} _{the Matusita} _{approach takes}
the _{prior} _{probabilities into} _{account.} _{In} _{contrast,} _{this is possible in the DB}
method.

4.1 _{Incorporating prior probabilities}

Suppose that uq is known to belong to 7r,- with probability g,,i = 1,2.

Letus consider discriminant function _{(9) with distance (13). Given} _{u} _{€} _{T\}
we find

### ¿2(u>o,u>)

= 0 if u>o€jti, =### 9i-1

+### 9Í1

if wo € tt2. Henee #io = q\ ■0 +### 92(1

+### 9Í1)

_{=}

### ?i-1

Moreover _{#1} = 0, so we obtain the function /i(u>o) =

### 1.

Similarly### h{uo)

=### gj1-

However,### rule (10)

remains invariant if we add the sameconstant to _{f\ and /2} . Therefore, in order to construct a function that is

a _{square} _{distance,} _{let} _{us} _{introduce the prior discriminant function}

Note that _{/¿(wo) is} _{consistent} _{with the discriminant} _{function}

_{(17).}

Thus rule _{(10) allocates} _{w0} _{to} _{the population} _{for which the prior} _{}
pro-bability is greater.

Suppose now that a vector observation X is known with density p,(x).

Taking into account property (16), let us introduce the posterior discrimi¬ nant functions

### f?3(xo)

= + -l ¿=1,2. (19)where _{Hío,Hí} _{are} _{defined} _{in} _{(9).}
A decisión rule for _{allocating} _{xo} _{is:}

allocate _{x0} _{to} 7r, if

### ff{x0)

=### min{/f(a:o),

### ff

### (*o)}

(20)Discriminant function _{(19) is constructed by using distance} _{(15), but} _{it}
also _{applies for} _{any} _{other distance} _{¿(.,} _{.).}

However, note that rule (10) is invariant after multiplying ¿(., .) by a

positive constant, but the decisión taken from rule (20) could be affected.

In order toavoid this arbitrariness, it is necessary to imposeastandardizing
condition similar to _{(14)} _{to} _{the quadratic} _{form related} _{to} _{S(.,} _{.).}

4.2 Multivariate normal

The DB rule based on (18) _{and the BR} _{are} closely related. _{Suppose that}

7T, is JV(/i¿,¿7), i = 1,2. Then ML is based on the LDF

V(x) = -

### n2)'

x -### ^(mi

-### M2y

### ^_1

### (Mi

### +

### M2X

BR is based on B(x) = V(x) +log(0/l-6), and DB is based on D(x) =### V(x)+(e-

### I)/[0(1-0)],

where 0 and 1 — 0 are the _{prior} _{probabilities of drawing} _{an} _{observation}
from 7ri and *2 respectively.

For 0=

### |

_{we}

_{obtain}

### V(x)

_{=}

### B(x)

_{=}

### D(x), otherwise functions B{x)

and _{D(x)} _{are} _{different.} _{However,} _{the difference between} _{log(0/l} _{—} _{0) and}

(0—

### ^)/[0(l

—### 0)]

### is

not too### marked

### in

### the interval

### (0.2, 0.8). See Fig.

### 1.

(0-l/2)/[0(l-0)]

« B ln

### (lé?)

### 0

### <

### 0

### <

### 1

Figure 1. Graphical comparison between distance based rule (continuous curve)

and _{Bayes rule} _{(discrete} _{curve),} _{for the multivariate normal }
distribu-tion. 6 and 1—0 are the_{prior} _{probabilities.}

4.3 Multinomial distribution

Suppose thatarandomevent Ahas the conditional probabilities P(j4/7t,)

Pi, i = 1,2. If u>o € A, the Bayes decisión rule is allocate uq to 7r, if

P\0 > P2(l — 0), otherwiseto tt2 . Therefore BR is based on

b = log_{p\} _{-}logp2 _{+}log(0/l
- 0)

and DB is based on

For 6 =

### |,

_{b and d}

_{are}

_{equivalent,}

_{otherwise}

_{the decisión could be}

different. To _{study this} _{difference, let} us consider the Borel set K C (0,

### l)3

K =

### {(puP2,0)\b-d>0}

We take the same decisión as _{long} _{as} (pi,p2,0) _{€} _{K.} After _{some} tedious
computations, the Lebesgue measure of K ¡s

ft(K) = 0.971

which _{is very} _{cióse} _{to} _{1,} _{the} _{Lebesgue} _{measure} _{of} _{(0,}

_{l)3.}

Thereby, the decisión taken after using either BR or DB, isalmost thesame.
This is _{quite interesting, because} _{although the} _{two} _{decisión rules} _{are} _{based}

on different _{criteria,} _{their results coincide}_{in} _{practice.}

5. CLASSIFICATION INTO SEVERAL POPULATIONS

Suppose that we have k mutually exclusive populations 7Ti,• • •,jt* . The

theory developed is easily extended to k > 2. Let C,- be a sample from 7r, and Di a distance matrix. If u>o is an individual to be allocated and

### Sj(i)

is the distance from uo to j € C,, the discriminant function /,-(w) isdefined as in _{(1), and the} _{decisión rule} _{for allocating} _{u} _{is}

allocateu _{to tt,} if /,(w) _{=}

### min{/i(o;),-•

_{•}

### ,/jt(w)}.

(21)It is obvious that theorem 1 also _{applies here,} _{so} _{(20) is} _{a} _{}
minimum-distance discriminant rule. In _{addition, the} _{error} _{rate} _{computations} _{are}

easily obtained.

When the distributions are known (up _{to} parameters),_{extensions of}_{(9)}

and _{(10),} _{as} _{well} _{as} _{therorem} _{3,} _{do} _{not} _{present} _{further difficulties.}

Finally, if prior probabilities q\,- ■ ■,qk are known, we may use the pos¬ terior discriminant functions

### /f(*o)

= fi(xo) +### q-1

- 1 i = 1,• • •,k,6. SOME EXAMPLES

Three real data _{examples} _{are} _{used} _{to} _{compare} _{the DB approach with}
LDF, QDF, EDF and LM approachs.

Data set 1 is the “avanced breast cáncer data” used by Krzanowski

### (1975,

1986,### 1987)

to ¡Ilústrate the LM. In this data set,### of

186 cases### of

ablative_{surgery}

_{for advanced breast}

_{cáncer,}

_{99}wereclassified as “successful

or intermedíate” (7Ti) and _{87} _{as} _{“failure”} (ir2). Gower’s distance (section

2.3) is used on the basis of6 continuous variables and 3 binary variables for
the DB _{method, but} we take ranks on the continuous variables instead of

numérica! _{valúes,} _{as} _{the} _{range} _{of}_{most} _{variables} _{was} _{too} _{large.}

Data set 2 is the well-known Fisher’s Iris data _{(Fisher,} _{1936), which}
consider _{ni} _{=} _{n2} _{=} _{713} _{=} _{50} _{observations} on three species of iris, I.
se-tosa _{(ttj),} _{I.} _{versicolor} _{(7t2) and} _{/. virginica} _{(^3) and the discrimination}

problem on the basis of 4 continuous variables. The DB method uses the
Euclidean distance _{(which yields the} _{EDF) and distance (8).}

Data set _{3,} _{taken from Mardia} _{et} _{al.} _{(1979,} _{p.} _{328), is concerned with}

the _{problem} _{of} _{discriminating between} _{the species} _{Chaetocnema} _{concinna}

(jti) and Ch. heikeriingeri (tt2) on the basisof2continuous variables and

samples of sizes m = n2 = 18. Gower’s distance (versión for continuous

variables) is also used.

The _{leaving-one-out method} _{was} _{used,} _{as} _{described} _{in section 2.2,} _{for}
computing the error rates. Table 1 summarizes the results obtained.

Table 1

Numbers of misclassifications obtained with the_{leaving-one-out method.}

1. Cáncer data *1 ít2 LM 34 27 61 LDF 41 31 72 EDF 45 43 88 DB 33 32 65 2. Iris data JTi 7T2 *3 LDF 0 2 1 EDF 1 5 7 DB 0 2 3 3 13 5 3. Chaetocnema data 7Ti 7T2 LDF 0 1 QDF 1 1 EDF 2 3 DB 0 0 1 2 5 0

7. SUMMARY AND CONCLUSIONS

A distance based _{approach} _{(DB), which} _{uses a} _{statistical distance }

be-tween _{observations,} _{is} _{described and theoretical properties} are presented.
This method reduces to linear discrimination _{(LDF),} _{or even} _{quadratic} _{}
dis-crimination

_{(QDF)}

_{when}

_{the sample Mahalanobis distance is used. }More-over the DB approach yields

_{a}general discriminant function which reduces to the Euclidean discriminant function

_{(EDF) when the Euclidean sample}distance is used.

This method offers a rather simple algebraic _{way} _{to} consider the dis¬
crimination _{problem with mixed variables} _{as} _{well} _{as}_{the missing valúes} _{case,}

provided that a suitable distance is adopted. No restriction seems to be

necessary on the numbersof binary variables, sothe DB method may be an
alternative to discrimination based on the location model _{(LM).}

The _{leaving-one-out method} _{for computing the}_{probability} _{of }
misclassi-fication can be applied following _{an} elementary matrix computation in this
approach.

Finally, unlike other methods based on distances, the priorprobabilities

can be taken into account in the DB method, with results quite similar to

those _{given} _{by the Bayes} _{decisión} _{rule.}

8. REFERENCES

Anderson, T.W. (1958). An Introduction to MultivariateStatistical Ana-lysis. Wiley, New York.

Anderson, J.A. (1972). Sepárate sample logistic discrimination. Biome-trika, 59, 19-35.

Burbea, J. & Rao, C.R. (1982). Entropy differential metric, distance
and _{divergence} _{measures} _{in} _{probability} _{spaces:} _{A unified approach.}
J. Multiv. Anal., 12, 575-596.

Cuadras, C.M. (1988). Distancias Estadísticas. Estadística Española, 30

(119), 295-378.

classifica-tion _{using} _{both continuous and} _{categórica!} _{variables.} _{In:} _{Statistical}
Data _{Anatysis and Inference} _{(Y.} _{Dodge,} _{ed.). North-Holland,} _{Pu.}
Co.,459-473.

Cuadras, C.M. (1990). Multivariate distributions with given multivariate margináis and given dependencestructure. Math. Preprint Series No.

80, Univ. of Barcelona.

Cuadras, C.M. & Arenas, A. (1990). A distancebasedregressionmodel
for _{prediction} _{with mixed data. Commun.} _{Statist.-Theory} _{Meth., 19}

(6), 2261-2279.

Fisher, R.A. (1936). The use of múltiple measurements in taxonomic problems. Ann. Eugen., 7, 179-188.

Gower, J.C. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika, 27, 857-874.

Gower, J.C. (1971). A general coefficient of similarity and some of its

properties. Biometrika, 27, 857-874.

Huster, W.J., Brookmeyer, R. i¿ Self, S.G. (1989). Modellingpaired

survival data wúth covariates. _{Biometrika, 45, 145-156.}

Kent, J.T. (1982). Robust properties oflikehood ratiotests. Biometrika,

69, 19-27.

Knoke, J.D. (1982). Discriminant analysis with discrete and continuous
variables. _{Biometrics, 38, 191-200.}

Krzanowski, W.J. (1975). Discrimination and classification using both binary and continuous variables. J. Am. Stat. Assoc., 70, 782-790.

Krzanowski, W.J. (1977). The performance of Fisher’s linear discrimi¬

nant function under _{non-optimaJ conditions. Technometrics,} _{19, }
191-200.

Krzanowski, W.J. (1986). Múltiple discriminant analysisin the presence
of mixed continuous and _{categórica! data.} _{Comp,} _{& Maths.} _{with}
Appls., 12A, 179-185.

Krzanowski, W.J. (1987). A comparison between twodistance-based dis¬

Lachenbruch, P.A. (1975). Discriminant Analysis. Hafner, N. York.

Lachenbruch, P.A. & Goldstein, M. (1979). Discriminant analysis.

Bio-metrics, 35, 69-85.

Lingoes, J.C. (1971). Some boundary conditions for amonotone analysis

of_{symmetric matrices.} _{Psychomeirika, 36, 195-203.}

Marco, V.R., Young, D.M. & Turner, D.W. (1987). The Euclidean

distance classifier: analternativetolinear discriminant function.

Com-mun. Statist.-Simula., _{16} (2), _{485-505.}

Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multivariate Analy¬

sis. Academic Press, London.

Matusita, K. (1956). Decisión rule, based on the distance, for the
classi-fication _{problem. Ann. Inst. Stat. Math., 8,} _{67-77.}

Oller, J.M. (1989). Some geométrica! aspects ofdata analysis and

statis-tics. In: Statistical Data _{Analysis and Inference} _{(Y. Dodge, ed.).}

North-Holland, Pu. Co., 41-58.

Oller, J.M. & Cuadras, C.M. (1985). Rao’s distance fornegative

multi-nomial distributions. _{Sanlchya,47,} _{75-83.}

Rao, C.R. (1945). Information and accuracy attainable in the estimation
of statistical _{parameters.} _{Bull. Calcutta Math. Soc., 37, 81-91.}
Raveh, A. (1989). A nonmetric approach to linear discriminant analysis.

J. Am. Stat. _{Assoc., 84, 176-183.}

Royall, M.M. (1986). Model robust confidence intervals using máximum
likelihood estimators. International Statistical _{Review, 54, 221-226.}
Schmitz, P.I.M., Habbema, J.D.F., Hermans, J. & Raatgever, J.

W. _{(1983). Comparative performance of four discriminant} _{analysis}
methods for mixtures of continuous and discrete variables. Commun.
Statist.-Simula. _{Computa.,} _{12} _{(6),} _{727-751.}

Seber, G.A.F. (1984). Multivariate Observations. J. Wiley, New York. Vlachonikolis, I.G. & Marriott, F.H.C. (1982). Discrimination with

Wald, A. (1944). On astatistical problem arisinginthe classiñcation ofan
individual into one oftwo _{groups.} _{Ann.} _{Math.} _{Statist.,} _{15,} _{145-163.}

REMARK

A _{Computer} program for PC to perform a distance based