A distance based approach to discriminant analysis and its properties

(1)

cái>/f

_32.30

UNIVERSITAT DE BARCELONA

A DISTANCE BASED APPROACH TO DISCR1MINANT

ANALYSIS AND ITS PROPERTIES

by

C. M. Cuadras

AMS _{Subject Classification:} _62H30

BIBLIOTEC AD LA LJN VERSITAT DE BARCELONA

0701570686

Mathematics _{Preprint Series No. 90}

(2)

i

A DISTANCE BASED APPROACH TO DISCRIMINANT

ANALYSIS AND ITS PROPERTIES

C.M. Cuadras

Departament d’Estadistica Diagonal, 645

08028 _Barcelona, _Spain

ABSTRACT

A _{general distance based} _{method for allocating} _an _observation to one of several known _populations, _on _the _{basis of both}

con-tinuous and discrete _explanatory _{variables, is} _{proposed and} stu-died. This method _depends on a given statistical distance be-tween _{observations,} _{and leads} _to some classic discriminant rules

by taking suitable distances. The error rates can be easily

com-puted and, unlike other distance classification rules, the prior probabilities can be taken into account.

Key words and phrases: Linearand quadratic discriminant

func-tions; Location model; Statistical distances; Given margináis.

AMS _subject _{classification: 62H30}

(3)

1

1. INTRODUCTION

The _{problem of allocating} _an _individual _to _one _of _two _populations _was proposed by Fisher (1936), who obtained the linear discriminant function (LDF). Optimality ofLDF under the assumption of multivariate normality,

with a common covariance matrix, is a well-known property. LDF gives a _{good performance} _even _when _the _above _assumptions _are _{violated and} several studies have been carried outon LDF as well as alternative methods

(Krzanowski, 1977, Seber, 1984, Raveh, 1989). The discrimination pro¬ blemleads toa_{quadratic discriminant function} (QDF) _{when the covariance} matrices are not _equal.

LDF is _{essentially based} _on _continuous _variables _(Wald, _1944, Ander-son, 1958). However, most problems in applied statistics fall in the mixed

case, i.e., the variables are both continuous and discrete. An interesting

approach, based in the location model

(LM),

has been given by Krzanowski (1975,1986, 1987). LMis applicable when the data contain both continuous

and _{binary variables. This method}_computes _an _LDF_for_each _pattern _of_the binary variables, is óptima! under the assumption of conditional normality and can be extended to the multistate case for discrete variables (Lachen-bruch and _Goldstein, _1979). _{However, the LM approach} _needs _a _conside¬ rable _{computational effort and has} _not _{been implemented in the standard} packages (Knoke, 1982). Moreover LM cannot be used when there are not

enough data for manylocations (Vlachonikolis and Marriott, 1982).

Logistic regression (LR) is another interesting approach, which ineludes

quite a wide class of models and also allows discrimination with mixed va¬ riables _{(Anderson, 1972). LR also requires} _a _large _amount _{of computing.}

LDF, QDF, LM and LRare all based on the ratioof probability density

functions, the máximum likelihood rule

(ML).

The ML rule is a particular caseof the _{Bayes discriminant rule} _{(BR), the rule obtained after supposing} that the _{populations have prior probabilities. BR provides} _a _general _ap¬ proach to discriminant analysis which has certain óptima! properties from a theoretica! _point_{of view.}

Another _{general approach, first studied by} _Matusita _{(1956) is based} _on the _concept _{of distance. Let d(w,}_{7r,) be} _a _{distance between} _w _to _7ri ₌

1,2, where u isanindividualtobe allocated. The allocation ruleis: allocate uj to the nearest_population _tt, _.

(4)

r

LDF is _{closely related} _to_{the Mahalanobis distance and}_both_approaches, ML and _distance,areequivalent for multivariate normal data (Mardia et al,

1979).

If we substitute the covariance matrix

by the identity

matrix, we

obtain the Euclidean distance classiñer _{(EDF), which is} _an _alternative _to LDF _(Marco _et _al, _{1987), and has advantages when the number of variables} is_{large relative} _to_{the training}_sample_size. _{EDF is} _a_{distance rule based}_on Euclidean distance. The distance rule is also _equivalent _to _BR _(including ML _{rule), for the LM model,} _as _is _proved _{by Krzanowski}

_{(1986, 1987).}

A distance-based _approach _(DB) _{for regression and} _{classification} _with mixed variables was _{proposed by Cuadras} (1989) and Cuadras and _Arenas

(1990).

Although considerable research has been done on discrimination, and the State of the Art of allocation has reached such a high level that it

seems difficult tofurnish _original _ideas,_this _paper _{has the} _aim_{of examining} further _aspects _{and applications of discrimination} _{by the above mentioned} DB _approach.

Given every couple of individuáis the DB method uses a distance

d(u>,u}') to computediscriminant functions. The assumption underlyingthis

method is the_{following: in} _some _{circumnstances} _{(mixed data, for example),} it is easier to obtain a distance d(., _{.) than} _a _{probability density function}

!>(•)•

2. A DISTANCE-BASED CLASSIFICATION RULE IN TWO

POPULATIONS OBTAINED FROM SAMPLE DATA

Let 7ri, 7T2 betwopopulations and supposethat u is an individualtobe

allocated. _{Suppose that} _a _{distance function} _{¿(., .) is defined} on the basis

of several _{(mixed) variables. The distance-based rule (1),} _or _DB _rule, _is introduced and some of its _properties _are studied by Cuadras (1989). 2.1 _Training _sets

Suppose that samples C\ and C2 ofsizes ni and «2 areavailables from tti and 7T2, respectively. Suppose that an ni x ni intradistance matrix

D\ =

(¿¿¿(I))

_can be computed _on

C\ by

using

the

distance

6(., .),

_as

(5)

f

D2 =

(¿»'i(2)),

and

¿¿(2),

i = l,---,n2 are analogously ¡ntroduced. Let us define the discriminant functions as follows:

am

= *= 1-2

(>)

nk _¿ _£nk _..

A decisión rule for allocation of u> is:

allocate u _{to ;r,} if /,(u>) ₌ _min{/j(w),

/2(w)}

(2)

This rule leads to a minimum-distance classification rule, _{as a} _consequence

of theorem 1.

Let us denote C= {1,2,_•_{• •}

,n}.

D ₌

(6¿¿)

is _an _n_x _n

distance

matrix,

(n + 1) is a new individual, ¿i,- • •,Sn are the distances from (n + 1) to

each ofC and _f(n

₊₁₎

_is _{defined according}_to

_(1).

_Then _(n

₊₁₎

_{and C}_can be _{represented by the points P, Pi},• • •,Pn 6 RT XgRa, where g =

y/^J

and r+ s = n— 1. The coordinates are _given _by

P =

Pi = (x'i, gy'¡, 0) i = 1,•• •,n

and it issatisfied that

¿ij

=

d2(Pi,Pj)

=

ll*¿

-

*jl|2

-

l|s/¿

“

V¿||2

where

_||

_•

_||

_means_{the Euclidean}_norm. _An_{explicit construction of P, Pj,}_{• • •,}

P„ is given in the following:

Theorem 1

Set P = (*,y,0), where _{x -} (E,Xi)/n, _{y =}

(Ey¿)/n-Then

f(, _:W) =

d2(p,p)

In _particular, _{if D is} _a _{Euclidean distance matrix and} _n+1, _C _{are repre¬} sented _by *,*!,•••,*„€ RPi then

(6)

f

f(u) =

d2(x,x)

= ||* -

x||2

Proof. Let B =

(6tJ)

= HAH,

where JET

is

the centring matrix

and

A =

(a,j)

₌

-f(¿,2).

_{If B}

_>

_O,

_i.e.

_{D is}

_a

Euclidean

distance

matrix,

let Ax be the _{diagonal matrix of positive eigenvalues} _{and X the matrix}

of _{corresponding} _{eigenvectors.} _The _rows _{of X,}

_x\,

_{•••,*'„,}

_verify

_<5,2

₌

II®,

—

Xj||2

and constitute the classical solution of MDS

(Mardia

et

al.,

1979).

In _general _{(D need} _not _be _Euclidean), _we _must _{also consider} _a _diago¬ nal matrix _Ay _{of negative eigenvalues}

_of

_B,

_{whose corresponding eigen¬}

vectors are the columns of the matrix _{Y, and} y'are _the _rows of Y _{(Lingoes, 1971). Each element i} _{(say) of C} _can _be _{represented by}

P¡ = (*',,y'¿,0) and (n+1) by

(x',y',z).

Note that, as 1 is an

eigenvec-tor of B of_{eigenvalue 0,} _then _X'l ₌ ₀ _and _Y'₁ = 0, so * = (J2xi)/n =

0, 17=

(£3/,■)/«

= 0.

Gower ₍₁₉₆₈₎ _proves_{that if B} _> _0, _{the coordinates of}

_{(n +1)}

_are _given by

z X

,2

where

If i? is anon-Euclidean distance matrix, it is easily proved that

(7)

t

On the other hand

henee

Y. 6¡j

=

2

ntra

B

/(n+ 1) =

z2

+

(x'x

-

y'y)

Since x = 0 and _y ₌ 0, all coordinates of P _are nuil. Henee the distance

between P and P =

(x',gy',z)

_is /(n₊ l). _□

Now it is clear that decisión rule ₍₂₎ _is _equivalent _to allocatew _{to tt,} _if

d2(P,P)

_<

d?(Q,Q)

otherwise to where _{P,P,Q and Q} _are _{suitably constructed. Note} _that this construction is not needed to obtain and _/2(w).

Decisiónrule _{(2) leads} _to_standard _{discriminant functions} _for_continuous data and usual distances. Let *¿ti,• • • fc = 1,2, be the training

samples and * the observation to be allocated. Then (2) is based on the LDF

L(x) = x--(xi + x2) S 1 (xx - *2),

where S is the _{pooled sample covariance matrix,} _{provided that the} _square distance is the Mahalanobis distance

{xki -

xkj)'

S

1 (xki

-

xkj).

(3)

If the covariance matrices need not be_{equal and} we replace S for Sk in

(3), then rule (2) is based on a QDF.

Finally, ifwe choose the Euclidean distance, this decisión rule is based on the EDF

1 _ _ 1' _ _

E(x) = X --(*!+*2) (X1-X2).

2.2 Error rates

The _{leaving-one-out} _method _to _estímate _{the probability of} misclassi-fication _{(Lachenbruch, 1975)} _can _be _applied _{with simple} _additionai _com¬ putaron. Let us denote A = (a,j), B =

(6tJ) and C

=

(cij),

where

(8)

f

aij =

¿5(1),

bij

=

¿5(2)

and

c,¿ =

¿2(t,j),

where i

G

C\ and j

G

Cj.

A,B and C are ni X ni,n2 X n2 and ni x n2 matrices respectively. If

ni na

O =

^

®ij, ®¡r, =

)

,Cir,

i<j t=1 r=J

and

_{£>,&_, and}

_c.j _are

_similarly

_defined,

_to

_{allocate i}

_G

_C\

_{we use}

_the

discriminant function

/i(0 = (ni - 1)

1

a, - (ni - 1)

2

(a- a,)

and

/2(i) = nj

1c¿.

—

nj2&.

Henee

mi =

freql€C)

[/i(i)

_-

/2(i)

_>

0]

(4)

is the _{frequeney of individuáis} _{misclassiñed in C\.} _{The frequeney} _m2 _of misclassification in _C2 _is _{analogously obtained} _{as a} _function _{of b,bj and} c.j. If the n = ni + n2 observationsare random samples

from

iri Utt2,

the

error rate is estimated _by

mi + m2

ni + n2 (5)

Thus é is _{directly computed from A, B} _{and C.}

Now _suppose_that _C\ _and _C2 _are _{nonoverlapping classes, that} _is,

d2 < d2 (xly) _Vi _G _Cx

₍₆₎

(9)

1

d2

_(Vj,y-j)

<

d2

(y¿,*,)

Vj

6 C2

(7)

where x,-,x* are the Euclidean representation oft as a member of Ci,C2,

respectively, and x_, is the mean on Ci after leaving out individual i.

Similarly we have _y3,

_y*

_and

_{y_3.} It is _{easily proved that}

Mi) =

d2(xi,x-i),

M*) =

d2(x*,y).

Thus,if both_{(6) and (7) apply,} mi = m2 = 0 and aperfect classification for_{nonoverlappling classes is obtained.}

2.3 Distance between _samples

Suppose that the observable variables areboth quantitative and qualita-tive _{(nominal, ordinal, dichotomous) and that the distribution of the varia¬} bles is unknown. Then we _may _{use any} _{distance function chosen} _among_the

wide_repertory_{available. One candidate is the}_square_distance

_d?

₌ ₁_— where stJ isthe aJl-purposemeasure

of similarity proposed by Gower

(1971)

and well described _{by Seber} _{(1984). This distance provides} _a _Euclidean distance matrix and _gives _{good results in regression} _with _{mixed data,} see

Cuadras and Arenas _{(1990). When} _some_valúes _are _missing, _can _{also be} obtained but it could be a non-Euclidean distance. However, rule (2) also applies. Therefore, this DB method can be used for handling missing valúes in discriminant _analysis.

For continuous _variables, _{letting (x,i,}• • •,

x,p), (xjj,

• • •,

Xjp) be

two

ob-servations for individuáis i and _j, _we _also _use _the _square _distance

d2j

=

|x¿i

-xji

|

+• • •+

|x,p

-

Xjp|

(8)

which will be confronted with the Mahalanobis and Euclidean distances in the _examples _{given in section 6.}

(10)

t

3. CLASSIFICATION WHEN THE DISTRIBUTIONS ARE

KNOWN

Suppose that the observable variables are related to a random vector

with a_{probability density} _p,(x) _with _{respect to} _a _suitable _measure _{A, if}_* comes from tt,-, i = 1,2.

Let _*o _be_an_observation _to_be_allocated _and _S(., _.) _a_distance_function. The discriminant function which _generalizes ₍₁₎ _{is given} _by

/»(*o) =

J

62 _{(x0,x)pi(x)d\(x)}

-

^

J

62(x,y)pi(x)pi(y)dX(x)dX(y)

=

Hí0-\h{.

(9)

Hío is theexpectation of the random variable

S2

(xo,x) on . Weassume

x,y tobe independenttoobtain the expectation of

¿2(x,y)

on 7r,X7r, .

The decisión rule for _allocating _xq _is

allocate _*0_to7r¿ if /,(*o) = niiní/^xo),

/2(*o)}

(10)

Before _presenting _some _properties _{of the} _DB _{rule (10),} _we _introduce a distance between individuáis based on the so-called Rao distance _(Rao,

1945) and studied laterbyBurbeaand Rao(1982), Oller and Cuadras (1985)

and others.

3.1 A distance between individuáis

Suppose that the random vector X is related to astatistical model 5 =

{p(x;0},

where# is an n-dimensional vector parameter and

p{x,0)

is a

probability density. Assume the usual regularity conditions and consider the efficient score

Ze = —logp(X;0) (11)

Then _E(Z$) ₌ ₀ _{and G} ₌

_E(ZeZ¿)

_is _{the Fisher information matrix,} where Z$ is interpreted as a column vector.

(11)

r

Transform twoobservations Xi,x2 to z\,z2 _{by using} (11). The square distance between *i,*2 withrespect tothe statistical model S = {p(*,0)}

is _{given by}

62(xux2)

= (Zi ~

ZiYG-1

{Zx - Z2) (12)

This distance was introduced by Cuadras (1988, 1989) and especially 011er

(1989), who provides further theoretical justification.

If _p(x,0) _is _{the N(p,£) distribution} _{(S being}

_fixed),

_{then Z^} ₌

U~l(X

— n), G =

27-1

and (12) reduces to the familiar Mahalanobis dis¬ tance.

If X = (ii,-*-,xm) has the multinomial distribution

flpj;*,

where

a:* € {0,1}, k = 1,•• •,m, Xi falls in the cell r and X2 falls in the cell s, we obtain the square distance

62(x1,x2)

= (1 +P,

*)

(13)

where 6rs is the Kronecker delta.

In _{general, distance} _{(12) is related} _to _{the quadratic form}

_{Z'eGZg, where}

Zg is given by

(11),

which satisfies (011er,

1989)

E

_(Z'gG~xZg)

=

E

(tra G^Z'gZ'g)

=

tra

(g^g)

=

n.

(14)

Suppose next that the random variables are related to a random vector

W — (X\,X2), where X, is _a random vector with density p,(x¿, 0¿) with

respect toameasure A,-, i = 1,2. Supposethat there isadensity p(w,0\,02) with _{respect to} _a_suitable _{measure p,} _{with margináis}

_p,(x,-,#,),

_t ₌ _1,2.

Letting

Ze, =

_{QQ-\ogp,(xt,dt)}

let us define

(12)

t

and the _{symmetric matrix}

£*12

_^

G22

₎

_’

the_expectation_taken _with _{respect to} p(te,0i,02)-Since

Gu =

J J

ZgtZ'8'P(xi,X2,61,02)d\i(xi)d\2(x2)

=

J

Zg{Z6>

pi(xi,Oi)dXi(xi)

it is clear that is the Fisher information matrix for _p,(x¿,0,), _i ₌ _1,2.

However, G is not a Fisher information matrix for p(w,0i,$2).

Assume that G12 is a suitable fixed matrix and wi = (xn,xi2), w2 =

(*21,*22) aretwoobservations transformedto z\ =

(zn,zi2)>

*2 = (*2i>*22)

by meansof (11), _i.e.,

zij =

—logp^Oj).

We define the _square_{distance between} _Wi _and _W2

S2(wi,w2)

= (z\ -

z2)'

G~*

(zi - z2) (15)

This distancecanbe used when wisamixed random_vector,_for_example, x\ is the continuous type and 2^ is the discrete type.

When both X\ and X2 have absolutely continuous distributions with

respect to the Lebesgue measure, the pdf p(w,61,62) exists. Theorem 2

Suppose that

(11)

gives a transformation from to zgt which is one-to-one and satisfies the condition of the _{change of variables for múltiple}

(13)

r

some restrictive conditions) there _exists _a pdf

p(w,0i,02)

with margináis

p(*,,0,), i = 1,2, such that G\2 =

£(Z$,

•

Z'e3)-This result canbe obtained as a_consequence_{of the}_{problem of} construc-ting probability distributions with given multivariate margináis and agiven intercorrelation matrix. _{However, this construction is} _{quite complicated. A} solution is_given _by _Cuadras _(1990).

Although the density p(w,0\,02) exists, the true function is unknown

in _practice. _{Nevertheless, distance} ₍₁₅₎ _can _{be computed. Consequently,} _a DB classification rule is available for mixed _{variables, if only the marginal} distributions are known. _This,_again, _is _{the main idea of this} _paper.

3.2 Parameter estimation with _{given margináis}

Parameters _0\ _{and 62} _are _{unknown in} _{applications.} _{Suppose that}

W{ = (x,i,2¿2), i = I,---,N, is arandom sample from a population with

pdf p(w,0\,02)• However, only the margináis pdfs p,(2,,0,),t = 1,2, are

known.

Let 0 = (0i,02) be_{the ML estimation} obtained considering the product

pdf pi(xi,0i) -^2(^21 ^2)1 i-e-,

N 2

EE»¡(«s.t)

=

«

where

Thetrue valué _0q is characterized by (Kent, 1982, Royall, 1986)

where the _expectation _is _{taken with} _{respect to} p(w,01,^2)-Since

(14)

f

=

J

Ui{x-l,Oi)p-i{xi,Ol)dxi

=

O

and _similarly _for _{O is} _& _consistent _estímate _{of $q.} _{(See also} Huster et _al, _{1989). Finally,} _a _consistent _estimation _of

_Gkj

_{is given}

_by

Gkj

4E"‘

_V'i

(z¡¡,i¡)

.

t=l

3.3 Some _{properties of the classification rule}

The discriminant function _{(9) and the decisión rule} _{(10) satisfy} _certain

properties.

1) Hi is the diversity coefficient of tt, and Híq is the average difference between fio and 7r,, where tto = {*o}• Since Ho = 0 we find that

ft(xo) = H,o ~

\

(Hl

+

H0)

is the Jensen difference between ir¿ and tto.

2) Suppose that x\ and *2 are independent random vectors and the

distances

_£?(.,

_{.),¿2(-» •)} _3X6 _{defined by using} _3:1,12 _{respectively.}

According to Oller (1989), let us define the square distance between

(*11, *12) and (*21, *22) by

^2(ll,i2;21i22)

= ¿l(*lli*2l) +

^(*21,

*22),

and let _{(xoi,*o2) be} _an _observation _to _{be allocated. Then, using} _an obvious notation, it is verified that

/t(*01, *02) = /.l(*0l)+/.2(*02).

(16)

3)If

X is and the Mahalanobis distance is adopted, then /.'(*0) = (*o -

Mi)'

(*o- Mi)

and rule _{(10) leads} _to _a _{minimum-distance rule. This result} _is _even valid for nonnormal data, see theorem 3.

(15)

f

4) If X has the multinomial distribution n

p*¡¡,

when X come

from

w,, and ifwe use distance _{(13), then}

/.(so) = (1 ~Pik)/Pik

(17)

if _*o _{falls in the cell k.} _Thus, _the _{classification rule} _is: allocatexq to ir,- if pu¡ =

min{pi/t,p2fc}-Let us now _suppose _{that given} _Xi,xi _€ _7r,, _{there exists} _{<p : ir,-} _—►

Rq,ti = ip(xi),t2 = <p(x2)1 such that

S2(xi,x2)

= (ti -

t2)'(ti

- Í2) ti,<2 € Rq

i.e., £(., .) is a g-dimensional Euclidean distance. One way to define t¿ =

<p(x¡)

is

ti = G~iZ{

where Z, is obtained from (11) and G is the Fisher information matrix. Generally, if are given and ¿>(.,.) is a Euclidean distance,

£1,• ••,ín can be obtained by metric scaling. Assumefurther that

Mi =

[<¿>(*)]

exists and

E*i

(II*-Mili)2

< o® where Emeans _expectation _with _{respect to} _pi{x). Theorem 3

Let *0 bean observation tobe allocated, t0 = <p(*o) and fi(xQ) given

by (9). Then

(16)

Proof. Let xi,• ••,xn beiid random vectorsfrom x, and let ti = <¿>(x,), i =

1,• ••,N. Then

S2(xr,xa)

= (tr -

t,)'(tr

- t3)

and, after some _algebra, we find

2

¿L¿2(a5r’**)=

_]v

where t = (]>3r tr)/TV. From the law of large numbers it

follows

that _as

N —► oc the above _expression _tends _to

^E^{S2(x,y)}

=

E„t{\\t

-

/x,||2}.

Similarly

4 _£

_<2(xo,

_*,)

=

iit„

-

i\\2

+ -

)'('

-')

■*

r 1 r

and, as N —► oo this expression tends to

E„,{S2(x0,x)}

=

||to-/*¿||2+ ^ir,-{||t-Mili2}

and _{(18) follows.} _□

For distance ₍₁₅₎ _an _alternative _{proof of Theorem} ₃ _{is given.} _Using _a suitable notation:

¿2(xo,®)

=

(z0-z)'G

1

(z0-z),

E(z,G~1z)

=

EÍXtslG-'zz')

= tra

(G_1G)

= n.

Henee, because E(z) = 0,

¿2(x0,x)j

=

z'oG

^o

+

n.

6 _(x,y) ₌

_(zx-zy)G

_(zx-zy),

E

_E(z'xG~'zx)

₊

_E(z,yG~1zy)

_-

_2E{z'xG~'zy)

2n.

(17)

We condude that the discriminant function is _{given by}

/(*o) =

z'oG-'zo,

the _square _distance _between _zo ₌ ₃₃

_{logp(*o,0) and}

_E(z)

₌

_0.

4. THE BAYESIAN APPROACH

The _Bayes _{discriminant rule} _{(BR) allocates} _xo _to _{the population} _for which _qip¡(x) _is _greatest, _where _q\ _and _<72 _are _{the prior probabilities of}

drawing anobservation from 7Ti and ^2 respectively. BR leads toML when

qi = 92 = 1/2.

If _Ti _is _i ₌ _{1,2, ML is} _equivalent _to _{the mínimum distance} rule_{provided that} _{the Mahalanobis distance is}_{adopted. For mixed data and} using the location model LM, Krzanowski (1986) using a distance based on Matusita _affinity, _proves _{that LM is also} _equivalent _to _a _mínimum _distance rule. _{However, neither the LM} _approach _ñor _{the Matusita} _{approach takes} the _prior _{probabilities into} _account. _In _contrast, _{this is possible in the DB} method.

4.1 _{Incorporating prior probabilities}

Suppose that uq is known to belong to 7r,- with probability g,,i = 1,2.

Letus consider discriminant function _{(9) with distance (13). Given} _u _€ _T\ we find

¿2(u>o,u>)

= 0 if u>o€jti, =

9i-1

+

9Í1

if wo € tt2. Henee #io = q\ ■0 +

92(1

+

9Í1)

₌

?i-1

Moreover _#1 = 0, so we obtain the function /i(u>o) =

1.

Similarly

h{uo)

=

gj1-

However,

rule (10)

remains invariant if we add the same

constant to _{f\ and /2} . Therefore, in order to construct a function that is

a _square _distance, _let _us _{introduce the prior discriminant function}

(18)

Note that _{/¿(wo) is} _consistent _{with the discriminant} _function

_(17).

Thus rule _{(10) allocates} _w0 _to _{the population} _{for which the prior} pro-bability is greater.

Suppose now that a vector observation X is known with density p,(x).

Taking into account property (16), let us introduce the posterior discrimi¬ nant functions

f?3(xo)

= + -l ¿=1,2. (19)

where _Hío,Hí _are _defined _in _(9). A decisión rule for _allocating _xo _is:

allocate _x0 _to 7r, if

ff{x0)

=

min{/f(a:o),

ff

(*o)}

(20)

Discriminant function _{(19) is constructed by using distance} _{(15), but} _it also _{applies for} _any _{other distance} _¿(., _.).

However, note that rule (10) is invariant after multiplying ¿(., .) by a

positive constant, but the decisión taken from rule (20) could be affected.

In order toavoid this arbitrariness, it is necessary to imposeastandardizing condition similar to ₍₁₄₎ _to _{the quadratic} _{form related} _to _S(., _.).

4.2 Multivariate normal

The DB rule based on (18) _{and the BR} _are closely related. _{Suppose that}

7T, is JV(/i¿,¿7), i = 1,2. Then ML is based on the LDF

V(x) = -

n2)'

x -

^(mi

-

M2y

^_1

(Mi

+

M2X

BR is based on B(x) = V(x) +log(0/l-6), and DB is based on D(x) =

V(x)+(e-

I)/[0(1-0)],

(19)

where 0 and 1 — 0 are the _prior _{probabilities of drawing} _an _observation from 7ri and *2 respectively.

For 0=

|

_we

_obtain

V(x)

₌

B(x)

₌

D(x), otherwise functions B{x)

and _D(x) _are _different. _However, _{the difference between} _log(0/l _— _{0) and}

(0—

^)/[0(l

—

0)]

is

not too

marked

in

the interval

(0.2, 0.8). See Fig.

1.

(0-l/2)/[0(l-0)]

« B ln

(lé?)

0 <

1

Figure 1. Graphical comparison between distance based rule (continuous curve)

and _{Bayes rule} _(discrete _curve), _{for the multivariate normal} distribu-tion. 6 and 1—0 are the_prior _{probabilities.}

4.3 Multinomial distribution

Suppose thatarandomevent Ahas the conditional probabilities P(j4/7t,)

Pi, i = 1,2. If u>o € A, the Bayes decisión rule is allocate uq to 7r, if

P\0 > P2(l — 0), otherwiseto tt2 . Therefore BR is based on

b = log_p\ _-logp2 ₊log(0/l - 0)

and DB is based on

(20)

For 6 =

|,

_{b and d}

_are

_equivalent,

_otherwise

_{the decisión could be}

different. To _{study this} _{difference, let} us consider the Borel set K C (0,

l)3

K =

{(puP2,0)\b-d>0}

We take the same decisión as _long _as (pi,p2,0) _€ _K. After _some tedious computations, the Lebesgue measure of K ¡s

ft(K) = 0.971

which _{is very} _cióse _to _1, _the _Lebesgue _measure _of _(0,

_l)3.

Thereby, the decisión taken after using either BR or DB, isalmost thesame. This is _{quite interesting, because} _{although the} _two _{decisión rules} _are _based

on different _criteria, _{their results coincide}_in _practice.

5. CLASSIFICATION INTO SEVERAL POPULATIONS

Suppose that we have k mutually exclusive populations 7Ti,• • •,jt* . The

theory developed is easily extended to k > 2. Let C,- be a sample from 7r, and Di a distance matrix. If u>o is an individual to be allocated and

Sj(i)

is the distance from uo to j € C,, the discriminant function /,-(w) is

defined as in _{(1), and the} _{decisión rule} _{for allocating} _u _is

allocateu _{to tt,} if /,(w) ₌

min{/i(o;),-•

_•

,/jt(w)}.

(21)

It is obvious that theorem 1 also _{applies here,} _so _{(20) is} _a minimum-distance discriminant rule. In _{addition, the} _error _rate _computations _are

easily obtained.

When the distributions are known (up _to parameters),_{extensions of}₍₉₎

and _(10), _as _well _as _therorem _3, _do _not _present _{further difficulties.}

Finally, if prior probabilities q\,- ■ ■,qk are known, we may use the pos¬ terior discriminant functions

/f(*o)

= fi(xo) +

q-1

- 1 i = 1,• • •,k,

(21)

6. SOME EXAMPLES

Three real data _examples _are _used _to _compare _{the DB approach with} LDF, QDF, EDF and LM approachs.

Data set 1 is the “avanced breast cáncer data” used by Krzanowski

(1975,

1986,

1987)

to ¡Ilústrate the LM. In this data set,

of

186 cases

of

ablative_surgery _{for advanced breast} _cáncer, ₉₉ wereclassified as “successful

or intermedíate” (7Ti) and ₈₇ _as _{“failure”} (ir2). Gower’s distance (section

2.3) is used on the basis of6 continuous variables and 3 binary variables for the DB _{method, but} we take ranks on the continuous variables instead of

numérica! _valúes, _as _the _range _of_most _variables _was _too _large.

Data set 2 is the well-known Fisher’s Iris data _(Fisher, _{1936), which} consider _ni ₌ _n2 ₌ ₇₁₃ ₌ ₅₀ _observations on three species of iris, I. se-tosa _(ttj), _I. _versicolor _{(7t2) and} _{/. virginica} _{(^3) and the discrimination}

problem on the basis of 4 continuous variables. The DB method uses the Euclidean distance _{(which yields the} _{EDF) and distance (8).}

Data set _3, _{taken from Mardia} _et _al. _(1979, _p. _{328), is concerned with}

the _problem _of _{discriminating between} _{the species} _Chaetocnema _concinna

(jti) and Ch. heikeriingeri (tt2) on the basisof2continuous variables and

samples of sizes m = n2 = 18. Gower’s distance (versión for continuous

variables) is also used.

The _{leaving-one-out method} _was _used, _as _described _{in section 2.2,} _for computing the error rates. Table 1 summarizes the results obtained.

Table 1

Numbers of misclassifications obtained with the_{leaving-one-out method.}

1. Cáncer data *1 ít2 LM 34 27 61 LDF 41 31 72 EDF 45 43 88 DB 33 32 65 2. Iris data JTi 7T2 *3 LDF 0 2 1 EDF 1 5 7 DB 0 2 3 3 13 5 3. Chaetocnema data 7Ti 7T2 LDF 0 1 QDF 1 1 EDF 2 3 DB 0 0 1 2 5 0

(22)

7. SUMMARY AND CONCLUSIONS

A distance based _approach _{(DB), which} _{uses a} _{statistical distance}

be-tween _{observations,} _is _{described and theoretical properties} are presented. This method reduces to linear discrimination _(LDF), _{or even} _quadratic dis-crimination

_(QDF)

_when _{the sample Mahalanobis distance is used.} More-over the DB approach yields _ageneral discriminant function which reduces to the Euclidean discriminant function _{(EDF) when the Euclidean sample} distance is used.

This method offers a rather simple algebraic _way _to consider the dis¬ crimination _{problem with mixed variables} _as _well _as_{the missing valúes} _case,

provided that a suitable distance is adopted. No restriction seems to be

necessary on the numbersof binary variables, sothe DB method may be an alternative to discrimination based on the location model _(LM).

The _{leaving-one-out method} _{for computing the}_probability _of misclassi-fication can be applied following _an elementary matrix computation in this approach.

Finally, unlike other methods based on distances, the priorprobabilities

can be taken into account in the DB method, with results quite similar to

those _given _{by the Bayes} _decisión _rule.

8. REFERENCES

Anderson, T.W. (1958). An Introduction to MultivariateStatistical Ana-lysis. Wiley, New York.

Anderson, J.A. (1972). Sepárate sample logistic discrimination. Biome-trika, 59, 19-35.

Burbea, J. & Rao, C.R. (1982). Entropy differential metric, distance and _divergence _measures _in _probability _spaces: _{A unified approach.} J. Multiv. Anal., 12, 575-596.

Cuadras, C.M. (1988). Distancias Estadísticas. Estadística Española, 30

(119), 295-378.

(23)

classifica-tion _using _{both continuous and} _categórica! _variables. _In: _Statistical Data _{Anatysis and Inference} _(Y. _Dodge, _{ed.). North-Holland,} _Pu. Co.,459-473.

Cuadras, C.M. (1990). Multivariate distributions with given multivariate margináis and given dependencestructure. Math. Preprint Series No.

80, Univ. of Barcelona.

Cuadras, C.M. & Arenas, A. (1990). A distancebasedregressionmodel for _prediction _{with mixed data. Commun.} _{Statist.-Theory} _{Meth., 19}

(6), 2261-2279.

Fisher, R.A. (1936). The use of múltiple measurements in taxonomic problems. Ann. Eugen., 7, 179-188.

Gower, J.C. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika, 27, 857-874.

Gower, J.C. (1971). A general coefficient of similarity and some of its

properties. Biometrika, 27, 857-874.

Huster, W.J., Brookmeyer, R. i¿ Self, S.G. (1989). Modellingpaired

survival data wúth covariates. _{Biometrika, 45, 145-156.}

Kent, J.T. (1982). Robust properties oflikehood ratiotests. Biometrika,

69, 19-27.

Knoke, J.D. (1982). Discriminant analysis with discrete and continuous variables. _{Biometrics, 38, 191-200.}

Krzanowski, W.J. (1975). Discrimination and classification using both binary and continuous variables. J. Am. Stat. Assoc., 70, 782-790.

Krzanowski, W.J. (1977). The performance of Fisher’s linear discrimi¬

nant function under _{non-optimaJ conditions. Technometrics,} _19, 191-200.

Krzanowski, W.J. (1986). Múltiple discriminant analysisin the presence of mixed continuous and _{categórica! data.} _Comp, _{& Maths.} _with Appls., 12A, 179-185.

Krzanowski, W.J. (1987). A comparison between twodistance-based dis¬

(24)

Lachenbruch, P.A. (1975). Discriminant Analysis. Hafner, N. York.

Lachenbruch, P.A. & Goldstein, M. (1979). Discriminant analysis.

Bio-metrics, 35, 69-85.

Lingoes, J.C. (1971). Some boundary conditions for amonotone analysis

of_{symmetric matrices.} _{Psychomeirika, 36, 195-203.}

Marco, V.R., Young, D.M. & Turner, D.W. (1987). The Euclidean

distance classifier: analternativetolinear discriminant function.

Com-mun. Statist.-Simula., ₁₆ (2), _485-505.

Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multivariate Analy¬

sis. Academic Press, London.

Matusita, K. (1956). Decisión rule, based on the distance, for the classi-fication _{problem. Ann. Inst. Stat. Math., 8,} _67-77.

Oller, J.M. (1989). Some geométrica! aspects ofdata analysis and

statis-tics. In: Statistical Data _{Analysis and Inference} _{(Y. Dodge, ed.).}

North-Holland, Pu. Co., 41-58.

Oller, J.M. & Cuadras, C.M. (1985). Rao’s distance fornegative

multi-nomial distributions. _Sanlchya,47, _75-83.

Rao, C.R. (1945). Information and accuracy attainable in the estimation of statistical _parameters. _{Bull. Calcutta Math. Soc., 37, 81-91.} Raveh, A. (1989). A nonmetric approach to linear discriminant analysis.

J. Am. Stat. _{Assoc., 84, 176-183.}

Royall, M.M. (1986). Model robust confidence intervals using máximum likelihood estimators. International Statistical _{Review, 54, 221-226.} Schmitz, P.I.M., Habbema, J.D.F., Hermans, J. & Raatgever, J.

W. _{(1983). Comparative performance of four discriminant} _analysis methods for mixtures of continuous and discrete variables. Commun. Statist.-Simula. _Computa., ₁₂ _(6), _727-751.

Seber, G.A.F. (1984). Multivariate Observations. J. Wiley, New York. Vlachonikolis, I.G. & Marriott, F.H.C. (1982). Discrimination with

(25)

Wald, A. (1944). On astatistical problem arisinginthe classiñcation ofan individual into one oftwo _groups. _Ann. _Math. _Statist., _15, _145-163.

REMARK

A _Computer program for PC to perform a distance based

(26)