Adaptive Model Selection for Digital Linear Classifiers

(1)

UNIVERSITY

OF TRENTO

DEPARTMENT OF INFORMATION AND COMMUNICATION TECHNOLOGY

38050 Povo – Trento (Italy), Via Sommarive 14

http://www.dit.unitn.it

ADAPTIVE MODEL SELECTION

FOR DIGITAL LINEAR CLASSIFIERS

Andrea Boni

2002

Technical Report # DIT-02-0034

Also: to be published in Proceedings International Conference on

Artificial Neural Networks 2002

(2)

(3)

for Digital Linear Classiers

AndreaBoni

UniversityofTrento

DepartmentofInformationandCommunicationTechnologies

ViaSommarive14,

38100Povo(Trento), Italy

[email protected]

Abstract. Adaptivemodelselectioncanbedenedastheprocessthanks

towhichanoptimalclassiersh

isautomaticallyselectedfroma

func-tionclassHbyusingonlyagivensetofexamples z.Suchaprocessis

particularlycriticwhenthenumberofexamples inz is low,becauseit

isimpossibletheclassical splittingofz intraining+test+val idation.

Inthisworkweshowthatthejoinedinvestigationoftwoboundsofthe

predictionerroroftheclassiercanbeusefultoselecth

byusingz for

both modelselection and training. Ourlearning algorithm is a simple

kernel{basedPerceptron thatcan beeasily implementedina counter{

baseddigital hardware. Experimentson two realworlddata sets show

thevalidityoftheproposedmethod.

1 Introduction

In the framework of classication problems, the machine learning community

referstomodelselectionasthetaskthroughwhichanoptimalfunctionclassH

k

is selectedfrom agivenset H with thegoalof optimizing thepredictionerror

of the classier h

extracted from H

k

. Usually such a task involvesa tradeo

betweenthe capabilityof themodelto learnthe givenset of labeledexamples

z=f(x i ;y i )g m i=1

,andameasureofthecomplexityofthemodelitself[11].Each

(x i

;y

i

) is supposed to be designed independently and identically distributed

overanunknowndistributionP;weusethefollowingnotation:x

i 2X,y i 2Y, z i = (x i ;y i ) 2 X Y = Z; z 2 Z m ; X < r ;Y = f 1;1g. In this paper

weconsider digitallearningalgorithms A as operators that map z to a set of

parametersA, A:Z m ! A, A (N + ) m

; in particularweconsider theDigital

Kernel Perceptronwith Pocket (DKPP), which leadsto the followingfunction

class: H =fh:K AX !Y :h(;x) =sgn( P m j=1 j y j K(x j ;x)) ;K2K ;2

A;x2Xg.K(;):XX !<isakernelfunction thatrealizesadotproduct

in the feature space [1]. The choice of the kernel perceptron (KP) is justied

byrecentstudiesthat haverevalueditsrolein classicationtasks,showingthe

relationbetweenitsgeneralizationabilityandthe sparsityofthe nalsolution

(4)

uniform(overthedistributionP)anddata{dependentboundsfortheerrorrate

of the KP [9] and for agiven learningalgorithm [3], respectively. In the next

sectionwebrie yintroducetheDKPPlearningalgorithm,whileinsection3we

describeuniformanddata{dependedbounds; basedonsuchboundswepropose

aprocedure foradaptivemodelselection. Intheend, in section4wecomment

someexperimentsontworeal{worlddatasets.

2 The DKPP algorithm

It is well known that the KP algorithm belongs to the class of kernel-based

approaches,in which anon linear classier is builtafter amapping of the

in-put space to a higher, possibly innite, feature space via a function . This

is a well-known theory that exploits the Reproducing Kernel Hilbert Space

(RKHS)framework[1], andrecentlyhasbeenapplied withsuccess by the

ma-chine learningcommunity.SupportVectorMachines(SVMs),designedby

Vap-nik [11], represent oneof the most successful examples of a learningmachine

based on a kernel method. In practice, the kernel function K(;) acts as an

operator that hides thenon linear map in the feature space, thus avoiding to

work directly on. For classicationpurposes, themost used kernel functions

are: K(x i ;x j )=x i x j (linear), K(x i ;x j ) =(1+x i x j =r) p (polynomial) and K( x i ;x j )=exp kx i x j k 2 =2 2 (Gaussian);pand 2

areusuallyindicated

as thehyperparameters of theproblem. TheKPhas been appliedwith success

forclassicationandfunctionapproximationtasks[6],[7],anditsgeneralization

capabilities has been reported for example in [9]. It can be considered as the

dualformulationoftheRosenblatt'slearningalgorithmwhich,asknown,canbe

appliedtoanychoiceoftheupdatingstep.Theadvantageofthedual

formula-tionconsistsinthefactthatonecanexploitsthetheoryofkernels,thusallowing

animplicitnonlineartransformation;furthermorethechoice=1leadstothe

fact that 2A. Table1sketchesthe owof ouralgorithm, called DKPP. We

implementedthepocket algorithm[8]in order to accepterrorsonthetraining

set. This is acrucial pointfor optimal model selection, as will be clear in the

nextsections.Itisimportanttopointoutthattheon-goingresearchon

kernel-basedmethodssuggeststhatmanyclassiersperformcomparably,providedthat

agood kernelhasbeenchosen[9]; thereforeclassiersliketheDKPP, arevery

appealingandshowtheirsuperioritywhentargetingVLSI implementations, as

simplecounter{basedarchitecturescanbedesigned.Figure1showssuchan

ar-chitecture, whereq ij =y i y j K(x i ;x j )andMSB i

indicatestheMostSignicant

Bit of the accumulator, in a2's complement coding. Inthis paper we explore

thewayofautomatictuningthemodeloftheDKPP.

3 Uniform and data{depended bounds for the DKP

(5)

RAM

PE

Memory

Controller

q

i

q

_i,j

0 EN count

mux

0

1 clk

1 k+1

α

0 EN count

mux

0

1 clk

n

k+1

α

j

k

α

MSB

_i

acc

EN count

mux

0

1 clk

err

0

Fig.1.Acounter{basedarchitecturefortheDKPP.

er P ( h( k ;z ;x );y)=Prf(x;y)2Z:h( k ;z ;x)6=yg[2],where k ;z =A k (z) is

thevectorobtainedbyA

k

aftertheobservationofz andoverthefunctionclass

H k

.ItiswellknownfromtheProbablyApproximatelyCorrect (PAC)learning

framework that the behavior of er

P

can be represented in terms of a bound

which holds with a given probability; in practice, usually one asserts that the

followinginequality: er P (h( k ;z ;x);y)er^ P ( h k ;z ;z)+"(m;d k ;Æ) (1)

holds with agiven probability 1 Æ; er^

P (h

k ;z

;z) is theerrorrate ofh

k ;z

onz

(alsoknownasempiricalerrorrate),thatis:

^ er P (h k ;z ;z)= 1 m ji:1im;h( k ;z ;x i )6=y i jandd k

isameasureofthe

com-plexityofthefunctionclassH

k

(forexampletheVapnik{Chervonenkisdimension

[11]) 1

.Exploitingthewell-knowncompressionschemebyWarmuthet.al.[5],a

uniformboundfortheDKPPhasbeenrecentlyproposed[9];accordingtosuch

atheorem,withprobabilityatleast1 Æ,thepredictionerroroftheDKPP,as

longaser^ P =0,is: er z (h k ;z ) 1 m d k ln m d k +lnm+ln 1 Æ (2) 1

For simplicity, during the text we will use the notation erP(h

k ;z

) =

erP(h(k ;z;x);y) and erP^ (hk ;z) = erP^ (hk ;z;z), to indicate the prediction error

andthe empiricalerror rateofh

k ;z

(6)

k 1

agnostic case (er^

P

6=0), one canuse amore generalresult which permits the

calculusof"inadata{dependentway,thankstotheintroductionofameasure,

called penalized complexity,that canbedirectly estimated on thebasisof the

giventrainingsetz [3]:

er P (h k ;z )er^ P ( h k ;z )+d k + r 9 2m ln 1 Æ (3)

where,in thiscase,d

k = max h2H k ^ er (1) P (h) er^ (2) P (h) ,with: ^ er (1) P ( h)= 2 m ji:1im=2m;h(;x i )6=y i j, ^ er (2) P ( h) = 2 m ji:m=2+1im;h(;x i )6=y i j. We refer to (3) as B 2 . The measured k

,canbeeasilycomputedbyapplyingtheDKPPonanewtriningset

z 0

obtainedbysplittingz intwohalvesandby ippingthelabelsofthesecond

half [3].Inpractice,itiseasytoseethat:

d k =1 2er^ P (h k ;z 0;z 0 ) (4)

Themaingoaltodesignanoptimallearningsystems,istoboundtherightsideof

equation(1),inordertoboundthepredictionerror.ThisleadstotheStructural

RiskMinimization(SRM)principle;accordingtoSRMonelooksforafunction

hhavinglowerempiricalerroroveraxedfunction class H

k

, characterizedby

thecomplexityd.Fromaformalpointofview,oneshouldbuildseveralfunction

classesin increasing size,thatis increasingcomplexity (H

1

H

k

),

and then pick a function h which has small training error and comes from a

classesH

k

havinglowercomplexity(thatislower"accordingto(1)).Ourideafor

adaptivemodelselectionconsistsin buildingseveralH

k

and theninmeasuring

their richness on the basis of B

1;2

. In practice we choose the function h

= argmin k [er^ P ( h k ;z )+"(m;d k

;Æ)].Thegeneralprocedureissketchedin table2.

Notethat,theoretically,onecouldonlyuseB

2

.Actually,itisknownthatseveral

function classes,such as theGaussian one,have typically an high complexity,

andthecorrespondingmeasured

k

,foundaccordingtothemaximaldiscrepancy

estimate,doesnotgiveanyusefulinformationontheproblem(d

k

=1);onthe

other hand, in these cases wehave er^

P

=0,thus allowing theuse of B

1

. Our

experiments show that the joined investigation of B

1;2

can assure an optimal

model selectionfordierentreal{worldproblems.

4 Experiments

We test our approach on two well{known datasets from [4,10], that is sonar

(SNR)andpima indiandiabetes(PIMA),respectively.Theresultsare

summa-rizedin table 3.Both data setsare two{classproblems with r=60 and r=8

inputfeatures respectively, whilstthenumberoftrainingsamplesarem=104

(7)

val-StepDescription

1. set=0,

opt

=0,nerr=m

2. Repeatuntilnomistakesoccuroramaxnumberofstepshasbeenreached

3. for i=1tomdo 4. computeMSB i =y i sgn P j j y j K(x i ;x j ) 5. ifMSBi<0then 6. i = i +1 7. computeerr=ji:yi sgn P j jyjK(xi;xj) <0j

8. iferr<nerrthen

opt =,nerr=err 9. endfor 10. k ;z = opt

Table2.Aprocedurefor adaptivemodelselection.

StepDescription

1. setk=1

2. Selectakernelfunction(es:linear, polynomialorgaussian)

3. Selectavalueforthehyperparameter(buildH

k ) 4. ApplyDKKPandnd k ;z =A k (z),d k 5. ComputeB k ;f1;2g =er^ P (h k ;z )+"(m;d k ;Æ) 6. k=k+1 7. Chooseh =argmin k B k ;f1;2 g

Table3.Experimentsforthesonarandpimaindiandiabetesdatasets.

SNR 2 =0:1 2 =0:5 2 =1:0 p=2 p=3 p=4 B1(TS) 1.86(11) 1.82(7) 1.96(10)6.58(19) 4.8(15) 4.8(13) B 2 (TS) { { { { { { PIMA 2 =0:05 2 =0:1 2 =0:5 p=2 p=3 p=4 B 1 (TS) 1.41(53) 1.6(50) 2.16(60) { { { B2(TS) { { { 0.5(41) 0.51(40)0.6(43)

(8)

hyperparameter( 2

orp).Asexpected,thereisasubstantialagreementbetween

the bounds B

1;2

and TS; asreported in table 3,our criteria chooses the best

model, thatisaGaussiankernelwith

2

=0:5forSNR andapolynomialwith

p= 2orp= 3for PIMA.As a nal remark note that with '{' we mean that

corresponding bound doesnotgiveanyuseful informationfor model selection.

For example, in the case of B

2

, it means that d

k

= 1 for every value of the

hyperparameter,while inthe caseofB

1

itsimplymeansthat wecannot apply

equation(2)asweareinpresenceofanagnosticcase.

5 Conclusion

InthispaperwehaveproposeaneÆcientmethodforadaptivemodelselection.

Weusedadigitalkernelbasedperceptronwithpocket,butourapproachcould

be applied to other learning systems as well, as long as good bounds of the

prediction error are provided; in particular, in this work wehaveinvestigated

theuseof twodierentkindsof bounds: auniformbound,thatcanbe applied

to systemscharacterized by a high level of complexity, and adata{dependent

bound.Experimentsontworealworldproblemshavedemonstratedthevalidity

ofthemethod.

References

1. Aizerman,M. A.,Braverman,E. M.andRozonoer, L.I.:TheoreticalFoundations

ofthePotentialFunctionMethodinPatternRecognitionLearning.Automationand

RemoteControl,25(1964)821-837.

2. Bartlett P.: NeuralNetworksLearning: Theoretical Foundations,Cambridge

Uni-versityPress,1999.

3. Bartlett,P.L.,Boucheron,S.,andLugosi,G.:ModelSelectionandErrorEstimation.

MachineLearning,48(2002),85{113.

4. Blake, C., Keogh, E., and Merz, C.J.: UCI Repository of Machine Learning

Databases,http://www.ics.uci.edu/mlearn/MLRepository.html.

5. Floyd, S. and Warmuth, M.: Sample compression, learnability and the Vapnik{

Chervonenkisdimension.MachineLearning,21(1995)269{304.

6. Freund Y. and Shapire, R.E.: Large Margin Classication Using the Perceptron

Algorithm.MachineLearning,37(1999)277-296.

7. Friess,T.T.andHarrison,R.F.:AKernel-Based AdalineforFunction

Approxima-tion.IntelligentDataAnalysis,3(1999)307-313.

8. Gallant,S.I:Perceptron{BasedLearningAlgorithms.IEEETransactiononNeural

Networks,1(1990)179{191.

9. Herbrich,R.,Graepel,T.,andWilliamson,R.:FromMargin toSparsity.NIPS13,

2001.

10. J.M. TorresMoreno,M.B.Gordon. CharacterizationoftheSonarSignals

Bench-mark.NeuralProcessingLetters,7(1998)1{4.

11. Vapnik,V.N.:TheNatureofStatisticalLearningTheory.JohnWiley&Sons,NY,