• Non ci sono risultati.

Adaptive Model Selection for Digital Linear Classifiers

N/A
N/A
Protected

Academic year: 2021

Condividi "Adaptive Model Selection for Digital Linear Classifiers"

Copied!
8
0
0

Testo completo

(1)

UNIVERSITY

OF TRENTO

DEPARTMENT OF INFORMATION AND COMMUNICATION TECHNOLOGY

38050 Povo – Trento (Italy), Via Sommarive 14

http://www.dit.unitn.it

ADAPTIVE MODEL SELECTION

FOR DIGITAL LINEAR CLASSIFIERS

Andrea Boni

2002

Technical Report # DIT-02-0034

Also: to be published in Proceedings International Conference on

Artificial Neural Networks 2002

(2)
(3)

for Digital Linear Classi ers

AndreaBoni

UniversityofTrento

DepartmentofInformationandCommunicationTechnologies

ViaSommarive14,

38100Povo(Trento), Italy

andrea.boni@ing.unitn.it

Abstract. Adaptivemodelselectioncanbede nedastheprocessthanks

towhichanoptimalclassi ersh



isautomaticallyselectedfroma

func-tionclassHbyusingonlyagivensetofexamples z.Suchaprocessis

particularlycriticwhenthenumberofexamples inz is low,becauseit

isimpossibletheclassical splittingofz intraining+test+val idation.

Inthisworkweshowthatthejoinedinvestigationoftwoboundsofthe

predictionerroroftheclassi ercanbeusefultoselecth



byusingz for

both modelselection and training. Ourlearning algorithm is a simple

kernel{basedPerceptron thatcan beeasily implementedina counter{

baseddigital hardware. Experimentson two realworlddata sets show

thevalidityoftheproposedmethod.

1 Introduction

In the framework of classi cation problems, the machine learning community

referstomodelselectionasthetaskthroughwhichanoptimalfunctionclassH

k

is selectedfrom agivenset H with thegoalof optimizing thepredictionerror

of the classi er h



extracted from H

k

. Usually such a task involvesa tradeo

betweenthe capabilityof themodelto learnthe givenset of labeledexamples

z=f(x i ;y i )g m i=1

,andameasureofthecomplexityofthemodelitself[11].Each

(x i

;y

i

) is supposed to be designed independently and identically distributed

overanunknowndistributionP;weusethefollowingnotation:x

i 2X,y i 2Y, z i = (x i ;y i ) 2 X Y = Z; z 2 Z m ; X  < r ;Y = f 1;1g. In this paper

weconsider digitallearningalgorithms A as operators that map z to a set of

parametersA, A:Z m ! A, A (N + ) m

; in particularweconsider theDigital

Kernel Perceptronwith Pocket (DKPP), which leadsto the followingfunction

class: H =fh:K AX !Y :h( ;x) =sgn( P m j=1 j y j K(x j ;x)) ;K2K ; 2

A;x2Xg.K(;):XX !<isakernelfunction thatrealizesadotproduct

in the feature space [1]. The choice of the kernel perceptron (KP) is justi ed

byrecentstudiesthat haverevalueditsrolein classi cationtasks,showingthe

relationbetweenitsgeneralizationabilityandthe sparsityofthe nalsolution

(4)

uniform(overthedistributionP)anddata{dependentboundsfortheerrorrate

of the KP [9] and for agiven learningalgorithm [3], respectively. In the next

sectionwebrie yintroducetheDKPPlearningalgorithm,whileinsection3we

describeuniformanddata{dependedbounds; basedonsuchboundswepropose

aprocedure foradaptivemodelselection. Intheend, in section4wecomment

someexperimentsontworeal{worlddatasets.

2 The DKPP algorithm

It is well known that the KP algorithm belongs to the class of kernel-based

approaches,in which anon linear classi er is builtafter amapping of the

in-put space to a higher, possibly in nite, feature space via a function . This

is a well-known theory that exploits the Reproducing Kernel Hilbert Space

(RKHS)framework[1], andrecentlyhasbeenapplied withsuccess by the

ma-chine learningcommunity.SupportVectorMachines(SVMs),designedby

Vap-nik [11], represent oneof the most successful examples of a learningmachine

based on a kernel method. In practice, the kernel function K(;) acts as an

operator that hides thenon linear map in the feature space, thus avoiding to

work directly on. For classi cationpurposes, themost used kernel functions

are: K(x i ;x j )=x i x j (linear), K(x i ;x j ) =(1+x i x j =r) p (polynomial) and K( x i ;x j )=exp  kx i x j k 2 =2 2  (Gaussian);pand 2

areusuallyindicated

as thehyperparameters of theproblem. TheKPhas been appliedwith success

forclassi cationandfunctionapproximationtasks[6],[7],anditsgeneralization

capabilities has been reported for example in [9]. It can be considered as the

dualformulationoftheRosenblatt'slearningalgorithmwhich,asknown,canbe

appliedtoanychoiceoftheupdatingstep.Theadvantageofthedual

formula-tionconsistsinthefactthatonecanexploitsthetheoryofkernels,thusallowing

animplicitnonlineartransformation;furthermorethechoice=1leadstothe

fact that 2A. Table1sketchesthe owof ouralgorithm, called DKPP. We

implementedthepocket algorithm[8]in order to accepterrorsonthetraining

set. This is acrucial pointfor optimal model selection, as will be clear in the

nextsections.Itisimportanttopointoutthattheon-goingresearchon

kernel-basedmethodssuggeststhatmanyclassi ersperformcomparably,providedthat

agood kernelhasbeenchosen[9]; thereforeclassi ersliketheDKPP, arevery

appealingandshowtheirsuperioritywhentargetingVLSI implementations, as

simplecounter{basedarchitecturescanbedesigned.Figure1showssuchan

ar-chitecture, whereq ij =y i y j K(x i ;x j )andMSB i

indicatestheMostSigni cant

Bit of the accumulator, in a2's complement coding. Inthis paper we explore

thewayofautomatictuningthemodeloftheDKPP.

3 Uniform and data{depended bounds for the DKP

(5)

RAM

PE

Memory

Controller

q

i

q

i,j

0

EN count

mux

0

1

clk

1

k+1

α

0

EN count

mux

0

1

clk

n

k+1

α

j

k

α

MSB

i

acc

EN count

mux

0

1

clk

err

0

Fig.1.Acounter{basedarchitecturefortheDKPP.

er P ( h( k ;z ;x );y)=Prf(x;y)2Z:h( k ;z ;x)6=yg[2],where k ;z =A k (z) is

thevectorobtainedbyA

k

aftertheobservationofz andoverthefunctionclass

H k

.ItiswellknownfromtheProbablyApproximatelyCorrect (PAC)learning

framework that the behavior of er

P

can be represented in terms of a bound

which holds with a given probability; in practice, usually one asserts that the

followinginequality: er P (h( k ;z ;x);y)er^ P ( h k ;z ;z)+"(m;d k ;Æ) (1)

holds with agiven probability 1 Æ; er^

P (h

k ;z

;z) is theerrorrate ofh

k ;z

onz

(alsoknownasempiricalerrorrate),thatis:

^ er P (h k ;z ;z)= 1 m ji:1im;h( k ;z ;x i )6=y i jandd k

isameasureofthe

com-plexityofthefunctionclassH

k

(forexampletheVapnik{Chervonenkisdimension

[11]) 1

.Exploitingthewell-knowncompressionschemebyWarmuthet.al.[5],a

uniformboundfortheDKPPhasbeenrecentlyproposed[9];accordingtosuch

atheorem,withprobabilityatleast1 Æ,thepredictionerroroftheDKPP,as

longaser^ P =0,is: er z (h k ;z ) 1 m d k  ln  m d k  +lnm+ln 1 Æ  (2) 1

For simplicity, during the text we will use the notation erP(h

k ;z

) =

erP(h( k ;z;x);y) and erP^ (hk ;z) = erP^ (hk ;z;z), to indicate the prediction error

andthe empiricalerror rateofh

k ;z

(6)

k 1

agnostic case (er^

P

6=0), one canuse amore generalresult which permits the

calculusof"inadata{dependentway,thankstotheintroductionofameasure,

called penalized complexity,that canbedirectly estimated on thebasisof the

giventrainingsetz [3]:

er P (h k ;z )er^ P ( h k ;z )+d k + r 9 2m ln 1 Æ (3)

where,in thiscase,d

k = max h2H k  ^ er (1) P (h) er^ (2) P (h)  ,with: ^ er (1) P ( h)= 2 m ji:1im=2m;h( ;x i )6=y i j, ^ er (2) P ( h) = 2 m ji:m=2+1im;h( ;x i )6=y i j. We refer to (3) as B 2 . The measured k

,canbeeasilycomputedbyapplyingtheDKPPonanewtriningset

z 0

obtainedbysplittingz intwohalvesandby ippingthelabelsofthesecond

half [3].Inpractice,itiseasytoseethat:

d k =1 2er^ P (h k ;z 0;z 0 ) (4)

Themaingoaltodesignanoptimallearningsystems,istoboundtherightsideof

equation(1),inordertoboundthepredictionerror.ThisleadstotheStructural

RiskMinimization(SRM)principle;accordingtoSRMonelooksforafunction

hhavinglowerempiricalerrorovera xedfunction class H

k

, characterizedby

thecomplexityd.Fromaformalpointofview,oneshouldbuildseveralfunction

classesin increasing size,thatis increasingcomplexity (H

1

 H

k

),

and then pick a function h which has small training error and comes from a

classesH

k

havinglowercomplexity(thatislower"accordingto(1)).Ourideafor

adaptivemodelselectionconsistsin buildingseveralH

k

and theninmeasuring

their richness on the basis of B

1;2

. In practice we choose the function h

 = argmin k [er^ P ( h k ;z )+"(m;d k

;Æ)].Thegeneralprocedureissketchedin table2.

Notethat,theoretically,onecouldonlyuseB

2

.Actually,itisknownthatseveral

function classes,such as theGaussian one,have typically an high complexity,

andthecorrespondingmeasured

k

,foundaccordingtothemaximaldiscrepancy

estimate,doesnotgiveanyusefulinformationontheproblem(d

k

=1);onthe

other hand, in these cases wehave er^

P

=0,thus allowing theuse of B

1

. Our

experiments show that the joined investigation of B

1;2

can assure an optimal

model selectionfordi erentreal{worldproblems.

4 Experiments

We test our approach on two well{known datasets from [4,10], that is sonar

(SNR)andpima indiandiabetes(PIMA),respectively.Theresultsare

summa-rizedin table 3.Both data setsare two{classproblems with r=60 and r=8

inputfeatures respectively, whilstthenumberoftrainingsamplesarem=104

(7)

val-StepDescription

1. set =0,

opt

=0,nerr=m

2. Repeatuntilnomistakesoccuroramaxnumberofstepshasbeenreached

3. for i=1tomdo 4. computeMSB i =y i  sgn  P j j y j K(x i ;x j )  5. ifMSBi<0then 6. i = i +1 7. computeerr=ji:yi  sgn  P j jyjK(xi;xj)  <0j

8. iferr<nerrthen

opt = ,nerr=err 9. endfor 10. k ;z = opt

Table2.Aprocedurefor adaptivemodelselection.

StepDescription

1. setk=1

2. Selectakernelfunction(es:linear, polynomialorgaussian)

3. Selectavalueforthehyperparameter(buildH

k ) 4. ApplyDKKPand nd k ;z =A k (z),d k 5. ComputeB k ;f1;2g =er^ P (h k ;z )+"(m;d k ;Æ) 6. k=k+1 7. Chooseh  =argmin k B k ;f1;2 g

Table3.Experimentsforthesonarandpimaindiandiabetesdatasets.

SNR  2 =0:1  2 =0:5 2 =1:0 p=2 p=3 p=4 B1(TS) 1.86(11) 1.82(7) 1.96(10)6.58(19) 4.8(15) 4.8(13) B 2 (TS) { { { { { { PIMA 2 =0:05 2 =0:1 2 =0:5 p=2 p=3 p=4 B 1 (TS) 1.41(53) 1.6(50) 2.16(60) { { { B2(TS) { { { 0.5(41) 0.51(40)0.6(43)

(8)

hyperparameter( 2

orp).Asexpected,thereisasubstantialagreementbetween

the bounds B

1;2

and TS; asreported in table 3,our criteria chooses the best

model, thatisaGaussiankernelwith

2

=0:5forSNR andapolynomialwith

p= 2orp= 3for PIMA.As a nal remark note that with '{' we mean that

corresponding bound doesnotgiveanyuseful informationfor model selection.

For example, in the case of B

2

, it means that d

k

= 1 for every value of the

hyperparameter,while inthe caseofB

1

itsimplymeansthat wecannot apply

equation(2)asweareinpresenceofanagnosticcase.

5 Conclusion

InthispaperwehaveproposeaneÆcientmethodforadaptivemodelselection.

Weusedadigitalkernelbasedperceptronwithpocket,butourapproachcould

be applied to other learning systems as well, as long as good bounds of the

prediction error are provided; in particular, in this work wehaveinvestigated

theuseof twodi erentkindsof bounds: auniformbound,thatcanbe applied

to systemscharacterized by a high level of complexity, and adata{dependent

bound.Experimentsontworealworldproblemshavedemonstratedthevalidity

ofthemethod.

References

1. Aizerman,M. A.,Braverman,E. M.andRozonoer, L.I.:TheoreticalFoundations

ofthePotentialFunctionMethodinPatternRecognitionLearning.Automationand

RemoteControl,25(1964)821-837.

2. Bartlett P.: NeuralNetworksLearning: Theoretical Foundations,Cambridge

Uni-versityPress,1999.

3. Bartlett,P.L.,Boucheron,S.,andLugosi,G.:ModelSelectionandErrorEstimation.

MachineLearning,48(2002),85{113.

4. Blake, C., Keogh, E., and Merz, C.J.: UCI Repository of Machine Learning

Databases,http://www.ics.uci.edu/mlearn/MLRepository.html.

5. Floyd, S. and Warmuth, M.: Sample compression, learnability and the Vapnik{

Chervonenkisdimension.MachineLearning,21(1995)269{304.

6. Freund Y. and Shapire, R.E.: Large Margin Classi cation Using the Perceptron

Algorithm.MachineLearning,37(1999)277-296.

7. Friess,T.T.andHarrison,R.F.:AKernel-Based AdalineforFunction

Approxima-tion.IntelligentDataAnalysis,3(1999)307-313.

8. Gallant,S.I:Perceptron{BasedLearningAlgorithms.IEEETransactiononNeural

Networks,1(1990)179{191.

9. Herbrich,R.,Graepel,T.,andWilliamson,R.:FromMargin toSparsity.NIPS13,

2001.

10. J.M. TorresMoreno,M.B.Gordon. CharacterizationoftheSonarSignals

Bench-mark.NeuralProcessingLetters,7(1998)1{4.

11. Vapnik,V.N.:TheNatureofStatisticalLearningTheory.JohnWiley&Sons,NY,

Figura

Fig. 1. A counter{based architecture for the DKPP.
Table 3. Experiments for the sonar and pima indian diabetes datasets.

Riferimenti

Documenti correlati

MPD thrusters hold considerable interest for future interplanetary missions, because they are characterized by thrust density values higher than other electric

As for the value of Resistance to Uniaxial Compressive Strenght, the survey refers to the results obtained by the compressive press, being it a direct test; the instrument

The properties of merger shocks, at least based on cosmological simulations, are expected to be fairly independent of the host cluster mass and on the merger configuration,

We found that 38 ± 2 per cent of our AGN sample analysed (178/469) present detected outflows, mostly blueshifted, and that the fraction of blue versus redshifted outflows in our

The simulation results showed that the contrast method is the most promising methods based on the simulated families that are considered.. Key Words: Probability Density

Up to now this class of low power thrusters has not been deeply studied, in fact, only recently there is a great interest to mini and micro satellites that ask for thruster

To fulfill this goal, this article introduces a model that includes new operators for defining rules combining multiple events and conditions exposed by smart objects, and for

Reaction of serinol with carbonyl compounds – Hypothesis of a mechanism The nucleophilic nitrogen reacts with carbonyl group..