UNIVERSITY
OF TRENTO
DEPARTMENT OF INFORMATION AND COMMUNICATION TECHNOLOGY
38050 Povo – Trento (Italy), Via Sommarive 14
http://www.dit.unitn.it
ADAPTIVE MODEL SELECTION
FOR DIGITAL LINEAR CLASSIFIERS
Andrea Boni
2002
Technical Report # DIT-02-0034
Also: to be published in Proceedings International Conference on
Artificial Neural Networks 2002
for Digital Linear Classiers
AndreaBoni
UniversityofTrento
DepartmentofInformationandCommunicationTechnologies
ViaSommarive14,
38100Povo(Trento), Italy
andrea.boni@ing.unitn.it
Abstract. Adaptivemodelselectioncanbedenedastheprocessthanks
towhichanoptimalclassiersh
isautomaticallyselectedfroma
func-tionclassHbyusingonlyagivensetofexamples z.Suchaprocessis
particularlycriticwhenthenumberofexamples inz is low,becauseit
isimpossibletheclassical splittingofz intraining+test+val idation.
Inthisworkweshowthatthejoinedinvestigationoftwoboundsofthe
predictionerroroftheclassiercanbeusefultoselecth
byusingz for
both modelselection and training. Ourlearning algorithm is a simple
kernel{basedPerceptron thatcan beeasily implementedina counter{
baseddigital hardware. Experimentson two realworlddata sets show
thevalidityoftheproposedmethod.
1 Introduction
In the framework of classication problems, the machine learning community
referstomodelselectionasthetaskthroughwhichanoptimalfunctionclassH
k
is selectedfrom agivenset H with thegoalof optimizing thepredictionerror
of the classier h
extracted from H
k
. Usually such a task involvesa tradeo
betweenthe capabilityof themodelto learnthe givenset of labeledexamples
z=f(x i ;y i )g m i=1
,andameasureofthecomplexityofthemodelitself[11].Each
(x i
;y
i
) is supposed to be designed independently and identically distributed
overanunknowndistributionP;weusethefollowingnotation:x
i 2X,y i 2Y, z i = (x i ;y i ) 2 X Y = Z; z 2 Z m ; X < r ;Y = f 1;1g. In this paper
weconsider digitallearningalgorithms A as operators that map z to a set of
parametersA, A:Z m ! A, A (N + ) m
; in particularweconsider theDigital
Kernel Perceptronwith Pocket (DKPP), which leadsto the followingfunction
class: H =fh:K AX !Y :h(;x) =sgn( P m j=1 j y j K(x j ;x)) ;K2K ;2
A;x2Xg.K(;):XX !<isakernelfunction thatrealizesadotproduct
in the feature space [1]. The choice of the kernel perceptron (KP) is justied
byrecentstudiesthat haverevalueditsrolein classicationtasks,showingthe
relationbetweenitsgeneralizationabilityandthe sparsityofthe nalsolution
uniform(overthedistributionP)anddata{dependentboundsfortheerrorrate
of the KP [9] and for agiven learningalgorithm [3], respectively. In the next
sectionwebrie yintroducetheDKPPlearningalgorithm,whileinsection3we
describeuniformanddata{dependedbounds; basedonsuchboundswepropose
aprocedure foradaptivemodelselection. Intheend, in section4wecomment
someexperimentsontworeal{worlddatasets.
2 The DKPP algorithm
It is well known that the KP algorithm belongs to the class of kernel-based
approaches,in which anon linear classier is builtafter amapping of the
in-put space to a higher, possibly innite, feature space via a function . This
is a well-known theory that exploits the Reproducing Kernel Hilbert Space
(RKHS)framework[1], andrecentlyhasbeenapplied withsuccess by the
ma-chine learningcommunity.SupportVectorMachines(SVMs),designedby
Vap-nik [11], represent oneof the most successful examples of a learningmachine
based on a kernel method. In practice, the kernel function K(;) acts as an
operator that hides thenon linear map in the feature space, thus avoiding to
work directly on. For classicationpurposes, themost used kernel functions
are: K(x i ;x j )=x i x j (linear), K(x i ;x j ) =(1+x i x j =r) p (polynomial) and K( x i ;x j )=exp kx i x j k 2 =2 2 (Gaussian);pand 2
areusuallyindicated
as thehyperparameters of theproblem. TheKPhas been appliedwith success
forclassicationandfunctionapproximationtasks[6],[7],anditsgeneralization
capabilities has been reported for example in [9]. It can be considered as the
dualformulationoftheRosenblatt'slearningalgorithmwhich,asknown,canbe
appliedtoanychoiceoftheupdatingstep.Theadvantageofthedual
formula-tionconsistsinthefactthatonecanexploitsthetheoryofkernels,thusallowing
animplicitnonlineartransformation;furthermorethechoice=1leadstothe
fact that 2A. Table1sketchesthe owof ouralgorithm, called DKPP. We
implementedthepocket algorithm[8]in order to accepterrorsonthetraining
set. This is acrucial pointfor optimal model selection, as will be clear in the
nextsections.Itisimportanttopointoutthattheon-goingresearchon
kernel-basedmethodssuggeststhatmanyclassiersperformcomparably,providedthat
agood kernelhasbeenchosen[9]; thereforeclassiersliketheDKPP, arevery
appealingandshowtheirsuperioritywhentargetingVLSI implementations, as
simplecounter{basedarchitecturescanbedesigned.Figure1showssuchan
ar-chitecture, whereq ij =y i y j K(x i ;x j )andMSB i
indicatestheMostSignicant
Bit of the accumulator, in a2's complement coding. Inthis paper we explore
thewayofautomatictuningthemodeloftheDKPP.
3 Uniform and data{depended bounds for the DKP
RAM
PE
Memory
Controller
q
i
q
i,j
0
EN count
mux
0
1
clk
1
k+1
α
0
EN count
mux
0
1
clk
n
k+1
α
j
k
α
MSB
i
acc
EN count
mux
0
1
clk
err
0
Fig.1.Acounter{basedarchitecturefortheDKPP.
er P ( h( k ;z ;x );y)=Prf(x;y)2Z:h( k ;z ;x)6=yg[2],where k ;z =A k (z) is
thevectorobtainedbyA
k
aftertheobservationofz andoverthefunctionclass
H k
.ItiswellknownfromtheProbablyApproximatelyCorrect (PAC)learning
framework that the behavior of er
P
can be represented in terms of a bound
which holds with a given probability; in practice, usually one asserts that the
followinginequality: er P (h( k ;z ;x);y)er^ P ( h k ;z ;z)+"(m;d k ;Æ) (1)
holds with agiven probability 1 Æ; er^
P (h
k ;z
;z) is theerrorrate ofh
k ;z
onz
(alsoknownasempiricalerrorrate),thatis:
^ er P (h k ;z ;z)= 1 m ji:1im;h( k ;z ;x i )6=y i jandd k
isameasureofthe
com-plexityofthefunctionclassH
k
(forexampletheVapnik{Chervonenkisdimension
[11]) 1
.Exploitingthewell-knowncompressionschemebyWarmuthet.al.[5],a
uniformboundfortheDKPPhasbeenrecentlyproposed[9];accordingtosuch
atheorem,withprobabilityatleast1 Æ,thepredictionerroroftheDKPP,as
longaser^ P =0,is: er z (h k ;z ) 1 m d k ln m d k +lnm+ln 1 Æ (2) 1
For simplicity, during the text we will use the notation erP(h
k ;z
) =
erP(h(k ;z;x);y) and erP^ (hk ;z) = erP^ (hk ;z;z), to indicate the prediction error
andthe empiricalerror rateofh
k ;z
k 1
agnostic case (er^
P
6=0), one canuse amore generalresult which permits the
calculusof"inadata{dependentway,thankstotheintroductionofameasure,
called penalized complexity,that canbedirectly estimated on thebasisof the
giventrainingsetz [3]:
er P (h k ;z )er^ P ( h k ;z )+d k + r 9 2m ln 1 Æ (3)
where,in thiscase,d
k = max h2H k ^ er (1) P (h) er^ (2) P (h) ,with: ^ er (1) P ( h)= 2 m ji:1im=2m;h(;x i )6=y i j, ^ er (2) P ( h) = 2 m ji:m=2+1im;h(;x i )6=y i j. We refer to (3) as B 2 . The measured k
,canbeeasilycomputedbyapplyingtheDKPPonanewtriningset
z 0
obtainedbysplittingz intwohalvesandby ippingthelabelsofthesecond
half [3].Inpractice,itiseasytoseethat:
d k =1 2er^ P (h k ;z 0;z 0 ) (4)
Themaingoaltodesignanoptimallearningsystems,istoboundtherightsideof
equation(1),inordertoboundthepredictionerror.ThisleadstotheStructural
RiskMinimization(SRM)principle;accordingtoSRMonelooksforafunction
hhavinglowerempiricalerroroveraxedfunction class H
k
, characterizedby
thecomplexityd.Fromaformalpointofview,oneshouldbuildseveralfunction
classesin increasing size,thatis increasingcomplexity (H
1
H
k
),
and then pick a function h which has small training error and comes from a
classesH
k
havinglowercomplexity(thatislower"accordingto(1)).Ourideafor
adaptivemodelselectionconsistsin buildingseveralH
k
and theninmeasuring
their richness on the basis of B
1;2
. In practice we choose the function h
= argmin k [er^ P ( h k ;z )+"(m;d k
;Æ)].Thegeneralprocedureissketchedin table2.
Notethat,theoretically,onecouldonlyuseB
2
.Actually,itisknownthatseveral
function classes,such as theGaussian one,have typically an high complexity,
andthecorrespondingmeasured
k
,foundaccordingtothemaximaldiscrepancy
estimate,doesnotgiveanyusefulinformationontheproblem(d
k
=1);onthe
other hand, in these cases wehave er^
P
=0,thus allowing theuse of B
1
. Our
experiments show that the joined investigation of B
1;2
can assure an optimal
model selectionfordierentreal{worldproblems.
4 Experiments
We test our approach on two well{known datasets from [4,10], that is sonar
(SNR)andpima indiandiabetes(PIMA),respectively.Theresultsare
summa-rizedin table 3.Both data setsare two{classproblems with r=60 and r=8
inputfeatures respectively, whilstthenumberoftrainingsamplesarem=104
val-StepDescription
1. set=0,
opt
=0,nerr=m
2. Repeatuntilnomistakesoccuroramaxnumberofstepshasbeenreached
3. for i=1tomdo 4. computeMSB i =y i sgn P j j y j K(x i ;x j ) 5. ifMSBi<0then 6. i = i +1 7. computeerr=ji:yi sgn P j jyjK(xi;xj) <0j
8. iferr<nerrthen
opt =,nerr=err 9. endfor 10. k ;z = opt
Table2.Aprocedurefor adaptivemodelselection.
StepDescription
1. setk=1
2. Selectakernelfunction(es:linear, polynomialorgaussian)
3. Selectavalueforthehyperparameter(buildH
k ) 4. ApplyDKKPandnd k ;z =A k (z),d k 5. ComputeB k ;f1;2g =er^ P (h k ;z )+"(m;d k ;Æ) 6. k=k+1 7. Chooseh =argmin k B k ;f1;2 g
Table3.Experimentsforthesonarandpimaindiandiabetesdatasets.
SNR 2 =0:1 2 =0:5 2 =1:0 p=2 p=3 p=4 B1(TS) 1.86(11) 1.82(7) 1.96(10)6.58(19) 4.8(15) 4.8(13) B 2 (TS) { { { { { { PIMA 2 =0:05 2 =0:1 2 =0:5 p=2 p=3 p=4 B 1 (TS) 1.41(53) 1.6(50) 2.16(60) { { { B2(TS) { { { 0.5(41) 0.51(40)0.6(43)
hyperparameter( 2
orp).Asexpected,thereisasubstantialagreementbetween
the bounds B
1;2
and TS; asreported in table 3,our criteria chooses the best
model, thatisaGaussiankernelwith
2
=0:5forSNR andapolynomialwith
p= 2orp= 3for PIMA.As a nal remark note that with '{' we mean that
corresponding bound doesnotgiveanyuseful informationfor model selection.
For example, in the case of B
2
, it means that d
k
= 1 for every value of the
hyperparameter,while inthe caseofB
1
itsimplymeansthat wecannot apply
equation(2)asweareinpresenceofanagnosticcase.
5 Conclusion
InthispaperwehaveproposeaneÆcientmethodforadaptivemodelselection.
Weusedadigitalkernelbasedperceptronwithpocket,butourapproachcould
be applied to other learning systems as well, as long as good bounds of the
prediction error are provided; in particular, in this work wehaveinvestigated
theuseof twodierentkindsof bounds: auniformbound,thatcanbe applied
to systemscharacterized by a high level of complexity, and adata{dependent
bound.Experimentsontworealworldproblemshavedemonstratedthevalidity
ofthemethod.
References
1. Aizerman,M. A.,Braverman,E. M.andRozonoer, L.I.:TheoreticalFoundations
ofthePotentialFunctionMethodinPatternRecognitionLearning.Automationand
RemoteControl,25(1964)821-837.
2. Bartlett P.: NeuralNetworksLearning: Theoretical Foundations,Cambridge
Uni-versityPress,1999.
3. Bartlett,P.L.,Boucheron,S.,andLugosi,G.:ModelSelectionandErrorEstimation.
MachineLearning,48(2002),85{113.
4. Blake, C., Keogh, E., and Merz, C.J.: UCI Repository of Machine Learning
Databases,http://www.ics.uci.edu/mlearn/MLRepository.html.
5. Floyd, S. and Warmuth, M.: Sample compression, learnability and the Vapnik{
Chervonenkisdimension.MachineLearning,21(1995)269{304.
6. Freund Y. and Shapire, R.E.: Large Margin Classication Using the Perceptron
Algorithm.MachineLearning,37(1999)277-296.
7. Friess,T.T.andHarrison,R.F.:AKernel-Based AdalineforFunction
Approxima-tion.IntelligentDataAnalysis,3(1999)307-313.
8. Gallant,S.I:Perceptron{BasedLearningAlgorithms.IEEETransactiononNeural
Networks,1(1990)179{191.
9. Herbrich,R.,Graepel,T.,andWilliamson,R.:FromMargin toSparsity.NIPS13,
2001.
10. J.M. TorresMoreno,M.B.Gordon. CharacterizationoftheSonarSignals
Bench-mark.NeuralProcessingLetters,7(1998)1{4.
11. Vapnik,V.N.:TheNatureofStatisticalLearningTheory.JohnWiley&Sons,NY,