recursive neural networks
F. Costa 1
,P. Frasconi 1
,V. Lombardo 2
, and G.Soda 1
1
Dept. of Systems and Computer Science, University of Florence
via diSantaMarta 3, I-50139 Firenze (Italy)
2
DISTA, Universita delPiemonte Orientale \A. Avogadro"
c.so Borsalino 54, I-15100Alessandria (Italy)
April 7,2000
Abstract
In this paper we develop novel algorithmic ideas for building a naturallanguage
parser grounded upon the hypothesis of incrementality. Although widely accepted
andexperimentallysupported underacognitiveperspectiveasamodel ofthehuman
parser,theincrementalityassumptionhasneverbeenexploitedforbuildingautomatic
parsersofunconstrainedrealtexts. Theessentialsofthehypothesisarethatwordsare
processedinaleft-to-rightfashion,andthesyntacticstructureiskepttotallyconnected
ateachstep.
Ourproposalreliesonamachinelearningtechniqueforpredictingthecorrectnessof
partialsyntacticstructuresthatarebuiltduringtheparsingprocess. Arecursiveneural
network architectureis employedfor computingpredictionsafter atrainingphase on
examples drawn from a corpusof parsed sentences, the Penn Treebank. Our results
indicatetheviabilityoftheapproachandlayoutthepremisesforanovelgenerationof
algorithms fornaturallanguageprocessingwhich morecloselymodel humanparsing.
These algorithms may prove very useful in the development of eÆcient parsers and
haveanimmediateapplicationin theconstructionofsemiautomaticannotationtools.
1 Introduction
Incrementalprocessingof the human parseris a widely heldassumption inpsycholinguis-
tics. This assumption accounts for the intuitive fact that language is processed from left
to right: this is obviously true for speech, which is obligatorily fed to the processor in a
tion(asopposedtothemonodimensionalspeech)allowsaninspectionoflinguisticmaterial
that goes beyond theleft-to-right sequentialpresentation. Incremental processing also ac-
count forafactwhich hasreceived muchexperimentalsupport, i.e. thathumansinterpret
language withoutreaching theend ofthe inputsentence, thatis they are ableto assigna
meaning to \almost any" initial(left)fragment ofa sentence (Marslen-Wilson,1973).
Although incremental processing of natural language is uncontroversially supported by
many psycholinguistic experiments, eective models for parsing unconstrained language
are rarely based on theincrementality hypothesis. In this area, most solutions consist in
stochastic versions of bottom-up parsers, relying on models grounded upon context-free
rule probabilities (see, e.g. (Collins, 1996)). The incremental approach is at work only
in psycholinguistic models of the human parser, mostly tested on articially-generated
sentences that display this or that ambiguity form. Probably, the major reason is the
incomplete knowledge aboutthehuman parser, which makesa reliablealgorithmic design
considerablyhard.
Nevertheless, some of the discovered features of the human parserhave beenexploited in
NLP system implementations. For example, a number of preference heuristics (like Late
Closure (Kimball, 1973) orMinimal Attachment (Frazier & Fodor, 1978), see Section 5),
devised in psycholinguistic environments to describe some aspects of the human parser,
have beenimplementedto solve localambiguities (see, e.g.,(Hobbs &Bear, 1990)). After
the emergence of thecorpus-based approaches, NLP systems often involve a combination
of heuristics and statistics to guide the parser decisions (see, e.g., (Nagao, 1994)). In
this paperwe push furtherthe similaritieswith the humanparser, byadopting the novel
assumption that the parsing strategy incorporates an architectural aspect of the human
parser, namelyitsincrementalnature. Toourknowledge,thisistherstattempt inwhich
apsycholinguisticmodelofthehuman parseris appliedto parsingreal texts. The senseof
\toward"inthepapertitleisthattheresultpresentedhereisnotafullyfunctionalsystem
yet, buta concreteassessment ofthe proposedmethodology.
An operational account of the incrementality hypothesis, called strong incrementality, is
at thecore of a numberof computationalmodelsof thehuman parser, as a parsimonious
algorithmicversionofincrementality(Stabler,1994). Inthisversion,theparsermaintains
a totallyconnectedparse,whilescanningtheinputwordsfrom leftto right,withnoinput
storedinadisconnectedstate. ThemajordiÆcultyofimplementingastronglyincremental
strategy is that the connection of the next inputword to the existing structure (what is
NP VP S
SBAR
0 S
NP VP V
ADJ He thinks steeper
N Connection Path
Figure1: The connection pathfor\steeper"in\Hethinks steeperprices areto come."
called the left context) requires the construction of a substructure called the connection
path (see Figure 1). In parsing naturally occurring text, this results in a high number
of candidate connection paths, which yields a hard search problem, aecting the parser
accuracy and speed.
In this paper we propose an incremental parsing model able to process unconstrained
naturally occurringtexts. At each step ofparsing,thecorrect connection pathlinkingthe
next wordto theincrementally builtparsetree ispredictedusinga neuralnetwork model.
Such a prediction eectively acts like a heuristic that can guide the search process, and
signicantly reduce computational eorts. Designing accurate heuristics for thisproblem
is a diÆculttaskbut, on theother hand, the availability ofdatabases of parsedsentences
(treebanks) makes the problem interesting from a machine learning perspective. In our
experiments,examples aredrawnfrom asubset ofparsedsentencesinthePennTreebank
(Marcusetal., 1993).
As far as we know, there exists one previous attempt of applying machine learning tech-
niquestoformulateparsedecisions. HermjakobandMooney(Hermjakob&Mooney,1997)
useda modiedID3 algorithm tolearnthemoves ofashift-reduceparser. Theproblemis
castintoarepresentationwhichreliesonarichsetofmorphological,syntacticandsemantic
features (around 200). The result is a parsing model that provides an output consisting
ofa phrasestructuretree andapredicate-argumentstructurefedto amachinetranslation
module. Though similar inthe spirit, the work described inthis paper relieson a dier-
machine learningtechnique, arecursive neural network vs. ID3 algorithm. Also, thegoal
of this paper is the assessment of a methodology rather than a fully functional system:
the numerical dataweprovideresult froma reducedlearningdomainwhichis exclusively
syntactic (vs. the 200 features from several sources), in the form of syntactic structures.
We areaware that a complete parsingmodel cannot be realized withoutthe contribution
of other knowledge sources,like a semantic knowledge baseand a set ofsubcategorization
frame(orpredicate-argumentstructures),buttheresultswe have yieldedwiththeskeletal
framework clearlyindicatetheeectivenessof theapproach.
Iinstancesinourlearningdomainare naturallyrepresentedasincremental trees, obtained
by joining a candidate connection path to some left context (another incremental tree).
Uncertainty about the selection of the correct candidate can be naturally modeled in a
probabilisticsetting. Inparticular,thepredictiontaskconsistsofestimatingtheprobability
thata givenattachment betweena givenincrementaltreeand acandidate pathis correct.
As a major advantage of this formulation, the learner outputs a continuous ranking of
alternativesso thatbestcandidatescan betried rst.
Theprobabilisticand supervisednatureofthelearningtask(correctcandidatesareknown
forparsedsentencesinthetrainingset)suggests asolutionbasedon connectionistmodels.
The structured nature of instances requires theadoption of a recentlyintroducedclass of
connectionistarchitectures,calledrecursiveneuralnetworks(Goller&Kuchler,1996;Sper-
duti&Starita,1997;Frasconietal.,1998). Unlikefeedforwardneuralnetworks,whichare
grounded on attribute-value representations of data, recursive networks can adaptively
processlabeledgraphs,eectivelytakingadvantageofrelationsamong atomicentitiesthat
constitute a single instance. Representation and processing of structured information is
nota recent area ofinterest inconnectionism. Earlyattemptsincludedistributedreduced
descriptors by Hinton (1990), Recursive Auto-Associative Memories (RAAM) by Pollack
(1990) and Holographic Reduced Representations by Plate (Plate, 1995). Recursive neu-
ral networks, however, are the rst architecture that can exploit the supervised learning
paradigmon structureddata. Theyhave beensuccessfully appliedto chemistry(Bianucci
et al., 2000), pattern recognition (Francesconi et al.), and learning search heuristics for
guidinga theoremprover(Goller, 1997). In thispaperwedescribea noveland specialized
version ofthese networksthat enablesthem to learnhowto rank aforestof alternatives.
The paper is organized as follows: Section 2 introduces the basic notions of incremental
naturallanguageprocessing. Insection3weformulatethelearningtask. Section4reviews
to solve the learningproblemof the present paper. Insection 5 we explainin detail data
preparationand theexperimentalsetup, withadiscussionof theresultsobtainedsofar.
2 Incremental parsing
Incrementalprocessingofnaturallanguage(incrementalityforshort)isanintuitivelyplau-
sible hypothesisupon thehuman languageprocessor.
Thisincrementalityhypothesis hasreceived alargeexperimentalsupportinthepsycholin-
guistic community over the years: the shadowing experiments (Marslen-Wilson, 1973)
proved that people are able to interpret speech word-by-word with a latency time of just
250 ms 1
; data about head-nal language processing 2
suggest that the human parser as-
sembles substructures eagerly before reaching the structure head (Bader & Lasser, 1994;
Frazier, 1987; Kamide & Mitchell, 1999; Yamashita, 1994); nally, eye-movement studies
ofvisualobjectrecognitionshowthatthereferentinavisualcontextofanounphrasewith
pre-modiersiscomputed wellbeforereachingthenominalhead(Eberhardetal., 1995).
Theincrementalityhypothesisposessomeconstraintsuponthegeneralparsingstrategy. In
its generalform,thehypothesis impliesthat thesemanticinterpretationof some fragment
of the sentence is available as the scan of the input material proceeds from left to right.
This doesnotinvolveanyspecic constraint onhowtheinteractionbetweenthesyntactic
analysisand thesemantic interpretationcan beimplementedatthealgorithmiclevel. One
possible(and parsimonious)solutionto the interaction problemisoered bythe strongly
incrementalversionofthehypothesis(Lombardoetal.,1998;Milward.,1995;Stabler,1994;
Steedman, 1989; Sturt & Crocker, 1996). The interaction occurs at the word level, and
compels the syntactic analyzer to keep a totally connectedstructure at all times (that is,
the semanticinterpreterhasno meansto work ondisconnected items 3
). Parsing proceeds
1
Ina shadowing experiment, subjects listen to passagesfrom a novel and say words outloud as fast
as theycan. Subjectsdonotknowthat certain words inthepassageare mispronounced. Inmanycases,
subjectstendtoutterwordscorrectlywithnolossoftimewithrespecttothecorrectpronunciation. These
andotherdatahavebeeninterpretedasaproofthatpeopletendtointerpretthemeaningofsentencesas
soonastheycan,probablyonaword-by-wordbasis.
2
Head-nallanguagesarelanguageswhereconstituentsfeaturetheheadwordintherightmostposition.
ExamplesareDutch,GermanandJapanese.
3
Thisis anassumptionuponthe semantic interpreter: it could well bethat the humanprocessor has
some means of interpreting the disconnected pieces of structure in memory. However, the latter is an
unparsimoniousassumption.
to theright. Here arethe formaldenitions 4
.
Givenasentences=w
0 w
1
w
i
w
jsj 1
andatree T forit(whose internalnodesarela-
beledbynonterminalsymbolsandwhoseleavesarelabeledbywords),wedenerecursively
theincremental trees T
i
(i=0;1;:::;jsj 1) spanningw
0
w
i
asfollows:
T
0
consists ofthechain ofnodesand edges fromw
0
to itsmaximalprojection;
T
i
consistsof all thenodesand edges inT
i 1
and thechainof nodes andedges from
w
i 1
to N,where N is
{ either anode ofT
i 1 ,
{ or the lowest node of T dominating boththe root of T
i 1
and w
i
(in thiscase
T
i
also includestheedgefrom N to theroot of T
i 1 ).
Given two incremental treesT
1 and T
2
,we dene thedierence betweenT
1 and T
2 as the
tree formed byall the edges which are inT
1
and not inT
2
, and all the nodestouchedby
suchedges.
Now, givena sentence s=w
0 w
1
w
jsj 1
and a treeT forit, theconnection path forw
i is
thedierence betweentheincremental treesT
i and T
i 1
. Moreover,
AnodebothinT
i
andinT
i 1
iscalledananchor(thatis,anodewheretheconnection
pathanchors to T
i 1 ).
The nodelabeledbythePOStag ofw
i
iscalleda foot.
In Figure2,we showthesequence of incrementaltrees forasentenceof thecorpus.
When the parser tries to connect an inputword w
i
to the incremental tree T
i 1
(i.e. the
structurethatspansallthewordsthatprecedethecurrentword),itcomputesaconnection
path, thatis thestructural portion that must be builtin order to yield T
i
. However, the
numberofpathsthatallowtolinkw
i
withT
i 1
canbehigh(seeSection5foraquantitative
estimationonthecorpus). Ambiguitycomesintwoforms: thenumberofpossibleanchors
4
These denitions introduce two simplications with respect to (Lombardo& Sturt, 1999): rst, we
includeintheconnectionpaththeedgesthatlinkawordanditsmaximalprojection,thatisthesubstructure
headedbythatword;second,weneglectthepresenceofemptynodes(traces)betweenthelexicalelements
w
i 1 and w
i
. Both simplications make the design of the learning domain easier, because we are not
considering the distinction between headed and headless nodes in connection paths (see Section 5) and
co-referringsymbols,respectively.
Prp It
VBZ has
DT no
NN bearing
IN on
DT the
NN performance
IN of
POSS our
NN company
NN stock NP VP
S
NP
NP
PP
NP PP
NP
NP
T 0
T 0
T 1
T 1
T 1 T 2
T 2
T 2
T 3 T 4
T 4
T 5
T 5
T 5 T 6 T 7
T 7
T 8
T 8 T 9 T 10
Figure 2: The incremental treesof the sentence \It has no bearing on theperformanceof
our company stock." Nodesare labeled with the incremental tree that includesthem for
therst time.
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
ANCHOR
ANCHOR
ANCHOR
ANCHOR
bearing
S ( NP ( PRP ) VP(ANCHOR) ( VBZ NP ( NP ( DT NN ) ) ) ) S ( NP ( PRP ) VP ( VBZ NP(ANCHOR) ( NP ( DT NN ) ) ) ) S ( NP ( PRP ) VP ( VBZ NP ( NP(ANCHOR) ( DT NN ) ) ) )
It has no
S(ANCHOR) ( NP ( PRP ) VP ( VBZ NP ( NP ( DT NN ) ) ) )
NP(ANCHOR) ( PP ( PP ( IN(FOOT) ) ) ) NP(ANCHOR) ( NP ( IN(FOOT) ) ) NP(ANCHOR) ( NP ( NP ( IN(FOOT) ) ) )
NP(ANCHOR) ( ADVP ( IN(FOOT) ) ) NP(ANCHOR) ( PRN ( PP IN(FOOT) ) ) NP(ANCHOR) ( WHPP ( IN(FOOT) ) ) NP(ANCHOR) ( PP ( IN(FOOT) ) )
It has no bearing on
NP
PP
NP
NP
NP
IN
IN IN
ADVP PP
NP
IN NP
NP NP
IN VP
NN DT
VBZ PRP
PRP VBZ DT NN
NP
NP VP S
NP NP
S
NP
NP
WHPP
(b)
NP
PRN
IN NP
IN PP
(a)
Figure 3: Legal attachment of a connection path to incremental trees. The sentence in
this example is \It has no bearing on the performance of our company stock." (a) An
incrementaltree T
i 1
;nodesmarked as\anchor" arepotentialanchorpointsforjoininga
connection path. (b)Afterone ofthepotentialanchorpointshasbeenselected(NPinthe
example), several alternative paths can be joinedto formT
i
;in theexample, these areall
thepathshavingNP asan anchorand IN(thePOStag of\on")asfoot. The correctpath
forthissentence is highlighted.
i 1 i i 1
from some anchor (see Figure 3). A selection procedure chooses thebest connection path
and anchor for continuation, and instantiates it to generate the new incremental tree T
i .
The better istheselection procedure,theless parsingdecisionsare wrong.
The success of a selection dependson theultimate task of theparsing model. If thegoal
is to simulate the human parser (this is interesting from a psycholinguistic standpoint),
then the selection procedure must returnthe connection path that would be chosen bya
human,evenifitiswrong. Infact,itcan happen thatrstpassanalysesleadto aparsing
breakdown,and thecontrolgoesto areanalysismodule,aquasi-necessary component ofa
psycholinguisticmodel(Fodor&Ferreira, 1998). Onthecontrary,ifthegoal isthedesign
of an eÆcient automatic parser, then the selection procedure must minimize the number
of failures due to the wrong choice of the connection path. The goal of this paper is to
evaluatewhethermachinelearningalgorithmscan eectivelysolvethelocalambiguitydue
to theselection of the connection path. Thus, the reference taskconsidered in thispaper
isthedesignofaneÆcientautomaticparser. Inthenextsection weformalizethelearning
task.
3 Formal denition of the learning problem
In the following, we assume that a treebank (a corpus of sentences withassociated parse
trees) is available. A treebank is denoted as B =f(s (p)
;T(s (p)
));p=1;;Pg where s (p)
is ageneric sentence and T(s (p)
)its parse tree.
3.1 Universe of connection paths
The universe of connection paths U(B) is the set of all connection paths that can be
extracted from T(s (p)
);p = 1;;P. This set of paths can be obtained by running a
simulationalgorithmthatcomputesalltheincrementaltreesassociatedwiththesentences
in the corpus. The algorithm scans the sentence covered by this tree in a word-by-word
fashionfromlefttoright,andforeachwordw
i
,ndsthelegalconnectionpathswhichwould
bebuiltinorderto attach w
i
to thecurrentleft context T
i 1
. Thealgorithmsimulatesthe
buildingofstructurebymarking, duringthescanofthetree,thebrancheswhichwouldbe
builtbyaperfectlyinformedincrementalparser. Moredetailsareexplainedin(Lombardo
&Sturt, 1999)).
a large enough corpus, we can expect thatU(B) willprovidean approximately complete
coverage of rulesforcoveringall thesentences inthelanguageunderconsideration.
3.2 The forests of candidates
Suppose we are given a new sentence s = w
0
;:::;w
jsj 1
(not in the corpusB), and sup-
pose that at stage i of parsing we know the correct incremental tree T
i 1
(s) spanning
w
0
;:::;w
i 1
. The goal of an incremental parseris then to computethe next tree T
i (s) in
order to accommodate the next word w
i
. Assuming that U(B) has a rich enough cover-
age, we know that T
i
(s) can be obtainedby joiningT
i 1
(s) to one of thepaths inU(B).
However, othertrees spanningw
1
;;w
i
can begenerated bylegally attaching othercon-
nection paths. In this regard, recall from Section 2 that an attachment is legal if the
anchor matches one of the nodes in the right edge of the tree and the foot matches the
POS tag of w
i
(see Figure 3). Legal foots can be easily obtained by running a part-of-
speech tagger on the sentence. The set of trees obtained by legally attaching T
i 1 (s) to
anypathinU(B)is calledtheforestof candidates forwordw
i
withinsentence s,denoted
F
i
(s)=fT
i;1
(s);:::;T
i;m
i
(s)g. Ofcourse,onlyoneofthetreesinF
i (s),T
i;j
?(s),
isthecor-
rect one(i.e.,itisa subgraphof theactualparse treeT forthesentence)butforunparsed
sentencesit is unknown. If an oracle were availableto select the correct incremental tree
from the candidate set, then incremental parsing would become a trivial procedure. The
machinelearningalgorithmsinthispaperaim to accuratelyapproximatesuch an oracle.
3.3 The learning task
We proposethat the learningalgorithm rely on a statistical model incharge of assigning
probabilities of correctness to each candidate tree. Parameters of the model willbe esti-
mated from examples. The main advantage of a probabilistic formulation is that during
recall,whenunseensentencesareprocessed,themodelcan beeectivelyemployed torank
alternative trees, sorting them by increasing probability of correctness. In this way, the
parser can try rst candidate trees with higher probability of correctness. If we look at
parsing asa search problem(wherethe goalis thefullparse tree for s)then rankingcan-
didates can be seen as a heuristic evaluation function that guides the search problem. A
perfect predictor (i.e.,one that always assignsthe highest probabilityto the correct tree)
would give a perfect heuristic, reducing the branching factor of the search problem to 1.
This means that two learners should not be compared by simply computing the number
of failures to assign the highest probability to the correct tree (as in a winner-take-all
scheme). Infacts, we needthatthe numberof candidatesrankedabove thecorrect treeis
small. Hence,oneperformancemeasureforthepredictorcanbetheaverage positionofthe
correct treeinthelistof candidates,aftertreeshave beensortedbyincreasingprobability
of correctness.
In thispaper, we assumethatthemodeldoesnotmakeuse oflexicalinformation, thatis,
correctness is assessedbyonly observing nonterminalsymbols in each tree, discarding all
theleaves(whicharelabeledbywords). Therearetworeasonsforthischoice. First,incor-
poratinglexicalknowledgeina connectionistmodel necessarilyinvolvesa largenumberof
parameters to be estimated, requiringlarge amount of training data. In ourexperiments
(see Section 5) we show that even a limited amount of training sentences is suÆcient to
achieve interesting performance,a valuablefeature ofthe model forintegration ina semi-
automatic annotation tool. Second, we are interested in showing that the pure syntactic
level contains signicant informationunder the incrementalityassumption. In the frame-
workwe areoutlining,lexical knowledge can be employed after learningto detect parsing
breakdownsand to implement recovery mechanisms.
Thereare two possibleformulationsof thelearningtask:
Eachinstanceisacandidateincrementaltree. Theinstancespaceisthenpartitioned
into positive and negative instances. Candidate T
i;j
(s) is a positive instance if and
onlyif it is the correct incremental tree for sat positioni. In thisformulation, the
learner mustclassify labeled treesinto one of two classes.
EachinstanceisthewholeforestofcandidatetreesF
i
(s). Thetaskconsistsoflearn-
ingthecorrectmemberoftheforest,whichcanbeidentiedbyamultinomialvariable
withrealizationsinf1;:::;m
i
(s)g. Trainingexamples arepairs(F
i (s);j
?
i
(s)), where
theinputportionisthe candidateforestand theoutput portion(supervision)is the
integer j
?
i
(s)2[1;m
i
(s)]identifyingthecorrecttree.
While therst formulation relieson thestandardconcept learningframework, thesecond
formulationrequiresmore attentionbecauseitdoesnotinvolveclassication ascommonly
intended. Infacts,oneinstanceisa\bag"ofobjects(trees),inawaythatlooselyresembles
multi-instance learning (Auer, 1997; Dietterich et al., 1997). In our case (as a major
dierence withrespect to multi-instancelearning), each bag containsexactly one positive
The importance of our second formulation stems from the fact that it more accurately
describesthe kindof uncertainty we aredealing with. To explainthis, letus considerthe
probabilistic interpretations of the two formulations. In the rst case, the learner makes
decisionsaboutbooleanoutputvariablesO
i;j
(s),eachindicatingwhetheragivencandidate
incrementaltree iscorrect forpositioniof sentence s:
y
i;j
(s)=P(O
i;j
(s)=correct jT
i;j
(s)): (1)
When computingthis probabilitythe learner is notinformed about competing candidate
trees. In thesecond formulation,the learner makesdecisions about a multinomial output
variablesO
i
(s) whoserealizationsare integers in[1;;m
i
(s)] identifyingthecorrect tree
inthebag associatedwith wordw
i
insentences:
y
i;j
(s)=P(O
i
(s)=jjF
i
(s)) (2)
where O
i
(s) = j means that T
i;j
(s) is the correct tree. In this case, predictionsare con-
ditional to the whole set of candidates, thus introducingcompetition among trees in the
forest. Assuming that theuniverse of paths can coverthe sentence s (i.e. thegrammaris
suÆcientlyexpressive),we have
m
i (s)
X
i=1 y
i;j
(s)=1: (3)
This is because one and only one candidate tree is the correct one (but which one is
unknown). If thegrammar is not suÆcientlyexpressive, predictions are meaninglessbut,
on the other hand, in that case parsing would fail anyway. Hence, as far as the learning
taskis concerned, we can assumethat (3)holds.
In the following section we shall explain how this learning task can be solved using a
connectionistapproach. Tosimplifynotation, referencetotheparticularsentence swillbe
omitted inthefollowingdiscussion.
4 Neural network architecture
Neuralnetworkarchitecturesforsupervisedlearningarewellknowninthecaseofattribute-
value representations. However, as explained in Section 3, the learningtask that can in-
structparsedecisionsinvolvesmakingpredictionsaboutstructuredobjects(namely,labeled
trees). Feedforwardneuralnetworkscannottakeasinputlabeledtreesbecausetheycannot
constituents. In principle, recurrent neural networks (Kolen & Kremer, 2000) might be
employed by converting the tree into a sequence (e.g., by visiting it in preorder). How-
ever, such an approach would lead to diÆcult learning tasks for at least two important
reasons. First, the exponential relation between number of nodes and height in a tree
leads to very long sequences even for trees of modest height. This has a very negative
impact on learning because of vanishinggradient problems(Bengio et al., 1994). Second,
oncea tree islinearizedinto a sequence,hierarchicalrelationsareblurred andintermixed,
making generalization to new instanceshard (Frasconi et al.,1998). Theseconsiderations
motivatea dierentclassofarchitectures,calledrecursive neuralnetworks,thatgeneralize
recurrent neural networks to deal withlabeled directed acyclicgraphs(Goller & Kuchler,
1996; Sperduti&Starita,1997;Frasconi etal.,1998). InSection4.1weshortlyreviewthe
essentialarchitecturaland algorithmicideasforsupervisedclassicationofdatastructures
| more detailscan be foundin(Frasconi etal., 1998). InSection 4.2 we reformulate the
basicclassicationmodelforrankingalternativeincrementaltreesaccordingtothesecond
formulationoutlinedinSection3.
4.1 Recursive neural networks
From a connectionist (or statistical) standpoint, supervised learning for classication is
formalized as the problem of estimating the conditional probabilityP(OjI). The output
O is a discrete variable associated with the class and I is a generic input object. The
general theorydeveloped in(Frasconi et al.,1998) allows I to be a directed acyclicgraph
with a supersource. Here we are interested in the case of the input I being a labeled
ordered m-ary tree. By orderedwe mean that, for each vertex v, a total order is dened
on the m childrenof v. Specically, ch[v] denotes the ordered m-tuple of vertices whose
elements are v's children. If v has k < m children, then the last m k elements of ch[v]
arelledwithaspecialentrynil,denotingamissingchild. I(v) denotesthelabelattached
to vertexv. Althoughthe generaltheory allows to mixdiscrete, continuous, and possibly
multivariate labels, in thispaper we are onlyinterested in the case of discrete labels, i.e.
random variables withrealizations in a nitealphabet I =I
1
;:::;I
N
(inour application,
thisisthe setof nonterminalsymbols inthetreebank).
The basicneural network architectureis based on the followingrecursive state space rep-
resentation:
x(v)=f(x(ch[v]);I(v))
1stchild 2ndchild 3rdchild I(v)
(one-hotencoded)
x(ch[v])
Figure 4: The recursiveneuralnetworkmodel.
x(v)=f(x(ch[v]);I(v))
o=g(x (r)):
(4)
In the above equation, x(v) 2 IR n
denotes the state vector associated with node v and
x(ch[v])2IR mn
isavector obtainedbyconcatenatingthecomponentsofthestatevectors
contained in v's children. f : IIR mn
! IR n
is the state transition function that maps
states at v'schildrenand thelabelat v into thestate vector at v. g:IR n
! [0;1]
m
is the
output function, that maps the state x(r) (at the root r of the tree) into the probability
P(OjI)thattheinputtreeisapositiveexample. StatesinEq. (4)areupdatedbottom-up,
traversingthetreeinpost-order. Ifachildismissing,thecorrespondingentriesinx(ch [v])
are lled with the frontier state x, which is associated with the base step of recursion.
A typical choice is x =0. This computational scheme closely resembles the one used by
frontier-to-root treeautomata(Thacher,1973),butinthatcasestatesarediscretewhereas
here statesare realvectors.
The transition function f is implemented by a feedforward neural network, according to
thefollowingscheme(see Figure 4):
a
h
(v) = !
h;0 +
N
X
j=1
!
hj z
j
(I(v))+ m
X
k=1 n
X
`=1 w
hk`
x
` (ch
k
[v]) (5)
x
h
(v) = tanh(a
h
(v)) h=1;:::;n (6)
wherex
h
(v) denotestheh-thcomponent ofthestatevector atvertexv,z
j (I
q
)=1 ifj=q
orzerootherwise(i.e.,weareusingaone-hot encodingofsymbols),ch
k
[v]denotesthek-th
childof v, and !
ij
;w
hk`
areadjustable weights. It isimportant to remark thatweights in
the transition network are independent of the node v at which the transition function is
applied. Thiscan beseen asa formof weight sharing.
b
a c c
a f()
f() f()
f() f()
g()
a
b c
a c
I
Figure 5: Network unrollingand backpropagation throughstructure
The outputfunction gis implementedbyanother network:
o= 1
1+e a
(7)
where
a=w
0 +
n
X
`=1 w
` x
`
(r) (8)
beingw
`
adjustable weights.
Supervisedlearningis basedonthemaximumlikelihoodprinciple asinmanyother neural
network models. In the case of classication, the likelihood function has the form of a
cross-entropy. LetD=f(I (k)
;o (k)
);k =1;;Kg bea trainingset oflabeledtrees(where
o (k)
denotesthe classof I (k)
). Then thecostfunctionis thenegative log-likelihood:
J(D)= K
X
k=1 o
(k)
logo (k)
+(1 o (k)
)log(1 o (k)
)
!
(9)
beingo (k)
theoutputofthenetworkafterprocessingthek-thtree. Optimizationissolvedby
gradientdescent. Inthiscase,gradientsarecomputedbyaspecialformofbackpropagation
on the feedforward network obtained by unrolling the state transition network according
to thetopologyoftheinputtreeI. Thealgorithmwasrstproposedin(Goller&Kuchler,
1996) and is referred to as backpropagation through structure (BPTS). The main idea is
depicted inFigure 5: Onthe left an exampleinputtree is shown. Onthe right,the state
network g() is attached to the replica of f() associated with the root. After recursion
(4) is completed, thestate vector x(r) at theroot r contains an adaptive encoding of the
wholetree. Thisissimilartotheencodingperformedbyrecursiveautoassociativememories
rst proposed by Pollack(Pollack, 1990) but here the encoding resultsfrom a supervised
learningmechanism. Forwardpropagation(darkarrows)isbottom upandfollowsEq. (4).
Backward propagation(light arrows) is employed to compute the gradients and proceeds
fromtheroot totheleaves. Note thatgradientcontributionsmustbesummed overall the
replicasof thetransition network to correctly implement weightsharing.
4.2 Recursive networks for ranking alternatives
As explainedinSection 3, an alternative formulation of the learningtaskestablishesthat
the input portion of each example is a forest of candidate trees. Such an object is not
connected and thus not readily suitable as the input to a recursive neural network. In
this formulation, moreover, we do not have a standard classication task. The network
has to pick up one out of m
i
candidates, and the number of candidates depends on the
particular instance (this is somewhat similar to having a variable number of classes). In
thissection we describethe requiredmodicationsto thebasicmodel presentedabove, to
make itsuitableforsolvingthe naturallanguagelearningtaskat hand.
Thebasicideaisrathersimpleandconsistsofprocessingeachtreeintheforestseparately,
andthenintegrating resultswithanormalizationstepto satisfyEqs. (2,3). Asarststep,
wemodifythe outputfunctiong()(see Eqs. (7,8))of therecursivemodelasfollows:
o = w
0 +
n
X
`=1 w
` x
` (r)
i.e. theoutput networkreduces toa (linear)weightedsum of thestate componentsat the
rootofeachtree. Inthiswayweobtainavectoro
i
=[o
i;1
;;o
i;m
i
]whosecomponentsare
realnumbersin( 1;+1)associatedwitheachcandidateinF
i
. Correctnessprobabilities
arenally obtainedbytaking thesoftmax functionofo
i :
y
i;j
= e
o
i;j
m
i
X
`=1 e
o
i;`
(10)
In this way we satisfy the multinomial model expressed by Eqs. (2,3) and y
i;j
can be
interpretedasthe probabilitythat thej-th candidatetree inF
i
is thecorrect tree. Each
network is used to process each tree in the forest so the resulting unrolled network is a
forestof neural networkssharing thesame weightsforthe transitionfunction f()and the
output functiong(). Animmediateconsequence ofthischoiceis thattheprobabilitiesy
i;j
arenotaected byapermutationoftreesinthecandidate set, apropertywhichis clearly
desirableat thislevel.
The cost function (negative log-likelihood according to the multinomialmodel) is written
asfollows:
J(D)= P
X
p=1 js
(p)
j 1
X
i=0 logy
i;j
(11)
where D denotesthe training set, the rst sum rangesover sentences inthe training set,
the secondsum rangesoverwordswithineach sentence,and j
?
is theindexof thecorrect
incrementaltree inthecandidatelist (which isknowninthe trainingset).
At this point, the calculation of delta errors needed to compute gradients is straightfor-
ward. It beginsby following the classic scheme of multinomial classication described in
(Rumelhart etal.,1995), thatwe reporthere forcompleteness. Denoting
Æ
i;j :
=
@J(D)
@o
i;j
wehave
Æ
i;j
= 8
<
: 1 y
i;j
ifj=j
y
i;j
ifj 6=j
:
Foreachnetworkjintheforest,Æ
i;j
isinjectedasthedeltaoutput error,andsubsequently
passed to the BPTS algorithms. Of course, since networks in the forest share the same
weights, gradients are computed by summing all the contributions relative to the same
sharedweight.
5 Implementation and results
5.1 Data preparation and experimental setup
The experimentswere run on a set B of2000 parsedsentences, randomly extractedfrom
the Wall Street Journal Section of the Penn II Treebank. The universe of paths U(B)
was extracted from these sentences using the incremental parsing simulation algorithm
described in Section 2. This resulted in a set of 1,675 connection paths, that we kept
indexed by anchorand foot. Then,we selected three disjoint subsets of100, 100, and 500
sentences. Thesethreesets wereusedto constructthetrainingset, thevalidationset ,and
the test set, respectively. Each sentence in these sets was processed to extract a forestof
candidate incremental trees asfollows. A connection path is validfor T
i
if its anchor and
its foot match the corresponding entities in T
i
. Therefore, for every incremental tree we
have to identifyan anchor(that can be anynode onthe rightfrontierof T
i 1
)and afoot
(given by therightmost leaf of T
i
). At thispoint, theuniverse U(B) is searchedfor legal
paths matching every pair anchor-foot (see Figure 3). Inturn, legalconnection paths are
then joined to T
i 1
to form the forest F
i
. Onthe availabledata, the average cardinality
of acandidate setF
i
is56, whilethemaximumcardinalityis 388. In thetrainingset, the
total numberof words is2,685 and thetotal numberof candidates is183,841. In thetest
set, thetotal numberof words is11,664 and thetotal numberofcandidatesis 698,735.
Nodes are labeled with non terminal symbols (syntactic categories) of the context free
grammarused in thePennII Treebank. We have decidedto usea subsetof non terminal
symbols in order to lter out informationthat is not relevant to our task, or that would
makesomerulesto belearnedbasedonspecicandrareexamples. Asaresult,wereduced
the nonterminal symbols to a subset of 71 elements, collapsing some of the non terminal
variationsintotheir baseform. Table1 shows detailsofthe mergewe haveperformed.
The rst recursive network used for the experiments implements the model described in
Section 4.1, and has the following specications: m = 14 (maximum outdegree), N =71
labels (one-hot encoded), and n = 40 state components, yielding a total of about 25,000
adjustable parameters. When using 0-1 classication, the ratio of positive over negative
examplesturnsouttobeverysmall(2,685/181,156=0.015). Thisfollowsfromthefactthat
foreachpositiveexamplethereisalargenumberofcompetingcandidateincrementaltrees.
Thenetworkistrainedtominimizethenumberofmisclassications,butthisgoalwouldbe
easilyachieved bypredictingalways0(regardlessoftheinput)sincethebaseaccuracy 6
for
thislearningproblemis 98.5%. Thistheoretical expectation hasbeenpractically veried,
and our experiments using all the available examples in each epoch did not produce any
usefulresults(the network output wasalways very closeto zero). We foundthata viable
wayof dealingwiththisproblemisto present thenegativeexamples inthetrainingphase
with alesserfrequency thanthepositive ones, inorder to re-establishingan average ratio
of 1:1positive-negativeexamples per trainingepoch.
5
The validation set was employed to estimate generalization performance, in order to stop gradient
descentandpreventovertting.
6
The base accuracy is the fraction of correctly classied instances whenalways predicting the most
frequentclass.
ADJP-ADV NP-ADV PP-BNF
ADJP-CLR NP-BNF PP-CLR
ADJP-PRD NP-CLR PP-DIR
ADVP-CLR NP-DIR PP-DIR-CLR
ADVP-DIR NP-EXT PP-DTV
ADVP-DIR-CLR NP-HLN PP-EXT
ADVP-EXT NP-LGS PP-LGS
ADVP-LOC NP-LOC PP-LOC
ADVP-LOC-CLR NP-LOC-PRD PP-LOC-CLR
ADVP-LOC-PRD NP-MNR PP-LOC-PRD
ADVP-MNR NP-PRD PP-MNR
ADVP-PRP NP-SBJ PP-PRD
ADVP-PUT NP-TMP PP-PRP
ADVP-TMP NP-TTL PP-PUT
Table1: Reductionof non terminal symbols. Entrieswithineach column aremerged into
thesymbolreportedinthe headerof thecolumn.
Ina secondexperiment weimplementedthemodeldescribedin4.2. Inthiscase, eachtree
intheforestisprocessedseparatelybyarecursivenetworkwithonelinearoutputunit. The
resulting numbers arethen normalized with the softmax function. The recursive network
used in the experiment had the same size of the one described above (i.e., about 25,000
weights). Inthiscase, sincetheinputto thenetwork isan entireforestof candidates, the
unbalance problemof positive/negative examples does not occur. Allthe trees (both the
correct and the incorrect ones) are grouped into 2,685 forests and are employed to train
thenetwork ineach epoch.
In all theexperiments, we estimated generalization (the average positionR of thecorrect
candidate) byrunningthe partiallytrainednetworksin recallmode onthe 100 validation
sentences. Gradient descent was stopped when theperformanceon thevalidation set was
maximum. This early stopping is expected to be useful in order to prevent overtting.
In the case of 0-1 classication, training was stopped after 1000 epochs. In the case of
multinomialranking,90epochsweresuÆcient(note,however,thatlesstreesperepochare
processedintherst case, dueto negative examples subsampling).
A faircomparisonwithotherapproachesto scoreparse decisioncaninvolve thepsycholin-
guistic preferences mentioned in the introduction, Late Closure (LC) and Minimal At-
tachment (MA), that are at work in many parsing models because of their immediate
implementation and the reasonable results they yield. Brie y, LC is the preference for
attachinganewconstituent atthelowest legalsiteontherightedge;MA isthepreference
fortheattachment thatinvolvesthecreation oftheminimumnumberof new nodes.
LCpreference explainsthesurpriseeect intheinterpretationof asentence. Forexample,
consider\Thepresidentdeclared thathewillresignyesterday." Accordingto LC,\yester-
day"isinitiallyattachedto thelowerverb\willresign,"butthisattachment isthenreject
on thebasisof thesemantic knowledge (\yesterday"is incompatiblewiththefuture tense
of\willresign");onthecontrary,thehigherattachment to\declared"issemanticallyplau-
sible. MA expresses a preference for atter structures. In \Johnwrote a letter to Mary,"
\to Mary"is preferablyattachedto \wrote" rather thanto \a letter,"because of a minor
numberofnodesinvolved. Noticethatinthiscase MAoverridesLC,whichwouldpredict
an attachment to \the letter."
Thoughitiscommoninthepsycholinguisticliteraturetodescribeinitialparsingdecisions
as the result of the application of a number of preferences, (almost) everybody does still
believe that preferences forrecency (as in LC) and structural simplicity(as inMA) have
an in uence on parsing decisions in some form or other. The reason for considering LC
and MA in the context of this paper is that such preferences are of exclusively syntactic
nature, and(though overriddenbylexicaland semantic heuristics)they are invoked when
other sophisticate methodsdo not produce any valuable result. For example, Weischedel
et al. (Weischedel et al., 1993) report a 75% of correctness of LC preference on 166 PP-
attachments inthePenntreebank,remainedunsolvedaftera runof apartialparserbased
on a probabilistic model 7
. Lombardo and Sturt (Lombardo & Sturt, 1999) report on a
sophisticateformofMAinthecontextofastronglyincrementalparser. Thesophistication
isduetothedistinctionofthesyntacticnodesbetweenheadedandheadless: headednodes
are those nodes for which the head word have already been scanned; otherwise, they are
headless. Headedness reformulates MA from the numberof nodes ina constituent to the
number of headless nodes in a constituent: this reformulation is more adequate with the
7
Notethat this75% correctnesscannotbe comparedwiththeresults describedinthis paper(Section
5.3),sincetheyarebasedontwodierentparsingmodels.
beadequatelyexplainedwithoutincludinglexicalknowledgeinatheoryofnaturallanguage
syntax. Forexample,aruleof acontext freegrammarfornaturallanguagemayexistonly
if there exists a word (in the lexicon) that licenses it. The results on a sample of about
50,000wordsfromtheWallStreetJournalsectionofthePenntreebankwerethatthe82%
of connection paths had 0 headless nodes, 15% had 1 headlessnodes, 2% had 2 headless
nodes, and 1%had 3headlessnodes.
Since LC and MA are oftenin con ictwith each other (see example \John wrotea letter
to Mary" above), we have collected the data for the two preferences by alternating the
one which overrides the other: in one trial, LC overrides MA (LC over MA), that is,
incremental trees are ranked higher if the connection path anchors low and then with a
minimum number of nodes; in the other trial, MA overrides LC (MA over LC), that is,
incrementaltreesarerankedhigherifthey haveaminimumnumberofnodesandthenthe
connection pathanchors low. Inthenext section we report theresults.
5.3 Results
After training, the network assigns a probability to each candidate incremental tree. A
parser can employ the network output to rank candidate trees (by increasing probability
ofcorrectness)forcontinuation. Hence,theperformancemeasure ofourapproach toparse
decisionsis based onthe rankassigned to thecorrect treein eachforest F
i .
Asynthesisofresultsonthe500testsentencesisreportedinTable2. Inthetable,Risthe
average positionof the correct tree, after having sorted candidates according to network
predictions. R = 1 would yield a perfect predictor that always ranks the correct tree in
rst position. The next column, F, is the percentage of times the correct tree is ranked
in rst position. Global results are reported in the rst row of the table. We can see
thatthemultinomialranking(\ANNmultin.") outperforms0-1classicationbyaboutone
position. Ofcourse,whenthenumberofcandidatesishigher,wecanexpectthatitismore
diÆcult to rank the correct one on the rst positions. To give a more detailed overview
of predictionquality,we shouldreport R separately foreach valueof F
i
. For conciseness,
results are averaged over bins associated to ranges of candidate list lengths. The total
numberofforests ineachbin isalso reported inthe lastcolumn. The resultsonthemean
valueRclearlydemonstratethattherankingproducedbythenetworksignicantlyreduces
the number of candidate trees at each incremental parsing step. For example, when the
i
R F R F R F R F
2{388 3.25 52.4% 4.27 31.6% 6.73 44.8% 7.63 39.9% 11,664
2{3 1.25 77% 1.41 63% 1.22 82% 1.18 82% 384
4{5 1.54 60% 2.05 33% 2.52 29% 2.15 35% 349
6 1.92 59% 2.10 38% 3.08 14% 2.53 19% 191
7 1.85 50% 2.29 33% 2.64 53% 2.10 52% 320
8 1.49 76% 1.81 65% 2.74 56% 2.14 67% 254
9 2.45 34% 2.42 44% 3.69 46% 3.10 45% 250
10 1.79 60% 2.54 43% 3.94 32% 2.72 32% 305
11{13 2.31 56% 2.44 49% 5.93 21% 4.07 20% 371
14{16 1.88 65% 2.41 47% 5.33 33% 3.88 33% 391
17 2.29 49% 2.97 30% 6.56 13% 5.32 11% 154
18 1.64 79% 1.74 77% 2.90 73% 2.41 73% 344
19{20 2.88 50% 3.09 28% 6.32 38% 4.34 33% 252
21{23 2.56 48% 3.05 28% 6.49 25% 5.45 22% 385
24{27 3.25 47% 3.48 31% 5.00 42% 5.16 38% 387
28{30 2.29 54% 2.80 26% 5.80 36% 3.72 32% 353
31{34 3.12 49% 3.75 29% 6.17 38% 5.57 31% 379
35{38 2.99 52% 3.70 32% 5.54 44% 4.77 41% 381
39{41 3.14 51% 4.05 24% 5.18 45% 5.47 41% 330
42{45 2.41 59% 3.16 32% 6.48 45% 5.09 41% 373
46{49 2.67 54% 3.45 32% 7.38 52% 5.69 46% 301
50{53 3.34 53% 4.16 27% 5.63 37% 6.78 30% 370
54{58 2.84 52% 3.44 30% 7.70 51% 6.36 47% 378
59{64 3.72 51% 4.40 31% 7.23 52% 6.90 44% 357
65{69 3.57 50% 5.21 19% 8.50 54% 6.65 46% 400
70{74 5.39 49% 5.82 24% 6.68 48% 7.47 42% 380
75{81 3.31 57% 4.52 27% 7.10 58% 8.65 51% 385
82{89 4.01 46% 5.26 21% 6.20 57% 8.94 49% 353
90{99 4.29 50% 5.51 23% 8.61 50% 9.38 38% 372
100{111 4.17 42% 5.19 23% 7.13 53% 8.94 45% 397
112{126 4.80 51% 6.90 20% 8.89 51% 12.24 42% 391
127{145 4.44 49% 5.85 20% 10.63 47% 13.37 39% 388
146{173 4.76 42% 7.33 17% 9.12 49% 14.17 41% 400
174{222 7.14 33% 11.21 14% 11.69 41% 20.75 29% 398
223{388 7.02 32% 13.60 11% 17.74 23% 30.80 15% 241
Table2: Prediction performance(see text forexplanation).
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Multinomial Ranking 0-1 Ranking LC over MA MA over LC
Figure 6: Experimental comparison ofknowledge-basedheuristics(Late Closure,Minimal
Attachment) and neuralnetwork predictions(Multinomialrankingand 0-1classication).
The graph reports the percentage of times that, after sorting candidates by one of the
heuristics, thecorrecttree is foundwithintherst k trees, fork=1;:::;20.
number of candidates is between 100 and 111, the correct tree is ranked in position4.17
(onaverage).
Columns 6{9 in Table 2 report R and F for the two strategies (LC over MA, and MA
over LC) described in Section 5.2. The results obtained with the neural network model
comparevery favorably withrespectto psycholinguisticpreferences. Theonlyexceptionis
when there areonly 2 or 3 alternative trees (in thiscase MA and LC preferences slightly
outperform the neural network). Moreover, multinomial ranking performs signicantly
better than the 0-1 approach. This can be expected theoretically since the architecture
proposed in Section 4.2 is a more accurate model of the task at hand. This can also be
explainedintuitively: ifthecorrectcandidateisnotrankedinrstposition,thecontribution
tocostfunction(11)islarge. Ontheotherhand,thecontributiontocostfunction(9)could
be smallif thescore of thecorrect candidate iscloseto 1butother competing candidates
achieve higher scores. This explains why F is consistently higher for the multinomial
ranking method.
In Figure 6 we report thecumulative histogram of the four ranking methods. The graph
representsthepercentageoftimesthatthecorrecttreeisrankedwithintherstkpositions.
5.4 Discussion
Althoughtooweakforacompleteparsingmodel,thepredictionssummarizedinTable2and
Figure6haveanimmediateapplicationinasemiautomaticmodulefortheconstructionofa
treebank,aresourcecurrentlymissingformanylanguages,andoffundamentalimportance
forcorpus-based approaches. The module worksbyproposingincrementalparses to some
humanannotator. Ithasasyntacticknowledgeintheformofabasiccontext-freegrammar,
which is updated when there is evidence for some new rule in the corpus. The parsing
algorithm works incrementally, and, for each word, displays the candidate incremental
trees to the human annotator, sorted by network prediction. Then, he/she would only
needtocheck10 orlesscandidates94.5%ofthetimes(onaverage). Usingpsycholinguistic
preferences, on theother hand,checking10 orlesscandidatesissuÆcient only80%of the
times. Moreover, if the neural network tool presents to the human annotator only the 5
bestalternatives,thenwehaveaprobabilityof87.3% tondthecorrect treelisted. Using
psycholinguistic preferences,thisprobabilitydecreasesto 70.5%.
We can comment these results inthe NLPsetting. A general comment concerns thecon-
rmationinourdataofthe complexityofnaturallanguageparsing. Inthestronglyincre-
mental version adopted in this work, where the problem is coded as the selection of the
best incremental tree, the data on the Penntreebank sample show that the search space
hasan averagebranchingfactor of 56,witha maximumfactor of388. These dataconrm
thenecessityofhavingsome form ofsophisticate elaboration ofparse decisions.
The rank position of the correct incremental tree is constantly low in the candidate list.
Althoughwecouldnotdevelopatthemomentadeterministicparserwithsuchapredictor,
theresultisparticularlypositive,especiallyiftaken initsnaiveconception. Considerthat
corpus-based modelsintheliteratureusuallytake intoaccount several cuesforreachinga
parsedecision;onthecontrary,intheexperimentsdescribedhereonlysyntacticknowledge
istaken intoaccount. Lexicalknowledge,whichisatthecoreofmanyparsingmodels(see,
e.g.,(Srinivas&Joshi, 1999)),istotallyneglected here. Connectionpaths includeat most
POS-tags. On thecontrary, mostmodelsaretrained on relationsestablishedbetween the
head words of syntactic phrases; the use of syntactic categories is limited to those cases
when data about lexical relations are sparse. Alsosemantics is taken into account when
available for limited domains, and including subcategorization frames and a number of
of therelevancefor disambiguation.
6 Conclusions
Thepaperhaspresentedanovelmethodologyforparsingunrestrictedtexts,baseduponthe
incrementalityhypothesis,a widelyheldassumption aboutthehumanparser. Arecursive
neuralnetwork,trainedontreeportionsextractedfromthePenntreebankviaasimulation
of the incremental strategy, carries out the parse decisions in case of ambiguity. The
decisionisintheform oftheselectionofthecorrect incrementaltreeinthelistofpossible
candidates forcontinuationin left-to-right processing. The learningtask is formulated as
theidenticationofthecorrectincrementaltreeinthelistofcandidates;thepredictionisa
rankingofthecandidatesaccordingto theconditionalprobabilityofbeingthecorrectone.
Theresults, althoughthemethodcurrentlytakesintoaccount onlythesyntacticstructure
labeled withnon terminal categories, are very positive. The comparison with well-known
parsing preferences in the psycholinguistic literature clearly indicate a superiority of the
parse predictionsprovidedbythenetwork. Onthearchitectural side,a novel contribution
of thispaperistheproposalofa neuralnetworkapproach forlearningto rankcandidates.
The same architecture can, inprinciple, be applied to other tasks inwhich each objectis
a collectionof instancesamong whicha \winner"isto beprobabilisticallyselected.
Thefutureapplicationofourresultstoparsingareimmediate: alongtermprojectconsists
inthedevelopment ofan eÆcient incrementalparser whichisinformedbythenetworkar-
chitectureintakingdecisionsaboutattachment ambiguity;ashorttermprojectconsistsin
the development of an interactive tool which is of support to human treebank annotators
(Lombardoet al.,2000). Inthelatter case, the software toolparsesthe sentence fromleft
to right (as humans do), and interacts with theannotator in case of ambiguity. The tool
presentsthepossiblealternativesasranked accordingto thenetworkresult,and theanno-
tator chooses thecorrect one. An immediate improvement to the methodology presented
here is theinsertion of other features in thedenitionof thelearningtask. Corpus-based
approaches usually rely on lexical and thematic knowledge for the formulation of parse
decisions. Purely syntactic knowledge is empirically demonstrated to be insuÆcient in a
realisticparsing process.
Wewould like to thankPatrickSturt forprovidingtheprocessedcorpusofsentences used
intheexperiments, andforcommentson an earlydraft ofthispaper.
References
Auer, P. (1997). On learning from multi-instance examples: Empirical evaluation of a
theoretical approach. In Fisher,D. H.(Ed.), Proc. 14th Int.Conf. MachineLearning, pp.
21{29. Morgan Kaufmann.
Bader, M., & Lasser, I. (1994). German verb-nal clauses and sentence processing. In
Clifton, C., Frazier, L., & Reyner, K. (Eds.), Perspectives on Sentence Processing, pp. {.
Lawrence ErlbaumAssociates.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with
gradientdescent is diÆcult. IEEETransactions on Neural Networks,5(2), 157{166.
Bianucci, A., Micheli, A., Sperduti, A., & Starita, A. (2000). Application of cascade-
correlationnetworksforstructures to chemistry. Applied Intelligence,12, 115{145.
Collins,M.(1996). Anewstatisticalparserbasedonbigramlexicaldependencies.InProc.
of 34th ACL,pp. 184{191.
Dietterich, G., Lathrop, R. H., &Lozano-Perez, T. (1997). Solvingthe multiple-instance
problemwithaxis-parallel rectangles. Articial Intelligence, 89(1{2), 31{71.
Eberhard, K. M., Spivey-Knowlton, M. J., Sedivy, J., & Tanenhaus, M. K. (1995). Eye
movementsasawindowintoreal-timespokenlanguagecomprehensioninnaturalcontexts.
Journal of Psycholinguistic Research, 24, 409{436.
Fodor, J. D., & Ferreira, F. (Eds.). (1998). Reanalysis in sentence processing. Kluwer
Academic Publishers.
Francesconi, E., Frasconi, P., Gori, M., Marinai, S., Sheng, J., Soda, G., & Sperduti, A.
(1997).Logorecognitionbyrecursiveneuralnetworks.InKasturi,R.,&Tombre,K.(Eds.),
Graphics Recognition { Algorithmsand Systems. Springer Verlag.
Frasconi,P.,Gori,M.,&Sperduti,A.(1998). Ageneralframeworkforadaptiveprocessing
of datastructures. IEEE Trans. on Neural Networks,9(5), 768{786.
Linguistic Theory, 5,519{559.
Frazier,L., &Fodor, J.D. (1978). The sausagemachine: A new two-stageparsing model.
Cognition, 6, 291{325.
Goller, C. (1997). A Connectionist Approach for Learning Search-Control Heuristics for
Automated DeductionSystems. Ph.D.thesis,Tech.Univ.Munich,ComputerScience.
Goller, C., & Kuchler, A. (1996). Learning task-dependent distributed structure-
representations by backpropagation throughstructure. InIEEE International Conference
on Neural Networks,pp.347{352.
Hermjakob, U., & Mooney, R. J. (1997). Learning parse and translation decisions from
examples withrichcontext. InProceedings of ACL97,pp. 482{489.
Hinton,G.E.(1990).Mappingpart-wholehierarchiesintoconnectionistnetworks.Articial
Intelligence,46, 47{75.
Hobbs, J., &Bear, J. (1990). Two principlesof parse preference. In Proceedings of COL-
ING90, pp.162{167.
Kamide,Y.,&Mitchell,D.C.(1999).Incrementalpre-headattachmentinjapaneseparsing.
Language andCognitive Processes,14(5{6), 631{662.
Kimball, J. P. (1973). Seven principles of surface structure parsing in natural language.
Cognition, 2, 15{47.
Kolen, J., & Kremer, S.(Eds.). (2000). A FieldGuide to Dynamical Recurrent Networks.
IEEE Press.
Lombardo,V.,Bosco,C.,&Vassallo,D.(2000). Treebankannotationand psycholinguistic
modeling. Submitted.
Lombardo,V.,Lesmo, L.,Ferraris, L.,&Seidenari, C.(1998). Incremental processingand
lexicalized grammars. In Proceedings of the XXIAnnual Meeting of the Cognitive Science
Society,pp. 621{626.
Lombardo, V., & Sturt, P. (1999). Incrementality and lexicalism: A treebank study. In
Stevenson, S., & Merlo, P. (Eds.), Lexical Representations in Sentence Processing. John
Benjamins.
corpusof english: Thepenntreebank. Computational Linguistics,19, 313{330.
Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short la-
tencies. Nature, 244, 522{533.
Milward.,D. (1995). Incremental interpretation of categorial grammar. In Proceedings of
EACL95.
Nagao, M. (1994). Varieties of heuristics in sentence processing. In Current Issues in
Natural Language Processing: In Honour of DonWalker. GiardiniwithKluwer.
Plate, T. A. (1995). Holographic reducedrepresentations. IEEE Transactions on Neural
Networks, 6(3), 623{641.
Pollack,J.B.(1990). Recursivedistributedrepresentations.ArticialIntelligence,46(1-2),
77{106.
Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropagation:
The basic theory. In Backpropagation: Theory, Architectures and Applications, pp. 1{34.
Lawrence ErlbaumAssociates, Hillsdale, NJ.
Sperduti, A., & Starita, A. (1997). Supervised neural networks for the classication of
structures. IEEE Transactions on Neural Networks, 8(3).
Srinivas,B.,&Joshi,A. (1999). Supertagging: An approachto almost parsing. Computa-
tional Linguistics,25(2), 237{265.
Stabler, E. P. (1994). The niteconnectivity of linguistic structure. In Clifton, C., Fra-
zier, L.,&Reyner, K.(Eds.), Perspectives on SentenceProcessing,pp.303{336. Lawrence
Erlbaum Associates.
Steedman, M. J. (1989). Grammar, interpretation and processing from the lexicon. In
Marslen-Wilson,W.M.(Ed.),LexicalRepresentationandProcess, pp.463{504.MITPress.
Sturt, P., & Crocker, M. (1996). Monotonic syntactic processing: a cross-linguistic study
of attachment and reanalysis. Language and Cognitive Processes, 11(5), 449{494.
Thacher, J.(1973). Treeautomata: Aninformalsurvey. InAho,A.(Ed.), Currentsin the
Theory of Computing, pp.143{172. Prentice-Hall Inc., EnglewoodClis.
ambiguity and unknown words through probabilistic models. Computational Linguistics,
19(2), 359{382.
Yamashita, K. (1994). Processing of Japanese and Korean. Ph.D. thesis, Ohio State
University,Columbus, Ohio.