NP VP S

(1)

recursive neural networks

F. Costa 1

,P. Frasconi 1

,V. Lombardo 2

, and G.Soda 1

1

Dept. of Systems and Computer Science, University of Florence

via diSantaMarta 3, I-50139 Firenze (Italy)

2

DISTA, Universita delPiemonte Orientale \A. Avogadro"

c.so Borsalino 54, I-15100Alessandria (Italy)

April 7,2000

Abstract

In this paper we develop novel algorithmic ideas for building a naturallanguage

parser grounded upon the hypothesis of incrementality. Although widely accepted

andexperimentallysupported underacognitiveperspectiveasamodel ofthehuman

parser,theincrementalityassumptionhasneverbeenexploitedforbuildingautomatic

parsersofunconstrainedrealtexts. Theessentialsofthehypothesisarethatwordsare

processedinaleft-to-rightfashion,andthesyntacticstructureiskepttotallyconnected

ateachstep.

Ourproposalreliesonamachinelearningtechniqueforpredictingthecorrectnessof

partialsyntacticstructuresthatarebuiltduringtheparsingprocess. Arecursiveneural

network architectureis employedfor computingpredictionsafter atrainingphase on

examples drawn from a corpusof parsed sentences, the Penn Treebank. Our results

indicatetheviabilityoftheapproachandlayoutthepremisesforanovelgenerationof

algorithms fornaturallanguageprocessingwhich morecloselymodel humanparsing.

These algorithms may prove very useful in the development of eÆcient parsers and

haveanimmediateapplicationin theconstructionofsemiautomaticannotationtools.

1 Introduction

Incrementalprocessingof the human parseris a widely heldassumption inpsycholinguis-

tics. This assumption accounts for the intuitive fact that language is processed from left

to right: this is obviously true for speech, which is obligatorily fed to the processor in a

(2)

tion(asopposedtothemonodimensionalspeech)allowsaninspectionoflinguisticmaterial

that goes beyond theleft-to-right sequentialpresentation. Incremental processing also ac-

count forafactwhich hasreceived muchexperimentalsupport, i.e. thathumansinterpret

language withoutreaching theend ofthe inputsentence, thatis they are ableto assigna

meaning to \almost any" initial(left)fragment ofa sentence (Marslen-Wilson,1973).

Although incremental processing of natural language is uncontroversially supported by

many psycholinguistic experiments, eective models for parsing unconstrained language

are rarely based on theincrementality hypothesis. In this area, most solutions consist in

stochastic versions of bottom-up parsers, relying on models grounded upon context-free

rule probabilities (see, e.g. (Collins, 1996)). The incremental approach is at work only

in psycholinguistic models of the human parser, mostly tested on articially-generated

sentences that display this or that ambiguity form. Probably, the major reason is the

incomplete knowledge aboutthehuman parser, which makesa reliablealgorithmic design

considerablyhard.

Nevertheless, some of the discovered features of the human parserhave beenexploited in

NLP system implementations. For example, a number of preference heuristics (like Late

Closure (Kimball, 1973) orMinimal Attachment (Frazier & Fodor, 1978), see Section 5),

devised in psycholinguistic environments to describe some aspects of the human parser,

have beenimplementedto solve localambiguities (see, e.g.,(Hobbs &Bear, 1990)). After

the emergence of thecorpus-based approaches, NLP systems often involve a combination

of heuristics and statistics to guide the parser decisions (see, e.g., (Nagao, 1994)). In

this paperwe push furtherthe similaritieswith the humanparser, byadopting the novel

assumption that the parsing strategy incorporates an architectural aspect of the human

parser, namelyitsincrementalnature. Toourknowledge,thisistherstattempt inwhich

apsycholinguisticmodelofthehuman parseris appliedto parsingreal texts. The senseof

\toward"inthepapertitleisthattheresultpresentedhereisnotafullyfunctionalsystem

yet, buta concreteassessment ofthe proposedmethodology.

An operational account of the incrementality hypothesis, called strong incrementality, is

at thecore of a numberof computationalmodelsof thehuman parser, as a parsimonious

algorithmicversionofincrementality(Stabler,1994). Inthisversion,theparsermaintains

a totallyconnectedparse,whilescanningtheinputwordsfrom leftto right,withnoinput

storedinadisconnectedstate. ThemajordiÆcultyofimplementingastronglyincremental

strategy is that the connection of the next inputword to the existing structure (what is

(3)

NP VP S

SBAR

0 S

NP VP V

ADJ He thinks steeper

N Connection Path

Figure1: The connection pathfor\steeper"in\Hethinks steeperprices areto come."

called the left context) requires the construction of a substructure called the connection

path (see Figure 1). In parsing naturally occurring text, this results in a high number

of candidate connection paths, which yields a hard search problem, aecting the parser

accuracy and speed.

In this paper we propose an incremental parsing model able to process unconstrained

naturally occurringtexts. At each step ofparsing,thecorrect connection pathlinkingthe

next wordto theincrementally builtparsetree ispredictedusinga neuralnetwork model.

Such a prediction eectively acts like a heuristic that can guide the search process, and

signicantly reduce computational eorts. Designing accurate heuristics for thisproblem

is a diÆculttaskbut, on theother hand, the availability ofdatabases of parsedsentences

(treebanks) makes the problem interesting from a machine learning perspective. In our

experiments,examples aredrawnfrom asubset ofparsedsentencesinthePennTreebank

(Marcusetal., 1993).

As far as we know, there exists one previous attempt of applying machine learning tech-

niquestoformulateparsedecisions. HermjakobandMooney(Hermjakob&Mooney,1997)

useda modiedID3 algorithm tolearnthemoves ofashift-reduceparser. Theproblemis

castintoarepresentationwhichreliesonarichsetofmorphological,syntacticandsemantic

features (around 200). The result is a parsing model that provides an output consisting

ofa phrasestructuretree andapredicate-argumentstructurefedto amachinetranslation

module. Though similar inthe spirit, the work described inthis paper relieson a dier-

(4)

machine learningtechnique, arecursive neural network vs. ID3 algorithm. Also, thegoal

of this paper is the assessment of a methodology rather than a fully functional system:

the numerical dataweprovideresult froma reducedlearningdomainwhichis exclusively

syntactic (vs. the 200 features from several sources), in the form of syntactic structures.

We areaware that a complete parsingmodel cannot be realized withoutthe contribution

of other knowledge sources,like a semantic knowledge baseand a set ofsubcategorization

frame(orpredicate-argumentstructures),buttheresultswe have yieldedwiththeskeletal

framework clearlyindicatetheeectivenessof theapproach.

Iinstancesinourlearningdomainare naturallyrepresentedasincremental trees, obtained

by joining a candidate connection path to some left context (another incremental tree).

Uncertainty about the selection of the correct candidate can be naturally modeled in a

probabilisticsetting. Inparticular,thepredictiontaskconsistsofestimatingtheprobability

thata givenattachment betweena givenincrementaltreeand acandidate pathis correct.

As a major advantage of this formulation, the learner outputs a continuous ranking of

alternativesso thatbestcandidatescan betried rst.

Theprobabilisticand supervisednatureofthelearningtask(correctcandidatesareknown

forparsedsentencesinthetrainingset)suggests asolutionbasedon connectionistmodels.

The structured nature of instances requires theadoption of a recentlyintroducedclass of

connectionistarchitectures,calledrecursiveneuralnetworks(Goller&Kuchler,1996;Sper-

duti&Starita,1997;Frasconietal.,1998). Unlikefeedforwardneuralnetworks,whichare

grounded on attribute-value representations of data, recursive networks can adaptively

processlabeledgraphs,eectivelytakingadvantageofrelationsamong atomicentitiesthat

constitute a single instance. Representation and processing of structured information is

nota recent area ofinterest inconnectionism. Earlyattemptsincludedistributedreduced

descriptors by Hinton (1990), Recursive Auto-Associative Memories (RAAM) by Pollack

(1990) and Holographic Reduced Representations by Plate (Plate, 1995). Recursive neu-

ral networks, however, are the rst architecture that can exploit the supervised learning

paradigmon structureddata. Theyhave beensuccessfully appliedto chemistry(Bianucci

et al., 2000), pattern recognition (Francesconi et al.), and learning search heuristics for

guidinga theoremprover(Goller, 1997). In thispaperwedescribea noveland specialized

version ofthese networksthat enablesthem to learnhowto rank aforestof alternatives.

The paper is organized as follows: Section 2 introduces the basic notions of incremental

naturallanguageprocessing. Insection3weformulatethelearningtask. Section4reviews

(5)

to solve the learningproblemof the present paper. Insection 5 we explainin detail data

preparationand theexperimentalsetup, withadiscussionof theresultsobtainedsofar.

2 Incremental parsing

Incrementalprocessingofnaturallanguage(incrementalityforshort)isanintuitivelyplau-

sible hypothesisupon thehuman languageprocessor.

Thisincrementalityhypothesis hasreceived alargeexperimentalsupportinthepsycholin-

guistic community over the years: the shadowing experiments (Marslen-Wilson, 1973)

proved that people are able to interpret speech word-by-word with a latency time of just

250 ms 1

; data about head-nal language processing 2

suggest that the human parser as-

sembles substructures eagerly before reaching the structure head (Bader & Lasser, 1994;

Frazier, 1987; Kamide & Mitchell, 1999; Yamashita, 1994); nally, eye-movement studies

ofvisualobjectrecognitionshowthatthereferentinavisualcontextofanounphrasewith

pre-modiersiscomputed wellbeforereachingthenominalhead(Eberhardetal., 1995).

Theincrementalityhypothesisposessomeconstraintsuponthegeneralparsingstrategy. In

its generalform,thehypothesis impliesthat thesemanticinterpretationof some fragment

of the sentence is available as the scan of the input material proceeds from left to right.

This doesnotinvolveanyspecic constraint onhowtheinteractionbetweenthesyntactic

analysisand thesemantic interpretationcan beimplementedatthealgorithmiclevel. One

possible(and parsimonious)solutionto the interaction problemisoered bythe strongly

incrementalversionofthehypothesis(Lombardoetal.,1998;Milward.,1995;Stabler,1994;

Steedman, 1989; Sturt & Crocker, 1996). The interaction occurs at the word level, and

compels the syntactic analyzer to keep a totally connectedstructure at all times (that is,

the semanticinterpreterhasno meansto work ondisconnected items 3

). Parsing proceeds

1

Ina shadowing experiment, subjects listen to passagesfrom a novel and say words outloud as fast

as theycan. Subjectsdonotknowthat certain words inthepassageare mispronounced. Inmanycases,

subjectstendtoutterwordscorrectlywithnolossoftimewithrespecttothecorrectpronunciation. These

andotherdatahavebeeninterpretedasaproofthatpeopletendtointerpretthemeaningofsentencesas

soonastheycan,probablyonaword-by-wordbasis.

2

Head-nallanguagesarelanguageswhereconstituentsfeaturetheheadwordintherightmostposition.

ExamplesareDutch,GermanandJapanese.

3

Thisis anassumptionuponthe semantic interpreter: it could well bethat the humanprocessor has

some means of interpreting the disconnected pieces of structure in memory. However, the latter is an

unparsimoniousassumption.

(6)

to theright. Here arethe formaldenitions 4

.

Givenasentences=w

0 w

1

w

i

w

jsj 1

andatree T forit(whose internalnodesarela-

beledbynonterminalsymbolsandwhoseleavesarelabeledbywords),wedenerecursively

theincremental trees T

i

(i=0;1;:::;jsj 1) spanningw

0

w

i

asfollows:

T

0

consists ofthechain ofnodesand edges fromw

0

to itsmaximalprojection;

T

i

consistsof all thenodesand edges inT

i 1

and thechainof nodes andedges from

w

i 1

to N,where N is

{ either anode ofT

i 1 ,

{ or the lowest node of T dominating boththe root of T

i 1

and w

i

(in thiscase

T

i

also includestheedgefrom N to theroot of T

i 1 ).

Given two incremental treesT

1 and T

2

,we dene thedierence betweenT

1 and T

2 as the

tree formed byall the edges which are inT

1

and not inT

2

, and all the nodestouchedby

suchedges.

Now, givena sentence s=w

0 w

1

w

jsj 1

and a treeT forit, theconnection path forw

i is

thedierence betweentheincremental treesT

i and T

i 1

. Moreover,

AnodebothinT

i

andinT

i 1

iscalledananchor(thatis,anodewheretheconnection

pathanchors to T

i 1 ).

The nodelabeledbythePOStag ofw

i

iscalleda foot.

In Figure2,we showthesequence of incrementaltrees forasentenceof thecorpus.

When the parser tries to connect an inputword w

i

to the incremental tree T

i 1

(i.e. the

structurethatspansallthewordsthatprecedethecurrentword),itcomputesaconnection

path, thatis thestructural portion that must be builtin order to yield T

i

. However, the

numberofpathsthatallowtolinkw

i

withT

i 1

canbehigh(seeSection5foraquantitative

estimationonthecorpus). Ambiguitycomesintwoforms: thenumberofpossibleanchors

4

These denitions introduce two simplications with respect to (Lombardo& Sturt, 1999): rst, we

includeintheconnectionpaththeedgesthatlinkawordanditsmaximalprojection,thatisthesubstructure

headedbythatword;second,weneglectthepresenceofemptynodes(traces)betweenthelexicalelements

w

i 1 and w

i

. Both simplications make the design of the learning domain easier, because we are not

considering the distinction between headed and headless nodes in connection paths (see Section 5) and

co-referringsymbols,respectively.

(7)

Prp It

VBZ has

DT no

NN bearing

IN on

DT the

NN performance

IN of

POSS our

NN company

NN stock NP VP

S

NP

PP

NP PP

NP

T 0

T 1

T 1 T 2

T 2

T 3 T 4

T 4

T 5

T 5 T 6 T 7

T 7

T 8

T 8 T 9 T 10

Figure 2: The incremental treesof the sentence \It has no bearing on theperformanceof

our company stock." Nodesare labeled with the incremental tree that includesthem for

therst time.

(8)

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

ANCHOR

bearing

S ( NP ( PRP ) VP(ANCHOR) ( VBZ NP ( NP ( DT NN ) ) ) ) S ( NP ( PRP ) VP ( VBZ NP(ANCHOR) ( NP ( DT NN ) ) ) ) S ( NP ( PRP ) VP ( VBZ NP ( NP(ANCHOR) ( DT NN ) ) ) )

It has no

S(ANCHOR) ( NP ( PRP ) VP ( VBZ NP ( NP ( DT NN ) ) ) )

NP(ANCHOR) ( PP ( PP ( IN(FOOT) ) ) ) NP(ANCHOR) ( NP ( IN(FOOT) ) ) NP(ANCHOR) ( NP ( NP ( IN(FOOT) ) ) )

NP(ANCHOR) ( ADVP ( IN(FOOT) ) ) NP(ANCHOR) ( PRN ( PP IN(FOOT) ) ) NP(ANCHOR) ( WHPP ( IN(FOOT) ) ) NP(ANCHOR) ( PP ( IN(FOOT) ) )

It has no bearing on

NP

PP

NP

IN

IN IN

ADVP PP

NP

IN NP

NP NP

IN VP

NN DT

VBZ PRP

PRP VBZ DT NN

NP

NP VP S

NP NP

S

NP

WHPP

(b)

NP

PRN

IN NP

IN PP

(a)

Figure 3: Legal attachment of a connection path to incremental trees. The sentence in

this example is \It has no bearing on the performance of our company stock." (a) An

incrementaltree T

i 1

;nodesmarked as\anchor" arepotentialanchorpointsforjoininga

connection path. (b)Afterone ofthepotentialanchorpointshasbeenselected(NPinthe

example), several alternative paths can be joinedto formT

i

;in theexample, these areall

thepathshavingNP asan anchorand IN(thePOStag of\on")asfoot. The correctpath

forthissentence is highlighted.

(9)

i 1 i i 1

from some anchor (see Figure 3). A selection procedure chooses thebest connection path

and anchor for continuation, and instantiates it to generate the new incremental tree T

i .

The better istheselection procedure,theless parsingdecisionsare wrong.

The success of a selection dependson theultimate task of theparsing model. If thegoal

is to simulate the human parser (this is interesting from a psycholinguistic standpoint),

then the selection procedure must returnthe connection path that would be chosen bya

human,evenifitiswrong. Infact,itcan happen thatrstpassanalysesleadto aparsing

breakdown,and thecontrolgoesto areanalysismodule,aquasi-necessary component ofa

psycholinguisticmodel(Fodor&Ferreira, 1998). Onthecontrary,ifthegoal isthedesign

of an eÆcient automatic parser, then the selection procedure must minimize the number

of failures due to the wrong choice of the connection path. The goal of this paper is to

evaluatewhethermachinelearningalgorithmscan eectivelysolvethelocalambiguitydue

to theselection of the connection path. Thus, the reference taskconsidered in thispaper

isthedesignofaneÆcientautomaticparser. Inthenextsection weformalizethelearning

task.

3 Formal denition of the learning problem

In the following, we assume that a treebank (a corpus of sentences withassociated parse

trees) is available. A treebank is denoted as B =f(s (p)

;T(s (p)

));p=1;;Pg where s (p)

is ageneric sentence and T(s (p)

)its parse tree.

3.1 Universe of connection paths

The universe of connection paths U(B) is the set of all connection paths that can be

extracted from T(s (p)

);p = 1;;P. This set of paths can be obtained by running a

simulationalgorithmthatcomputesalltheincrementaltreesassociatedwiththesentences

in the corpus. The algorithm scans the sentence covered by this tree in a word-by-word

fashionfromlefttoright,andforeachwordw

i

,ndsthelegalconnectionpathswhichwould

bebuiltinorderto attach w

i

to thecurrentleft context T

i 1

. Thealgorithmsimulatesthe

buildingofstructurebymarking, duringthescanofthetree,thebrancheswhichwouldbe

builtbyaperfectlyinformedincrementalparser. Moredetailsareexplainedin(Lombardo

&Sturt, 1999)).

(10)

a large enough corpus, we can expect thatU(B) willprovidean approximately complete

coverage of rulesforcoveringall thesentences inthelanguageunderconsideration.

3.2 The forests of candidates

Suppose we are given a new sentence s = w

0

;:::;w

jsj 1

(not in the corpusB), and sup-

pose that at stage i of parsing we know the correct incremental tree T

i 1

(s) spanning

w

0

;:::;w

i 1

. The goal of an incremental parseris then to computethe next tree T

i (s) in

order to accommodate the next word w

i

. Assuming that U(B) has a rich enough cover-

age, we know that T

i

(s) can be obtainedby joiningT

i 1

(s) to one of thepaths inU(B).

However, othertrees spanningw

1

;;w

i

can begenerated bylegally attaching othercon-

nection paths. In this regard, recall from Section 2 that an attachment is legal if the

anchor matches one of the nodes in the right edge of the tree and the foot matches the

POS tag of w

i

(see Figure 3). Legal foots can be easily obtained by running a part-of-

speech tagger on the sentence. The set of trees obtained by legally attaching T

i 1 (s) to

anypathinU(B)is calledtheforestof candidates forwordw

i

withinsentence s,denoted

F

i

(s)=fT

i;1

(s);:::;T

i;m

i

(s)g. Ofcourse,onlyoneofthetreesinF

i (s),T

i;j

?(s),

isthecor-

rect one(i.e.,itisa subgraphof theactualparse treeT forthesentence)butforunparsed

sentencesit is unknown. If an oracle were availableto select the correct incremental tree

from the candidate set, then incremental parsing would become a trivial procedure. The

machinelearningalgorithmsinthispaperaim to accuratelyapproximatesuch an oracle.

3.3 The learning task

We proposethat the learningalgorithm rely on a statistical model incharge of assigning

probabilities of correctness to each candidate tree. Parameters of the model willbe esti-

mated from examples. The main advantage of a probabilistic formulation is that during

recall,whenunseensentencesareprocessed,themodelcan beeectivelyemployed torank

alternative trees, sorting them by increasing probability of correctness. In this way, the

parser can try rst candidate trees with higher probability of correctness. If we look at

parsing asa search problem(wherethe goalis thefullparse tree for s)then rankingcan-

didates can be seen as a heuristic evaluation function that guides the search problem. A

perfect predictor (i.e.,one that always assignsthe highest probabilityto the correct tree)

would give a perfect heuristic, reducing the branching factor of the search problem to 1.

(11)

This means that two learners should not be compared by simply computing the number

of failures to assign the highest probability to the correct tree (as in a winner-take-all

scheme). Infacts, we needthatthe numberof candidatesrankedabove thecorrect treeis

small. Hence,oneperformancemeasureforthepredictorcanbetheaverage positionofthe

correct treeinthelistof candidates,aftertreeshave beensortedbyincreasingprobability

of correctness.

In thispaper, we assumethatthemodeldoesnotmakeuse oflexicalinformation, thatis,

correctness is assessedbyonly observing nonterminalsymbols in each tree, discarding all

theleaves(whicharelabeledbywords). Therearetworeasonsforthischoice. First,incor-

poratinglexicalknowledgeina connectionistmodel necessarilyinvolvesa largenumberof

parameters to be estimated, requiringlarge amount of training data. In ourexperiments

(see Section 5) we show that even a limited amount of training sentences is suÆcient to

achieve interesting performance,a valuablefeature ofthe model forintegration ina semi-

automatic annotation tool. Second, we are interested in showing that the pure syntactic

level contains signicant informationunder the incrementalityassumption. In the frame-

workwe areoutlining,lexical knowledge can be employed after learningto detect parsing

breakdownsand to implement recovery mechanisms.

Thereare two possibleformulationsof thelearningtask:

Eachinstanceisacandidateincrementaltree. Theinstancespaceisthenpartitioned

into positive and negative instances. Candidate T

i;j

(s) is a positive instance if and

onlyif it is the correct incremental tree for sat positioni. In thisformulation, the

learner mustclassify labeled treesinto one of two classes.

EachinstanceisthewholeforestofcandidatetreesF

i

(s). Thetaskconsistsoflearn-

ingthecorrectmemberoftheforest,whichcanbeidentiedbyamultinomialvariable

withrealizationsinf1;:::;m

i

(s)g. Trainingexamples arepairs(F

i (s);j

?

i

(s)), where

theinputportionisthe candidateforestand theoutput portion(supervision)is the

integer j

?

i

(s)2[1;m

i

(s)]identifyingthecorrecttree.

While therst formulation relieson thestandardconcept learningframework, thesecond

formulationrequiresmore attentionbecauseitdoesnotinvolveclassication ascommonly

intended. Infacts,oneinstanceisa\bag"ofobjects(trees),inawaythatlooselyresembles

multi-instance learning (Auer, 1997; Dietterich et al., 1997). In our case (as a major

dierence withrespect to multi-instancelearning), each bag containsexactly one positive

(12)

The importance of our second formulation stems from the fact that it more accurately

describesthe kindof uncertainty we aredealing with. To explainthis, letus considerthe

probabilistic interpretations of the two formulations. In the rst case, the learner makes

decisionsaboutbooleanoutputvariablesO

i;j

(s),eachindicatingwhetheragivencandidate

incrementaltree iscorrect forpositioniof sentence s:

y

i;j

(s)=P(O

i;j

(s)=correct jT

i;j

(s)): (1)

When computingthis probabilitythe learner is notinformed about competing candidate

trees. In thesecond formulation,the learner makesdecisions about a multinomial output

variablesO

i

(s) whoserealizationsare integers in[1;;m

i

(s)] identifyingthecorrect tree

inthebag associatedwith wordw

i

insentences:

y

i;j

(s)=P(O

i

(s)=jjF

i

(s)) (2)

where O

i

(s) = j means that T

i;j

(s) is the correct tree. In this case, predictionsare con-

ditional to the whole set of candidates, thus introducingcompetition among trees in the

forest. Assuming that theuniverse of paths can coverthe sentence s (i.e. thegrammaris

suÆcientlyexpressive),we have

m

i (s)

X

i=1 y

i;j

(s)=1: (3)

This is because one and only one candidate tree is the correct one (but which one is

unknown). If thegrammar is not suÆcientlyexpressive, predictions are meaninglessbut,

on the other hand, in that case parsing would fail anyway. Hence, as far as the learning

taskis concerned, we can assumethat (3)holds.

In the following section we shall explain how this learning task can be solved using a

connectionistapproach. Tosimplifynotation, referencetotheparticularsentence swillbe

omitted inthefollowingdiscussion.

4 Neural network architecture

Neuralnetworkarchitecturesforsupervisedlearningarewellknowninthecaseofattribute-

value representations. However, as explained in Section 3, the learningtask that can in-

structparsedecisionsinvolvesmakingpredictionsaboutstructuredobjects(namely,labeled

trees). Feedforwardneuralnetworkscannottakeasinputlabeledtreesbecausetheycannot

(13)

constituents. In principle, recurrent neural networks (Kolen & Kremer, 2000) might be

employed by converting the tree into a sequence (e.g., by visiting it in preorder). How-

ever, such an approach would lead to diÆcult learning tasks for at least two important

reasons. First, the exponential relation between number of nodes and height in a tree

leads to very long sequences even for trees of modest height. This has a very negative

impact on learning because of vanishinggradient problems(Bengio et al., 1994). Second,

oncea tree islinearizedinto a sequence,hierarchicalrelationsareblurred andintermixed,

making generalization to new instanceshard (Frasconi et al.,1998). Theseconsiderations

motivatea dierentclassofarchitectures,calledrecursive neuralnetworks,thatgeneralize

recurrent neural networks to deal withlabeled directed acyclicgraphs(Goller & Kuchler,

1996; Sperduti&Starita,1997;Frasconi etal.,1998). InSection4.1weshortlyreviewthe

essentialarchitecturaland algorithmicideasforsupervisedclassicationofdatastructures

| more detailscan be foundin(Frasconi etal., 1998). InSection 4.2 we reformulate the

basicclassicationmodelforrankingalternativeincrementaltreesaccordingtothesecond

formulationoutlinedinSection3.

4.1 Recursive neural networks

From a connectionist (or statistical) standpoint, supervised learning for classication is

formalized as the problem of estimating the conditional probabilityP(OjI). The output

O is a discrete variable associated with the class and I is a generic input object. The

general theorydeveloped in(Frasconi et al.,1998) allows I to be a directed acyclicgraph

with a supersource. Here we are interested in the case of the input I being a labeled

ordered m-ary tree. By orderedwe mean that, for each vertex v, a total order is dened

on the m childrenof v. Specically, ch[v] denotes the ordered m-tuple of vertices whose

elements are v's children. If v has k < m children, then the last m k elements of ch[v]

arelledwithaspecialentrynil,denotingamissingchild. I(v) denotesthelabelattached

to vertexv. Althoughthe generaltheory allows to mixdiscrete, continuous, and possibly

multivariate labels, in thispaper we are onlyinterested in the case of discrete labels, i.e.

random variables withrealizations in a nitealphabet I =I

1

;:::;I

N

(inour application,

thisisthe setof nonterminalsymbols inthetreebank).

The basicneural network architectureis based on the followingrecursive state space rep-

resentation:

(14)

x(v)=f(x(ch[v]);I(v))

1stchild 2ndchild 3rdchild I(v)

(one-hotencoded)

x(ch[v])

Figure 4: The recursiveneuralnetworkmodel.

x(v)=f(x(ch[v]);I(v))

o=g(x (r)):

(4)

In the above equation, x(v) 2 IR n

denotes the state vector associated with node v and

x(ch[v])2IR mn

isavector obtainedbyconcatenatingthecomponentsofthestatevectors

contained in v's children. f : IIR mn

! IR n

is the state transition function that maps

states at v'schildrenand thelabelat v into thestate vector at v. g:IR n

! [0;1]

m

is the

output function, that maps the state x(r) (at the root r of the tree) into the probability

P(OjI)thattheinputtreeisapositiveexample. StatesinEq. (4)areupdatedbottom-up,

traversingthetreeinpost-order. Ifachildismissing,thecorrespondingentriesinx(ch [v])

are lled with the frontier state x, which is associated with the base step of recursion.

A typical choice is x =0. This computational scheme closely resembles the one used by

frontier-to-root treeautomata(Thacher,1973),butinthatcasestatesarediscretewhereas

here statesare realvectors.

The transition function f is implemented by a feedforward neural network, according to

thefollowingscheme(see Figure 4):

a

h

(v) = !

h;0 +

N

X

j=1

!

hj z

j

(I(v))+ m

X

k=1 n

X

`=1 w

hk`

x

` (ch

k

[v]) (5)

x

h

(v) = tanh(a

h

(v)) h=1;:::;n (6)

wherex

h

(v) denotestheh-thcomponent ofthestatevector atvertexv,z

j (I

q

)=1 ifj=q

orzerootherwise(i.e.,weareusingaone-hot encodingofsymbols),ch

k

[v]denotesthek-th

childof v, and !

ij

;w

hk`

areadjustable weights. It isimportant to remark thatweights in

the transition network are independent of the node v at which the transition function is

applied. Thiscan beseen asa formof weight sharing.

(15)

b

a c c

a f()

f() f()

g()

a

b c

a c

I

Figure 5: Network unrollingand backpropagation throughstructure

The outputfunction gis implementedbyanother network:

o= 1

1+e a

(7)

where

a=w

0 +

n

X

`=1 w

` x

`

(r) (8)

beingw

`

adjustable weights.

Supervisedlearningis basedonthemaximumlikelihoodprinciple asinmanyother neural

network models. In the case of classication, the likelihood function has the form of a

cross-entropy. LetD=f(I (k)

;o (k)

);k =1;;Kg bea trainingset oflabeledtrees(where

o (k)

denotesthe classof I (k)

). Then thecostfunctionis thenegative log-likelihood:

J(D)= K

X

k=1 o

(k)

logo (k)

+(1 o (k)

)log(1 o (k)

)

!

(9)

beingo (k)

theoutputofthenetworkafterprocessingthek-thtree. Optimizationissolvedby

gradientdescent. Inthiscase,gradientsarecomputedbyaspecialformofbackpropagation

on the feedforward network obtained by unrolling the state transition network according

to thetopologyoftheinputtreeI. Thealgorithmwasrstproposedin(Goller&Kuchler,

1996) and is referred to as backpropagation through structure (BPTS). The main idea is

depicted inFigure 5: Onthe left an exampleinputtree is shown. Onthe right,the state

(16)

network g() is attached to the replica of f() associated with the root. After recursion

(4) is completed, thestate vector x(r) at theroot r contains an adaptive encoding of the

wholetree. Thisissimilartotheencodingperformedbyrecursiveautoassociativememories

rst proposed by Pollack(Pollack, 1990) but here the encoding resultsfrom a supervised

learningmechanism. Forwardpropagation(darkarrows)isbottom upandfollowsEq. (4).

Backward propagation(light arrows) is employed to compute the gradients and proceeds

fromtheroot totheleaves. Note thatgradientcontributionsmustbesummed overall the

replicasof thetransition network to correctly implement weightsharing.

4.2 Recursive networks for ranking alternatives

As explainedinSection 3, an alternative formulation of the learningtaskestablishesthat

the input portion of each example is a forest of candidate trees. Such an object is not

connected and thus not readily suitable as the input to a recursive neural network. In

this formulation, moreover, we do not have a standard classication task. The network

has to pick up one out of m

i

candidates, and the number of candidates depends on the

particular instance (this is somewhat similar to having a variable number of classes). In

thissection we describethe requiredmodicationsto thebasicmodel presentedabove, to

make itsuitableforsolvingthe naturallanguagelearningtaskat hand.

Thebasicideaisrathersimpleandconsistsofprocessingeachtreeintheforestseparately,

andthenintegrating resultswithanormalizationstepto satisfyEqs. (2,3). Asarststep,

wemodifythe outputfunctiong()(see Eqs. (7,8))of therecursivemodelasfollows:

o = w

0 +

n

X

`=1 w

` x

` (r)

i.e. theoutput networkreduces toa (linear)weightedsum of thestate componentsat the

rootofeachtree. Inthiswayweobtainavectoro

i

=[o

i;1

;;o

i;m

i

]whosecomponentsare

realnumbersin( 1;+1)associatedwitheachcandidateinF

i

. Correctnessprobabilities

arenally obtainedbytaking thesoftmax functionofo

i :

y

i;j

= e

o

i;j

m

i

X

`=1 e

o

i;`

(10)

In this way we satisfy the multinomial model expressed by Eqs. (2,3) and y

i;j

can be

interpretedasthe probabilitythat thej-th candidatetree inF

i

is thecorrect tree. Each

(17)

network is used to process each tree in the forest so the resulting unrolled network is a

forestof neural networkssharing thesame weightsforthe transitionfunction f()and the

output functiong(). Animmediateconsequence ofthischoiceis thattheprobabilitiesy

i;j

arenotaected byapermutationoftreesinthecandidate set, apropertywhichis clearly

desirableat thislevel.

The cost function (negative log-likelihood according to the multinomialmodel) is written

asfollows:

J(D)= P

X

p=1 js

(p)

j 1

X

i=0 logy

i;j

(11)

where D denotesthe training set, the rst sum rangesover sentences inthe training set,

the secondsum rangesoverwordswithineach sentence,and j

?

is theindexof thecorrect

incrementaltree inthecandidatelist (which isknowninthe trainingset).

At this point, the calculation of delta errors needed to compute gradients is straightfor-

ward. It beginsby following the classic scheme of multinomial classication described in

(Rumelhart etal.,1995), thatwe reporthere forcompleteness. Denoting

Æ

i;j :

=

@J(D)

@o

i;j

wehave

Æ

i;j

= 8

<

: 1 y

i;j

ifj=j

y

i;j

ifj 6=j

:

Foreachnetworkjintheforest,Æ

i;j

isinjectedasthedeltaoutput error,andsubsequently

passed to the BPTS algorithms. Of course, since networks in the forest share the same

weights, gradients are computed by summing all the contributions relative to the same

sharedweight.

5 Implementation and results

5.1 Data preparation and experimental setup

The experimentswere run on a set B of2000 parsedsentences, randomly extractedfrom

the Wall Street Journal Section of the Penn II Treebank. The universe of paths U(B)

was extracted from these sentences using the incremental parsing simulation algorithm

described in Section 2. This resulted in a set of 1,675 connection paths, that we kept

indexed by anchorand foot. Then,we selected three disjoint subsets of100, 100, and 500

(18)

sentences. Thesethreesets wereusedto constructthetrainingset, thevalidationset ,and

the test set, respectively. Each sentence in these sets was processed to extract a forestof

candidate incremental trees asfollows. A connection path is validfor T

i

if its anchor and

its foot match the corresponding entities in T

i

. Therefore, for every incremental tree we

have to identifyan anchor(that can be anynode onthe rightfrontierof T

i 1

)and afoot

(given by therightmost leaf of T

i

). At thispoint, theuniverse U(B) is searchedfor legal

paths matching every pair anchor-foot (see Figure 3). Inturn, legalconnection paths are

then joined to T

i 1

to form the forest F

i

. Onthe availabledata, the average cardinality

of acandidate setF

i

is56, whilethemaximumcardinalityis 388. In thetrainingset, the

total numberof words is2,685 and thetotal numberof candidates is183,841. In thetest

set, thetotal numberof words is11,664 and thetotal numberofcandidatesis 698,735.

Nodes are labeled with non terminal symbols (syntactic categories) of the context free

grammarused in thePennII Treebank. We have decidedto usea subsetof non terminal

symbols in order to lter out informationthat is not relevant to our task, or that would

makesomerulesto belearnedbasedonspecicandrareexamples. Asaresult,wereduced

the nonterminal symbols to a subset of 71 elements, collapsing some of the non terminal

variationsintotheir baseform. Table1 shows detailsofthe mergewe haveperformed.

The rst recursive network used for the experiments implements the model described in

Section 4.1, and has the following specications: m = 14 (maximum outdegree), N =71

labels (one-hot encoded), and n = 40 state components, yielding a total of about 25,000

adjustable parameters. When using 0-1 classication, the ratio of positive over negative

examplesturnsouttobeverysmall(2,685/181,156=0.015). Thisfollowsfromthefactthat

foreachpositiveexamplethereisalargenumberofcompetingcandidateincrementaltrees.

Thenetworkistrainedtominimizethenumberofmisclassications,butthisgoalwouldbe

easilyachieved bypredictingalways0(regardlessoftheinput)sincethebaseaccuracy 6

for

thislearningproblemis 98.5%. Thistheoretical expectation hasbeenpractically veried,

and our experiments using all the available examples in each epoch did not produce any

usefulresults(the network output wasalways very closeto zero). We foundthata viable

wayof dealingwiththisproblemisto present thenegativeexamples inthetrainingphase

with alesserfrequency thanthepositive ones, inorder to re-establishingan average ratio

of 1:1positive-negativeexamples per trainingepoch.

5

The validation set was employed to estimate generalization performance, in order to stop gradient

descentandpreventovertting.

6

The base accuracy is the fraction of correctly classied instances whenalways predicting the most

frequentclass.

(19)

ADJP-ADV NP-ADV PP-BNF

ADJP-CLR NP-BNF PP-CLR

ADJP-PRD NP-CLR PP-DIR

ADVP-CLR NP-DIR PP-DIR-CLR

ADVP-DIR NP-EXT PP-DTV

ADVP-DIR-CLR NP-HLN PP-EXT

ADVP-EXT NP-LGS PP-LGS

ADVP-LOC NP-LOC PP-LOC

ADVP-LOC-CLR NP-LOC-PRD PP-LOC-CLR

ADVP-LOC-PRD NP-MNR PP-LOC-PRD

ADVP-MNR NP-PRD PP-MNR

ADVP-PRP NP-SBJ PP-PRD

ADVP-PUT NP-TMP PP-PRP

ADVP-TMP NP-TTL PP-PUT

Table1: Reductionof non terminal symbols. Entrieswithineach column aremerged into

thesymbolreportedinthe headerof thecolumn.

Ina secondexperiment weimplementedthemodeldescribedin4.2. Inthiscase, eachtree

intheforestisprocessedseparatelybyarecursivenetworkwithonelinearoutputunit. The

resulting numbers arethen normalized with the softmax function. The recursive network

used in the experiment had the same size of the one described above (i.e., about 25,000

weights). Inthiscase, sincetheinputto thenetwork isan entireforestof candidates, the

unbalance problemof positive/negative examples does not occur. Allthe trees (both the

correct and the incorrect ones) are grouped into 2,685 forests and are employed to train

thenetwork ineach epoch.

In all theexperiments, we estimated generalization (the average positionR of thecorrect

candidate) byrunningthe partiallytrainednetworksin recallmode onthe 100 validation

sentences. Gradient descent was stopped when theperformanceon thevalidation set was

maximum. This early stopping is expected to be useful in order to prevent overtting.

In the case of 0-1 classication, training was stopped after 1000 epochs. In the case of

multinomialranking,90epochsweresuÆcient(note,however,thatlesstreesperepochare

processedintherst case, dueto negative examples subsampling).

(20)

A faircomparisonwithotherapproachesto scoreparse decisioncaninvolve thepsycholin-

guistic preferences mentioned in the introduction, Late Closure (LC) and Minimal At-

tachment (MA), that are at work in many parsing models because of their immediate

implementation and the reasonable results they yield. Brie y, LC is the preference for

attachinganewconstituent atthelowest legalsiteontherightedge;MA isthepreference

fortheattachment thatinvolvesthecreation oftheminimumnumberof new nodes.

LCpreference explainsthesurpriseeect intheinterpretationof asentence. Forexample,

consider\Thepresidentdeclared thathewillresignyesterday." Accordingto LC,\yester-

day"isinitiallyattachedto thelowerverb\willresign,"butthisattachment isthenreject

on thebasisof thesemantic knowledge (\yesterday"is incompatiblewiththefuture tense

of\willresign");onthecontrary,thehigherattachment to\declared"issemanticallyplau-

sible. MA expresses a preference for atter structures. In \Johnwrote a letter to Mary,"

\to Mary"is preferablyattachedto \wrote" rather thanto \a letter,"because of a minor

numberofnodesinvolved. Noticethatinthiscase MAoverridesLC,whichwouldpredict

an attachment to \the letter."

Thoughitiscommoninthepsycholinguisticliteraturetodescribeinitialparsingdecisions

as the result of the application of a number of preferences, (almost) everybody does still

believe that preferences forrecency (as in LC) and structural simplicity(as inMA) have

an in uence on parsing decisions in some form or other. The reason for considering LC

and MA in the context of this paper is that such preferences are of exclusively syntactic

nature, and(though overriddenbylexicaland semantic heuristics)they are invoked when

other sophisticate methodsdo not produce any valuable result. For example, Weischedel

et al. (Weischedel et al., 1993) report a 75% of correctness of LC preference on 166 PP-

attachments inthePenntreebank,remainedunsolvedaftera runof apartialparserbased

on a probabilistic model 7

. Lombardo and Sturt (Lombardo & Sturt, 1999) report on a

sophisticateformofMAinthecontextofastronglyincrementalparser. Thesophistication

isduetothedistinctionofthesyntacticnodesbetweenheadedandheadless: headednodes

are those nodes for which the head word have already been scanned; otherwise, they are

headless. Headedness reformulates MA from the numberof nodes ina constituent to the

number of headless nodes in a constituent: this reformulation is more adequate with the

7

Notethat this75% correctnesscannotbe comparedwiththeresults describedinthis paper(Section

5.3),sincetheyarebasedontwodierentparsingmodels.

(21)

beadequatelyexplainedwithoutincludinglexicalknowledgeinatheoryofnaturallanguage

syntax. Forexample,aruleof acontext freegrammarfornaturallanguagemayexistonly

if there exists a word (in the lexicon) that licenses it. The results on a sample of about

50,000wordsfromtheWallStreetJournalsectionofthePenntreebankwerethatthe82%

of connection paths had 0 headless nodes, 15% had 1 headlessnodes, 2% had 2 headless

nodes, and 1%had 3headlessnodes.

Since LC and MA are oftenin con ictwith each other (see example \John wrotea letter

to Mary" above), we have collected the data for the two preferences by alternating the

one which overrides the other: in one trial, LC overrides MA (LC over MA), that is,

incremental trees are ranked higher if the connection path anchors low and then with a

minimum number of nodes; in the other trial, MA overrides LC (MA over LC), that is,

incrementaltreesarerankedhigherifthey haveaminimumnumberofnodesandthenthe

connection pathanchors low. Inthenext section we report theresults.

5.3 Results

After training, the network assigns a probability to each candidate incremental tree. A

parser can employ the network output to rank candidate trees (by increasing probability

ofcorrectness)forcontinuation. Hence,theperformancemeasure ofourapproach toparse

decisionsis based onthe rankassigned to thecorrect treein eachforest F

i .

Asynthesisofresultsonthe500testsentencesisreportedinTable2. Inthetable,Risthe

average positionof the correct tree, after having sorted candidates according to network

predictions. R = 1 would yield a perfect predictor that always ranks the correct tree in

rst position. The next column, F, is the percentage of times the correct tree is ranked

in rst position. Global results are reported in the rst row of the table. We can see

thatthemultinomialranking(\ANNmultin.") outperforms0-1classicationbyaboutone

position. Ofcourse,whenthenumberofcandidatesishigher,wecanexpectthatitismore

diÆcult to rank the correct one on the rst positions. To give a more detailed overview

of predictionquality,we shouldreport R separately foreach valueof F

i

. For conciseness,

results are averaged over bins associated to ranges of candidate list lengths. The total

numberofforests ineachbin isalso reported inthe lastcolumn. The resultsonthemean

valueRclearlydemonstratethattherankingproducedbythenetworksignicantlyreduces

the number of candidate trees at each incremental parsing step. For example, when the

(22)

i

R F R F R F R F

2{388 3.25 52.4% 4.27 31.6% 6.73 44.8% 7.63 39.9% 11,664

2{3 1.25 77% 1.41 63% 1.22 82% 1.18 82% 384

4{5 1.54 60% 2.05 33% 2.52 29% 2.15 35% 349

6 1.92 59% 2.10 38% 3.08 14% 2.53 19% 191

7 1.85 50% 2.29 33% 2.64 53% 2.10 52% 320

8 1.49 76% 1.81 65% 2.74 56% 2.14 67% 254

9 2.45 34% 2.42 44% 3.69 46% 3.10 45% 250

10 1.79 60% 2.54 43% 3.94 32% 2.72 32% 305

11{13 2.31 56% 2.44 49% 5.93 21% 4.07 20% 371

14{16 1.88 65% 2.41 47% 5.33 33% 3.88 33% 391

17 2.29 49% 2.97 30% 6.56 13% 5.32 11% 154

18 1.64 79% 1.74 77% 2.90 73% 2.41 73% 344

19{20 2.88 50% 3.09 28% 6.32 38% 4.34 33% 252

21{23 2.56 48% 3.05 28% 6.49 25% 5.45 22% 385

24{27 3.25 47% 3.48 31% 5.00 42% 5.16 38% 387

28{30 2.29 54% 2.80 26% 5.80 36% 3.72 32% 353

31{34 3.12 49% 3.75 29% 6.17 38% 5.57 31% 379

35{38 2.99 52% 3.70 32% 5.54 44% 4.77 41% 381

39{41 3.14 51% 4.05 24% 5.18 45% 5.47 41% 330

42{45 2.41 59% 3.16 32% 6.48 45% 5.09 41% 373

46{49 2.67 54% 3.45 32% 7.38 52% 5.69 46% 301

50{53 3.34 53% 4.16 27% 5.63 37% 6.78 30% 370

54{58 2.84 52% 3.44 30% 7.70 51% 6.36 47% 378

59{64 3.72 51% 4.40 31% 7.23 52% 6.90 44% 357

65{69 3.57 50% 5.21 19% 8.50 54% 6.65 46% 400

70{74 5.39 49% 5.82 24% 6.68 48% 7.47 42% 380

75{81 3.31 57% 4.52 27% 7.10 58% 8.65 51% 385

82{89 4.01 46% 5.26 21% 6.20 57% 8.94 49% 353

90{99 4.29 50% 5.51 23% 8.61 50% 9.38 38% 372

100{111 4.17 42% 5.19 23% 7.13 53% 8.94 45% 397

112{126 4.80 51% 6.90 20% 8.89 51% 12.24 42% 391

127{145 4.44 49% 5.85 20% 10.63 47% 13.37 39% 388

146{173 4.76 42% 7.33 17% 9.12 49% 14.17 41% 400

174{222 7.14 33% 11.21 14% 11.69 41% 20.75 29% 398

223{388 7.02 32% 13.60 11% 17.74 23% 30.80 15% 241

Table2: Prediction performance(see text forexplanation).

(23)

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Multinomial Ranking 0-1 Ranking LC over MA MA over LC

Figure 6: Experimental comparison ofknowledge-basedheuristics(Late Closure,Minimal

Attachment) and neuralnetwork predictions(Multinomialrankingand 0-1classication).

The graph reports the percentage of times that, after sorting candidates by one of the

heuristics, thecorrecttree is foundwithintherst k trees, fork=1;:::;20.

number of candidates is between 100 and 111, the correct tree is ranked in position4.17

(onaverage).

Columns 6{9 in Table 2 report R and F for the two strategies (LC over MA, and MA

over LC) described in Section 5.2. The results obtained with the neural network model

comparevery favorably withrespectto psycholinguisticpreferences. Theonlyexceptionis

when there areonly 2 or 3 alternative trees (in thiscase MA and LC preferences slightly

outperform the neural network). Moreover, multinomial ranking performs signicantly

better than the 0-1 approach. This can be expected theoretically since the architecture

proposed in Section 4.2 is a more accurate model of the task at hand. This can also be

explainedintuitively: ifthecorrectcandidateisnotrankedinrstposition,thecontribution

tocostfunction(11)islarge. Ontheotherhand,thecontributiontocostfunction(9)could

be smallif thescore of thecorrect candidate iscloseto 1butother competing candidates

achieve higher scores. This explains why F is consistently higher for the multinomial

ranking method.

In Figure 6 we report thecumulative histogram of the four ranking methods. The graph

representsthepercentageoftimesthatthecorrecttreeisrankedwithintherstkpositions.

(24)

5.4 Discussion

Althoughtooweakforacompleteparsingmodel,thepredictionssummarizedinTable2and

Figure6haveanimmediateapplicationinasemiautomaticmodulefortheconstructionofa

treebank,aresourcecurrentlymissingformanylanguages,andoffundamentalimportance

forcorpus-based approaches. The module worksbyproposingincrementalparses to some

humanannotator. Ithasasyntacticknowledgeintheformofabasiccontext-freegrammar,

which is updated when there is evidence for some new rule in the corpus. The parsing

algorithm works incrementally, and, for each word, displays the candidate incremental

trees to the human annotator, sorted by network prediction. Then, he/she would only

needtocheck10 orlesscandidates94.5%ofthetimes(onaverage). Usingpsycholinguistic

preferences, on theother hand,checking10 orlesscandidatesissuÆcient only80%of the

times. Moreover, if the neural network tool presents to the human annotator only the 5

bestalternatives,thenwehaveaprobabilityof87.3% tondthecorrect treelisted. Using

psycholinguistic preferences,thisprobabilitydecreasesto 70.5%.

We can comment these results inthe NLPsetting. A general comment concerns thecon-

rmationinourdataofthe complexityofnaturallanguageparsing. Inthestronglyincre-

mental version adopted in this work, where the problem is coded as the selection of the

best incremental tree, the data on the Penntreebank sample show that the search space

hasan averagebranchingfactor of 56,witha maximumfactor of388. These dataconrm

thenecessityofhavingsome form ofsophisticate elaboration ofparse decisions.

The rank position of the correct incremental tree is constantly low in the candidate list.

Althoughwecouldnotdevelopatthemomentadeterministicparserwithsuchapredictor,

theresultisparticularlypositive,especiallyiftaken initsnaiveconception. Considerthat

corpus-based modelsintheliteratureusuallytake intoaccount several cuesforreachinga

parsedecision;onthecontrary,intheexperimentsdescribedhereonlysyntacticknowledge

istaken intoaccount. Lexicalknowledge,whichisatthecoreofmanyparsingmodels(see,

e.g.,(Srinivas&Joshi, 1999)),istotallyneglected here. Connectionpaths includeat most

POS-tags. On thecontrary, mostmodelsaretrained on relationsestablishedbetween the

head words of syntactic phrases; the use of syntactic categories is limited to those cases

when data about lexical relations are sparse. Alsosemantics is taken into account when

available for limited domains, and including subcategorization frames and a number of

(25)

of therelevancefor disambiguation.

6 Conclusions

Thepaperhaspresentedanovelmethodologyforparsingunrestrictedtexts,baseduponthe

incrementalityhypothesis,a widelyheldassumption aboutthehumanparser. Arecursive

neuralnetwork,trainedontreeportionsextractedfromthePenntreebankviaasimulation

of the incremental strategy, carries out the parse decisions in case of ambiguity. The

decisionisintheform oftheselectionofthecorrect incrementaltreeinthelistofpossible

candidates forcontinuationin left-to-right processing. The learningtask is formulated as

theidenticationofthecorrectincrementaltreeinthelistofcandidates;thepredictionisa

rankingofthecandidatesaccordingto theconditionalprobabilityofbeingthecorrectone.

Theresults, althoughthemethodcurrentlytakesintoaccount onlythesyntacticstructure

labeled withnon terminal categories, are very positive. The comparison with well-known

parsing preferences in the psycholinguistic literature clearly indicate a superiority of the

parse predictionsprovidedbythenetwork. Onthearchitectural side,a novel contribution

of thispaperistheproposalofa neuralnetworkapproach forlearningto rankcandidates.

The same architecture can, inprinciple, be applied to other tasks inwhich each objectis

a collectionof instancesamong whicha \winner"isto beprobabilisticallyselected.

Thefutureapplicationofourresultstoparsingareimmediate: alongtermprojectconsists

inthedevelopment ofan eÆcient incrementalparser whichisinformedbythenetworkar-

chitectureintakingdecisionsaboutattachment ambiguity;ashorttermprojectconsistsin

the development of an interactive tool which is of support to human treebank annotators

(Lombardoet al.,2000). Inthelatter case, the software toolparsesthe sentence fromleft

to right (as humans do), and interacts with theannotator in case of ambiguity. The tool

presentsthepossiblealternativesasranked accordingto thenetworkresult,and theanno-

tator chooses thecorrect one. An immediate improvement to the methodology presented

here is theinsertion of other features in thedenitionof thelearningtask. Corpus-based

approaches usually rely on lexical and thematic knowledge for the formulation of parse

decisions. Purely syntactic knowledge is empirically demonstrated to be insuÆcient in a

realisticparsing process.

(26)

Wewould like to thankPatrickSturt forprovidingtheprocessedcorpusofsentences used

intheexperiments, andforcommentson an earlydraft ofthispaper.

References

Auer, P. (1997). On learning from multi-instance examples: Empirical evaluation of a

theoretical approach. In Fisher,D. H.(Ed.), Proc. 14th Int.Conf. MachineLearning, pp.

21{29. Morgan Kaufmann.

Bader, M., & Lasser, I. (1994). German verb-nal clauses and sentence processing. In

Clifton, C., Frazier, L., & Reyner, K. (Eds.), Perspectives on Sentence Processing, pp. {.

Lawrence ErlbaumAssociates.

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with

gradientdescent is diÆcult. IEEETransactions on Neural Networks,5(2), 157{166.

Bianucci, A., Micheli, A., Sperduti, A., & Starita, A. (2000). Application of cascade-

correlationnetworksforstructures to chemistry. Applied Intelligence,12, 115{145.

Collins,M.(1996). Anewstatisticalparserbasedonbigramlexicaldependencies.InProc.

of 34th ACL,pp. 184{191.

Dietterich, G., Lathrop, R. H., &Lozano-Perez, T. (1997). Solvingthe multiple-instance

problemwithaxis-parallel rectangles. Articial Intelligence, 89(1{2), 31{71.

Eberhard, K. M., Spivey-Knowlton, M. J., Sedivy, J., & Tanenhaus, M. K. (1995). Eye

movementsasawindowintoreal-timespokenlanguagecomprehensioninnaturalcontexts.

Journal of Psycholinguistic Research, 24, 409{436.

Fodor, J. D., & Ferreira, F. (Eds.). (1998). Reanalysis in sentence processing. Kluwer

Academic Publishers.

Francesconi, E., Frasconi, P., Gori, M., Marinai, S., Sheng, J., Soda, G., & Sperduti, A.

(1997).Logorecognitionbyrecursiveneuralnetworks.InKasturi,R.,&Tombre,K.(Eds.),

Graphics Recognition { Algorithmsand Systems. Springer Verlag.

Frasconi,P.,Gori,M.,&Sperduti,A.(1998). Ageneralframeworkforadaptiveprocessing

of datastructures. IEEE Trans. on Neural Networks,9(5), 768{786.

(27)

Linguistic Theory, 5,519{559.

Frazier,L., &Fodor, J.D. (1978). The sausagemachine: A new two-stageparsing model.

Cognition, 6, 291{325.

Goller, C. (1997). A Connectionist Approach for Learning Search-Control Heuristics for

Automated DeductionSystems. Ph.D.thesis,Tech.Univ.Munich,ComputerScience.

Goller, C., & Kuchler, A. (1996). Learning task-dependent distributed structure-

representations by backpropagation throughstructure. InIEEE International Conference

on Neural Networks,pp.347{352.

Hermjakob, U., & Mooney, R. J. (1997). Learning parse and translation decisions from

examples withrichcontext. InProceedings of ACL97,pp. 482{489.

Hinton,G.E.(1990).Mappingpart-wholehierarchiesintoconnectionistnetworks.Articial

Intelligence,46, 47{75.

Hobbs, J., &Bear, J. (1990). Two principlesof parse preference. In Proceedings of COL-

ING90, pp.162{167.

Kamide,Y.,&Mitchell,D.C.(1999).Incrementalpre-headattachmentinjapaneseparsing.

Language andCognitive Processes,14(5{6), 631{662.

Kimball, J. P. (1973). Seven principles of surface structure parsing in natural language.

Cognition, 2, 15{47.

Kolen, J., & Kremer, S.(Eds.). (2000). A FieldGuide to Dynamical Recurrent Networks.

IEEE Press.

Lombardo,V.,Bosco,C.,&Vassallo,D.(2000). Treebankannotationand psycholinguistic

modeling. Submitted.

Lombardo,V.,Lesmo, L.,Ferraris, L.,&Seidenari, C.(1998). Incremental processingand

lexicalized grammars. In Proceedings of the XXIAnnual Meeting of the Cognitive Science

Society,pp. 621{626.

Lombardo, V., & Sturt, P. (1999). Incrementality and lexicalism: A treebank study. In

Stevenson, S., & Merlo, P. (Eds.), Lexical Representations in Sentence Processing. John

Benjamins.

(28)

corpusof english: Thepenntreebank. Computational Linguistics,19, 313{330.

Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short la-

tencies. Nature, 244, 522{533.

Milward.,D. (1995). Incremental interpretation of categorial grammar. In Proceedings of

EACL95.

Nagao, M. (1994). Varieties of heuristics in sentence processing. In Current Issues in

Natural Language Processing: In Honour of DonWalker. GiardiniwithKluwer.

Plate, T. A. (1995). Holographic reducedrepresentations. IEEE Transactions on Neural

Networks, 6(3), 623{641.

Pollack,J.B.(1990). Recursivedistributedrepresentations.ArticialIntelligence,46(1-2),

77{106.

Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropagation:

The basic theory. In Backpropagation: Theory, Architectures and Applications, pp. 1{34.

Lawrence ErlbaumAssociates, Hillsdale, NJ.

Sperduti, A., & Starita, A. (1997). Supervised neural networks for the classication of

structures. IEEE Transactions on Neural Networks, 8(3).

Srinivas,B.,&Joshi,A. (1999). Supertagging: An approachto almost parsing. Computa-

tional Linguistics,25(2), 237{265.

Stabler, E. P. (1994). The niteconnectivity of linguistic structure. In Clifton, C., Fra-

zier, L.,&Reyner, K.(Eds.), Perspectives on SentenceProcessing,pp.303{336. Lawrence

Erlbaum Associates.

Steedman, M. J. (1989). Grammar, interpretation and processing from the lexicon. In

Marslen-Wilson,W.M.(Ed.),LexicalRepresentationandProcess, pp.463{504.MITPress.

Sturt, P., & Crocker, M. (1996). Monotonic syntactic processing: a cross-linguistic study

of attachment and reanalysis. Language and Cognitive Processes, 11(5), 449{494.

Thacher, J.(1973). Treeautomata: Aninformalsurvey. InAho,A.(Ed.), Currentsin the

Theory of Computing, pp.143{172. Prentice-Hall Inc., EnglewoodClis.

(29)

ambiguity and unknown words through probabilistic models. Computational Linguistics,

19(2), 359{382.

Yamashita, K. (1994). Processing of Japanese and Korean. Ph.D. thesis, Ohio State

University,Columbus, Ohio.