Extracting Semantic Annotations from Legal Texts

(1)

Extracting Semantic Annotations from Legal Texts

L. Lesmo, A. Mazzei and D. P. Radicioni

(2)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

research carried out by the Interaction Models

Group of the CS Department (University of Turin) in partnership with Modelling Legal Informatics

Resource Group of Cirsfid (University of Bologna)

2

(3)

outline

• the problem at hand: automating semantic annotation for consolidating legal texts

• the proposed approach

• experimental results and open issues

(4)

definition of the problem

(5)

semantic mark-up

•

semantic mark-up is commonly

acknowledged as a costly and complex matter

•

much work has been carried out in various fields, such as Information Extraction, with the aim at automatically extracting salient information from texts

(6)

restricted domains

•

no mature technology to cope with unrestricted language exists yet

•

some advances have been obtained over

restricted -more regular- domains, such as the legal field.

• systems have built that automatically identify and classify structural portions of legal documents and their intra- and inter-references

• other investigations are being carried out to produce semantic analysis

6

(7)

XML standards

•

various initiatives to devise XML standards for describing legal sources and schemas to identify legal documents

•

text editors exist (e.g., NIR NormaEditor) to mark up in a supervised fashion structural

partitions and normative references

•

unfortunately, the human annotation process is expensive and error-prone

•

such efforts will be viable only in conjunction with tools to extract in automatic fashion the structural and semantic data from legal texts.

(8)

hypertexts and legal texts

•

documents containing modificatory provisions are relevant in such perspective, in that they are explicitly concerned with describing how some other legal text --or other part of current one-- has to be modified

8

“at the article 29, comma 2 of law #212/2007, the words «free press» are

substituted with «silence please», ...”.

DOCUMENT 1 modification !

modification α

modification or, in the legal jargon,

modificatory provision, is a change made to one or more clauses within a text or to an entire text

(9)

hypertexts and legal texts

“at the article 33, comma 1 of law #212/2007, the word

«forbidden» is substituted by

«allowed», ...”.

“at the article 1, of law #55/2009, the word «gate»

is repealed.

modification β

modification γ

DOCUMENT 1

modification ! modification "

modification #

(10)

hypertexts and legal texts

10

DOCUMENT 1 modification ! modification "

modification # modification ...

law #212/2007

... free press ...

... forbidden...

law #212/2007

... silence please ...

... allowed ...

law #55/2009

... gate...

law #55/2009

... gate ...

...

(11)

hypertexts and legal texts

DOCUMENT 1 modification ! modification "

modification #

modification ... law #55/2009

... gate ...

... canceled added ...

modified TeXt ...

DOCUMENT 2

modification $

DOCUMENT 3

modification %

how can we build the text that

counts as law (the consolidated text)?

(12)

consolidation of legal texts

•

the consolidated text is the

updated version of a normative text, the version embodying the changes;

•

the uncertainty on the effects of normative modifications would undermine the certainty of the law, thereby making it hard to

clearly understand what is the law, or which one of several versions of a provision counts as law

12

law #212/2007

... free press silence

please ...

... forbidden allowed ...

(13)

consolidation of legal texts

• automating the process to semantically annotate modificatory clauses and

provisions would be of great help in

simplifying the legal system and in

consolidating texts of law

(14)

semantic annotation of modificatory

provisions

(15)

to qualify a modificatory provision

•

one has to recognize the following:

- the type of the modification (e.g., repeal, modification, substitution);

- the document being modified (another law, decree, etc.);

- the portion of such document affected by the modification (be it a structural part, as ``article 22'' or a text fragment, as ``the words `six

months'...'');

•

and to generate a set of metadata that compactly describe such modification

(16)

taxonomy of modificatory provisions

16

MOD. OF THE CONTENT

MOD. OF THE RANGE

MOD. OF THE TIME

NORMATIVE SYSTEM MODIFICATION

TEXT

MEANING

CHANGES OF FORCE CHANGES OF

EFFICACY

REPEAL

SUBSTITUTION INTEGRATION RELOCATION

MODIFICATION OF TERMS INTERPRETATION

MEANING CHANGE

DEROGATION EXTENSION

ANNULMENT

POSTPONEMENT OF ENTER IN FORCE ENTRY INTO FORCE

PROROGATION OF FORCE RE-ENACTMENT

POSTPONEMENT OF START OF EFFICACY SUSPENSION

DISAPPLICATION

PROROGATION OF EFFICACY RETROACTIVITY

CONVERSION TRANSPOSE IMPLEMENT RATIFICATION

DELEGATION OF POWER DEREGULATION

{

{ {

{

(17)

taxonomy of modificatory provisions

17

MOD. OF THE CONTENT

MOD. OF THE RANGE

MOD. OF THE TIME

NORMATIVE SYSTEM MODIFICATION

TEXT

MEANING

CHANGES OF FORCE CHANGES OF

EFFICACY

REPEAL

SUBSTITUTION INTEGRATION RELOCATION

MODIFICATION OF TERMS INTERPRETATION

MEANING CHANGE

DEROGATION EXTENSION

ANNULMENT

POSTPONEMENT OF ENTER IN FORCE ENTRY INTO FORCE

PROROGATION OF FORCE RE-ENACTMENT

POSTPONEMENT OF START OF EFFICACY SUSPENSION

DISAPPLICATION

PROROGATION OF EFFICACY RETROACTIVITY

CONVERSION TRANSPOSE IMPLEMENT RATIFICATION

DELEGATION OF POWER

{

{ {

{

(18)

input representation

•

let us consider the following input sentence:

“All'articolo 40, comma 1, della legge 28 dicembre 2005, n. 262, le parole: «sei mesi» sono sostituite dalle seguenti: «dodici mesi».”

“At article 40, comma 1 of law 28/12/2005 # 262, the words: «six months» are substituted by the following: «twelve months».”

18

(19)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 19

<corpo>

All'

<rif id="rif9" xlink:href="urn:nir:[...]">

articolo 40, comma 1, della legge 28 dicembre 2005, n. 262 </rif>

, le parole:

<virgolette tipo="parola" id="vir1">

"sei mesi"

</virgolette>

sono sostituite dalle seguenti:

"dodici mesi"

</virgolette>

.</corpo> Input to the parser

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones: vir2.

<dsp:sostituzione>

<dsp:pos xlink:href="#art1-com4" />

<dsp:norma xlink:href="urn:nir:[...]">

<dsp:pos xlink:href="#rif9" />

</dsp:norma>

<dsp:novella>

<dsp:pos xlink:href="#vir2" />

</dsp:novella>

<dsp:novellando>

</dsp:novellando>

</dsp:sostituzione>

modification of type substitution

position of the norm being modified

quoted text. novella: new text

quoted text. novellando: text being modified

Figure 3: Left: XML encoding the sentence (extracted from the tag <corpo>) “ All’articolo 40 [, comma 1, della legge 28 dicembre 2005, n. 262,] le parole: ‘sei mesi’ sono sostituite dalle seguenti: ‘dodici mesi’.” (At article 40 [...] the words: ‘six months’ are substituted by the following: ‘twelve months’.). Top: the meta-description of the modification. Bottom: the content of the tag <corpo>, as it is given in input to the parser after the rewriting of constants rif (i.e., reference) and virgolette (quotes).

!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()

!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-

!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/

!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0

FRAME sostituire (substitute)

contains VIR?

contains RIF?

sostituite

SUBJ OBJ MODs

Figure 4: Three tests are performed on te parse tree, and the appropriate slots are filled.

Further rules are designed to account for complex linguistic con- structions, such as the case of coordination. Let us consider the following sentence: All’articolo 1, comma 2, sono soppresse le let- tere d) ed f); [At the article 1, comma 2, the letters d) and f) are deleted;]. It is conveniently converted into: All’RIF16, sono sop- presse le RIF34) ed RIF35), [At RIF16, the RIF34) and RIF35) are deleted]. The TUP recognizes coordination and marks the edge between the conjuncts. The semantic frame corresponding to the deletion of RIF34 is filled similarly to the case of the substitu- tion; additionally, the semantic interpreter recognizes the presence of a conjunct. In facts, the reference RIF34 has a descendant node RIF35 in a subtree that is connected by the COORD labeled edge.

Cases involving 2 conjuncts, such as RIF34 and RIF35, are rather straightforward because we are handling homogeneous objects (both are RIFs), and because there is no ambiguity about the fact that RIF34 and RIF35 are coordinated. To extend the coverage, here are some further sorts of coordination types:

• Al RIF22 le parole: VIR4 e VIR7 sostituiscono le parole:

VIR8 e VIR11, rispettivamente; (rough translation: At the RIF22 the words: VIR4 and VIR7 substitute the words:

VIR8 and VIR11, respectively;);

• Al RIF1 le parole VIR1 sostituiscono VIR4 e VIR5 e al RIF2 le VIR12 sostituiscono VIR13, VIR14 e VIR15 (rough translation: At RIF1, the words VIR1 replace the VIR4 and VIR5, and at RIF2 the VIR12 replaces the VIR13, VIR14

As a general strategy, to handle such cases we bound the com- binatorics menacing the computation by assuming that coordinates only can be homogeneous (that is, a VIR cannot be coordinated to a RIF).

3. EXPERIMENTATION

From a practical viewpoint, our research is intended to assist human annotators in individuating legal modificatory provisions and qualifying them with as many details as possible. At the current stage of development we deal with modificatory provisions of ei- ther integration, substitution or deletion type. For the experimen- tation we used a dataset composed of 180 files, containing over- all 11, 944 XML corpo elements (see Section 2) and 2, 306 mod- ificatory provisions (namely, 809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation.⁶ The system accuracy is computed as the percentage of modificatory provisions cor- rectly computed as the tuple !type, position, novella, novellando", where type is one in {integration, substitution, and deletion}, posi- tion is the constant identifying the position into a given document where the modification occurs, and novella and novellando are both excerpts of quoted text (see Section 2). Our system obtains 83.0%

precision and 71.7% recall. In particular, the figures we obtain for the recall on integration, substitution and deletion are 77.8%, 77.4% and 55.1%, respectively.

The overall result gives an estimation of the robustness of the approach, and of the implemented system as well: as far as we know, no experimentation has been conducted on datasets large as the present one. Provided that the present experimentation is performed with a preliminary release of the system, in the annota- tion of integrations and substitutions we obtain encouraging results, while only a poor accuracy is achieved in the case of deletions.

Most errors fall into one of the following classes: 1. preprocess- ing errors, such as misspelled words, or errors in the XML annota- tion; 2. discourse manager errors, occurring when chains of com- plex phrases are met; 3. parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before); 4. semantic interpretation errors, in cases in which the semantic interpreter is not able to extract the salient information.

6 The dataset is available for download at the URL:

input: XML NIR

(20)

input: XML NIR

20

<corpo>

All'

<rif id="rif9" xlink:href="urn:nir:[...]">

articolo 40, comma 1, della legge 28 dicembre 2005, n. 262 </rif>

, le parole:

"sei mesi"

</virgolette>

sono sostituite dalle seguenti:

"dodici mesi"

</virgolette>

.</corpo> Input to the parser

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones: vir2.

<dsp:sostituzione>

<dsp:pos xlink:href="#rif9" />

</dsp:norma>

<dsp:novella>

</dsp:novella>

<dsp:novellando>

</dsp:novellando>

</dsp:sostituzione>

modification of type substitution

quoted text. novella: new text

quoted text. novellando: text being modified

Figure 3: Left: XML encoding the sentence (extracted from the tag <corpo>) “ All’articolo 40 [, comma 1, della legge 28 dicembre 2005, n. 262,] le parole: ‘sei mesi’ sono sostituite dalle seguenti: ‘dodici mesi’.” (At article 40 [...] the words: ‘six months’ are substituted by the following: ‘twelve months’.). Top: the meta-description of the modification. Bottom: the content of the tag <corpo>, as it is given in input to the parser after the rewriting of constants rif (i.e., reference) and virgolette (quotes).

!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()

!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-

!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/

!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0

contains VIR?

contains RIF?

sostituite

SUBJ OBJ MODs

Figure 4: Three tests are performed on te parse tree, and the appropriate slots are filled.

Further rules are designed to account for complex linguistic con- structions, such as the case of coordination. Let us consider the following sentence: All’articolo 1, comma 2, sono soppresse le let- tere d) ed f); [At the article 1, comma 2, the letters d) and f) are deleted;]. It is conveniently converted into: All’RIF16, sono sop- presse le RIF34) ed RIF35), [At RIF16, the RIF34) and RIF35) are deleted]. The TUP recognizes coordination and marks the edge between the conjuncts. The semantic frame corresponding to the deletion of RIF34 is filled similarly to the case of the substitu- tion; additionally, the semantic interpreter recognizes the presence of a conjunct. In facts, the reference RIF34 has a descendant node RIF35 in a subtree that is connected by the COORD labeled edge.

Cases involving 2 conjuncts, such as RIF34 and RIF35, are rather straightforward because we are handling homogeneous objects (both are RIFs), and because there is no ambiguity about the fact that RIF34 and RIF35 are coordinated. To extend the coverage, here are some further sorts of coordination types:

• Al RIF22 le parole: VIR4 e VIR7 sostituiscono le parole:

VIR8 e VIR11, rispettivamente; (rough translation: At the RIF22 the words: VIR4 and VIR7 substitute the words:

VIR8 and VIR11, respectively;);

• Al RIF1 le parole VIR1 sostituiscono VIR4 e VIR5 e al RIF2 le VIR12 sostituiscono VIR13, VIR14 e VIR15 (rough translation: At RIF1, the words VIR1 replace the VIR4 and VIR5, and at RIF2 the VIR12 replaces the VIR13, VIR14 and VIR15).

As a general strategy, to handle such cases we bound the com- binatorics menacing the computation by assuming that coordinates only can be homogeneous (that is, a VIR cannot be coordinated to a RIF).

3. EXPERIMENTATION

From a practical viewpoint, our research is intended to assist human annotators in individuating legal modificatory provisions and qualifying them with as many details as possible. At the current stage of development we deal with modificatory provisions of ei- ther integration, substitution or deletion type. For the experimen- tation we used a dataset composed of 180 files, containing over- all 11, 944 XML corpo elements (see Section 2) and 2, 306 mod- ificatory provisions (namely, 809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation.⁶ The system accuracy is computed as the percentage of modificatory provisions cor- rectly computed as the tuple !type, position, novella, novellando", where type is one in {integration, substitution, and deletion}, posi- tion is the constant identifying the position into a given document where the modification occurs, and novella and novellando are both excerpts of quoted text (see Section 2). Our system obtains 83.0%

precision and 71.7% recall. In particular, the figures we obtain for the recall on integration, substitution and deletion are 77.8%, 77.4% and 55.1%, respectively.

The overall result gives an estimation of the robustness of the approach, and of the implemented system as well: as far as we know, no experimentation has been conducted on datasets large as the present one. Provided that the present experimentation is performed with a preliminary release of the system, in the annota- tion of integrations and substitutions we obtain encouraging results, while only a poor accuracy is achieved in the case of deletions.

Most errors fall into one of the following classes: 1. preprocess- ing errors, such as misspelled words, or errors in the XML annota- tion; 2. discourse manager errors, occurring when chains of com- plex phrases are met; 3. parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before); 4. semantic interpretation errors, in cases in which the semantic interpreter is not able to extract the salient information.

6 The dataset is available for download at the URL:

http://www.di.unito.it/~radicion/datasets/hypertext09/

position of the norm being modified

quoted text (novellando) text being modified quoted text (novella) new text

Italian for body

(21)

output: generated metadata

<dsp:sostituzione>

<dsp:pos xlink:href="#rif9" />

</dsp:norma>

<dsp:novella>

<dsp:pos xlink:href="#vir2" />

</dsp:novella>

<dsp:novellando>

<dsp:pos xlink:href="#vir1" />

</dsp:novellando>

</dsp:sostituzione>

modification of type substitution

quoted text. novella: new text quoted text. novellando: text being modified

(22)

three steps process

•

input preprocessing. we retrieve the possible locations of modificatory provisions within the document, and we simplify the input sentences, so to prune text fragments that do not convey relevant information

•

in the second step we perform the syntactic analysis of the retrieved sentences

•

in the third step we semantically annotate the retrieved provisions through a tree matching approach

22

(23)

Modifications taxonomy Chunking rules

Subcategorization frames

CHUNKING

Morphological tables Dictionary MORPHOLOGY &

POS-TAGGING

COORDINATION ANALYSIS

ANALYSIS OF VERBAL DEPENDENTS

SEMANTIC INTERPRETER

sequence of lexical items

input text

verbs + partial chunks

verbs + final chunks

Parse Tree

INPUT PREPROCESSING

XML NIR

ar chitectur e of the system

step 1

step 2

step 3

(24)

step 1: preprocessing step

24

•

based on the XML structure, we retain the text

excerpts contained between the tags <corpo> (Italian for body), that is where the modifications may be

found

•

we then rewrite the text tagged by <rif> (short for reference) and <virgolette> (Italian for quotes),

individuating the position where a modification occurs and a quoted text fragment with their respective IDs

INPUT TO THE PARSER

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones: vir2.

(25)

step 2: parsing step

• ^theTurin University Parser (TUP) is a rule-based parser that returns the syntactic structure of

sentences in dependency format

• dependencies are binary relations (e.g. subject-

relation) between a dominant word -the head, e.g. the verb- and a dominated word -the dependent, e.g. the noun-subject-

• after two preliminary steps (morphological analysis and part of speech PoS tagging), the sequence of

words goes through three phases: chunking, syntactic analysis of the coordination, and verbal

subcategorization

(26)

resulting parse tree

8 sostituite

1 All'-A 3 , 4 le 7 sono 9 dalle-DA 12 .

1.1 All'-IL

2 RIF9

9.1 dalle-IL

11 VIR2

10 seguenti

RMOD SEPARATOR OBJ AUX SUBJ END

PREP-ARG

DET-ARG

PREP-ARG

DET-ARG

ADJC-RMOD 5 parole

6 VIR1 DET-ARG

NOUN-APPOSITION

INPUT TO THE PARSER

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones:

vir2.

(27)

step 3: semantic interpretation

•

we have mentioned that the relevant pieces of information to be individuated to qualify a

modificatory provision are 1.the modification type;

2.the position (be it specified as a sequence of words or relative to the document);

3.the quoted text, both in the form of the textual fragment being inserted/added

(novella), and the text fragment affected by the normative modification (novellando)

(28)

step 3: semantic interpretation

28

•

modifications are represented by means of semantic frames, composed by slots, such as legalCategory,

the referenceDocument, modifyingText and the modifiedText

•

frames are associated to the legal categories: for instance, the verbs belonging to the legalCategory substitution such as replace, substitute, change, modify, etc. have all the same slots.

•

in this way it is possible to add further verbs to the legal categories by taking advantage of their shared semantic frames.

(29)

step 3: semantic interpretation

•

the semantic interpreter is a rule-based algorithm.

the rules handle two sorts of information: the parse trees, and the domain knowledge encoding the legal modifications taxonomy;

•

a main rule tests whether the root node of the syntactic tree is a verb, and if it belongs to the modificatory provisions taxonomy.

e.g., given the root verb insert, we take the verb lemma to insert, search for it in the knowledge base, and find that it is a possible instantiation of the legalCategory integration, together with the verbs to add, to incorporate, etc..

(30)

•

the next step is inspecting the dependents of the verb by looking for the position of the modification, and for the novella and novellando (modifyingText) and (modifiedText), arguments

•

modificatory provisions often span over different adjacent phrases (e.g., verbal, nominal, adjectival, prepositional, adverbial phrases) and adjacent

sentences, so that it is necessary to put together such fragments in order to collect all pieces of information

•

to do so, we devised a discourse manager

step 3: tree matching

(31)

step 3: discourse manager

• the discourse manager interacts with the parser in two phases. firstly each phrase is analyzed by the parser to discover the type of the phrase; secondly, the discourse manager puts them together according to some schema (e.g., prepositional phrase (PP) + verbal phrase (VP))

sending the resulting phrase to the parser

• given the sentence ‘at the article 127: the words «six

months» are replaced by the words: «twelve months»’, the discourse manager

• sends the text fragments separately to the parser;

• then, considering the roots of the parse trees, it puts them together according to the schema [PP + VP + NP], and sends

(32)

step 3: tree matching

32

•

in this setting, filling a modification frame

corresponds to searching a parse tree that fits to slots of a given type, and then to finding an

appropriate mapping between (tree) dependents and (frame) slots

8 sostituite

1 All'-A 3 , 4 le 7 sono 9 dalle-DA 12 .

1.1 All'-IL

2 RIF9

9.1 dalle-IL

11 VIR2

10 seguenti

RMOD SEPARATOR OBJ AUX SUBJ END

PREP-ARG

DET-ARG

PREP-ARG

DET-ARG

ADJC-RMOD 5 parole

6 VIR1 DET-ARG

NOUN-APPOSITION

Figure 2: Syntactic analysis of the sentence “All’RIF9, le pa- role: VIR1 sono sostituite dalle seguenti: VIR2”, (At RIF9 the words: VIR1 are substituted by the following: VIR2).

Semantic Interpretation

The semantic interpreter is a rule-based algorithm. The rules han- dle two sorts of information: the parse trees, and the domain knowledge encoding the legal modifications taxonomy.

To consider the main traits of modificatory provisions and how they are extracted, let us consider the modification in Fig. 3. To qualify a modification, it is necessary to extract information about the following points. i) the modification type : we presently con- sider only integration, substitution and deletion, but in the taxon- omy there are many other classes; ii) the position can be specified as a sequence of words (e.g., “before the words: ‘six months’ ”).

Alternatively, the position can be specified relative to the document (e.g., “at article 40”); iii) quoted text, that can be of two types. Le- gal experts denote the textual fragment being inserted/added as the novella of a modificatory provision; while they denote by novel- lando the text fragment affected by the normative modification. For the sake of clarity, we will borrow that terminology. The quoted text identifies both novella and novellando. However, the structure of normative modifications can be different. In facts, it could be present only the novellando (e.g., “[in the end of article 40] the words: ‘six months’ are deleted”), or only the novella (e.g., “[at the end of article 40] the words ‘goodnight’ are added”).

After having described the meaningful elements of normative modifications, let us consider how the modificatory provisions com- ponents are encoded within the system. Modifications are repre- sented by means of semantic frames, composed by slots [1], such as legalCategory, referenceDocument, modifyingText and modified- Text. Frames are associated to the legal categories: for instance, the verbs belonging to the legalCategory substitution such as re- place, substitute, change, modify, etc. have all the same slots. In this way we can add further verbs to the legal categories by taking advantage of their shared semantic frames. Integration, substitu- tion and deletion can differ as regards as the slots corresponding to the quoted text. For example, the deletion possibly has a novel- lando (in this case, the slot modifiedText should be filled) but no novella, and accordingly no modifyingText slot. Viceversa, both substitution and integration can have either novella or novellando, both of them or none.

A main rule is charged to test whether the root node of the syntactic tree is a verb, and if it belongs to the modificatory provi-

sions taxonomy. For example, given the root verb, we take the verb lemma inserire (insert), we search for it in the knowledge base, and find that it is a possible instantiation of the legalCategory in- tegration, together with the verbs aggiungere (add), integrare (in- corporate), etc.. In this case we have a fundamental cue that the sentence being analyzed contains a modificatory provision. Also, based on the taxonomy, we are informed about the legalCategory of the modification at hand. The next step is in inspecting the de- pendents of the verb looking for the position of the modification, and for novella and novellando (modifyingText and modifiedText, respectively) arguments.

Since modificatory provisions often span over different adjacent phrases (e.g., verbal, nominal, adjectival, prepositional, adverbial phrases), it is necessary to put together such fragments in order to collect all pieces of information. To do so we implemented a dis- course manager. Moreover, modificatory provisions can span over multiple sentences, which considerably increases the difficulty of the syntactic analysis and of the semantic interpretation, as well.

Also punctuation (such as colon, semicolon and period) may deter- mine the need to consider multiple sentences. In particular, the discourse manager interacts with the parser in two phases. Firstly each phrase is analyzed by the parser to discover the type of the phrase;

secondly, the discourse manager puts them together according to some schema (e.g., prepositional phrase + verbal phrase) sending the resulting phrase to the parser. For example, the colon is commonly employed in locutions such as “at the article 127: the words

‘six months’ are replaced by the words: ‘twelve months’ ”. Ini- tially, the discourse manager sends the text fragments separately to the parser. Then, considering the roots of the resulting parse trees, the discourse manager puts them together according to the schema prepositional phrase + verbal phrase + noun phrase and sends them to the parser again.

In this setting, filling a modification frame corresponds to searching a parse tree that fits to slots of a given type, and then to finding an appropriate mapping between (tree) dependents and (frame) slots.

Table 1: Example of tests on the verb dependents.

IF

- the subtree attached to the verb by a RMOD label (that is, modifier) contains a RIF1 constant; AND - the subtree attached to the verb by a SUBJ label (that is subject) contains a VIR₁ constant; AND

- the subtree attached to the verb by a OBJ label (that is, object) contains a VIR₂ constant

THEN - fill referenceDocument with he RIF1 AND - fill modifiedText with the VIR₂ AND

- fill modifyingText with the VIR₁

Tree matching. To perform the tree matching, the other rules test the content of the verb arguments and the verb modifiers to fill the slots of current frame. In particular, the rules are charged to dis- cover whether in the syntactic arguments like subject, object or in any modifier a constant such as RIF or VIR are present. For exam- ple, the syntactic tree corresponding to the substitution modifica- tion in Fig. 2 has as root node the verb “sostituire” (substitute). This verb is present in the knowledge base, which causes a frame asso- ciated to the legalCategory: substitution to be instantiated. Sub- sequently, a set of tests are executed on the verb dependents –i.e., the children nodes– to fill the appropriate slots (Table 1). The rule maps a syntactic pattern onto a set of semantic slots, performing a test on the subtrees rooted in the subject, in the object and in the modifiers (Fig. 4).

(33)

step 3: tree matching

•

the rule maps a syntactic pattern onto a set of semantic slots, performing a test on the subtrees rooted in the subject, in the object, and in the modifiers

!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()

!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-

!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/

!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0

contains

contains VIR?

contains RIF?

sostituite

SUBJ OBJ MODs

(34)

experimental results

(35)

dataset

•

at the current stage of development we deal with modificatory provisions of either

integration, substitution or deletion type

•

the dataset was composed of 180 files, containing overall 11,944 XML corpo

elements and 2,306 modificatory provisions (809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation

(36)

accuracy metrics

•

the system accuracy is computed as the

percentage of modificatory provisions correctly computed as the tuple

<type,position,novella,novellando>, where

• type is one in {integration, substitution, repeal},

• position is the constant identifying the position into a given document where the modification occurs, and

• novella and novellando are both excerpts of quoted text

36

(37)

results

•

we obtain 83.0% precision and 71.7% recall.

•

the figures we obtain for the recall on

integration, substitution and deletion are 77.8, 77.4 and 55.1, respectively.

(38)

errors analysis

•

most errors fall into few main classes:

• preprocessing errors, such as misspelled words, or errors in the XML annotation;

• discourse manager errors, occurring when chains of complex phrases are met;

• parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before);

• semantic interpretation errors, in cases in

which the semantic interpreter is not able to extract the salient information.

38

(39)

errors analysis

•

preprocessing errors are by far the most frequent ones: in particular, the system skipped a large

number of corpo elements (namely, 1,772 out of 11,944). 14.8% texts fragments were skipped

•

if we restrict to the number of modificatory

provisions that are actually analyzed by the parser, the recall of the system raises to 82.0% (namely 85.1%, 85.7% and 70.5% on I/S/D, respectively).

(40)

future work

•

improving the pre-processing filtering phase

•

tuning the set of rules for the tree matching step, also considering ML approaches such as SSL

•

extending the system’s coverage to further types of modificatory provisions

40

(41)

• thank you for your attention

• please address correspondence to

radicion@di.unito.it

(42)

(43)

(44)

unused slides

(45)

hypertexts and legal texts

“at the article 29, comma 2 of law #212/2007, the words «free press» are

substituted with «silence please», ...”.

DOCUMENT 2

... free press ...

DOCUMENT 2

... silence please ...

DOCUMENT 1 modification !

modification α

(46)

hypertexts and legal texts

•

legal texts often refer either to other legal texts or to another part of the same

document in a non-sequential fashion

•

hence, legal texts can be considered as special cases of hypertext, i.e., a

machine-readable text that is not sequential but is organized so that related items of

information are connected [wordnet:

hypertext]

46

(47)

step 3: semantic interpretation

•

however, the structure of normative

modifications can differ in a number of ways

•

for example, it can be present only the novellando e.g., “at the end of article 40 the words: `six months' are deleted”,

•

or it could be present only the novella e.g., ”[at the end of article 40] the words

`goodnight' are added”.