• Non ci sono risultati.

Extracting Semantic Annotations from Legal Texts

N/A
N/A
Protected

Academic year: 2022

Condividi "Extracting Semantic Annotations from Legal Texts"

Copied!
47
0
0

Testo completo

(1)

Extracting Semantic Annotations from Legal Texts

L. Lesmo, A. Mazzei and D. P. Radicioni

(2)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

research carried out by the Interaction Models

Group of the CS Department (University of Turin) in partnership with Modelling Legal Informatics

Resource Group of Cirsfid (University of Bologna)

2

(3)

outline

• the problem at hand: automating semantic annotation for consolidating legal texts

• the proposed approach

• experimental results and open issues

(4)

definition of the problem

(5)

semantic mark-up

semantic mark-up is commonly

acknowledged as a costly and complex matter

much work has been carried out in various fields, such as Information Extraction, with the aim at automatically extracting salient information from texts

(6)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

restricted domains

no mature technology to cope with unrestricted language exists yet

some advances have been obtained over

restricted -more regular- domains, such as the legal field.

systems have built that automatically identify and classify structural portions of legal documents and their intra- and inter-references

other investigations are being carried out to produce semantic analysis

6

(7)

XML standards

various initiatives to devise XML standards for describing legal sources and schemas to identify legal documents

text editors exist (e.g., NIR NormaEditor) to mark up in a supervised fashion structural

partitions and normative references

unfortunately, the human annotation process is expensive and error-prone

such efforts will be viable only in conjunction with tools to extract in automatic fashion the structural and semantic data from legal texts.

(8)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

hypertexts and legal texts

documents containing modificatory provisions are relevant in such perspective, in that they are explicitly concerned with describing how some other legal text --or other part of current one-- has to be modified

8

“at the article 29, comma 2 of law #212/2007, the words «free press» are

substituted with «silence please», ...”.

DOCUMENT 1 modification !

modification α

modification or, in the legal jargon,

modificatory provision, is a change made to one or more clauses within a text or to an entire text

(9)

hypertexts and legal texts

“at the article 33, comma 1 of law #212/2007, the word

«forbidden» is substituted by

«allowed», ...”.

“at the article 1, of law #55/2009, the word «gate»

is repealed.

modification β

modification γ

DOCUMENT 1

modification ! modification "

modification #

(10)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

hypertexts and legal texts

10

DOCUMENT 1 modification ! modification "

modification # modification ...

law #212/2007

... free press ...

... forbidden...

law #212/2007

... silence please ...

... allowed ...

law #55/2009

... gate...

law #55/2009

... gate ...

...

(11)

hypertexts and legal texts

DOCUMENT 1 modification ! modification "

modification #

modification ... law #55/2009

... gate ...

... canceled added ...

modified TeXt ...

DOCUMENT 2

modification $

DOCUMENT 3

modification %

how can we build the text that

counts as law (the consolidated text)?

(12)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

consolidation of legal texts

the consolidated text is the

updated version of a normative text, the version embodying the changes;

the uncertainty on the effects of normative modifications would undermine the certainty of the law, thereby making it hard to

clearly understand what is the law, or which one of several versions of a provision counts as law

12

law #212/2007

... free press silence

please ...

... forbidden allowed ...

(13)

consolidation of legal texts

• automating the process to semantically annotate modificatory clauses and

provisions would be of great help in

simplifying the legal system and in

consolidating texts of law

(14)

semantic annotation of modificatory

provisions

(15)

to qualify a modificatory provision

one has to recognize the following:

- the type of the modification (e.g., repeal, modification, substitution);

- the document being modified (another law, decree, etc.);

- the portion of such document affected by the modification (be it a structural part, as ``article 22'' or a text fragment, as ``the words `six

months'...'');

and to generate a set of metadata that compactly describe such modification

(16)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

taxonomy of modificatory provisions

16

MOD. OF THE CONTENT

MOD. OF THE RANGE

MOD. OF THE TIME

NORMATIVE SYSTEM MODIFICATION

TEXT

MEANING

CHANGES OF FORCE CHANGES OF

EFFICACY

REPEAL

SUBSTITUTION INTEGRATION RELOCATION

MODIFICATION OF TERMS INTERPRETATION

MEANING CHANGE

DEROGATION EXTENSION

ANNULMENT

POSTPONEMENT OF ENTER IN FORCE ENTRY INTO FORCE

PROROGATION OF FORCE RE-ENACTMENT

POSTPONEMENT OF START OF EFFICACY SUSPENSION

DISAPPLICATION

PROROGATION OF EFFICACY RETROACTIVITY

CONVERSION TRANSPOSE IMPLEMENT RATIFICATION

DELEGATION OF POWER DEREGULATION

{

{

{

{ {

{

(17)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

taxonomy of modificatory provisions

17

MOD. OF THE CONTENT

MOD. OF THE RANGE

MOD. OF THE TIME

NORMATIVE SYSTEM MODIFICATION

TEXT

MEANING

CHANGES OF FORCE CHANGES OF

EFFICACY

REPEAL

SUBSTITUTION INTEGRATION RELOCATION

MODIFICATION OF TERMS INTERPRETATION

MEANING CHANGE

DEROGATION EXTENSION

ANNULMENT

POSTPONEMENT OF ENTER IN FORCE ENTRY INTO FORCE

PROROGATION OF FORCE RE-ENACTMENT

POSTPONEMENT OF START OF EFFICACY SUSPENSION

DISAPPLICATION

PROROGATION OF EFFICACY RETROACTIVITY

CONVERSION TRANSPOSE IMPLEMENT RATIFICATION

DELEGATION OF POWER

{

{

{

{ {

{

(18)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

input representation

let us consider the following input sentence:

“All'articolo 40, comma 1, della legge 28 dicembre 2005, n. 262, le parole: «sei mesi» sono sostituite dalle seguenti: «dodici mesi».”

“At article 40, comma 1 of law 28/12/2005 # 262, the words: «six months» are substituted by the following: «twelve months».”

18

(19)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 19

<!-- INFORMATION -->

<corpo>

All'

<rif id="rif9" xlink:href="urn:nir:[...]">

articolo 40, comma 1, della legge 28 dicembre 2005, n. 262 </rif>

, le parole:

<virgolette tipo="parola" id="vir1">

"sei mesi"

</virgolette>

sono sostituite dalle seguenti:

<virgolette tipo="parola" id="vir2">

"dodici mesi"

</virgolette>

.</corpo> Input to the parser

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones: vir2.

<!-- META INFORMATION -->

<dsp:sostituzione>

<dsp:pos xlink:href="#art1-com4" />

<dsp:norma xlink:href="urn:nir:[...]">

<dsp:pos xlink:href="#rif9" />

</dsp:norma>

<dsp:novella>

<dsp:pos xlink:href="#vir2" />

</dsp:novella>

<dsp:novellando>

<dsp:pos xlink:href="#vir1" />

</dsp:novellando>

</dsp:sostituzione>

modification of type substitution

position of the norm being modified

quoted text. novella: new text

quoted text. novellando: text being modified

Figure 3: Left: XML encoding the sentence (extracted from the tag <corpo>) “ All’articolo 40 [, comma 1, della legge 28 dicembre 2005, n. 262,] le parole: ‘sei mesi’ sono sostituite dalle seguenti: ‘dodici mesi’.” (At article 40 [...] the words: ‘six months’ are substituted by the following: ‘twelve months’.). Top: the meta-description of the modification. Bottom: the content of the tag <corpo>, as it is given in input to the parser after the rewriting of constants rif (i.e., reference) and virgolette (quotes).

!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()

!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-

!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/

!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0

FRAME sostituire (substitute)

contains VIR?

contains VIR?

contains RIF?

sostituite

SUBJ OBJ MODs

Figure 4: Three tests are performed on te parse tree, and the appropriate slots are filled.

Further rules are designed to account for complex linguistic con- structions, such as the case of coordination. Let us consider the following sentence: All’articolo 1, comma 2, sono soppresse le let- tere d) ed f); [At the article 1, comma 2, the letters d) and f) are deleted;]. It is conveniently converted into: All’RIF16, sono sop- presse le RIF34) ed RIF35), [At RIF16, the RIF34) and RIF35) are deleted]. The TUP recognizes coordination and marks the edge between the conjuncts. The semantic frame corresponding to the deletion of RIF34 is filled similarly to the case of the substitu- tion; additionally, the semantic interpreter recognizes the presence of a conjunct. In facts, the reference RIF34 has a descendant node RIF35 in a subtree that is connected by the COORD labeled edge.

Cases involving 2 conjuncts, such as RIF34 and RIF35, are rather straightforward because we are handling homogeneous objects (both are RIFs), and because there is no ambiguity about the fact that RIF34 and RIF35 are coordinated. To extend the coverage, here are some further sorts of coordination types:

• Al RIF22 le parole: VIR4 e VIR7 sostituiscono le parole:

VIR8 e VIR11, rispettivamente; (rough translation: At the RIF22 the words: VIR4 and VIR7 substitute the words:

VIR8 and VIR11, respectively;);

• Al RIF1 le parole VIR1 sostituiscono VIR4 e VIR5 e al RIF2 le VIR12 sostituiscono VIR13, VIR14 e VIR15 (rough translation: At RIF1, the words VIR1 replace the VIR4 and VIR5, and at RIF2 the VIR12 replaces the VIR13, VIR14

As a general strategy, to handle such cases we bound the com- binatorics menacing the computation by assuming that coordinates only can be homogeneous (that is, a VIR cannot be coordinated to a RIF).

3. EXPERIMENTATION

From a practical viewpoint, our research is intended to assist hu- man annotators in individuating legal modificatory provisions and qualifying them with as many details as possible. At the current stage of development we deal with modificatory provisions of ei- ther integration, substitution or deletion type. For the experimen- tation we used a dataset composed of 180 files, containing over- all 11, 944 XML corpo elements (see Section 2) and 2, 306 mod- ificatory provisions (namely, 809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation.6 The system accu- racy is computed as the percentage of modificatory provisions cor- rectly computed as the tuple !type, position, novella, novellando", where type is one in {integration, substitution, and deletion}, posi- tion is the constant identifying the position into a given document where the modification occurs, and novella and novellando are both excerpts of quoted text (see Section 2). Our system obtains 83.0%

precision and 71.7% recall. In particular, the figures we obtain for the recall on integration, substitution and deletion are 77.8%, 77.4% and 55.1%, respectively.

The overall result gives an estimation of the robustness of the approach, and of the implemented system as well: as far as we know, no experimentation has been conducted on datasets large as the present one. Provided that the present experimentation is performed with a preliminary release of the system, in the annota- tion of integrations and substitutions we obtain encouraging results, while only a poor accuracy is achieved in the case of deletions.

Most errors fall into one of the following classes: 1. preprocess- ing errors, such as misspelled words, or errors in the XML annota- tion; 2. discourse manager errors, occurring when chains of com- plex phrases are met; 3. parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before); 4. semantic interpretation errors, in cases in which the semantic interpreter is not able to extract the salient information.

6 The dataset is available for download at the URL:

input: XML NIR

(20)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

input: XML NIR

20

<!-- INFORMATION -->

<corpo>

All'

<rif id="rif9" xlink:href="urn:nir:[...]">

articolo 40, comma 1, della legge 28 dicembre 2005, n. 262 </rif>

, le parole:

<virgolette tipo="parola" id="vir1">

"sei mesi"

</virgolette>

sono sostituite dalle seguenti:

<virgolette tipo="parola" id="vir2">

"dodici mesi"

</virgolette>

.</corpo> Input to the parser

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones: vir2.

<!-- META INFORMATION -->

<dsp:sostituzione>

<dsp:pos xlink:href="#art1-com4" />

<dsp:norma xlink:href="urn:nir:[...]">

<dsp:pos xlink:href="#rif9" />

</dsp:norma>

<dsp:novella>

<dsp:pos xlink:href="#vir2" />

</dsp:novella>

<dsp:novellando>

<dsp:pos xlink:href="#vir1" />

</dsp:novellando>

</dsp:sostituzione>

modification of type substitution

position of the norm being modified

quoted text. novella: new text

quoted text. novellando: text being modified

Figure 3: Left: XML encoding the sentence (extracted from the tag <corpo>) “ All’articolo 40 [, comma 1, della legge 28 dicembre 2005, n. 262,] le parole: ‘sei mesi’ sono sostituite dalle seguenti: ‘dodici mesi’.” (At article 40 [...] the words: ‘six months’ are substituted by the following: ‘twelve months’.). Top: the meta-description of the modification. Bottom: the content of the tag <corpo>, as it is given in input to the parser after the rewriting of constants rif (i.e., reference) and virgolette (quotes).

!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()

!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-

!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/

!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0

FRAME sostituire (substitute)

contains VIR?

contains VIR?

contains RIF?

sostituite

SUBJ OBJ MODs

Figure 4: Three tests are performed on te parse tree, and the appropriate slots are filled.

Further rules are designed to account for complex linguistic con- structions, such as the case of coordination. Let us consider the following sentence: All’articolo 1, comma 2, sono soppresse le let- tere d) ed f); [At the article 1, comma 2, the letters d) and f) are deleted;]. It is conveniently converted into: All’RIF16, sono sop- presse le RIF34) ed RIF35), [At RIF16, the RIF34) and RIF35) are deleted]. The TUP recognizes coordination and marks the edge between the conjuncts. The semantic frame corresponding to the deletion of RIF34 is filled similarly to the case of the substitu- tion; additionally, the semantic interpreter recognizes the presence of a conjunct. In facts, the reference RIF34 has a descendant node RIF35 in a subtree that is connected by the COORD labeled edge.

Cases involving 2 conjuncts, such as RIF34 and RIF35, are rather straightforward because we are handling homogeneous objects (both are RIFs), and because there is no ambiguity about the fact that RIF34 and RIF35 are coordinated. To extend the coverage, here are some further sorts of coordination types:

• Al RIF22 le parole: VIR4 e VIR7 sostituiscono le parole:

VIR8 e VIR11, rispettivamente; (rough translation: At the RIF22 the words: VIR4 and VIR7 substitute the words:

VIR8 and VIR11, respectively;);

• Al RIF1 le parole VIR1 sostituiscono VIR4 e VIR5 e al RIF2 le VIR12 sostituiscono VIR13, VIR14 e VIR15 (rough translation: At RIF1, the words VIR1 replace the VIR4 and VIR5, and at RIF2 the VIR12 replaces the VIR13, VIR14 and VIR15).

As a general strategy, to handle such cases we bound the com- binatorics menacing the computation by assuming that coordinates only can be homogeneous (that is, a VIR cannot be coordinated to a RIF).

3. EXPERIMENTATION

From a practical viewpoint, our research is intended to assist hu- man annotators in individuating legal modificatory provisions and qualifying them with as many details as possible. At the current stage of development we deal with modificatory provisions of ei- ther integration, substitution or deletion type. For the experimen- tation we used a dataset composed of 180 files, containing over- all 11, 944 XML corpo elements (see Section 2) and 2, 306 mod- ificatory provisions (namely, 809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation.6 The system accu- racy is computed as the percentage of modificatory provisions cor- rectly computed as the tuple !type, position, novella, novellando", where type is one in {integration, substitution, and deletion}, posi- tion is the constant identifying the position into a given document where the modification occurs, and novella and novellando are both excerpts of quoted text (see Section 2). Our system obtains 83.0%

precision and 71.7% recall. In particular, the figures we obtain for the recall on integration, substitution and deletion are 77.8%, 77.4% and 55.1%, respectively.

The overall result gives an estimation of the robustness of the approach, and of the implemented system as well: as far as we know, no experimentation has been conducted on datasets large as the present one. Provided that the present experimentation is performed with a preliminary release of the system, in the annota- tion of integrations and substitutions we obtain encouraging results, while only a poor accuracy is achieved in the case of deletions.

Most errors fall into one of the following classes: 1. preprocess- ing errors, such as misspelled words, or errors in the XML annota- tion; 2. discourse manager errors, occurring when chains of com- plex phrases are met; 3. parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before); 4. semantic interpretation errors, in cases in which the semantic interpreter is not able to extract the salient information.

6 The dataset is available for download at the URL:

http://www.di.unito.it/~radicion/datasets/hypertext09/

position of the norm being modified

quoted text (novellando) text being modified quoted text (novella) new text

Italian for body

(21)

output: generated metadata

<dsp:sostituzione>

<dsp:pos xlink:href="#art1-com4" />

<dsp:norma xlink:href="urn:nir:[...]">

<dsp:pos xlink:href="#rif9" />

</dsp:norma>

<dsp:novella>

<dsp:pos xlink:href="#vir2" />

</dsp:novella>

<dsp:novellando>

<dsp:pos xlink:href="#vir1" />

</dsp:novellando>

</dsp:sostituzione>

modification of type substitution

position of the norm being modified

quoted text. novella: new text quoted text. novellando: text being modified

(22)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

three steps process

input preprocessing. we retrieve the possible locations of modificatory provisions within the document, and we simplify the input sentences, so to prune text fragments that do not convey relevant information

in the second step we perform the syntactic analysis of the retrieved sentences

in the third step we semantically annotate the retrieved provisions through a tree matching approach

22

(23)

Modifications taxonomy Chunking rules

Subcategorization frames

CHUNKING

Morphological tables Dictionary MORPHOLOGY &

POS-TAGGING

COORDINATION ANALYSIS

ANALYSIS OF VERBAL DEPENDENTS

SEMANTIC INTERPRETER

sequence of lexical items

input text

verbs + partial chunks

verbs + final chunks

Parse Tree

INPUT PREPROCESSING

XML NIR

XML NIR

ar chitectur e of the system

step 1

step 2

step 3

(24)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

step 1: preprocessing step

24

based on the XML structure, we retain the text

excerpts contained between the tags <corpo> (Italian for body), that is where the modifications may be

found

we then rewrite the text tagged by <rif> (short for reference) and <virgolette> (Italian for quotes),

individuating the position where a modification occurs and a quoted text fragment with their respective IDs

INPUT TO THE PARSER

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones: vir2.

(25)

step 2: parsing step

the Turin University Parser (TUP) is a rule-based parser that returns the syntactic structure of

sentences in dependency format

dependencies are binary relations (e.g. subject-

relation) between a dominant word -the head, e.g. the verb- and a dominated word -the dependent, e.g. the noun-subject-

after two preliminary steps (morphological analysis and part of speech PoS tagging), the sequence of

words goes through three phases: chunking, syntactic analysis of the coordination, and verbal

subcategorization

(26)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 26

resulting parse tree

8 sostituite

1 All'-A 3 , 4 le 7 sono 9 dalle-DA 12 .

1.1 All'-IL

2 RIF9

9.1 dalle-IL

11 VIR2

10 seguenti

RMOD SEPARATOR OBJ AUX SUBJ END

PREP-ARG

DET-ARG

PREP-ARG

DET-ARG

ADJC-RMOD 5 parole

6 VIR1 DET-ARG

NOUN-APPOSITION

INPUT TO THE PARSER

All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.

At rif9, the words: vir1 are substituted by the following ones:

vir2.

(27)

step 3: semantic interpretation

we have mentioned that the relevant pieces of information to be individuated to qualify a

modificatory provision are 1.the modification type;

2.the position (be it specified as a sequence of words or relative to the document);

3.the quoted text, both in the form of the textual fragment being inserted/added

(novella), and the text fragment affected by the normative modification (novellando)

(28)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

step 3: semantic interpretation

28

modifications are represented by means of semantic frames, composed by slots, such as legalCategory,

the referenceDocument, modifyingText and the modifiedText

frames are associated to the legal categories: for instance, the verbs belonging to the legalCategory substitution such as replace, substitute, change, modify, etc. have all the same slots.

in this way it is possible to add further verbs to the legal categories by taking advantage of their shared semantic frames.

(29)

step 3: semantic interpretation

the semantic interpreter is a rule-based algorithm.

the rules handle two sorts of information: the parse trees, and the domain knowledge encoding the legal modifications taxonomy;

a main rule tests whether the root node of the syntactic tree is a verb, and if it belongs to the modificatory provisions taxonomy.

e.g., given the root verb insert, we take the verb lemma to insert, search for it in the knowledge base, and find that it is a possible instantiation of the legalCategory integration, together with the verbs to add, to incorporate, etc..

(30)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 30

the next step is inspecting the dependents of the verb by looking for the position of the modification, and for the novella and novellando (modifyingText) and (modifiedText), arguments

modificatory provisions often span over different adjacent phrases (e.g., verbal, nominal, adjectival, prepositional, adverbial phrases) and adjacent

sentences, so that it is necessary to put together such fragments in order to collect all pieces of information

to do so, we devised a discourse manager

step 3: tree matching

(31)

step 3: discourse manager

the discourse manager interacts with the parser in two phases. firstly each phrase is analyzed by the parser to discover the type of the phrase; secondly, the discourse manager puts them together according to some schema (e.g., prepositional phrase (PP) + verbal phrase (VP))

sending the resulting phrase to the parser

given the sentence ‘at the article 127: the words «six

months» are replaced by the words: «twelve months»’, the discourse manager

sends the text fragments separately to the parser;

then, considering the roots of the parse trees, it puts them together according to the schema [PP + VP + NP], and sends

(32)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

step 3: tree matching

32

in this setting, filling a modification frame

corresponds to searching a parse tree that fits to slots of a given type, and then to finding an

appropriate mapping between (tree) dependents and (frame) slots

8 sostituite

1 All'-A 3 , 4 le 7 sono 9 dalle-DA 12 .

1.1 All'-IL

2 RIF9

9.1 dalle-IL

11 VIR2

10 seguenti

RMOD SEPARATOR OBJ AUX SUBJ END

PREP-ARG

DET-ARG

PREP-ARG

DET-ARG

ADJC-RMOD 5 parole

6 VIR1 DET-ARG

NOUN-APPOSITION

Figure 2: Syntactic analysis of the sentence “All’RIF9, le pa- role: VIR1 sono sostituite dalle seguenti: VIR2”, (At RIF9 the words: VIR1 are substituted by the following: VIR2).

Semantic Interpretation

The semantic interpreter is a rule-based algorithm. The rules han- dle two sorts of information: the parse trees, and the domain knowl- edge encoding the legal modifications taxonomy.

To consider the main traits of modificatory provisions and how they are extracted, let us consider the modification in Fig. 3. To qualify a modification, it is necessary to extract information about the following points. i) the modification type : we presently con- sider only integration, substitution and deletion, but in the taxon- omy there are many other classes; ii) the position can be specified as a sequence of words (e.g., “before the words: ‘six months’ ”).

Alternatively, the position can be specified relative to the document (e.g., “at article 40”); iii) quoted text, that can be of two types. Le- gal experts denote the textual fragment being inserted/added as the novella of a modificatory provision; while they denote by novel- lando the text fragment affected by the normative modification. For the sake of clarity, we will borrow that terminology. The quoted text identifies both novella and novellando. However, the structure of normative modifications can be different. In facts, it could be present only the novellando (e.g., “[in the end of article 40] the words: ‘six months’ are deleted”), or only the novella (e.g., “[at the end of article 40] the words ‘goodnight’ are added”).

After having described the meaningful elements of normative modifications, let us consider how the modificatory provisions com- ponents are encoded within the system. Modifications are repre- sented by means of semantic frames, composed by slots [1], such as legalCategory, referenceDocument, modifyingText and modified- Text. Frames are associated to the legal categories: for instance, the verbs belonging to the legalCategory substitution such as re- place, substitute, change, modify, etc. have all the same slots. In this way we can add further verbs to the legal categories by taking advantage of their shared semantic frames. Integration, substitu- tion and deletion can differ as regards as the slots corresponding to the quoted text. For example, the deletion possibly has a novel- lando (in this case, the slot modifiedText should be filled) but no novella, and accordingly no modifyingText slot. Viceversa, both substitution and integration can have either novella or novellando, both of them or none.

A main rule is charged to test whether the root node of the syn- tactic tree is a verb, and if it belongs to the modificatory provi-

sions taxonomy. For example, given the root verb, we take the verb lemma inserire (insert), we search for it in the knowledge base, and find that it is a possible instantiation of the legalCategory in- tegration, together with the verbs aggiungere (add), integrare (in- corporate), etc.. In this case we have a fundamental cue that the sentence being analyzed contains a modificatory provision. Also, based on the taxonomy, we are informed about the legalCategory of the modification at hand. The next step is in inspecting the de- pendents of the verb looking for the position of the modification, and for novella and novellando (modifyingText and modifiedText, respectively) arguments.

Since modificatory provisions often span over different adjacent phrases (e.g., verbal, nominal, adjectival, prepositional, adverbial phrases), it is necessary to put together such fragments in order to collect all pieces of information. To do so we implemented a dis- course manager. Moreover, modificatory provisions can span over multiple sentences, which considerably increases the difficulty of the syntactic analysis and of the semantic interpretation, as well.

Also punctuation (such as colon, semicolon and period) may deter- mine the need to consider multiple sentences. In particular, the dis- course manager interacts with the parser in two phases. Firstly each phrase is analyzed by the parser to discover the type of the phrase;

secondly, the discourse manager puts them together according to some schema (e.g., prepositional phrase + verbal phrase) sending the resulting phrase to the parser. For example, the colon is com- monly employed in locutions such as “at the article 127: the words

‘six months’ are replaced by the words: ‘twelve months’ ”. Ini- tially, the discourse manager sends the text fragments separately to the parser. Then, considering the roots of the resulting parse trees, the discourse manager puts them together according to the schema prepositional phrase + verbal phrase + noun phrase and sends them to the parser again.

In this setting, filling a modification frame corresponds to search- ing a parse tree that fits to slots of a given type, and then to find- ing an appropriate mapping between (tree) dependents and (frame) slots.

Table 1: Example of tests on the verb dependents.

IF

- the subtree attached to the verb by a RMOD label (that is, modifier) contains a RIF1 constant; AND - the subtree attached to the verb by a SUBJ label (that is subject) contains a VIR1 constant; AND

- the subtree attached to the verb by a OBJ label (that is, object) contains a VIR2 constant

THEN - fill referenceDocument with he RIF1 AND - fill modifiedText with the VIR2 AND

- fill modifyingText with the VIR1

Tree matching. To perform the tree matching, the other rules test the content of the verb arguments and the verb modifiers to fill the slots of current frame. In particular, the rules are charged to dis- cover whether in the syntactic arguments like subject, object or in any modifier a constant such as RIF or VIR are present. For exam- ple, the syntactic tree corresponding to the substitution modifica- tion in Fig. 2 has as root node the verb “sostituire” (substitute). This verb is present in the knowledge base, which causes a frame asso- ciated to the legalCategory: substitution to be instantiated. Sub- sequently, a set of tests are executed on the verb dependents –i.e., the children nodes– to fill the appropriate slots (Table 1). The rule maps a syntactic pattern onto a set of semantic slots, performing a test on the subtrees rooted in the subject, in the object and in the modifiers (Fig. 4).

(33)

step 3: tree matching

the rule maps a syntactic pattern onto a set of semantic slots, performing a test on the subtrees rooted in the subject, in the object, and in the modifiers

!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()

!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-

!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/

!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0

FRAME sostituire (substitute)

contains

contains VIR?

contains RIF?

sostituite

SUBJ OBJ MODs

(34)

experimental results

(35)

dataset

at the current stage of development we deal with modificatory provisions of either

integration, substitution or deletion type

the dataset was composed of 180 files, containing overall 11,944 XML corpo

elements and 2,306 modificatory provisions (809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation

(36)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

accuracy metrics

the system accuracy is computed as the

percentage of modificatory provisions correctly computed as the tuple

<type,position,novella,novellando>, where

type is one in {integration, substitution, repeal},

position is the constant identifying the position into a given document where the modification occurs, and

novella and novellando are both excerpts of quoted text

36

(37)

results

we obtain 83.0% precision and 71.7% recall.

the figures we obtain for the recall on

integration, substitution and deletion are 77.8, 77.4 and 55.1, respectively.

(38)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

errors analysis

most errors fall into few main classes:

preprocessing errors, such as misspelled words, or errors in the XML annotation;

discourse manager errors, occurring when chains of complex phrases are met;

parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before);

semantic interpretation errors, in cases in

which the semantic interpreter is not able to extract the salient information.

38

(39)

errors analysis

preprocessing errors are by far the most frequent ones: in particular, the system skipped a large

number of corpo elements (namely, 1,772 out of 11,944). 14.8% texts fragments were skipped

if we restrict to the number of modificatory

provisions that are actually analyzed by the parser, the recall of the system raises to 82.0% (namely 85.1%, 85.7% and 70.5% on I/S/D, respectively).

(40)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

future work

improving the pre-processing filtering phase

tuning the set of rules for the tree matching step, also considering ML approaches such as SSL

extending the system’s coverage to further types of modificatory provisions

40

(41)

• thank you for your attention

• please address correspondence to

radicion@di.unito.it

(42)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 42

(43)
(44)

unused slides

(45)

hypertexts and legal texts

“at the article 29, comma 2 of law #212/2007, the words «free press» are

substituted with «silence please», ...”.

DOCUMENT 2

... free press ...

DOCUMENT 2

... silence please ...

DOCUMENT 1 modification !

modification α

(46)

Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts

hypertexts and legal texts

legal texts often refer either to other legal texts or to another part of the same

document in a non-sequential fashion

hence, legal texts can be considered as special cases of hypertext, i.e., a

machine-readable text that is not sequential but is organized so that related items of

information are connected [wordnet:

hypertext]

46

(47)

step 3: semantic interpretation

however, the structure of normative

modifications can differ in a number of ways

for example, it can be present only the novellando e.g., “at the end of article 40 the words: `six months' are deleted”,

or it could be present only the novella e.g., ”[at the end of article 40] the words

`goodnight' are added”.

Riferimenti

Documenti correlati

The form f lu-da-qé-e of CTMMA 148A would represent the inaccuracy of an unexperienced scribe (or of a scribe who was not much familiar with the Greek language); 14) while the

Paronen J, Knip M, Savilahti E, Virtanen SM, Ilonen J, Akerblom HK, Vaarala O, 2000, Effect of cow's milk exposure and maternal type 1 diabetes on cellular and humoral immunization

A similar picture emerged from a follow-up A similar picture emerged from a follow-up of 150patients discharged from Tooting Bec of 150patients discharged from Tooting Bec

One of the competing hypotheses suggests that local ischaemia of the tissue, caused by occlusion of the capillaries under pressure and subsequent reperfusion damage by oxygen

(1994) or Lise and Seitz (2011) should be the same as the estimates derived implementing an indirect least square procedure consisting in a first stage estimation of an

Powder XRD patterns of the samples (Figure S1); N 2 adsorption/desoprtion isotherms at 77 K of the samples (Figure S2); SEM images of the samples (Figure S3); comparison of

Keywords: Isochronous potential; Periodic solution; Planar system; Steen’s equation.. All

Remark 3.18 Proposition 3.17 says that the general integral of the nonhomogeneous linear system is the general integral of the associated homogeneous system (which is a vectorial