Extracting Semantic Annotations from Legal Texts
L. Lesmo, A. Mazzei and D. P. Radicioni
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
research carried out by the Interaction Models
Group of the CS Department (University of Turin) in partnership with Modelling Legal Informatics
Resource Group of Cirsfid (University of Bologna)
2
outline
• the problem at hand: automating semantic annotation for consolidating legal texts
• the proposed approach
• experimental results and open issues
definition of the problem
semantic mark-up
•
semantic mark-up is commonlyacknowledged as a costly and complex matter
•
much work has been carried out in various fields, such as Information Extraction, with the aim at automatically extracting salient information from textsDaniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
restricted domains
•
no mature technology to cope with unrestricted language exists yet•
some advances have been obtained overrestricted -more regular- domains, such as the legal field.
• systems have built that automatically identify and classify structural portions of legal documents and their intra- and inter-references
• other investigations are being carried out to produce semantic analysis
6
XML standards
•
various initiatives to devise XML standards for describing legal sources and schemas to identify legal documents•
text editors exist (e.g., NIR NormaEditor) to mark up in a supervised fashion structuralpartitions and normative references
•
unfortunately, the human annotation process is expensive and error-prone•
such efforts will be viable only in conjunction with tools to extract in automatic fashion the structural and semantic data from legal texts.Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
hypertexts and legal texts
•
documents containing modificatory provisions are relevant in such perspective, in that they are explicitly concerned with describing how some other legal text --or other part of current one-- has to be modified8
“at the article 29, comma 2 of law #212/2007, the words «free press» are
substituted with «silence please», ...”.
DOCUMENT 1 modification !
modification α
modification or, in the legal jargon,
modificatory provision, is a change made to one or more clauses within a text or to an entire text
hypertexts and legal texts
“at the article 33, comma 1 of law #212/2007, the word
«forbidden» is substituted by
«allowed», ...”.
“at the article 1, of law #55/2009, the word «gate»
is repealed.
modification β
modification γ
DOCUMENT 1
modification ! modification "
modification #
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
hypertexts and legal texts
10
DOCUMENT 1 modification ! modification "
modification # modification ...
law #212/2007
... free press ...
... forbidden...
law #212/2007
... silence please ...
... allowed ...
law #55/2009
... gate...
law #55/2009
... gate ...
...
hypertexts and legal texts
DOCUMENT 1 modification ! modification "
modification #
modification ... law #55/2009
... gate ...
... canceled added ...
modified TeXt ...
DOCUMENT 2
modification $
DOCUMENT 3
modification %
how can we build the text that
counts as law (the consolidated text)?
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
consolidation of legal texts
•
the consolidated text is theupdated version of a normative text, the version embodying the changes;
•
the uncertainty on the effects of normative modifications would undermine the certainty of the law, thereby making it hard toclearly understand what is the law, or which one of several versions of a provision counts as law
12
law #212/2007
... free press silence
please ...
... forbidden allowed ...
consolidation of legal texts
• automating the process to semantically annotate modificatory clauses and
provisions would be of great help in
simplifying the legal system and in
consolidating texts of law
semantic annotation of modificatory
provisions
to qualify a modificatory provision
•
one has to recognize the following:- the type of the modification (e.g., repeal, modification, substitution);
- the document being modified (another law, decree, etc.);
- the portion of such document affected by the modification (be it a structural part, as ``article 22'' or a text fragment, as ``the words `six
months'...'');
•
and to generate a set of metadata that compactly describe such modificationDaniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
taxonomy of modificatory provisions
16
MOD. OF THE CONTENT
MOD. OF THE RANGE
MOD. OF THE TIME
NORMATIVE SYSTEM MODIFICATION
TEXT
MEANING
CHANGES OF FORCE CHANGES OF
EFFICACY
REPEAL
SUBSTITUTION INTEGRATION RELOCATION
MODIFICATION OF TERMS INTERPRETATION
MEANING CHANGE
DEROGATION EXTENSION
ANNULMENT
POSTPONEMENT OF ENTER IN FORCE ENTRY INTO FORCE
PROROGATION OF FORCE RE-ENACTMENT
POSTPONEMENT OF START OF EFFICACY SUSPENSION
DISAPPLICATION
PROROGATION OF EFFICACY RETROACTIVITY
CONVERSION TRANSPOSE IMPLEMENT RATIFICATION
DELEGATION OF POWER DEREGULATION
{
{
{
{ {
{
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
taxonomy of modificatory provisions
17
MOD. OF THE CONTENT
MOD. OF THE RANGE
MOD. OF THE TIME
NORMATIVE SYSTEM MODIFICATION
TEXT
MEANING
CHANGES OF FORCE CHANGES OF
EFFICACY
REPEAL
SUBSTITUTION INTEGRATION RELOCATION
MODIFICATION OF TERMS INTERPRETATION
MEANING CHANGE
DEROGATION EXTENSION
ANNULMENT
POSTPONEMENT OF ENTER IN FORCE ENTRY INTO FORCE
PROROGATION OF FORCE RE-ENACTMENT
POSTPONEMENT OF START OF EFFICACY SUSPENSION
DISAPPLICATION
PROROGATION OF EFFICACY RETROACTIVITY
CONVERSION TRANSPOSE IMPLEMENT RATIFICATION
DELEGATION OF POWER
{
{
{
{ {
{
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
input representation
•
let us consider the following input sentence:“All'articolo 40, comma 1, della legge 28 dicembre 2005, n. 262, le parole: «sei mesi» sono sostituite dalle seguenti: «dodici mesi».”
“At article 40, comma 1 of law 28/12/2005 # 262, the words: «six months» are substituted by the following: «twelve months».”
18
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 19
<!-- INFORMATION -->
<corpo>
All'
<rif id="rif9" xlink:href="urn:nir:[...]">
articolo 40, comma 1, della legge 28 dicembre 2005, n. 262 </rif>
, le parole:
<virgolette tipo="parola" id="vir1">
"sei mesi"
</virgolette>
sono sostituite dalle seguenti:
<virgolette tipo="parola" id="vir2">
"dodici mesi"
</virgolette>
.</corpo> Input to the parser
All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.
At rif9, the words: vir1 are substituted by the following ones: vir2.
<!-- META INFORMATION -->
<dsp:sostituzione>
<dsp:pos xlink:href="#art1-com4" />
<dsp:norma xlink:href="urn:nir:[...]">
<dsp:pos xlink:href="#rif9" />
</dsp:norma>
<dsp:novella>
<dsp:pos xlink:href="#vir2" />
</dsp:novella>
<dsp:novellando>
<dsp:pos xlink:href="#vir1" />
</dsp:novellando>
</dsp:sostituzione>
modification of type substitution
position of the norm being modified
quoted text. novella: new text
quoted text. novellando: text being modified
Figure 3: Left: XML encoding the sentence (extracted from the tag <corpo>) “ All’articolo 40 [, comma 1, della legge 28 dicembre 2005, n. 262,] le parole: ‘sei mesi’ sono sostituite dalle seguenti: ‘dodici mesi’.” (At article 40 [...] the words: ‘six months’ are substituted by the following: ‘twelve months’.). Top: the meta-description of the modification. Bottom: the content of the tag <corpo>, as it is given in input to the parser after the rewriting of constants rif (i.e., reference) and virgolette (quotes).
!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()
!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-
!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/
!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0
FRAME sostituire (substitute)
contains VIR?
contains VIR?
contains RIF?
sostituite
SUBJ OBJ MODs
Figure 4: Three tests are performed on te parse tree, and the appropriate slots are filled.
Further rules are designed to account for complex linguistic con- structions, such as the case of coordination. Let us consider the following sentence: All’articolo 1, comma 2, sono soppresse le let- tere d) ed f); [At the article 1, comma 2, the letters d) and f) are deleted;]. It is conveniently converted into: All’RIF16, sono sop- presse le RIF34) ed RIF35), [At RIF16, the RIF34) and RIF35) are deleted]. The TUP recognizes coordination and marks the edge between the conjuncts. The semantic frame corresponding to the deletion of RIF34 is filled similarly to the case of the substitu- tion; additionally, the semantic interpreter recognizes the presence of a conjunct. In facts, the reference RIF34 has a descendant node RIF35 in a subtree that is connected by the COORD labeled edge.
Cases involving 2 conjuncts, such as RIF34 and RIF35, are rather straightforward because we are handling homogeneous objects (both are RIFs), and because there is no ambiguity about the fact that RIF34 and RIF35 are coordinated. To extend the coverage, here are some further sorts of coordination types:
• Al RIF22 le parole: VIR4 e VIR7 sostituiscono le parole:
VIR8 e VIR11, rispettivamente; (rough translation: At the RIF22 the words: VIR4 and VIR7 substitute the words:
VIR8 and VIR11, respectively;);
• Al RIF1 le parole VIR1 sostituiscono VIR4 e VIR5 e al RIF2 le VIR12 sostituiscono VIR13, VIR14 e VIR15 (rough translation: At RIF1, the words VIR1 replace the VIR4 and VIR5, and at RIF2 the VIR12 replaces the VIR13, VIR14
As a general strategy, to handle such cases we bound the com- binatorics menacing the computation by assuming that coordinates only can be homogeneous (that is, a VIR cannot be coordinated to a RIF).
3. EXPERIMENTATION
From a practical viewpoint, our research is intended to assist hu- man annotators in individuating legal modificatory provisions and qualifying them with as many details as possible. At the current stage of development we deal with modificatory provisions of ei- ther integration, substitution or deletion type. For the experimen- tation we used a dataset composed of 180 files, containing over- all 11, 944 XML corpo elements (see Section 2) and 2, 306 mod- ificatory provisions (namely, 809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation.6 The system accu- racy is computed as the percentage of modificatory provisions cor- rectly computed as the tuple !type, position, novella, novellando", where type is one in {integration, substitution, and deletion}, posi- tion is the constant identifying the position into a given document where the modification occurs, and novella and novellando are both excerpts of quoted text (see Section 2). Our system obtains 83.0%
precision and 71.7% recall. In particular, the figures we obtain for the recall on integration, substitution and deletion are 77.8%, 77.4% and 55.1%, respectively.
The overall result gives an estimation of the robustness of the approach, and of the implemented system as well: as far as we know, no experimentation has been conducted on datasets large as the present one. Provided that the present experimentation is performed with a preliminary release of the system, in the annota- tion of integrations and substitutions we obtain encouraging results, while only a poor accuracy is achieved in the case of deletions.
Most errors fall into one of the following classes: 1. preprocess- ing errors, such as misspelled words, or errors in the XML annota- tion; 2. discourse manager errors, occurring when chains of com- plex phrases are met; 3. parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before); 4. semantic interpretation errors, in cases in which the semantic interpreter is not able to extract the salient information.
6 The dataset is available for download at the URL:
input: XML NIR
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
input: XML NIR
20
<!-- INFORMATION -->
<corpo>
All'
<rif id="rif9" xlink:href="urn:nir:[...]">
articolo 40, comma 1, della legge 28 dicembre 2005, n. 262 </rif>
, le parole:
<virgolette tipo="parola" id="vir1">
"sei mesi"
</virgolette>
sono sostituite dalle seguenti:
<virgolette tipo="parola" id="vir2">
"dodici mesi"
</virgolette>
.</corpo> Input to the parser
All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.
At rif9, the words: vir1 are substituted by the following ones: vir2.
<!-- META INFORMATION -->
<dsp:sostituzione>
<dsp:pos xlink:href="#art1-com4" />
<dsp:norma xlink:href="urn:nir:[...]">
<dsp:pos xlink:href="#rif9" />
</dsp:norma>
<dsp:novella>
<dsp:pos xlink:href="#vir2" />
</dsp:novella>
<dsp:novellando>
<dsp:pos xlink:href="#vir1" />
</dsp:novellando>
</dsp:sostituzione>
modification of type substitution
position of the norm being modified
quoted text. novella: new text
quoted text. novellando: text being modified
Figure 3: Left: XML encoding the sentence (extracted from the tag <corpo>) “ All’articolo 40 [, comma 1, della legge 28 dicembre 2005, n. 262,] le parole: ‘sei mesi’ sono sostituite dalle seguenti: ‘dodici mesi’.” (At article 40 [...] the words: ‘six months’ are substituted by the following: ‘twelve months’.). Top: the meta-description of the modification. Bottom: the content of the tag <corpo>, as it is given in input to the parser after the rewriting of constants rif (i.e., reference) and virgolette (quotes).
!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()
!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-
!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/
!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0
FRAME sostituire (substitute)
contains VIR?
contains VIR?
contains RIF?
sostituite
SUBJ OBJ MODs
Figure 4: Three tests are performed on te parse tree, and the appropriate slots are filled.
Further rules are designed to account for complex linguistic con- structions, such as the case of coordination. Let us consider the following sentence: All’articolo 1, comma 2, sono soppresse le let- tere d) ed f); [At the article 1, comma 2, the letters d) and f) are deleted;]. It is conveniently converted into: All’RIF16, sono sop- presse le RIF34) ed RIF35), [At RIF16, the RIF34) and RIF35) are deleted]. The TUP recognizes coordination and marks the edge between the conjuncts. The semantic frame corresponding to the deletion of RIF34 is filled similarly to the case of the substitu- tion; additionally, the semantic interpreter recognizes the presence of a conjunct. In facts, the reference RIF34 has a descendant node RIF35 in a subtree that is connected by the COORD labeled edge.
Cases involving 2 conjuncts, such as RIF34 and RIF35, are rather straightforward because we are handling homogeneous objects (both are RIFs), and because there is no ambiguity about the fact that RIF34 and RIF35 are coordinated. To extend the coverage, here are some further sorts of coordination types:
• Al RIF22 le parole: VIR4 e VIR7 sostituiscono le parole:
VIR8 e VIR11, rispettivamente; (rough translation: At the RIF22 the words: VIR4 and VIR7 substitute the words:
VIR8 and VIR11, respectively;);
• Al RIF1 le parole VIR1 sostituiscono VIR4 e VIR5 e al RIF2 le VIR12 sostituiscono VIR13, VIR14 e VIR15 (rough translation: At RIF1, the words VIR1 replace the VIR4 and VIR5, and at RIF2 the VIR12 replaces the VIR13, VIR14 and VIR15).
As a general strategy, to handle such cases we bound the com- binatorics menacing the computation by assuming that coordinates only can be homogeneous (that is, a VIR cannot be coordinated to a RIF).
3. EXPERIMENTATION
From a practical viewpoint, our research is intended to assist hu- man annotators in individuating legal modificatory provisions and qualifying them with as many details as possible. At the current stage of development we deal with modificatory provisions of ei- ther integration, substitution or deletion type. For the experimen- tation we used a dataset composed of 180 files, containing over- all 11, 944 XML corpo elements (see Section 2) and 2, 306 mod- ificatory provisions (namely, 809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation.6 The system accu- racy is computed as the percentage of modificatory provisions cor- rectly computed as the tuple !type, position, novella, novellando", where type is one in {integration, substitution, and deletion}, posi- tion is the constant identifying the position into a given document where the modification occurs, and novella and novellando are both excerpts of quoted text (see Section 2). Our system obtains 83.0%
precision and 71.7% recall. In particular, the figures we obtain for the recall on integration, substitution and deletion are 77.8%, 77.4% and 55.1%, respectively.
The overall result gives an estimation of the robustness of the approach, and of the implemented system as well: as far as we know, no experimentation has been conducted on datasets large as the present one. Provided that the present experimentation is performed with a preliminary release of the system, in the annota- tion of integrations and substitutions we obtain encouraging results, while only a poor accuracy is achieved in the case of deletions.
Most errors fall into one of the following classes: 1. preprocess- ing errors, such as misspelled words, or errors in the XML annota- tion; 2. discourse manager errors, occurring when chains of com- plex phrases are met; 3. parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before); 4. semantic interpretation errors, in cases in which the semantic interpreter is not able to extract the salient information.
6 The dataset is available for download at the URL:
http://www.di.unito.it/~radicion/datasets/hypertext09/
position of the norm being modified
quoted text (novellando) text being modified quoted text (novella) new text
Italian for body
output: generated metadata
<dsp:sostituzione>
<dsp:pos xlink:href="#art1-com4" />
<dsp:norma xlink:href="urn:nir:[...]">
<dsp:pos xlink:href="#rif9" />
</dsp:norma>
<dsp:novella>
<dsp:pos xlink:href="#vir2" />
</dsp:novella>
<dsp:novellando>
<dsp:pos xlink:href="#vir1" />
</dsp:novellando>
</dsp:sostituzione>
modification of type substitution
position of the norm being modified
quoted text. novella: new text quoted text. novellando: text being modified
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
three steps process
•
input preprocessing. we retrieve the possible locations of modificatory provisions within the document, and we simplify the input sentences, so to prune text fragments that do not convey relevant information•
in the second step we perform the syntactic analysis of the retrieved sentences•
in the third step we semantically annotate the retrieved provisions through a tree matching approach22
Modifications taxonomy Chunking rules
Subcategorization frames
CHUNKING
Morphological tables Dictionary MORPHOLOGY &
POS-TAGGING
COORDINATION ANALYSIS
ANALYSIS OF VERBAL DEPENDENTS
SEMANTIC INTERPRETER
sequence of lexical items
input text
verbs + partial chunks
verbs + final chunks
Parse Tree
INPUT PREPROCESSING
XML NIR
XML NIR
ar chitectur e of the system
step 1
step 2
step 3
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
step 1: preprocessing step
24
•
based on the XML structure, we retain the textexcerpts contained between the tags <corpo> (Italian for body), that is where the modifications may be
found
•
we then rewrite the text tagged by <rif> (short for reference) and <virgolette> (Italian for quotes),individuating the position where a modification occurs and a quoted text fragment with their respective IDs
INPUT TO THE PARSER
All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.
At rif9, the words: vir1 are substituted by the following ones: vir2.
step 2: parsing step
• the Turin University Parser (TUP) is a rule-based parser that returns the syntactic structure of
sentences in dependency format
• dependencies are binary relations (e.g. subject-
relation) between a dominant word -the head, e.g. the verb- and a dominated word -the dependent, e.g. the noun-subject-
• after two preliminary steps (morphological analysis and part of speech PoS tagging), the sequence of
words goes through three phases: chunking, syntactic analysis of the coordination, and verbal
subcategorization
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 26
resulting parse tree
8 sostituite
1 All'-A 3 , 4 le 7 sono 9 dalle-DA 12 .
1.1 All'-IL
2 RIF9
9.1 dalle-IL
11 VIR2
10 seguenti
RMOD SEPARATOR OBJ AUX SUBJ END
PREP-ARG
DET-ARG
PREP-ARG
DET-ARG
ADJC-RMOD 5 parole
6 VIR1 DET-ARG
NOUN-APPOSITION
INPUT TO THE PARSER
All'rif9, le parole: vir1 sono sostituite dalle seguenti: vir2.
At rif9, the words: vir1 are substituted by the following ones:
vir2.
step 3: semantic interpretation
•
we have mentioned that the relevant pieces of information to be individuated to qualify amodificatory provision are 1.the modification type;
2.the position (be it specified as a sequence of words or relative to the document);
3.the quoted text, both in the form of the textual fragment being inserted/added
(novella), and the text fragment affected by the normative modification (novellando)
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
step 3: semantic interpretation
28
•
modifications are represented by means of semantic frames, composed by slots, such as legalCategory,the referenceDocument, modifyingText and the modifiedText
•
frames are associated to the legal categories: for instance, the verbs belonging to the legalCategory substitution such as replace, substitute, change, modify, etc. have all the same slots.•
in this way it is possible to add further verbs to the legal categories by taking advantage of their shared semantic frames.step 3: semantic interpretation
•
the semantic interpreter is a rule-based algorithm.the rules handle two sorts of information: the parse trees, and the domain knowledge encoding the legal modifications taxonomy;
•
a main rule tests whether the root node of the syntactic tree is a verb, and if it belongs to the modificatory provisions taxonomy.e.g., given the root verb insert, we take the verb lemma to insert, search for it in the knowledge base, and find that it is a possible instantiation of the legalCategory integration, together with the verbs to add, to incorporate, etc..
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 30
•
the next step is inspecting the dependents of the verb by looking for the position of the modification, and for the novella and novellando (modifyingText) and (modifiedText), arguments•
modificatory provisions often span over different adjacent phrases (e.g., verbal, nominal, adjectival, prepositional, adverbial phrases) and adjacentsentences, so that it is necessary to put together such fragments in order to collect all pieces of information
•
to do so, we devised a discourse managerstep 3: tree matching
step 3: discourse manager
• the discourse manager interacts with the parser in two phases. firstly each phrase is analyzed by the parser to discover the type of the phrase; secondly, the discourse manager puts them together according to some schema (e.g., prepositional phrase (PP) + verbal phrase (VP))
sending the resulting phrase to the parser
• given the sentence ‘at the article 127: the words «six
months» are replaced by the words: «twelve months»’, the discourse manager
• sends the text fragments separately to the parser;
• then, considering the roots of the parse trees, it puts them together according to the schema [PP + VP + NP], and sends
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
step 3: tree matching
32
•
in this setting, filling a modification framecorresponds to searching a parse tree that fits to slots of a given type, and then to finding an
appropriate mapping between (tree) dependents and (frame) slots
8 sostituite
1 All'-A 3 , 4 le 7 sono 9 dalle-DA 12 .
1.1 All'-IL
2 RIF9
9.1 dalle-IL
11 VIR2
10 seguenti
RMOD SEPARATOR OBJ AUX SUBJ END
PREP-ARG
DET-ARG
PREP-ARG
DET-ARG
ADJC-RMOD 5 parole
6 VIR1 DET-ARG
NOUN-APPOSITION
Figure 2: Syntactic analysis of the sentence “All’RIF9, le pa- role: VIR1 sono sostituite dalle seguenti: VIR2”, (At RIF9 the words: VIR1 are substituted by the following: VIR2).
Semantic Interpretation
The semantic interpreter is a rule-based algorithm. The rules han- dle two sorts of information: the parse trees, and the domain knowl- edge encoding the legal modifications taxonomy.
To consider the main traits of modificatory provisions and how they are extracted, let us consider the modification in Fig. 3. To qualify a modification, it is necessary to extract information about the following points. i) the modification type : we presently con- sider only integration, substitution and deletion, but in the taxon- omy there are many other classes; ii) the position can be specified as a sequence of words (e.g., “before the words: ‘six months’ ”).
Alternatively, the position can be specified relative to the document (e.g., “at article 40”); iii) quoted text, that can be of two types. Le- gal experts denote the textual fragment being inserted/added as the novella of a modificatory provision; while they denote by novel- lando the text fragment affected by the normative modification. For the sake of clarity, we will borrow that terminology. The quoted text identifies both novella and novellando. However, the structure of normative modifications can be different. In facts, it could be present only the novellando (e.g., “[in the end of article 40] the words: ‘six months’ are deleted”), or only the novella (e.g., “[at the end of article 40] the words ‘goodnight’ are added”).
After having described the meaningful elements of normative modifications, let us consider how the modificatory provisions com- ponents are encoded within the system. Modifications are repre- sented by means of semantic frames, composed by slots [1], such as legalCategory, referenceDocument, modifyingText and modified- Text. Frames are associated to the legal categories: for instance, the verbs belonging to the legalCategory substitution such as re- place, substitute, change, modify, etc. have all the same slots. In this way we can add further verbs to the legal categories by taking advantage of their shared semantic frames. Integration, substitu- tion and deletion can differ as regards as the slots corresponding to the quoted text. For example, the deletion possibly has a novel- lando (in this case, the slot modifiedText should be filled) but no novella, and accordingly no modifyingText slot. Viceversa, both substitution and integration can have either novella or novellando, both of them or none.
A main rule is charged to test whether the root node of the syn- tactic tree is a verb, and if it belongs to the modificatory provi-
sions taxonomy. For example, given the root verb, we take the verb lemma inserire (insert), we search for it in the knowledge base, and find that it is a possible instantiation of the legalCategory in- tegration, together with the verbs aggiungere (add), integrare (in- corporate), etc.. In this case we have a fundamental cue that the sentence being analyzed contains a modificatory provision. Also, based on the taxonomy, we are informed about the legalCategory of the modification at hand. The next step is in inspecting the de- pendents of the verb looking for the position of the modification, and for novella and novellando (modifyingText and modifiedText, respectively) arguments.
Since modificatory provisions often span over different adjacent phrases (e.g., verbal, nominal, adjectival, prepositional, adverbial phrases), it is necessary to put together such fragments in order to collect all pieces of information. To do so we implemented a dis- course manager. Moreover, modificatory provisions can span over multiple sentences, which considerably increases the difficulty of the syntactic analysis and of the semantic interpretation, as well.
Also punctuation (such as colon, semicolon and period) may deter- mine the need to consider multiple sentences. In particular, the dis- course manager interacts with the parser in two phases. Firstly each phrase is analyzed by the parser to discover the type of the phrase;
secondly, the discourse manager puts them together according to some schema (e.g., prepositional phrase + verbal phrase) sending the resulting phrase to the parser. For example, the colon is com- monly employed in locutions such as “at the article 127: the words
‘six months’ are replaced by the words: ‘twelve months’ ”. Ini- tially, the discourse manager sends the text fragments separately to the parser. Then, considering the roots of the resulting parse trees, the discourse manager puts them together according to the schema prepositional phrase + verbal phrase + noun phrase and sends them to the parser again.
In this setting, filling a modification frame corresponds to search- ing a parse tree that fits to slots of a given type, and then to find- ing an appropriate mapping between (tree) dependents and (frame) slots.
Table 1: Example of tests on the verb dependents.
IF
- the subtree attached to the verb by a RMOD label (that is, modifier) contains a RIF1 constant; AND - the subtree attached to the verb by a SUBJ label (that is subject) contains a VIR1 constant; AND
- the subtree attached to the verb by a OBJ label (that is, object) contains a VIR2 constant
THEN - fill referenceDocument with he RIF1 AND - fill modifiedText with the VIR2 AND
- fill modifyingText with the VIR1
Tree matching. To perform the tree matching, the other rules test the content of the verb arguments and the verb modifiers to fill the slots of current frame. In particular, the rules are charged to dis- cover whether in the syntactic arguments like subject, object or in any modifier a constant such as RIF or VIR are present. For exam- ple, the syntactic tree corresponding to the substitution modifica- tion in Fig. 2 has as root node the verb “sostituire” (substitute). This verb is present in the knowledge base, which causes a frame asso- ciated to the legalCategory: substitution to be instantiated. Sub- sequently, a set of tests are executed on the verb dependents –i.e., the children nodes– to fill the appropriate slots (Table 1). The rule maps a syntactic pattern onto a set of semantic slots, performing a test on the subtrees rooted in the subject, in the object and in the modifiers (Fig. 4).
step 3: tree matching
•
the rule maps a syntactic pattern onto a set of semantic slots, performing a test on the subtrees rooted in the subject, in the object, and in the modifiers!!!!!!!!!"#$!%$&"#'()"!!!!!!!!!!!!!!!#$%#&'&$&'()
!!!!!!!!("*"("+,"-',./"+&"!!!!*+,-
!!!!!!!!/'0121"03"4&"!!!!!!!!!!!!!!!!.+*/
!!!!!!!!/'01*)1+#3"4&"!!!!!!!!!!!!!!.+*0
FRAME sostituire (substitute)
contains
contains VIR?
contains RIF?
sostituite
SUBJ OBJ MODs
experimental results
dataset
•
at the current stage of development we deal with modificatory provisions of eitherintegration, substitution or deletion type
•
the dataset was composed of 180 files, containing overall 11,944 XML corpoelements and 2,306 modificatory provisions (809 integrations, 894 substitutions and 603 deletions) hand-annotated by the CIRSFID legal experts, which were considered for the experimentation
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
accuracy metrics
•
the system accuracy is computed as thepercentage of modificatory provisions correctly computed as the tuple
<type,position,novella,novellando>, where
• type is one in {integration, substitution, repeal},
• position is the constant identifying the position into a given document where the modification occurs, and
• novella and novellando are both excerpts of quoted text
36
results
•
we obtain 83.0% precision and 71.7% recall.•
the figures we obtain for the recall onintegration, substitution and deletion are 77.8, 77.4 and 55.1, respectively.
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
errors analysis
•
most errors fall into few main classes:• preprocessing errors, such as misspelled words, or errors in the XML annotation;
• discourse manager errors, occurring when chains of complex phrases are met;
• parser errors, e.g., too complex syntactic structures, or unknown words (such as the verb anteporre, to place sth. before);
• semantic interpretation errors, in cases in
which the semantic interpreter is not able to extract the salient information.
38
errors analysis
•
preprocessing errors are by far the most frequent ones: in particular, the system skipped a largenumber of corpo elements (namely, 1,772 out of 11,944). 14.8% texts fragments were skipped
•
if we restrict to the number of modificatoryprovisions that are actually analyzed by the parser, the recall of the system raises to 82.0% (namely 85.1%, 85.7% and 70.5% on I/S/D, respectively).
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
future work
•
improving the pre-processing filtering phase•
tuning the set of rules for the tree matching step, also considering ML approaches such as SSL•
extending the system’s coverage to further types of modificatory provisions40
• thank you for your attention
• please address correspondence to
radicion@di.unito.it
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts 42
unused slides
hypertexts and legal texts
“at the article 29, comma 2 of law #212/2007, the words «free press» are
substituted with «silence please», ...”.
DOCUMENT 2
... free press ...
DOCUMENT 2
... silence please ...
DOCUMENT 1 modification !
modification α
Daniele Radicioni - HT2009 - Extracting Semantic Annotations from Legal Texts
hypertexts and legal texts
•
legal texts often refer either to other legal texts or to another part of the samedocument in a non-sequential fashion
•
hence, legal texts can be considered as special cases of hypertext, i.e., amachine-readable text that is not sequential but is organized so that related items of
information are connected [wordnet:
hypertext]
46
step 3: semantic interpretation
•
however, the structure of normativemodifications can differ in a number of ways
•
for example, it can be present only the novellando e.g., “at the end of article 40 the words: `six months' are deleted”,•
or it could be present only the novella e.g., ”[at the end of article 40] the words`goodnight' are added”.