3 Naturalness in Diff-ing Literary Documents

(1)

Changes in Tree-Based Textual Documents

Angelo Di Iorio¹, Michele Schirinzi¹, Fabio Vitali¹, and Carlo Marchetti^2,3

1 Dept. of Computer Science, University of Bologna Mura Anteo Zamboni 7, 40127, Bologna, Italy

2 Dip. di Informatica e Sistemistica, University of Rome ”La Sapienza”

Via Ariosto 22, Rome, Italy

3 Senato Della Repubblica Italiana, Palazzo Madama, Rome, Italy

Abstract. Several efficient and very powerful algorithms exist for detecting changes in tree-based textual documents, such as those encoded in XML. An important aspect is still underestimated in their design and implementation: the quality of the output, in terms of readability, clearness and accuracy for human users. Such requirement is particularly relevant when diff-ing literary documents, such as books, articles, reviews, acts, and so on. This paper introduces the concept of ’naturalness’ in diff-ing tree-based textual documents, and discusses a new extensible set of changes which can and should be detected. A naturalness-based algorithm is presented, as well as its application for diff-ing XML-encoded legislative documents. The algorithm, called JNDiff, proved to detect significantly better matchings (since new operations are recognized) and to be very efficient.

Keywords: XML Diff-ing, Changes detection, Naturalness, Data management.

1 Introduction

The way users create, store and edit some kind of data has been changing in the recent years. The boundaries between structured data (where information is organized in tuples and records) and textual documents (where information is encoded as a stream of text) have progressively been fading. A leading role in such a process has been played by XML, with its strong accent on the coexistence between human- and machine-readable documents. Users are often not only interested in the current version of XML-encoded documents but also in their history and changes. The automatic detection of differences among them is then destined to become more and more important.

Although XML is used to encode both literary documents and database dumps, there is ’something different’ from diff-ing an XML-encoded literary document and a XML- encoded database. Two observations support our idea, from different perspectives. First, the fact that the output of a diff on literary documents needs to be as much as possible faithful to the output of a ’manual’ diff. Such a property, which is undoubtedly true in any context, is much more relevant for literary resources because they are primarily meant to be read by humans. The second and more important point is about the editing model of literary documents: they are usually modified according to some patterns and rules different from those adopted in changing databases. This behavior can be then exploited to produce high-quality and natural outputs.

J. Filipe and J. Cordeiro (Eds.): ICEIS 2009, LNBIP 24, pp. 90–101, 2009.

Springer-Verlag Berlin Heidelberg 2009c

(2)

This paper proposes a new dimension to evaluate, design and implement algorithms for diff-ing XML documents: the naturalness. In a sense, the naturalness indicates the capability of an algorithm to identify those changes which would be identified by a manual approach. This work address various aspects related to the notion of naturalness, at different levels. In particular, we:

– discuss an extensible set of natural changes which an algorithm specialized for literary documents can and should detect.

– present an algorithm, called JNDiff, able to detect most of them. We also stress on the modularity and high configurability of JNDiff.

– describe a Java system based on these ideas. We also present a case-study on de- tecting changes between XML-encoded legislative bills, and focus on benefits of such natural approach for their editing/publishing workflow.

The rest of the paper is then structured as follows. Section 2 describes related works;

section 3 introduces the concept of naturalness. JNDiff is analyzed in section 4, while section 5 briefly describes JNMerge, a tool to interpret and re-build the output of JNDiff.

The evaluation of both these tools is presented in section 6, while section 7 is about the case-study on legislative documents.

2 Related Works

A variety of tools which calculate differences among XML documents exists. Generally, comparison methods are divided in two groups: systems that operate on generic text content (such as GNUDiff [2]) and systems specifically designed for XML content. We define here a further breakdown into specific methods to compare XML documents, by distinguishing: (i) diff working on XML documents that express database dump, and (ii) diff working on XML documents, that express literary documents.

Cob´ena et al. [7] proposed XyDiff to detect changes between two ordered XML trees T₁and T₂. This algorithm uses a two-pass technique: XyDiff starts with matching the nodes by matching their ID attributes; next, it computes a signature and a weight for each node of both trees in a bottom-up traversal. A top-down phase - repeated in order for each subtree - completes the algorithm. XyDiff algorithm achieves O(nlogn) complexity in execution time and generates fairly good results in many cases. However, it cannot guarantee any form of optimal or near optimal result because of the greedy rules used in the algorithm.

X-diff [11] detects changes on parsed unordered labeled tree of XML. X-diff finds the equivalent second level sub-trees and compares the nodes using the structural information denoted as signature. In order to detect move operations i.e., if a node is moved from position i in the old tree to position j in the new tree an ordered tree is needed.

Zhang and Shasha proposed a fast algorithm to find the minimum cost editing distance between two ordered labeled trees [12]. Given two ordered trees T₁and T₂, in which each node has an associated label, their algorithm finds an optimal edit script in time O(|T1| × |T2| × mindepth(T 1), leaves(T 1)× mindepth(T 2), leaves(T 2)), which is the best known result for the general tree-to-tree correction problem.

There are very few algorithms customized for literary XML documents. The AT&T Internet Difference Engine [3] [4] uses an internal module to determine the differences

(3)

between two HTML pages. That module treats two HTML pages as two sequences of tokens (a token is either a sentence-breaking markup or a sentence) and uses a weighted LCS algorithm [8] to find the best matching between the two sequences. DeltaXML [5] [6] developed by Mosnell provides a plug-in solution for detecting and displaying changes between two versions of a XML literary document. They represent changes in a merged delta file by adding additional attributes to the original XML file.

3 Naturalness in Diff-ing Literary Documents

There is one aspect still underestimated in the design and implementation of XML diff algorithms: the quality of the output, in terms of readability, clearness and accuracy for human users. This aspect is particularly relevant for literary documents such as books, review, reports, acts, when encoded in XML. The point is that most of the existing diff- ing algorithm capture changes on the trees representing these documents rather than changes on the documents themselves.

Let us explain such issue with a simple example. Consider an author who merges two paragraphs in a document and deletes a fragment of the second one.

Fig. 1. Merging two paragraphs

Fig. 1 shows such a case in HTML (input files are followed by two possible diff- ing outputs). In the best case, all the algorithms we are aware of detect a couple of independent changes: the deletion of the whole second paragraphs, and the insertion of some words in the first one. Such output is technically correct but it would undoubtedly be more useful to detect a paragraphs’ re-structuring. Syntactical details are not relevant here, but it is important to understand how to improve quality when dealing with literary documents.

The authors of such documents do not strictly operate on the underpinning tree structure of the documents themselves. They basically edit, delete, change and move higher- level structures. For instance, they may insert/remove nodes along a path (wrapping text with a bold or italic effect), restructure text-nodes into hierarchical sub-trees (dividing an act into sub-acts, paragraphs, etc.), flatten a structured paragraph into a text nodes (removing formatting), rename elements (translating labels of a document), and so on.

(4)

Our idea is then to study ’meaningful and common’ operations/changes which re- flect what authors actually do on literary documents. We then introduce the notion of

’naturalness’ of a diff-ing algorithm. The ’naturalness’ indicates the ’capability of an algorithm to identify real changes, i.e. changes which would be identified by a manual approach’.

In order to define the set of ’natural’ operations, we propose a layered approach: it consists of exploiting regularities and patterns of editing processes, and reformulating high-level changes as combinations of atomic ones.

Basically, the existing algorithms identify four operations: (i) insertion of a subtree, (ii) deletion of a subtree, (iii) update of a node and (iv) moves of a subtree. Although these operations fully express changes in the documents’ trees, they hardly express higher-level changes mentioned so far (merging paragraphs, adding formatting, refactoring text blocks, and so on). What we propose is then to extend such traditional model into an open one, able to capture all those very common editing actions.

3.1 A New Set of Natural Operations

Paradoxically, the first relevant aspect of the set of operations we propose is their in- completeness. Our current model is the result of a deep analysis of the most common editing processes. On the other hand, we expect to discover new meaningful changes, tailored for specific contexts, specific classes of documents and specific editing patterns.

A preliminary set of operations is listed below:

Insertion/Deletion. Inevitably our model comprises the operations of insertion and deletion, as any textual diff-ing algorithm does. They represent the ’bricks’ upon which other complex operations are built. Moreover we define either the insertion/deletion of a subtree of of an element along a path.

Move. The movement of a subtree is another natural change which should be detected.

Moving fragments of a document is in fact a very common operation (suffice it to mention the cut&paste operation we do thousand and thousand of times) on literary documents, already detected by most existing algorithms.

Downgrade. The downgrade operation occurs when adding nodes along a path, as shown in figure 2. Actually two subtrees are downgraded, after adding an intermediate element. The downgrade operation is a very good example of natural diff on literary documents. While in a database context such operation is quite uncommon (records can be moved or updated, but their subcomponents are hardly pushed down in a hierarchical structure), they are very common when dealing with hierarchical texts and managing mixed content-models, containers and sub-containers, logical document structures.

Upgrade. As expected the opposite upgrade operations occurs when a node is removed along a path. Figure 2 can be read from right to left in order to picture an upgrade operation. There is actually a deletion of element that does not involve its whole subtree but only the element itself.

Even upgrades are very common when editing literary documents, since authors use to change, polish and re-organize (sub-)structures of a document, without working on whole subtrees but on ’connections’ between them.

(5)

Fig. 2. The downgrade operation

Refactoring. We define the refactoring operation as a structural modification of ”con- tiguous” blocks in a textual content. Refactoring is very frequent operation in document editing. Suffice it to mention how many times we divide a paragraph or we insert and remove emphasis on fragments.

.

Fig. 3. The refactoring operation

Figure 3 shows an example: a single text node (right-bottom of the tree) is splitted in two, while composite text node is created by deleting an in-line element. The refactoring is a high-level operation that aggregates some elementary changes, considered as a whole action. Some of them are: (i) Join(X,Y) (merging content of X and Y nodes), (ii) Split(X,n) (dividing the X node content at the n offset into two splitted text nodes), (iii)InsNode (deleting a new single node), (iv) DelNode (deleting a single node). What is important is the logical consistency of a document before and after the application of a refactoring operation, as a combination of all these changes.

Element Update. A very common operation on XML documents is the update of an element, a change which does not involve its content-model (and name) but only its attributes. Insertion/deletion of an attribute, and modification of its value are possible

(6)

sub-steps of such complex change. For instance, style and formatting adjustments on literary documents may correspond to such updates.

Text Node Update. One of the most common editing changes is the text-node update, i.e. insertion/deletion of substrings in a text. We propose to aggregate such update in a complex operation of elementary changes (insertion/deletion of a substring of a given length in a given position). Detecting this type of changes is very important in literary documents because it allows us to perform a very exhaustive text changes analysis until single word level.

4 An Optimized Algorithm for Natural (XML) Diff-ing: JNDiff

The set of changes described in the previous section are independent from any diff-ing algorithm. A first goal of our work, in fact, was investigating a new diff-ing approach for literary tree-based documents, and defining a set of natural operations to be recognized.

The second and more important step is designing an actual algorithm able to detect those changes. Since any complex operation is actually a combination of atomic ones, we could have extended existing algorithms with an interpretation-phase able to re- arrange the output in terms of higher-level changes. Instead, we have implemented a native naturalness-based algorithm using specific data structures and rules to directly detect natural changes. We called it JNDiff, as it is implemented in Java.

Intuitively, we tried to ’simulate’ in JNDiff our experience in diffing literary docu- ments. What we humans do is trying to understand which are the relationships between parts of the two input documents. We found different relationships with an iterative process: we usually first identify those parts which remain unchanged, in order to have a sort of pivot around which other changes are detected and classified. Then, we identify as ’moved’ those parts which do not change but are in a different position, or as ’updated’ those parts which have been slightly modified but do not change their position (similarly, we can identify as ’upgraded’ or ’downgraded’ those unmodified parts which have been pushed/pulled downward/upward in the tree document structure).

JNDiff adopts a similar approach: it is a modular algorithm which first detects a set of relationships between the documents’ parts (basically a list of insertions/deletions) and iteratively refines them, through cascade phases. Each phase is in charge of detecting a specific class of changes.

The modularity is one of the most important and innovative aspects of JNDiff. It allows us to customize the algorithm on the basis of users’ preferences and needs.

Although the current implementation of JNDiff works much better with literary documents, the algorithm can be easily specialized for different applications domains. For instance, we can obtain much better results when diff-ing database dumps by deac- tivating modules for upgrades/downgrades detection, since these operations are very uncommon in that context. Similarly, other configurations can be set up for different scenarios. Moreover, new modules can be implemented able to detect new operations (as we said before, the set of changes we propose here is a partial list meant to be extended and polished) and easily activated. The algorithm is then extremely powerful and flexible.

(7)

In the rest of the paper we discuss the current configuration/implementation of JN- Diff which have been tailored for naturalness-based diff-ing. It consists of five independent phases:

Phase 0: Linearization. The preliminary phase, called linearization, is mandatory and independent from the set of changes we need to detect. For each input document, it cre- ates a ’smart’ data structure, called VTree, which makes it easy and fast to identify and compare documents’ elements and subtrees. Basically, a VTree is an array of records built with a pre-order depth-first visit of a document. Each record represents a node and contains: a hash-value which identifies that node (and its attributes), a hash-value which identifies the whole subtree rooted in that node (derived from the hash-values of its children and itself), a pointer to that node in the document, and other application- dependent information we dot not describe here, due to space limits. Such data structure plays an essential role for JNDiff, since it allows to compare elements by simply com- paring integer numbers, in constant time (in particular, whole subtrees can be matched by matching two integers). Note also that the construction of a VTree is linear on the number of nodes, since JNDiff basically visits twice a tree and properly synthesizes meaningful hash-values.

Phase 1: Partitioning. The Partitioning phase consists of finding unchanged parts of the two documents. As expected, such a match greatly benefits from the VTree data structures. The search of those subtrees, in fact, becomes a search of a LCS (Longest Common Subsequence) between two arrays. Note also that a subtree corresponds to a continuous interval of a VTree, because of the pre-order depth-first visit, and many comparisons can be skipped. JNDiff finds a LCS as an ordered concatenation of LCSS (Longest Common Continuous Subsequence). Actually other algorithms for LCS could be used, such as Myers’s one [10].

At this stage, unchanged parts are identified and connected by a single type of match- ing relation. The rest of the nodes are then considered inserted, if they only are in the second document, or deleted, if they only are the first one.

Phase 2: Text Updates Detection. The phase 2 is the first optional step of JNDiff, and detects text nodes updates. Two criteria are followed to do that: principle of local- ity and similarity threshold. They respectively state that two text-nodes are considered

’updated’ if (i) they lay between two subtrees matched in phase 1 (intuitively, they belong to the same document part) and (ii) the amount of text changes crosses a threshold passed as a parameter. These tests are in fact designed to simulate the ’natural’ behavior described at the beginning of the section (updated nodes do not change their position and do slightly change their content).

The parametrization of the similarity threshold is another key aspect to be remarked:

once again, JNDiff is highly configurable and different classes of updates can be easily detected by changing that threshold.

Phase 3: Moves Detection. The phase 3 is a further optional step which detects moves.

Basically, JNDiff measures the distance between matched subtrees (by counting matching partitions and nodes between them) and classifies as ’moved’ nodes whose distance

(8)

is under a given threshold. As happens for updates, such a solution has a twofold goal:

(i) it simulates manual changes detection (nodes are very often moved within a limited range in a document) and (ii) it is highly configurable and can be adapted to very different contexts.

Phase 4: Matches Expansion. The last phase we have implemented is called matches expansion and propagates changes bottom-up, in order to improve the quality of the output. It is then optional but highly recommended to have much more natural results.

Intuitively, JNDiff ’goes up’ from leaves to the root and fixes some ’imperfections’ in the interpretation of insertions/deletions. Some elements in fact are still recognized as inserted/deleted, even if they are actually unchanged. The reason is that their (VTree) hash-values have changed because something have changed in their descendants (for in- stance, the book or chapter elements of the sample). What JNDiff does is then removing those false positives among insertions/deletions and polishing the output. Moreover, by analyzing some specific VTree fields, it detects updates of elements’ attributes.

Advanced Phases. The phases discussed so far were designed to obtain natural diff- ing outputs. In fact, most of the changes defined in section 3.1 are currently detected by JNDiff: inserted/deleted subtrees belong to the residual class of unmatched nodes, downgraded/upgraded nodes have a deleted/inserted element among their ancestors, moves and text-updates are recognized by phases 2 and 3, and elements’ updates are recognized during phase 4.

What is alike important is the modularity of JNDiff. In fact, we plan to work on new modules to detect more sophisticated changes. In particular, we aim at detecting combined operations such as refactoring (as described in section 3.1) or identifying nodes which have been simultaneously moved (or upgrade/downgrade) and slightly updated. The current engine, in fact, does not detect such tangled changes (only one match-relation is found, according to the order and parameters of the phases).

5 Expressing Detected Changes: JNMerge and JNApply

Currently JNDiff is implemented as a Java application, running on common web servers or executable by command line. Both these interfaces allows users to specify parameters useful to customize the algorithm. The output of JNDiff is an XML file which lists changes to be applied on the input document A in order to obtain the input document B.

Such output (Δ) expresses very fine-grained and diversified differences but it is quite complex and application-dependent. Syntactical details are not relevant here (moreover they will be probably changed). What is important is the fact that JNDiff is not enough to clearly show users how documents changed. We then implemented a related tool which generates a document A + Δ that clearly expresses those changes. We called it JNMerge.

What JNMerge does is scanning such output and embedding changes in the document A. For each detected operation, it adds appropriate attributes and other markup to the original document. Basically, JNMerge uses indexes and pointers of a VTree in order to access and modify the document.

(9)

By exploiting that information, the original document A is progressively transformed into A + Δ. Let us discuss a simple case: when adding a node along a path (InsertNode) that node is added to its new parent, all children of its parent are ’adopted’ by the new child, and all references are accordingly updated. Similarly, each operation implies a well-defined set of modifications.

As expected, such re-building is a very complex process. The document JNMerge deals with, in fact, is significantly different from the document JNDiff evaluated, since information about previous changes have already been embedded and offsets and posi- tions are broken. However JNMerge rebuilds a correct and meaningful A + Δ. Details of such process are out of the scope of this paper, whose main topic is JNDiff. Syntac- tical details of JNMerge markup are also not relevant now, since they can (and will) be easily changed.

We also implemented JNApply, a tool that directly generates the second document B taking in input A and Δ. It is an optimization which could have been coded by extending JNMerge with transformations on the embedded delta.

6 Computational Complexity of JNDiff and JNMerge

The most important goal of JNDiff was improving the quality of the output. A first application confirmed that very good results can be achieved, as discussed in the next section case-study. On the other hand, it is important to evaluate the algorithm in terms of computational costs. In fact, JNDiff is admittedly less efficient than other algorithms, just because it aims at maximizing ’naturalness’ first of all. By considering the overall result, however, we conclude JNDiff is a very good trade-off between naturalness and complexity.

Defining the complexity of JNDiff is quite difficult, since that measure strongly de- pends on which phases are actually executed. That is why we briefly discuss each phase separately. We will express complexity in terms of four parameters: n (nodes number of document A), m (nodes number of document B), p (leaves number of document A), q(leaves number of document B).

The VTree linearization (4) is realized by simply visiting the DOM trees, so it costs Θ(n+m). The partitioning-phase (4) consists of finding a LCS in the linearized VTrees.

The best known algorithm is Myers’one [10] that attains to find a LCS in time O((n + m)D), where D is the length of a shortest edit script. JNDiff uses a different approach in order to find a LCCS, a LCS with longest contiguous substrings. As mentioned before such approach captures the largest and most ’natural’ subtrees, especially for literary documents. It finds a LCCS in O(n × m) but it has a Ω(1) lower bound. In fact, he internal structure of VTrees makes it easy and fast to compare identical subtrees.

Thus, JNDiff works very well on documents with slight differences, while it has worse results when processing very different documents. However we expect that literary documents have much more unmodified nodes and subtrees, than changed ones.

When editing literary documents, in fact, users tend to modify limited parts of them by adding/removing text fragments, re-organizing parts, adding/removing nodes along paths, and so on.

Similar considerations can be applied to the optional phases of JNDiff. For instance, the complexity of text-updates-detection(4) is O(p×q×f(p, q)), where f is the function

(10)

that calculates the similarity between nodes p e q (f costs O(length(p) × length(q)) because uses our LCCS algorithm but can be fastened by using Meyrs’s one, to the detriment of naturalness). Similarly, a move-detection phase (4) costs O(n × m) when all nodes need to be scanned. In the same way, the complexity of the match-expansion phase (4) is O(min(p, q) × log(n)) since JNDiff has to rise the tree for log(n) nodes.

The total complexity of JNDiff is then O(n×m). In practice, since each phase works only on nodes not yet connected and the majority is connected after initial phases, computational costs progressively decreases at each interaction.

7 A Practical Application: Detecting Changes in Legislative Documents

As laid down by the Italian Constitution, the Italian Parliament consists of two Houses:

the Senate of the Republic and the Chamber of Deputies. According to the principle of full bicameralism, these two houses perform identical functions. The law-making function is thus performed jointly by the two Houses: a legislative bill becomes an act only after it has been passed by both Houses of Parliament in the same exact wording.

As consequence, it is important to provide Senators and Deputies with effective means for analyzing the modifications applied to legislative bills in a House after it has been modified in the other. The main tool provided by Senate employees to Senators for this purpose are documents containing the so-called ”Testo a Fronte” (TAF in the following, which literally means forehead text). A TAF document is a two columns page layout document that permits to represent the differences between the original version of the legislative document and the modified version. In a TAF document, the left column is used to represent the original version of a legislative bill, while the right column represent the modified one. However, in order to put in evidence modifications, the two columns don’t merely list the aligned content of the two versions: they rather re-present these versions according to well-defined TAF presentation rules, which permit a quick identification of the differences to readers (some real-world TAF examples are shown on our demo web site http://twprojects.cs.unibo.it:8080/taf/ ).

A TAF presentation rule is determined on the basis of the applied modification, which can span either the overall structure of the document (e.g. articles, commas) or simply a portion of the text (e.g. rephrasing of a comma). For the sake of synthesis, we describe a simplified version of some of these rules:

1. if some words within a comma (or its sub-parts) are deleted in the new version, then the text on the left column is printed in bold face and the right column is left unmodified,

2. if some words within a comma (or its sub-parts) are inserted in the new version, then the text on the right column is printed in bold face and the left column is left unmodified,

3. if an article is suppressed in the new version, then all the article text on the left column is printed in bold face while the one on the right is substituted with a blank paragraph starting with ”Soppresso.” (i.e., suppressed),

4. if an article is left unchanged in the new version, then the text on the left column remains identical while the one on the right is substituted with a blank paragraph starting with ”Identico.” (i.e., identical).

(11)

Note that rules 1 and 2 deal with modifications to text only, rule 3, and 4 deal with modification to the overall legislative bill structure. Let us also remark that these examples of TAF presentation rules are actually oversimplified: for the sake of synthesis, they omit several important details on alignments and punctuation, among the others.

The process for producing TAF documents summarized above is intrinsically error- prone and can be very costly in absence of appropriate tools (considering the length of some legislative documents). It is very useful to automatize such a process of evaluating and printing TAF documents in Senate. Indeed, an application permitting to approxi- mate a good-quality TAF document (namely, TAF-1.0) is already under evaluation in some drafting Offices of the Italian Senate. TAF-1.0 is obtained by engineering and in- tegrating together an implementation of JNDiff, augmented with a sophisticated set of XSL transformation to obtain a printable and human readable TAF document. The overall application is delivered as a Java servlet. In order to apply JNDiff to legislative documents, these must be valid XMLs. TAF-1.0 obtains the XML versions of a legislative bill using the XMLeges marker component [1], a syntactical parser enabling to transform the flat text of a bill in a structured XML document compliant with the NIR (Norme in Rete) Italian Standard for the markup of legislative acts [9]. NIR XML documents have a fine grained markup that enables JNDiff to produce very meaningful sets of revealed differences. By implementing TAF presentations rules, XSL transformations of TAF- 1.0 then completes the job: they are used to transform the JNDiff output and the NIR XML bill versions in an HTML TAF document, which is provided to users of drafting offices for further refinements, integrations, final formatting, and printing using normal word processors. We recently deployed a first version of TAF-1.0 and results are more than promising. Our application is available at http://twprojects.cs.unibo.it:8080/taf/, along with some demonstration documents.

8 Conclusions

Our research on naturalness and JNDiff is not completed yet. However, the current implementation of JNDiff is a reliable, modular and open-source application available at http://jndiff.sourceforge.net/. The related tool JNMerge is also coded in Java and available at the same web site.

Our next step will be a further investigation of the set of changes proposed here, in order to (i) implement modules which detect still unsupported changes and (ii) identify more complex high-level operations. We also plan to perform deeper tests on thresh- olds and parameters passed to JNDiff. Further evaluations of computational costs and resource consuming will be investigated as well.

References

1. Agnoloni, T., Francesconi, E., Spinosa, P.: xmLeges Editor, an OpenSource visual XML editor for supporting Legal National Standards. In: Proceedings of V Legislative XML Work- shop, Florence, Italy (2007)

2. Eggert, P.: Free Software Foundation: GNU Diff (2006),

http://www.gnu.org/software/diffutils/diffutils.html

(12)

3. Ball, T., Douglis, F.: Tracking and viewing changes on the web. In: 1996 USENIX Annual Technical Conference (1996)

4. Chen, Y.F., Douglis, F., Ball, T., Koutsofios, E.: The at&t internet difference engine: Tracking and viewing changes on the web. World Wide Web 1(1), 27–44 (1998)

5. Fontaine, R.L.: A delta format for xml: identifying changes in xml files and representing the changes in xml. In: XML Europe 2001 (May 2001)

6. Fontaine, R.L.: Xml files: a new approach providing intelligent merge of xml data sets. In:

XML Europe 2002 (May 2002)

7. Marian, A., Cobena, G., Abiteboul, S.: Detecting changes in xml documents. In: The 18th International Conference on Data Engineering, February 2002, pp. 493–504 (2002) 8. Hirschberg, D.S.: Algorithm for the longest common subsequence problem. Journal of the

ACM 24(4), 664–675 (1977)

9. Lupo, C., Aini, F.: Norme in rete (1999),http://www.normeinrete.it/

10. Myers, E.W.: An o(nd) difference algorithm and its variations. Algorithmica 1(2), 251–266 (1986)

11. Cai, J., Wang, Y., DeWitt, D.: X-diff: an effective change detection algorithm for xml documents. Technical Report, University of Wisconsin (2001)

12. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)