A tree-based blocking technique for record linkage in data integration

(1)

POLITECNICO DI MILANO

Scuola di Ingegneria Industriale e dell’Informazione

Corso di Laurea Magistrale in Ingegneria Informatica

Dipartimento di Elettronica, Informazione e Bioingegneria

A Tree-Based Blocking Technique for Record Linkage in Data

Integration

Supervisor:

PROF. LETIZIA TANCA

Assistant Supervisor:

FABIO AZZALINI

Master Graduation Thesis by:

NICCOLÒ POZZOLINI

Student ID n.

883974

(2)

Abstract

Entity Resolution (ER) is the task of identifying different entity profiles that describe the same real-world object. It is a core task for Data Integration, applicable to any kind of data, from structured sources like relational databases to semi-structured ones like the Linked

Open Data Cloud and the unstructured ones like free text. From a high level perspective, ER consists of two parts:

- the search step, which determines the entities that will be compared

- the decision step, which compares the selected entities to determine whether they represent the same real-world object.

The latter step is also called Entity Matching and involves time-consuming operations, called pairwise comparisons, which typically apply string similarity measures to pairs of entities, dominating the overall cost of ER.

Blocking is a technique that takes care of the search step, which is the crucial part of the process with respect to efficiency and scalability. The block building process receives as input one or more entity collections and produces as output a block collection B; only the elements within the same block will be compared with each other. Without it, ER suffers from a quadratic time complexity, i.e.,O(n )2 , as every element has to be compared with all others.

We found that a tree based approach was only used by a single filtering algorithm named Trie-Join.

The starting point of this thesis was to investigate strengths and weaknesses of a tree based blocking algorithm, trying to boost its performance to a point where it could be used in a real world scenario.

More in detail we first tried to use the Query Reverse Engineering framework to cope with the blocking step, to be able to _{discover the hidden relations behind each formed block, providing a} data analysis tool for the end user.

To reach good performances it revealed necessary to move away from a strict tree-based

(3)

with discrete performances, and with a working mechanism pretty different from a tree based approach.

(4)

Sommario

Si definisce Entity Resolution (ER) il processo di identificazione delle diverse entità che si riferiscono allo stesso oggetto del mondo reale. È un compito fondamentale per l'integrazione dei dati, applicabile a qualsiasi tipo di dati, da fonti strutturate come database relazionali a quelle semi-strutturate come il Linked Open Data Cloud e quelle non strutturate come il testo libero. Da una prospettiva di alto livello, ER è costituito da due parti:

- il passaggio di ricerca, che determina quali entità verranno confrontate tra loro

- la fase decisionale, che confronta le entità selezionate per determinare se rappresentano lo stesso oggetto del mondo reale.

Quest'ultimo passaggio è anche chiamato Entity Matching e comporta operazioni dispendiose in termini di tempo. Questo costo temporale è dominato dai numerosi confronti a coppie da operare, applicando misure di somiglianza che in larga scala possono essere computazionalmente costose. Il blocking è una tecnica che si occupa della prima fase, quella di ricerca, che è la parte cruciale del processo in termini di efficienza e scalabilità. Il processo di creazione del blocco riceve come input una o più collezioni di entità e produce come output una raccolta blocchi B; solo gli

elementi all'interno dello stesso blocco verranno confrontati tra loro nella fase decisionale. Senza blocking, ER soffre di una complessità temporale quadratica, cioè O(n )2 , poiché ogni elemento deve essere confrontato con tutti gli altri.

Abbiamo notato che nel panorama di algoritmi a disposizione, l'approccio ad alberi era utilizzato esclusivamente da una tecnica di Filtering, chiamato Trie-Join.

Il punto di partenza di questa tesi è stato quello di indagare i punti di forza e di debolezza di un algoritmo ad alberi applicato al blocking, cercando successivamente di aumentare le sue prestazioni fino ad ottenere un algoritmo utilizzabile in uno scenario del mondo reale.

Più in dettaglio, abbiamo prima cercato di utilizzare il framework Query Reverse Engineering per la creazione dei blocchi: questo permette di scoprire possibili relazioni nascoste dietro ogni blocco costruito, fornendo uno strumento di data analysis per l'utente finale. Dato un database D ed una tabella di risultati T, l'obiettivo di QRE è ricavare una o più query che eseguite su D

(5)

ritornino T. Per fare questo QRE costruisce un albero decisionale, traducendo i vari split in predicati della query.

Questa tecnica si è dimostrata abbastanza inefficace, e per raggiungere buone prestazioni si è rivelato necessario allontanarsi da un approccio ad alberi rigoroso, adottando vincoli più rilassati. L'algoritmo successivamente sviluppato è TIS (Tree Index Similarity, capitolo 4.2.2), che

introduce una nuova metrica di somiglianza ad indici, rivelatasi molto efficace pur soffrendo di un costoso preprocessing.

Mantenendo questa nuova metrica, il meccanismo di preprocessing è stato successivamente ampliato ed anche la parte di esecuzione è stata rivista di conseguenza: il risultato finale è TIS Scoring, un algoritmo di blocking con discrete prestazioni, e con un meccanismo di

(6)

List of Tables

● TABLE 1: TIS Preprocessing step 1 . . . 24

● TABLE 7: Cora Dataset attributes analysis . . . . . . 37

● TABLE 8: Restaurants Dataset attributes analysis . . . . 38

● TABLE 9: Census Dataset attributes analysis . . . . . . . 38

● TABLE 10: Proposed Algorithms Test Results . . . 39

List of Figures

● FIGURE 1: Algorithm 1: Trivial Decision Tree Classifier blocking function . . . 22

● FIGURE 2: Algorithm 2: TIS blocking function . . . . 28

● FIGURE 3: Similarity Function for TIS, TIS Scoring . . . 29

(7)

1- Intro

1.1- Motivations and goals

While some algorithms use Tree based approaches to the Filtering problem, the blocking field lacks any of them. A tree based approach would offer additional side benefits, like formally describing blocks through queries to better investigate and tweak the process.

The starting point was then to investigate the effectiveness of QRE techniques in the field of blocking and proceed in that direction if results were satisfying, otherwise analyze and correct its weaknesses to develop a well performing new algorithm.

1.2- Contents

The thesis is organized into following chapters:

● Chapter 2: Preliminaries. In this chapter we’ll analyze technologies and processes used throughout our study. First we describe the three phases of Data Integration: Schema Alignment, Entity Resolution, Data Fusion. After that we will dig down in the second phase, to focus on the blocking technique. At last we’ll have a brief description of QRE, the starting -and for short time, main- point of this thesis.

● Chapter3: State of the Art blocking algorithms. This chapter gives you information about the best performing blocking algorithms around at the moment, including some historic algorithms that evolved in many currently state of the art algorithms. The algorithms are cataloged in three categories: hash based, sort based, hybrid methods.

● Chapter 4: Proposed methodology. This chapter describes the proposed algorithms, decomposing their narrative in three parts: preprocessing, execution and parameters tuning.

● Chapter 5: Experimental results and evaluation. First datasets used for testing and the metrics involved are deeply described; in the last section the performances of the proposed algorithms are compared with the ones of SortedBlocks, a state of the art algorithm.

(8)

2- Preliminaries

2.1- Data Integration

The whole process of data integration can be divided in three distinct steps: schema alignment, entity resolution, and data fusion. Entity resolution steps can be further divided in:

● Record Linkage if dealing with two distinct clean datasets ● Deduplication if dealing with a single dirty dataset

2.1.1- Schema alignment

Schema alignment takes care of creating a new integrated schema that can represent all data from a given set of source schemata. All data should be represented semantically correct, yet

completeness is not always necessary. Two important goals of schema alignment are minimality and understandability, meaning that the integrated schema should be minimal with respect to the number of relations and attributes, while being still easy to understand. Various schema

integration procedures have been proposed in the literature, but most of them share these main steps [1]:

Preintegration — An analysis of schemas is carried out before integration to decide upon some integration policy. This governs the choice of schemas to be integrated, the order of integration, and a possible assignment of preferences to entire schemas or portions of schemas.

● Comparison of the Schemas — Schemas are analyzed and compared to determine the correspondences among concepts and detect possible conflicts. We can distinguish between naming conflicts and structural conflicts. Naming conflicts can be of two types: Homonyms when the same name is used for two different concepts, synonyms when the same concept is described by two or more names. Structural conflicts to include conflicts that arise as a result of a different choice of modeling constructs or integrity constraints. Among structural conflicts we can distinguish between:

(9)

○ Type Conflicts. These arise when the same concept is represented by different modeling constructs in different schemas. This is the case when, for example, a class of objects is represented as an entity in one schema and as an attribute in another schema

○ Dependency Conflicts. These arise when a group of concepts are related among themselves with different dependencies in different schemas. For example, the relationship Marriage between Man and Woman is 1: 1 in one schema, but m : n in another accounting for a marriage history

○ Key Conflicts. Different keys are assigned to the same concept in different schemas. For example, SS# and Empid may be the keys of Employee in two component schemas

○ Behavioral Conflicts. These arise when different insertion/deletion policies are associated with the same class of objects in distinct schemas. For example, in one schema a department may be allowed to exist without employees, whereas in another, deleting the last employee associated with,a department leads to the deletion of the department itself. Note that these conflicts may arise only when the data model allows for the representation of behavioral properties of objects. Interschema properties may be discovered while comparing schemas.

● Conforming the Schemas — Once conflicts are detected, an effort is made to resolve them so that the merging of various schemas is possible.

● Merging and Restructuring — Now the schemas are ready to be superimposed, giving rise to some intermediate integrated schema(s). The intermediate results are analyzed and, if necessary, restructured in order to achieve our qualities of interest:

○ Completeness. The integrated schema must contain all concepts present in any component schema correctly

○ Minimality. If the same concept is represented in more than one component schema, it must be represented only once in the integrated schema

○ Understandability. The integrated schema should be easy to understand for the designer and the end user. This implies that among the several possible

(10)

representations of results of integration allowed by a data model, the one that is (qualitatively) the most understandable should be chosen.

2.1.2- Entity Resolution

To describe entity resolution we first need to define the entity profile: it is an attribute-value pair that constitutes a description of a unique real world object.

Assuming infinite sets of attribute names N, attribute values V, and unique identifiers I, an entity profile is formally defined as follows [2,3]:

Definition 1 (Entity Profile)

An entity profile e_id is a tuple⟨ Aid, ID⟩ , where id ∈ I is a unique identifier, and AID is a set of attribute-value pairs⟨ vn, ⟩ , with n ∈ Nand v ∈ (V ∪ I).

An entity collection is a set of entity profiles.E Definition 2 (Entity Resolution)

We say that two entity profiles and ej match, if they refer to the same real-world entity, and wee_i

denote this by ei ≡ ej. Matching entities are also referred to as duplicates, if they come from the same source. The task of Entity Resolution (ER) is to find all matching entities within an entity collection or across two or more entity collections.

We then need to distinguish the Clean-Clean ER and Dirty ER cases:

● Clean-Clean ER receives as input two duplicate free entity collections E₁and E₂, and produces as output the set of all pairs of matching entity profiles between them, i.e.,

. (E , E ) {(e , e ) e ∈ E , e ∈ E , e ≡ e }

D ₁ ₂ = _i _j : _i ₁ _j ₂ _i _j

● Dirty ER instead receives as input a single entity collection and produces as output theE

set of all pairs of matching entity profiles within , i.e.,E .

(E) {(e , e ) e ∈ E, e ∈ E, e ≡ e }

D = _i _j : _i _j _i _j

In the context of structured (i.e., relational) data, Clean-Clean ER is also known as Record Linkage, whereas Dirty ER is also called Deduplication.

(11)

2.1.3- Data Fusion

Now that we have identified the records that refer to the same real world object, we need a way to retrieve the correct values of their attributes; this step is called data fusion. It is a combination of techniques that aims to resolve conflicts from a collection of sources and to find the truth that reflects the real world. It is a new field that has emerged recently. Its motivation is exactly the veracity of data: the Web has made it easy to publish and spread false and imprecise information

across multiple sources, which make the identification of wrong information

very critical to present a high quality data. [4]

There are three techniques used in Data Fusion, often used together: Copy detection, Voting and Source quality.

● Copy Detection: detect copier sources and reduces their weight ● Voting: detecting the most common value for each attribute

● Source Quality: after voting, we gives more weight to knowledgeable sources (that have the highest number of common attributes)

2.2 - Blocking

Blocking tackles the issue of the exponential nature of pairwise comparisons, providing a means to trade effectiveness for efficiency. The idea is to reduce the number of comparisons to be performed, while missing as few matches as possible.

Blocking involves a super-linear but sub-quadratic time complexity, lying between the two extremes of the ideal and the brute-force solution.

Block Building is the process that receives as input one or more entity collections and provides as output a block collection B. The blocking scheme guides the process, by determining how entity profiles are assigned to blocks. This scheme is typically composed by two parts. Every entity profile is initially processed to extract signatures, may it be a tokens or a q-grams, such that the similarity of signatures correlates well with the similarity of the corresponding entity profiles. Based on these signatures, each entity is subsequently mapped to one or more blocks.

(12)

Blocking applies to both Dirty ER and Clean-Clean ER.

2.2.1 - Blocking Definition

Let P(S) denote the power set of a set S, and let K denote the universe of signatures appearing in entity profiles. We formally define a blocking scheme as follows [5]:

Definition (Blocking Scheme). Given an entity collection , a blocking scheme is a functionE

that maps entity profiles to blocks. It is composed of two functions: E → P (B)

f_B :

1. A transformation function f_T E → P (K): that maps an entity profile to a set of signatures (also called blocking keys)

2. An assignment function f K → P (B)_A : that maps each signature to one or more blocks.

This definition applies to Dirty ER, but can be easily extended to Clean-Clean ER. The set of comparisons in the resulting block collection B is called comparison collection and is denoted by C(B). Every comparison c_i,j ∈ C(B)belongs to one of the following types [6,7]:

• Matching comparison, if e_i and match.e_j

• Superfluous comparison, if and do not match.e_i e_j

• Redundant comparison, if and have already been compared in a previous block.e_i e_j

We collectively call the last two types unnecessary comparisons, as their execution brings no gain.

Note that once the input entity profiles are mapped to blocks, an inverted index is constructed that associates the id of each entity profile with the ids of the blocks that contain it. For this reason, Block Building is also called Indexing in the literature [8,9].

2.2.2 - BLOCKING ALGORITHMS TAXONOMY

Now we introduce a taxonomy for blocking algorithms, in order to be able to better capture similarities and differences among the techniques we’ll describe in the following chapter [5].

(13)

• _{Key selection distinguishes between rule-based and ML-based methods. The former methods} assign entities to blocks based on rules derived from expert knowledge or mere heuristics. The latter methods require a training set and use Machine Learning (ML) techniques to learn the best blocking keys for mapping entities to blocks.

•_{Schema-awareness distinguishes between schema-aware and schema-agnostic methods. The} former methods presume schema knowledge for mapping entities into blocks, i.e., they

extract blocking keys from specific attributes which are considered to be more appropriate for matching (e.g., more distinctive or less noisy). The latter methods do not rely on schema knowledge, extracting blocking keys from all attributes.

•_{Key type distinguishes between hash- or equality-based methods, which map a pair of entities} to the same block if they have a common key, and sort- or similarity-based methods, which map a pair of entities to the same block if they have a similar key. There exist also hybrid methods, which combine hash- with sort-based functionality.

•_{Redundancy-awareness distinguishes methods into three categories based on the relation} between the blocks they create. Redundancy-free methods assign every entity to a single block, thus creating disjoint blocks. Redundancy-positive methods place every entity into multiple blocks, yielding overlapping blocks. The more blocks two entities share, the more similar their profiles are. The number of blocks shared by a pair of entities is thus proportional to the likelihood that they are matching. Redundancy-neutral methods create overlapping blocks, where most pairs of entities share the same number of blocks, or the degree of redundancy is arbitrary, having no implications.

• _{Constraint-awareness distinguishes blocking methods into lazy, which impose no constraints} on the blocks they create, and proactive, which enforce one or more constraints on their blocks, such as maximum block size, or try to refine their comparisons, discarding unnecessary ones.

• _{Matching-awareness distinguishes between static methods, which are independent of the} subsequent matching process, producing a static block collection, and dynamic methods, which intertwine Block Building with Entity Matching, updating or processing their blocks dynamically, as more duplicates are detected.

(14)

• _{Domain refers to the type of data a block building method is crafted for, which can be} structured (e.g., relational), semi-structured (e.g., XML, RDF), or both (i.e., any data type).

2.2.3 - Blocking performance metrics

The performance of an ER workflow can be characterized by two main aspects: its effectiveness and its efficiency. The former refers to how many of the actual duplicates are detected, while the latter refers to the computational cost for detecting them [5].

This cost is typically measured by the number of performed comparisons between entity profiles. We refer to this as cardinality and denote it by ||E||. The naive, brute-force approach performs all pairwise comparisons between entity profiles.

For Clean-Clean ER ||E|| = |E | x |E |₁ ₂ , while for Dirty ER ||E|| = |E| · (|E| − 1)/2.

In other words, the brute-force approach has quadratic complexity, which is not efficient and does not scale to large entity collections.

In more detail, a blocking method employs a blocking scheme, which when applied to one or more entity collections yields a set of blocks , also referred to as block collection. TheB

cardinality ||B|| of a block collection denotes the number of comparisons entailed in ,B B

taking into consideration that only entity pairs within the same block are compared, i.e.,

, where stands for the number of comparisons contained in block . We |B|| ||b ||

| = ∑

b ∈B_i i

|b ||

| _i b_i

denote the set of detectable duplicates in as B D(B), while D(E) denotes all existing duplicates. Since B reduces the number of performed comparisons, D(B) ⊆ D(E).

There is a clear trade-off between the effectiveness and the efficiency of a blocking scheme [9]: the more comparisons are contained in the resulting block collection (i.e., higher B ||B||), the more duplicate pairs will be detected (i.e., higher |D(B)|). In this way, effectiveness rises at the cost of lower efficiency, and vice versa. A blocking scheme is, therefore, considered successful as long as it achieves a good balance between these two competing objectives. This trade-off is commonly captured by the following two measures:

(15)

● Pair Completeness (PC)_{corresponds to recall, estimating the portion of the detectable} duplicates with respect to the existing ones. Given a block collection B, it is defined as:

C(B) |D(B)|/|D(E)|

P =

PC takes values in the interval [ 0, 1], with higher values indicating higher effectiveness of the blocking scheme

● Reduction Ratio (RR)_{measures the reduction in the number of pairwise comparisons} contained in a block collection B with respect to the brute-force approach. It is defined as: RR(B, E) = 1 − ||B|| / ||E||, thus taking values in the interval [ 0, 1], with higher values denoting higher efficiency of the blocking scheme.

Note that PC provides an optimistic estimation of recall, presuming the existence of an oracle.

Based on the above, we can provide a further definition of blocking: given an entity collection E, Blocking clusters similar entities into a block collection B such that PC(B) and RR(B, E) are simultaneously maximized.

Simultaneously maximizing PC and RR necessitates that the enhancements in time efficiency do not affect the effectiveness of Blocking, i.e. the only way they can be improved together is by carefully removing comparisons between non matching entities. Thus, conceptually, Blocking can be viewed as an optimization task. However, this implies that the real duplicate collection

is known, which is actually what ER tries to compute. Hence, in practice, Blocking is (E)

D

typically treated as an engineering task that aims to provide an approximate solution for the data at hand.

A blocking-based ER workflow may comprise several stages. In addition to Block Building (BlBu),

which applies a blocking scheme to produce a block collection from the given entityB

collection(s), there is often a second, optional, stage called Block Processing. Its purpose is to refine through additional optimizations that further reduce the number of performedB

comparisons. This may involve discarding entire blocks that primarily contain unnecessary comparisons, referred to as Block Cleaning (BlCl), and/or discarding individual comparisons within certain blocks, referred to as Comparison Cleaning (CoCl).

(16)

Block Building may be repeated several times on the same input in order to apply multiple blocking schemes to achieve a more robust performance in the context of highly noisy data. Similarly, Block Cleaning may be applied repeatedly, each time enforcing a different, complementary method to discard blocks.

2.3 - Query Reverse Engineering

QRE framework is based on decision trees. A Decision Tree is a simple classifying algorithm. It is a Supervised Machine Learning algorithm where the data is continuously split according to a certain parameter.

Let’s first formally define the QRE framework. Given a database D and a result table T—the output of some known or unknown query Q on D—the goal of QRE is to reverse-engineer a query Q such that the output of query Q on database D (denoted by Q (D)) is equal to T (i.e., Q(D)). The QRE problem has useful applications in database usability, data analysis, and data security [10]. In the context of data analysis for example it can be used to _{discover hidden} relations among a set of data extracted from a database, explaining how the set result has been obtained or, in other words, why it is made up of those specific tuples. [11]

In this thesis QRE framework is used to retrieve the set of the records that are most similar to the single one provided as input; this set will compose a block. The full process would be:

- Provide as input a random record - Generate a query to select that record

- Generalize the query to include similar records

- Put the records in a block, remove them from the dataset and repeat till all records are put in a block

This full process would not be that efficient, since the steps generating and generalizing the query are not strictly necessary. Since the query gets generated according to a tree approach, in my implementation I just stop the tree building a few steps early.

This process can be seen at section _{4.2.1 ‘Trivial tree based blocking’, in the other algorithms I} dropped this QRE approach since the performances were not satisfying.

(17)

3 - State of the Art blocking algorithms

There is no better algorithm, each dataset is a problem in itself: it is useful to analyze the records, identify recurring patterns in the values of the individual attributes, and choose the most

appropriate algorithm.

In this section we will analyze the best performing algorithms. Since there are numerous blocking algorithms, we will limit our survey to the schema aware ones, which is the target use case of our algorithm - even if it can also be extended to the schema agnostic case study.

3.1 - Hash-based Methods

The seminal method_{Standard Blocking (SB) [12] belongs to this category.}

SB applies the simplest form of blocking: an expert selects the most suitable attributes, and a transformation function concatenates (parts of) their values to form blocking keys. For every distinct key, a block is created containing all corresponding entities. In short, SB operates as a hash function, conveying two main advantages:

● It yields disjoint blocks, which contain no redundant comparisons ● It has a linear time complexity, O(|E|).

On the flip side, PC depends on the noise in the blocking keys, as the slightest difference in the blocking keys of duplicates places them in different blocks, or no blocks at all. Additionally, SB is a lazy method that imposes no limit on the size of the resulting blocks. Depending on the frequency distribution of attribute values, it may yield large blocks with many superfluous comparisons, i.e., low PQ and RR.

To address these issues,_{Suffix Arrays Blocking (SA) [13] converts each blocking key of SB} into the

list of its suffixes that are longer than a predetermined minimum length lmin. Then, it defines a block for every suffix that does not exceed a predetermined frequency threshold bmax , which specifies the maximum block size. This proactive functionality is necessary, as very frequent suffixes result in large blocks that are dominated by unnecessary comparisons.

(18)

SA is a popular technique that has spawned a whole family of blocking methods, due to two major advantages [14]:

● It is very efficient, as it results in a small but relevant set of candidate matches, and it has low time complexity, O(|E|·log|E|).

● It is very effective, due to the robustness to the noise at the beginning of blocking keys and the high levels of redundancy (i.e., it places every entity into multiple blocks). On the downside, SA does not handle noise at the end of SB blocking keys.

A more advanced solution is provided by _{Improved Suffix Arrays Blocking (ISA) [15], which} employs the same blocking keys as SA, but sorts them in alphabetical order, comparing the consecutive ones with a string similarity measure. If the similarity of two suffixes exceeds a predetermined threshold, the corresponding blocks are merged (note that the frequency threshold bmax is not applied after the merge of two suffixes). This allows for detecting duplicates even when there is noise at the end of SB keys, or their sole common key is too frequent. For example, ISA detects the high string similarity of the keys “JohnSnith" and “JohnSmith", placing the corresponding entities into the same block. It is theoretically proven that ISA results in a PC greater or equal to that of SA, though at the cost of more comparisons and, thus, lower PQ and RR.

Q-grams Blocking _{[9]. Its transformation function}

converts the blocking keys of SB into sub-sequences of q characters (q-grams) and defines a block for every distinct q-gram. For example, for q=3, the key ‘france’ is transformed into the trigrams ‘fra’, ‘ran’, ‘anc’, ‘nce’. This approach is more resilient to noise than SB, yielding higher PC, but results in more and larger blocks, thus decreasing both PQ and RR.

A more advanced q-gram-based approach is_{MFIBlocks [16]. Its transformation function} concatenates keys of Q-Grams Blocking into itemsets and uses a maximal frequent itemset algorithm for defining new blocking keys. Only the most frequent itemsets that exceed a predetermined support threshold are treated as blocking keys. This reduces significantly the number of blocks and matching candidates (i.e., high PQ and RR), but it may come at the cost of missed matches (lower PC) in case the resulting blocking keys are very restrictive for matches with noisy descriptions.

(19)

3.2 - Sort-based Methods

The cornerstone method of this category is _{Sorted Neighborhood (SN)}

[17], which operates as follows: first, it sorts all blocking keys in alphabetical order and arranges the associated entities accordingly. Subsequently, a window of fixed size w slides over the sorted list of entities and compares the entity at the last position with all other entities placed within the same window. The underlying assumption is that the closer the blocking keys of two entities are in the lexicographical order, the more likely they are to be matching.

SN has three major advantages [9]:

● It has low time complexity, (|E|·log|E|)O

● Since it only scans the dataset once, i results in linear ER complexity, (w · |E|)O

● It is robust to noise, supporting errors at the end of blocking keys

On the flip side, it may place two entities in the same block even if their keys are

dissimilar (e.g., the keys "alphabet" and "apple" are considered similar if no other key intervenes between them lexicographically). Its performance also depends heavily on the window size w. Configuring w is quite difficult, especially in Dirty ER, where the matching entities form clusters of varying size; a small w leads to high PQ and RR but low PC and vice versa for a large w. To ease the effect of the fixed window size, various techniques have been proposed in the literature. The typical solution is the multi-pass version of the algorithm in [18], i.e., applying the core algorithm multiple times, using a different transformation function in each iteration. In this way, more matches can be identified, even if the window is set to low size.

More advanced strategies adapt the window size dynamically in order to optimize the balance between PC and PQ/RR, we can group them in three main categories [19]:

1. Key similarity strategy. The window size increases if the similarity of the blocking keys exceeds a predetermined threshold, which indicates that more similar entities should be expected.

2. Entity similarity strategy. The window size relies on the similarity of the entities within the current window. _{Incrementally Adaptive SN [20] increments the window size as} long as the distance of the first and the last element in the current window is smaller than

(20)

a predetermined threshold. The actual increase of the window size depends on its current value and the selected threshold. _{Accumulative Adaptive SN [20] creates windows with} a single overlapping entity and exploits transitivity to group multiple adjacent windows into the same block, as long as the last entity of one window is a potential duplicate of the last entity in the next adjacent window. After expanding the window, both algorithms apply a retrenchment phase that decreases the window size until all entities within the block are potential duplicates.

3. Dynamic strategy. The window size fluctuates according to the number of identified duplicates, based on the assumption that the more duplicates are found within a window, the more are expected to be found by increasing the current window. _{Duplicate Count}

Strategy (DCS)_{[21] defines a window of fixed size w for every entity of SN’s sorted}

list. All comparisons are executed within a window and the ratio d/c is estimated, where d denotes the newly detected duplicates and c the executed comparisons. The window size is then incremented by one position at a time as long as d/c ≥ ϕ , where ϕ ∈ (0, 1) is a predetermined threshold that expresses the average number of duplicates per

comparison. _{DCS++ [21] improves DCS by increasing the window size with the next w} − 1 entities, even if the new ratio becomes lower than ϕ. It also exploits transitive closure in order to skip some windows, saving part of the comparisons. It is theoretically proven that with an appropriate value for ϕ, DCS++ misses no matches, while being at least as efficient as SN.

3.3 - Hybrid methods

Sorted Block_{s [22] combines the benefits of SB’s hash-based with SN’s sortbased}

functionality. First, it sorts all blocking keys and their corresponding entities in lexicographical order, just like SN. Then, it partitions the sorted entities into disjoint blocks, just like SB, using a prefix of the blocking keys. Next, all pairwise comparisons are executed within each block. To avoid missing any matches, an overlap parameter o is used to define a sliding window of fixed size. Sorted Blocks is a lazy approach that does not restrict the size of its blocks. Thus, it may result in large blocks that dominate its processing time. To address this, two proactive variants

(21)

set a limit on the maximum block size. _{Sorted Blocks New Partition [22] simply creates a new} block when the maximum block size is reached for a specific (prefix of) blocking key; the overlap between the blocks ensures that every entity is compared with its predecessors and successors in the sorting order. _{Sorted Blocks Sliding Window [22] avoids executing all} comparisons within a block with size larger than the upper limit; instead, it slides a window of size equal to the maximum block size over the entities of the current block. Sorted Blocks New Partition outperforms all other algorithms, including SB and SN. However, Sorted Blocks and its variants include more parameters than SN, involving a more complex configuration.

4- Proposed methodology

4.1- Research strategy

The starting point was to adopt a tree approach which the bibliography did not have. By applying Query Reverse Engineering techniques it was possible, having access to the various splits

operated, to obtain not only the blocks but also an explanation of each one. In this sense, a tree-based approach constituted a blocking tool that could also be used to perform data analysis. Once the first pure tree algorithm is created, we are able to analyze its limitations with respect to the advantages a QRE approach entails. With the subsequent iterations we will try to sacrifice the tree approach as little as possible, while keeping the focus on increasing the performances.

4.2- Proposed algorithm

By definition a three based approach impose some constraints, I’ll describe them according to the previously provided taxonomy:

- Key selection must be rule based, since machine learning is not involved in a vanilla implementation of a decision tree classifier

- Key type must be equality based, since decision tree classifiers operate comparison among values instead of operating through hashes of them

I developed three distinct algorithms. The first is a trivial implementation of scipy's decision tree algorithm, adapted to our use case. The next two algorithms differ significantly in

(22)

implementation, while trying to detach themselves as little as possible from the tree approach. If with TIS I managed to maintain a relationship with the tree approach, with the TIS Scoring evolution I completely sacrificed the tree method in favor of higher performances.

4.2.1 - Trivial tree based blocking

The first approach was to adopt a standard algorithm for decision trees. The main problem with adopting a standard algorithm for decision trees is that these algorithms are effective for

numerical attributes, while:

1. In databases, numerical values _{are often in the minority}

2. The noise in most cases is at the level of string perturbation, not perturbation of the numerical value (for example “1992”, ”1992b”, ”october 1992”, “october” may refer to the same attribute of the same real world entity)

Not wanting to operate particular tweaks in this first test, I decided to opt for a well-performing data mining library: _scipy.

Using an off-the-shelf library I was not able to operate particular split strategies (for example special treatments for outliers), but this test is useful for understanding the initial direction to follow. A negative aspect of the tree approach is the inability to manage missing values. So I had to perform imputing with the most frequent policy, and this probably contributed to lower

performances. The order of the attributes is decided by the algorithm according to metrics such as Gini index or Entropy, not very indicative of our purpose. In subsequent algorithms,

developed ad hoc, the attributes will be sorted in ascending order of missing values, or in a user specified order.

(23)

The algorithm used is, therefore, the following: for each block, I select a random tuple from the dataset [line 5], that will constitute the training set of the decision tree; the algorithm will split according to each attribute until the tuples that meet the split criteria are of the desired number. At this point, the block is formed [line 11] and the tuples involved will be removed from the dataset. These steps are repeated until the dataset is empty, i.e. all tuples are placed in some block.

It is important to underline that the size of the blocks is not rigorously decided beforehand, only a lower bound is provided: when a split generates a block that is too small, the block is returned without considering the last split. It can be observed experimentally how the blocks contain a similar number of elements.

The first variation that I implemented on this algorithm, was to move to a positive redundancy approach in which the blocks can share one or more elements. This variation increases the

(24)

number of blocks, so the Reduction Ratio is slightly lower, but the Pair Completeness is definitely increased. Overall the performances improved considerably, showing an average PC increase of 22%, at the cost of an average RR loss of 5.6%.

I then tried some more tests on the ‘cora’ dataset, it is the most realistic dataset I used, so it was the hardest to work with. Firstly I performed a deep preprocessing on the Author field, since it is one of the dirtiest one despite its huge semantic importance. PC was raised by 3.1%. Another change was to operate one-hot-encoding on every single word in the Author fields, being a field that globally contains a limited number of words and is at the same time extremely important. As an example, a tuple having “Al Charles” as Author, will generate two distinct boolean attributes: “al” and “charles”.

The performances improved discreetly, leading to a PC increase of 7.7%

Despite this one hot encoding step, it is possible to trace a similar query with a syntax of the type "AND ‘x' in Authors", thus keeping the QRE benefits.

The complexities will be analyzed later.

4.2.2 - TIS Algorithm

Having acquired knowledge of the strengths and weaknesses of this tree approach, I moved on to the development of a new algorithm: TIS - Tree Index Similarity.

The strength of this approach is the adoption of a new record similarity metric. The tree philosophy is still present in the execution of the algorithm, however, to increase the

performances, I had to sacrifice the possibility of obtaining a query describing the block. TIS, in fact, operates a preprocessing in which the dataset is completely encoded by losing the content of the individual fields: each field will be transformed into a numeric value, an index to be exact.

4.2.2.1 - TIS preprocessing

The description of the preprocessing will be supported by a visual example. The following table contains our simulated original dataset:

(25)

[TABLE 1: TIS Preprocessing step 1]

We’ll process the columns one by one, so let each attribute have its own data structure.

The first step consists of sorting the words alphabetically in each cell of the dataset. The next algorithm, TIS Scoring, will replace this single processing function with a set of user designed ones, working in parallel.

(26)

Now each table’s records will be sorted according to the attribute’s column.

Now the attribute values are replaced by the record’s position. In the same table, the same values will have the same index. In this way, two fields with similar prefixes will have close indices.

(27)

Finally, we can merge together the data structure to create our processed dataset.

There will, therefore, be a dataset of the same size nxm as output, but the fields of which are replaced by indices.

(28)

4.2.2.2 - TIS Execution

Given the significant increase in performances against low costs provided by the switch to the overlapping strategy in the tree algorithm (+22% PC), in the development of these subsequent algorithms I continued to leave flexibility to the blocks allowing overlapping.

Here the similarity metric of two records is trivially the module of the difference between the indices, summed up for each attribute.

A vector of size N is created containing initially all the IDs: it will indicate which tuples have not yet been inserted in any block. This list will be called "missing_list". [line 1]

Each block is built like this:

The first record in the missing_list is selected as positive: the block that is about to be built will consist of records similar to this one. I will refer to this record used as train set for a block, as “positive record”.

The first iteration of the block building process only considers the first attribute: the records whose first attribute’s value does not differ from that of the positive record for more than DiffThresh, are added to the block. [lines 7-12]

For each subsequent attribute, I remove from the block the records that do not meet the same criteria.

When a split produces a block smaller than BlockThresh, the block returns the block relative to the previous split. [lines 23-25]

(29)

(30)

With this approach, I am also free to define arbitrary policies for missing values. By carrying out various tests, it emerged that the missing-as-negative policy performs better than the

missing-as-positive policy in the datasets used (+12.9% PC on average), however, it is plausible that particularly sparse datasets can benefit from it. The missing-as-negative policy could also be obtained in the trivial tree algorithm, by operating a preprocessing that replaces the missing values _{with random strings. The order in the choice of attributes is left to a domain expert to} increase performances. In its absence, however, the algorithm can be operated in an agnostic scheme: the attributes are used in increasing order of missing values. This approach has a twofold significance: on the one hand, more data will be used, on the other, a smaller number of missing values _{tends to underline the semantic importance of that attribute, which is, therefore,} more deserving of guiding the split. To perform the tests I opted for this agnostic scheme mode.

4.2.3 - TIS Scoring

TIS is the basic implementation of the index difference similarity metric.

The only feature inserted is to sort the words alphabetically in the individual fields of each record. This technique, although useful in most cases, can be penalizing in others: the addition of a word in a field may throw the relative index completely off.

TIS is ineffective for fields that, once sorted, differ in the first letters, typically due to the addition of a word with minimum alphabetical order. This problem is solved in the TIS Scoring variant, which expands the preprocessing step to increase the PC at the expense of a marginal loss of RR.

(31)

TIS Scoring generalizes the TIS approach: while maintaining its similarity metric, it makes preprocessing expandable and drastically changes the process of creating blocks.

4.2.3.1 - TIS Scoring preprocessing

The creation of the indexed dataset is the same as for TIS. The difference lies in the fact that here the process is repeated several times, varying the preprocessing function applied to the individual fields from time to time. By inverting the strings contained in the fields, for example, we will obtain high similarity between two similar records in the last letters. In TIS the words in the fields were sorted alphabetically, in TIS Scoring the preprocessing functions can be arbitrarily decided based on an analysis of the database by a domain expert. Implementing several functions in parallel, therefore, presumes the creation of a number of data structures equivalent to the number of functions used.

I implemented four preprocessing functions that work pretty good together. I’ll describe them supposing to have two tuples T₁ and T₂ that refer to the same real world entity.

- Sorted: same as in TIS.

Effective in case of words shifting, for example a different order of name-surname: T_1,1 = ”m fahle s edelman”, T_2,1= ”fahle m edelman s”. They will both become: T_1,2=T_2,2

=”edelman fahle m s”.

If one or more words are contained in just one of the two records, “sorted” preprocessing function is only effective if these words are of higher alphabetic order, like in the

following example: T_1,1=”fahle m edelman s”, T_2,1=”fahle m edelman s poggio t”. They will become: T_1,2=”edelman fahle m s”, T_2,2=”edelman fahle m s poggio t” so high similarity will be assigned to them.

- Sorted rev: same ad before, but at the end the string gets inverted.

As just shown, “sorted” preprocessing function is vulnerable to word insertion.

“Sorted rev”, like the other two functions, try to compensate for this problem: part of the matching missed by “sorted” will be catched by these other functions. For example “sorted rev” can be useful in a situation like this: T_1,1=”fahle m edelman s”, T_2,1=”fahle

(32)

m edelman s charles grant” will become: T_1,2=”edelman fahle m s”, T_2,2=”charles edelman fahle grant m s”. “Sorted rev” will assign high similarity to these two records. That is the core philosophy of TIS Scoring: to introduce multiple cheap preprocessing functions capable of catching all the different ways in which two strings may be different - while being part of tuples referring to the same real world entity. The same

considerations apply to “normal” and “normal rev”. - Normal: the string is not changed.

- Normal rev: the string is only reversed.

The weakness of the index-like metric is that it only gives importance to one side of the word; using the inverted string in parallel significantly mitigates the problem.

This process mimics a winning approach in the blocking landscape: the execution of multiple cheap blocking functions. The execution of TIS Scoring is not exactly cheap, but the use of multiple indexed structures penalizes the execution linearly, while the performances significantly increase. The ability to expand and customize preprocessing makes TIS Scoring particularly suitable for datasets with multi-word text fields.

4.2.3.2 - TIS Scoring Execution

Having described how the indexed structures are created, I can now introduce the record similarity metric: given indexed structures, attributes for each indexed structure, twoI K

records T₁,T₂, their similarity is computed as follow:

(T , ) (value(T , , ) alue(T , , ))

S ₁ T₂ = ∑I

D_i∑ K

a_k 1 Di ak − v 2 Di ak

Now I’ll illustrate TIS Scoring’s execution, referring to the lines of code that can be found at the following page.

A vector of size N is created containing initially all the IDs: it will indicate which tuples have not yet been inserted in any block. This list will be called "missing_list".

(33)

The first record in the missing_list is selected as positive: the block that is about to be built will consist of records similar to this one. I will refer to this record used as a train set for a block, as “positive record” [line 5].

The list of the IDs in the dataset is entirely scrolled, computing the similarity of each record against the positive one: for each index [line 7], one by one each indexed structure is read [line 8], to compute the similarity according to each attribute. Once each data structure is considered, all the scores get summed up to compose the record similarity. Two records are similar if their similarity tends to zero. The BlockSize records with the highest scores will form the block and therefore will also be inserted in missing_list [lines 18-23].

As with TIS, the block building process finishes when there are no more records in the missing_list.

(34)

(35)

Initially, I tried to trivially add the differences between the indices, letting statistics do its job, but this solution proved to be too noisy and poorly performing: PC on average was 0.82. Therefore, I implemented a threshold (DiffThresh) below which the score is not increased. By eliminating noise, great results are obtained.

I tried to implement an alternative to the BlockSize variable: a block will no longer contain the records with the highest #BlockSize scores, but the records with a score greater than or equal to ScoreThresh. This approach leads to blocks of too variable size and poor performances, so I have adopted a hybrid approach in which for the creation of a block both the constraints on BlockSize and ScoreThresh must be respected. Again, the performances turned out to be worse.

An important advantage of TIS Scoring is that it is possible to integrate the domain expert with two sets of weights: one for the indexed structures and one for attributes.

As for TIS, the order in the choice of attributes is left to a domain expert to increase

performances. In its absence, however, the algorithm can be operated in an agnostic scheme: the attributes are used in increasing order of missing values. In testing, I stuck to this agnostic scheme mode.

4.2.3.3 - TIS / TIS Scoring parameter tuning

TIS and TIS Scoring use two parameters that need tuning: DiffThresh and BlockSize / BlockThresh. Having a pool of duplicate records available, a domain expert will be able to analyze the indexed tables and extrapolate the best values. Alternatively, multiple executions by

(36)

varying the parameters can be a viable solution. The best approach is probably a mix of these two techniques.

4.3 - Complexity analysis

Time complexity is fundamental in a blocking algorithm, and a main concern in TIS algorithm family. Let’s analyze the preprocessing and the execution part independently.

Let’s first focus on the preprocessing. Being d _{the number of attributes in the dataset, the}k

number of preprocessing tables (the indexed structures. k = 1 for TIS), the number of tuplesn

in the dataset the overall complexity is (sort(n)) O(d k) O(n)

O _* _* =

For the preprocessing we have an overall complexity of O(n). It is the same for TIS and TIS Scoring.

When it comes to execution we need to consider TIS and TIS Scoring independently. Given the redundancy positive approach, I am considering

(number_of_blocks_generated) O(log(n))

O =

TIS:

For each block:

- we scroll the whole dataset to check the first attribute

- We then scroll the block multiple times to filter the records according to the other attributes

Being the number of blocks, f a constant representing the filtering pass, the number ofb n

tuples in the dataset, the time complexity is: Time complexity = (b O _*(n + f))

Assumptions:

● I am considering (b)O = O(log(n))

● f represents the filtering passes after the first attribute has been evaluated. At first the block contains the record whose index differs from the positive record’s one by less than

(37)

DiffThresh. This amount of records surely depends on n, but i think less than linearly. I am then considering (k)O = O(log(n))

Time complexity = (b O _*(n + f)) = O(n _*log(n))

TIS Scoring:

For each block we scroll the whole dataset to compute the scores, and for each record we compare its score with the scores of the records in the block.

Being the number of blocks, the block size, the number of preprocessing tables (theb s k

indexed structures), and the number of tuples in the dataset, the time complexity is:n

(b s n k) O(n log(n))

O _{* * *} = _*

TIS and TIS Scoring both have a _{O * l}(n og(n)) time complexity, that is not linear but still acceptable in a wide variety of cases.

(38)

5- Experimental results and evaluation

5.1- Data structures

5.1.1- Cora dataset

What is it about: scientific publications. Length: 1879 records.

Number of attributes: 13. Attributes analysis:

[TABLE 7: Cora Dataset attributes analysis]

5.1.2- Restaurant dataset

What is it about: restaurants basic information. Length: 864 records.

(39)

[TABLE 8: Restaurants Dataset attributes analysis]

5.1.3- Census dataset

What is it about: basic demographic information. Length: 841 records.

(40)

5.2- Test results

[TABLE 10: Proposed algorithms test results, compared with a state of the art algorithm’s performances]

Sorted Block’s parameters meaning:

● u: it’s the size of the sliding window

● key_attr_substr: it represents how many first characters of each attribute are used to build the key. Not all the attributes are used.

As I mentioned before, the tests have been conducted using TIS and TIS Scoring in a schema-agnostic way, that means choosing the attributes in order of increasing number of missing values. Usually most of these attributes are the best to be picked, but a configuration from a domain expert would still be the best option.

Results are not bad but not even astonishing: given the temporal complexity sacrifice dictated by the heavy preprocessing, the performances should probably be a little bit higher to benefit from a real world scenario application of TIS and TIS Scoring.

We can notice that TIS Scoring performs significantly better than his ancestor TIS, but tuning the parameters to reach a high level of pair completeness may lead to a pretty low reduction ratio. An important point to notice is the inconsistency of the performances of TIS Scoring compared to Sorted Blocks: it performs well on the easiest dataset (census, restaurants), but pretty bad in cora, the most realistic dataset of the three (in terms of diversity of artifacts on the records). To

(41)

get a satisfying value of PC I had to increase a lot the block size, leading to a very low RR. A wiser choice of preprocessing functions would for sure lead to higher metrics, but it would require a deep analysis of the dataset and quite a bit of trial and error, and even then positive results would not be granted. TIS Scoring’s performances are fairly good compared to the state of the art, but at the cost of a higher configuration and tuning effort that may not be worth it.

6- Conclusions and future research direction

I think the approach behind TIS Scoring is interesting, and if revisited in the implementation may lead to some well performing technique. Until then, given the wide panorama of well performing algorithms, I see too many limitations to suggest a real world application of TIS Scoring.

An interesting research direction would be the tuning process. Tuning TIS and TIS Scoring is pretty different from tuning Sorted Blocks even if the number of tuning parameters is the same. For Sorted Blocks we have the window parameter ‘u’ and ‘key_attr_substr’, that is the length of the substring of each attribute used to compute the token.

For TIS and TIS Scoring we have to tune the two thresholds for block size and record similarity (measured in indexes distance).

Tuning sorted blocks parameters is easier, the main reason being that ‘u’ and ‘key_attr_substr’ are pretty much independent. That is not the case for TIS and TISS, where the tuning of the two must be orchestrated altogether. Another reason is that the search space for TIS and TISS parameters is significantly bigger.

For this reason an interesting research direction would be the study of a function that by analyzing the indexed data structures would output a well performing set of parameters.

(42)

Bibliography

[1] A Comparative Analysis of Methodologies for Database Schema Integration DOI: 10.1145/27633.27634

[2] V. Christophides, V. Efthymiou, and K. Stefanidis. Entity Resolution in the Web of Data. Morgan & Claypool Publishers, 2015.

[3] T. Papenbrock, A. Heise, and F. Naumann. Progressive duplicate detection. IEEE TKDE., 27(5):1316–1329, 2015.

[4] X. L. Dong and D. Srivastava, “Big Data Integration”, Published date: 2017–08–26 [5] A Survey of Blocking and Filtering Techniques for Entity Resolution, 2019, G. A. Papadakis, D. Skoutas, E. Thanos, T. Palpanas

[6] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederée, andW. Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE TKDE, 25(12):2665–2682, 2013.

[7] G. Papadakis, G. Koutrika, T. Palpanas, andW. Nejdl. Meta-blocking: Taking entity resolution to the next level. IEEE TKDE, 26(8):1946–1960, 2014.

[8] P. Christen. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.

[9] P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE, 24(9):1537–1555, 2012.

[10] Tran, Q.T., Chan, C. & Parthasarathy, S. Query reverse engineering. The VLDB Journal 23, 721–746 (2014)

[11] Explaining query answers using query reverse engineering, 2019, A. Pasquali, A. Perini [12] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.

[13] A. N. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In International Workshop on

Challenges in Web Information Retrieval and Integration (WIRI), pages 30–39, 2005.

[14] T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM, pages 305–314, 2009.

[15] T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM, pages 305–314, 2009.

[16] B. Kenig and A. Gal. Mfiblocks: An effective blocking algorithm for entity resolution. Inf. Syst., 38(6):908–926, 2013.

[17] M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, pages 127–138, 1995.

(43)

[18] M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2(1):9–37, 1998.

[19] K. Ma, F. Dong, and B. Yang. Large-scale schema-free data deduplication approach with adaptive sliding window using mapreduce. Comput. J., 58(11):3187–3201, 2015.

[20] S. Yan, D. Lee, M. Kan, and C. L. Giles. Adaptive sorted neighborhood methods for efficient record linkage. In JCDL, pages 185–194, 2007.

[21] U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg. Adaptive windows for duplicate detection. In ICDE, pages 1073–1083, 2012.

[22] U. Draisbach and F. Naumann. A generalization of blocking and windowing algorithms for duplicate detection. In ICDKE, pages 18–24, 2011.