Risk Minimization Models for Technology-Assisted Review and their Application to e-discovery

(1)

DIPARTIMENTO DI FILOLOGIA, LETTERATURA, E LINGUISTICA CORSO DI LAUREA IN INFORMATICA UMANISTICA

TESI DI LAUREA MAGISTRALE

Risk Minimization Models

for Technology-Assisted Review

and their Application to e-Discovery

Candidato:

Alessio Molinari

Relatori:

Andrea Esuli

Fabrizio Sebastiani

Anno Accademico 2018/2019

(2)

Abstract

In several subfields of data science, the term “review” refers to the activities, carried out by one or more human annotators, of checking the correctness of the class labels attributed by an automatic process to unlabelled data items, and of replacing wrong labels with the correct labels.

Given a set D of unlabelled items that are automatically classified, only a subset of such items are usually reviewed (otherwise, the previous automatic classification step would be useless). This is due to the fact that reviewing comes at a cost, and is even more true when the size of the automatically classified dataset is large, which is increasingly often the case in many application domains. The amount of data items that are reviewed depends (among other factors) on the annotation budget, i.e., the maximum amount of data items that the annotator is willing / has time to review, or the maximum cost one is willing to pay an annotator for having the data reviewed.

In order to support the activity of human annotators, it would be important to identify which data items in D should be reviewed and which should not, in such a way that overall costs are minimized and the accuracy of the resulting labels is maximized. Exactly identifying these data items is the task of technology-assisted review (TAR) algorithms. Essentially, these algorithms attempt to strike an optimal tradeoff between the contrasting goals of minimizing the cost of human intervention and maximizing the accuracy of the resulting labelled data, by focusing on those data items that are most likely to have been misclassified by the automatic classifier, and/or those data items for which review would bring about the highest benefit.

In this work we introduce and test three major modifications of MINECORE, a recently proposed TAR algorithm whose goal is that of minimizing the expected costs of review for topical relevance (also known as “responsiveness”) and sensitivity (also known as “privilege”) in e-discovery, where the latter is defined as the task of determining, within a civil lawsuit in legal systems based on common law, which documents need be produced by one party to the other party in the lawsuit.

The first modification consists in attempting to increase the quality of the “pos-terior” probabilities that are input to MINECORE. We attempt to do this via an instance of the EM algorithm which leverages the transductive nature of e-discovery,

(3)

i.e., the fact that the set of documents that must be classified is finite and available at training time.

The second modification relaxes a simplifying assumption on which MINECORE relies, i.e., that we should “trust” the posterior probabilities output by our machine-learned classifiers.

The third modification relaxes yet another simplifying assumption on which MINECORE relies, i.e., the assumption that human annotators always attribute the correct class label to the data items they review.

We report experimental results obtained on a large (≈ 800K) dataset of textual documents, on which we experimentally compare the original version of MINECORE with our three variants.

(4)

1 Introduction

In several subfields of data science, the term “review” generally refers to the activity, carried out by a human annotator (also called reviewer, or coder ), of checking the correct-ness of the class labels attributed by an automatic process (usually: a machine-learned classifier ) to unlabelled data items, and of replacing wrong labels with the correct ones (Cormack and Grossman, 2015).

Given a set D of unlabelled items that are automatically classified, only a subset of such items are usually reviewed (otherwise, the previous automatic classification step would be useless). This is due to the fact that reviewing comes at a cost, and is even more true when the size of the automatically classified dataset is large, which is increasingly often the case in many application domains. The amount of data items that are reviewed depends (among other factors) on the annotation budget, i.e., the maximum amount of data items that the annotators are willing / have time to review, or the maximum cost one is willing to pay annotators for having the data reviewed. The goal of Technology-Assisted Review (TAR) systems (Cormack and Grossman, 2015; Saha et al., 2015) is to exactly identify which data items should be reviewed, in such a way that annotation costs are minimized and the accuracy of the resulting labels is maximized. Typically, these data items are those that are most likely to have been misclassified by the automatic classifier, or those for which review would bring about the highest benefit, or those for which a combination of these two criteria hold.

TAR systems thus belong to the “human in the loop” category of software systems: that is, the system relies, in certain pre-defined steps, on having humans carry out part of the work, the results of which will be then used for further processing by the machine. From a more general point of view, we might say that such systems involve and are based on man-machine collaboration, where both entities are of critical importance and mutually necessary for achieving the final goal. These systems can generate synergy between the human and tha automatic component: Grossman and Cormack (2011, p. 3) even claims that “a technology-assisted process, in which humans examine only a small fraction of the document collection, can yield higher recall and/or precision than an exhaustive manual

(6)

review process, in which humans code and examine the entire document collection”. This thesis focuses on defining and experimentally testing three main modifications aimed at improving MINECORE (Oard et al., 2019), a TAR system originally developed for e-discovery applications. E-discovery is a process especially relevant in civil litigation in the United States of America or other jurisdictions based on common law; e-discovery consists of a review phase in which a set D of digital documents that may contain evidence that would be of interest in a specific lawsuit are reviewed to identify those which are “responsive” (i.e., relevant) to a request made by one of the parties.

We actually rely on annotators in different steps of our framework (see Section 2): • First of all, e-discovery training sets are usually built via active learning (Settles,

2012) (even though in the MINECORE experimental setup a preexisting dataset was employed, see Section 3), a process in which the annotators are presented with several documents they are asked to evaluate, so that the algorithm can learn which documents, out of the entire set, are more appropriate to be put in the training set (this is thus common to all TAR systems);

• Secondly, MINECORE asks reviewers to annotate a subset of the total number of documents to help it take the best possible decisions, in a risk-minimization fashion. Regarding the second point we made, we will see how MINECORE tries to strike an opti-mal balance between machine and human work (i.e. load-balancing): indeed, this is, to the best of our knowledge, the first framework for e-discovery which takes into consideration misclassification and annotation costs, making use of decision theory techniques to avoid extremely risky courses of action (i.e. taking decisions which may lead to higher costs), leveraging both automated text classifiers and human reviewers. Moreover, the very large dimensions (and ever growing throughout the years) of the datasets in basically any TAR setting, as well as the critical importance of having an high recall and precision, make the man-machine collaboration not only desirable, but inevitable: indeed, the amount of documents to be manually reviewed increased enormously, making it almost impossible for a company to bear the costs of such work; actually, Grossman and Cormack (2013, p. 4) points out how the entities involved in discovery cases “may find that the cost of

(7)

searching and producing is so great that settling the lawsuit may be the only way out of an otherwise impossible situation”. On the other hand, even though humans cannot carry out such an enormous work on their own, thoroughly trusting a classifier can be very dangerous and lead to severe misclassification costs; this is the very reason standing behind our research work, i.e. to reach the most optimal balancing of human and machine workloads, as pointed out before.

That said, much of our effort will be put not only in explaining as clearly as possible how we propose to modify MINECORE, but also in how we carry out our experiments: we believe this to be an extremely important aspect of any experimental work, as it allows readers to reproduce all of our tests and obtain the same results1_.

Indeed, reproducibility is a constantly growing concern in scientific studies: it is often the case that the results obtained in a research paper are difficult (or sometimes impossi-ble) to replicate and, as a consequence, the validity of the study itself is hard to estimate. This is why, in the first place, we are going to publish all of our source code on a public web platform, as to make it easy for everyone to download, read and also review any part of the framework: we strongly foster the idea that “anything less than release of actual source code is an indefensible approach for any scientific results that depend on computation, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility” (Ince et al., 2012).

Our efforts towards reaching this goal consist of:

• Publishing all Python code on Github, free for anyone to read and fork;

• Using and implementing a publicly available dataset (which, for reasons we will see later, is not an e-discovery dataset). These two operations aim towards what Goodman et al. have called “Results reproducibility”, that is “obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible.” (Goodman et al., 2016);

• Explaining in a detailed way how and what we do to modify the framework, be it the internal details or its input data, as well as explaining how we prepare and

(8)

carry out each of our experiments, also showing pseudo-code of the most central algorithms so that it is not necessary to go and read the source code;

• Explaining how and what we do to analyze and plot our results. These last two points aim to achieve “Method reproducibility”, that is “the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated” (Goodman et al., 2016).

We will show in the following section how MINECORE works and what are the aspects on which we can work to improve it: finally, we will describe and test our hypothesis in Section 4 and Section 5. Section 7 concludes, summing up our results and sketching some avenues for possible future research.

(9)

2 MINECORE: A risk minimization framework for

e-discovery

2.1 What is e-discovery?

When a civil lawsuit is filed in the United States of America, the judge may request one of the involved parties to produce to the other party any evidence relevant to the lawsuit. Upon receiving such a request, the producing party must search in their electronically stored information for any document which might be relevant to the case, in order to disclose it to the other party; this task usually goes under the name of e-discovery. In order to do so, it is typically the case that the producing party asks junior lawyers to review (i.e., annotate) for “responsiveness” (i.e., topical relevance to the case) the candidate documents, and then asks senior lawyers to review for “privilege” (i.e., presence of sensitive content, which would allow the party to rightfully withhold the document) the documents that have been deemed responsive.

As a consequence of this two-phase review, each document is classified into one of the three following classes:

• cP (which stands for “Produce”): The document has been deemed responsive and

not privileged; as such, it should be produced to the receiving party;

• cL (which stands for “Log”): The document has been deemed responsive and

privi-leged; as such, it should be entered into a “privilege log” (i.e., a repository open to inspection by the judge) and should not be disclosed to the receiving party;

• cW (which stands for “Withhold”): The document has been deemed not responsive,

which means that the producing party should not produce it.

In such a scenario, the producing party may incur into two types of costs, i.e., annotation costs (since lawyers need to be paid for their reviewing work), and misclassification costs (that might derive when an inappropriate class – e.g., “Produce” instead of “Log” – is chosen for a document).

(10)

Given the enormous amount of digital documents that should undergo review in many practical cases, a need for automation of the review process has arisen. Several techniques for technology-assisted review (TAR), that combine techniques from information retrieval and machine learning, have thus been proposed in the last ten years (Oard and Webber, 2013; Baron et al., 2016; Cormack and Grossman, 2014). The aim of TAR algorithms is that of supporting the review process, minimizing the overall costs deriving from annota-tion efforts and misclassificaannota-tions. As a matter of fact, we should notice that training a perfect classifier is not possible and, most probably, the error rate of a well-trained text classifier would probably be worse than what human reviewers can achieve: in this case, further manual review of the documents is necessary to decrease the overall error rate (but, be mindful, this increases the cost of the annotation process). Notice that we are not interested in improving the accuracy of the classifier itself but rather of the classifica-tion decisions: this approach makes sense because e-discovery datasets are usually finite and not subject to grow (that is, we are in a transductive setting - see e.g., Joachims (2006)).

The focus of MINECORE is on this final step: since some documents will need manual review, which documents in our dataset D should we ask the reviewers to annotate and which should we not?

2.2 MINECORE: A semi-automated predictive coding system

MINECORE (Oard et al., 2019) is a recently proposed decision-theoretic algorithm for minimizing the expected costs of review for responsiveness and privilege in e-discovery. Given a set D of documents that must each be assigned a class in C = {cP, cL, cW} based

on whether they belong or not to the class cr of responsive documents and/or to the class

cp of privileged documents, the goal of MINECORE is to determine, for each document

in D, whether manually reviewing d by responsiveness and/or privilege is expected to be cost-effective or not.

This determination is based:

(11)

Table 1: Contingency table D (a) and cost matrix Λm (b) for our cost-sensitive, single-label classification problem.

actual

c

P

c

L

c

W

pred

c

P

D

P P

D

P L

D

P W

c

L

D

LP

D

LL

D

LW

c

W

D

W P

D

W L

D

W W

actual

c

P

c

L

c

W

pred

c

P

0 λ

mP L

λ

mP W

c

L

λ

m_LP

0 λ

m_LW

c

W

λ

mW P

λ

mW L

0 (a)

(b)

Pr(cp|d), hereafter called the posteriors) returned by automated classifiers hr (that

classifies documents by responsiveness) and hp (that classifies documents by

privi-lege);

2. on the unit costs of manually checking a document for responsiveness (λa_r) or for privilege (λa

p), where superscript a stands for “annotation”;

3. on the costs λm_ij incurred when mistakenly assigning class ci to a document which

should be assigned class cj, where ci, cj ∈ C and superscript m stands for

“misclas-sification”.

Such a classification task brings about the contingency table shown in Table 1(a) as well as the cost matrix Λm _{(Table 1(b)): in the former, D}

ij is the number of documents

d ∈ D whose predicted class h(d) is ci and whose true class (which we denote by y(d))

is cj (Oard et al., 2019, p. 4). It should also be noted that the classes in cP, cL, cW can

actually be defined in terms of relevance and privilege (classes cr and cp); indeed we can

define cP ≡ cr∩ cp, cL ≡ cr∩ cp, and cW ≡ cr, where cj denotes the complement in D of

class cj. On the other hand, we should notice that our problem should be framed as a

cost-sensitive single-label classification problem: that is, in e-discovery different classification errors result in different penalties, with higher or lower costs. As a consequence, we need

(12)

a misclassification matrix Λm _{= λ}m

ij (for i, j ∈ P, L, W ), as shown in Table 1(b), where

each entry λm

ij (unit cost ) is a non negative value representing the cost incurred when

misclassifying a document as ci when its true class was actually cj. Moreover, we denote

with λa

r and λap the annotation costs for responsiveness and privilege, respectively (we will

see how these costs have been defined in Section 3).

2.2.1 Two extreme solutions: The fully automated and the fully manual solutions

To solve our classification problem, we may hypothesize two “extreme” (and suboptimal) solutions, which we will now see in details.

The fully automated solution In this solution, we train two automated classifiers hr

(which classifies for responsiveness) and hp (which classifies for privilege) and we apply

them to D. These classifiers can be generated independently of each other, since respon-siveness and privilege are two standalone qualities of a document. The training documents are usually selected via “active learning” (Settles, 2012; Cormack and Grossman, 2016, 2015): notice that in the original MINECORE paper (and in this work as well), training the two classifiers is assumed to bring about no costs at all. These classifiers generate, for each document d ∈ D, two posterior probabilities Pr(cr|d) and Pr(cp|d) which

respec-tively represent the probability of a document being relevant or privileged: for instance, a Pr(cr|d) value of 1 represents total certainty (of the hr classifier) that d ∈ cr. We can

legitimately assume that the two probabilities are stochastically independent and as a consequence it holds that

Pr(cP|d) = Pr(cr|d) Pr(cp|d)

Pr(cL|d) = Pr(cr|d) Pr(cp|d)

Pr(cW|d) = Pr(cr|d)

(1)

We take a risk minimization approach (Oard et al., 2019, p. 6) (see also Anand, 1993; Berardi et al., 2015b,a) and classify each document d in the class

h(d) = arg min

ci

(13)

Where R(d, ci) is the risk associated with assigning d to class ci, defined as:

R(d, ci) =

X

j∈{P,L,W }

λm_ijP r(cj|d) (3)

We can express the global risk brought about by this classification as: R(D) = X

d∈D

R(d, h(d)) (4)

This approach allows us to assign to each document d the class that brings about the minimum expected misclassification cost: we avoid courses of actions for which a com-bination of probability of occurrence and cost is high. The function for measuring our misclassification costs is described in Oard et al. (2019, p. 6) as:

Km(D) = X

i,j∈{P,L,W }

λm_ijDij (5)

Where the m superscript stands for “misclassification”.

The fully manual solution In this solution, a reviewer (usually a junior lawyer) an-notates all documents in D for responsiveness: the subset of documents which have been deemed responsive are then forwarded to a senior lawyer, who annotates them for priv-ilege while all the others are withheld (the CW class we explained before). The original

MINECORE paper makes the simplifying assumption that reviewers do not make anno-tation errors (an assumption we will relax in Section 5). The total cost of the annoanno-tation process can then be defined as: let the pair Λa= (λa_r, λa_p) denote the costs of annotating a single document for responsiveness (λa

r) and privilege (λap). The annotation costs can

then be measured with the following function:

Ka(D) = λa_rτr+ λapτp (6)

Where τr and τp are the subset of documents annotated for responsiveness and privilege

respectively (notice that in this case τr = |D|).

2.2.2 A hybrid solution: The MINECORE framework

(14)

• The fully manual solution has the advantage of zero annotation costs but may very well be subject to substantial misclassification costs (we will present our cost structure in Table 2 of Section 3)

• The fully automated solution has the advantage of zero misclassification costs (if we assume our reviewers to be infallible) but is very expensive (especially if our dataset is very large) and it is not always feasible (it could be impossible to manually review hundreds of thousands of documents).

The original MINECORE was then designed as an hybrid three-phases model (which, we will see, can actually be expanded to a number N of phases):

1. firstly, for each document d ∈ D we compute the Pr(cr|d) and Pr(cp|d) probabilities

via our hr and hp classifier. We then assign d to a class h(d) ∈ {cP, cL, cW} with

Equation 2;

2. then, a junior reviewer annotates a subset Dr of D for responsiveness, where for

every d ∈ Dr:

(a) We assign 1 or 0 to Pr(cr|d) if the reviewer has deemed the document to be

responsive or not;

(b) h(d) is recomputed, which may cause the document to fall into a different class from before

3. finally, a senior reviewer annotates a subset Dp of D for privilege and, as for

respon-siveness, we update our Pr(cp|d) probabilities and recompute h(d) ∈ {cP, cL, cW}.

We should remark that there is no relationship between Dr and Dp (and cr and cp): a

document may belong to just one of the two subsets, to both or to none. Notice that our goal is now to strike an optimal balance: we need to decide which documents should be annotated by the reviewers in Phase 2 and which in Phase 3, but also which documents are safe to leave unchecked.

(15)

First of all, we now have an overall cost for the hybrid solution, which can be quantified as:

Ko(D) = Km(D) + Ka(D) (7)

where the o superscript stands for “overall”. Notice that Ko_{(D) is linear since both}

Km_{(D) and K}a_{(D) are linear; we can thus focus on the cost of a single document:}

Ko(d) = Km(d) + Ka(d) (8)

First phase: classification. As already explained, in this phase we automatically classify all documents d ∈ D and generate Pr1(cr|d) and Pr1(cp|d) posterior probabilities

for each document. Using these probabilities, we assign a class h1(d) ∈ {cP, cL, cW} for

any d ∈ D using Equation 2.

Second phase: annotating for responsiveness. During the second phase, we rank the documents in D and the junior reviewer annotates the top-ranked τr documents:

notice that since the reviewers are considered infallible, annotating a document d has the effect of completely eliminating uncertainty, therefore we set Pr2(cr|d) = 1 if d is

deemed responsive or 0 otherwise (notice that Pr2(cp|d) = Pr1(cp|d)). All probabilities of

documents from the τr+ 1 position are left unchanged.

We now need both an optimal ranking of our documents and an optimal threshold τr.

First of all, we indicate the misclassification cost brought about by attributing class hφ(d)

with K_φm(d), defining the difference in overall cost that annotating d for responsiveness brings about with equation 9 (Oard et al., 2019, p. 10):

∆or(d) = K₂o(d) − K₁o(d)

= K₂m(d) + K₂a(d) − K₁m(d) − K₁a(d) = K₂m(d) + λa_r − K₁m(d)

(9)

(16)

consequence we need to compute an expectation of the ∆or_{(d) (Oard et al., 2019, p.11):} Ey[∆or(d)] = Ey[K2m(d) + λ a r − K m 1 (d)] = Ey[K2m(d)] + λ a r − Ey[K1m(d)] = R2(d, h2(d)) + λar − R1(d, h1(d)) (10)

As a matter of fact, we do not know yr(d) either (a binary variable telling us whether the

annotator would deem d responsive or not): we need an expectation of Ey[∆or(d)] over

the yr(d) random variable (Oard et al., 2019, p. 10)

Eyry[∆ or_{(d)] = E} yr[R2(d, h2(d)) + λ a r − R1(d, h1(d))] = Eyr[R2(d, h2(d))] + λ a r − R1(d, h1(d)) (11)

To compute Eyr[R2(d, h2(d))] we assign probabilities to the events cr (i.e., “the reviewer annotates d as responsive”) and cr(“the reviewer annotates d as nonresponsive”): in Oard

et al. (2019) the posterior probabilities Pr1(cr|d) and Pr1(cr|d) (where 1 indicates that

these are coming from Phase one) coming from the classifiers are taken at face value and used here to obtain the expectation as:

Eyr[R2(d, h2(d))] = R2(d, h2(d)|cr) · Pr1(cr|d) + R2(d, h2(d)|cr) · Pr1(cr|d)

(12)

Where with R2(d, h2(d)|cr) we indicate the misclassification risk resulting from assigning

Pr2(cr|d) = 1, and viceversa for R2(d, h2(d)|cr). Finally, we can rank the D documents

according to their Eyry[∆

or_{(d)] value, top-ranking those with lowest scores. The stopping}

condition is defined as: “If Eyr[R2(d, h2(d))] < 0, then annotate d by responsiveness, [...] else stop annotating” (Oard et al., 2019, p. 12).

Third phase: annotating for privilege. In this phase, the senior reviewers annotate the top-ranked τp documents for privilege. The algorithm shows basically no differences

(17)

coming from both phase 1 and 2:

∆op(d) = K₃o(d) − K₂o(d) (13)

= K₃m(d) + K₃a(d) − K₂m(d) − K₂a(d) (14) = K₃m(d) + λa_p− Km

2 (d) (15)

Note that, if τr documents are reviewed by responsiveness and τp documents are reviewed

by privilege, the overall cost of the entire process may be described as Ko(D) = Km(D) + Ka(D)

= Km(D) + λa_rτr+ λapτp

(16)

We show the pseudo-code of MINECORE algorithm, as presented by its authors, in Algorithm 1.

A more thorough explanation can be read in Oard et al. (2019, pp. 3-17). It is important to notice that, despite being originally devised as a three phases algorithm, MINECORE could easily be extended to take into consideration a number N of phases, the first of which would always be an automatic phase, whereas the remaining (N − 1) phases would be carried out no differently from Phases 2 and 3: of course, this means that we would deal with a number (N − 1) of possible classes, where each class is stochastically independent from all the others. Finally, we should notice here that this work will focus on two important aspects of the MINECORE framework:

1. MINECORE strongly depends on the quality of the posterior probabilities gener-ated by our classifiers: we would like to improve such probabilities, especially by leveraging the transductive nature of e-discovery datasets (see section 4), which has not yet been taken into consideration;

2. the algorithm is based on two simplifying assumptions, firstly that we should take the classifiers’ output probabilities at face value (that is, we completely trust our classifier) and secondly that the reviewers are infallible. We will address these as-sumptions by editing MINECORE algorithm, giving a clear mathematical definition of how and when we should actually trust the classifiers and our reviewers (see Sec-tion 5).

(18)

Input : A training set T rrof documents labeled for responsiveness;

A training set T rpof documents labeled for privilege;

Documents D to be analysed for possible production to the requesting party; Cost structure Λ = (Λm, Λa).

Output: A partition of D into the following three sets:

– Set DP of documents to be produced to the receiving party;

– Set DLof documents to be put on the privilege log;

– Set DW of documents to be withheld;

Annotation cost Ka_{(D) incurred in the process.}

/* Phase 1 */

1 Train classifiers hr and hpfrom T rrand T rp, respectively; 2 for d ∈ D do

3 Compute Pr1(cr|d) by means of hrand Pr1(cp|d) by means of hp; 4 Compute h1(d) via Equation 2;

5 end /* Phase 2 */ 6 for d ∈ D do 7 Pr2(cr|d) ← Pr1(cr|d); Pr2(cp|d) ← Pr1(cp|d); Compute Eyry[∆ or_{(d)] using Equation 11;} 8 end 9 Generate a ranking Rr Dof D in increasing order of Eyry[∆or(d)];

/* Rr_D(k) denotes the document at the k-th rank position in Rr

D */

10 k ← 1; τr← 0; 11 while Eyry[∆

or_(Rr

D(k))] < 0 do

12 Have the reviewer annotate document Rr

D(k) for responsiveness; 13 if Rr

D(k) has been judged responsive by the reviewer then 14 Pr2(cr|RrD(k)) ← 1 15 else 16 Pr2(cr|RrD(k)) ← 0 17 end 18 τr← τr+ 1; k ← k + 1; 19 end 20 for d ∈ D do

21 Compute h2(d) using Equation 2; 22 end

/* Phase 3 */

23 for d ∈ D do

24 Pr3(cr|d) ← Pr2(cr|d); Pr3(cp|d) ← Pr2(cp|d); Compute Eypy[∆op(d)] using Equation 11;

25 end

26 Generate a ranking Rp_Dof D in increasing order of Eypy[∆op(d)];

/* Rp_D(k) denotes the document at the k-th rank position in Rp_D */

27 k ← 1; τp← 0;

28 while Eypy[∆op(RpD(k))] < 0 do

29 Have the reviewer annotate document RpD(k) for privilege; 30 if Rp_D(k) has been judged privileged by the reviewer then

31 Pr3(cp|RpD(k)) ← 1 32 else 33 Pr3(cp|RpD(k)) ← 0 34 end 35 τp← τp+ 1; k ← k + 1; 36 end 37 for d ∈ D do

38 Compute h3(d) using Equation 2; 39 end

40 DP← {d|h3(d) = cP}; DL← {d|h3(d) = cL}; DW ← {d|h3(d) = cW}; 41 Compute Ka(D) using Equation 6.

(19)

2.3 Applying MINECORE beyond e-discovery: The case of

sys-tematic reviews

Despite being born as an e-discovery framework, MINECORE can actually easily be adapted to work in different contexts: supporting the production of systematic reviews is just an example of where it could be employed, but usage in any process which requires human reviewers to select and annotate a set of documents is also possible.

Systematic review tasks are usually very widespread in the medical field (Khan et al., 2011), but Burgers et al. (2019) have very recently remarked how systematic literature reviews can be extremely fundamental for any interdisciplinary research (see also Shemilt et al., 2016). Systematic reviews are “a key tool of Evidence-based Medicine, which seeks to inform health decisions using all available relevant evidence. Systematic reviews identify, describe, critique, and synthesize empirical studies relevant to a well-defined clinical question” (Lease et al., 2016, p. 1). In order to do this, it is necessary to formulate “a precise clinical question to be addressed; [. . . ] designing and executing a high-recall Boolean query to retrieve potentially relevant citations; [. . . ] screening query results for eligibility for inclusion in the review [. . . ]” (Lease et al., 2016, p. 1).

As can be easily guessed, such a task (be it in medicine or in any other discipline) has to deal with an overwhelming number of documents, which need to be selected and reviewed: TAR algorithms can thus be extremely helpful in reducing human workloads2_{. As a matter}

of fact, Lease et al. (2016) have noticed how e-discovery and systematic reviews are two extremely similar tasks: while the environment around the former, however, is already leveraging automated systems, the latter is still convinced that fully-manual reviewing work is the best way to achieve trustworthy results; both tasks, moreover, require vital manual reviewing work, i.e., for detecting privilege in e-discovery and for information extraction in systematic reviews. A future goal would be that of showing how TAR systems can substantially improve the reviewing work and its results, in e-discovery as much as in systematic reviews.

2_{Miwa et al. (2014) have proposed the employment of active learning to reduce the screening workload}

(20)

From the point of view of research, it would be extremely interesting to combine methods and ideas of these two different areas: Lease et al. (2016) have indeed setup a research challenge to “tackle shared problems and explore common opportunities”, cross-pollinating the two research topics.

(21)

3 Our experimental setting

3.1 The dataset

Following MINECORE original paper (Oard et al., 2019, pp. 17-27), we employed the RCV1-v2 dataset in all of our tests and experiments, using MINECORE results as a baseline for our work. Indeed, since MINECORE was proved to achieve better results than the other methods thereby proposed (see Oard et al., 2019, Section 4.4), we did not run our experiments against other baselines.

The RCV1-v2 dataset consists of 804, 414 news stories produced by Reuters; more-over, the dataset is multilabel, meaning that a document may belong to multiple classes: this actually makes it suitable for MINECORE scenario, since we need documents to be potentially both relevant and privileged. Furthermore, we wanted our experiments to be as close as possible to those conducted in MINECORE paper. This is why we adopted the same training/test sets split used in the original experimental setting: the training set consists of the first 23, 149 documents, whereas the following chunk of 199, 328 documents were used as our test set. Moreover, we selected the same 120 pairs of classes used for the original experiments, which were selected based on their prevalence in the entire RCV1-v2 collection:

• the classes which had their prevalence in the range [0.03, 0.07] were selected to play the cr role;

• the classes which instead had a prevalence in the range [0.01, 0.20] in the responsive documents were selected to play the cp role.

This selection was done based on several reasons: firstly, no e-discovery dataset is actually available due to privacy concerns; such a dataset would indeed consist of personal emails, chats and sensitive documents which companies cannot or do not want to disclose to the public. The prevalence ranges by which the selection was made is therefore trying to resemble the typical prevalence of relevant documents and, among those, of privileged ones. Even though this is a good and reasonable approach, the RCV1-v2 dataset brings about an issue, which we will address more thoroughly in section 4: the distribution drift

(22)

(or shift) between the class prevalences in the training set and in the test set is basically non-existent; we computed for each class in our pairs their prevalence in the training and test sets, and the average difference is around 0.0019. Such a low distribution shift does not resemble, in practice, the typical e-discovery scenario: as a matter of fact, the dataset is usually built via active learning, which tends to insert as many positives as possible into the training collection, leading to a reasonably high shift between training and test prevalences, where the number of positives example is way higher in the former than in the latter: this is called prior probability shift (see Storkey, 2009, Section 1.5) and should be taken into account as it may lead to a biased classifier which in MINECORE experiments was not considered. In section 4 we will show how we created a new dataset starting from RCV1-v2 via active learning and how much this method can impact on the a priori probability shift between the training and test collections.

3.2 The cost structures

As we explained in Section 2, different misclassification errors in e-discovery bring about different costs: for instance, if we mistakenly classify a document in the cP class when

it was actually belonging to the cL class, this means that we are disclosing sensitive

information to the other party, resulting in a truly severe error. MINECORE algorithm needs therefore a cost structure which can be used to estimate the effective loss we could incur when a misclassification error happens: the original authors in Oard et al. (2019, Section 4.3) asked three senior members of the e-discovery community to think of a possible cost structure that would resemble real cases they have had experience with. The three resulting cost structures can be seen in Table 2.

For all of our experiments, we mainly referred to the first cost structure, following the assumption that improving MINECORE results on one cost structure would au-tomatically result in better results for others as well. Finally, it should be noticed that for every experiment we run, we retrained the two classifiers and rerun MINE-CORE algorithm using their a posteriori probabilities (or posteriors): this was done to ensure and remark the reproducibility of our experiments and their outcomes. The Python code used for the experiments can be found in the online git repository at

(23)

Table 2: Cost structures used in MINECORE experiments, Table 5 in Oard et al. (2019). Each individual cost is expressed in US$.

λa_r λa_p λm_{P L} λm_{P W} λm_LP λm_LW λm_{W P} λm_{W L} CostStructure1 1.00 5.00 600.00 5.00 150.00 3.00 15.00 15.00 CostStructure2 1.00 5.00 100.00 0.03 10.00 2.00 8.00 8.00 CostStructure3 1.00 5.00 1000.00 0.10 1.00 1.00 1.00 1.00

(24)

4 Improving the posterior probabilities via EMQ

4.1 Prior probabilities, posterior probabilities, and their mutual

dependence

As explained in Section 2, MINECORE’s cost minimization algorithm strongly depends on the quality of posterior probabilities returned by the classifiers: the better these probabil-ities are, the better we will be able to estimate risks in the three phases of the framework. Furthermore, one aspect of e-discovery scenarios, which to the best of our knowledge has not yet been addressed in the literature, is their transductive nature: this means that the set of documents which need to be labeled by our classifiers is finite and known at training time. This is an important detail which, as we will see, we can exploit in order to improve our posterior probabilities. Saerens et al. (2002) presented “a new procedure for a priori and a posteriori probabilities adjustment, based on the EM algorithm (Dempster et al., 1977; McLachlan & Krishnan, 1997)” which Gao and Sebastiani (2015) has renamed EMQ.

The EMQ algorithm iteratively updates the posterior probabilities by using the prior probabilities, where the latter are initialized by the frequencies of the classes in the training set (Saerens et al., 2002, p. 27). If we define ˆPr(s)(cj) and ˆPr

(s)

(cj|d) as the estimations

of the priors and posteriors at step s, the iterative procedure goes as follows: ˆ Pr(0)(cj) ← ˆPrt(cj) ˆ Pr(s)(cj|x) ← ˆ Pr(s)(cj) ˆ Prt(cj) ˆ Prt(y|d) X j∈cr,cp ˆ Pr(s)(cj) ˆ Prt(cj) ˆ Prt(cj|d) ˆ Pr(s+1)(cj) ← 1 N X d∈D ˆ Pr(s)(c|d) (17)

When convergence is reached, we can feed the new posteriors obtained via the EMQ algorithm (the pseudo-code can be read in Algorithm 20) to MINECORE’s formulas.

(25)

Pr(cp|d), for each document d ∈ D, from our linear SVM classifiers3 and eventually update

them with algorithm 20. The resulting new a priori and a posteriori probabilities will then replace all instances of the previous ones in MINECORE.

Finally, as pointed out in Section 3, we are always going to use the results obtained in Oard et al. (2019) as a baseline: to measure whether there is an improvement in MINE-CORE’s estimated costs, the evaluation function we use is Ko_{(D) from Equation 16. We}

measure the ratio

∆Ko(D) =

K_{M LE}o (D) Ko

EM Q(D)

(18) between Ko_{(D) as deriving from MINECORE}

MLEand Ko(D) as deriving from MINECOREEMQ;

if this ratio is > 1, this means that EMQ delivers an improvement, and that our experi-ment has been successful.

4.2 Experiments

Our first attempt, however, does not achieve positive results, actually worsening the overall costs achieved in Oard et al. (2019): we show the outcome of our experiment in figure 1, where we show ∆Ko(D) on the Y axis and the RCV1-v2 pairs on the X axis (sorted by their ∆Ko(D) value in order to improve the readability of the plot). A red line has been drawn in order to help the reader better recognize when the delta is 1 (that is, no improvement nor worsening has happened). As a matter of fact, figure 1 clearly shows that 60% of the pairs chosen from the RCV1 dataset have increased in costs. These disappointing results urge for an analysis of the possible underlying reasons of such a regression. As already stated before, an improvement in the quality of the posterior probabilities should and would result in a better outcome of the expected costs.

Hence, it seems reasonable, as a first approach, to compare the accuracy obtained by the two posteriors PrMLE(c|d) and PrEMQ(c|d): since we are dealing with probabilities, we will compute a “soft-accuracy” using probability masses. The contingency table is built as shown in table 3. The results, presented in table 4, report the average soft-accuracy

3_{Recall that two classifiers h}

r(for responsiveness) and hp(for privilege) are trained and later used in

(26)

Input : Posterior probabilities p(c|x), for all c ∈ C and for all x ∈ U ;

Output: Updated posterior probabilities p(c|x), for all c ∈ C and for all x ∈ U ;

/* Initialization */ 1 s ← 0; 2 for c ∈ C do 3 for x ∈ U do 4 p(s)(c|x) ← p(c|x); 5 end 6 end

/* Main Iteration Cycle */

7 while stopping condition = false do 8 s ← s + 1;

9 for c ∈ C do

/* Compute/update estimates of prior probabilities */

10 Prˆ (s) U (c) ← 1 |U | X x∈U p(s−1)(c|x) ; 11 for x ∈ U do

/* Update posterior probabilities */

12 p(s)(c|x) ← ˆ Pr(s)_U (c) ˆ Pr(s−1)_U (c) · p(s−1)_(c|x) X c∈C ˆ Pr(s)_U (c) ˆ Pr(s−1)_U (c) · p(s−1)_(c|x) 13 end 14 end 15 end /* Generate output */ 16 for c ∈ C do 17 for x ∈ U do 18 p(c|x) ← p(s)(c|x); 19 end 20 end 25

(27)

Figure 1: Cost ratio between two runs of MINECORE: one uses probabilities coming from the classifiers, whereas the other uses EMQ posterior probabilities as input. A dot in the positive area of the graph means EMQ improved the overall costs, see Equation 18.

value computed over all the selected RCV1 pairs: the accuracies are basically identical, even though they are slightly better for MLE if we consider more than 3 decimal places (we decided not to show more than 3 in our tables).

It should however be noticed that EMQ achieves higher true positives and lower false negatives, but has worse performance with respect to false positives and true negatives. This actually raises doubts about why the new probabilities performed much worse than expected, especially knowing that the difference in the accuracies is so small: if we look at the cost structures presented in Table 2, we notice that the higher cost is always assigned to λP L, that is the misclassification of a document as Produced when it actually was to

be Logged (see Oard et al. (2019, pp. 4, 5) and Section 2.2) followed by, in two cases out of three, λLP.

If thus MINECORE algorithm gets a false negative in the third phase and misclassify a document as relevant but not privileged, we should incur in higher costs. For this reason, it might be interesting to compute the accuracies for cr and cp separately and see

(28)

Ground truth Yes No Classifier Yes TPM =X d∈c Pr(c|d) (19) FPM =X d∈¯c Pr(c|d) (20) No FNM = X d∈c (1 − Pr(c|d)) (21) TNM =X d∈¯c (1 − Pr(c|d)) (22)

Table 3: The “soft-accuracy” contingency table

where cr values are on the left and cp values are on the right but, unfortunately, there

does not seem to be any real difference between these and the previous ones. Finally, we should also notice that among the selected pairs of labels (for pair selection criteria, see Section 3 and Oard et al. (2019, p. 18)), it sometimes happen that the same class is used alternatively for relevance and privilege, which may arguably cause a single false negative to count as double. On the other hand, accuracy is not such an important measure in cost-sensitive contexts as it is in standard classification tasks: as exemplified before, a greater accuracy might still result in higher costs and cannot thus be sufficient to explain the failure of our experiment. To prove this, we can rebuild the graph shown before (Figure 1), emphasizing the difference in accuracy. To do it, we first keep track, for every label of every pair, of whether EMQ has achieved a better or worse accuracy. As a consequence, when we draw the cost ratio dots on the graph (using Equation 18), we can color each one with the following legend:

• Green: when EMQ accuracy is better than MLE for both cr and cp;

• Blue: when EMQ accuracy is better for cr only;

(29)

MLE EMQ TPM 14794 14815 FNM 3822 3801 FPM 1391 1425 TNM 179319 179285 Cost average 77692 96247 Accuracy 0.973 0.973

Table 4: The “soft-accuracy” average over all pairs. Better values are in bold. Cost averages are in US dollars. MLE probabilities seems to be slightly better than EMQ’s.

MLE EMQ TPM 8034 8042 FNM 3524 3516 FPM 1068 1094 TNM 186700 186674 Accuracy 0.976 0.976 MLE EMQ TPM 15245 15259 FNM 3709 3695 FPM 1401 1427 TNM 178971 178945 Accuracy 0.974 0.974

Table 5: The “soft-accuracy” average over all labels for cr (on the left) and cp (on the right).

(30)

Figure 2: Cost ratio between the two probabilities, where every dot has been colored to emphasize accuracy differences.

• Red: when EMQ accuracy is worse for both classes;

By looking at Figure 2, we notice indeed that the placement of the points seem to be completely unrelated to the accuracy: it is remarkable that both the worst and best performing couples (namely, the M12-M131 and GVIO-C21 pairs) achieve better accu-racies for both cr and cp classes, nonetheless bringing to completely different results in

MINECORE algorithm’s output.

One thing we have not considered so far, is that the EMQ algorithm should not only improve the posterior probabilities, but actually update the priors as well: to be fair, Saerens et al. (2002) was actually mainly focused on the a priori probabilities and most of its results were indicating an improvement on such probabilities (see Saerens et al., 2002, Section 4). Indeed, it would be extremely interesting to see whether these new a priori probabilities have actually improved with respect to the previous ones. Moreover, the priors coming from our classifier are naively based on the assumption that the frequency of a class c in the training set will remain unchanged in the test set: in other words, we are not taking into consideration the distribution drift that may very well occur in an e-discovery task (Cormack and Grossman, 2014, 2015). Indeed, the EM instance proposed

(31)

by Saerens et al. (2002) was effectively meant to be applied in a learning context where we expect this drift to be relevant4_.

We are first going to measure the similarity between the estimated probabilities and the true ones, as resulting from the test set, with the Absolute Error (AE) shown in the following equation and explained in Sebastiani (2019), in the context of text quantification.

AE(p, ˆp) = 1 |C|

X

c∈C

|ˆp(c) − p(c)| (23)

MLE prior probabilities obtained a score of 0.191, whereas EMQ achieved a score of 0.082 (notice: the lower, the better). Such a result is very much unexpected: EMQ improved the a priori probabilities by a non-negligible margin; nevertheless, the better priors did not bring better posteriors.

As hinted before, one of the reasons we are witnessing such an unexpected behaviour might lie beneath a very low distribution drift in our data: as a matter of fact, the average difference between the prevalences of a class in the training and test sets is no higher than 0.6%. Having considered this, we decided to generate artificial drift in the RCV1 dataset, by removing increasingly high quantities of negative examples from the training set: this should allow the EM iterative algorithm to achieve better results, both for the a priori and the a posteriori probabilities, reducing the overall expected costs.

Since the example removal is completely random, we repeated the experiment several times, first removing up to 60% of the negatives and eventually up to 80%. However, the results are once again discrediting our assumptions: the cost differences remain basically unchanged, showing a deterioration of a few percentage points with respect to the previous attempts.

One last experiment, before closing the EMQ chapter, might be worth the effort: as explained in Section 2 and Section 3, e-discovery datasets in real world scenarios are usually built via active learning. This technique naturally brings distribution shift of priors probabilities as a consequence, and could thus be extremely well suited for the

4_{In the conclusions of that paper, the author notes that “We also observed that adjusting the outputs}

of the classifier when not needed (i.e., when the a priori probabilities of the training set and the real-world data do not differ) can result in a decrease in classification accuracy.” Saerens et al. (2002, p. 35).

(32)

Figure 3: Cost ratio between MLE and EMQ after 60% of the negatives have been removed. First run.

Figure 4: Cost ratio between MLE and EMQ after 60% of the negatives have been removed. Second run.

(33)

Figure 5: Cost ratio between MLE and EMQ after 60% of the negatives have been removed. Third run.

Figure 6: Cost ratio between MLE and EMQ after 80% of the negatives have been removed. First run.

(34)

Figure 7: Cost ratio between MLE and EMQ after 80% of the negatives have been removed. Second run.

Figure 8: Cost ratio between MLE and EMQ after 80% of the negatives have been removed. Third run.

(35)

EMQ algorithm. Indeed, this could allow us to work in a context much closer to a possible real one and it could definitely deliver interesting results. To generate such a dataset we decided to use RCV1-v2 as a starting point, using the ground truth to mimic the human evaluation of the documents brought up for assessment by the active learning technique. The procedure we used goes as follows:

1. we selected 1000 random documents from the RCV1-v2 collection;

2. we inserted all the remaining documents in the test set and trained our classifiers. As a result, we get the posterior probabilities for all documents d ∈ T r0, where T r0

is the training set at iteration 0, consisting of 1000 documents;

3. at each iteration, we use relevance sampling to choose the next set of documents to add to our training set;

4. when we reach the desired number of documents to be in the training set (23149, to keep the dataset consistent with the previous experiments), we can update the posterior probabilities of the training documents leveraging the ground truth, to mimic what a human annotator would have done in an active learning application. The pseudo-code can be read in Algorithm 12. The resulting dataset consists of 23,149 documents in the training collection plus the remaining 781,265 documents in the test collection. To see whether we actually reached our goal of generating a typical active learning dataset, we can compute the prevalences of all our c ∈ C classes and eventually measure the shift δp(c) (where subscript p(c) is the prevalence of a class) between the

training and the test sets: to do so, we can easily plot, for each class c ∈ C, the absolute difference between the two prevalences δp(c)= |pu(c) − pl(c)|, where pu and plare the prior

probabilities of the training and test (unlabeled and labeled) sets respectively.

Indeed, as it can be seen in Figure 10, the distribution shift is consistent with our hypothesis: the difference between the training and test sets are in some cases higher than 0.9, whereas that of the RCV1-v2 dataset (Figure 9) barely reaches the 0.025 mark. We can now run the EMQ algorithm in order to see whether such distribution drift can help bring better results. Conversely to what we thought, however, Figure 6a shows a

(36)

Input : Posterior probabilities Pr(c|d)0, for all c ∈ C and for all d ∈ T r0;

Output: Posterior probabilities Pr(c|d)i, for all c ∈ C and for all d ∈ T ri;

/* We will collect the training documents in a set */

1 train docs ← set(); 2 i = 0;

/* Keep adding documents to the set until we reach the desired number */

3 while train docs.length < f inal train length do

/* Add a maximum number of batch size docs via relevance sampling */

4 to add ← min(batch size, final tr length - train indexes.length); 5 rank indexes ← relevance sampling(Pr(c|d)_i);

6 train indexes ← update set(train indexes, rank indexes, to add );

/* Learn the classifier using the updated training collection */

7 classif ier ← learn classifier(train indexes); 8 Pr(c|d)_i ← get proba from calibration(classifier ); 9 i ← i + 1;

10 end

/* Update probabilities using ground truth to mimic human annotator */

11 update proba(train indexes, Pr(c|d)i, ground truth); 12 return Pr(c|d);

(37)

(38)

(39)

Accuracy MLE EMQ 0.97 0.93 AE MLE EMQ 22.68 1.95

Table 6: Accuracy (Table 3) of the posteriors and Absolute Error (Equation 23) of the priors on the active learning dataset

very disappointing deterioration of the EMQ posterior probabilities accuracy. Moreover, Figure 6b shows that the EMQ algorithm actually improved the quality of the a priori probabilities by a very consistent margin: as a matter of fact, this confirms the hypothesis made by Saerens et al. (2002) that the higher is the disparity between the training and test sets priors and the better EMQ will improve their estimation. On the other hand, these results are consistent with what we have already seen in the previous experimental run of EMQ: the algorithm seems to improve the priors probabilities but does not deliver better posteriors. As we may expect, these new probabilities are going to bring no positive results when used as MINECORE input: as can be seen from Figure 11, all our 120 pairs scored a worse result with respect to MINECOREMLE. The fact that we keep getting

such disappointing results actually raises doubts on the effectiveness and capability of the EMQ algorithm to improve the posterior probabilities, especially because we are seeing, on the other hand, a substantial improvement of the a priori probabilities (whose quality should be strictly correlated to the posteriors). We can actually run one last sanity-check test: if we had the perfect prior probabilities, would they bring about better posteriors? That is, if we use the true a priori probabilities from the ground truth, would we be able to improve the quality of our posterior probabilities? Let pl(c) be the true prior

probabilities of the test set and let pu(c) be the prior probabilities on the training set; for

every class c ∈ C, we can then compute the new posterior probabilities as: Pr(c|d) = Pr(c|d)pu(c)

pl(c)

(24) As a matter of fact, these new posterior probabilities score an accuracy of 0.9738249, which is very slightly worse than the MLE posteriors (0.9738419). A definitely interesting

(40)

Figure 11: Cost ratio between MLE and EMQ with the active learning generated dataset. EMQ probabilities deterioration brought to a consistent deterioration over all the pairs.

research would be that of studying why this is happening, as it seems to be against the widely spread opinion in the literature that good prior probabilities should bring good posterior probabilities as well (Saerens et al., 2002): such a study would be however out of the scope of this master thesis work.

In conclusion, since the EMQ algorithm does not seem to bring about substantial benefits to our expected costs, we have decided not to run further experiments with its posterior probabilities in our MINECORERX revised framework, whose details can be

(41)

5 Smoothing the posterior probabilities returned by

the classifiers

In previous sections of this work, we have pointed out how the original MINECORE algorithm was relying on two main simplifying assumptions:

1. The first one is that we should take the classifiers’ posterior probabilities at face value, that is if the classifier believes that a document d has a probability Pr(ci|d) of

1 we should share the same belief. In this section we will try to relax this assumption and see whether this brings improvements on MINECORE or not (Section 5.1); 2. Secondly, MINECORE does not take into consideration the possibility of failure

of the human annotators: when a junior lawyer deems a document d to be rel-evant, MINECORE’s algorithm upgrades the posterior probability into certainty Pr(cr|d) = 1. Of course, this cannot be the case in reality, where humans’ errors

can and usually do happen. We will later devise a possible solution to model the annotators’ fallibility (Section 6.1).

5.1 Smoothing posterior probabilities

Blindly trusting an automated classifier can bring about unpleasant consequences, espe-cially when we would have to share its certainty on a document d belonging to a class ci with probability Pr(ci|d) = 1 or Pr(ci|d) = 0. What we would like to do, instead, is

modeling a certain amount of distrust in our algorithm, such that if the classifier is certain that X is the case we do not share the same certainty, although we could surely give X a good probability of being the right answer.

In other words, we are actually speaking of two different set of probabilities:

1. Prs_{(X) which represents the classifier confidence in X being the case, i.e. these are}

our system probabilities (the superscript s stands indeed for system).

2. Pru_{(X) which models our confidence in X being the right answer, where u}

(42)

But how do we create this new set of probabilities, given the already available Prs_(X)

classifiers posteriors? We could devise a very simple solution, modelling a causal rela-tionship from the Prs _{to the Pr}u _{probabilities: that is, we could restrain the classifier}

confidence in certain “extreme” cases, with the following equation

Pru_{(X) =}          .95 if Prs(X) >= .95 .05 if Prs_{(X) <= .05} Prs(X) otherwise (25)

We could think, however, of a much better and less naive solution: if the classifiers were to be completely trusted, we should take their posterior probabilities at face value, i.e. Pru_{(X) = Pr}s_{(X), whereas if they were totally untrustworthy then we should fallback}

on the a priori probability of a class, which we could estimate on the training set, i.e. Pru_{(X) = p(c). The right solution, however, probably stands in the middle ground, that}

is we could model the trustworthiness of our classifier to be 0 < α < 1. As a consequence, we could obtain our Pru_{(X) probabilities as a linear combination of our trust and distrust}

in the Prs(X) classifiers probabilities, that is:

Pru_{(X) = α Pr}s_{(X) + (1 − α)p(c)} ₍₂₆₎

In order to find the best possible value for the α parameter we can opt for cross-validation, choosing the value of α which most minimize MINECORE overall costs. As a result, we believe that this “probabilities smoothing” operation should deliver lower expected costs: if we consider the elevated costs the algorithm may incur in when classifying a document as to be Produced when it actually was to be Logged, then smoothing the probabilities Prs_(c

r|d) and Prs(cp|d) could avoid extremely risky courses of actions, especially when

these probabilities are very close to 1 (and as such, in the original algorithm would have probably been excluded from the τr or τp top documents5).

Following equation 26, and recalling that cr and cp are two stochastically independent

variables (see Oard et al., 2019), we can rewrite the probabilities of our three possible

5_{Recall from Section 2 that τ}

r and τp represent the numbers of documents to be manually reviewed

(43)

Of course, this has effects on our risk-minimization function: as a matter of fact, the R(d, ci) function (Equation 3) has now to take into consideration the new Pru

probabili-ties, translating into

Ru(d, ci) =

X

j∈{P,L,W }

λm_ij Pru_(c

j|d) (28)

Where the u superscript stands once again for “user”. As a consequence, the global risk brought about by this classification is:

Ru(D) = X

d∈D

Ru(d, h(d)) (29)

Now, by expanding on equation 3 we might follow two possible routes: on the one hand, we will introduce a mathematically elegant solution which, unfortunately, has the downside of not being able to optimize the α parameter in the best possible manner, and on the other we will show a much more raw formulation which however allows for a more specific cross-validation optimization. Recall from Equation 26 that we can express our Pru

probabilities as:

Pru_(c

j|d) = α Prs(cj|d) + (1 − α)ˆp(cj) (30)

Where ˆp(cj) is an estimation of the class prevalence of the test set, done on the training

set (we could use both a maximum likelihood estimation or the EMQ approach6 _shown

in section 4).

6_{Even though EMQ proved to be less effective than we initially expected, it still holds that its prior}

(44)

We could then rewrite our risk minimization function as: Ru(d, ci) = X j∈{P,L,W } λm_ij Pru(cj|d) = X j∈{P,L,W } λm_ij(α Prs_(c j|d) + (1 − α)ˆp(cj)) = α X j∈{P,L,W } λm_ij Prs_(c j|d) + (1 − α)P_{j∈{P,L,W }}λmijp(cˆ j) = α · Rs(d, ci) + βi (31) where Rs(d, ci) ≡ X j∈{P,L,W } λm_ij Prs(cj|d) ₍₃₂₎

is the risk that the system associates with assigning d to class ci, while

βi ≡ (1 − α)

X

j∈{P,L,W }

λm_ijp(cˆ j) ₍₃₃₎

is a factor (here called the risk prior of class ci) that depends on the unit costs of wrongly

predicting ci when another class should have been predicted.

As one may notice, this solution presents a rather elegant mathematical formulation. Unfortunately, it has one major pitfall: we cannot estimate the values of α for cr and

cp separately. Indeed, this can be shown by looking at Equation 30 and Equation 31:

remember that the classifiers do not provide us with the probabilities of a document d being in P, L, W , but instead give us the probabilities Prs_(c

r|d) and Prs(cp|d), which can

then be combined in order to obtain Pr(cj|d) with j ∈ {P, L, W } (see Equation 1). This

means that we can use one and only one value of α in equation 30: this affects the risk minimization equation (Equation 31) as well, and as a result we are not able to perform a cross-validation optimization for both Pru_(c

r|d) and Pru(cp|d). As a consequence, we

are going to introduce an alternative formulation which we can use to perform such an optimization. First, we are going to define two values of our α parameter: αr for relevance

and αp for privilege. Our posterior probabilities can then be rewritten as:

Pru(cr|d) = Prs(cr|d) · αr+ (1 − αr) · p(cr)

Pru(cp|d) = Prs(cp|d) · αp+ (1 − αp) · p(cp)

(45)

These probabilities will then be combined to obtain the Pru_(c

i|d) with i ∈ {P, L, W }, as

seen in Equation 27.

As a result, we can use these probabilities directly in Ru_{(d, c}

i), which we will not

expand like we did in the first formulation: Ru(d, ci) =

X

j∈{P,L,W }

λm_ij Pru_(c

j|d) (35)

As one may notice, we have thus dropped the βi of Equation 33, which would be impossible

to compute as it needs a unique value of α for both cr and cp. In the next section, we will

see how we can implement these new {αr, αp} values in MINECORE and we will show

the results of our experiments.

5.2 Experiments

We will present in this section the results of our experiments, first explaining how we cross-validated our {αr, αp} values and then showing the cost ratios with respect to the

baseline (in a similar fashion of what we did for EMQ, Section 4). Recall once again that the dataset we have chosen for the experiments is RCV1-v2 (see Section 3 for more details), the same exploited by Oard et al. (2019): we are going to work on 120 selected pairs, performing a cross-validation of the two αr and αp for every single pair.

Let S be our set of pairs, then, for every {cr, cp} pair s ∈ S, we are going to have

two grids of values (alphaGrids in Algorithm 12), one for αr and one for αp, ranging

from 0.0 to 1.0; we are then going to test every possible combination of the grids (i.e., a Cartesian product Warner (1990, p. 6)), taking the misclassification costs (Equation 5) as the evaluation measure. That is, we are going to have 112 _{= 121 combinations for every}

pair. Finally, we are going to take the two best tuple of values for every s ∈ S.

Storing the best values for every pair is even more necessary in our case: remember that, since in our set S a RCV1 label can play, in different pairs, both the role of the cr

and cp classes, it would not make sense to assign a unique α value to a label. Rather,

we give to every {cr, cp} pair their best possible combined {αr, αp} values: this process is

(46)

Input : {αr, αp} ∈ alphaGrids, pairs, trainingSet;

Output: Optimized alpha values for every pair s ∈ S;

/* Map with pairs as index and HashMap as values */

/* storing MINECORE costs for every {αr, αp} */

1 alphaCosts ← HashMap();

2 kF oldIterator ← KFold(trainingSet, k = 10); 3 for {train, test} ∈ kF oldIterator do

4 classif iers ← learnClassifiers(train, test); 5 results ← runMinecore(pairs, classif iers, αr, αp); 6 costs ← results.getMissclassCost();

7 for {pair, missCosts} ∈ costs do

/* Update costs for this pair and {αr, αp} values */ 8 previousT otalCost ← alphaCosts[pair][{αr, αp}];

9 previousT otalCost ← previousT otalCost + missCosts; 10 end

11 end

/* Transform the alphaCosts data structure into */

/* {pair: [αr, αp]} */

12 return getBestAlphaValues(alphaCosts);

(47)

Moreover, in order to get a fine-grained output (that is, optimization to the second decimal place), we can run Algorithm 12 twice: firstly, we are going to input two grids g1αr and

g1αp, each ranging from 0.0 to 1.0, as explained before; once we obtain the cross-validation

results o (which stands for output ), we are going to generate 2l grids with l being the length of our set of pairs S (as a matter of fact, for every pair, we need one grid for the cr class and one for the cp class).

Let αo be any optimal value of α, coming from results o, for a class (cr or cp) in a pair:

we have that each of the giα grids is

giα =[αo− 0.05, αo+ 0.05] (36)

It is trivial yet important to notice that, when α = 1, the range is going to be from 0.95 to 1.0, as going above 1 would not make sense (similarly, when α = 0, we are going to take values in the range [0.0, 0.05]).

Even though this is computationally expensive (indeed, in our experiments it took several days of computational activity to complete), it also allows us to really get the best possible α values. Finally, the data structure can easily be stored on disk to quickly have it ready for later use on MINECORE algorithm.

We now move onto the discussion of our experiment results, mainly shown as the cost ratio between the baseline and the new “Relaxed” MINECORE (which we will call MINECORERX). The cost ratio has been computed using Equation 18 explained in

Section 4. As can be seen from Figure 12, the new MINECORERX has actually improved

on about 55% of our pairs. The overall ratio between MINECORERX and the baseline

costs is instead 1.013 (see Equation 18), meaning that we achieved an improvement of slightly more than 1% in terms of real costs. Although it is definitely not a huge result, it still shows that the adjustments we proposed here have a positive impact on the algorithm outcomes.

As previously said, it could be worth to substitute the prior probabilities used in Equation 34 with the EMQ priors, since the algorithm has proven to deliver better prior probabilities (see Section 4). To do this, we can simply run EMQ and save the resulting prior probabilities only: remember that, with the introduction of the {αr, αp} values, we

(48)

Figure 12: Cost ratio between the baseline and MINECORERX. Overall, the new algorithm has

obtained lower costs and slightly improved with respect to the baseline.

use prior probabilities to obtain the Pru _{probabilities (Equation 34): the lower the α}

value (be it αr or αp), the most we rely on prior probabilities and, as a consequence, the

most we could benefit from the EMQ results.

Figure 13 shows that, with respect to Figure 12, a higher number of pairs has improved the overall costs: however, it should be noticed that the best pair (M12-M131) is not having a result as good as the previous one which, together with a poorer result of other pairs, brings to a slightly lower overall improvement (the delta ratio is 1.007); indeed, we could say that using EMQ priors bring almost no benefits at all.

Most importantly, it should be noticed on both graphs that, as we have already seen on the EMQ experiments, there are two outliers which should be addressed in our analysis, namely the M12-M131 and the M11-ECAT pairs:

• We could check whether there are remarkably strong differences in the accuracy of these labels probabilities, which may have caused such a discrepancy between the scores;

• The M12-M131 pair is extremely interesting if we consider that it is the same pair that was behaving as an outlier in the EMQ experiments, albeit in those experiments

Risk Minimization Models for Technology-Assisted Review and their Application to e-discovery

TESI DI LAUREA MAGISTRALE