Metadata of the chapter that will be visualized in SpringerLink Book Title

(1)

Metadata of the chapter that will be visualized in SpringerLink

Book Title Bioinformatics and Biomedical Engineering Series Title

Chapter Title The Complexity of Some Pattern Problems in the Logical Analysis of Large Genomic Data Sets

Copyright Year 2016

Copyright HolderName Springer International Publishing Switzerland

Corresponding Author Family Name Lancia

Particle

Given Name Giuseppe

Prefix Suffix

Division Department of Mathematics and Computer Science Organization University of Udine

Address Udine, Italy

Email [email protected]

Author Family Name Serafini

Particle

Given Name Paolo

Prefix Suffix

Division Department of Mathematics and Computer Science Organization University of Udine

Address Udine, Italy

Email

Abstract Many biomedical experiments produce large data sets in the form of binary matrices, with features labeling the columns and individuals (samples) associated to the rows. An important case is when the rows are also labeled into two groups, namely the positive (or healthy) and the negative (or diseased) samples. The Logical Analysis of Data (LAD) is a procedure aimed at identifying relevant features and building boolean formulas (rules) which can be used to classify new samples as positive or negative. These rules are said to explain the data set. Each rule can be represented by a string over {0,1,-}, called a pattern. A data set can be explained by alternative sets of patterns, and many computational problems arise related to the choice of a particular set of patterns for a given instance. In this paper we study the computational complexity of these pattern problems and show that they are, in general, very hard. We give an integer programming formulation for the problem of determining if two sets of patterns are equivalent. We also prove computational complexity results which imply that there should be no simple ILP model for finding a minimal set of patterns explaining a given data set.

(2)

in the Logical Analysis of Large Genomic Data Sets

Giuseppe Lancia⁽B⁾and Paolo Seraﬁni

Department of Mathematics and Computer Science, University of Udine, Udine, Italy [email protected]

Abstract. Many biomedical experiments produce large data sets in the form of binary matrices, with features labeling the columns and individuals (samples) associated to the rows. An important case is when the rows are also labeled into two groups, namely the positive (or healthy) and the negative (or diseased) samples. The Logical Analysis of Data (LAD) is a procedure aimed at identifying relevant features and building boolean formulas (rules) which can be used to classify new samples as positive or negative. These rules are said to explain the data set. Each rule can be represented by a string over{0,1,-}, called a pattern. A data set can be explained by alternative sets of patterns, and many computational problems arise related to the choice of a particular set of patterns for a given instance. In this paper we study the computational complexity of these pattern problems and show that they are, in general, very hard. We give an integer programming formulation for the problem of determining if two sets of patterns are equivalent. We also prove computational complexity results which imply that there should be no simple ILP model

for finding a minimal set of patterns explaining a given data set. ^AQ1

1 Introduction

The recent advances in technology together with the massive use of computers have made modern molecular biology a science where the huge size of the input and/or output data of most experiments is by itself a serious computational problem. Very large data sets can be found everywhere in computational molecular biology. For instance, the European Molecular Biology Laboratory Nucleotide Sequence Database has grown exponentially in the past ten years, and, currently, the archive comprises around two billion records covering almost two trillion base pairs [2]. A similar growth, albeit slower, has interested the Protein Data Bank in which the number of protein structures deposited (each of which is a large data set by itself) is today around 100,000 [3]. Microarray experiments produce GigaBytes of information about the expression of hundreds of thousands of genes in hundreds of individuals at once [6]. These are only a few examples of data sets and experiments where the size of the information hinders the analyis and interpretation of the information itself.

Springer International Publishing Switzerland 2016c

F. Ortu˜no and I. Rojas (Eds.): IWBBIO 2016, LNBI 9656, pp. 1–10, 2016.

DOI: 10.1007/978-3-319-31744-1 1

Author Proof

(3)

In many biological experiments, the data have the form of a two dimensional array, where the rows correspond to individuals and the columns are associated to some biological features. This is the case, for instance of microarrays, bi- dimensional grids in which each column is associated to a gene, and each row to some tissue cells [17]. Each entry (i, j) of the grid contains a number measuring the level of expression of genej in cells i. The dimension of the data sets makes imperative the use of computer algorithms in order to understand what is relevant information and what, on the other hand, is background noise and extra information of little practical use.

Classification and Feature Selection. In many experiments the rows are partitioned in classes, where each class contains cells that are “of the same type” but diﬀerent from the other classes. A typical and prominent example is the case of two classes in the study of a disease: the healthy cells vs the infected cells.

When the individuals are divided into two or more classes associated with their characteristics, it is of interest to identify some rules that put the values of the features of an individual in relation with its class. Such rules can be used to try and understand the hidden underlying properties deﬁning each class (e.g., one may discover a certain combination of some over-expressed genes with other under-expressed is common to all infected individuals while it is not found in any of the healthy ones). These type of rules can then be used to classify new individuals, i.e., to predict the class of an individual of unknown class.

The general goal in data-analysis studies for large and complex data sets is to identify the relevant features (i.e., to perform Feature Selection) as well as their mutual relations in order to build simple (hopefully) rules that allow to discriminate between individuals that belong to diﬀerent classes (i.e., to perform Classification).

Feature Selection and Classification techniques have been extensively used in the field of bioinformatics [4,8,9,12,15]. Since both problems are computa- tionally hard, many times they are studied separately but this is not the best approach. For instance, some optimization approaches to feature selection are meant at minimizing the number of features (see, for example, the barcoding problem [14]), but this objective may not be appropriate when the true goal is that of using the selected features for classification purposes. For one, a classifier based only on only a few features may be not very robust. But the main weakness of this type of approach is that the number of features becomes more important than their mutual relations: it could happen, e.g., that classification should be based on three biologically correlated features, but this will never be discovered if there is a way to use two other, perhaps unrelated, features in their place.

Logical Analysis of Data. Consider a data set consisting in a binary matrix ofm rows and n columns, in which m< m rows are labeled as positive while the remaining rows are labeled as negative. This is a typical data set for many biol- ogy experiments in which the rows are associated, e.g., to healthy and diseased samples. If the data comes from sequencing, the columns would be associated to

Author Proof

(4)

nucleotides of a reference sequence, and a 0 in a row would mean “no mutation”

while a 1 would mean “a mutation occurred”. If the data comes from a microarray experiment, the values 0 and 1 would be related to the level of expression of each gene, with 0 meaning “normal” and 1 meaning “abnormal” [11,16].

A technique known as Logical Analysis of Data (LAD [1,7,13]) has been often applied to data sets of this type with the objective of discovering a set of simple boolean formulas (or “rules”) that can be used to classify new binary vectors (b1, . . . , bn). Each rule speciﬁes the values that some bits must have in order to classify a vector as positive or negative. For instance, a “positive rule”

could be

(b2= 0)∧ (b5= 1)∧ (b9= 1)

meaning that any vector that has a 0 in the 2nd component, and a 1 in the 5th and 9th component is classified as positive. Similarly, “negative rules” will specify how to classify a vector as negative. The goal of LAD is to derive positive and negative rules from the data in such a way that (i) each positive row satisfies at least one of the positive rules, while no negative row does and (ii) each negative row satisfies at least one of the negative rules, while no positive row does.

The techinque has been successfully applied to many biomedical contexts, such as the detection of rules that can be related to ovarian cancer, breast cancer and lymphoma [13]. The application of LAD to bioinformatics/biomedical data is based on the solution of optimization problems corresponding to large set covering instances.

Boolean Formulas as Patterns. Let us deﬁne an n-pattern (or simply a pattern) p as a string of n symbols where each symbol can take the values 0, 1 or −, i.e.,

pi∈ {0, 1, −} , i = 1, . . . n.

We say that a pattern p covers, or generates, a binary string b = (b1, . . . , bn) if bk =pk for each k such that pk ∈ {0, 1}. Let us denote the set of all binary strings generated by the pattern p by

S(p) = {s ∈ {0, 1}ⁿ:si=pi ifpi∈ {0, 1}}

Given a setP of patterns the set S(P ) generated by P is the set S(P ) =

p∈P

S(p)

The condition of being covered by a pattern can be interpreted as a boolean formula akin to the positive and negative rules of LAD, namely,

k:pk∈{0,1}

(b_k =p_k)

For instance, if n = 10, the rule (b2= 0)∧ (b5= 1)∧ (b9= 1) is represented by the pattern

-0--1---1-

Author Proof

(5)

In view of the equivalence of rules and patterns, we can consider LAD as the problem of ﬁnding positive and negative patterns in order to explain a set of positive and negative vectors.

For a given instance of LAD, there might be many alternative sets of patterns explaining the data. A widely accepted objective, when more theories can be adopted to explain a phenomenon, is to choose the simplest one (a principle known as Okkam’s razor). In this case, the simplest solution would be a set with a minimum number of patterns. Finding a min-size set of patterns is computation- ally demanding (see in [18] a branch-and-price procedure), and so also heuristic procedures have been proposed and used [5]. Some interesting computational problems emerge, such as

1. SIMPLE PATTERN FEASIBILITY: given a setA of strings and a set P of patterns,A = S(P )?

2. SIMPLE PATTERN MINIMALITY: given a set A of strings and a constant K, does there exist a set of patterns P such that A = S(P ) and |P | ≤ K?

3. PATTERN EQUIVALENCE: given two setsP , Pof patterns, are they equivalent, i.e.,S(P ) = S(P)?

4. PATTERN MINIMALITY: given a setP of patterns and a constant K < |P |, does there exist an equivalent set of patternsQ such that |Q| ≤ K?

5. FULL PATTERN COVERAGE: given a setP of patterns, S(P ) = {0, 1}ⁿ? 6. PARTIAL PATTERN COVERAGE: given a set P of patterns, does there

exist a strings ∈ {0, 1}ⁿ such thats /∈ S(P )?

All these questions are motivated when LAD is used for biomedical analysis.

For instance, one might be interested in knowing if a solution could be improved upon, or if two solutions obtained by different heuristics are equivalent. Fur- thermore, consider this situation: Given a set of patternsP and a set of strings Y , find a set of patterns Q such that S(Q) = S(P ) ∪ Y . This is a problem that arises when one wants to update a classifier in light of new data that has become available. Notice how this problem is substantially different from the “regular”

LAD problem: in LAD, we are given a set of strings and we want to ﬁnd a set of patterns that generate them. Here, we are given a set of patterns (and strings) and we want an equivalent set. The study of this problem motivates both questions 2. and 3. above, since we would certainly look for a set Q such that|Q| < |P | + |Y |. Problems 5. and 6. are complement of each other and their interest is more theoretical in studying the other problems.

Paper Contribution and Organization. In this paper we study the compu- tational complexity of pattern problems arising in the logical analysis of bidi- mensional data with positive/negative samples (such as biomedical data coming from microarray experiments). We show that these problems are, in general, quite complex. We give an integer programming formulation for PATTERN EQUIVA- LENCE. Formulating an ILP model for PATTERN MINIMALITY seems quite challenging due to the computational complexity results of the next section.

Author Proof

(6)

2 Computational Complexity Results

Proposition 1. SIMPLE PATTERN FEASIBILITY is polynomial.

Proof: For each a ∈ A and each p ∈ P we may easily check if a is generated by p. Hence in time O(n |A| |P |) we may decide whether A ⊆ S(P ) or not. In order to decide whether S(P ) ⊆ A or not, we start generating all strings in S(P ). For each generated string s, deciding whether s ∈ A or not can be done in time O(n |A|). We don’t need to generate more than |A| · |P | strings. If a generated string does not belong toA we have found that S(P ) ⊆ A. Otherwise each generated string belongs to A. Since a string can be generated by more than one pattern, i.e., in the worst case by|P | patterns, no more than |A| · |P | generated strings can belong to A. If there are more strings to be generated, these do not belong to A. Hence we may decide whether S(P ) ⊆ A or not in

timeO(n |A|²|P |).

Proposition 2. SIMPLE PATTERN MINIMALITY is NP-complete.

Proof: It suﬃces to observe that SIMPLE PATTERN MINIMALITY is exactly MINIMUM DISJUNCTIVE NORMAL FORM (see [10] p. 261), which we repeat here for the sake of completeness. A setU = {u1, u2, . . . , un} of variables, a set A ⊆ {T, F }ⁿ of “truth assigments”, and a positive integerK are given. We ask whether there exists a disjunctive normal form expression E over U, having no more than K disjuncts, such that E is true for precisely those truth assigments in A, and no others.

We may replace the variablesu_iwith the symbols in the strings, each element a ∈ A is a particular binary string, each disjunct is a pattern with a 1 for each variable, a 0 for each negated variable and a− for each variable not present in

the disjunct.

Proposition 3. PARTIAL PATTERN COVERAGE is NP-complete.

Proof: The proof goes via a transformation from SAT. Given an instance of SAT with n literals and m clauses we generate a pattern from each clause as follows: if the literal xi is present in the clause we setpi = 0, if¬xi is present we set pi= 1, and if neitherxi nor¬xi are present we setpi=−.

Now assume SAT is satisfiable and letx be the truth assignment that makes the instance satisfiable. Then we form the strings_i= 1 ifx_i= TRUE ands_i= 0 ifx_i = FALSE. Since at least one of the literals of each clause must be true, then, for eachp ∈ P , at least one of the symbols s_i corresponding to 0, 1 positions of p must be different from p_i and therefore cannot be inS(p). Conversely, given a string not inS(P ) we may just reverse the reasoning and find a truth assignment that makes the SAT instance satisfiable.

PARTIAL PATTERN COVERAGE is also in NP since it suﬃces to show a string not inS(P ) and the veriﬁcation that it does not belong to any S(p) takes

polynomial time.

Author Proof

(7)

Since PARTIAL PATTERN COVERAGE is complement to FULL PAT- TERN COVERAGE we clearly have:

Corollary 1. FULL PATTERN COVERAGE is co-NP-complete. Proposition 4. PATTERN EQUIVALENCE is co-NP-complete.

Proof: We transform FULL PATTERN COVERAGE into PATTERN EQUIV- ALENCE, by taking P the single pattern{− − . . . −} generating {0, 1}ⁿ. Fur- thermore for a no-instance there exists a strings ∈ S(P ) and s /∈ S(P), or vice

versa, and this string is a succinct certiﬁcate.

Proposition 5. PATTERN MINIMALITY is co-NP-hard.

Proof: The proof goes via a transformation from FULL PATTERN COVER- AGE. Given an instance of FULL PATTERN COVERAGE we deﬁne a similar instance of PATTERN MINIMALITY by choosing K = 1. Note that we may freely assume that for eachi, the pi’s,p ∈ P , are not all equal, otherwise we may drop each position with all symbols equal and reduce the instance to an equivalent one. Hence the only pattern that can be equivalent toP is {− − . . . −}. Since SIMPLE PATTERN MINIMALITY is a particular case of PATTERN MINIMALITY we have by Proposition2:

Proposition 6. PATTERN MINIMALITY is NP-hard. By putting together Propositions5and6we see that it is very unilkely that PATTERN MINIMALITY is in NP or in coNP.

Our main problem is ﬁnding a smaller pattern set generating the same strings of a given pattern set. The previous results show that, even if we have available the two pattern sets, the simple check that they generate the same set of binary strings is NP-hard. This observation seems to rule out ILP models in which patterns are represented in some way and the check is left to a polynomial system of equalities/inequalities on integer variables.

3 ILP Models

Two setsP and Q of patterns are given. Consider the following models

v = min

q∈Q

zq

i:qi=0

xi+

i:qi=1

(1− xi)≥ 1 − zq q ∈ Q

y_p≤ 1 − x_i i : p_i= 0, p ∈ P

yp≤ xi i : pi= 1, p ∈ P

p∈P

yp≥ 1

x_i∈ {0, 1} , y_p∈ {0, 1} , z_q ≥ 0 integer

(1)

Author Proof

(8)

w = min

p∈P

y_p

i:pi=0

x_i+

i:pi=1

(1− x_i)≥ 1 − y_p p ∈ P

zq≤ 1 − xi i : qi= 0, q ∈ Q

z_q≤ x_i i : q_i= 1, q ∈ Q

q∈Q

z_q ≥ 1

xi∈ {0, 1} , zq ∈ {0, 1} , yp≥ 0 integer

(2)

In the sequel⊂ means strict inclusion.

Proposition 7

– S(P ) ⊂ S(Q) if and only if v > 0 and w = 0;

– S(Q) ⊂ S(P ) if and only if w > 0 and v = 0;

– S(Q) = S(P ) if and only if v > 0 and w > 0.

Proof: It is enough to show thatS(P ) ⊆ S(Q) if and only if v > 0. If y_p= 1 then x is generated by p ∈ P . The constraint

p∈Py_p≥ 1 forces x to be generated by at least one pattern inP . Hence feasible x are in S(P ). Consider any x ∈ {0, 1}ⁿ. Ifx is generated by q ∈ Q then z_q = 1. Ifx is not generated by q ∈ Q then z_q = 0 is feasible (along with possible integer valuesz_q ≥ 1). The objective forces z_q to be zero in this case.

Hencev = 0 if and only if x ∈ S(P ) and x /∈ S(Q). If v > 0, for any pattern x ∈ S(P ) we have that x ∈ S(Q), i.e., S(P ) ⊆ S(Q). Note that, ifS(P ) ⊆ S(Q), i.e., when v = 0, model (1) exhibits also a string x in S(P ) that is not in S(Q), whereas if S(P ) ⊆ S(Q), i.e., when v > 0, model (1) exhibits also a stringx in S(P ) that is necessarily also in S(Q). Similarly if S(Q) ⊆ S(P ), i.e., when w = 0, model (2) exhibits also a stringx in S(Q) that is not in S(P ), whereas if S(Q) ⊆ S(P ), i.e., when w > 0, model (2) exhibits also a stringx in S(Q) that is necessarily also in S(P ).

We may further distinguish the casew = 0, v = 0, via the following model w = min wˆ p+wq

y_p≤ 1 − x_i i : p_i= 0, p ∈ P yp≤ xi i : pi= 1, p ∈ P

p∈P

yp≥ 1 − wp

z_q≤ 1 − x_i i : q_i= 0, q ∈ Q zq≤ xi i : qi= 1, q ∈ Q

q∈Q

zq ≥ 1 − wq

(3)

Author Proof

(9)

Proposition 8. S(P ) and S(Q) are disjoint if and only if ˆw > 0. IfS(P ) and S(Q) are not disjoint then the model exhibits a string x shared by both sets. Note that the problems FULL PATTERN COVERAGE and PAR- TIAL PATTERN COVERAGE are solved by

u = min

p∈P

yp

i:pi=0

xi+

i:pi=1

(1− xi)≥ 1 − yp p ∈ P xi ∈ {0, 1} , yp≥ 0 integer

Proposition 9. S(P ) = {0, 1}ⁿ if and only if u > 0. As a simple example of the previous results suppose we are given the two following sets of patterns. These are ﬁctitious instances andP has been obtained fromQ by simply replacing some “0” or some “1”with a “-” so that we know in advance thatS(Q) ⊂ S(P ).

Q =

⎡

⎢⎢

⎣

1 1 - - 1 1 1 1 - - 1 1 - - 1 1 1 1 - - 1 1 - - 1 1 1 1 - - 0 1 - - 1 1 1 1 - - 1 1 - 1 0 1 1 1 1 - 1 1 - 1 0 1 1 1 1 - 1 1 - - 1 1 1 1 - - 1 1 - - 1 0 1 1 - - 1 0 - 1 - - 1 - 0 1 1 0 - 1 - - 1 - 0 1 1 1 - - 1 1 1 1 - - 1 1 - - 1 1 1 1 - - 1 0 1 0 1 - 0 - 1 1 1 0 1 0 1 - 0 - 1 1 1 1 - - 1 1 0 1 - - 1 1 - - 1 1 1 1 - - 1 1 1 1 - 1 - 1 - 1 1 1 1 1 - 1 - 1 - 1 1 1 - - 1 1 1 1 - - 1 1 - - 0 1 1 1 - - 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 - 1 0 - 1 - - 0 - 0 0 1 0 - 1 - - 0 - 0 0 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 1 0 0 - 0 - 1 - - 0 1 0 0 - 0 - 1 - - 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 1 0 - 1 - - 0 - 0 0 1 0 - 1 - - 0 - 0 0 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 -

⎤

⎥⎥

⎦

P =

⎡

⎢⎢

⎣

1 1 - - 1 1 1 1 - - 1 1 - - 1 1 1 1 - - 1 1 - - 1 - 1 1 - - 0 1 - - 1 1 1 1 - - 1 1 - 1 0 1 1 1 1 - 1 1 - 1 0 1 1 1 1 - 1 1 - - 1 1 1 1 - - 1 1 - - 1 0 1 - - - 1 0 - 1 - - 1 - 0 - 1 0 - 1 - - 1 - 0 1 1 1 - - 1 1 1 1 - - 1 1 - - 1 1 1 1 - - 1 0 1 0 1 - 0 - 1 1 - 0 1 0 1 - 0 - 1 1 1 1 - - 1 1 0 1 - - 1 1 - - 1 1 1 1 - - 1 1 1 1 - 1 - 1 - 1 1 1 1 1 - 1 - 1 - 1 1 1 - - 1 1 1 1 - - 1 1 - - 0 - 1 1 - - 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 - 1 0 - 1 - - 0 - 0 0 1 0 - - - - 0 - 0 0 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 1 0 0 - 0 - 1 - - 0 1 0 0 - 0 - - - - 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 - - 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 0 - 1 0 1 - 0 - 0 1 1 0 - - - - 0 - 0 0 1 0 - 1 - - 0 - 0 0 0 0 - 1 0 0 0 - 0 - 0 0 - 1 0 0 0 - 0 -

⎤

⎥⎥

⎦

We run in sequence (1) and (2) and obtain v = 0 and w > 0, that implies S(Q) ⊂ S(P ), according to Proposition7. In this case there is no need of running (3). The two programs take less than one second on a Mac OS personal computer.

As a byproduct we obtain from (1) and (2) respectively the strings x¹= 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 x²= 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1

Author Proof

(10)

One can easily check thatx¹ ∈ S(P )andx¹ /∈ S(Q)and also thatx²∈ S(Q) ⊂ S(P ). Now suppose the ﬁrst symbol of the ﬁrst pattern in P is changed to “0”.

We run (1) and (2) and obtainv = 0 and w = 0, so that we know that neither S(P ) ⊂ S(Q) nor S(Q) ⊂ S(P ). We have to run (3) and obtain ˆw = 0 so that we know that the two sets are not disjoint according to Proposition 8. Also in this the running time is negligible. As byproducts we obtain respectively

x¹= 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 x²= 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 x³= 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 One can easily check that x¹ ∈ S(P ) and x¹ /∈ S(Q), that x² ∈ S(Q) and x² /∈ S(P ) and that x³∈ S(P ) ∩ S(Q).

4 Conclusions

The use of software tools for feature selection and LAD poses some challenging problems with respect to the solutions (sets of patterns) that migh be obtained.

In this paper we have studied the computational complexity of these problems and have show that they are, in general, very hard. One consequence of our complexity results is that there should be no simple ILP model for ﬁnding a minimal set of patterns explaining a given data set. On the other hand, we have given integer programming formulations for the problem of determining if two sets of patterns are equivalent. A line of future research would be to try and use these models in order to solve the PATTERN MINIMALITY problem.

References

1. Alexe, G., Alexe, S., Bonates, T.O., Kogan, A.: Logical analysis of data - the vision of Peter L. Hammer. Ann. Math. Artif. Intell.49, 265–312 (2007)

2. Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G., Tuli, M.A.: The EMBL nucleotide sequence database. Nucleic Acid Res.28, 19–23 (2000)

3. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acid Res. 28, 235–242 (2000)

4. Bertolazzi, P., Felici, G., Festa, P., Lancia, G.: Logic classification and feature selection for biomedical data. Comput. Math. Appl.55, 889–899 (2008)

5. Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng. 12, 292–306 (2000)

6. Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S., Fodor, S.P.A.: Accessing genetic information with high-density DNA arrays. Science274, 610–614 (1996)

7. Chikalov, I., Lozin, V., Lozina, I., Moshkov, M., Son Nguyen, H., Skowron, A., Zielosko, B.:Logical analysis of data: theory, methodology and applications. In:

Three Approaches to Data Analysis, pp. 147–192. Springer (2013)

Author Proof

(11)

8. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal.1, 131–156 (1997)

9. Felici, G., de Angelis, V., Mancinelli, G.: Feature selection for data mining.

In: Felici, G., Triantaphyllou, E. (eds.) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, pp. 227–252. Springer (2006) 10. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory

of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)

11. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science286, 531–537 (1999)

12. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn.46, 389–422 (2002)

13. Hammer, P., Bonates, T.: Logical analysis of data: from combinatorial optimization to medical applications. RUTCOR Research Report, 10-05, Rutgers University, NJ (2005)

14. Hebert, P.D.N., Cywinska, A., Ball, S.L., de Waard, J.R.: Biological identifications through DNA barcodes. Proc. R. Soc. Lond. B270, 313–321 (2003)

15. Hu, H., Li, J., Plank, A., Wang, H., Daggard, G.: A comparative study of classification methods for microarray data analysis. In: Proceedings of the fifth Australasian Conference on Data Mining and Analystics, vol. 61, pp. 33–37, Sydney, Australia (2006)

16. Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and mul- ticlass classification methods for tissue classification based on gene expression.

Bioinformatics20, 2429–2437 (2004)

17. Montgomery, D., Undem, B.L.: CombiMatrix’ customizable DNA microarrays- Tutoial: In situ computer-aided synthesis of custom oligo microarrays. Genetic Eng. News22, 32–33 (2002). Drug Discovery

18. Serafini, P.: Classifying negative and positive points by optimal box clustering.

Discrete Appl. Math.165(270–282), 10 (2014)

Author Proof

(12)

Chapter 1

Query Refs.

Details Required Author’s

response AQ1 Please confirm if the corresponding author is correctly

identified. Amend if necessary.

Author Proof

(13)

MARKED PROOF

Please correct and return this set

Instruction to printer

Leave unchanged under matter to remain

through single character, rule or underline

New matter followed by or

or

or or or or

and/or

and/or e.g.

e.g.

under character

over character new character new characters through all characters to be deleted

through letter or

through characters under matter to be changed under matter to be changed under matter to be changed under matter to be changed under matter to be changed Encircle matter to be changed (As above)

Metadata of the chapter that will be visualized in SpringerLink Book Title