Practical approaches for mining frequent patterns in molecular datasets

(1)

Introduction

In the last decade, various information-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.1–4_{A popular group of pattern mining techniques} are itemset mining and their derivative, association rule min-ing. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.5_{In this context, the shopping} cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (ie, itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coex-pressed in tissue samples or which mutations often occur together in cancer tumors of a given type.

Frequent itemset mining has proven especially use-ful in capturing and summarizing the characteristics of complex datasets to their important and most interesting aspects. Frequent patterns can be converted into rules with a

discriminatory value that can, in turn, be used to build trans-parent classifications. For example, if a gene C is always upreg-ulated when genes A and B are downregupreg-ulated, the frequent itemset {A|Down, B|Down, C|Up} can be rewritten as the rule {A|Down, B|Down}  {C|Up}, where the left-hand side (antecedent) of the rule leads to the consequent (right-hand side) of the rule. Rules of this type can be used to distinguish between tumor types, gene clusters, and various other biologi-cal contrasts. The advantage of this approach is that the rules immediately explain why a particular label was given, which is an advantage over machine learning methods such as neural networks that act as a black box. The strengths of frequent itemset mining have been consequently demonstrated in a broad range of bioinformatics applications, ranging from gene expression data,6–8_{annotation mining,}9,10_{and combinations} thereof11,12_{to interaction networks.}13_{A comprehensive} over-view of the broad range of implementations and bioinformat-ics applications of frequent itemset mining techniques was recently published.14

Despite their demonstrated suitability to address various bioinformatics problems, frequent itemset mining techniques have not been generally adopted in day-to-day omics data

in Molecular Datasets

stefan naulaerts

1,2

_{, sandy moens}

1

_{, Kristof Engelen}

3

_{, Wim Vanden Berghe}

4

_{, Bart goethals}

1

_,

Kris laukens

1,2

_{and Pieter meysman}

1,2

1_{Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium.}2_{Biomedical Informatics Research Center} Antwerpen (Biomina), University of Antwerp/Antwerp University Hospital, Antwerp, Belgium. 3_{Department of Computational Biology,} Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy. 4_{Department of Biomedical Sciences, Laboratory of Protein Science,} Proteomics and Epigenetic Signaling (PPES), University of Antwerp, Antwerp, Belgium.

AbstrAct: Pattern detection is an inherent task in the analysis and interpretation of complex and continuously accumulating biological data. Numerous

itemset mining algorithms have been developed in the last decade to efficiently detect specific pattern classes in data. Although many of these have proven their value for addressing bioinformatics problems, several factors still slow down promising algorithms from gaining popularity in the life science commu-nity. Many of these issues stem from the low user-friendliness of these tools and the complexity of their output, which is often large, static, and consequently hard to interpret. Here, we apply three software implementations on common bioinformatics problems and illustrate some of the advantages and disadvan-tages of each, as well as inherent pitfalls of biological data mining. Frequent itemset mining exists in many different flavors, and users should decide their software choice based on their research question, programming proficiency, and added value of extra features.

Keywords: frequent itemset mining, protein domain structure, protein–protein interaction, gene expression, Mycobacterium tuberculosis

CitAtion: naulaerts et al. Practical approaches for mining frequent Patterns in molecular datasets. Bioinformatics and Biology Insights 2016:10 37–47 doi: 10.4137/BBi.s38419. tYPE: review

RECEivED: January 19, 2016. RESubMittED: march 13, 2016. ACCEPtED FoR PubliCAtion: march 16, 2016.

ACADEMiC EDitoR: dr thomas dandekar, associate Editor

PEER REviEw: five peer reviewers contributed to the peer review report. reviewers’ reports totaled 1404 words, excluding any confidential comments to the academic editor.

FunDing: this project is supported by research foundation – flanders (fWo-Vlaanderen) project “instant interactive data exploration” and “Evolving graphs”; the University of antwerp Bof [docPro Phd grant to sn]; and sBo grant “insPEctor” (120025) of the flemish agency for innovation by science and technology (iWt). the authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.

CoMPEting intEREStS: Authors disclose no potential conflicts of interest. CoRRESPonDEnCE: pieter.meysman@uantwerpen.be

CoPYRight: © the authors, publisher and licensee libertas academica limited. this is an open-access article distributed under the terms of the creative commons cc-By-nc 3.0 license.

Paper subject to independent expert blind peer review. all editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. this journal is a member of the committee on Publication Ethics (coPE). Provenance: the authors were invited to submit this paper.

(2)

analysis workflows, and their popularity is only slowly gaining traction. This can be partially attributed to a number of short-comings in the existing implementations. First of all, most are command line tools that often need to be compiled from the source code, and clear documentation regarding their instal-lation is often lacking. This lack of user-friendliness poses a serious entrance barrier that daunts many life scientists. Sec-ond, the output of the implementations is often presented in a format that is not readily interpretable by domain experts. The results of the mining process are typically long pattern lists containing flat text files. However, these lists are often very lengthy and highly redundant. This is caused, in part, by the fact that if a set is frequent, any of the smaller sub-sets that it contains will also be frequent. This is also known as the apriori principle. For many pattern mining applications, there is often a so-called pattern explosion with results that list millions of patterns. Due to the verbose nature of these lists, user-friendly tools to process, query, and visualize this output are indispensible.

Convenient prioritization, filtering, cleaning, and inter-pretation of pattern result lists require certain functionalities that are rarely covered by existing implementations. Third, iterative optimization of the pattern list and browsing through the output of these algorithms is often hard, as they create static output that needs to be processed and converted to a compatible format before the next step in the iterative min-ing process can start. This can make result prioritization, an inherent part of many pattern discovery projects, a very cumbersome process.14

To address some of these limitations, software frameworks have been developed for interactive visual pattern mining, such as the MIME tool.15_{Such toolboxes offer intuitive access to} interestingness measures, mining algorithms, and postpro-cessing algorithms to assist in identifying interesting patterns. By enabling interactive mining, it allows the user to com-bine their subjective interestingness measure and background knowledge with a wide variety of objective measures to eas-ily and quickly mine the most important and interesting patterns. In this article, we demonstrate the opportunities

of frequent itemset mining in real-world bioinformatics scenarios and describe the application of three commonly used methods, namely, Apriori,5_arules,16_{and MIME.}15_This comparison is based on three representative bioinformatics use cases, ie, domain co-occurrence within proteins, interactions between domains in interacting proteins, and the response of the pathogen Mycobacterium tuberculosis to several drug treatments. For this purpose, we utilize data from Uniprot,2 IntAct,4_{and Colombos.}1_{The data files and step-by-step} tuto-rials on how to install and run the three presented tools on the three use cases are available in Supplementary Files 1–5. The goal of this study is to explore how interesting and biologi-cally relevant patterns can be effectively generated with differ-ent tools and provide the community with some guidance on how frequent itemset mining tools can be used in complex life science scenarios.

Materials and Methods

Frequent itemset mining. For datasets with large amounts of objects, features, or observations, it is not tractable to check all possible combinations and find correlated annota-tion terms or biological entities. Frequent itemset mining is composed of a set of tools that are able to find co-occurring terms, known as items, in big data. These items can be any entity, ranging from genes and RNAs to proteins or drugs, and thus allow, for example, to identify coexpressed genes or proteins. The mining algorithms typically start from transac-tional databases, as shown in Figure 1.

A transaction is simply a set of items, and a transactional database layout contains one transaction per row, in which each transaction refers to an observation, such as the collec-tion of domains associated with a protein (Fig. 1) or the aggre-gation of supermarket articles found in a shopping basket that was checked out by a single customer. These algorithms use heuristics to reduce the search space. For example, a priori-based methods use the Apriori principle, which states that if an itemset is not frequent, all supersets of these items will also not be frequent. A pattern is frequent when its items appear more than expected, which means that its support, the absolute

Transaction database Protein 1: Protein 3: Protein 4: Protein 2: Pattern list 3 3 2 2 2 2 2 Domain pattern Support

Figure 1. toy example of frequent itemset mining. the input of a frequent itemset mining approach is a transaction database (shown to the left). the

(3)

number of times that a pattern occurs, needs to exceed a user-defined minimal threshold. It is worth noting that frequency and support are often interexchanged, with frequency thresh-olds stating the relative minimum (often 10%) threshold that needs to be exceeded. A hot topic in this field is the inability of frequent itemset mining to identify co-occurrence of continu-ous values, as most methods require the user to conduct well-chosen discretization steps that define the items evaluated by the algorithm. We have described our discretization steps where needed in the case studies described subsequently.

datasets. In order to compare the benefits of each fre-quent itemset method, several datasets were created from public resources, more specifically using InterPro, Colombos, and IntAct. For each of the case studies, a transaction data-set was constructed based on the data downloaded from the database and saved in a simple space- or tab-delimited file. All of the transaction data files have been made available in Supplementary Files 1–3). A tutorial on how to practically execute the mining process has also been included (Supple-mentary File 4), as well as a small Python script that compares the mining results for the Borgelt’s Apriori implementation (Supplementary File 5).

Protein domain analysis. For the first use case, we downloaded all known proteins of the human reference pro-teome on February 19, 2014 from UniProt2_{and retained only} those that contained at least one InterPro domain.17_{In total,} this resulted in a set consisting of 20,636 proteins. In the second use case, we mapped this information on top of the IntAct protein–protein network,4_{which was downloaded on} the same day.

colombos. All the microarray information was obtained from Colombos (a collection of microarrays for bacterial organisms),1_{which contains numerous renormalized gene} expression experiments extracted from the Gene Expression Omnibus18_{and ArrayExpress.}19_{Using the advanced search} option, we created a dataset for M. tuberculosis composed of several experiments in which the bacteria responded to anti-biotics added to their growth medium. Next, the information in this dataset was discretized based on the fold change. Tra-ditionally, log2-fold changes ranging from 1 to 1.5 are used to define the differential regulatory state of a gene. Here, we used a log2-fold change of 1.2, which results in a dataset that includes 25% of the protein-coding genes in M. tuber-culosis that were differentially expressed in at least one con-dition contrast. All log2-fold changes greater than 1.2 were considered upregulated, while those smaller than −1.2 were labeled as being downregulated. Fold changes between these two values were excluded from the dataset. For each gene, the log-fold change was simplified to a discretized state (up or down) and appended as a suffix to the contrast informa-tion. For example, the gene Rv0823c has been identified to be downregulated (fold change −1.43) in study GSE1642, in which the treatment is an addition of 1 µM valinomycin. The combination of all this information into one label results in

Rv0823c|Down, which is a discrete item that can be used in the mining process.

tools. Apriori. Apriori is one of the oldest, most simple, and popular frequent itemset mining algorithms.20_{It is} avail-able in various implementations that vary from command line tools to parts of data analysis software suites. In this article, we use Christian Borgelt’s Apriori implementation, which is available at http://www.borgelt.net/apriori.html.21_We com-piled the C source code and ran the implementation using the synthaxis./apriori [options] infile [outfile] from the terminal on a Macintosh running OS 10.9 Mavericks.

One of the major advantages of Borgelt’s Apriori version, other than its improved pattern identification efficiency, is its support for distinctly different types of itemsets, includ-ing maximal, closed, and open itemsets. By default, Apriori generates all possible itemsets (open), which are typically far too many to analyze. The information contained by these patterns is largely redundant, which means that the result-ing pattern list can be efficiently reduced with a minimal loss of information content. For example, only itemsets that have no frequent superset can be retained (maximal itemsets). This results in a major reduction in patterns that cannot be justified in every scenario. A more balanced general approach that still effectively reduces the output of the mining process consists of mining only closed patterns. These are formally defined as itemsets that have no immediate superset with the same sup-port value. Although all these three pattern classes have been used to approach various life science problems, closed itemsets have been the most prominent in analyses that deal with the typical genome- or proteome-wide scale datasets.14

Arules. In addition to command line tools, several efforts have been made to bring frequent itemset mining to other platforms. One such project is the R package arules, which also contains the Apriori implementation by Borgelt in addi-tion to other mining algorithms.16

Arules provides an entire toolbox for the representation, manipulation, and analysis of frequent itemsets and association rules in R. It contains several scoring metrics and allows the calculation of various properties, such as dissimilarity. Arules is also compatible with arulesViz,22_{which allows rapid} visual-ization of the mining results. Arules is available for download through the R interface as a CRAN distribution.

MIME. The third tool covered in this study is MIME.15 MIME provides a dynamic and interactive graphical environ-ment to interact with the dataset and retrieve patterns. It hosts several popular pattern mining algorithms, such as Apriori,5 Eclat,23_tiling,24_{top-K mining,}25_{the OPUS Miner,}26_and Carpenter,6_{as well as various classification algorithms that} are based on association rule mining.27–29_{Similar to arules,} MIME has various scoring measures (so-called interestingness metrics) and, furthermore, supports iterative mining. The software is written in Java and depends on two Java libraries, namely, QTJambi (http://qt-jambi.org) and WEKA,30_{and is} available from http://adrem.ua.ac.be/mime.

(4)

results and discussion

By means of practical use cases, we will go through increas-ingly complex life sciences scenarios that at each step require additional preprocessing steps to translate the problem into a format recognizable by the tools.

case study 1: analyzing domain co-occurrence within proteins. The simplest use case for frequent itemset mining is the analysis of domain co-occurrence within proteins. The transaction dataset contains all human proteins documented in InterPro with at least one protein domain. Herein, each protein is considered as a transaction, which equals a single line in the input file containing the domains (items). In total, 20,636 proteins were present in the dataset. All three tools were able to find the most frequent individual domains by set-ting the maximal number of items in an itemset to 1. MIME delivers this information without requiring an extra step. This step allows us to identify which domains occur the most in the dataset and which is important for the calculation of several quality measures, such as lift value. It also gives the researcher a quick initial idea of which domains are most likely to appear in frequent itemsets. If these domains associate in an aspe-cific manner with another domain and can be expected by random combination (low lift value), the pattern is not inter-esting for a biologist who tries to identify domains that are functionally correlated.

The most frequent domain was identified to be IPR027417, a P-loop containing nucleoside triphosphate hydrolase, which occurred 1329 times (6.44% of all proteins). Other than this domain, only IPR013783 (4.44%), IPR011009 (4.31%), IPR000719 (3.96%), and IPR015943 (3.19%) occur in more than 3% of the human proteins. IPR000719 and IPR011009 (protein kinase-like domain) occurred together in the majority of proteins they occurred in, as represented by the correspond-ing support value (3.94%). This pattern was the most frequent of all associated InterPro terms, with the highest support and a high lift value, which suggests a nonrandom association. How-ever, this can easily be attributed to the fact that the former is a protein kinase domain and the latter is a protein kinase-like domain, with IPR000719 being a subset of IPR011009 and kinases being a family containing a significant number of proteins. Interestingly, one would then expect IPR000719 (819 proteins) to always co-occur with IPR011009, but this was apparently not the case for five proteins, namely, C9JE15 (aarF domain-containing protein kinase 2), D6RHX9 (cal-cium/calmodulin-dependent protein kinase type II subunit alpha), E9PPN3 (N-terminal kinase-like protein), F8W0N2 (serine/threonine-protein kinase receptor R3), and H0YAH6 (epithelial discoidin domain-containing receptor 1). In the following updates of Uniprot, these five proteins had the IPR011009 domain added, which illustrates that even basic frequent itemset mining can be applied to detect inconsisten-cies in annotations.

A combination of the low cutoff percentages and the com-plexity of biological data make setting a minimal threshold

a rather daunting task. One approach is iteratively lowering the threshold and keeping it above the value where the number of patterns explodes. This parameter fine-tuning may be time consuming, and the arbitrary threshold can be hard to justify biologically. In this case, it can be useful to simply search for 100 most frequent patterns without much knowledge of their individual protein abundance. In many settings, the manual evaluation of the pattern results will limit itself to those pat-terns that have the highest support value, as these will be most frequent in the data set and often the most relevant or interest-ing. For such a purpose, top-K mining would be much more straightforward as the only parameter it requires is setting K to 100. Of the three implementations addressed, this is only possible with MIME. Top-K mining with the other solu-tions requires knowledge of Java or can be cumbersome.31,32 Figure 2 shows the output of the three itemset mining imple-mentations. For apriori and arules, we iteratively optimized a lower threshold in order to display these results. An over-view of the most frequent itemsets obtained with MIME is listed in Table 1.

The 100 most frequent itemsets feature combinations of 63 unique protein domains. IPR013783 (immunoglobulin-like fold), IPR001881 (epidermal growth factor [EGF]-(immunoglobulin-like calcium binding domain), IPR000152 (EGF-like aspartate/ asparagine hydroxylation site), IPR018097 (EGF-like cal-cium binding), IPR013032 (EGF-like conserved site), and IPR000742 (EGF-like domain) appear in over 11 unique pat-terns of these 100 most frequent itemsets.

The use of frequent itemset mining techniques, without accounting for the hierarchical structure of annotations, will typically lead to this kind of frequent but trivial patterns with no true informative value. For example, IPR007087 (zinc fin-ger, C2H2) is a child of the CH2-like zinc fingers (IPR015880). Patterns of this kind create additional overhead that needs to be filtered out to separate it from patterns with true informative value. This problem has already been elaborately described in literature, especially for Gene Ontology analysis.10,33_Therefore, we grouped domains into functional categories, based on the structure of the InterPro annotation tree, to remove the redun-dant annotations and found immunoglobulin-like domains, CH2 zinc fingers, protein kinase-like domains, and WD40 repeats to be the most prevalent in human proteins when look-ing at individual domains. However, the 100 most frequent itemsets then consisted of vastly different frequent patterns at much lower frequencies. For example, some of the most prevalent patterns were the co-occurrence of protein kinase-like domains (IPR011009) and the ATP- binding domain (3.14%). Co-occurrence within tyrosine kinases and SH2-domains was also prevalent. A full overview of the 100 most frequent itemsets is shown in Supplementary File 6. Overall, frequent itemset mining can be used to cluster proteins based on their domain annotations and explore the structure and relationships that may exist in this dataset. Figure 3 shows the mined patterns in a pie chart representation, from three broad

(5)

categories that exist in the dataset, namely, the kinase-related, helicase-related, and WD40-related patterns. Each pattern in Figure 3 is represented as a path from the center to the exterior rings. Neither tool required extensive programming skills to tackle this use case.

case study 2: mining for co-occurring domains between interacting proteins. A topic of great biological interest is the identification domains that are likely to play

key roles in protein–protein interactions, in, for example, drug discovery.34_{We mapped the InterPro domains}17 cor-responding to each UniProt identifier2_{in the human IntAct} protein–protein interaction network4_{into transactions. Each} transaction then represents a pair of proteins, and the items within a transaction correspond to the union of domains for both proteins in the pair. We then removed all excess protein domains using the grouping method described in case 1 to

Figure 2. input (left) and output (right) of frequent itemset mining in popular frequent itemset mining implementations. the upper part shows the look

and feel of the terminal-based apriori-Borgelt implementation. the output of this tool shows each pattern in turn on each line with the support value as a frequency between brackets. in the middle, a similar figure is shown for the arules package. the arules output is an r data object with the items that make up a pattern in one column and the support in the second column. the bottom of the figure features the input and output of mimE. note here that the red dots in the upper whitespace indicate the items, which are described by their name and their individual support (between brackets). these items can be dragged and dropped in the larger white area below to modify the mining output.

(6)

prevent noninformative patterns from hierarchical annotations from appearing. However some patterns may still result from intraprotein domain co-occurrences, as we did not consider the protein of origin for each of the domains. Therefore, the intraprotein patterns obtained in case 1 were subtracted from the combined list. For Borgelt’s Apriori implementation, we had to perform the mining process for both datasets using the lowest possible threshold to avoid the pattern explosion and then removed all the patterns appearing in the list intersec-tion with a Python script. In arules, the results of both mining processes could be compared using data frames within the R

framework,35_{with a basic understanding of R language} syn-tax. MIME required no such scripting effort and allowed to compare the result files from the two mining processes directly using the compare worksheets function.

Either tool was able to find large amounts of frequently recurring motifs within proteins that have been described in literature, with notable examples such as interactions between the 14-3-3 domain (IPR000308) and S/T protein kinase domain (IPR002290).36_{In the IntAct network, we identified a} total of 272 interactions that exhibited this particular pattern, with a total of 76 unique proteins being involved (42 kinases). Figure 4 shows the distribution of the functional annotations of the proteins corresponding to the domains in the interac-tion dataset. As can be seen in Figure 4, kinase interacinterac-tions seem to be the most common in the human interactive data-set, with all of the top functions referring to this regulatory mechanism. An overview of the 100 most frequent patterns is shown in Supplementary File 7. Frequent itemsets may also be used to detect complexes, as indicated by the domain associa-tions between Skp1-containing proteins and F-box proteins, which have been previously described.37,38

We grouped all the proteins corresponding to a rule and looked for potential biological enrichments using clas-sic overrepresentation based on hypergeometric testing with Benjamini–Hochberg correction (P-value , 0.05). The results, visualized in Cytoscape,39_{are shown in Figure 5. There is} a clear functional coherence in interacting proteins, which is naturally reflected in the combination of elements at the domain level. For example, many proteins involved in apop-tosis were characterized as containing the death-like domain (IPR011029) and binding sites for TNF (IPR006035) and formed a more densely interconnected protein network with their associated kinases.

Compared with the textual output of apriori and arules, navigating through the dataset using MIME is more flexible. For example, a domain of interest could be dragged into the pat-tern browser window and extended by adding other domains in a drag and drop fashion. MIME automatically recalculates all quality measures it contains each time a modification is made to the pattern list. This means that the lift values, area, coverage, support, and many more measures are consistently up to date. This enables a more dynamic approach to manual supervision of the pattern list, which can speed up the discovery of interest-ing properties in a dataset.

It should be mentioned that specialized frequent item-set mining approaches exist to rapidly find patterns that dis-criminate between sets or more formally defined, ie, patterns whose support value increase significantly from one dataset compared with the other. Discriminative or emergent patterns have been suggested to have a higher value for use in classifica-tion models and have been extensively reviewed.40_{In this case} study, such methods could be used to more rapidly identify the patterns that are more likely to be the result of interacting pro-teins or, by extension, distinguish between protein families.

table 1. the 30 most frequent interPro intraprotein patterns.

itEMSEt FREquEnCY SuPPoRt

iPr000719 iPr011009 3,94 814 iPr007087 iPr015880 2,76 569 iPr007110 iPr013783 2,72 562 iPr017986 iPr015943 2,71 559 iPr001680 iPr015943 2,55 526 iPr001680 iPr017986 2,49 514

iPr001680 iPr017986 iPr015943 2,47 510

iPr013087 iPr007087 2,36 487

iPr013087 iPr015880 2,30 474

iPr013087 iPr007087 iPr015880 2,28 471

iPr017441 iPr011009 2,21 457

iPr017441 iPr000719 iPr011009 2,19 452

iPr000504 iPr012677 2,18 450

iPr011989 iPr016024 1,78 368

iPr003599 iPr013783 1,76 363

iPr008271 iPr000719 1,73 357

iPr008271 iPr000719 iPr011009 1,73 356

iPr001849 iPr011993 1,72 355

iPr002110 iPr020683 1,62 334

iPr003599 iPr007110 iPr013783 1,50 310

iPr002048 iPr011992 1,39 286

iPr003961 iPr013783 1,33 275

iPr019775 iPr001680 1,24 255

iPr019775 iPr001680 iPr017986 1,23 254

iPr019775 iPr001680 iPr015943 1,23 254

iPr019775 iPr001680 iPr017986

iPr015943 1,23 253 iPr001841 iPr013083 1,22 251 iPr013032 iPr000742 1,18 243 iPr001806 iPr027417 1,12 232 iPr003598 iPr013783 1,11 229 iPr003598 iPr007110 1,10 227 iPr000008 iPr008973 1,05 217 iPr013098 iPr013783 1,04 214

(7)

case study 3: M. tuberculosis response to stress. In the third case study, we downloaded gene expression profiles of M. tuberculosis in several antibiotic studies from the microbial gene expression compendium Colombos.1_{We discretized the fold} changes of the expression into upregulated (fold change .1.2) and downregulated (fold change ,−1.2). Discretization is essential for traditional frequent itemset mining. In this case, the transactions consist of the different expression experimen-tal contrasts, and their items are the up- or downregulated

genes within these contrasts. In total, the dataset contains 47 contrasts, and the most frequent differentially expressed items were WhiB7|Down (21% or 47%), esxS|Up (16% or 34%), higB|Down (16% or 34%), PE20|Down (16% or 34%), and esxH|Up (15% or 32%). Supplementary File 8 shows the resulting mining output.

The most frequent itemsets have a support value of 14, and several underlying types of frequent itemsets are immedi-ately apparent, either in the graphical display of MIME or in

Figure 3. Visual representation of pattern clustering. Each pie chart starts from a single item in the center and expands outward with those items that

were found in associated patterns. the size of each piece in the pie chart indicates the support value of this pattern and its ancestor, giving a relative image of how frequent the items are compared with the others. For the singletons, this equals their individual frequency (upper circular plot, first layer). only patterns with a length of two items are shown for legibility reasons in the figure, but this can be extended to virtually any itemset size. detail plots of the size-2 itemsets are also shown (right, bottom left, bottom right) and indicate if a given item appears (dark blue node central in plot), how likely it will be associated with any of the other terms. the purple pie chart contains kinase-related patterns, green consists of gtPases and helicases, and red is associated with Wd40 repeats.

(8)

the textual output of arules and apriori. A first type of pattern refers to the nature of transcription in prokaryotes. For exam-ple, esxH|Up (16) and esxS|Up (15) were strongly correlated in an itemset with a frequency of 30% (14). In the two cases, both genes did not occur together, due to marginally falling outside the boundaries set by the discretization procedure. esxH and esxS are part of the same transcription unit, which codes a Type VII secreted compound (EsxH) that impairs trafficking in the host organism.41_{In order to fulfill its purpose, it acts in concert} with esxG, which is in the same transcription unit and thus also upregulated. esxG appeared nine times in our dataset, each time upregulated (7) or downregulated (2) when the other ele-ments of the operon were as well.

Another key pattern turned out to be PE20|Down and whiB7|Down, also with 14 as the support value. WhiB7 is a transcriptional activator important for antibiotic resistance,42 and PE20 is a largely uncharacterized protein with a length of 99 amino acids that has an effect on pathogenicity,43_but their support value, in comparison with the large amount of rules these two genes appear in, suggests a key role in M. tuber-culosis response to several antibiotics. This pattern could be further extended to include higB|Down, esxH|Up, esxR|Up, and kasB|Up with a support value of 10 (21%). This was espe-cially easy to do using MIME and the best extension of bas-ket functionality, which greatly facilitates pattern exploration with the otherwise often laborious task of browsing through the often vast sets of patterns originating from more extensive

datasets. We extended the itemset to the maximal length at a minimal support value of 7, resulting in a set of 11 genes enriched in the generation of precursor metabolites and energy metabolism (GO:0006091) that is strongly interconnected at transcriptional levels (Fig. 6). We then investigated which anti-biotic treatments contributed to it, as seven transactions sup-ported it (and thus maximally seven different drug treatments). We found these contrasts to be Streptomycin:5 (GSE1642), Amikacin:10 (GSE1642), Streptomycin:5 (GSE1642), Ami-kacin:5 (GSE1642), AmiAmi-kacin:5 (GSE1642), Tetracycline:10 (GSE1642), and Capreomycin:10 (GSE1642), respectively. Amikacin, capreomycin, and streptomycin are aminoglycoside antibiotics that tend to target the initiation of protein syn-thesis by causing mistranslations.44_{Interestingly, tetracycline,} which acts by disruption of normal mRNA recognition by the tRNA anti-codon, was associated with the very same itemset even though the antibiotic has a vastly different structure and mode of action. As such, frequent itemsets that combine essen-tial disturbances of genes may prove interesting to identify novel compounds that are vastly different, but exhibit a simi-lar downstream effect on gene expression. Exploratory analysis with user-friendly pattern mining software can be a great assis-tance for this task.

conclusions

Several implementations, which do not require exten-sive programming skills, exist to perform biological data

Figure 4. Prevalence of several gene ontology terms that could be associated with the top rules. the importance of regulatory mechanisms becomes

immediately clear when looking at the most frequent terms. the top term has a support value of 808 (go:0005524) refers to atP-binding functions, while the second directly names protein phosphorylation (go:0006468) as the underlying mechanism of this figure. the other terms, such as go:0004674 (protein serine/threonine kinase activity), go:0004672 (protein kinase activity), and go:0018108 (peptidyl-tyrosine activity) only further strengthen this observation. overall, we can conclude that in the human interactome dataset, interactions with kinases seem to be the most prevalent, with kinases co-occurring with an extremely diverse amount of substrate domains. However, the most specific substrate domains are only found at lower support values.

(9)

mining with frequent itemsets. In three typical bioinfor-matics experiments, we found each method to excel at dif-ferent aspects. Borgelt’s Apriori implementation was very fast using the command line, but lacked the flexibility that arules has in the R environment. However, arules requires the user to have an understanding of the R scripting envi-ronment, and this is characterized by a steeply increasing learning curve for more complex operations, such as com-paring results from two mining processes. MIME showed its value when subsequent mining steps were required, as the output of one mining step did not have to be processed to be usable in the next mining iteration. This is espe-cially useful in exploratory analysis of complex biological datasets. The presence of additional mining algorithms in a graphical environment in which patterns can be modified by simply dragging and dropping items further strengthens its ease of use.

Author contributions

Conceived and designed the experiments: SN, SM, BG, KL. Analyzed the data: SN, KE, WVB, KL, PM. Wrote the first draft of the manuscript: SN, PM. Contributed to the writ-ing of the manuscript: SN, SM, KE, WVB, BG, KL, PM. Agreed with manuscript results and conclusions: SN, SM, KE, WVB, BG, KL, PM. Jointly developed the structure and arguments for the paper: SN, KL, PM. Made critical revi-sions and approved the final version: SN, SM, KE, WVB, BG, KL, PM. All the authors reviewed and approved the final manuscript.

supplementary Materials

supplementary File 1. Intra-protein dataset. This file contains all transactions containing protein domain presence that was extracted from the InterPro database. Each protein (shown before semicolon) and its transaction (after colon) are 0015

0015 IPR00PP 0001500000000001500000IPR00044001501500044999999990000000000000004404499999 IPR00062000000000000000000626666IPR00906009090909099060 IPR01995191919IPIPIPIPIPIPIPR0IPR0PPPPPPPPPPPPPR0PR0P999 000000006000000

IPR00021PR00021PR000000000 IP IPPRPPPRPPPRPP 00201 IP I 00000000000000000000000000000 999 IPR0042404040442444 1 IPR0022121 IPR0 2000 IPR00220022 IP IP I IPP I IPP IPR0R00R0R000200000000000200200220000000200200000000000020000000000000000200002000000020200020000222221111111111 827 827 8822 8 8 8222727777 822222222 82727227222 8 8 8888 8 8 IPR00PR0 IP IPR0R0R000000 IPRPR0PPPPPPPPPPPPPRRR0R0R00RRRRRRRRR0RRR0RRR000000000000000 IP IPPPPPPPPPPPPPPPPRPPRPPRPRRRRRRR I I IPPPPPPPPPPRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR000RR00R0R0R0RRRRRRR00000000000000000000000000000000000 IP IPPPPPPPPPPP I II IPPPPP I IPRPPPRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR I I IPPPPPPPPPPPP I IPPP IPP I IRR0RRR0R0RR0RRRRR0RRRR0RRRR0R000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000080000800800080000000000008000000088888 1111 IPR00140R000100000001110010001011010101401141111111111114011114011111111114404444440400000000444444444444444444444444444444444444444444444 PR00352 P P22222212222222222222222212222221222222222222222212222222222222222RRR212122R222121222000000 3003500000000 2000000000300335333335353533333353333333333333533333333 25225555555555555555555555552222222222 0222 02 022222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222122222222222212222222222111111111111111111111111 0 0 02 0222222222222222222 02 02 0 02222 02 02 0 02222222222222222222222222 02 02 02 02 0 0 0 0 02 02 02 02 0 0 02 02 00 0 0 0 0 0 0 0 0 0 0 02 02222 0 0 02 02 0 022 0 02 0 0 0 0 0 0 0 0 02 0222 0222222222222222 02 02 02222222222222222222222222 02222 0 022 0 0 0 0 0 0 00 0 0 0 0 0 0 0

IPR00903PRRR00009009990090000900000000000000000009000999999999099999999909099999990999999999IPIPIPIPPRPRPRPR00 _IPR0_IPR0_IPR02_PR0_R0_R0_R02₀₂₀₂₀₀₂₀₂₀₂₂₂₇₇₂₂₂₂₇₂₂₂₂₂₂₂₂₂₇₂₂₂₂₂₂₂₂₂₂₇₇₂₇₄₁₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₂₇₂₂₂₂₂₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₄₁₇₇₄₁₇₄₁₇₄₁₇₄₇₇₄₇₇₄₁₇₄₁₇₄₇₄₇₇₇₇₇₇₇₇₇₇₄₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₇₄₇₄₄₁₇₇₇₇₇ IPR019951111111IIPIIPIPIIIIIIIPIPIPR0IPR0IIPPPPRPPPR5000044000000

IPR02057PR 05RR0R02R02002222222220220520050000000000500570055 IPPR02R0R I

IPIIIPIPR00096PP20202202220020020202202022220R0R0 0RR0R0RRRR0RRR0000000 60000000IPR00PPPPPPPPPPPPPPPPPPPPPP 0PPR0000000000000000000000000000000000R00RRRRRR0RRRR0R0R0R0R0RR0R0R0R0RRR0RR00RR0RRRRRRRRRRRRRRRRRRRRRRRRRRR00090000000000009099055555555555555500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000009999999996000000000000000002020000020020000000000000000000000000000000000000000060000066666666666660066660002020200000000600000000000000000020000000000200000002222222222221122292222222299299900 IPR00171000000000000010000000100 5 IPR01234IPIPIPPR012PR0R0112341112121112 00 IPR02055 IPR000 I R02PR02PR0RR0R0R0R0RRR020022000005555 IPRR0RRRRR00000IPIIIPIIPIPIPIIIIIPIPIPIIPR00IIPR00PPPPPPPPPPPRPR2PPPPPPPRPPPPPPPPPPPPPPRPPPRPP222222222PRP2PRPR20PPP2202PP220PPPPP2PP20RRRRRRRRRRR0RRRRRRRRRRR0RRRR0R0R0RRRRRRRRRR0RRRRRRRR0IPIIII00IPIPIPIPIP00IP000IPIPIIP00IIIIPIPIPIPIIIIIIIPIPIPIPIPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP0000000000000000000000000000000000000000000RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR000000000 IPR020682 IP 222PPPPPR0202068 I I 0606 5 5 5 5 5 500000000000PRP0000000000000020022999999999202222220202020222220000 333 IPR00133PR0PPR000000100001001000001000000013333311 184 18 184 1 1 1 IPR00PR00PR000 I RR I IPR0RRR0R0RRR0RRR000000000 IP IPPRPPPPPPPPR IP IPPR0RRR00R0R0RR0IIPIPIPR00000IPIPPRPPPPPPRPRPPRPRPR00000000000000000P000000000P0PPPPPRPR00001101000000010000000101010000010000001000000010000000010000000000100000100000000010101000100000RRRRRRRRR0RRRRRRRRRRRR0000R0R00RRR11818118181111841111111111111818118411110000000848848884888888888488848884848888848484848840000000000000000000000000000044099999999999000000000088888 IPR02 IPR IPPR0P 02PPRPPRPRPRPRPR0R0R0RR000020002 IPRPPPPPR02RRRR0R0RR02022 IPR0PPPPRR02045040004045000040004040444454444445555555555 IPR00PPRPPPPPR002220222020RR20222022020R202200200000000000000000000000000000000000000000000000000020002020202020000002000200202IIIPIIPIPPPPPPPPPPPP22222222222222 99999999IPIPR00147PPPRPPPPR00PR00RR00014714714147 00 00000 0 00000 9 99 9 9 9 9 99 9 9 9 888888₀₁₀₀₀₀₀₁₀₀₀₁ ₈₈₈ IPR02907 IPR02PR02

IPR02PR02R02IPR00202907IPR01594PR012929229222229229222229292229R0R0R01990799999999999999999079999090790779079079907907907799090900000015901000707707151591559415515151511515159151151515115115151577777755 4455555559455555945959455555944555555IIPIPIPIPIP111111PPPPPPPPPPP999994999999999444444444000000000000000IPR01789IPIPIP 0204522020222022022022222202022222222220452220422022222022202PPPP0000000000400404000400000000404004000000040000450004500R0R0RR0R0R00R45010000000000000005555555555517117777777778778787877777777777777777778989899999898899999 ₃₃₃₃₃₃₃₃₃₃00000000 IPR00311PRPR000000031030000300003030300003033000000303333303000303030330303113333333113333333333333333331133333113333331333111111111666666666666666666666666

IPR00621066666666666066660606066666666211666666666666666666666666621626216216266266216621621662166216621111111111111111 IPIIPIIIIIPIPIIPR0IPR0IPIIIIIIPIIIIIIIIIIIIIIII RIIIIPIIPIIPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPR0PPPPPPPPRPPPPPPPPPPPPPPPPPPPPPRPPPPPPRR0RRRRR0RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR0RRRRRR0RRRRRRRRRRRRR0RRRRRRRRRRRRR0RRRRRRR0RRRRRR0RRRR0RRRRRRRRRRRRRRR0RR0RRRRRRRRRRR0RR0RR0RRRRR0RRRRRR000000000000000000000000000000000000000000000000001101101010100011000111000000000001011000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111 01111111111111100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000999999999999999 0988 098 0988 0999 099988888 09 0999 0999988888888 0 0 0 0999 09 09 0 098 099998 0 0 099999998 0 0 0999 09 0 0999999999 0 0 0 0 0 0 0 098999888888888888888888888 0 0 000 0 0 0 0 0 0 0 0 0 0 0 00 00 0 00 0 0 0 IPR00 IPR I I I 0 I I IPRPRRR I I I I I I I I I RRRRRRRR00 IPPRRR I I IPRPRRRRRRRR0RRRRR IP I II IPRRR I I RR0RRRRR IPRR IPPPRPRPRR0RRRRRRRRR0RRRRRR0RRR0 I I IP I I I I II I I IPRPPRPRRR0RR0RRRRRR0RRRRRRRRR00 I I I I I I IPR I I RR I I IPRRRR I RPRPRRRRRRRRRRRRR0R0RRRRRRRRRRRRRRRRRRR0RRRR00000000000 IPRR IPRPPRRRRRRRRRR I I I I II RR00RRRRR0RRR00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000 IPR016240162416666666666666616161666616666666626626666624662662626246246246242242424224444444455555555555555555555555555 IPR00621RR0006620006622121212 IPR00621R06622 IPR00R0000606060600006000660600606006666666666066660606000600600666666666666662166262662162262666626666211666666661222222222222222222222222222222222 124 12442442422422424244444444 1 1 124242224444 1224224444444 1 122222222222222222222242444244242224242244444 1 1 1222222222224222444 1 111 1 1 1 1 1 1 111 1 1111 1 IPR0000 IPPR0 IP IP IPR0PPP IPR0R IPR0PR0 I I I IPR0 IPPPPRP 0 I IPR0000 IPR IPR0PR0PRPRPR0PPPPPPRPRR0RRRRRRRRRRRRRRRRR0R0RRRRR0RR0RRRR0R0RRR0R0RR0R0RR000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 IPP 0P 000000000000000000010010010000000000000000000000000000000000000000000000000000000000000000000000000000000100001000000001100000000000000000000100000000100001010111111111111111111111111111111111111111 IPR000490 9 IPR00049R0000049499999 IP I R IPR0004R000 IP I R IP IPR00PR00PRPRPR0R00R00R00R0R0RRR0000000000000000000000000000000000000000000000000000000000000049000440400000000000000040000000049004040404904904900000000000400444949449449494444444994949999999999999999999999999999999999999999999999999999999999999999999999999999944444444444444444444444444444444444444444444444444444444444444444444444444 IPR00071R0 07 I R0R0R0RRR0000 07 IP IPRPRR0R00000000000 070710070707007070710707070771777777717777771777177171777777777777777177777711111 IP IPR0PRRRRRRR0R0RR00000 IP IPPR0PPR0RRR0R0R0RR0 IPR IPPRR IP IPPPR0RRRRRRRRRRRRRR0RRR0RR IPR I22PPRPPPPPRPRPPRR0R0RRRRRRRRRRRRRR0RRRRRR00000000000000000000000 000000070700070707070707070700070707771777771777777777177777111111111111111111199999999999999999999999999999999999999999999999 IPR00074R000000070000000000000000000000000074000742222 IPR00048PR0000000000000 8 IPR001310100131311 55 IPR01303P 0131 IP 303 IPR0PPR01001013111113131111033222222 IPR01809R01880111188818118111818188880809888 77 IPR0001500000000000000000000000005222 IPR0116001111111111111111111000 IPR02903PR02922 IPRR02202222229292229222922229999999300 IPR00188001000101181881111111 11 IPR00019 IP 0001 IPR00RR00000000000 IPR0R0R0R00 I IPP 00000000000000000000 88 145 1 14 145 14 1 145 14 1445 1 144445 1 1 14 144 14 1 1454 1 14544 145444 1 145 1 1 11 1 1 1 1 1 111 1 1 1 1 1 11 1 IPR00PPPRPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPRPPPRPPPPPPPPPPPPPPPPRPPPPPPPPPPPPPPPPPR0RR0RR0RRRRRR0R0RRRRR0RRRRRRRRRRRRRRR0R0RR0RRR0RRR0RR0000000000000000000000 IPPRPPPPPPPPPPPPPPPRPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPR0R0R0RRRRRRRRRR0R000000000000000000000000

IP 00000000000000000000000000000000000000000001000000100101010001000111111111111111111IIPIPR01798PPR0PRPR01R0RR0RR0R0RR0RR02R2R222220000100170010171711117171717171117717777797987779877 666_{IP 0}_IPR01_IPR01331_P ₀₁₃₁₁₃₁₃₁ ₅ IPR00893 I I RR00R0000000008080808080000808008080083666 IPR00165R00001650000111001101010100001001010001651661111165111666500 IPR00168 I R 0PR00RR0000 IPR00 I R00R000000100000000000101000000000 0 IPR01151PRPR01RRRR0R0RR01111111 IPPPP 001011111 IPP IPPR0PR0PPRPPPPPRPRR0RRRRRRRR000000011151

IPR0PRPPRRRR0R0R1111111111111111111111111111IIIIIIPRIPIP1PPRPPPPRPRPPPRPRPRPRPPPPRPR5150000000000001111 IPR00033000000000000000000 0 IPR00172 IP 07 IPR IPR0PR00 I IPRRR00R0R000 IPRP IPRPRR 1 IP IPR0R000000000000010000000100 0 IPR011021 I 0101010111111111111111111111111111111111111102111110 999999 IPR00029PPR00000000000000000020000000000202002999999999 IPR01974R0R0RRR0R001019000001019019000000001011191191119191111919199974999 IPR00602006006060 0 IPR01435011001111411111141411111144 2 IPR01980PRPR0R0198R0101901010111991111919191911111999111191191199999888055 IPR01974001 IP IP 19199999191919119999999999997948888 IPR002112 IPR000002000020200022 0 IPR00359PPPPPPPPR003R00033333330303030033333333333333333333359333333333535933333333593593359335333333335959I99IPIPI99IPIPIPIPPRPPPRPPPPRPPPRPPPPPRP4_I_IP4444444_IP_IP_I_IP_I_I_P_P_P_PR IPR0018000118001001001001 66 IPR00158000010000 99 IPR00395 IP 003000303030300303030303039599 IPR00522PR00522050050505005 55 IPR0012101010000121 4 IPR0018400010100000100010000100000000100001100000000100000001100000010011111111181 IPR0110201010111111111111111111111 2 IPR000690000000669000000600000000000000000000000006006988 IPR01102PRR0101011000100111111111111111111111111 11 IPR01475114141114444444444444444744 3 IPR01475PPPPR01PRRRRR 1RR0110101141111141141411111414472 IPR017861717177777777777 677777786787867644444444

IPR0021102022020202022020211022221121177777 _IPR01_IPR01161_IPR01_R01₀₁₁₁_1111111₁₁₁₁₁₁₁₁₁₁₁₁ ₅ IPR010991010100099000990000000000999111 IPR01387PR01P 010011111313131131131138872 IPR01234R012340111 021111111212111121212111 6 IPR008960000000000080808000000000808080000000008 7 IPR01153R00101111111111111111115359 IPR00045000000000000000400040455111 IPR00290PR002PR00002200000 9000229290 IPRR0000020020220202022000200000202020202000200000202000222209999 475 4 4 4 4 4 4 4 475 4 4 47 4 4 4 4 4 4 47 475 IPR01R01000141414111111411114444444444444444444444444444444444444444444 6 3 333 3 33 3 3 IPR01310111311131311 6 IPR00216P 0000000002020200000202020002026655 IPR00162 I R000 IP IPR 0R00R0000001000000000010010000100010111677 3595595955599 3595959999999999 3555559 3 3 3 33 3 3 33 35555559555559 3999999 3 3 0 03 0333333595555595955595599999999999999 0000 0 0 0 IPR00 IPRP IPPRPPRPPPR0R II IPPPP I IPPRR000 I I I IPP 00 IPR0PPPRRR0 I IPPR0 IPR0PRPRRRRR0 II R00PPRRRR0R0R0000 IPPPPPPPPPPP I IPPR0PRPPRPRRRRRRRRRRRRRRRRR0000003000000000000000000000000300303300000030033300030000000000003000000033030300030303030003000000030000030000000003030033333333333333 88888888888888888888888 131533315 1 51155 13 131313333333333333333 5331511151111151555555555555555555555555555555555 R010011 R0 R0 R R IPR IP IPPRR I IIP IIPPR0PPRRRRR I I I IPPRPRRR I R0RRRRR01003355555 I IIPP 1113333333331311111111131311111111111131333333333333333 06363 00333 06663 06333 06 063 06636 06 06636636333 06 066666666663633 0 00 0 00 066666636666666666333333 0 0 06 0666 0 0 0 0 0 0 0 0 06 0 066 0 000 00 033333 06 0 0 0 0 0 06 0 0 0 00 06366633 06 06 0 06666 0 0 0 0 0 0 0 00 0 0 00 0 0 0 0 00 0 0 00 00 0 0 0 00 0666663636663636636366633333333 0 0 0 0333 0 0 0 00000 0 0 0 000 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 IPR02R00R0002 IPPR0R0R000 IPP 020 IPPPRPRPRPRPPPPR0RRRRRRRRRRRRRRRRRR0R0R0R000002000000000 IP IP IIP IPPRR0RRR0RRR000000002002000000000002 IPRPP 0PPRPPR0PR0R0R00000000020020200200020000002020000002000020020202000002022000200000220000220000000200020200000000000000000200200000000000000000000000000000000000000000000000000000000000000000IPR00826IPIIIIPIPR0IPIIIIIIIIIII R0IIIIIIIPIIIIPPPPPPR0PPPPPPRPPPPP 0PPPPPPPPRPPRPPPPPPPPPPRPPPPRPPPPPPRPPPPPPPRPPRPPRPPPPPPPPPRPPPR202222222222022222202220202222222222202020222222202022222222222222200000222222222200222222222222222222202222222222222222200RRRRR0R0RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR0RRRRRR0R0R0RRRRRRR0RRRRR0RRRRR0R0R0RRRRRRRRRR00RRRRRRRRRRRR0R0R0R0RRRRRR0RRRRRRRRRRRRRRRRRRRR0RRRRRR00RRRRRRRRR0RRR0R0R0 82RR0RRRRRRRRRRRRRR00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008260000000000000000000000000888888882888888888888888888888888888882828828882888888888888888222222222222666666

624224 6 644 624 6624 62 62 62 622 6 6 6 6 66 6622 6 6 62 62 62 62 6 PR010 PR00 P PRR P P P P P P PRRRRRRRR01R00000000000111111 I I IP IP IPR0PPRR0R0RR000000062626626266666666624222224 IP IPP 0111621111111111666246666666666666666666242224 IP IPPP 0011162462666666624 IP IPPPPP IP I I I IPP IPR0PPPPR0RRR0R0001166116161161116116116611111111111111111161111116111166161611111111666661166666611111111111111111661161111111111111111611666666161161111161111611111111111111111111111111111666666666666662466662466666666666246626666666666666666666666666666666666662242443333333333333333333333333333333333333 0002400 0 02400 0 000000240000000020020024242222222222244444444 00020 40 400200222242224242222242242444444444444444444 PR00R0R0R0R0RRR PRR000 IPRPPPPPP IPPPPPPPR00PPRR0R0R000RRRRRRRRRRR000000000000000000000000 4 IP 0000000004 IPR00PPPPRPPPPRPPPRP 00R0RR00RRRRRRR0R00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 2222222222222222222222 3787 377 3 3 3 3 3787 3 3787 377 3 3 33 3 3 3 3 33 R0 R R0 R R R IPR0RRRR000 I R0 IPRR IP IPR0RRRRRR0RR0RRRRRR0R00RRR000100100013000100000000000113133333333 33 620 62 R016166666666 IPR01620R01620011116111111616116116116111666611611616111666616116166666666666666666666666666666666666666666666666666666666662066266626266666206662066666662222000001111111111111 1 1 111 1 1 1 1 7111 71 7 1 7 7 7 7 7 7 7 711 71111 71 71 7 71 7 7 7 7 7 71111 711 71 71 7111 7 7111 71 71 7 71 00 0 00 000 0 PR P IPRRR0RR0RRRR00000 IPRPRR00RRR0000000000 IPPRR IPRP I I I IPRPPR0RRR IPP IPPR0PPR0PRPPPRR0RRR0R0R0R0RR0R0RRRRRR0R0R0R0RR000700000000000000070000000000000000000000070700000000700000000000000000070700000077707000000000000000070000707007000000700777070700000700000007000000700000000777777777777777777777777 000000000000000000000000 9 9 9 9 9 9 99 9 9 30909999999 3 3 3 3 3 3 3 3 01 0 IPR0 IPR0 IPRPR IP IP I IPR IPRP IPR013R013133030303309303033333330009900999999999 IPR IPRP IP IP IPR0 30900000030930933309333093333333333333330933333309909009090090909909999999999999999999999999999999999999999999999999999999 I 1313131313131131313131313133333133331333333333333333333333333333333 8888888888888888888888888888888888888888888888 902022 90 902020202 90 90 9 900 9 9 290999909999000 IPR02902PR02020292929222929299292222292929299999292222299999292929299299999999999999999999999999999999999990990909090090999090999909999999990999900200000000221111111 3 33 3 6131 613 61 61 61 61 6 613 613 6 66 IPR016131616166661366666666666666666666666666666666666616613666666 3666333333000000000000000 396 396 396 39 3 3 3 3 3 IPR00396030300030300333333333333333333333333333333333333333333333339633333333396396333393333396339333333 11111

IPR00038000000000000000000000000000000000000000030000003880380380003803803038388888888777777 IIPIPRIPIPI RIPR0II RIIIPIIPRIPIPIIPR01501IPPPPR0PRR0R0RRRRRR0RRRRRRRRRRRRRRRRR01R0RRRRRR0RRR0RRR01RRRRRR0RRRR0000000000000000001500000000000100000100100111151151115111511115111111151111111111111151515151511115050155501505555555555555555555555555555555001115555 IPR01893P 188188888888181811888888888888888888898 6 594 5594 5 5 5 5949 5 5 IPR01011 IPP 0R00 IPPR0R I I 1 I I I I RRR I I I I RRR0R0RRRRRRR0R0000001500000001111115115155151151511515115115551111111155151111115555555945 45595555945 4559555 3333 822 018218222 0 0 0 000 00 0 00 00 00 PR00RR0R000 PRR PRRR0R0RRR00000 IP IPPP I I IP I I I II IPR0PPRRRR0000 IPP IPR0PPPPRPPRRRR0000000000000000000000000 IP I IPR0PRPPR0R0RR000 IP IPPPR00R0000000 IPR0PRPRR00R00 IPPP IP IP IPR0PRR00RR00 IP IP I IPR0PRPPRR0R0R00000000000000000000000000000000000000100100000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000085285 44 PR028R0R02000002202000000002002020222222822 PR0R0R0R0000000000000000000000 PR P IPR IP IPPRPR0PRP IPPP IPR I IP IPPR0R0R0R0R000000028228000000002222828888228228222828282888888835956 IPR0RR IPR00RR0000 IPR0RRR0RR0RR0R000 IPR0R0R00030000000000000000000000000000000003000000000000000003000000000 9 40000 IPR01PRPR01R001RR 14R0111114114111411111111111111111111 9 R000400 IPR0PR 0R000R00R000000000000000000000000000 33 15 1 311 3 IPR003R 033333333333313151111111555555 82 82 8444 52 5266IPIIPR0IPR0₀₃₃₃₃₀₃₀₃₀₃₀₃RRRRRR0R0RRR0R0RRRRRRRRR0R0R0R0RRRR₃₃₃₃0000010001401400000000100014111114111111411111111141411114440444444444444444444444444440₁₁₁₁ 3151 031 R003 R 11151 IPR00315PR0000000300 10003003103333333151511111515555 4 4 ₀₀₀₀₀₀₃₀₃₀₃₀₀₀₀₃₀₀₃₀₃₀₃₃₃₃₀₃₀₀₀₀₀₀₀₀₀₃₀₃₃₃₃₃₃₃ ₂₂₂₂ 730 7 0 IPR01 I R001 I I 1 IP 1 I IPPPPP 017017 8 8 88 8 8 8 111717117111171IIIIIPIPIIIIPIIIPIPIPPR0PPPR0P7RR000000000400000440 IPR00667060606666666 11 IPR00361PR 0000300030300 6 IPR004360000400040400 7 IPR02905229229292999 2 IPR0182011811818188 0 IPR013760111311131311 3 IPR001390101000000 4

IPR00772000700070700 228IPR006188 0606060600606 6 IPR02888R 222282228228 9 IPR00484P 0000400040400 3 IPR00246 IPPR000020202022 0 0 02 0 02 022 0 00 02 02 002 0022 000 4 4011 4 IPR0140 IPR011114414141414144404444 44444444 IPR00396396 II R000396030030303003 00 IPR01400PR0 IPR0PR0PPPPPPRR010144001001141 011111144414141111411444404 00111 01154 0 R00 R R IPR IPR IPRPPPRR0RRRR0R010100001010000111111111111111111111111 5 R01170 R0 R0 R R R PR P IP IP 1PRPRR01 IPPRRR 111111111111111111111111 9 IPR00750R0R00R00070000707000000707070000007 2 IPR01199 IPR0111111111111111111191991111 1 IPR00359R00 IPR00PR00000300000303000 3 IPR0035700 IPR00PR000300000303000 8 IPR01793PR0R0117171717177 0 IPR00137PR0001300011301001011000100101113733 IPR01615R011011 IPR I 161111116161111 8 IPR00100R0000001000001000 5 IPR00135R000 I 01000000 6 IPR01955 IP 0PPR0P 19PP 00111911119111111111111111111111 9 IPR01615R01011161111616111 7 IPR01797011711717177 0 IPR016151161666161616661566 9 IPR0178811711717178877 4 IPR009050000909999009009099990909909 7 IPR013081131131313033 7 IPR00161P 0160101100101600 IPR00166R00000100000000 0 IPR021122212222111 9 IPR00001R0000000000000000 4 IPR01306113113131 9 IPR023412322322232322 0 IPR01159115111111111 8 IPR0137601

IP 1R01131111313111 7 IPR0070800000700070700IPR01133701111111111 3IPR000300000000000000 8 IPR0137613713131313133 1IPR01588151515555 0 IPR01151 I 01111111111 00 IPR0002100000000000000 _{0 IPR02340}₂₂₃₂₃₂₂₂₃₂₃₂₂ ₉ 308 3008 3080 30 3 30 30 33080 3330 30 30 R01 R IPR011 IPR0R0RRR0R0R0R 10011331311111333111311131331131133313113111133 333 196 19 IPR001PRR000 IPR00100 IR0 96R00R0000000000010000100100195555 61313 01613 01 R0 R PRR IPPP 11 IPR016RRR IPPP 0 613111 I RIPIIIIIIPIPPPPPPPR00PR0PPPPR0PR0PPP016RR00R01661611166161111161166600000000000000600000000060000000000000000000000000000006000005 8888 IPR02331 IPR02R02R0222223222222232322222 3 IPR02377PP 0223770237222222232323222232232222IPR0177777R01R01R016199999010101011111111116161111111161 7 000950 R000 PR PR IPR000RR00000000000000000000000000000000000000000000000000053 R023782 IPR02378 IPR23 IP I 2223333323232333232333333333333800 IPR01745P 01011171111717111 5555 IPR017980111171111717111 4 IPR0131013131333133IPR013025R01301113131113131311 6 IPR013320 IPR01199PPR01010111111111111 0 IPR0197319973191999191919999 444IPIPIPIPPPPP IPR0011001001111110110111111 33 IPR001300 I 00 IP IPR 1 IPR00PR00 IPR00P 00010000000001000000 9 6122 IPR01612161616661666666666666129999999 IPR01591011511151111515111555555 7 IPR002130202020202021222222222212122222 888 _IPR00144₀₀₀₁₀₁₀₁₀₀₀₁₀₀ ₀ IPR00172 IPR00R00000000010000000001000000 3 IPR003870030030300303030 7 IPR00898R008 88988 1332 133233232 13 133 13 13 13 1333 0000 0 0 ₀₈₀₀₈₀₀₀₀₀₀ ₅ R0018700 R000 P P 8 8 8 8 8 8 8 8555501000001000 0 IPR013080130013080113131133113313133333333333083388888 IPR018350181181181818118 5 IPR00894R0080080808008080808008888 6 IPR00162 IPR0010011 IP 00PP 00000000000000100000001010 8 _IPR00208₀₀₂₀₀₂₀₂₀ ₃ IPR01101P 001010111111111111111111111011111 R0003100031 R IPR0003R 0RR00R0000000000000000000000303103003003115555555555555555555555 IPR0013701001 00 355 001351 PR001R 100 IPR001P 000000000000000100000000111111135557777 IPR01790111717171177777777700777 IPR00897IPR01R08008000080800197880119919197799999979978944777777 8955 IPR01895188181888958888 77 IPR019781191191919999 66 IPR00030000000000000000 6 IPR001160001000000 3 IPR012670112112121 7 PR01092R110110101 0 IPR00308000300030300 IIPP

IPR000280000_PR02380_P00280000000000000000000000002800000000_R000002028₂₂₃₂₂₂₃₂₃888888 _I_I_P_P IPR00050₀₀₀₀₀₀₀₉₂₀₉₂₀₀₀₀₀₀000IIIIIPR0IIIIIIPR0PR0PR00000

IPR00300I030303 IPPP ₂₃88888888868888888868PPR00664666₃₃6₃66₃₃666666₃₃₃₈₀₃₈₀_IP_II_I_I_IP_PR_PR01_P_P_P_P_P_P_PR01_PR_P_PR_P_P_P_P060606_R_R_R_R_R_R_R_R01_R0_R_R_R_R_R₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁₁II₀₀₀₀₀₀P₀₀₀₀₀₀₀P₀ IPR02659222622262622 IIPP IPR02903R 2PR0PR0PP II0000 00300 003000000303030030303030333300303300₂₂₉0000₂₉₂₉₂IIPIPIIIIPIPIPIPPPPPPPPPPPPPPPPPP 00000000_P_P000000000000 IPR01301IPR0011311313131313011301113131330130133333330133330010100199999999999999999999999999999999999991 2 IPR008980000008008000088008080008889IPR01379IPRR01R01014113111113131111 0 IPR000250000000000 3 IPR00361 I R00P I I RR00R00R 0300000000030300000 9 IPR01785011 511771717177117171 57878778778555555 IPR02842020202228 IPRPR I 28222222822822222 6 IPR00204 IPR00PPPRPRRRR00RRRRR002RRRRRR02020020202020208 IPR00009 I IPR00000000000000000000000 1 IPR02461PRPR022424222242424222444444 3 IPR01824R01011811181811 7 IPR00022PPR00000000000000000 5 IPR01602116116161 1 IPR01328P 111311131311333 4 IPR019771111191191911 55II 2047 222 2 IPR02P IPR0204702 IP IIPP IPP 002022044747477 1 1 1 1 1 1 315 31511 3 31111 31 3 3 311111122222222222222222222222022222022222220 2222 IPR01199R0119 IPR01PPRPR01RRRRR0R01RRRRRRRRRR01R01 9 IPRPRRIPIPIIPR01198PRP0R11111111111119919111990000101190101010101111111111111111111111111111111111111111IPIIIPIPIIPIIIIIPIIIIIIP111991119PPPRPRPRPRPRPRPPPPPPR2299999999998RRR99 602 6 602 6 6 6 6 60 60 6 6 60 6 6 6 6 6 6 60 6 66 6 602 66 6 6 6 6 6 6 6 6 6 6 6 6 6 62 6 66 66 6 PR01R00001 IPR0R IP IPR0R11 IP 01001000161161616111611111661616161116111161611161116161616111616166666666666666666666666666666666666666666666 44444444444444444444444444444

Figure 5. frequent itemsets were mapped to a cytoscape network that displays which domains are often found to be associated in interacting proteins.

the width of the edges indicates the betweenness centrality, while the node size is representative of the degree of the node. central in this network seem to be iPr001452 (sh3 domain), iPr013783 (immunoglobulin-like fold), and iPr027417 (P-loop ntPase), which is the most prevalent domain in nucleotide-binding proteins. however, the items iPr011009 and iPr000719 were present in the highest number of distinct itemsets. these terms refer to the protein kinase-like domain and the protein kinase domain, respectively, and their importance supports the observations made in figure 4 that indicate kinase interactions form a vast part of the human interactome. it is interesting to note that of the 1753 proteins featuring a Wd40/yVtn containing domain (iPr015943) in human beings, only associations with the kinase domain (iPr000719) and the kinase-like domain (iPr011009) were retained. the concanavalin a-like lectin/glucanases superfamily (iPr008985) is also shown to have a relatively high centrality in this network.

(10)

listed to make retracing patterns easier. To use in frequent itemset mining, strip everything before the transactions.

supplementary File 2. Protein domains in binary inter-action dataset. This file contains all InterPro domains that were mapped to interactions that were extracted from IntAct. The UniProt accessions of both interaction partners in the binary interaction are shown before the colon. Strip this part to only retain the interactions. Making the contrast between the mining results of Supplementary File 1 and those in this file will remove the protein-specific patterns, as well as several patterns that result from the ontology itself.

supplementary File 3. M. tuberculosis dataset. This file contains all transactions extracted from the Colombos dataset. Each row represents the response to an antibi-otic, in which only the genes that were significantly up- or downregulated according to the logFC discretization step were retained. Each item combines the gene name with its individual response.

supplementary File 4. Tutorial. This brief tutorial describes the practical use of each of the three frequent itemset implementations on the case studies featured in this article.

supplementary File 5. Python script. This file contains a simple Python script that compares the output of the Apriori Borgelt implementation to another file with similar structure. This can be used to find overlaps and differences between the mining output of Case 1 and Case 2.

supplementary File 6. 100 most frequent InterPro intra-protein patterns. The 100 most frequent intra-intra-protein itemsets that contain domain associations. Most itemsets consist of

combinations of kinases, patterns resulting from the ontology tree, immunoglobulins, WD40 repeats and others. This file extends Table 1.

supplementary File 7. 100 most frequent InterPro inter-protein domain patterns. This table shows the 100 most fre-quent domain associations in protein-protein interaction data. We obtained this output by subtracting the intra-protein pat-terns. We find that the top 100 length 2 rules mostly consist of combinations of only 45 distinct domains. Most of these terms refer to kinase cascades (associations between serine/threonine and tyrosine kinases), interactions between SH2 and SH3 domains and kinases, as well as a large amount of domains that is involved in ubiquitinylation and immunological response. All patterns that could be produced by the InterPro ontology were filtered out, which leaves the interaction between the SH2 and the SH3 domain as the most abundant.

supplementary File 8. 100 most frequent itemsets in the M. tuberculosis use case. In this table the 100 itemsets are shown that represent the gene regulation response to several antibiotics included in Colombos. Every item is a combination of a gene name and its regulatory state, which was the result of a discretization step (logFC .1.2: upregulated, ,−1.2: down-regulated. Several genes, such as whiB7, higB, and PE20, are strongly co-regulated. We were also able to identify transcrip-tional units, such as the esx-unit.

reFerences

1. Meysman P, Sonego P, Bianco L, et al. COLOMBOS v2.0: an ever expanding col-lection of bacterial expression compendia. Nucleic Acids Res. 2013;42(1):D649–53.

Figure 6. transcriptional networks created from the colombos web portal. here, we mapped the contents of a maximal length items with minimal support

of 7 to the current state-of-the-art knowledge. We found that our set of 11 genes traced back to an interconnected network of 15 genes that were subject to shared transcriptional regulation, related to generation of precursor metabolites and energy metabolism (go:0006091).