User-centric focused Web crawling

(1)

University of Pisa

and

Scuola Superiore Sant’Anna

Master Degree in Computer Science and Networking

Master Thesis

U S E R - C E N T R I C

F O C U S E D W E B C R AW L I N G

Candidate

Luigi Caiazza

Supervisors

Nicola Tonellotto

Matteo Catena

Academic Year 2016/2017

(2)

A B S T R A C T

Search engines are the main hub of information in the Web. They crawl and index Web contents to allow their users to sat-isfy their information needs. At the same time, many other users create everyday new Web contents or modify existing ones. This continuous growth of the Web poses a challenge to search engines. In fact, due to the evolution of the Web and to hardware limitations, it is impossible for search engines to crawl the Web in all its entirety. Consequently, new crawling ap-proaches are needed to limit the amount of Web pages a search engine needs to fetch.

In this work, we give a literature review of the state-of-the-art on Web crawling with a particular attention on two optimiza-tion paradigms: the focused Web crawling and the user-centric Web crawling. The first aims at fetching only Web contents re-garding a particular topic (e.g., sports, games, etc.), while the latter tries to fetch only those contents related to the informa-tion needed by the users (e.g., by analyzing query logs). Both these approaches can crawl high quality contents with a smaller amount of hardware requirements compared to traditional Web crawling. To the best of our knowledge, however, no attempt has been made to combine their strengths. Therefore, in this the-sis we present a novel paradigm for Web crawling optimization that combines the two aforementioned approaches. Through ex-tensive experimentation on the TREC ClueWeb09 (cat. B) Web corpus and on the MSN 2006 query log we show that our hy-brid paradigm outperforms existing focused Web crawlers and user-centric Web crawlers.

(3)

C O N T E N T S

1 i n t r o d u c t i o n . . . 1

1.1 Contributions of this work . . . 7

1.2 Chapters overview . . . 8

2 w e b c r aw l i n g . . . 9

2.1 Basic Web crawling . . . 9

2.1.1 Crawler architecture . . . 10

2.1.2 Crawl ordering . . . 12

2.2 Focused Web crawling . . . 12

2.2.1 Focused crawler architecture . . . . 13

2.2.2 Evaluation of focused crawling . . . 15

2.2.3 Frontier management . . . 17

2.3 User-centric Web crawling . . . 18

2.3.1 Page importance measures . . . 19

2.3.2 The objective function . . . 20

2.3.3 User-centric crawler architecture . . 21

2.3.4 Evaluation of user-centric crawling . 22 3 r e l at e d w o r k . . . 25

3.1 Focused crawling . . . 25

3.1.1 Seed selection . . . 25

3.1.2 Classification of the crawled pages . 26 3.1.3 Classification of the discovered links 29 3.1.4 URL selection . . . 35

3.2 User-centric Web crawling . . . 41

3.2.1 Query-overlap model . . . 41

3.2.2 Random-walk model . . . 44

4 p r o b l e m f o r m u l at i o n a n d s o l u t i o n . . 47

4.1 Design issues . . . 48

4.1.1 The Focused Random Walk model . 48 4.1.2 Frontier ordering and management 52 4.2 Architecture and algorithm description . . 53

4.3 Implementation . . . 55

4.3.1 Our code . . . 57

(4)

5 e x p e r i m e n t s a n d r e s u lt s . . . 69

5.1 Experimental setup . . . 71

5.1.1 Topic definition and seed selection . 72 5.1.2 Page classifications . . . 73

5.1.3 URL classifications . . . 74

5.1.4 Query log filtering . . . 76

5.2 Experiments . . . 78 5.2.1 Evaluation metrics . . . 79 5.3 Conclusion . . . 93 6 c o n c l u s i o n s . . . 96 6.1 Future work . . . 97 b i b l i o g r a p h y . . . 99 Bibliography . . . 99

(5)

1

I N T R O D U C T I O N

The Web can be defined as an immense repository of contents, created by billions of publishers and hosted by hundreds of millions of Web servers. It would be useless to perform this mass publishing in absence of a mechanism allowing users to discover these contents, motivating the existence of Web search engines.

A search engine is a system demanded to efficiently and effec-tively retrieve Web pages that would answer user queries. To re-alize this goal, it first explores the Web and stores all the fetched contents in a local repository. Then, such contents are organized in a large index that supports efficient text searching. When the user submits a query, usually expressed as a small set of terms that represents her needs, the search engine retrieves the documents in the repository whose contents match the query, and eventually presents them in the order of their relevance. As an example, Figure 1 depicts the top-10 results presented

by the Google search engine to an input query. The first crucial operation performed by the search engine is to populate the local repository, i.e., to crawl contents to allow their retrieval. By representing the Web as a graph in which Web pages are nodes and hyperlinks are directed edges between two pages, the crawling activity can be intended as a periodic traversal of the Web graph in order to discover nodes, download their con-tent and build a corpus of Web pages.

A Web crawler, sometimes referred to as a spider, is the soft- Web crawling

ware component which performs the crawling activity. A basic crawler executes a simple algorithm with the aim of building the Web pages corpus: starting from an input set of seed Uni-form Resource Locators (URLs), it fetches all the relative Web pages and stores them in the repository, then extracts the con-tained hyperlinks, and finally follows all or part of them and re-peats the process. The crawling stops when there are no

(6)

remain-i n t r o d u c t remain-i o n

Figure 1: A top-10 results page for a query from Google search en-gine.

(7)

i n t r o d u c t i o n

ing Web pages to discover, or according to some other stopping criteria. For instance, the crawling activity may terminate when a certain number of Web pages have been fetched or after a certain execution time.

For a search engine, the repository returned by the crawler represents the snapshot of the Web over which all the user queries are processed. The implication is that the quality of query answers reflects that of the available contents. The aim of the crawling activity is to build a repository that covers as many contents as possible. Due to the dynamics of the Web, usually these contents must be also periodically re-crawled in order to synchronize their copies with the real-world updates, i.e., to maintain their freshness.

The enormous size of the Web and the evolution of its struc-ture over time are significant obstacles for the Web crawling [10, 54]. Nowadays a good coverage of the Web graph, in combina-tion with an acceptable refresh rate, have huge requirements in terms of hardware and network resources. Hence, a series of challenges have to be taken into account during the design phase of a Web crawler:

• Coverage and quality: due to the actual size of the Web and to the limitations imposed by the hardware, a crawl-ing process can cover only a subset of the indexable Web. Therefore, one of the desirable goals of the crawler is to discard irrelevant or redundant pages in favour of the ac-quisition of the Web contents that are more likely to be interesting for the users.

• Robustness: some Web pages may limit the crawler in the exploration of the graph, inadvertently or with malice. These kind of pages are called spider traps. For instance, a trap may consist in a dynamic Web page that generates an infinite number of links to follow. To prevent the con-sequent waste of resources, and to avoid the failure of the crawling system, the crawler must be resilient to these traps.

• Freshness: in order to provide up-to-date contents, the crawler should continuously obtain fresh copies of

(8)

previ-i n t r o d u c t previ-i o n

ously fetched pages, with a frequency that ideally approx-imates their update rate.

• Politeness: if a crawler tries to fetch a huge number of pages from a single Web server, it may cause a congestion to the content provider’s network. Thus, there must be a threshold that limits the number of requests towards a sin-gle provider. The provider itself can define some policies to regulate the rate at which a crawler can visit it, e.g., in the robots.txt [74], therefore a crawler must respect these requirements.

• Scalability: it is difficult for a single process to discover a meaningful number of Web pages in a small amount of time. A possible approach to increase this rate is to repli-cate the crawling infrastructure across multiple instances and to partition the URL space across them. In this way, each process can limit its crawl to certain boundaries of the Web and can work on its own copy of the data struc-tures with a minimal need of synchronization with the concurrent crawlers [20,76].

Given these challenges, the huge investment needed to de-ploy and to maintain an efficient and effective Web search en-gine tends to be affordable only for few large companies. To reduce such investment, and to increase the opportunities of Web crawling also for smaller companies, some improvements for the basic crawling activity are investigated in the litera-ture [38,60,77].

One of such improvements is the focused crawling (or scoped Focused crawling

crawling). The purpose of a focused crawler is to collect Web contents only if they respect some scope, for instance:

• Web contents that treat information pertaining to a spe-cific topic [18]. In the literature, this kind of crawling is also referred to as topical crawling.

• Web contents with a specific geographic scope, that some-times does not coincide with the geographical zone of the hosting Web server [31].

(9)

• Web contents whose data, e.g., events or facts, fall into a given temporal scope, to perform a temporal crawling [66]. • Web contents belonging to Web forums, to build an

intelli-gent forum crawler able to efficiently traverse the structures of forums [11,36].

• sentimental and opinionated Web contents, to provide an opportunity for sentiment analysis over the Web [81]. • Web contents located in hidden points of the Web, e.g.,

those that can be discovered only by filling and submit-ting a form and not by following hyperlinks. This kind of crawling is also called deep web crawling or form-focused crawling [5,85].

• Web contents discussed in social networks, to efficiently monitor and get relevant and fresh contents from social media streams [9,83].

• Web contents about specific business entities to locate in-teresting information for competitive intelligence [19,62]. • academic documents over the Web to build digital libraries

of scientific papers [71,87].

Intuitively, focused Web crawling facilitates the existence of domain-specific search engines. As an example, we may think about a Web portal on food that provides an autonomous, verti-cal search engine on cooking recipes and reviews about restau-rants retrieved over the Web. However, this idea can be ex-ploited also by general-purpose search engines, by replacing a single exhaustive crawler with multiple specialized focused crawlers, each in charge of exploring in depth only the portion of the Web relative to its focus. Such divide and conquer crawling strategy reduces the complexity in discovering and refreshing Web contents for large scale crawling, and reduces the hard-ware investments [18]. Moreover, focused crawling can also be exploited for building vertical indexes, allowing to search for relevant results for focused queries among a restricted set of pages [43].

(10)

An alternative improvement to the basic Web crawling is

the user-centric crawling (or impact-driven crawling) [60]. Social _User-centric

crawling

experiments reveal that the top ranked results presented by the search engine have a large influence on the user’s satisfac-tion, while the less ranked pages are often ignored [38,45,59]. Given this observation, a user-centric Web crawler prioritizes the fetching of the Web pages which would be in the top-k re-sults for the most likely future queries, over other Web pages that are expected to be ranked in lower positions for the same queries.

Let us now consider to crawl the Web for a vertical search en-gine about a broad topic, e.g.,Sports. Assume that the users of the search engine are aware of this definition and, consequently, nearly each submitted query is related to such broad topic. Now, notice that the broad topic can be decomposed into many narrow, distinct sub-topics, e.g., Basketball, Soccer, Tennis. This implies that the user needs can highlight very different interests with respect to the sub-topics. For instance, given a query log to the search engine referring to a certain time pe-riod, it can be detected thatBasketballis very popular, Soccer

is less popular, and contents about Tennis are not queried at all. If the frontier is managed without taking into account any feature about such popularity, as it happens in the state-of-the-art focused crawling, then the crawler does not discriminate between the contents of the sub-topics since all of them be-long to the same broad topic Sports. The crawler may thus explore the subgraphs of contents aboutTennis with the same depth of (or worse, more deeply than) the subgraphs about

Basketball. If, instead, the frontier is managed only by esti-mating the top-ranked discovered but uncrawled Web contents for a set of queries, as it happens in the user-centric crawling, then the off-topic contents have a chance to be crawled prior to the estimated in-topic ones. For example, it may happen that As a matter of fact, in any of the above cases, the resources for the crawling are wasted by pages that would be interesting for few users when presented in their query’s results page, at the expense of discovered but uncrawled pages that many users would have been retrieved in the search results.

(11)

A straightforward approach to this problem is to assign a large amount of hardware resources to a state-of-the-art Web crawler, i.e., to get a larger set of crawled pages. In fact, by augmenting the coverage of the Web graph, we increase the completeness of the search index and, implicitly, the probability of satisfying any related query. However, as the coverage of the Web enlarges, the complexity of maintaining freshness of the repository increases as well. Moreover, this approach could not be affordable if the crawler is a module of a small/medium commercial search engine running on a limited set of resources.

A more effective solution would be to crawl the Web using an approach that takes the best from both the focused crawl-ing and the user-centric crawlcrawl-ing. The ideal behavior of such hybrid approach is to crawl Web contents that both are in-topic and popular for the most frequent and likely future queries to the search engine. To the best of our knowledge, none of the existing crawling paradigms follows this proposed approach.

1.1 c o n t r i b u t i o n s o f t h i s w o r k

The work for this thesis starts from the above observation. We studied a novel user-centric focused crawling approach. We will show that if the crawler is aware of the interests from the users during the crawling session and it can adapt its visiting priori-ties accordingly, then it can improve the quality of the crawled information. In this scenario, the fetching of pages related to distinct in-focus concepts is balanced according to the expected satisfaction of the search engine’s users, without ideally affect-ing the performance measures related to the focus. Therefore, when the crawling ends, we expect that the increased coverage of popular contents will produce a more accurate top-k list of results for the most frequent queries to the search engine.

To experiment our paradigm, we implemented a prototype of a user-centric focused crawler. From its point of view, the focus is defined as a broad topic. It does not need the definition of the set of narrow topics as well. The coverage of such topics is regulated by exploiting some mathematical tools and some known properties of the Web graph.

(12)

1.2 c h a p t e r s ov e r v i e w

The rest of the thesis is organized as follows. In Chapter 2,

we provide a literature review of the basic Web crawling, the focused Web crawling and the user-centric Web crawling ap-proaches. Chapter 3 describes the related work for the focused

crawling and the user-centric crawling. Chapter 4 introduces

the paradigm of user-centric focused Web crawling by formally presenting the problem, provides an abstract user-centric fo-cused crawler architecture and a base crawling algorithm, and describes our implementation of a user-centric focused crawler prototype. Chapter 5 describes how we tested our prototype

and shows the comparisons with existing baselines for some crawling simulations. Finally, Chapter 6 provides a summary

(13)

2

W E B C R AW L I N G

In this Chapter, we cover the necessary background to under-stand our work. Specifically, in Section 2.1 we introduce some

concepts about the basic Web crawling approach, i.e., the ar-chitecture of a basic crawler and the crawling algorithm. Sec-tion 2.2 reviews the state-of-the-art in focused Web crawling.

Section2.3reports the state-of-the-art in user-centric Web

crawl-ing.

2.1 b a s i c w e b c r aw l i n g

This Section describes the Web crawling by following the work in [35]. It represents the basic approach to crawl the Web, i.e., without pursuing any optimization related to the coverage of the Web.

At any point in time during the crawling process, a Web page may be in one of the following three states: crawled, dis-covered but not yet crawled, or undisdis-covered. The Web pages in the crawled state are stored in a repository. To manage the set of Web pages in the discovered but not yet crawled state, a Web crawler uses two data structures. The first one is the frontier, a queue of discovered links to be crawled. Its implementation must support two operations: adding a link to the queue and selecting the next link to fetch, which depends on the frontier ordering. The frontier ordering affects the order in which Web pages are visited. The second data structure is the duplicate URL eliminator, which maintains the set of discovered URLs during the current execution of the crawler, without any further dis-tinction (i.e., crawled or not crawled Web pages). The purpose of this data structure, whose implementation may rely on hash tables or Bloom filters, is to avert enqueuing the same link more than once in the frontier, and consequently to prevent the pos-sibility of multiple downloads of the same Web page.

(14)

w e b c r aw l i n g

2.1.1 Crawler architecture

The architecture of a general purpose Web crawler is composed by a group of independent layers, depicted in Figure 2. Their

cooperation forms the crawling procedure, whose pseudo code is shown in Algorithm1.

Figure 2: Representation of a general purpose Web crawler architec-ture.

The crawling algorithm is cyclic. A single execution of the al-gorithm, from the first to the last crawling cycle, forms a crawl-ing session. Prior to the first cycle, the URL frontier data struc-ture is initialized with a set of seed URLs taken as input (line2). Each crawling cycle begins with the extraction of the URL of the next Web page to fetch from the frontier (line5). Different mech-anisms to extract the URLs from the frontier are discussed in Section2.1.2. The Domain Name Service resolver module uses the

extracted URL to determine the IP address of the hosting Web server (line6). Requests to the DNS servers can take up to sec-onds, reducing the fetching rate of the crawler. Therefore, the module speeds up the process by maintaining a cache of recent DNS lookups. The IP address is then passed to the HTTP fetcher component, that contacts the relative Web server to get the Web page (line7). Before downloading the content, the HTTP fetcher may consult arobots.txtfile placed in the root of the Web site

(15)

1: functionCrawl(S): . S = set of seed URLs

2: Frontier← S;

3: Repository ←∅; DuplicateURLElim← ∅; 4: while Frontier 6= ∅ do

5: url← Frontier.GetNextURL();

6: address← DNSResolver.Resolve(url); 7: page← HTTPFetcher.Fetch(address); 8: if Repository.Contains(page) then

9: go to 4;

10: Repository.Store(page);

11: U ← Parser.ExtractOutlinks(page); 12: for all outlinks u_i ∈ U do

13: if DuplicateURLElim.Contains(u_i) then

14: discard ui;

15: else

16: DuplicateURLElim.Store(u_i); 17: Frontier.PutURL(ui);

18: return Repository;

Algorithm 1: Basic pseudocode of the crawling algorithm.

hierarchy. This file contains possible crawling rules, defined fol-lowing the Robots Exclusion Protocol standard [74]. For instance, a content provider can use crawling rules to forbid the crawl-ing of certain Web pages and the HTTP fetcher must adhere to these limitations. The fetched contents are analyzed to check if the page, associated with a different URL, has already been fetched in a previous cycle (line8). If so, the page is not further processed and a new crawling cycle begins (line 9). Otherwise, the parser module extracts the outgoing links (line 11). In the last step of the crawling cycle, every link is analyzed by the duplicate URL eliminator module to check if the crawler has al-ready encountered it before (line13). If so, the URL is discarded (line 14), otherwise it is marked as discovered and then pushed into the frontier (lines 16, 17). Finally, the crawling session re-turns the set of Web pages that have been downloaded during the crawling cycles (line18).

(16)

2.1.2 Crawl ordering

It is possible to implement different policies to regulate the se-lection of the URLs in the frontier.

The simplest selection strategy is the breadth-first search: it de-fines the frontier as a First-In First-Out (FIFO) queue, therefore the crawling process fetches the Web pages in the same order of their discovery [67]. The intuitive advantage of this strategy is that the Web crawler remains easy to implement, since there is no need to define any additional component to those presented in Section 2.1.1. Despite its simplicity, this crawling strategy

fetches high-quality pages at the beginning of the spidering (ac-cording to connectivity-based measures of the quality such as the PageRank [58]). However, the quality of the visited contents progressively diminishes as its spidering proceeds [56].

More complex selection policies can be adopted to get a dif-ferent coverage of the Web graph. This requires to extend the architecture presented so far with an intermediary manager be-tween the extraction layer and the frontier (see Section 2.2.3).

2.2 f o c u s e d w e b c r aw l i n g

This Section summarizes the state-of-the-art of the first opti-mization for the Web crawling that we consider in this disser-tation, i.e., the focused crawling. The effort of this optimization is to solve the following problem.

Problem 1 Given a focusF denoting a possible state of the tents of a Web page, crawl the Web such that the repository con-tains as many Web pages as possible whose contents are inF.

In other words, focused crawling aims at fetching the unvis-ited Web pages which are relevant to its focus. The measure of relevance for a page p with respect to F is given by a binary score R(p,F ) = {0, 1} defined as:

R(p,F ) =     

1 if p is relevant to F given its contents 0 otherwise

(17)

The relevance of p is computed by measuring how much its contents are similar to a given text, e.g., a description of the focus or the contents of another Web page whose relevance to the focus is already known. Some methods to measure such relevance are reported in Section3.1.2.

The concept of focus augments the complexity of the crawl-ing strategy. In its most basic formulation, the focused crawler receives a set of seed URLs and uses them to initialize the fron-tier and to learn some good examples of pages to crawl. This knowledge is then used in the crawling phase to determine if a discovered URL has to be fetched according to the focus, and to assign a relevance score to prioritize the URLs that belong to the focus more than the others. Finally, URLs in the frontier are se-lected according to a specific strategy, e.g., in decreasing order of their estimated relevance.

However, focused crawlers must made relevance measure-ments before fetching the page contents. As a consequence, it is necessary to estimate, for each unvisited Web page, the probability that it would be relevant if downloaded. Of course, some predictions could be wrong and the set of crawled pages may include some unimportant contents or miss some perti-nent pages. The purpose of the compoperti-nents in the intelligence layer is to minimize these prediction errors.

2.2.1 Focused crawler architecture

A basic, high-level architecture of a focused Web crawler can be an extension of that of the general crawler, as presented in Sec-tion 2.1.1, without any need to redefine the inherited modules.

In addition, an intelligence layer of components is in charge of selecting and prioritizing the discovered but not crawled links, aiming to respect the focus F with the best effort. Such archi-tecture is depicted by Figure3. Algorithm 2 shows the pseudo

code of the focused crawling procedure.

The analysis of the fetched documents and the relevance pre-dictions of the uncrawled contents with respect to F are done by the relevance classifier module. Initially, it determines whether the contents of the crawled Web page are relevant toF (line11).

(18)

Figure 3: Representation of a focused Web crawler architecture.

Section 3.1.2 overviews the techniques for this judgment. If

the outcome of such classification is negative then the page is discarded and the current crawling cycle terminates (line 13). Otherwise, URLs contained in the page are extracted and their probabilities of relevance to F are estimated according to some content-independent features (line21). The reader is referred to Section3.1.3for more details about such estimations.

Given the URLs in the frontier, the frontier manager is respon-sible for the selection of the URL for the next crawling cycle (line 5). By associating each discovered URL to its estimated relevance score to F (lines 2, 22), the frontier manager strate-gically organizes them into the frontier (line 23). For instance, it could sort the URLs in decreasing order of their scores to prioritize the most promising links according to the previous classifications. Section 2.2.3 reports the current state of the art

(19)

1: functionCrawl_F(S, t,F): . S = set of seed URLs

. t = relevance threshold for a Web page

. F = the focus 2: Frontier← (S,∞); . insertion with maximum score 3: Repository ←∅; DuplicateURLElim← ∅;

4: while Frontier 6= ∅ do

5: url← FrontierManager.GetNextURL(Frontier); 6: address← DNSResolver.Resolve(url);

7: page← HTTPFetcher.Fetch(address); 8: if Repository.Contains(page) then

9: go to 4;

10: contents← Parser.ExtractContents(page); 11: r ←RelClassi f ier.ClassifyPage(contents,F ); 12: if r <t then

13: go to 4;

17: if DuplicateURLElim.Contains(ui) then

18: discard ui; 19: else 20: DuplicateURLElim.Store(ui); 21: r_i ← RelClassi f ier.ClassifyURL(ui,F ); 22: Frontier.PutURL(u_i, r_i); 23: FrontierManager.Order(Frontier,F ); 24: return Repository;

Algorithm 2: Basic pseudocode of the focused crawling algo-rithm.

2.2.2 Evaluation of focused crawling

We report some possible metrics to evaluate a focused crawl-ing procedure and to compare different focused crawlcrawl-ing poli-cies. The evaluations permit to estimate the advantages and the weaknesses of a focused crawler implementation. Furthermore, evaluations make feasible to automatically detect the quality of the crawling over time and to optimize its policy.

Harvest rate, loss rate

One of the main metrics to evaluate a focused crawling is the Harvest rate

(20)

pertinency) metric that nowadays represents a standard measure in the information retrieval discipline [42]. The harvest rate is also the objective function of the focused crawling.

Definition 2.1 Let C be the set of crawled pages. Given F as the focus and R(p,F ) as the relevance function of a page p with respect toF as in Equation1, the harvest rate H(C,F )is

H(C,F ) = 1

|C| _p

∑

_∈_CR(p,F ) .

Therefore, the harvest rate H(C,F ) is the sum of the relevance to F of each Web page in C over the size of C. We can state that the aim of the focused crawling is to find the set C that maximizes H. The higher is H, the better is the performance of the focused crawler given this metric.

The previous metric can be also seen from a dual perspective,

i.e., the Loss rate. Loss rate

Definition 2.2 Let C be the set of crawled pages. GivenF as the focus and H(C,F ) as the harvest rate of C with respect to F as in Definition2.1, the loss rate L(C,F )is

L(C,F ) = 1−H(C,F ) .

Therefore, the loss rate L(C,F ) is the complement of the har-vest rate H(C,F ). Hence, we can state that the aim of the fo-cused crawling is to find the set C that minimizes L. The lower is L, the better is the performance of the focused crawler given this metric.

Recall

Another meaningful performance metric for the focused

crawl-ing is the recall [42]. This metric quantifies the percentage of _Recall the retrieved relevant documents over all the existing relevant

(21)

Definition 2.3 Let C be the set of retrieved pages, and let F be

the set of all the pages of the Web graph that are in the focus. The recall R(C, F)is:

R(C, F) = |F| ∩ |C| |F| .

If the size of F is not quantifiable then an approximation of the recall can be used, e.g., as an intersection of distinct crawled sets of pages produced by concurrent executions of the same crawler starting from different seeds [57].

2.2.3 Frontier management

Given the URLs enqueued in the frontier and their estimated relevance scores, the frontier management (or crawl ordering) task determines the sequence of candidates to pick from the frontier, i.e., how to explore the Web graph.

The main issue in the frontier management is how to deal with the relevance estimations in order to select an URL to fetch in the next cycle. A greedy approach for URL selection emphasizes the exploitation of the content-independent informa-tion gathered so far in the crawling session. The structure of the entire Web graph points out that the majority of the exist-ing Web pages is strongly connected [54]. Unfortunately, this is not the case of the interesting Web pages for a focused crawler. Their interconnections form relatively small patchy regions of relevant contents over the whole graph [80]. As a matter of fact, a greedy policy restricts the discovery of new Web pages to these specific areas and does not discover some good content that could be find only after visiting uninteresting pages. There-fore, in most cases the optimal harvest rate is not guaranteed by the greedy approach.

A good URL selection procedure should sometimes break away from the greedy approach in order to perform a better exploration of the Web graph [8,65]. On the other hand, the less quality estimation is taken into account to select the next page, the less harvest rate is expected at the end of the discovery.

(22)

According with this definition, the more the crawling policy is explorative, the more the behavior of the focused crawler tends to be similar to the behavior of a general purpose crawler.

If the focused crawling has to cover only a very small part of the Web, the selection should be exploitative [65]. Otherwise, a successful URL selection policy must balance exploitation and exploration. It is impossible to define the best balancing strategy until the topology of the Web graph is not completely known, and of course this is not the case during the Web crawl-ing. When the crawler is designed to periodically recrawl the same regions of the Web from scratch, this problem can be at-tacked by manually tuning the exploitation and the exploration rates for the new crawling session according to the results from the past sessions. An automatic approach that could be inde-pendent from the past sessions is to use an adaptive policy to select the links. A detailed overview of the state-of-the-art of the adaptive selection is reported in Section3.1.4.

2.3 u s e r-centric web crawling

This Section deals with the user-centric Web crawling. This is an alternative approach to focused Web crawling, presented in Section2.2, to selectively crawl the Web. A user-centric crawler

tackles the following problem.

Problem 2 Given a representation of the user needs, expressed

by a set Q of queries to the search engine, crawl the Web such that the repository contains as many Web pages as possible whose contents are relevant for the queries in Q.

In other words, the objective of this approach is to maximize the quality of the experience with the search engine that the user perceives after submitting a query. To do so, a user-centric crawler behaves as a general-purpose Web crawler until it re-ceives a set of queries from an external software module, e.g., from a query logger. Then, according to the quality of the con-tents that match such queries and to the links that are in the frontier, it reorders the Web pages to crawl in the next cycles

(23)

according with the estimated impact they will have in the top-k results page for the received queries. Similar to the focused crawling, the main challenge in the user-centric crawling is to estimate how much a Web page is interesting for the user be-fore accessing to its contents.

2.3.1 Page importance measures

In this paradigm, the importance of a page is measured as the impact it has on the user experience with the search engine. We review some of its definitions.

Query impact

A query impact definition for a Web page has been proposed in [61]. Given a query q, a query sketch for q is defined as the set of all the discovered pages (either crawled or enqueued in the frontier) that are relevant to q, and their associated relevance score for q. Then, from such query sketch we can estimate the importance of crawling the page p with respect to the query q, i.e., its query impact, as:

I_Q(p, q) = V(R(p, q)), (2)

where R(p, q)is the rank of p in the search results for q and V(r)

is the probability that a user is interested in a page ranked at position r on average. It can be estimated that such probability increases as r is closer to the first position [38,45,59]. To have an example about how to compute R(p, q), we refer the reader to Section3.2.1.

The query impact as in Equation 2 can be extended by

con-sidering a query workload Q instead of just one query. In the workload, each query q ∈ Q is associated with a frequency of occurrence f(q) ∈ [0, 1]. In this way, the query impact of p with

(24)

respect to Q is the sum of the frequencies of the queries for which it will be impactful multiplied by such impact, i.e.,

IQ(p, Q) =

∑

q∈Q f(q) · IQ(p, q) =

∑

q∈Q f(q) · V(R(p, q)) . (3) Click impact, view impact

Alternative impact definitions have been proposed in the litera-ture. In [29], it is supposed that a page has a large impact if it receives many clicks when its link is presented in the search en-gine results pages. Therefore, a click impact definition of a page p is given as:

IC(p, t) = C(p, t), (4) where t is an observation time period and C(p, t)is the number of times p had been clicked from any search engine results page during the time t. In [79], a similar definition of the impact of p is given as the view impact, i.e.,

IV(p, t) = V(p, t) , (5)

where V(p, t)counts how many times p had been viewed in the top-k list from any search engine results page during t. Notice that, due to the more general nature of both these impact defi-nitions, they do not consider or estimate any ranking score for the page in any of the present or future queries, as the query impact does. The concept behind these two definitions is that all the Web pages that are largely clicked and/or viewed are to be considered equally important to crawl. This observation al-lows to speed up the process of click or view impact estimation for an uncrawled Web page, when compared to a query impact estimation, since it is no longer needed to estimate the rank of the page in the top-k if crawled.

2.3.2 The objective function

A user-centric crawling policy works in pursuing the following objective function. Given a generic impact definition I that

(25)

in-w e b c r ain-w l i n g

fers some impact to the Web pages, crawl the C most impactful pages, i.e., I (C,·) =

∑

p∈C I (p,·), (6) or, in expectation, E[ I (·) ] = E "

∑

p Xp · I (p,·) # , (7)

where Xp is a binary random variable defining whether p has been crawled (Xp =1) or not (Xp =0).

2.3.3 User-centric crawler architecture

Similar to what we have done in Section 2.2.1 for the focused

crawling, we derive the basic architecture of a user-centric Web crawler by extending that of the basic Web crawler, presented in Section 2.1.1, with an intelligence layer. The intelligent

com-ponents for the user-centric Web crawling have to estimate the impact of fetching the uncrawled Web contents given a set of queries. Then, according with such estimations, they have to prioritize the fetching of the most impactful links that are en-queued in the frontier. Figure 4 illustrates the modules of this

architecture and their cooperation, that we are going to discuss. Algorithm 3shows the pseudo code of their interaction.

The architecture and the algorithm of the user-centric crawler are similar to those of the basic Web crawler, reported in Sec-tion 2.1.1. The crawl ordering policy starts when the queries

re-ceiver module receives a representative workload of queries to the search engine, e.g., from a query logger (lines 5, 6). From such queries, the impact estimator component estimates the im-pact of fetching the discovered but uncrawled pages, given the actual coverage of the Web contents relevant to the queries (line 7). For instance, it gets from the repository the top-ranked Web pages that are actually returned by the search engine for those queries. Then, it uses such documents to infer the im-portance of crawling the discovered but uncrawled Web con-tents, by calculating their probability to be relevant to the input

(26)

Figure 4: Representation of a user-centric Web crawler architecture.

queries and ranked in top-k if crawled. Section3.2 resumes the

state-of-the-art of this estimation procedure. Finally, the links in the frontier are ordered by the frontier manager component according with such estimated impacts (line 8) and the next links to fetch are selected accordingly (line 9). In the rest of the algorithm, it behaves similar to the basic Web crawler (see Section2.1.1).

2.3.4 Evaluation of user-centric crawling

We describe some evaluation metrics for the user-centric Web crawling. Similar to the metrics to evaluate a focused crawler, presented in Section2.2.2, their purpose is to measure the profit

and the drawback of a user-centric crawling policy with respect to other policies of the same kind, or to the basic Web crawling policy.

(27)

1: functionCrawl_UC(S): . S = set of seed URLs

2: Frontier← S;

3: Repository ←∅; DuplicateURLElim← ∅; 4: while Frontier 6= ∅ do

5: if QueriesReceiver.HasNewQueries() then 6: Q←QueriesReceiver.GetNewQueries(); 7: I ← ImpactEstim.GetImpacts(Repository, Q); 8: FrontierManager.Order(Frontier,I );

9: url← FrontierManager.GetNextURL(Frontier); 10: address← DNSResolver.Resolve(url);

11: page← HTTPFetcher.Fetch(address); 12: if Repository.Contains(page) then

13: go to 4;

17: if DuplicateURLElim.Contains(ui) then

18: discard ui;

19: else

20: DuplicateURLElim.Store(ui); 21: Frontier.PutURL(u_i);

22: return Repository;

Algorithm 3: Basic pseudocode of the user-centric crawling algo-rithm.

Impact accrued

The most intuitive evaluation metric for the user-centric crawl-ing is the impact accrued. Given the crawled Web pages that are selected through a user-centric policy, this metric simply aver-ages their impacts after an observation period. A well-performing user-centric crawler obtains large values of such average over time. The curve of a plot for this metric can be traced according with the time when the pages have been crawled. In existing work, the crawling policy reorders the frontier a single time, and the cumulative impact accrued is measured according with the percentage of fetched pages over the pages in the frontier at the time of the re-sort [61,79].

(28)

Click-through rate

If the information of what the search engine users view and click on the query results is available, another interesting qual-ity metric is the click-through rate (CTR). For a crawled page, it consists of the ratio of the number of users who click on its link in a results page to the number of users who view such link. In the context of user-centric crawling, we expect from a good set of crawled Web pages that the contained information are interesting enough to be often clicked when presented to the users.

(29)

3

R E L AT E D W O R K

This Chapter reports the related work in focused Web crawling and in user-centric Web crawling.

3.1 f o c u s e d c r aw l i n g

3.1.1 Seed selection

As already stated, a crawler uses a set of seed URLs as start-ing points for the crawlstart-ing process. The quality of the set of crawled pages is largely influenced by the quality of the input seeds [86]. This aspect is further accentuated in case of focused crawling [80]. To the best of our knowledge, despite the impor-tance of seed selection, commonly accepted guidelines for this activity are missing from the scientific literature.

A manual seed selection requires a specialized user with a good knowledge of the topic. A possible source of seeds for some topics is given by Web directories such as the Open Di-rectory Project 1

. This strategy has been used in many works, e.g., [17,18,52,63]. However, when the Web directory does not index a topic, or if the provided list is not rich or accurate, then the performance of the focused crawler can be negatively af-fected [80].

Often, selecting seeds without taking care of their location in the Web graph leads to a superficial coverage of the relevant regions of the Web graph [80]. During the selection process, it would be better to have an approximate knowledge of the topology of the Web graph, and then to select seeds located in sparse regions of such graph. Recently, a framework that accurately and automatically selects a large and sparse set of seeds has been proposed [80]. It builds such set by strategically generating and issuing multiple queries to a major, general-1 _{http://www.dmoz-odp.org}

(30)

r e l at e d w o r k

purpose search engine. The top-ranked pages returned by the search engine for each query are downloaded and classified to estimate the more rewarding seeds. A query consists of focus-specific terms and is composed through an adaptive strategy that builds a language model for the topic. Such model is iter-atively built starting from an input set of representative terms, and then enriched in a pseudo-relevance feedback process that selects new representative terms from the rewarding seeds dis-covered in previous iterations.

3.1.2 Classification of the crawled pages

One of the challenges for a focused crawler is to classify a crawled Web page pc to check if it is relevant to a given fo-cus. In this problem, pc can be assessed using its content. As we discuss in Section3.1.3, a careful evaluation may be used as

a content-independent feature to classify the discovered but un-crawled pages that are near to pc in the Web graph. Moreover, this kind of classification is useful in measuring the harvest rate as in Definition2.1.

Besides the evaluation of the downloaded contents in focused crawling, Web page classification has several other applications. For this reason, many studies and exhaustive reports about such task are available in the literature, e.g., [70]. In the fol-lowing, we briefly discuss the major features and algorithms. Features

The content of pc is composed by HTML elements, or tags. Some elements contain a portion of the human-readable in-formation, i.e., the textual content rendered by a Web browser. Other elements contain additional information, e.g., hyperlinks (relationships of pc with other Web pages), rendering instruc-tions for the browser, and so on. We mention some features that can be extracted for the content-based classification of a Web page.

A way to obtain features for the classification of pc is to Textual content

(31)

is the set of unique terms or sentences obtainable from the bag-of-words representation of the text [48]. A more accurate but higher dimensional feature space can be extracted from the n-gram representation of text. Different feature selection techniques to automatically reduce the dimension of the fea-tures with a negligible loss of accuracy have been investigated in [30,84].

The main difference between Web page classification and HTML tags

plain text categorization is the possibility to extract a feature space from sources other than just text, i.e., from the HTML tags. In principle, all the tags may produce a feature space for the categorization of pc. For instance, a mechanism that weights the terms of a Web page according to the header tag (<h1>,<h2>

etc.) in which they appear is adopted in [44].

When pc exhibits poor contents in terms of the quantity of Features from neighboring Web pages

plain text or of HTML tags, an accurate classification for the page in isolation is not possible. A way to overcome this prob-lem is based on the assumption that if pc is surrounded by many Web pages of a given category, it is probable that pc belongs to the same category [22]. Therefore, in this case the classification of pc is based on the classifications of the neigh-borhood. Since this kind of classification coincides with the content-independent link classification in focused crawling, we cover the possible features extractable from the neighbors in Section3.1.3.

Algorithms

The features extracted from pc are used as input for the page classifier to perform the actual categorization, i.e., to check if the contents of pc are relevant, given the focus F. In the follow-ing, we review some classification algorithms.

One of the simplest approach to classify pc computes a cen- Text similarity

troid to represent F. A centroid consists of a vector of terms, each weighted by its importance with respect to the others [75]. The terms for a centroid can be extracted from a textual descrip-tion of F or from the text of some representative Web pages. Usually, the weight assigned to a term is derived from its

(32)

fre-r e l at e d w o fre-r k

quency in the original text, e.g., by the term frequency-inverse document frequency (tf-idf) function [75]. Given the centroid of

F and the text of pc, the relevance of pc is measured accord-ing to well-known metrics, such as cosine similarity [48]. Such method is easy to implement and can be effective if the purpose of the classification is limited to the measurement of the harvest rate [33]. Nevertheless, its performance is usually poor [68]. In focused crawling, this kind of classification may negatively af-fect the harvest rate if its outcome is used as a feature to classify the links in the frontier [53].

To obtain a better quality in classification, it is very common Machine learning algorithms

to adopt more sophisticated approaches. Often, Web page clas-sifiers rely on machine learning algorithms. In most cases, the page classifier is trained to categorize a Web page as relevant or not relevant to F by means of some supervised learning. The training set typically consists of a set of keywords and/or of known high quality Web pages. Each keyword or Web page is manually or automatically categorized as a positive example if it pertains to F, or as a negative example otherwise. After the training phase, the resulting classification model is used to categorize pc either as relevant or not relevant according to its extracted features. Consequently, the effectiveness of the clas-sification is mainly dependent on the quality of the provided examples.

A study about the classification of Web pages suggests that the size of the negative set of examples should be twice the size of the positive set [63]. If few examples are available, the model of the page classifier can be updated in the future accord-ing to a greater trainaccord-ing set. Unfortunately, a simple automatic approach that enlarges the training set with the classification outcomes of the first thousand of crawled pages is not effec-tive [63].

Two popular algorithms in Web page classification are Naive Bayes (NB) and Support Vector Machines (SVMs). NB works with the assumption that all features are independent of each other, and computes the probability of being in a given class based on maximal probability [46]. A SVM maps the training examples as points into one or more hyperplanes, according to the

(33)

posi-r e l at e d w o posi-r k

tive or negative label assigned. Then, it finalizes the training by solving the quadratic programming problem of selecting the optimal hyperplane that minimizes the probability of wrong classification [78]. A strength of this method is the capability of handling a large (and, possibly, sparse) feature space. Such capability reduces the need of feature selection which is often complex in text categorization [37]. Despite the different na-tures of NB and SVM, the quality of their results in automatic classification of Web documents are equally competitive [63,68].

In another approach of relevance computations,F is described Semantic similarity

by means of ontologies. Given the ontology, a knowledge base of ontological concepts is built. Then, at crawling time the rel-evance of pc is computed by comparing the ontological ccepts extracted from its contents to those belonging to the on-tology [26].

A way to get a relevance score from the above comparisons is to count the occurrences of each ontological entity in pc and then to multiply the result with the TF-IDF weight assigned in the ontology. Then, the results for each entity are summed up and divided by the text length of pc [26]. Alternative strategies are summarized in [25].

3.1.3 Classification of the discovered links

This task consists of a relevance classification of the set of links L to enqueue in the frontier. During a link classification, the con-tents of the linked page are unknown since it has been only dis-covered but not (yet) crawled. Therefore, for each link l ∈ L, the relevance of the discovered Web page pd referred to by l must be predicted by relying on content-independent features. Un-like the content-based Web page categorization of Section3.1.2,

this kind of categorization is less popular in the literature, prob-ably due to a more restricted number of applications.

Features

Various features to predict the category of pd have been pro-posed in the literature. We report the most relevant ones.

(34)

A set of features can be extracted by exploiting all the infor- Features of the discovered page

mation we have about pd in isolation at the time of its classifi-cation, noticeably:

• the URL: the structure and/or the set of tokens contained in the URL of pdcan be useful to classify pdwith good effi-ciency and accuracy [2,7,40]. For example, if the classifier has been properly trained to focus on links about micro-controllers, then the URL that contains the tokenArduino

provides a clue that l links to an interesting pd. This con-cept can be extended to hostnames and to domains. If the classifier knows their scopes, then it can take an accurate decision just by recognizing them from the URL.

• the anchor context: the anchor text of l and a certain area of surrounding text could summarize the content of pd. Words before and after the anchor are also considered, since anchor contents can be composed by few terms (e.g., "click here", or the URL of l), or by no terms at all (e.g., an image). In general, the more terms surrounding the anchor of l are considered, the more the anchor context is expressive. However, it is important to define a limit for such context: a large set of terms could be noisy for this feature. Empirical results from different works state that stemmed link contexts without stop words and with an average length of 11 terms give a good summary of pd on average [15,22].

• the proximity of the link to a token: another way to clas-sify links takes into account statistical properties about the position of specific tokens with respect to the links. A token has many possible definitions. The simplest one is a given textual term [2]. Alternatively, we can consider the HTML contents of the parent page of pd, properly en-coded in a Document Object Model tree [82]. Such struc-ture can be as powerful as to get more definitions for a token. For instance, it can be a given font color, the n-th item of a list and so on and so forth [17]. The basic idea is to train the classifier from a model that considers the

(35)

textual contents, or the DOM trees, of a known set of Web pages. The features of the model are the distances of links to known pages and common tokens. A distance is com-puted as the number of terms, or DOM elements, that sep-arate the token and the anchor element of the link. Then, given the page from which l has been discovered, if a to-ken is detected then its distance to l is computed and the link classifier predicts the usefulness of pdaccordingly. • general properties from the discovered Web graph: the

structure of the Web graph at the time of the prediction can be used to derive further features. For instance, the classifier may assign a relevance to pd according to its input degree, i.e., the number of its incoming links [21], its estimated PageRank score [21] or a derived form of it [1,28], or its authority and hub scores [18,32]. Exper-iments showed that predictions based on the aforemen-tioned properties can be effective when the purpose of the crawler is to provide a wide coverage of a small area of the Web graph [3]. Conversely, for large scale crawling these kinds of metrics are not effective [53]. A motivation is that the topology of the discovered Web graph changes continuously as the crawling proceeds, and its properties have to be periodically recomputed. When the graph be-comes sufficiently large the time complexity of such up-dates could degrade the efficiency of the crawler, espe-cially for the properties computed by iterative algorithms. Another observation is that any Web graph as seen as by the crawler is always a small subset of the real one. The natural consequence is that the extracted properties could not reflect the real ones and thus they could not be reli-able for good predictions. A third reason is related to the nature of these properties, that may be too general to be helpful for most applications of focused crawling. In fact, the favored Web pages are just connected to many other Web pages, or to some authoritative Web pages, without taking care to their contents.

Another set of features can be derived from the fact that the Local features of the link

(36)

overall link structure of the Web graph is not random. Various studies have shown that in most cases relevant Web pages link to other relevant Web pages [16,22,49,64]. Over the empirical results, such hypothesis has been formalized in [50].

Consider, for instance, a Web page about C++ programming. In general, it is very rare that the Web surfer reaches it by fol-lowing a link on a Web page about TV series. It is easier to imagine that such page can be discovered from a computer sci-ence community. Similarly, it is unusual that the author of such page places a link to a Web page about mad cow disease. It is more common to find a link to a C++ library Web page. In other words, the average behavior of Web page authors is to link to pages that could be interesting for the same reader. As a consequence for the link classifier, the more an unknown page is near to a known, high quality page in the Web graph, the more promising quality can reasonably be assigned to it.

The assumption of content locality leads to the definition of the following content-independent features for the discovered page pd:

• the relevance of the ancestor Web pages: a Web page is considered as an ancestor of pd if it has been previ-ously crawled, its contents have been classified (as in Sec-tion 3.1.2), and it has a path of links connecting it to p_d.

Given the content locality property mentioned before, the relevance of the contents of the ancestors could provide an estimation of the relevance for pd. Among the set of an-cestors, we can distinguish the parents of pd as the pages that contain l in their set of outlinks. Experiments showed that the parent pages provide a more accurate estimation with respect to the higher generation ancestor pages [39]. • the relevance of the sibling Web pages: given the Web subgraph surrounding the node of pd, a feature to esti-mate its relevance is the analysis of the classification of neighbors [2,14,69]. In focused crawling, the most rele-vant neighbors of pd are the sibling Web pages that have been crawled and evaluated. This feature exploits the evi-dence that if the common parent page has many crawled

(37)

outlinks whose contents have been judged as (not) rele-vant, then the contents of the target page pd would be (not) relevant as well.

Figure 5: Neighbors of a target page.

Since the Web contents do not have an homogeneous organi- Features of the paths including the link

zation, the assumption of content locality does not always hold. Sometimes, contents are organized in categories and can be ac-cessed only after some walk over a path of non-relevant pages. A study over half a million of Web pages reported that the pairs of relevant pages that are not connected by a direct link are sep-arated from 1 to 12 non-relevant pages, commonly 5 [8]. There are many possible ways to build paths of pages. However, in some cases we may be interested in crawling over links that are likely to form common paths.

As an example, consider a typical digital library website that contains research papers. In order to get a publication from the

(38)

Homepage, commonly we follow the Journals link that leads to a page having a list of links to different journals. Then, we select the desired Journal link, obtaining a page with mul-tiple links related to issues of the specific journal. Next, we follow an Issue link to get the page with the relative list of publications. Eventually, by following a discovered link from the current page, we get the relative Publication. If we think about a focused crawler about research papers, such traversal

Homepage → Journals → Journal → Issue → Publication

requires to download some pages that are useless for the main purpose of the crawler, i.e., a collection of publications. How-ever, they lead to the discovery of the Issue Web page, that represents a gold mine of useful links to increase the harvest rate.

In this case, it is useful to estimate the membership and the position of pd in a known sequence of pages. A link classi-fier may then assign a weight to l according to the following content-independent features [24,47,72]:

• the linkage relations to the crawled ancestors: this fea-ture can be useful for the crawler to detect if it is clas-sifying a link in a notable path. A possible definition is the combination of the anchor text of l, the directories in its URL, its position in the parent page, and the observed membership of the crawled ancestors to the same path. • the state of the path traversal: with respect to the

pre-vious feature, this one defines the notion of state for the current traversal to estimate the reward of pd. A state can be defined according to specific goals. For instance, it can be the distance of pd from the last relevant ancestor. An-other possible definition of the state is as the expected number of remaining hops to reach a relevant page, given the estimated position of pd in a common hierarchy. Algorithms

In Section3.1.2, we mentioned some approaches to compute the

relevance of the crawled page pc given its features. In link clas-sification, similar strategies are adopted to assign a crawling

(39)

priority to the discovered page pd. We review some strategies for the link classification task.

The first work on focused crawling implements the link clas-sification task in a component called distiller [18]. It periodically analyzes the discovered Web graph and identifies the crawled Web pages that frequently link to relevant Web pages, i.e., the hubs. Another approach assigns a priority of the outlinks based on the inherited, content-based relevance score of the parent page [24]. A focused crawler that estimates the relevance of p_d using the local features only does not implement a link clas-sifier module at all. In this scenario, the page clasclas-sifier is the sole responsible of assigning an importance to each of the dis-covered outlinks. It is easy to deduce that this strategy does not perform well, as in general each outlink can be very dif-ferent in semantics, even when they have common parents and siblings [47]. For this reason, usually the link classifier is im-plemented as a separate, independent module that properly weights some of the local features mentioned above.

The remaining features of pd are extracted in form of text from the crawled contents. For this reason, the link classifica-tion task is treated as a text categorizaclassifica-tion task, similar to the page classification one.

3.1.4 URL selection

In the following, we illustrate some well-known strategies pro-posed in the literature for the URL selection. When useful, we consider also the exploitation versus exploration issue reported in Section2.2.3.

Static selection

The concept of crawl ordering has been formulated prior to the focused crawling idea. At that time, the goal was to improve the precision and the recall for a general-purpose search engine [21, 23,34]. Nevertheless, the principle of frontier management in focused crawling is the same and the quality concept can be easily defined by the scores computed by the link classifier [18,

(40)

52]. We refer to this kind of URLs picking as a static selection since it works with the distribution of the links all over the Web as if it were homogeneous. As a consequence, the harvest rate of a crawling adopting static selection is sensible to the quality of the starting seeds and to the effectiveness of the content locality over the Web graph.

The approach par excellence to exploit static content locality Best-first search

is the best-first search [21]. This paradigm is very simple: the next link to pick from the frontier is always the one with the highest relevance score. The assumption here is that high qual-ity pages always exhibit the best features than other pages if their content-independent estimations are accurate. Empirical studies proved that the best-first selection can outperform the breadth-first search [4,6,21,53]. The selection procedure can be generalized in such a way that the N best links are selected in a batch, rather than just the first one. The parameter N tunes the greediness of the selection. When a small N is set, the crawler tends to be exploitative. Conversely, with a large N the trend of the crawler is more explorative.

Two alternative approaches to best-first search that exploit Fish-search and shark-search

the relevance clustering over the Web are the fish-search [23] and its successor, the shark-search [34]. The idea is to detect the re-gions of the discovered Web graph that contain relevant pages and to perform an extensive crawling of those regions. In the meanwhile, the irrelevant sub-graphs that are encountered dur-ing the crawldur-ing are not deeply explored, but also not com-pletely discarded, as it would be in best-first search. This could lead to the discovery of new relevant pages after a short path of non-relevant ones. For such purposes, the selection of the next URL to fetch is not guided solely by the score produced by the link classifier. Instead, every link inherits an additional score from the ancestor pages, i.e., the inherited relevance score. The inherited score to propagate from the seeds is initialized to a predefined value. If a crawled page pc is not relevant for the page classifier, then the inherited score of its discovered out-links is computed as the inherited score of pc multiplied by a decay factor δ ∈ (0, 1). Otherwise, their inherited score is ob-tained by multiplying the relevance score of pc from the page

(41)

classifier with δ. The score to associate to the link l in the fron-tier is computed as:

score(l) = γ inh_score(l) + (1−γ) rel_score(l),

where inh_score(l) is the inherited score of l, rel_score(l) is the relevance score produced by the link classifier for l, and

γ ∈ [0, 1] is a predefined constant to weight the influence of

the two scores. To limit the traversal of consecutive irrelevant pages, a depth value d ∈ N is defined for each crawled page

pc. The authors of shark-search set d = 3 in all of their experi-ments [34]. Such depth value is associated to the seeds and to pc if it is relevant for the page classifier. If pc is not relevant, its depth value is set as the depth value of the parent page decre-mented by one. If the depth value of pc becomes zero, none of its outlinks is classified and added to the frontier. In this way, the current path is not explored anymore. As in best-first search, the exploitation and exploration can be simply balanced by crawling each time a batch of N top links from the frontier, rather than just the first one.

Adaptive selection

Other paradigms for link selection provide the feature to adapt the crawling policy during its execution. The purpose of the adaptive selection is to improve the average quality of the picked URLs as the crawling proceeds, possibly at the expense of a lower harvest rate at the beginning of the spidering. An adapta-tion is a consequence of some detected events, such as a varia-tion in the curve of the harvest rate or a refinement of the focus definition during the crawling session [55]. A frontier manage-ment of such kind gives the system the ability of self-learning, i.e., a periodic retraining of its classifiers to select the best fea-tures and their weights at that epoch.

In the state of the art, a popular adaptive strategy is InfoSpi- InfoSpiders

ders [51]. This approach distributes the search for relevant pages to different parallel and independent modules, called agents. Each agent works on a local frontier, i.e., an assigned partition of the frontier. The agent is initialized with a predefined score,

(42)

called energy. Each agent estimates the quality of the links it discovered using its own link classifier. According to the rele-vance scores, the agent prioritizes such links in the local fron-tier. When the crawler selects a link from the local frontier of an agent, the page is downloaded and evaluated by the page classifier according to its contents. If the prediction was correct, then the agent receives an increment of its energy as a reward. If, instead, the prediction was wrong, the agent loses some of its energy as a penalty. An agent reproduces if its energy gets over an upper threshold. The local frontier of the new agent is the half of the local frontier of its parent. On the contrary, an agent dies if its energy gets below a certain threshold. In this case, this failure is propagated backwards to the ancestors of the agent, that retrain their own link classifiers accordingly. In InfoSpiders, the selection of the link to fetch is regulated by a parametric stochastic function. Such function also contains some tunable parameters to balance the exploitation and the exploration.

Another family of adaptive crawl ordering schemes has been Intelligent crawling

introduced as intelligent crawling [2]. The crawler is initialized with some arbitrary, user-defined predicates such as keywords or topical queries. It does not require any representative ex-ample to train the relevance classifier. Then, starting from the seeds, it begins to crawl like a general-purpose crawler. As the repository of the crawled Web pages populates, it gradually learns how to estimate and how to select the URLs that satisfy the requested predicates. The curve of the resulting harvest rate gradually improves as well.

The intelligent crawling approach is improved in [17]. The main improvement is given by the linear combination of two classifiers. The first one is a binary classifier called baseline. It requires an offline training with some labeled page contents as examples. At the beginning of the crawling, the system dis-covers, fetches and evaluates new Web pages according to the outcomes of such classifier. In the meanwhile, a second classi-fier called apprentice is continuously trained online using the links from where the Web pages have been discovered, accord-ing with the trend of the harvest rate. Specifically, it learns to