Crawling the web

(1)

Information Retrieval - University of Pisa

Crawling the web

Marco Cornolti

(2)

Aim: crawling

Crawling: download pages following links

● Web structure: graph (nodes=pages, edges=HTML link)

● Limit depth

P1

P3 P2

P4

P5

P6 d=1

d=2 d=3

(3)

Exercise: crawling IMDB

Starting from a specific link on IMDB

1. Download all movie pages within a certain distance

2. Extract information from them

(4)

Environment

● Create a directory ir/pages/

From terminal:

sudo apt-get install python-scrapy python-lxml

(5)

Crawling the web with scrapy

entry web node

edit imdb_crawl.py:

import scrapy

class imdbSpider(scrapy.Spider):

name = "imdb"

allowed_domains = ["www.imdb.com"]

start_urls = [

"http://www.imdb.com/chart/top/" ]

def parse(self, response):

print "Body:", response.body

domain restriction

response body

(generally HTML code)

run the crawler:

scrapy runspider imdb_crawl.py

(alternatively, if that doesn’t work:)

python -m scrapy.cmdline runspider imdb_crawl.py

(6)

Writing webpages to file

import scrapy import re import os

STORAGE_DIR = "pages"

URL_REGEX = "http://www.imdb.com/title/(tt\d{7})/\?.*"

class ImdbSpider(scrapy.Spider):

name = "imdb"

allowed_domains = ["www.imdb.com"]

start_urls = ["http://www.imdb.com/chart/top"]

def parse(self, response):

if not response.headers.get("Content-Type").startswith("text/html"):

return

m = re.match(URL_REGEX, response.url) if m:

filename = os.path.join(STORAGE_DIR, m.group(1) + ".html") with open(filename, "wb") as f:

f.write(response.body)

output directory (must be created before running)

same as before

filenames look like tt0123456.html

keep html only

(7)

Following links

only follow links that lead to movies

add to imdb_crawl.py:

[...]

def parse(self, response):

if not response.headers.get("Content-Type").startswith("text/html"):

return

m = re.match(URL_REGEX, response.url) if m:

filename = os.path.join(STORAGE_DIR, m.group(1) + ".html") with open(filename, "wb") as f:

f.write(response.body)

for href in response.xpath("//a/@href"):

url = response.urljoin(href.extract()) if re.match(URL_REGEX, url):

yield scrapy.Request(url)

same as before

run the crawler from terminal:

scrapy runspider imdb_crawl.py -s DEPTH_LIMIT=1

get html links

(8)

From HTML to data

● HTML contains lots of stuff we are not interested in:

○ links

○ references to images

○ formatting

○ JavaScript code

○ CSS styles

We need to keep interesting data only.

(9)

HTML tree

<html>

<body>

<script>...</script>

<h1>The Godfather</h1>

<style>...</style>

<p itemprop="description">

The <b>aging patriarch</b> of an organized crime dynasty...

</p>

</body>

</html>

<html>

<body>

<script> <h1> <style> <p>

<b>

... The

Godfather ... The

aging patriarch

of an

organized...

text

tail

child

(10)

Data extraction

edit processhtml.py:

from lxml import etree def html_to_data(html_file):

parser = etree.HTMLParser()

tree = etree.parse(html_file, parser) if tree.getroot() is None: return None title = None

nodes_title = tree.xpath('//meta[@property="og:title"]/@content') if nodes_title: title = nodes_title[0]

return title

depends on what we need

edit parse_test.py:

from processhtml import * if name == "main":

print html_to_data("pages/tt0091042.html")

just an example

(11)

Exploring the file system

let’s rewrite parse_test.py:

import os

from processhtml import * PAGES_DIR = "pages"

for filename in os.listdir(PAGES_DIR):

print html_to_data(os.path.join(PAGES_DIR, filename))

(12)