Information Retrieval - University of Pisa
Crawling the web
Marco Cornolti
Aim: crawling
Crawling: download pages following links
● Web structure: graph (nodes=pages, edges=HTML link)
● Limit depth
P1
P3 P2
P4
P5
P6 d=1
d=2 d=3
Exercise: crawling IMDB
Starting from a specific link on IMDB
1. Download all movie pages within a certain distance
2. Extract information from them
Environment
● Create a directory ir/pages/
From terminal:
sudo apt-get install python-scrapy python-lxml
Crawling the web with scrapy
entry web node
edit imdb_crawl.py:
import scrapy
class imdbSpider(scrapy.Spider):
name = "imdb"
allowed_domains = ["www.imdb.com"]
start_urls = [
"http://www.imdb.com/chart/top/" ]
def parse(self, response):
print "Body:", response.body
domain restriction
response body
(generally HTML code)
run the crawler:
scrapy runspider imdb_crawl.py
(alternatively, if that doesn’t work:)
python -m scrapy.cmdline runspider imdb_crawl.py
Writing webpages to file
import scrapy import re import os
STORAGE_DIR = "pages"
URL_REGEX = "http://www.imdb.com/title/(tt\d{7})/\?.*"
class ImdbSpider(scrapy.Spider):
name = "imdb"
allowed_domains = ["www.imdb.com"]
start_urls = ["http://www.imdb.com/chart/top"]
def parse(self, response):
if not response.headers.get("Content-Type").startswith("text/html"):
return
m = re.match(URL_REGEX, response.url) if m:
filename = os.path.join(STORAGE_DIR, m.group(1) + ".html") with open(filename, "wb") as f:
f.write(response.body)
output directory (must be created before running)
same as before
filenames look like tt0123456.html
keep html only
Following links
only follow links that lead to movies
add to imdb_crawl.py:
[...]
def parse(self, response):
if not response.headers.get("Content-Type").startswith("text/html"):
return
m = re.match(URL_REGEX, response.url) if m:
filename = os.path.join(STORAGE_DIR, m.group(1) + ".html") with open(filename, "wb") as f:
f.write(response.body)
for href in response.xpath("//a/@href"):
url = response.urljoin(href.extract()) if re.match(URL_REGEX, url):
yield scrapy.Request(url)