• Non ci sono risultati.

Crawling the web

N/A
N/A
Protected

Academic year: 2021

Condividi "Crawling the web"

Copied!
13
0
0

Testo completo

(1)

Information Retrieval - University of Pisa

Crawling the web

Marco Cornolti

(2)

Aim: crawling

Crawling: download pages following links

● Web structure: graph (nodes=pages, edges=HTML link)

● Limit depth

P1

P3 P2

P4

P5

P6 d=1

d=2 d=3

(3)

Exercise: crawling IMDB

Starting from a specific link on IMDB

1. Download all movie pages within a certain distance

2. Extract information from them

(4)

Environment

● Create a directory ir/pages/

From terminal:

sudo apt-get install python-scrapy python-lxml

(5)

Crawling the web with scrapy

entry web node

edit imdb_crawl.py:

import scrapy

class imdbSpider(scrapy.Spider):

name = "imdb"

allowed_domains = ["www.imdb.com"]

start_urls = [

"http://www.imdb.com/chart/top/" ]

def parse(self, response):

print "Body:", response.body

domain restriction

response body

(generally HTML code)

run the crawler:

scrapy runspider imdb_crawl.py

(alternatively, if that doesn’t work:)

python -m scrapy.cmdline runspider imdb_crawl.py

(6)

Writing webpages to file

import scrapy import re import os

STORAGE_DIR = "pages"

URL_REGEX = "http://www.imdb.com/title/(tt\d{7})/\?.*"

class ImdbSpider(scrapy.Spider):

name = "imdb"

allowed_domains = ["www.imdb.com"]

start_urls = ["http://www.imdb.com/chart/top"]

def parse(self, response):

if not response.headers.get("Content-Type").startswith("text/html"):

return

m = re.match(URL_REGEX, response.url) if m:

filename = os.path.join(STORAGE_DIR, m.group(1) + ".html") with open(filename, "wb") as f:

f.write(response.body)

output directory (must be created before running)

same as before

filenames look like tt0123456.html

keep html only

(7)

Following links

only follow links that lead to movies

add to imdb_crawl.py:

[...]

def parse(self, response):

if not response.headers.get("Content-Type").startswith("text/html"):

return

m = re.match(URL_REGEX, response.url) if m:

filename = os.path.join(STORAGE_DIR, m.group(1) + ".html") with open(filename, "wb") as f:

f.write(response.body)

for href in response.xpath("//a/@href"):

url = response.urljoin(href.extract()) if re.match(URL_REGEX, url):

yield scrapy.Request(url)

same as before

run the crawler from terminal:

scrapy runspider imdb_crawl.py -s DEPTH_LIMIT=1

get html links

(8)

From HTML to data

● HTML contains lots of stuff we are not interested in:

○ links

○ references to images

○ formatting

○ JavaScript code

○ CSS styles

We need to keep interesting data only.

(9)

HTML tree

<html>

<body>

<script>...</script>

<h1>The Godfather</h1>

<style>...</style>

<p itemprop="description">

The <b>aging patriarch</b> of an organized crime dynasty...

</p>

</body>

</html>

<html>

<body>

<script> <h1> <style> <p>

<b>

... The

Godfather ... The

aging patriarch

of an

organized...

text

tail

child

(10)

Data extraction

edit processhtml.py:

from lxml import etree def html_to_data(html_file):

parser = etree.HTMLParser()

tree = etree.parse(html_file, parser) if tree.getroot() is None: return None title = None

nodes_title = tree.xpath('//meta[@property="og:title"]/@content') if nodes_title: title = nodes_title[0]

return title

depends on what we need

edit parse_test.py:

from processhtml import * if __name__ == "__main__":

print html_to_data("pages/tt0091042.html")

just an example

(11)

Exploring the file system

let’s rewrite parse_test.py:

import os

from processhtml import * PAGES_DIR = "pages"

for filename in os.listdir(PAGES_DIR):

print html_to_data(os.path.join(PAGES_DIR, filename))

(12)

And More...

Extract more fields from IMDB pages: director, description, runtime, vote, genre, etc.

(optional) download more IMDB pages from:

http://bit.ly/1SYuht7

(13)

Pills of XPath

● High-level introduction

● W3C Tutorial.

● Extensions for Chrome and Firefox.

● XPath exercises.

Riferimenti

Documenti correlati

Per esempio un crawler breadth-first deve tenere traccia di quali pagine sono già state scansionate: questo è generalmente realizzato utilizzando una struttura dati “URL visitati”

description: events are simulated with random direc- tion and random position inside the stainless steel of the two cryostats. • /ds/generator/is G2 cryostat xxx

Now, if a propitious image — passing to identify the part of communication — founds an important economic part of the business patrimony and if its presence

We complete our study by showing a simple way for crafting a (phishing) page that looks very similar visually to the targeted page but very different in terms of NCD,

The series published two significant essays: the article by Josef Strzygowski, where he innovatively affirmed the role of the East in Christian art and where he employed

Overall, the current regime maintains earlier provisions on acquisition of Panamanian nationality through ius soli, ius sanguinis, adoption and naturalisation, and

Start Download - View PDF. Convert From Doc to PDF &amp; PDF to Doc With The Free

All you need is any iPod, from the early classic iPod to the latest iPod Nano, the smallest iPod Shuffle to the largest iPod Photo, and a digital camera. Just