Project #2: Compressing a large web collection

(1)

Project #2:

Compressing a large web collection

The problem consists of grouping Web pages and then compressing each group individually via Lzma2 compressor.

The goal is to find the grouping strategy that minimizes the achieved compression. This problem has important applications in storage of text collections in search engines.

Two main paradigms (LZMA2 over original or over URL sorted):

• Find a serialization in a file, and then compress with LZMA2.

• Cluster docs, and then compress each cluster with Lzma2.

Few key issues:

• Which clustering ? Which similarity ? Which cluster size ?

• What about graph-based clustering: Just graph construction?

(2)

Project #2:

Compressing a large web collection

Datasets: Part of Gov2 crawl – 2.7 Gb 7zipped.

Software and Libraries

 Sorting large datasets: Unix sort or STXXL

 http://stxxl.sourceforge.net/

 Library for LSH implementation and shingling

 http://www.mit.edu/~andoni/LSH/

 http://research.microsoft.com/en-us/downloads/4e0d0535-ff4c-4259-99f a-ab34f3f57d67

/

 Weka - implements k-means and k-means++

 http://www.cs.waikato.ac.nz/~ml/weka/

 Graph partition

 http://algo2.iti.kit.edu/documents/kahip/index.html

 Lzma2 compressor

 http://www.7-zip.org/sdk.html

3 papers to read (no proofs) and find inspiration for coding