Project #2:
Compressing a large web collection
The problem consists of grouping Web pages and then compressing each group individually via Lzma2 compressor.
The goal is to find the grouping strategy that minimizes the achieved compression. This problem has important applications in storage of text collections in search engines.
Two main paradigms (LZMA2 over original or over URL sorted):
• Find a serialization in a file, and then compress with LZMA2.
• Cluster docs, and then compress each cluster with Lzma2.
Few key issues:
• Which clustering ? Which similarity ? Which cluster size ?
• What about graph-based clustering: Just graph construction?
Project #2:
Compressing a large web collection
Datasets: Part of Gov2 crawl – 2.7 Gb 7zipped.
Software and Libraries
Sorting large datasets: Unix sort or STXXL
http://stxxl.sourceforge.net/
Library for LSH implementation and shingling
http://www.mit.edu/~andoni/LSH/
http://research.microsoft.com/en-us/downloads/4e0d0535-ff4c-4259-99f a-ab34f3f57d67
/
Weka - implements k-means and k-means++
http://www.cs.waikato.ac.nz/~ml/weka/
Graph partition
http://algo2.iti.kit.edu/documents/kahip/index.html
Lzma2 compressor
http://www.7-zip.org/sdk.html
3 papers to read (no proofs) and find inspiration for coding