• Non ci sono risultati.

Project #2: Compressing a large web collection

N/A
N/A
Protected

Academic year: 2021

Condividi "Project #2: Compressing a large web collection"

Copied!
2
0
0

Testo completo

(1)

Project #2:

Compressing a large web collection

The problem consists of grouping Web pages and then compressing each group individually via Lzma2 compressor.

The goal is to find the grouping strategy that minimizes the achieved compression. This problem has important applications in storage of text collections in search engines.

Two main paradigms (LZMA2 over original or over URL sorted):

• Find a serialization in a file, and then compress with LZMA2.

• Cluster docs, and then compress each cluster with Lzma2.

Few key issues:

• Which clustering ? Which similarity ? Which cluster size ?

• What about graph-based clustering: Just graph construction?

(2)

Project #2:

Compressing a large web collection

Datasets: Part of Gov2 crawl – 2.7 Gb 7zipped.

Software and Libraries

 Sorting large datasets: Unix sort or STXXL

http://stxxl.sourceforge.net/

 Library for LSH implementation and shingling

http://www.mit.edu/~andoni/LSH/

http://research.microsoft.com/en-us/downloads/4e0d0535-ff4c-4259-99f a-ab34f3f57d67

/

 Weka - implements k-means and k-means++

http://www.cs.waikato.ac.nz/~ml/weka/

 Graph partition

http://algo2.iti.kit.edu/documents/kahip/index.html

 Lzma2 compressor

 http://www.7-zip.org/sdk.html

3 papers to read (no proofs) and find inspiration for coding

Riferimenti

Documenti correlati

All these groups are taken from the Appendix A of [3- 2] except for a group that we modify because it has the same value of the previous group.. Really those group are not the

Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information.. Furthermore, Google is a complete architecture

● We say that a strongly connected component (SCC) in a directed graph is a subset of the nodes such that:. – (i) every node in the subset has a path to every

All you need is any iPod, from the early classic iPod to the latest iPod Nano, the smallest iPod Shuffle to the largest iPod Photo, and a digital camera. Just

PC proper- ties for each group are identified to support grouping and read-across:  size  shape  surface chemistry  solubility  composition  crystal structure  CNTs 

Group registration not only implies that supplies of goods and services exchanged between the members of the group are outside the scope of VAT, as regards taxable persons with

I soci di Lucca pensavano, dunque, che contandosi il debito del re nei confronti della compagnia - un po’ più di 40000 marchi - ed il debito di suo fratello Edmund, conte di

L’azione del vento su un edificio a torre può es- sere vista come somma di due contributi: da un lato “un’azione locale” che può essere schematiz- zata valutando i picchi