• Non ci sono risultati.

Data stream sta)s)cs

N/A
N/A
Protected

Academic year: 2021

Condividi "Data stream sta)s)cs"

Copied!
12
0
0

Testo completo

(1)

Data stream sta)s)cs

Filippo Geraci, CNR, Pisa

(2)

Cache and Bloom filters

•  Consider you have a proxy that caches web

pages. You may want not to cache a page that will be visited only once

– Solu)on: use a bloom filter. Once you have a

request first check whether it has already be seen.

If YES cache the page, otherwise NO. ANYWAY add the page to the Bloom filter.

(3)

Es)ma)on of the number of

occurrences

(4)

What about compu)ng distribu)ons?

•  Given highly skewed data I want to measure the frequency at least of the top elements

•  Facts:

– Counters are expected to be higher because of the contribu)on of other elements

– CM returns the counter with less noise

•  Idea

– Es)mate the contribu)on of noise for a specific counter

(5)

CMM – Count Mean-Min sketch

(6)

Heavy hi;ers

•  All the above data structures allow coun)ng or membership evalua)on.

•  How to know the most represented keys in a stream?

•  Un)l now:

– I can count how many keys exist,

– I can check if a par)cular key is present – I can count the number of its occurrences – …but I can’t do anything if I don’t know it

(7)

Bad news

•  Naïve solu)on:

– Sort data

•  There is no algorithm that solves the Heavy

Hi;ers problems in one pass while using a

sublinear amount of auxiliary space

(8)

A simple algorithm

•  Problem: find the elements that occur more than N/k )mes (N is the stream length, k is a free

parameter)

•  Solu)on:

–  Maintain a CM and a max-heap (with k elements) of the top elements

•  Process:

1.  Add the element in the CM and es)mate its frequency

2.  If frequency >= N/k insert the element in the heap 3.  Note: the number of elements in the heap must be

at most k

(9)

The Space saving algorithm - build

(10)

The Space saving algorithm - query

Note that counts are sorted in descending order in this implementa)on

(11)

References

•  New Es)ma)on Algorithms for Streaming Data: Count- min Can Do More

–  h;p://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.420.449&rep=rep1&type=pdf

•  Efficient Computa)on of Frequent and Top-k Elements in Data Streams

–  h;p://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.94.8360&rep=rep1&type=pdf

•  PROBABILISTIC DATA STRUCTURES FOR WEB ANALYTICS AND DATA MINING

–  h;ps://highlyscalable.wordpress.com/2012/05/01/

probabilis)c-structures-web-analy)cs-data-mining/

(12)

Datasets

•  Free Twi;er datasets

– h;p://followthehashtag.com/datasets/

•  Stackexchange Q&A website

– h;ps://archive.org/download/stackexchange

Riferimenti

Documenti correlati

•  Bloom filters reduce to linear counting when using only one hash function... Improving

–  I strongly think the element is in set –  Definitely not in set.. The bloom filter version of the spell checker. Number of words in

Use the 2-gram index to rank all lexicon terms according to the number of matching 2-grams (moon).

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over

Therefore implementation of one modern non-destructive control methods – metal magnetic memory – allows to solve one of the most important problems – flaw detection in

For example, suppose that a client c makes a read request cr with consistency level one (meaning that the request coordinator only needs to wait for the complete data from a

In this work we present NuChart-II, a software that allows the user to annotate and visualize a list of input genes with information relying on Hi-C data, integrating knowledge

Ante la eliminación total de los santos tras la reforma de la fe, se advirtió la necesidad de elaborar una martirología protestante, que pudiera tener efecto constrictor