• Non ci sono risultati.

The rsync algorithm

N/A
N/A
Protected

Academic year: 2021

Condividi "The rsync algorithm"

Copied!
22
0
0

Testo completo

(1)

The rsync algorithm

h1ps://rsync.samba.org/tech_report/tech_report.html

(2)

An easy problem

•  I have two files A and B. I want to make B equals to A

•  What is the cost?

– CPU

– Data moved (reads, writes)

(3)

The problem of rsync

•  A is stored in computer alpha and B in computer beta

•  The network link can be slow (at least it is much slower than CPU)

•  How can I save bandwidth?

(4)

A naïve approach

•  Beta compute a hash of the file B and send it to alpha

•  Alpha compute the hash of A and send back to beta either the hash (if the two hash are the same) or the content of A if they differ

•  Beta check if the message is the hash or has to update B

•  What is the cost?

•  What is the hash funcNon?

(5)

Cryptographic hash

1.  DeterminisNc

2.  Quick to compute

3.  Infeasible to generate a message from the hash

4.  A small change in the message should drasNcally change the hash

5.  It is infeasible to find collisions

(6)

Cryptographic hash

1.  DeterminisNc

2.  Quick to compute

3.  Infeasible to generate a message from the hash

4.  A small change in the message should drasNcally change the hash

5.  It is infeasible to find collisions

(7)

Can I do be1er?

•  Can I save bandwidth when A and B are similar?

(8)

SoluNon 1 - buckeNng

•  Weakness?

•  Can I do be1er?

(9)

SoluNon 2 - rolling

…and the green ones as well

(10)

Can I do be1er?

•  Intense use of cpu in alpha

(11)

SoluNon 3 – rolling hashing

•  A two hashing strategy

!"#$%&'( = !!,!! … !!!

!

! !, ! = ! !!

!

!!!

!"#!!!

! !, ! = ! (! − ! + 1)!!

!

!!!

!"#!!!

! !, ! = !! !, ! +!2!"!!(!, !)!

!

(12)

SoluNon 3 – rolling hashing

•  Is it M=216 a good idea?

•  Collisions?

! ! + 1, ! + 1 = !! !, ! +!!!!! −!!! !!"#!!!

! ! ! + 1, ! + 1 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

= ! ! !, ! !! − ! − ! + 1 !!!!!!!!!!!!!!

+ !(! + 1, ! + 1) !!"#!!!

!

•  A convenient way to derive next hash

(13)

QuesNons?

1.  What is the difference with the Rabin fingerprint?

2.  What is the difference with the KarpRabin searching algorithm?

(14)

SoluNon 4 - rsync

•  Use two hash funcNons

•  The rolling hashing for each possible offset

•  A stronger 128bit hash in case a collision is detected

– Rsync uses MD4

(15)

SoluNon 4 - rsync

•  Use two hash funcNons

•  The rolling hashing for each possible offset

•  A stronger 128bit hash in case a collision is detected

– Rsync uses MD4

•  How to generate collisions in MD4

– h1ps://eprint.iacr.org/2005/151.pdf

(16)

Checksum searching

•  Beta sent several checksums

•  For each test alpha performs a search on these checksums

•  Is linear scanning an opNon?

(17)

Checksum searching

•  Beta sent several checksums

•  For each test alpha performs a search on these checksums

•  Is linear scanning an opNon?

Binary search

Perfect hashing

•  What is the preprocessing and querying cost in terms of CPU and memory?

(18)

The rsync three way test

16bit

Rolling checksum

216 enNes

•  Search for a match in the table

– If nul the block is not found

(19)

The rsync three way test

•  Scan the sorted list to find a match with the second half of the checksum

16bit

Rolling checksum

216 enNes

(20)

The rsync three way test

•  Use the strong fingerprint to

confirm the match

16bit

Rolling checksum

216 enNes

(21)

The rsync three way test

•  What happens if two blocks in B have the same fingerprint?

•  How the list of blocks can be organized?

•  Is it possible to copy a corrupted file?

(22)

Things you may want to try and discuss next week

•  Test the karpRabin algorithm

•  Test binary search or perfect hashing

•  Test the impact of the length of the block

•  Small vs huge files

Riferimenti

Documenti correlati

• Con un appropriato dimensionamento della tabella hash, è possibile avere nel caso medio un costo (1) per tutte le operazioni. ‣ Intuitivamente, le liste di trabocco sono

Since algebraic and geometric multiplicities differ and hence there is no basis of R 4 consisting of eigenvectors of A, the matrix A cannot be

Otherwise indicate why diagonalization is not possible1. Otherwise indicate why diagonalization is

• quando, in seguito a un inserimento, si ha α = 1 (cioè ogni lista contiene in media 1 elemento), si crea una nuova tabella di dimensione 2M (cioè doppia), vi si trasferiscono i

2 h(k) 6= l , i.e., the element is in the middle of a chain ⇒ Save the pointer to the next element, erase the content of this slot and push it to the free slots stack. ,→ As for

 For each hash table entry, a list of elements stores all data items having the same hash function value. 

Data una tabella hash T in cui le collisioni sono risolte con le liste di trabocco, definire una procedura di ricerca di una chiave che sposti la chiave cercata, se presente

[r]