BHUPI RAJPUT: hash collition

Calculating the Probability of a Hash Collision

There are many choices of hash function, and the creation of a good hash function is still an active area of research. Some hash functions are fast; others are slow. Some distribute hash values evenly across the available range; others don’t. If you’re interested in the real-world performance of a few known hash functions, Charles Bloom and strchr.com offer some comparisons. For our purposes, let’s assume the hash function is pretty good — it distributes hash values evenly across the available range.
In this case, generating hash values for a collection of inputs is a lot like generating a collection of random numbers. Our question, then, translates into the following:

Given $k$ randomly generated values, where each value is a non-negative integer less than $N$ , what is the probability that at least two of them are equal?

It turns out it’s actually a bit simpler to start with the reverse question: What is the probability that they are all unique? Whatever the answer to the reverse question, we can just subtract it from one, and we’ll have the answer to our original question.
Given a space of $N$ possible hash values, suppose you’ve already picked a single value. After that, there are $N-1$ remaining values (out of a possible $N$ ) that are unique from the first. Therefore, the probability of randomly generating two integers that are unique from each other is $\frac{N-1}{N}$ .
After that, there are $N-2$ remaining values (out of a possible $N$ ) that are unique from the first two, which means that the probability of randomly generating three integers that are all unique is $\frac{N-1}{N}\times\frac{N-2}{N}$ . (We can multiply the probabilities together because each random number generation is an independent event.)
In general, the probability of randomly generating $k$ integers that are all unique is:

$\frac{N-1}{N}\times\frac{N-2}{N}\times\dots\times\frac{N-(k-2)}{N}\times\frac{N-(k-1)}{N}$

On a computer, this can be quite slow to evaluate for large k. Luckily, the above expression is approximately equal to:

$e^{\frac{-k(k-1)}{2N}}$

which is a lot faster to compute. How do we know this is a good approximation? Well, it can be shown analytically, using the Taylor expansion of $e^x$ and an epsilon-delta proof, that the approximation error tends to zero as $N$ increases. Or, you can just compute both values and compare them. Run the following Python script with different $N$ , and you’ll get a feeling for just how accurate the approximation is:

import math
N = 1000000
probUnique = 1.0
for k in xrange(1, 2000):
    probUnique = probUnique * (N - (k - 1)) / N
    print k, 1 - probUnique, 1 - math.exp(-0.5 * k * (k - 1) / N)

Great, so this magic expression serves as our probability that all values are unique. Subtract it from one, and you have the probability of a hash collision:

$1 - e^{\frac{-k(k-1)}{2N}}$

Here is a graph for $N = 2^{32}$ . This illustrates the probability of collision when using 32-bit hash values. It’s worth noting that a 50% chance of collision occurs when the number of hashes is 77163. Also note that the graph takes the same S-curved shape for any value of $N$ .

BHUPI RAJPUT

Tuesday, 11 September 2012

hash collition

Calculating the Probability of a Hash Collision

No comments:

Post a Comment

About Me