Calculating the Probability of a Hash Collision
There are many choices of hash function, and the creation of a good hash function is still an active area of research. Some hash functions are fast; others are slow. Some distribute hash values evenly across the available range; others don’t. If you’re interested in the real-world performance of a few known hash functions, Charles Bloom and strchr.com offer some comparisons. For our purposes, let’s assume the hash function is pretty good — it distributes hash values evenly across the available range.In this case, generating hash values for a collection of inputs is a lot like generating a collection of random numbers. Our question, then, translates into the following:
GivenIt turns out it’s actually a bit simpler to start with the reverse question: What is the probability that they are all unique? Whatever the answer to the reverse question, we can just subtract it from one, and we’ll have the answer to our original question.randomly generated values, where each value is a non-negative integer less than
, what is the probability that at least two of them are equal?
Given a space of
After that, there are
In general, the probability of randomly generating
import math
N = 1000000
probUnique = 1.0
for k in xrange(1, 2000):
probUnique = probUnique * (N - (k - 1)) / N
print k, 1 - probUnique, 1 - math.exp(-0.5 * k * (k - 1) / N)
Great, so this magic expression serves as our probability that all
values are unique. Subtract it from one, and you have the probability of
a hash collision:
No comments:
Post a Comment