Sunday, May 27, 2012

The trouble with small UID keys

The main database for a project I recently joined was designed with 5 character alpha-numeric random generated primary keys.

This provides just 62 degrees of freedom per character (26 lowercase + 26 uppercase + 10 numbers). In this case, to find the odds of picking any particular number, you take 62 to the power of 5 (62^5=916,132,832). This shows that the probability you will generate 2 random numbers from this pool ,and they both match, is about 1 out of 916 million. While that is very unlikely to happen, trouble starts once you begin adding more rows to the database. Because now, you are no longer testing for matches between 2 particular pairs of numbers. You are testing for ANY 2 pairs to match.

Small, random alpha-numeric values like this are not suitable for primary keys because they will quickly collide due to the phenomenon known as the Birthday Paradox.


The Birthday Paradox states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.


The reason this is so surprising is because we are used to comparing our particular birthdays with others. For example, if you meet someone randomly and ask him what his birthday is, the chance of the two of you having the same birthday is only 1/365 (0.27%). In other words, the probability of any two individuals having the same birthday is extremely low. Even if you ask 23 people, the probability is still low -- about 5%. So we feel like it is very rare to meet anyone with the same birthday as our own.


When you put 23 people in a room, however, the thing that changes is the fact that each of the 23 people is now asking each of the other 22 people about their birthdays. Each individual person only has a small (less than 5%) chance of success, but each person is trying it 22 times. That increases the probability dramatically. [source]


A shortcut to estimate how many people can be in a room before there is roughly 50% chance that 2 people will have the same birthday is to take the square root of the number of possible birthdays:
sqrt(365)=19.1 


In the case of the 5 digit alpha-numeric key the numbers break down as follows (using the rough approximation method):


> 62^5
[1] 916132832
> sqrt(916132832)
[1] 30267.69

So we can only expect to enter 30 thousand records into the database before the odds favor a collision.

Prolog:

In this case there was an even worse problem because I found the MySQL database was not setup to do binary key comparisons on the keys. Effectively making the keys case insensitive and lowering the degrees of freedom from 62 to 33.

> 33^5
[1] 39135393
> sqrt(39135393)
[1] 6255.829

Every 6 thousand rows we would likely see a primary key collision. And as predicted- it happened.

For most DBA'a and experienced developers, it is probably common sense that a 5 digit sequence won't provide a collision resistant key for even a toy application. That's the trouble with small UID keys.