multi-core workouts: 2012

Thursday, August 09, 2012

Rotate log tables with a MySQL EVENT

Unix crontab can be used for database tasks that need to be run on a schedule, but since MySQL version 5.1.6 there is a better way to handle periodic database tasks - Events.

http://dev.mysql.com/doc/refman/5.1/en/events.html

Events in MySQL are like a stored procedure that can run automatically at a set time. The two main advantages over using a crontab for scheduled jobs are:

No need to put the database password in the crontab file for the connection.
When you move your database, the scheduled EVENTS go with it.

Rotate MySQL log tables

Be sure to add global event_scheduler=1 to your my.ini config file so events will be enabled at database startup.

set global event_scheduler=1;

DROP EVENT IF EXISTS log_clean_daily;

CREATE EVENT log_clean_daily
ON SCHEDULE EVERY 1 day
DO CALL rotate_logs;

The rotate_logs stored procedure:

The two versions I'll show here, are for a regular MySQL server and for an RDS MySQL server.

The regular MySQL procedure for rotate_logs

DROP PROCEDURE IF EXISTS rotate_logs;
DELIMITER |
CREATE PROCEDURE rotate_logs()
BEGIN
drop table if exists general_log_tmp;
drop table if exists general_log_old;
CREATE TABLE general_log_tmp LIKE general_log;
RENAME TABLE general_log TO general_log_old, general_log_tmp TO general_log;
drop table if exists slow_log_tmp;
drop table if exists slow_log_old;
CREATE TABLE slow_log_tmp LIKE slow_log;
RENAME TABLE slow_log TO slow_log_old, slow_log_tmp TO slow_log;
END
|
DELIMITER ;

The Amazon RDS MySQL procedure for rotate_logs

DROP PROCEDURE IF EXISTS rotate_logs;

DELIMITER |
CREATE PROCEDURE rotate_logs()
BEGIN
call mysql.rds_rotate_general_log;
call mysql.rds_rotate_slow_log;
END
|
DELIMITER ;

Make sure to enable the event_scheduler for your parameter group since RDS does not allow setting the event_scheduler global variable directly.

rds-modify-db-parameter-group mygroup --parameters "name=event_scheduler, value=ON, method=immediate"

Tuesday, August 07, 2012

BigInt 64bit UID generation for MYSQL

Unique IDs and GUIDs can easily be represented in 128 bits. However, because there are no commonly available 128-bit processors- it is not practical to work with these values for large scale data processing.

We are left with 64-bit numbers. These are the largest values we can efficiently work with because that's the width on AMD and Intel registers. Also, the largest integer datatype in MySQL is the 64-bit BigInt. This is the most efficient size for a UID when 32-bit auto increment is not an option.

Goals:

The IDs need to remain unique (per table) for the expected life of the application.
Needs to be efficient for record storage and retrieval in MySQL.
Any server should be able to generate the UIDs in any common language.

MySQL Size Constraint:

The largest unsigned integer datatype in MySQL is the 64-bit BigInt which is a 20 digit number.

> 2^64
[1] 1.844674e+19

To reduce random IO for inserts we want our UIDs to be sequential and ordered by time. ID's will ideally be k-sorted to within 1 millisecond.

We can break the 20 digit number into a time part and random part to ensure no duplicate keys.

If we select the first 12 digits as the timestamp in milliseconds we can only store 5 years before key space runs out. However with 13 digits, we can keep the key going for 58 years (give or take a few leap days and leap seconds). Since this fits within our expected application lifetime, we choose 13 digits.

31536000000 milliseconds per year * 58 years

> 31536000000*58
[1] 1.829088e+12

_____________ ______ _
13 Digit Time 6 Rand S

Time: Take current time in milliseconds - Jan 1 2011 (or whenever your application started).
Rand: 6 digit random number
S: 1 digit Server ID

Example BigInt UIDs:

00000000000001234561
Generated exactly at epoch.
Random number=123456
Server ID=1

00315360000001234561
Generated about 1 year after epoch.
Random number=123456
Server ID=1

18290880000001234561

Generated about 58 years after epoch.
Random number=123456
Server ID=1

Example Java code to generate UID:

public static String getUID() {
long time = System.currentTimeMillis() - EPOCH;
StringBuilder sb = new StringBuilder();
sb.append(String.format("%d", time));
sb.append(String.format("%06d", generator.nextInt(1000000)));//0-999999
sb.append(hostIndex);//should be a 1 digit integer
return sb.toString();
}

Reference: http://mindprod.com/jgloss/pseudorandom.html#LOWHIGH

Example SQL code to generate UID:

For batch inserting lots of records fast, we need to generate the key in SQL.
We use the uuid function in MySQL instead of the 3 digit millisecond value and 6 digit random number. From the first 8 characters of uuid, convert from hex to decimal and take the right 9 digits.

___________ _________
11 digit sec. 9 digit uuid

set @epoch=1293858000; #epoch in seconds (Jan 1, 2011)
set autocommit=0;
insert test (id, a, b)
values
(concat(unix_timestamp()-@epoch, right(conv(substring(uuid(),1,8),16,10)),9) , 'a1', 'b1'),
(concat(unix_timestamp()-@epoch, right(conv(substring(uuid(),1,8),16,10)),9) , 'a2', 'b2'),
(concat(unix_timestamp()-@epoch, right(conv(substring(uuid(),1,8),16,10)),9) , 'a3', 'b3')
;
set autocommit=1;

Of course, with timestamp being a major component of the UID, both server and database should be on UTC time and NTP. This will keep clocks synchronized across servers and prevent time repeating as happens during daylight savings.

For an alternate 64-bit UID see the Twitter Snowflake algorithm:
https://github.com/twitter/snowflake
It provides greater collision safety for a high volumen of inserts/second and a higher number of server ID values but the tradeoff is that it requires a ticket server.

Sunday, May 27, 2012

The trouble with small UID keys

The main database for a project I recently joined was designed with 5 character alpha-numeric random generated primary keys.

This provides just 62 degrees of freedom per character (26 lowercase + 26 uppercase + 10 numbers). In this case, to find the odds of picking any particular number, you take 62 to the power of 5 (62^5=916,132,832). This shows that the probability you will generate 2 random numbers from this pool ,and they both match, is about 1 out of 916 million. While that is very unlikely to happen, trouble starts once you begin adding more rows to the database. Because now, you are no longer testing for matches between 2 particular pairs of numbers. You are testing for ANY 2 pairs to match.

Small, random alpha-numeric values like this are not suitable for primary keys because they will quickly collide due to the phenomenon known as the Birthday Paradox.

The Birthday Paradox states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.

The reason this is so surprising is because we are used to comparing our particular birthdays with others. For example, if you meet someone randomly and ask him what his birthday is, the chance of the two of you having the same birthday is only 1/365 (0.27%). In other words, the probability of any two individuals having the same birthday is extremely low. Even if you ask 23 people, the probability is still low -- about 5%. So we feel like it is very rare to meet anyone with the same birthday as our own.

When you put 23 people in a room, however, the thing that changes is the fact that each of the 23 people is now asking each of the other 22 people about their birthdays. Each individual person only has a small (less than 5%) chance of success, but each person is trying it 22 times. That increases the probability dramatically. [source]

A shortcut to estimate how many people can be in a room before there is roughly 50% chance that 2 people will have the same birthday is to take the square root of the number of possible birthdays:
sqrt(365)=19.1

In the case of the 5 digit alpha-numeric key the numbers break down as follows (using the rough approximation method):

> 62^5
[1] 916132832
> sqrt(916132832)
[1] 30267.69

So we can only expect to enter 30 thousand records into the database before the odds favor a collision.

Prolog:

In this case there was an even worse problem because I found the MySQL database was not setup to do binary key comparisons on the keys. Effectively making the keys case insensitive and lowering the degrees of freedom from 62 to 33.

> 33^5
[1] 39135393
> sqrt(39135393)
[1] 6255.829

Every 6 thousand rows we would likely see a primary key collision. And as predicted- it happened.

For most DBA'a and experienced developers, it is probably common sense that a 5 digit sequence won't provide a collision resistant key for even a toy application. That's the trouble with small UID keys.