multi-core workouts

Thursday, August 09, 2012

Rotate log tables with a MySQL EVENT

Unix crontab can be used for database tasks that need to be run on a schedule, but since MySQL version 5.1.6 there is a better way to handle periodic database tasks - Events.

http://dev.mysql.com/doc/refman/5.1/en/events.html

Events in MySQL are like a stored procedure that can run automatically at a set time. The two main advantages over using a crontab for scheduled jobs are:

No need to put the database password in the crontab file for the connection.
When you move your database, the scheduled EVENTS go with it.

Rotate MySQL log tables

Be sure to add global event_scheduler=1 to your my.ini config file so events will be enabled at database startup.

set global event_scheduler=1;

DROP EVENT IF EXISTS log_clean_daily;

CREATE EVENT log_clean_daily
ON SCHEDULE EVERY 1 day
DO CALL rotate_logs;

The rotate_logs stored procedure:

The two versions I'll show here, are for a regular MySQL server and for an RDS MySQL server.

The regular MySQL procedure for rotate_logs

DROP PROCEDURE IF EXISTS rotate_logs;
DELIMITER |
CREATE PROCEDURE rotate_logs()
BEGIN
drop table if exists general_log_tmp;
drop table if exists general_log_old;
CREATE TABLE general_log_tmp LIKE general_log;
RENAME TABLE general_log TO general_log_old, general_log_tmp TO general_log;
drop table if exists slow_log_tmp;
drop table if exists slow_log_old;
CREATE TABLE slow_log_tmp LIKE slow_log;
RENAME TABLE slow_log TO slow_log_old, slow_log_tmp TO slow_log;
END
|
DELIMITER ;

The Amazon RDS MySQL procedure for rotate_logs

DROP PROCEDURE IF EXISTS rotate_logs;

DELIMITER |
CREATE PROCEDURE rotate_logs()
BEGIN
call mysql.rds_rotate_general_log;
call mysql.rds_rotate_slow_log;
END
|
DELIMITER ;

Make sure to enable the event_scheduler for your parameter group since RDS does not allow setting the event_scheduler global variable directly.

rds-modify-db-parameter-group mygroup --parameters "name=event_scheduler, value=ON, method=immediate"

Tuesday, August 07, 2012

BigInt 64bit UID generation for MYSQL

Unique IDs and GUIDs can easily be represented in 128 bits. However, because there are no commonly available 128-bit processors- it is not practical to work with these values for large scale data processing.

We are left with 64-bit numbers. These are the largest values we can efficiently work with because that's the width on AMD and Intel registers. Also, the largest integer datatype in MySQL is the 64-bit BigInt. This is the most efficient size for a UID when 32-bit auto increment is not an option.

Goals:

The IDs need to remain unique (per table) for the expected life of the application.
Needs to be efficient for record storage and retrieval in MySQL.
Any server should be able to generate the UIDs in any common language.

MySQL Size Constraint:

The largest unsigned integer datatype in MySQL is the 64-bit BigInt which is a 20 digit number.

> 2^64
[1] 1.844674e+19

To reduce random IO for inserts we want our UIDs to be sequential and ordered by time. ID's will ideally be k-sorted to within 1 millisecond.

We can break the 20 digit number into a time part and random part to ensure no duplicate keys.

If we select the first 12 digits as the timestamp in milliseconds we can only store 5 years before key space runs out. However with 13 digits, we can keep the key going for 58 years (give or take a few leap days and leap seconds). Since this fits within our expected application lifetime, we choose 13 digits.

31536000000 milliseconds per year * 58 years

> 31536000000*58
[1] 1.829088e+12

_____________ ______ _
13 Digit Time 6 Rand S

Time: Take current time in milliseconds - Jan 1 2011 (or whenever your application started).
Rand: 6 digit random number
S: 1 digit Server ID

Example BigInt UIDs:

00000000000001234561
Generated exactly at epoch.
Random number=123456
Server ID=1

00315360000001234561
Generated about 1 year after epoch.
Random number=123456
Server ID=1

18290880000001234561

Generated about 58 years after epoch.
Random number=123456
Server ID=1

Example Java code to generate UID:

public static String getUID() {
long time = System.currentTimeMillis() - EPOCH;
StringBuilder sb = new StringBuilder();
sb.append(String.format("%d", time));
sb.append(String.format("%06d", generator.nextInt(1000000)));//0-999999
sb.append(hostIndex);//should be a 1 digit integer
return sb.toString();
}

Reference: http://mindprod.com/jgloss/pseudorandom.html#LOWHIGH

Example SQL code to generate UID:

For batch inserting lots of records fast, we need to generate the key in SQL.
We use the uuid function in MySQL instead of the 3 digit millisecond value and 6 digit random number. From the first 8 characters of uuid, convert from hex to decimal and take the right 9 digits.

___________ _________
11 digit sec. 9 digit uuid

set @epoch=1293858000; #epoch in seconds (Jan 1, 2011)
set autocommit=0;
insert test (id, a, b)
values
(concat(unix_timestamp()-@epoch, right(conv(substring(uuid(),1,8),16,10)),9) , 'a1', 'b1'),
(concat(unix_timestamp()-@epoch, right(conv(substring(uuid(),1,8),16,10)),9) , 'a2', 'b2'),
(concat(unix_timestamp()-@epoch, right(conv(substring(uuid(),1,8),16,10)),9) , 'a3', 'b3')
;
set autocommit=1;

Of course, with timestamp being a major component of the UID, both server and database should be on UTC time and NTP. This will keep clocks synchronized across servers and prevent time repeating as happens during daylight savings.

For an alternate 64-bit UID see the Twitter Snowflake algorithm:
https://github.com/twitter/snowflake
It provides greater collision safety for a high volumen of inserts/second and a higher number of server ID values but the tradeoff is that it requires a ticket server.

Sunday, May 27, 2012

The trouble with small UID keys

The main database for a project I recently joined was designed with 5 character alpha-numeric random generated primary keys.

This provides just 62 degrees of freedom per character (26 lowercase + 26 uppercase + 10 numbers). In this case, to find the odds of picking any particular number, you take 62 to the power of 5 (62^5=916,132,832). This shows that the probability you will generate 2 random numbers from this pool ,and they both match, is about 1 out of 916 million. While that is very unlikely to happen, trouble starts once you begin adding more rows to the database. Because now, you are no longer testing for matches between 2 particular pairs of numbers. You are testing for ANY 2 pairs to match.

Small, random alpha-numeric values like this are not suitable for primary keys because they will quickly collide due to the phenomenon known as the Birthday Paradox.

The Birthday Paradox states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.

The reason this is so surprising is because we are used to comparing our particular birthdays with others. For example, if you meet someone randomly and ask him what his birthday is, the chance of the two of you having the same birthday is only 1/365 (0.27%). In other words, the probability of any two individuals having the same birthday is extremely low. Even if you ask 23 people, the probability is still low -- about 5%. So we feel like it is very rare to meet anyone with the same birthday as our own.

When you put 23 people in a room, however, the thing that changes is the fact that each of the 23 people is now asking each of the other 22 people about their birthdays. Each individual person only has a small (less than 5%) chance of success, but each person is trying it 22 times. That increases the probability dramatically. [source]

A shortcut to estimate how many people can be in a room before there is roughly 50% chance that 2 people will have the same birthday is to take the square root of the number of possible birthdays:
sqrt(365)=19.1

In the case of the 5 digit alpha-numeric key the numbers break down as follows (using the rough approximation method):

> 62^5
[1] 916132832
> sqrt(916132832)
[1] 30267.69

So we can only expect to enter 30 thousand records into the database before the odds favor a collision.

Prolog:

In this case there was an even worse problem because I found the MySQL database was not setup to do binary key comparisons on the keys. Effectively making the keys case insensitive and lowering the degrees of freedom from 62 to 33.

> 33^5
[1] 39135393
> sqrt(39135393)
[1] 6255.829

Every 6 thousand rows we would likely see a primary key collision. And as predicted- it happened.

For most DBA'a and experienced developers, it is probably common sense that a 5 digit sequence won't provide a collision resistant key for even a toy application. That's the trouble with small UID keys.

Wednesday, November 23, 2011

ngrep2curl - How to replay a browser web request

Many times, I found myself downloading a file twice because Cookies or some other required browser headers prevent curl from doing a direct download from the command line.

To simplify this process, I wrote a script that:
1) Listens on port 80 to grab the http headers for a download on a browser
2) Generates a curl command using the captured headers.

Old workflow (file transfered twice):
Web Browser -> browse to file -> download to local machine
sftp to remote server

New workflow (file transfered just once):
run ngrep2curl on local machine
Web Browser -> browse to file -> download just the beginning of the file -> cancel download
run the generated curl command on remote server

#!/bin/sh
ngrep port 80 -Wbyline |grep -vE '^#|^T '|head -20|sed 's/\.$//'|awk '/GET/{url=$2} /Host:/{host=$2} /User-Agent:/{gsub("User-Agent: ","");ua=$0} /Cookie:/{gsub("Cookie: ","");cookie=$0} /Referer:/{gsub("Referer: ","");ref=$0} /Accept:/{accept=$0} END{print "curl -L","\x27"host url"\x27","-A","\""ua"\"","-b\""cookie"\"","-e \x27"ref"\x27","-H","\x27"accept"\x27"}'

Monday, September 26, 2011

High Performance Text Processing -Taco Bell style

Following the Taco Bell programming methodology, we will process a huge amount of data using only a few simple ingredients (i.e. unix command line tools).

Most people won't think twice about writing thousands of lines of code to accomplish what a line or two of bash script will handle.

Some anti-pattterns to avoid come to mind:
NIH (Not Invented Here)
Golden Hammer (Treat every new problem like it's a nail.)
Re-inventing the wheel

Text processing is composed of four primary activities. Filtering, Transforming, Sorting and Joining.

To achieve the fastest processing speed, you should try grouping all of your filtering, transforming and joining tasks together in one pipeline.

Stream processing tasks (filtering, transforming, joining) are limited by disk io only so take advantage of your disk scan and apply all operations as co-routines at the time you read the file.

Lets say I need to apply 5 regular expressions to a file:

Example (as co-routines- equally fast ways):

time cat bigfile \
|grep -vE "[^a-z0-9 ][^a-z0-9 ]|[^a-z0-9] [^a-z0-9]||\. |[a-z]' " \
> bigfile.clean

Another example (same thing- but 5 times slower):
time cat bigfile|grep -v '[^a-z0-9 ][^a-z0-9 ]'>tmpfile1
time cat tmpfile1|grep -v '[^a-z0-9] [^a-z0-9]'>tmpfile2
time cat tmpfile2|grep -v '' >tmpfile3
time cat tmpfile3|grep -v '\. '>tmpfile4
time cat tmpfile4|grep -v "[a-z]' " >bigfile.clean

Using temp files here causes the equivalent of 5 full text scans on the data when you should really only be reading the data once.

Friday, September 16, 2011

Word.

Wednesday, September 07, 2011

Select 'Text Processing' from UNIX like 'a BOSS'

1) Learn the commands, avoid re-inventing wheels
Bootstrap method:
First, read every man page and understand that each command can be fed into any other command. There are around 2000 commands on a typical UNIX box but you only need to know a few hundred of these. If you have a specific task you can usually figure out which commands to use with the "apropos" command

Read the entire man page for important commands but just skim the top description for the others (Like encryption or security programs)

Here's how to get a list of commands available on your system (logged as root):
(The script command will screen capture into a file named typescript)
Press the tab key twice to see available commands-
#script
-tab- -tab-

Display all 2612 possibilities? (y or n)
y
! icc2ps ppmtobmp
./ icclink ppmtoeyuv
411toppm icctrans ppmtogif
7za iceauth ppmtoicr
: icnctrl ppmtoilbm
Kobil_mIDentity_switch icontopbm ppmtojpeg
MAKEDEV iconv ppmtoleaf
--More--

...
#ctl-d

# exit
Script done, file is typescript

Now that you have captured all the commands available on your system, process that file to extract commands with man pages.
Get rid of dos newlines
#unix2dos typescript
Manually cleanup "--More--" if needed using emacs or vi editor.
#emacs typescript
Loop over the commands and keep only commands with a man page
# (for k in `cat typescript`; do echo -n "$k="; man $k|wc -l ; echo; done 2>/dev/null)|sort|grep '=[1-9]' >command.list
Look over the list
awk=1612
badblocks=119
baobab=28
base64=60
basename=52
bash=4910
batch=177
bc=794
bccmd=116
bdftopcf=71

2) Start "Reading The Fine Manpages"

# man man

man(1) Manual pager utils man(1)
NAME
man - an interface to the on-line reference manuals
SYNOPSIS
man [-c|-w|-tZHT device] [-adhu7V] [-m system[,...]] [-L locale] [-p string] [-M path] [-P pager] [-r prompt] [-S list] [-e extension] [[section] page ...] ...
man -l [-7] [-tZHT device] [-p string] [-P pager] [-r prompt] file ...
man -k [apropos options] regexp ...
man -f [whatis options] page ...
DESCRIPTION
man is the system’s manual pager. Each page argument given to man is normally the name of a program, utility or function. The manual page associated with each of these arguments is then found and displayed. A section, if provided, will direct man to look only in that section of the manual. The default action is to search in all of the available sections, following a pre-defined order and to show only the first page found, even if page exists in several sections.
...

Commands:

ENTER – scroll down
”b” – take you back
”/ keyword” – search the man page for a word
asumming more as page reader

Finding Commands:

# apropos CPU
top (1) - display top CPU processes
# whatis apropos
apropos (1) - search the manual page names and descriptions
NOTE: apropos is man -k (for ”keyword search”) and whatis is man -f
(for ”fullword search”)

3) Study and practice combining these key text processing tools:
I can't even tell you how important this is.
AWK, SED, TR, GREP, SORT, UNIQ, CUT,PASTE,JOIN, HEAD, TAIL, CAT, TAC

4) Learn how to manage running processes
PS, TOP, FG, BG, KILL, vmstat, iostat, netstat, lsof, fuser, strace
ctrl-d
ctrl-c
ctrl-z

5) Learn the BASH shell
UNIX pipes combine and process streams:

Integer value	Name
0	Standard Input (stdin)
1	Standard Output (stdout)
2	Standard Error (stderr)

6) Keep going
Check out history, wget, curl, xml2, xmlformat and GNU parallel.. tons more....

(You'll be glad you did)