Monday, September 26, 2011

High Performance Text Processing -Taco Bell style

Following the Taco Bell programming methodology, we will process a huge amount of data using only a few simple ingredients (i.e. unix command line tools).

Most people won't think twice about writing thousands of lines of code to accomplish what a line or two of bash script will handle.

Some anti-pattterns to avoid come to mind:
NIH (Not Invented Here)
Golden Hammer (Treat every new problem like it's a nail.)
Re-inventing the wheel

Text processing is composed of four primary activities. Filtering, Transforming, Sorting and Joining.

To achieve the fastest processing speed, you should try grouping all of your filtering, transforming and joining tasks together in one pipeline.

Stream processing tasks (filtering, transforming, joining) are limited by disk io only so take advantage of your disk scan and apply all operations as co-routines at the time you read the file.

Lets say I need to apply 5 regular expressions to a file:

Example (as co-routines- equally fast ways):

time cat bigfile \
|grep -vE "[^a-z0-9 ][^a-z0-9 ]|[^a-z0-9] [^a-z0-9]||\. |[a-z]' " \
> bigfile.clean

OR

time cat bigfile \
|grep -v '[^a-z0-9 ][^a-z0-9 ]' \
|grep -v '[^a-z0-9] [^a-z0-9]' \
|grep -v '' \
|grep -v '\. ' \
|grep -v "[a-z]' " \
> bigfile.clean

Another example (same thing- but 5 times slower):
time cat bigfile|grep -v '[^a-z0-9 ][^a-z0-9 ]'>tmpfile1
time cat tmpfile1|grep -v '[^a-z0-9] [^a-z0-9]'>tmpfile2
time cat tmpfile2|grep -v '' >tmpfile3
time cat tmpfile3|grep -v '\. '>tmpfile4
time cat tmpfile4|grep -v "[a-z]' " >bigfile.clean

Using temp files here causes the equivalent of 5 full text scans on the data when you should really only be reading the data once.


No comments: