Monday, September 26, 2011

High Performance Text Processing -Taco Bell style

Following the Taco Bell programming methodology, we will process a huge amount of data using only a few simple ingredients (i.e. unix command line tools).

Most people won't think twice about writing thousands of lines of code to accomplish what a line or two of bash script will handle.

Some anti-pattterns to avoid come to mind:
NIH (Not Invented Here)
Golden Hammer (Treat every new problem like it's a nail.)
Re-inventing the wheel

Text processing is composed of four primary activities. Filtering, Transforming, Sorting and Joining.

To achieve the fastest processing speed, you should try grouping all of your filtering, transforming and joining tasks together in one pipeline.

Stream processing tasks (filtering, transforming, joining) are limited by disk io only so take advantage of your disk scan and apply all operations as co-routines at the time you read the file.

Lets say I need to apply 5 regular expressions to a file:

Example (as co-routines- equally fast ways):

time cat bigfile \
|grep -vE "[^a-z0-9 ][^a-z0-9 ]|[^a-z0-9] [^a-z0-9]||\. |[a-z]' " \
> bigfile.clean

OR

time cat bigfile \
|grep -v '[^a-z0-9 ][^a-z0-9 ]' \
|grep -v '[^a-z0-9] [^a-z0-9]' \
|grep -v '' \
|grep -v '\. ' \
|grep -v "[a-z]' " \
> bigfile.clean

Another example (same thing- but 5 times slower):
time cat bigfile|grep -v '[^a-z0-9 ][^a-z0-9 ]'>tmpfile1
time cat tmpfile1|grep -v '[^a-z0-9] [^a-z0-9]'>tmpfile2
time cat tmpfile2|grep -v '' >tmpfile3
time cat tmpfile3|grep -v '\. '>tmpfile4
time cat tmpfile4|grep -v "[a-z]' " >bigfile.clean

Using temp files here causes the equivalent of 5 full text scans on the data when you should really only be reading the data once.


Wednesday, September 07, 2011

Select 'Text Processing' from UNIX like 'a BOSS'


1) Learn the commands, avoid re-inventing wheels
Bootstrap method:
First, read every man page and understand that each command can be fed into any other command. There are around 2000 commands on a typical UNIX box but you only need to know a few hundred of these. If you have a specific task you can usually figure out which commands to use with the "apropos" command

Read the entire man page for important commands but just skim the top description for the others (Like encryption or security programs)


Here's how to get a list of commands available on your system (logged as root):
(The script command will screen capture into a file named typescript)
Press the tab key twice to see available commands-
#script
-tab- -tab-

Display all 2612 possibilities? (y or n)
y
!                                                icc2ps                                           ppmtobmp
./                                               icclink                                          ppmtoeyuv
411toppm                                         icctrans                                         ppmtogif
7za                                              iceauth                                          ppmtoicr
:                                                icnctrl                                          ppmtoilbm
Kobil_mIDentity_switch                           icontopbm                                        ppmtojpeg
MAKEDEV                                          iconv                                            ppmtoleaf
--More--

...
#ctl-d

# exit
Script done, file is typescript

Now that you have captured all the commands available on your system, process that file to extract commands with man pages.
Get rid of dos newlines
#unix2dos typescript
Manually cleanup "--More--" if needed using emacs or vi editor.
#emacs typescript
Loop over the commands and keep only commands with a man page
# (for k in `cat typescript`; do echo -n "$k="; man $k|wc -l ; echo; done 2>/dev/null)|sort|grep '=[1-9]' >command.list
Look over the list
awk=1612
badblocks=119
baobab=28
base64=60
basename=52
bash=4910
batch=177
bc=794
bccmd=116
bdftopcf=71



2) Start "Reading The Fine Manpages"
# man man

man(1) Manual pager utils man(1)
NAME
man - an interface to the on-line reference manuals
SYNOPSIS
man [-c|-w|-tZHT device] [-adhu7V] [-m system[,...]] [-L locale] [-p string] [-M path] [-P pager] [-r prompt] [-S list] [-e extension] [[section] page ...] ...
man -l [-7] [-tZHT device] [-p string] [-P pager] [-r prompt] file ...
man -k [apropos options] regexp ...
man -f [whatis options] page ...
DESCRIPTION
man is the system’s manual pager. Each page argument given to man is normally the name of a program, utility or function. The manual page associated with each of these arguments is then found and displayed. A section, if provided, will direct man to look only in that section of the manual. The default action is to search in all of the available sections, following a pre-defined order and to show only the first page found, even if page exists in several sections.
...

Commands:

ENTER – scroll down
”b” – take you back
”/ keyword” – search the man page for a word
asumming more as page reader

Finding Commands:

# apropos CPU
top (1) - display top CPU processes
# whatis apropos
apropos (1) - search the manual page names and descriptions
NOTE: apropos is man -k (for ”keyword search”) and whatis is man -f
(for ”fullword search”)

3) Study and practice combining these key text processing tools:
I can't even tell you how important this is.
AWK, SED, TR, GREP, SORT, UNIQ, CUT,PASTE,JOIN, HEAD, TAIL, CAT, TAC

4) Learn how to manage running processes
PS, TOP, FG, BG, KILL, vmstat, iostat, netstat, lsof, fuser, strace
ctrl-d
ctrl-c
ctrl-z


5) Learn the BASH shell
UNIX pipes combine and process streams:

Integer valueName
0Standard Input (stdin)
1Standard Output (stdout)
2Standard Error (stderr)



6) Keep going
Check out history, wget, curl, xml2, xmlformat and GNU parallel.. tons more....
(You'll be glad you did)

Friday, September 02, 2011

NCD for news article data mining

The Normalized Compression Distance (NCD) is a powerful formula that can be used to discover hidden relationships in almost any data.



C=compressor program like Zip
x and y are the 2 strings you want to compare



I works by simply compressing three things. The first text string C(x), the second C(y) and then the concatenation both strings together c(xy).


Apply the formula and you get a distance score between the 2 strings from 0 to 1 (plus a slight error amount). 0 means the articles are identical, and 1 means they are absolutely dissimilar.



NCD was discovered by Rudi Cilibrasi and Paul Vitanyi as described in thier 2005 paper "Clustering by Compression"
http://arxiv.org/PS_cache/cs/pdf/0312/0312044v2.pdf



Consider the following article:

What If He Had Gone on Vacation-

Jack Kilby describes how he developed the world's first integrated circuits.
"After several interviews, I was hired by Willis Adcock of Texas Instruments. My duties were not precisely defined, but it was understood that I would work in the general area of microminiaturization. Soon after starting at TI in May 1958, I realized that since the company made transistors, resistors, and capacitors, a repackaging effort might provide an effective alternative to the Micro-Module. I therefore designed an IF amplifier using components in a tubular format and built a prototype. We also performed a detailed cost analysis, which was completed just a few days before the plant shut down for a mass vacation.
"As a new employee, I had no vacation time coming and was left alone to ponder the results of the IF amplifier exercise. The cost analysis gave me my first insight into the cost structure of a semiconductor house. The numbers were high — very high — and I felt it likely that I would be assigned to work on a proposal for the Micro-Module program when vacation was over unless I came up with a good idea very quickly. In my discouraged mood, I began to feel that the only thing a semiconductor house could make in a cost-effective way was a semiconductor. Further thought led me to the conclusion that semiconductors were all that were really required — that resistors and capacitors, in particular, could be made from the same material as the active devices.
"I also realized that, since all of the components could be made of a single material, they could also be made in situ, interconnected to form a complete circuit. I then quickly sketched a proposed design for a flip-flop using these components. Resistors were provided by bulk effect in the silicon, and capacitors by p-n junctions.
"These sketches were quickly completed, and I showed them to Adcock upon his return from vacation. He was enthused but skeptical and asked for some proof that circuits made entirely of semiconductors would work. I therefore built up a circuit using discrete silicon elements. Packaged grown-junction transistors were used. Resistors were formed by cutting small bars of silicon and etching to value. Capacitors were cut from diffused silicon power transistor wafers, metallized on both sides. This unit was assembled and demonstrated to Adcock on August 28, 1958.
"Although this test showed that circuits could be built with all semiconductor elements, it was not integrated. I immediately attempted to build an integrated structure as initially planned. I obtained several wafers, diffused and with contacts in place. By choosing the circuit, I was able to lay out two structures that would use the existing contacts on the wafers. The first circuit attempted was a phase-shift oscillator, a favorite demonstration vehicle for linear circuits at that time.
"On September 12, 1958, the first three oscillators of this type were completed. When power was applied, the first unit oscillated at about 1.3 megacycles.
"The concept was publicly announced at a press conference in New York on March 6, 1959. Mark Shepherd said, "I consider this to be the most significant development by Texas Instruments since we divulged the commercial availability of the silicon transistor." Pat Haggerty predicted the circuits first would be applied to the further miniaturization of electronic computers, missiles, and space vehicles and said that anyapplication to consumer goods such as radio and television receivers would be several years away."
television program in 1997 said about the integrated circuit and Jack Kilby, "One invention we can say is one of the most significant in history -- the microchip, which has made possible endless numbers of other inventions. For the past 40 years, Kilby has watched his invention change the world. Jack Kilby — one of the few people who can look around the globe and say to himself 'I changed how the world functions.'" 



Can NCD work as a relationship classifier in news articles?

First, I ran a part of speech tagger on the article text and then looked-up all entities (proper nouns) in Wikipedia. The Wikipedia articles were then used to build a related word-list for each entity. Next I compute the normalized compression distance (NCD) pair-wise for each entity to get a distance matrix. Cluster the matrix and a nice binary graph appears using dot and the relationships begin to appear.

From the 50 entities mapped below, a few relations are apparent. (Probably should have just picked the top 25 entities though, because the clustering works better with fewer items.)

A) Mark Shepherd and Willis Adcock are related through Jack Kilby
B) The integrated circuit, transistor, capacitor and resistor are related by silicon. 
C) Concepts and ideas are related through thought.