Monday, September 26, 2011

High Performance Text Processing -Taco Bell style

Following the Taco Bell programming methodology, we will process a huge amount of data using only a few simple ingredients (i.e. unix command line tools).

Most people won't think twice about writing thousands of lines of code to accomplish what a line or two of bash script will handle.

Some anti-pattterns to avoid come to mind:
NIH (Not Invented Here)
Golden Hammer (Treat every new problem like it's a nail.)
Re-inventing the wheel

Text processing is composed of four primary activities. Filtering, Transforming, Sorting and Joining.

To achieve the fastest processing speed, you should try grouping all of your filtering, transforming and joining tasks together in one pipeline.

Stream processing tasks (filtering, transforming, joining) are limited by disk io only so take advantage of your disk scan and apply all operations as co-routines at the time you read the file.

Lets say I need to apply 5 regular expressions to a file:

Example (as co-routines- equally fast ways):

time cat bigfile \
|grep -vE "[^a-z0-9 ][^a-z0-9 ]|[^a-z0-9] [^a-z0-9]||\. |[a-z]' " \
> bigfile.clean

Another example (same thing- but 5 times slower):
time cat bigfile|grep -v '[^a-z0-9 ][^a-z0-9 ]'>tmpfile1
time cat tmpfile1|grep -v '[^a-z0-9] [^a-z0-9]'>tmpfile2
time cat tmpfile2|grep -v '' >tmpfile3
time cat tmpfile3|grep -v '\. '>tmpfile4
time cat tmpfile4|grep -v "[a-z]' " >bigfile.clean

Using temp files here causes the equivalent of 5 full text scans on the data when you should really only be reading the data once.

Friday, September 16, 2011

Word.

Wednesday, September 07, 2011

Select 'Text Processing' from UNIX like 'a BOSS'

1) Learn the commands, avoid re-inventing wheels
Bootstrap method:
First, read every man page and understand that each command can be fed into any other command. There are around 2000 commands on a typical UNIX box but you only need to know a few hundred of these. If you have a specific task you can usually figure out which commands to use with the "apropos" command

Read the entire man page for important commands but just skim the top description for the others (Like encryption or security programs)

Here's how to get a list of commands available on your system (logged as root):
(The script command will screen capture into a file named typescript)
Press the tab key twice to see available commands-
#script
-tab- -tab-

Display all 2612 possibilities? (y or n)
y
! icc2ps ppmtobmp
./ icclink ppmtoeyuv
411toppm icctrans ppmtogif
7za iceauth ppmtoicr
: icnctrl ppmtoilbm
Kobil_mIDentity_switch icontopbm ppmtojpeg
MAKEDEV iconv ppmtoleaf
--More--

...
#ctl-d

# exit
Script done, file is typescript

Now that you have captured all the commands available on your system, process that file to extract commands with man pages.
Get rid of dos newlines
#unix2dos typescript
Manually cleanup "--More--" if needed using emacs or vi editor.
#emacs typescript
Loop over the commands and keep only commands with a man page
# (for k in `cat typescript`; do echo -n "$k="; man $k|wc -l ; echo; done 2>/dev/null)|sort|grep '=[1-9]' >command.list
Look over the list
awk=1612
badblocks=119
baobab=28
base64=60
basename=52
bash=4910
batch=177
bc=794
bccmd=116
bdftopcf=71

2) Start "Reading The Fine Manpages"

# man man

man(1) Manual pager utils man(1)
NAME
man - an interface to the on-line reference manuals
SYNOPSIS
man [-c|-w|-tZHT device] [-adhu7V] [-m system[,...]] [-L locale] [-p string] [-M path] [-P pager] [-r prompt] [-S list] [-e extension] [[section] page ...] ...
man -l [-7] [-tZHT device] [-p string] [-P pager] [-r prompt] file ...
man -k [apropos options] regexp ...
man -f [whatis options] page ...
DESCRIPTION
man is the system’s manual pager. Each page argument given to man is normally the name of a program, utility or function. The manual page associated with each of these arguments is then found and displayed. A section, if provided, will direct man to look only in that section of the manual. The default action is to search in all of the available sections, following a pre-defined order and to show only the first page found, even if page exists in several sections.
...

Commands:

ENTER – scroll down
”b” – take you back
”/ keyword” – search the man page for a word
asumming more as page reader

Finding Commands:

# apropos CPU
top (1) - display top CPU processes
# whatis apropos
apropos (1) - search the manual page names and descriptions
NOTE: apropos is man -k (for ”keyword search”) and whatis is man -f
(for ”fullword search”)

3) Study and practice combining these key text processing tools:
I can't even tell you how important this is.
AWK, SED, TR, GREP, SORT, UNIQ, CUT,PASTE,JOIN, HEAD, TAIL, CAT, TAC

4) Learn how to manage running processes
PS, TOP, FG, BG, KILL, vmstat, iostat, netstat, lsof, fuser, strace
ctrl-d
ctrl-c
ctrl-z

5) Learn the BASH shell
UNIX pipes combine and process streams:

Integer value	Name
0	Standard Input (stdin)
1	Standard Output (stdout)
2	Standard Error (stderr)

6) Keep going
Check out history, wget, curl, xml2, xmlformat and GNU parallel.. tons more....

(You'll be glad you did)

Friday, September 02, 2011

NCD for news article data mining

The Normalized Compression Distance (NCD) is a powerful formula that can be used to discover hidden relationships in almost any data.

C=compressor program like Zip
x and y are the 2 strings you want to compare

I works by simply compressing three things. The first text string C(x), the second C(y) and then the concatenation both strings together c(xy).

Apply the formula and you get a distance score between the 2 strings from 0 to 1 (plus a slight error amount). 0 means the articles are identical, and 1 means they are absolutely dissimilar.

NCD was discovered by Rudi Cilibrasi and Paul Vitanyi as described in thier 2005 paper "Clustering by Compression"
http://arxiv.org/PS_cache/cs/pdf/0312/0312044v2.pdf

Consider the following article:

What If He Had Gone on Vacation-

Jack Kilby describes how he developed the world's first integrated circuits.

"After several interviews, I was hired by Willis Adcock of Texas Instruments. My duties were not precisely defined, but it was understood that I would work in the general area of microminiaturization. Soon after starting at TI in May 1958, I realized that since the company made transistors, resistors, and capacitors, a repackaging effort might provide an effective alternative to the Micro-Module. I therefore designed an IF amplifier using components in a tubular format and built a prototype. We also performed a detailed cost analysis, which was completed just a few days before the plant shut down for a mass vacation.

"As a new employee, I had no vacation time coming and was left alone to ponder the results of the IF amplifier exercise. The cost analysis gave me my first insight into the cost structure of a semiconductor house. The numbers were high — very high — and I felt it likely that I would be assigned to work on a proposal for the Micro-Module program when vacation was over unless I came up with a good idea very quickly. In my discouraged mood, I began to feel that the only thing a semiconductor house could make in a cost-effective way was a semiconductor. Further thought led me to the conclusion that semiconductors were all that were really required — that resistors and capacitors, in particular, could be made from the same material as the active devices.

"I also realized that, since all of the components could be made of a single material, they could also be made in situ, interconnected to form a complete circuit. I then quickly sketched a proposed design for a flip-flop using these components. Resistors were provided by bulk effect in the silicon, and capacitors by p-n junctions.

"These sketches were quickly completed, and I showed them to Adcock upon his return from vacation. He was enthused but skeptical and asked for some proof that circuits made entirely of semiconductors would work. I therefore built up a circuit using discrete silicon elements. Packaged grown-junction transistors were used. Resistors were formed by cutting small bars of silicon and etching to value. Capacitors were cut from diffused silicon power transistor wafers, metallized on both sides. This unit was assembled and demonstrated to Adcock on August 28, 1958.

"Although this test showed that circuits could be built with all semiconductor elements, it was not integrated. I immediately attempted to build an integrated structure as initially planned. I obtained several wafers, diffused and with contacts in place. By choosing the circuit, I was able to lay out two structures that would use the existing contacts on the wafers. The first circuit attempted was a phase-shift oscillator, a favorite demonstration vehicle for linear circuits at that time.

"On September 12, 1958, the first three oscillators of this type were completed. When power was applied, the first unit oscillated at about 1.3 megacycles.

"The concept was publicly announced at a press conference in New York on March 6, 1959. Mark Shepherd said, "I consider this to be the most significant development by Texas Instruments since we divulged the commercial availability of the silicon transistor." Pat Haggerty predicted the circuits first would be applied to the further miniaturization of electronic computers, missiles, and space vehicles and said that anyapplication to consumer goods such as radio and television receivers would be several years away."

A television program in 1997 said about the integrated circuit and Jack Kilby, "One invention we can say is one of the most significant in history -- the microchip, which has made possible endless numbers of other inventions. For the past 40 years, Kilby has watched his invention change the world. Jack Kilby — one of the few people who can look around the globe and say to himself 'I changed how the world functions.'"

Can NCD work as a relationship classifier in news articles?

First, I ran a part of speech tagger on the article text and then looked-up all entities (proper nouns) in Wikipedia. The Wikipedia articles were then used to build a related word-list for each entity. Next I compute the normalized compression distance (NCD) pair-wise for each entity to get a distance matrix. Cluster the matrix and a nice binary graph appears using dot and the relationships begin to appear.

From the 50 entities mapped below, a few relations are apparent. (Probably should have just picked the top 25 entities though, because the clustering works better with fewer items.)

A) Mark Shepherd and Willis Adcock are related through Jack Kilby

B) The integrated circuit, transistor, capacitor and resistor are related by silicon.

C) Concepts and ideas are related through thought.

Monday, August 15, 2011

Move your web content to Rackspace Cloudfiles

I 'm gonna make this quick

To start hosting your web content on Rackspace Cloudfiles, you need to complete the following tasks:

Create Cloudfile containers to hold your web content ( plan on storing up to 10K items per container so you can use cloudfuse to mount your containers to the filesystem). You need to come up with an organizational structure that makes sense for your site. For example, you could use a container for your audio named c_audio and c_pics for pictures. If you know you will have over 10K files you should try to break them up like this: c-pics000,c-pics001, c-pics002, ...
Click "Publish to CDN" - This will give you your CDN URL.
Add cname entries into your DNS to create a predictable name for your containers.
Write a PHP script to query your database and copy all content listed in your database into the cloud. Then save the new cloud location for the content back to your database. For example, if you use auto-increment and have 30 thousand files you can put the first 10k into c-pics000 container, second 10K into c-pics001 container and so on.
For user contributed content- update your website file upload script to save the content to the cloud instead of the filesystem and save the new content location in the database. For images you may want to create the different sizes you will need before uploading to the cloud. Example: me.jpg and me_thumb.jpg
Connect your Cloudfile containers to your filesystem with cloudfuse.

Add the following entry to /etc/fstab:

cloudfuse /mnt/cloud fuse defaults,allow_other,username=your_username,api_key=your_api_key

Then mount /mnt/cloud and you will see all your containers as directories. This is a great way to see what you have using ls and also to make backups of your cloudfiles to another hard-drive on your server.

You're done!

Linux virtualization with chroot

Big savings with virtualization (Exponential?)

According to Moore's Law we get better, and generally cheaper servers every year. In the data center, we can take advantage of this by leasing servers instead of buying- if we are using Linux virtualization.

There are many virtualization containers that work perfectly well for Linux. The three I have used most over the years are Xen, VMware and Oracle VM VirtualBox. Each has it's own quirks you need to learn and requires a software install to use. The one disadvantage they all share though, is that they add complexity and there is a cost to switching containers. For a pure linux shop there is a simpler way to virtualize without adding software.

Good old chroot will do the trick

As long as the host and the virtualized OS will run on the same kernel, you don't need any special software to virtualize in Linux. Using chroot, you can build an OS inside an OS. I currently run CentOS as the host OS and Gentoo for my virtualized instances.

Here is how I set it up:
# cat /etc/redhat-release
CentOS release 5.5 (Final)

#mkdir /mnt/gentoo
#cd /mnt/gentoo

First: Download a tar ball of a running Gentoo OS and unpack it
#wget http://distro.ibiblio.org/pub/linux/distributions/gentoo/releases/amd64/autobuilds/current-install-amd64-minimal/stage3-amd64-20110811.tar.bz2

#bunzip2 stage3-amd64-20110811.tar.bz2
#tar xvf stage3-amd64-20110811.tar

Next: Setup networking and devices, copy these 3 lines into the /etc/rc.local of your host OS (CentOS in my case)
mount -t proc none /mnt/gentoo/proc
cp -L /etc/resolv.conf /mnt/gentoo/etc/resolv.conf
chroot /mnt/gentoo su -c '/etc/init.d/start_gentoo.sh' -

Last: add a shell script into your virtualized OS (/mnt/gentoo) to start your webservers, databases and any other programs you need to run on a reboot.

To login to your new virtualized server
#chroot /mnt/gentoo

Come back in 1 year

To copy it to a new leased machine in a year once prices have dropped and CPU cores have doubled:
Stop the running programs (whatever you put into start_gentoo.sh) so that you don't copy things like your database tables during a write.
#tar cvfz /mnt/gentoo/gentoo.20110815.tar.gz /mnt/gentoo

You can now ftp this to your new server, make a new /mnt/gentoo directory, untar the tar ball, setup your start script and /etc/rc.local and you are up and running on your new, faster machine.
#chroot /mnt/gentoo

Improving a good thing

So that's it! I have had this setup up and working for the past 5 years and every year I change to new servers it gets easier (Hint- if you use the cloud to host your web content, upgrading to new servers gets even easier- see my next post on this topic).

Now I am planning to switch my setup from Gentoo to Ubuntu because Ubuntu has better support in the community. I will write that up once I get it working (I started using Ubuntu initially because it is a source distribution, and due to that, there are tar balls that unpack and work easily in chroot). If anyone has tips on how to get Ubuntu running in a chroot environment, please leave a comment.

Thursday, February 05, 2009

HOWTO - Simple Parallel Sort in Linux

More data than memory

I had duplicate records in a MySQL table with 180 million rows and a varchar index. One requirement I had in de-duping the records was that I needed to keep the row with the highest value in a certain column. This is a VERY common scenario for developers and DBAs but there seems to be no good way to do this in MySQL or Postgreql. A simple "group by order by" does not work in MySQL (5.0.22) for example because it does not know how to presort rows in a group by.

This means that if you want to show the top seller for each day by using a "group by" it will actually pick the first row it finds instead of sorting first to give you the top seller for the day.

I read tons of posts and tried several ways to dedupe the table. Basically everything works and everyone has great advice and tips. But try de-duping a hundred million records that don't fit into memory with whatever query you want on a database, it's just too slow.

Trying MySQL

Since "group by" would not work without an inner query, my idea was to copy the table structure, add a unique key constraint to the primary key on the new table and then "insert ignore" (ordering by my value to get the highest in first) from old table to new. This would leave me with the new table correctly deduped and I would avoid an expensive subquery. I ran this query for 1 week though and it was still killing the server and not even half finished. (I was not batching these inserts so that is something to try next time.)

Try Postgres

I tried postgresql at this point thinking I could take advantage of the DISTINCT ON(field) construct that they have. This seemed like a good way to de-dupe the records so I loaded the data. After getting the data in though, I discovered that building the indexes to just get started was actully slower than in my sql... (I'm not saying postgresql in general is slower than MySQL, it could be faster but it was not going to save the day here)

On to Linux - The Sort Command

I decided to try presorting and deduping the records in Linux PRIOR to loading in the database.

Hardware: 2 Dual-Core AMD Opteron(tm) Processor 2214 HE
OS: 2.6.18-92.1.22.el5 #1 SMP x86_64 GNU/Linux

I have 2 servers like this. Each has 4 cores and 2G ram to work with.

First thing to do is clean up the data and remove any junk rows before starting the sort. You want the file small enough to fit into memory if possible.

step1: left pad any numbers so they sort correctly (integers need 10 places)
awk -F',' '{printf "%s,%s,%10d\n",$3,$4,$5}' file.csv >file.padded.csv

step2: remove non-alphanumeric and junk rows (whatever you can do to get the filesize down)
grep -v '[^a-Z,0-9 ]' file.padded.csv | grep -vE '^junk,|,junk,|^foo,|,foo,' >file.padded.clean.nojunk.csv

step3: sort the rows and keep only uniques (I used reverse sort to keep the max value rows)
sort -fru -S1800M file.padded.clean.nojunk.csv > file.padded.clean.nojunk.sorted.csv

The Job completed in 12 hours on one server. Blowing away the performance of the databases.
But it could be faster.

Linux sorting in parallel (Distributed, Muti-Core)
With 2 servers and 4 cores available to each, it doesn't make sense to run a large sort on only one.

Split the file to be sorted into 4:
wc -l file.padded.clean.nojunk.csv
140000000
split -l 35000000 file.padded.clean.nojunk.csv

files:
xaa
xab
xac
xad

Copy 2 files to the second machine for processing then begin sorting: (4 parallel sorts)
Machine#1
nohup sort -fr -S900M xaa > xaa.sorted
nohup sort -fr -S900M xab > xab.sorted

Machine#2
nohup sort -fr -S900M xac > xac.sorted
nohup sort -fr -S900M xad > xad.sorted

4 Hours later - merge the 2 sort results in parallel (merges are very fast anyway)
Machine#1
nohup sort -frmu xaa.sorted xab.sorted > xa.ab.sorted

Machine#2
nohup sort -frmu xac.sorted xad.sorted > xa.cd.sorted

4 Minutes later - copy the result from server2 to server1 and complete the merge.

Machine#1
nohup sort -frmu xa.ab.sorted xa.cd.sorted > file.padded.clean.nojunk.sorted.csv

All Done

Took just a little over 4 hours this way and with more machines you could get the sort time down to minutes. You are mainly limited by disk write speeds as long as you have gigabit ethernet between servers.

Time everything and make sure to experiment with memory allocation. I noticed that sort merges were much slower when I allocated a lot of extra memory for them.

Follow-up notes:

The main takeaway from this is to process the data until you can fit it into memory. Splitting the data on a Multi-Core server into subsets will allow you to process data in parallel.

For a speed boost change your locale from en_US.UTF-8 to C with export LC_ALL=C
Thanks to Tapajyoti Das for this tip. http://tdas.wordpress.com/2008/02/03/speed-up-grep/
(this also will change the way results are filtered, for example a-Z will match only simple ascii and no accented characters under Locale C- this is often what you want)

Wednesday, December 03, 2008

Absolute Best Way to Embed Flash in HTML

*update- doesn't seem to work with Google browser... not sure about safari. i'll update if i get a fix

This is the best way to embed flash into a webpage without javascript. It is a Markup Only solution and because it does not use the embed tag, it will validate.

The HTML object method (nested objects + IE conditional comments):

<object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" id="smix" codebase="http://fpdownload.macromedia.com/get/flashplayer/current/swflash.cab" height="30" width="100">
<param name="movie" value="file.swf">
<param name="wmode" value="transparent">
<param name="allowScriptAccess" value="sameDomain">

<object name="smix" type="application/x-shockwave-flash" data="file.swf" height="30" width="100">
<param name="wmode" value="transparent">
<param name="allowScriptAccess" value="sameDomain">
</object>

</object>

Here is a compatability chart for this method from the flash test suite by Bobby van der Sluis.

Browser	IE 5, 5.5, 6	IE 7, 8b2	FF 1, 2, 3Moz 1.7	Saf 1.3, 2, 3*Chrome 0.2	Opera 9, 9.5
Basic	yes	yes	yes	yes	yes
Streaming	yes	yes	yes	yes	yes
Params	yes	yes	yes	yes	yes
Communication	yes	yes	yes	no fscommand	no fscommand

Since I require javascript communication, I was a little worried about Safari support but then realized that I use ExternalInterface instead of the old flash way to communicate with javascript called fscommand.

Flash 8 uses ExternalInterface to do javascript communication without using SWLiveConnect. This was a always a problem with Safari using the Object code above but no more with Flash 8 and up.

Recommendations to use SWFObject are all over the forums and is well promoted but it is too much overhead for my taste. Just remember that IE7 and under will require an extra click from the users unless you use one of the javascript methods. This is due to the stupid Ebola patent that claimed rights to HTML embed technology.

Wednesday, April 09, 2008

ImageMagick - How to shrink images in batch mode

find . -size +1000k -name \*.png awk '{print $1,$1}' | xargs -n2 convert -scale 1024x\>

Friday, February 01, 2008

Setup a Future Rescue with the 'at' Comand

With a leased server, you never want to lock yourself out of your box. If you do get locked out, KVM style console access can save you but it's not cheap. It usually costs $30 per use. And that's if your ISP even supports it.

There are 3 risky things that can get you locked out of your server.
1) service iptables restart
2) service network restart
3) changing etc files like /etc/sudoers ... and then rebooting

For me, iptables is the riskiest to edit because one bad rule can lock you out of ssh on port 22.

To protect yourself, use 'at' to revert the file change 5 minutes in the future:
echo 'mv ifcfg-eth0.save ifcfg-eth0' at now + 5 minute

*this will work even if you are kicked out (as long as the server is running).

--update-- here is another way to prevent getting locked out by iptables during testing:

*/10 * * * * iptables -P INPUT ACCEPT; iptables -P OUTPUT ACCEPT; iptables -P FORWARD ACCEPT; /sbin/iptables -F

This will flush all the rules every 10 minutes, just in case you lock yourself out. When you're happy with the results of your work, remove the line from your crontab, and you're in business.

Saturday, May 05, 2007

Consolidate your web architecture -w- vmware

Good and Cheap Hardware
CPU: AMD64 Dual Core Processor 4400+
Clock: 2.2 GHz
Ram: 1 Gig

1) Select a CentOS 4.4 install (With 5 usable ip addresses)
2) Install vmware server (See last weeks post)
3) Download CentOS 5 virtual appliance and cp into /vm directory
http://www.vmware.com/vmtn/appliances/directory/820
4) Download MySQL Appliance and cp into /vm directory
5) Download memcached Appliance and cp into /vm directory

6) Edit /etc/sysconfig/network-scripts/ifcfg-eth0-range0 to free up 3 ip addresses
## START
#IPADDR_START=72.xxx.xxx.19
IPADDR_START=72.xxx.xxx.22
IPADDR_END=72.xxx.xxx.22
CLONENUM_START=0
##END

7) service network restart
New ifconfig wil show IP's up on eth0 and and one ethernet alias eth0:0

8) run vmware_config.pl
for networking, select vmnet0 to bridge on eth0
set vmware default appliance directory to /vm

9) From a pc start vmware client and connect to the remote server.
10) "Open existing virtual machine" and browse to each appliance to install.
11) Start up both vm appliances and setup networking.
ifconfig eth0 72.xxx.xxx.19
route add default gw [use the same gateway that is used on the host server]
make changes permanent

/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=static
HWADDR=00:0c:22:32:s0:d4
IPADDR=72.xxx.xxx.19
NETMASK=255.255.55.248
ONBOOT=yes

# vi /etc/sysconfig/network
Append/modify configuration as follows:

NETWORKING=yes
HOSTNAME=www1.fubar.com
GATEWAY=72.xxx.xxx.xxx

Set the correct DNS server defined in /etc/resolv.conf file:
# vi /etc/resolv.conf
nameserver 72.yyy.yyy.yyy

service network restart
ping google.com

12) Assign DNS names to the new servers

13) Install vmware tools. This is not required if you will usually login to the server via ssh.
http://www.thoughtpolice.co.uk/vmware/howto/centos-5-vmware-tools-install.html

Sunday, April 29, 2007

Install Vmware on 64 bit CENTOS 4.4

Here are detailed instructions on how to successfully install vmware on a 64 bit server running CENT0S 4.4

The server:
OS: CentOS 5 x86_64
Hardware: AMD 64 Athlon 4400 DC/1024MB/160GB SATA

# uname -a
Linux xx.xx.xx.xx 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 09:40:21 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

# arch -a
x86_64

#cat /etc/redhat-release
CentOS release 5 (Final) *

[Thanks to my hosting provider LayeredTech. They Rock!]

First, download and install the latest VMware Server. Since there are no x64 rpm files I used the x86. I hope it will efficiently using this CPU.

sh-3.00# rpm -ivh VMware-server-1.0.3-44356.i386.rpm

sh-3.00# vmware-config.pl
The correct version of one or more libraries needed to run VMware Server may be
missing. This is the output of ldd /usr/bin/vmware:
linux-gate.so.1 => (0xffffe000)
libm.so.6 => /lib/tls/libm.so.6 (0xf7fd0000)
libdl.so.2 => /lib/libdl.so.2 (0xf7fcc000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0xf7fba000)
libX11.so.6 => not found
libXtst.so.6 => not found
libXext.so.6 => not found
libXt.so.6 => not found
libICE.so.6 => not found
libSM.so.6 => not found
libXrender.so.1 => not found
libz.so.1 => not found
libc.so.6 => /lib/tls/libc.so.6 (0x001e6000)
/lib/ld-linux.so.2 (0x001cd000)

This program cannot tell for sure, but you may need to upgrade libc5 to glibc
before you can run VMware Server.

Hit enter to continue.

This is the first hint of trouble. Next after a few more prompts you will see an error message about a missing X library file.

/usr/lib/vmware/bin/vmware-vmx: error while loading shared libraries: libX11.so. 6: cannot open shared object file: No such file or directory
Please enter your 20-character serial number.

This is because vmware server is compiled against X libraries on the x86 architecture. To me this is an odd dependency because most servers do not run Xwindows. Personally, I never use Xwindows on servers, but to get past the vmware-config.pl installation you must install the xorg-x11-libs for x86 platform (even if you are on a x64).

Here is how it went:
rpm -ivh xorg-x11-libs-6.8.2-1.EL.13.37.7.i386.rpm
error: Failed dependencies:
libGL.so.1 is needed by xorg-x11-libs-6.8.2-1.EL.13.37.7.i386
libexpat.so.0 is needed by xorg-x11-libs-6.8.2-1.EL.13.37.7.i386
libfontconfig.so.1 is needed by xorg-x11-libs-6.8.2-1.EL.13.37.7.i386
libfreetype.so.6 is needed by xorg-x11-libs-6.8.2-1.EL.13.37.7.i386
libz.so.1 is needed by xorg-x11-libs-6.8.2-1.EL.13.37.7.i386

OK, so first I need to download the libraries that are needed.

wget ftp://ftp.nluug.nl/pub/os/Linux/distr/smeserver/releases/testing/7.2/smeos/i386/CentOS/RPMS/xorg-x11-libs-6.8.2-1.EL.13.37.7.i386.rpm
wget ftp://ftp.nluug.nl/pub/os/Linux/distr/smeserver/releases/7.1/smeos/i386/CentOS/RPMS/zlib-1.2.1.2-1.2.i386.rpm
wget ftp://ftp.nluug.nl/pub/metalab/distributions/smeserver/releases/7.1/smeos/i386/CentOS/RPMS/expat-1.95.7-4.i386.rpm
wget ftp://ftp.nluug.nl/pub/ibiblio/distributions/smeserver/releases/7.1/smeos/i386/CentOS/RPMS/fontconfig-2.2.3-7.centos4.i386.rpm
wget ftp://ftp.nluug.nl/pub/os/Linux/distr/CentOS/4.4/updates/i386/RPMS/freetype-devel-2.1.9-5.el4.i386.rpm
wget ftp://ftp.nluug.nl/pub/sunsite/distributions/smeserver/releases/testing/7.2/smeos/i386/CentOS/RPMS/xorg-x11-Mesa-libGL-6.8.2-1.EL.13.37.7.i386.rpm
wget ftp://ftp.nluug.nl/pub/os/Linux/distr/CentOS/4.4/updates/x86_64/RPMS/freetype-2.1.9-5.el4.i386.rpm

rpm -ivh xorg-x11-libs-6.8.2-1.EL.13.37.7.i386.rpm xorg-x11-Mesa-libGL-6.8.2-1.EL.13.37.7.i386.rpm expat-1.95.7-4.i386.rpm fontconfig-2.2.3-7.centos4.i386.rpm freetype-devel-2.1.9-5.el4.i386.rpm zlib-1.2.1.2-1.2.i386.rpm freetype-2.1.9-5.el4.i386.rpm
Preparing... ########################################### [100%]
1:zlib ########################################### [ 14%]
2:freetype ########################################### [ 29%]
3:expat ########################################### [ 43%]
4:fontconfig ########################################### [ 57%]
5:freetype-devel ########################################### [ 71%]
6:xorg-x11-Mesa-libGL ########################################### [ 86%]
7:xorg-x11-libs ########################################### [100%]

OK, so now I got xorg-x11 library installed I can retry the vmware-config.pl script.

Please enter your 20-character serial number.

Type XXXXX-XXXXX-XXXXX-XXXXX or 'Enter' to cancel:

Starting VMware services:
Virtual machine monitor [ OK ]
Virtual ethernet [ OK ]
Bridged networking on /dev/vmnet0 [ OK ]

The configuration of VMware Server 1.0.3 build-44356 for Linux for this running
kernel completed successfully.

Success!!

Thanks to Karanbir Singh for his blog post on installing vmware on CENTOS 4.4.
vmware_server_on_x86_64_centos4_redhat_e

Friday, April 20, 2007

Easiest possible way to port forward

Say you have a mysql server that you do not want to make public. Port 3306 is blocked by your firewall but port 22 is open for ssh.

Just connect tp your mysql server from putty on port 22 and modify one setting.
Connection->SSH->Tunnels add "L3306 127.0.0.1:3306"

After you make this change you can Click on Session and save your changes.

Once you connect through putty, you will be able to access the remote mysql server by localhost:3306.

Warning: Don't forget that this port is local now! If you forget that port 3306 is forwarded, you may wonder why you have a mysql server running on your machine.
If you decide later to install mysql on your local machine you should delete the port forward from the putty config to avoid unintentional changes your remote server.

Saturday, August 19, 2006

Awking Mysql

How do you perform awk type functions in Mysql?

Say you have a list of urls and you want to find the most common domain names hosting gif files.

http://www.fileden.com/123/foo.gif
http://h1.ripway.com/11.gif
http://www.hostmyfile.info/22/web.gif
http://www.fileden.com/24/fun.gif

If this list is in a file you can use standard linux commands to find the most common domains.

grep '\.mp3' my_pictures |awk -F'/' '{print $3}' |sort |uniq -c|sort -nr

If the list is in a mysql database you can use locate() and substring() to accomplish the same thing!

SELECT count(*) as count, substring(url,1,locate('/',url,9)) as f FROM `my_pictures` WHERE url IS NOT NULL and url !='' and filename like '%.mp3' group by f order by count desc

Happy Awking!

Tuesday, July 11, 2006

phpMyAdmin #1045 - Access denied for user 'root'@'localhost' (using password: NO)

When installing phpMyAdmin the most common error is not being able to connect to the database after installation.

#1045 - Access denied for user 'root'@'localhost' (using password: NO)

Here is the solution:
Make sure your config file in the phpMyAdmin directory is named "config.inc.php"!

In the installation instuctions it says to copy config.default.php into the phpMyAdmin directory and then customize it (usually the only change needed is to change the 'auth_type' is changed from 'config' to 'http'). Make sure to rename config.default.php to config.inc.php.

Tuesday, July 04, 2006

Get the Latidude and Longitude from an IP

ip2location shell script

for k in `cat ip_list`;
do
id=`echo $k|awk -F_ '{print $1}'`;
ip_num=`echo $k|awk -F_ '{print $2}'|awk -F. '{ printf "%10d\n",$1 * 16777216 + $2 * 65536 + $3 * 256 + $4 }'`;
echo id=$id;
echo ip_num=$ip_num;
echo "SELECT ip_latitude,ip_longitude,ip_city FROM ip2location WHERE (ip_from <= $ip_num) AND (ip_to >= $ip_num) LIMIT 1"|mysql -u doug crostel_store;
done

UPDATE my_space_users u,
ip2location i SET u.user_location = i.ip_city WHERE (

ip_from <=1091213045
) AND (
ip_to >=1091213045
) LIMIT 2

mysql> update my_space_users u, ip2location l set u.ip_country=l.country_short, u.ip_region=l.ip_region, u.ip_city=l.ip_city, u.ip_latitude=l.ip_latitude, u.ip_longitude=l.ip_longitude, u.ip_isp=l.ip_isp, u.ip_domain=l.ip_domain WHERE (l.ip_from <= u.reg_ip_num) AND (l.ip_to >= u.reg_ip_num);

Thursday, April 13, 2006

Using Wikipedia as a natural language corpus

1) download the english yahoo abstracts from

http://en.wikipedia.org/wiki/Sort-Merge_Join

wiki.txt
Anarchism is derived from the Greek Î±Î½Î±ÏÏÎ¯Î± ("without archons (ruler, chief, king)"). Anarchism as a political philosophy, is the belief that rulers, governments, and hierarchal social relationships are unnecessary and should be abolished, although there are differing interpretations of what this means.
|
Albedo is the measure of reflectivity of a surface or body. It is the ratio of electromagnetic radiation (EM radiation) reflected to the amount incident upon it.
Abu Dhabi (Arabic: Ø£Ø¨Ù Ø¸Ø¨Ù Ê¼AbÅ« áºaby) is the largest of the seven emirates that comprise the United Arab Emirates and was also the largest of the former Trucial States. Abu Dhabi is also a city of the same name within the Emirate that is the capital of the country, in north central UAE.

wiki.txt1
Anarchism
is
derived
from
the

wiki.txt2
is
derived
from
the
Greek

wiki.txt12
Anarchism is
is derived
derived from
from the
the Greek

wiki.txt.sorted
wiki.txt.sorted.1
wiki.txt.sorted.nonums
wiki.txt.sorted.nonums1
wiki.txt.sorted.nonums1.uniq
wiki.txt.sorted.nonums1.uniq.2words

export LC_ALL=C

cat
wordnet.index
wordnet.index.sorted
wordnet.index.sorted.uniq
wiki.txt.sorted.nonums1.uniq.2words.cols
wiki.txt.sorted.nonums1.uniq.2words.cols.LC_ALL
wordnet.index.sorted.uniq.LC_ALL
wiki-wordnet.join

for k in `cat wordnet.index.sorted.uniq.LC_ALL`; do look $k wiki.txt.sorted.nonums1.uniq.2words.cols.LC_ALL; done >/tar-backups/wiki-wordnet.join

3 word n-grams
grep -v '[^[:print:]]' wiki.txt1 >wiki1.txt
tail +2 wiki1.txt >wiki2.txt
tail +2 wiki2.txt >wiki3.txt

grep "[[:print:]]\+\t[[:print:]]\+\t[[:print:]]" wiki123.txt >wiki123.txt3

find the most used 3 word phrases in a text file
sort rewrite123.txt3 >rewrite123.txt3
grep ".\+\t.\+\t.\+" rewrite123.txt >rewrite123.txt3
sort rewrite123.txt3 >rewrite123.txt3.sorted
uniq -dc rewrite123.txt3.sorted >rewrite123.txt3.sorted.uniqd
sort -nr rewrite123.txt3.sorted.uniqd >rewrite123.txt3.sorted.uniqd.sorted
more rewrite123.txt3.sorted.uniqd.sorted

3 be traveling with
2 you want to
2 want to have
2 student group bookings.
2 part of their
2 on the other
2 notarized letter of
2 made with tap
2 entertainment , activities
2 and want to

Saturday, March 18, 2006

Plot a wav file - wav2png

#!/bin/sh
#usage: wav2png.sh file.wav

BASE=${1%.wav}
PNG=$BASE.png
WAV=$BASE.wav
DAT=$BASE.dat

#echo $BASE,$PNG,$WAV,$DAT
sox $WAV $DAT
grep -v '^;' $DAT >$DAT.clean
FREQ=`head -1 $DAT|tr -d ';'`

echo -e "set terminal png;set title '$FREQ';set output '$PNG'; plot '$DAT.clean'" |gnuplot

Split a midi file into separate tracks

Get the program:

wget http://interglacial.com/~sburke/pub/midi_splitter.pl

perl midi_splitter.pl /tmp/vanilla.mid
file vanilla*.mid
midi2abc vanilla.mid
midi2abc vanilla_t01_c00.mid >vanilla_t01.abc

Thursday, March 16, 2006

Access internet hotspots in client mode with WRT54G

Linksys WRT54G (below version 5 because linksys later switched to vmware)

Load the sveasoft firmware. (I am using Alchemy-pre5.3 v2.04.4.8sv)

Put it in client mode with loopback off.

Here is a good tutorial:
http://www.engadget.com/2005/05/24/how-to-connect-your-linksys-wrt54g-network-to-the-internet/

ssh into the router and type:
~ # wl scan
~ # wl scanresults|tr -d '\n' |sed 's/noise/\n/g' |tr ']' '\n'|grep ^SSID|awk '{print $(NF-1),$2}' |sed 's/Mode://'|sort -n
-96 "zeus"
-94 "LF-X1U.00014A10A7C8"
-94 "wireless"
-89 "Doug"
-88 "GHOST"
-86 "COR-MDA-LAN"
-86 "linksys"
-86 "linksys"
-83 "default"
-83 "hpsetup"
-83 "linksys"
-78 "hpsetup"
-78 "linksys"
-74 "hirsh
-51 "link"

You will get a sorted list of the most powerfull access points in range.