More data than memory
I had duplicate records in a MySQL table with 180 million rows and a varchar index. One requirement I had in de-duping the records was that I needed to keep the row with the highest value in a certain column. This is a VERY common scenario for developers and DBAs but there seems to be no good way to do this in MySQL or Postgreql. A simple "group by order by" does not work in MySQL (5.0.22) for example because it does not know how to presort rows in a group by.
This means that if you want to show the top seller for each day by using a "group by" it will actually pick the first row it finds instead of sorting first to give you the top seller for the day.
I read tons of posts and tried several ways to dedupe the table. Basically everything works and everyone has great advice and tips. But try de-duping a hundred million records that don't fit into memory with whatever query you want on a database, it's just too slow.
Trying MySQL
Since "group by" would not work without an inner query, my idea was to copy the table structure, add a unique key constraint to the primary key on the new table and then "insert ignore" (ordering by my value to get the highest in first) from old table to new. This would leave me with the new table correctly deduped and I would avoid an expensive subquery. I ran this query for 1 week though and it was still killing the server and not even half finished. (I was not batching these inserts so that is something to try next time.)
Try Postgres
I tried postgresql at this point thinking I could take advantage of the DISTINCT ON(field) construct that they have. This seemed like a good way to de-dupe the records so I loaded the data. After getting the data in though, I discovered that building the indexes to just get started was actully slower than in my sql... (I'm not saying postgresql in general is slower than MySQL, it could be faster but it was not going to save the day here)
On to Linux - The Sort Command
I decided to try presorting and deduping the records in Linux PRIOR to loading in the database.
Hardware: 2 Dual-Core AMD Opteron(tm) Processor 2214 HE
OS: 2.6.18-92.1.22.el5 #1 SMP x86_64 GNU/Linux
I have 2 servers like this. Each has 4 cores and 2G ram to work with.
First thing to do is clean up the data and remove any junk rows before starting the sort. You want the file small enough to fit into memory if possible.
step1: left pad any numbers so they sort correctly (integers need 10 places)
awk -F',' '{printf "%s,%s,%10d\n",$3,$4,$5}' file.csv >file.padded.csv
step2: remove non-alphanumeric and junk rows (whatever you can do to get the filesize down)
grep -v '[^a-Z,0-9 ]' file.padded.csv | grep -vE '^junk,|,junk,|^foo,|,foo,' >file.padded.clean.nojunk.csv
step3: sort the rows and keep only uniques (I used reverse sort to keep the max value rows)
sort -fru -S1800M file.padded.clean.nojunk.csv > file.padded.clean.nojunk.sorted.csv
The Job completed in 12 hours on one server. Blowing away the performance of the databases.
But it could be faster.
Linux sorting in parallel (Distributed, Muti-Core)
With 2 servers and 4 cores available to each, it doesn't make sense to run a large sort on only one.
Split the file to be sorted into 4:
wc -l file.padded.clean.nojunk.csv
140000000
split -l 35000000 file.padded.clean.nojunk.csv
files:
xaa
xab
xac
xad
Copy 2 files to the second machine for processing then begin sorting: (4 parallel sorts)
Machine#1
nohup sort -fr -S900M xaa > xaa.sorted
nohup sort -fr -S900M xab > xab.sorted
Machine#2
nohup sort -fr -S900M xac > xac.sorted
nohup sort -fr -S900M xad > xad.sorted
4 Hours later - merge the 2 sort results in parallel (merges are very fast anyway)
Machine#1
nohup sort -frmu xaa.sorted xab.sorted > xa.ab.sorted
Machine#2
nohup sort -frmu xac.sorted xad.sorted > xa.cd.sorted
4 Minutes later - copy the result from server2 to server1 and complete the merge.
Machine#1
nohup sort -frmu xa.ab.sorted xa.cd.sorted > file.padded.clean.nojunk.sorted.csv
All Done
Took just a little over 4 hours this way and with more machines you could get the sort time down to minutes. You are mainly limited by disk write speeds as long as you have gigabit ethernet between servers.
Time everything and make sure to experiment with memory allocation. I noticed that sort merges were much slower when I allocated a lot of extra memory for them.
Follow-up notes:
The main takeaway from this is to process the data until you can fit it into memory. Splitting the data on a Multi-Core server into subsets will allow you to process data in parallel.
For a speed boost change your locale from en_US.UTF-8 to C with export LC_ALL=C
Thanks to Tapajyoti Das for this tip. http://tdas.wordpress.com/2008/02/03/speed-up-grep/
(this also will change the way results are filtered, for example a-Z will match only simple ascii and no accented characters under Locale C- this is often what you want)