Quick and easy multicore sort using Bash oneliners

In my line of work I often encounter the need to sort multi-gigabyte files that contain some form of tabulated text data. Usually, one would do this using a single unix sort command, like this:

sort data.tsv -o data.tsv.sorted

Even with an adequate machine, this takes 21 minutes for a 7.4GB file with 115M objects. But like most moderate-sized work machines these days, we have multiple cores(2xQuad Intel Xeon), abundant memory(24G) and a fast disk(15K RPM), so running a single sort command on a file is serious underutilization of resources!

Now there’s all sorts of fancy commercial / non-commercial tools that can optimize this, but I just wanted something quick and dirty. Here’s a quick few lines that I end up using often that I thought would be useful to share:

split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;

Followed by:
sort -m _tmp* -o data.tsv.sorted

This takes less than 5 minutes!

How this works: The process is straightforward; you’re essentially:
1. splitting the file into pieces
2. sorting all the split pieces in parallel to take advantage of multiple cores (note the “&”, enabling background processes)
3. then merging the sorted files back

Since the speedup from the number of cores outweighs the cost of increased disk I/O from splitting / merging, you have a much faster sort. Note the use of while read; this ensures that just-created files don’t get considered, avoiding infinite loops.

For fun, here’s a screen cap of what I’d like to call the “Muhahahahaha moment” — when the CPU gets bombarded with sort processes, saturating all cores:

(see video version of this , you’ll need to skip to 00:36s mark)

brand new day?

Microsoft seems to be flaunting a new attitude, one that says "Yes we're evil. Now let us try to become good". Rob Scoble is almost 100% of the time using this as his basic evangelistic premise; new language technologies ASP and .Net support unix-friendly PerlScript; and the open-source-like pre-pre-pre-alpha release of Longhorn is full of this apparently positive approach.

| |

joint interest

Aadisht writes about the mysterious interest-invoking properties of marijuana. Perhaps the perfect solution to cure my hatred against the Unix Network Programming paper?



It's been almost a month now since college started. I've become a "4th yearie" now, and along with the priviledge of being part the most senior student batch in college, comes the burden of selecting 5 electives out of a list of 11 specialized subjects. Here's what I chose:

  1. Artificial Intelligence
  2. Data Mining
  3. Advanced Internet Technology
  4. Unix Network Programming
  5. Cost Accounting and HRM

Out of these, AI and Data Mining are my two favourite subjects, and it's really cool to learn about logic, knowledge and stuff.

| |

now on mt2.5

Finally finished upgrading to Version 2.5. This is the reason I moved to php from perl. I uploaded all the files, ran the script, and what do I get? The Internal Server Error 500. My webserver can't handle perl scripts in DOS format, so you have to convert it into UNIX format. It took me an entire week of research to get this thing right - 6 days to find the batch converter, and 20 seconds to convert all the files.