In my line of work I often encounter the need to sort multi-gigabyte files that contain some form of tabulated text data. Usually, one would do this using a single unix
sort command, like this:
sort data.tsv -o data.tsv.sorted
Even with an adequate machine, this takes 21 minutes for a 7.4GB file with 115M objects. But like most moderate-sized work machines these days, we have multiple cores(2xQuad Intel Xeon), abundant memory(24G) and a fast disk(15K RPM), so running a single sort command on a file is serious underutilization of resources!
Now there’s all sorts of fancy commercial / non-commercial tools that can optimize this, but I just wanted something quick and dirty. Here’s a quick few lines that I end up using often that I thought would be useful to share:
split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;
sort -m _tmp* -o data.tsv.sorted
This takes less than 5 minutes!
How this works: The process is straightforward; you’re essentially:
1. splitting the file into pieces
2. sorting all the split pieces in parallel to take advantage of multiple cores (note the “&”, enabling background processes)
3. then merging the sorted files back
Since the speedup from the number of cores outweighs the cost of increased disk I/O from splitting / merging, you have a much faster sort. Note the use of
while read; this ensures that just-created files don’t get considered, avoiding infinite loops.
For fun, here’s a screen cap of what I’d like to call the “Muhahahahaha moment” — when the CPU gets bombarded with sort processes, saturating all cores:
(see video version of this , you’ll need to skip to 00:36s mark)