Quick and easy multicore sort using Bash oneliners

In my line of work I often encounter the need to sort multi-gigabyte files that contain some form of tabulated text data. Usually, one would do this using a single unix sort command, like this:

sort data.tsv -o data.tsv.sorted

Even with an adequate machine, this takes 21 minutes for a 7.4GB file with 115M objects. But like most moderate-sized work machines these days, we have multiple cores(2xQuad Intel Xeon), abundant memory(24G) and a fast disk(15K RPM), so running a single sort command on a file is serious underutilization of resources!

Now there’s all sorts of fancy commercial / non-commercial tools that can optimize this, but I just wanted something quick and dirty. Here’s a quick few lines that I end up using often that I thought would be useful to share:

split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;

Followed by:
sort -m _tmp* -o data.tsv.sorted

This takes less than 5 minutes!

How this works: The process is straightforward; you’re essentially:
1. splitting the file into pieces
2. sorting all the split pieces in parallel to take advantage of multiple cores (note the “&”, enabling background processes)
3. then merging the sorted files back

Since the speedup from the number of cores outweighs the cost of increased disk I/O from splitting / merging, you have a much faster sort. Note the use of while read; this ensures that just-created files don’t get considered, avoiding infinite loops.

For fun, here’s a screen cap of what I’d like to call the “Muhahahahaha moment” — when the CPU gets bombarded with sort processes, saturating all cores:

(see video version of this , you’ll need to skip to 00:36s mark)

|

What other people have to say:

what command do you use to show the last screen? is it some special option for top?

It’s an alternative process viewer called htop — I highly recommend it!

Man page said “sort -m” does not do sort; if so, the merged results are not sorted.

If you fire off more processes than cores, don’t you get a lot of swapping (certainly each time sort does i/o)?

I just used your trick (which was neat!), but split my files into 4 pieces because I have 4 cores and didn’t want to swap; still get 100% across the board, of course!

@anon 1: Indeed, “sort -m” does not sort, it merges already sorted results. Since we start off with sorted results (from the previous sort), the merged results are thus also sorted. You can verify this by running a “sort -c”.

@anon 2: Yes, you don’t want too many processes. The ideal number of processes vs number of cores relationship isn’t always 1:1, so the best bet is to try it a few times and pick the fastest config.

About the author:

Arnab Nandi is an Assistant Professor in the Department of Computer Science and Engineering at The Ohio State University. You can read more about him here.


August 2002 : 9 posts September 2002 : 16 posts October 2002 : 7 posts November 2002 : 21 posts December 2002 : 25 posts January 2003 : 8 posts February 2003 : 11 posts March 2003 : 7 posts April 2003 : 21 posts May 2003 : 14 posts June 2003 : 15 posts July 2003 : 4 posts August 2003 : 16 posts September 2003 : 25 posts October 2003 : 15 posts November 2003 : 24 posts December 2003 : 17 posts January 2004 : 6 posts February 2004 : 8 posts March 2004 : 6 posts April 2004 : 5 posts May 2004 : 29 posts June 2004 : 3 posts July 2004 : 17 posts August 2004 : 19 posts September 2004 : 3 posts October 2004 : 4 posts December 2004 : 1 posts February 2005 : 14 posts March 2005 : 17 posts April 2005 : 8 posts May 2005 : 27 posts June 2005 : 73 posts July 2005 : 44 posts August 2005 : 13 posts September 2005 : 3 posts October 2005 : 9 posts November 2005 : 20 posts December 2005 : 6 posts January 2006 : 25 posts February 2006 : 23 posts March 2006 : 36 posts April 2006 : 35 posts May 2006 : 7 posts June 2006 : 22 posts July 2006 : 20 posts August 2006 : 27 posts September 2006 : 15 posts October 2006 : 6 posts November 2006 : 19 posts December 2006 : 3 posts January 2007 : 4 posts February 2007 : 1 posts March 2007 : 3 posts May 2007 : 5 posts June 2007 : 2 posts July 2007 : 1 posts August 2007 : 13 posts September 2007 : 1 posts October 2007 : 21 posts November 2007 : 7 posts December 2007 : 9 posts January 2008 : 4 posts February 2008 : 13 posts March 2008 : 14 posts April 2008 : 11 posts May 2008 : 12 posts June 2008 : 12 posts July 2008 : 5 posts August 2008 : 10 posts September 2008 : 11 posts October 2008 : 10 posts November 2008 : 8 posts December 2008 : 4 posts January 2009 : 6 posts February 2009 : 13 posts March 2009 : 7 posts April 2009 : 7 posts May 2009 : 2 posts June 2009 : 3 posts July 2009 : 4 posts August 2009 : 4 posts September 2009 : 6 posts October 2009 : 4 posts November 2009 : 7 posts December 2009 : 10 posts January 2010 : 3 posts February 2010 : 2 posts April 2010 : 5 posts May 2010 : 1 posts July 2010 : 4 posts August 2010 : 3 posts September 2010 : 4 posts October 2010 : 1 posts November 2010 : 2 posts December 2010 : 3 posts June 2011 : 1 posts August 2011 : 1 posts November 2011 : 1 posts December 2011 : 1 posts February 2012 : 1 posts May 2012 : 2 posts December 2012 : 1 posts June 2013 : 1 posts August 2013 : 1 posts October 2013 : 2 posts September 2014 : 1 posts November 2014 : 1 posts November 2015 : 2 posts January 2016 : 1 posts January 2017 : 1 posts April 2017 : 2 posts