BaconSnake: Inlined Python UDFs for Pig

I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.

I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.

Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:

R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 

#CS 
public static int StringOccurs(string str, string ptrn) {
   int cnt=0; 
   int pos=-1; 
   while (pos+1 < str.Length) {
        pos = str.IndexOf(ptrn, pos+1) ;
        if (pos < 0) break; cnt++; 
   } return cnt;
}
#ENDCS

Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?

Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:

-- Script calculates average length of queries at each hour of the day

raw = LOAD 'data/excite-small.log' USING PigStorage('\t')
           AS (user:chararray, time:chararray, query:chararray);

houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query;

hour_group = GROUP houred BY hour;

hour_frequency = FOREACH hour_group 
                           GENERATE group as hour,
                                    baconsnake.AvgLength($1.query) as count;

DUMP hour_frequency;

-- The excite query log timestamp format is YYMMDDHHMMSS
-- This function extracts the hour, HH
def ExtractHour(timestamp):
	return timestamp[6:8]

-- Returns average length of query in a bag
def AvgLength(grp):
	sum = 0
	for item in grp:
		if len(item) > 0:
			sum = sum + len(item[0])	
	return str(sum / len(grp))

Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.

It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.

Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:

cat scripts/histogram.bs | python scripts/bs2pig.py > scripts/histogram.pig
java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig

Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.

Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!

Go checkout BaconSnake at Google Code!

Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.

Update (Apr 2010): Looks like BaconSnake's spirit is slowly slithering into Pig Core! Also some attention from the Hive parallel universe.

written on 5 Jul 2009 - 2:29am|

What other people have to say:

arnab said:

I received some comments from friends about comparing BaconSnake with Pig Streaming. I think the core differences are three-fold. First, BaconSnake works by generating a compiled Jython function, and then calling it for each input, as opposed to piping data through a single stream. Second, you can perform correct per-tuple iterations with multiple functions, e.g. “FOREACH A GENERATE baconsnake.FirstFunc(q), baconsnake.SecondFunc(q);”. Third, your functions are called inside the same runtime, so not only are they (arguably) more efficient, they also have access to all of the Java codebase that you included using “register .jar”.

Amr Awadallah (not verified) said:

Good job Arnab, I like it.

— amr

Mike K (not verified) said:

This is awesome, Arnab. Thanks for putting this together and publishing it!

About the author:

Arnab Nandi is an Assistant Professor in the Department of Computer Science and Engineering at The Ohio State University. You can read more about him here.

Before and After

Join.

Microsoft Research's Data-related Launches

August 2002 : 9 posts September 2002 : 16 posts October 2002 : 7 posts November 2002 : 21 posts December 2002 : 25 posts January 2003 : 8 posts February 2003 : 11 posts March 2003 : 7 posts April 2003 : 21 posts May 2003 : 14 posts June 2003 : 15 posts July 2003 : 4 posts August 2003 : 16 posts September 2003 : 25 posts October 2003 : 15 posts November 2003 : 24 posts December 2003 : 17 posts January 2004 : 6 posts February 2004 : 8 posts March 2004 : 6 posts April 2004 : 5 posts May 2004 : 29 posts June 2004 : 3 posts July 2004 : 17 posts August 2004 : 19 posts September 2004 : 3 posts October 2004 : 4 posts December 2004 : 1 posts February 2005 : 14 posts March 2005 : 17 posts April 2005 : 8 posts May 2005 : 27 posts June 2005 : 73 posts July 2005 : 44 posts August 2005 : 13 posts September 2005 : 3 posts October 2005 : 9 posts November 2005 : 20 posts December 2005 : 6 posts January 2006 : 25 posts February 2006 : 23 posts March 2006 : 36 posts April 2006 : 35 posts May 2006 : 7 posts June 2006 : 22 posts July 2006 : 20 posts August 2006 : 27 posts September 2006 : 15 posts October 2006 : 6 posts November 2006 : 19 posts December 2006 : 3 posts January 2007 : 4 posts February 2007 : 1 posts March 2007 : 3 posts May 2007 : 5 posts June 2007 : 2 posts July 2007 : 1 posts August 2007 : 13 posts September 2007 : 1 posts October 2007 : 21 posts November 2007 : 7 posts December 2007 : 9 posts January 2008 : 4 posts February 2008 : 13 posts March 2008 : 14 posts April 2008 : 11 posts May 2008 : 12 posts June 2008 : 12 posts July 2008 : 5 posts August 2008 : 10 posts September 2008 : 11 posts October 2008 : 10 posts November 2008 : 8 posts December 2008 : 4 posts January 2009 : 6 posts February 2009 : 13 posts March 2009 : 7 posts April 2009 : 7 posts May 2009 : 2 posts June 2009 : 3 posts July 2009 : 4 posts August 2009 : 4 posts September 2009 : 6 posts October 2009 : 4 posts November 2009 : 7 posts December 2009 : 10 posts January 2010 : 3 posts February 2010 : 2 posts April 2010 : 5 posts May 2010 : 1 posts July 2010 : 4 posts August 2010 : 3 posts September 2010 : 4 posts October 2010 : 1 posts November 2010 : 2 posts December 2010 : 3 posts June 2011 : 1 posts August 2011 : 1 posts November 2011 : 1 posts December 2011 : 1 posts February 2012 : 1 posts May 2012 : 2 posts December 2012 : 1 posts June 2013 : 1 posts August 2013 : 1 posts October 2013 : 2 posts September 2014 : 1 posts November 2014 : 1 posts November 2015 : 2 posts January 2016 : 1 posts January 2017 : 1 posts April 2017 : 2 posts