I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.
I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.
Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:
R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 #CS public static int StringOccurs(string str, string ptrn) { int cnt=0; int pos=-1; while (pos+1 < str.Length) { pos = str.IndexOf(ptrn, pos+1) ; if (pos < 0) break; cnt++; } return cnt; } #ENDCS
Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?
Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:
-- Script calculates average length of queries at each hour of the day raw = LOAD 'data/excite-small.log' USING PigStorage('\t') AS (user:chararray, time:chararray, query:chararray); houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query; hour_group = GROUP houred BY hour; hour_frequency = FOREACH hour_group GENERATE group as hour, baconsnake.AvgLength($1.query) as count; DUMP hour_frequency; -- The excite query log timestamp format is YYMMDDHHMMSS -- This function extracts the hour, HH def ExtractHour(timestamp): return timestamp[6:8] -- Returns average length of query in a bag def AvgLength(grp): sum = 0 for item in grp: if len(item) > 0: sum = sum + len(item[0]) return str(sum / len(grp))
Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.
It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.
Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:
cat scripts/histogram.bs | python scripts/bs2pig.py > scripts/histogram.pig java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig
Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.
Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!
Go checkout BaconSnake at Google Code!
Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.
Update (Apr 2010): Looks like BaconSnake's spirit is slowly slithering into Pig Core! Also some attention from the Hive parallel universe.
What other people have to say:
I received some comments from friends about comparing BaconSnake with Pig Streaming. I think the core differences are three-fold. First, BaconSnake works by generating a compiled Jython function, and then calling it for each input, as opposed to piping data through a single stream. Second, you can perform correct per-tuple iterations with multiple functions, e.g. “FOREACH A GENERATE baconsnake.FirstFunc(q), baconsnake.SecondFunc(q);”. Third, your functions are called inside the same runtime, so not only are they (arguably) more efficient, they also have access to all of the Java codebase that you included using “register .jar”.
Good job Arnab, I like it.
— amr
This is awesome, Arnab. Thanks for putting this together and publishing it!