BaconSnake to core: Pig 0.8 released!

My contribution to Python UDFs in Pig has finally released as part of the new and shiny 0.8 release! I’ve been meaning to blog about how this came about when I had time, but Dmitriy saved me the work, so I’ll just quote him instead:

This is the outcome of PIG-928; it was quite a pleasure to watch this develop over time — while most Pig tickets wind up getting worked on by at most one or two people, this turned into a collaboration of quite a few developers, many of them new to the project — Kishore Gopalakrishna’s patch was the initial conversation starter, which was then hacked on or merged into similar work by Woody Anderson, Arnab Nandi, Julien Le Dem, Ashutosh Chauhan and Aniket Mokashi (Aniket deserves an extra shout-out for patiently working to incorporate everyone’s feedback and pushing the patch through the last mile).

Yay Open Source! Documentation is available on the Pig website, but here’s a one-line example of what you can now do:

register '' using jython as myfuncs;

Notice that we’re using a python script instead of a jar. All python functions in become available as UDFs to Pig!

Going forward, there are two things I didn’t get to do, which I’d love any of my dear readers to pick up on. First, we’re still not at par with SCOPE’s ease-of-use since UDFs can’t be inlined yet (a la BaconSnake). This is currently open as a separate Apache JIRA issue, would be great to see patches here!

The second feature that I really wanted to make part of this patch submission, and didn’t have time for, was Java source UDF support as a JaninoScriptEngine. Currently, Pig users have to bundle their UDF code as jars, which quickly becomes painful. The goal of the ScriptEngine architecture is to let multiple languages be implemented and plugged in, so I really hope someone takes the time to look at the Janino project, which will allow Java source UDFs with zero performance penalty. This would also make it the first dynamically compiled query system over Hadoop (as far as I know), and opens up a world of query optimization work to explore.

BaconSnake: Inlined Python UDFs for Pig

I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.

I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.

Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:

R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 

public static int StringOccurs(string str, string ptrn) {
   int cnt=0; 
   int pos=-1; 
   while (pos+1 < str.Length) {
        pos = str.IndexOf(ptrn, pos+1) ;
        if (pos < 0) break; cnt++; 
   } return cnt;

Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?

Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:

-- Script calculates average length of queries at each hour of the day

raw = LOAD 'data/excite-small.log' USING PigStorage('\t')
           AS (user:chararray, time:chararray, query:chararray);

houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query;

hour_group = GROUP houred BY hour;

hour_frequency = FOREACH hour_group 
                           GENERATE group as hour,
                                    baconsnake.AvgLength($1.query) as count;

DUMP hour_frequency;

-- The excite query log timestamp format is YYMMDDHHMMSS
-- This function extracts the hour, HH
def ExtractHour(timestamp):
	return timestamp[6:8]

-- Returns average length of query in a bag
def AvgLength(grp):
	sum = 0
	for item in grp:
		if len(item) > 0:
			sum = sum + len(item[0])	
	return str(sum / len(grp))

Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.

It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.

Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:

cat scripts/ | python scripts/ > scripts/histogram.pig
java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig

Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.

Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!

Go checkout BaconSnake at Google Code!

Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.

Update (Apr 2010): Looks like BaconSnake's spirit is slowly slithering into Pig Core! Also some attention from the Hive parallel universe.

bags, balls and boyfriends

Links today brought to you by Red Bull™, my abusive friend in a can pushing me through a rather crazy day.

  • Lego Schoolbag : If I was a 10 year old girl, I would give away my younger brother for this one.
  • Every expression in this picture is priceless. I like how our hero has resigned to prayer.
  • A hilarious sketch from Snuff Box, starring Matt Berry, who also stars in the hilarious britcom The IT Crowd.
  • Adobe is opening up the SWF and FLV formats with the Open Screen project. (No sir, this is not about Single White Females or Fine Looking Virgins.) Flash has been sort of open for a while, with projects like SWFTools and GNash, but this takes things to a whole new level, with a slew of bigwig corporate backers. Flash and FLV have been in my opinion the critical enablers to the online video revolution; and this is definitely a great step ahead. I’m curious to know what Microsoft’s Silverlight team is thinking, as well as the folks at Sun (who just opened up all of Java). And of course, let’s not to forget the Android folks who have a very pretty stack, but tacking on some Flash magic would definitely be a very big deal. Considering the significant overlap between supporters of the Adobe effort and the Google effort, this is going to be fun to watch.

Attacked by the Lucene Bugzilla

Lucene’s Bugzilla was apparently migrated over the weekend, resulting in hundreds of emails being sent to the lucene-developer mailing list, all of which were totally useless and a royal pain to delete using a web-based interface(I was using gmail). Considering there’s atleast some hundred people on that mailing list(I guess), multiply that with the close to a thousand emails, and you have a lot of useless email that could have been avoided if someone just turned off email notification before bulk updates / migration.

Also, in case you haven’t noticed, Lucene’s now become a top-level Apache project, putting it in the same league as the HTTP Server, SpamAssassin, Jakarta, and Struts. The Lucene project now comprises of Lucene Java, Nutch, and Lucene4c. I’m really looking forward to updates on Lucene4c; the webpages seem to be comatose.


Making Programs Talk to each other using XML-RPC

The Internet has changed the way applications are built today. In the last few years, we have seen a sudden burst in Internet software – Instant Messengers, Online Gaming, etc; all based on the client-server architecture. With all this client server technology, we also have to ensure compatibility between languages, and operating systems. XML-RPC is one way to do this.


A Coder in Courierland:

Once upon a time, I was a coder not unlike yourself. My day consisted of coffee, perl and java hacking, meetings, and e-mail. I had a cubicle with fluorescent lighting, my own bookshelf and two computers. And I traded it all in.

Even before Office Space, white collar workers peered out the window (if they were so lucky) and imagined a more romantic life doing real work out under the sun.

Well, having no children, no great career ambition and no financial obligations more pressing than a crippling student loan, a year and a half ago, I decided to live this dream.

| |