Java

BaconSnake to core: Pig 0.8 released!

My contribution to Python UDFs in Pig has finally released as part of the new and shiny 0.8 release! I’ve been meaning to blog about how this came about when I had time, but Dmitriy saved me the work, so I’ll just quote him instead:

This is the outcome of PIG-928; it was quite a pleasure to watch this develop over time — while most Pig tickets wind up getting worked on by at most one or two people, this turned into a collaboration of quite a few developers, many of them new to the project — Kishore Gopalakrishna’s patch was the initial conversation starter, which was then hacked on or merged into similar work by Woody Anderson, Arnab Nandi, Julien Le Dem, Ashutosh Chauhan and Aniket Mokashi (Aniket deserves an extra shout-out for patiently working to incorporate everyone’s feedback and pushing the patch through the last mile).

Yay Open Source! Documentation is available on the Pig website, but here’s a one-line example of what you can now do:

register 'test.py' using jython as myfuncs;

Notice that we’re using a python script instead of a jar. All python functions in test.py become available as UDFs to Pig!

Going forward, there are two things I didn’t get to do, which I’d love any of my dear readers to pick up on. First, we’re still not at par with SCOPE’s ease-of-use since UDFs can’t be inlined yet (a la BaconSnake). This is currently open as a separate Apache JIRA issue, would be great to see patches here!

The second feature that I really wanted to make part of this patch submission, and didn’t have time for, was Java source UDF support as a JaninoScriptEngine. Currently, Pig users have to bundle their UDF code as jars, which quickly becomes painful. The goal of the ScriptEngine architecture is to let multiple languages be implemented and plugged in, so I really hope someone takes the time to look at the Janino project, which will allow Java source UDFs with zero performance penalty. This would also make it the first dynamically compiled query system over Hadoop (as far as I know), and opens up a world of query optimization work to explore.

BaconSnake: Inlined Python UDFs for Pig

I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.

I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.

Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:

R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 

#CS 
public static int StringOccurs(string str, string ptrn) {
   int cnt=0; 
   int pos=-1; 
   while (pos+1 < str.Length) {
        pos = str.IndexOf(ptrn, pos+1) ;
        if (pos < 0) break; cnt++; 
   } return cnt;
}
#ENDCS

Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?

Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:

-- Script calculates average length of queries at each hour of the day

raw = LOAD 'data/excite-small.log' USING PigStorage('\t')
           AS (user:chararray, time:chararray, query:chararray);

houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query;

hour_group = GROUP houred BY hour;

hour_frequency = FOREACH hour_group 
                           GENERATE group as hour,
                                    baconsnake.AvgLength($1.query) as count;

DUMP hour_frequency;

-- The excite query log timestamp format is YYMMDDHHMMSS
-- This function extracts the hour, HH
def ExtractHour(timestamp):
	return timestamp[6:8]

-- Returns average length of query in a bag
def AvgLength(grp):
	sum = 0
	for item in grp:
		if len(item) > 0:
			sum = sum + len(item[0])	
	return str(sum / len(grp))

Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.

It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.

Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:

cat scripts/histogram.bs | python scripts/bs2pig.py > scripts/histogram.pig
java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig

Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.

Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!

Go checkout BaconSnake at Google Code!

Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.

Update (Apr 2010): Looks like BaconSnake's spirit is slowly slithering into Pig Core! Also some attention from the Hive parallel universe.

Making Programs Talk to each other using XML-RPC

The Internet has changed the way applications are built today. In the last few years, we have seen a sudden burst in Internet software – Instant Messengers, Online Gaming, etc; all based on the client-server architecture. With all this client server technology, we also have to ensure compatibility between languages, and operating systems. XML-RPC is one way to do this.