Major

BaconSnake: Inlined Python UDFs for Pig

I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.

I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.

Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:

R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 

#CS 
public static int StringOccurs(string str, string ptrn) {
   int cnt=0; 
   int pos=-1; 
   while (pos+1 < str.Length) {
        pos = str.IndexOf(ptrn, pos+1) ;
        if (pos < 0) break; cnt++; 
   } return cnt;
}
#ENDCS

Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?

Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:

-- Script calculates average length of queries at each hour of the day

raw = LOAD 'data/excite-small.log' USING PigStorage('\t')
           AS (user:chararray, time:chararray, query:chararray);

houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query;

hour_group = GROUP houred BY hour;

hour_frequency = FOREACH hour_group 
                           GENERATE group as hour,
                                    baconsnake.AvgLength($1.query) as count;

DUMP hour_frequency;

-- The excite query log timestamp format is YYMMDDHHMMSS
-- This function extracts the hour, HH
def ExtractHour(timestamp):
	return timestamp[6:8]

-- Returns average length of query in a bag
def AvgLength(grp):
	sum = 0
	for item in grp:
		if len(item) > 0:
			sum = sum + len(item[0])	
	return str(sum / len(grp))

Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.

It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.

Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:

cat scripts/histogram.bs | python scripts/bs2pig.py > scripts/histogram.pig
java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig

Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.

Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!

Go checkout BaconSnake at Google Code!

Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.

y!Vmail - voice mail for your Yahoo! Mail

Yesterday Dan, Pradeep and I presented “y!Vmail: voicemail for your Yahoo! Mail” at the Yahoo! University Hack Day Contest, winning the award for the 2nd best Hack! (jump to the demo video )


Our team with judges Paul Tarjan and Rasmus Lerdorf

The adventure started when I heard about Yahoo!‘s Hack U event:

Join Yahoo! web experts including Rasmus Lerdorf, the creator of PHP, for a week of learning, hacking and fun! You’ll hear interesting tech talks, hacking tips and lessons, and get hands-on coding workshops where you’ll work with cutting-edge technology. The week’s events will culminate with our University Hack Day competition—a day-long festival of coding, camaraderie, demos, awards, food, music and jollity (it’s a real word, look it up).

Years ago when I was in my teens, I was an avid participant on the school / college tech fest circuit. Almost every major institution in and around Delhi would organize annual technical festivals, hosting programming contests and software demo competitions. This was where I got a chance to showcase my creations and meet other hackers. Winning these events became a good way for me to pay off those telephone bills — web development in the dial-up age was an expensive hobby!

I decided to enter the Hack Day contest just for fun; it had been a while since I participated in one of these. It wasn’t about winning this time; I just wanted to do the whole “idea to execution to demo” thing with a group of friends, and spend hours screaming at each other over STUPID hard-to-find bugs that are actually staring at you in the face, high-fiving every hour as a feature milestone was scratched off the todo-list. The reward: to be able to stand in front of a group of people and say “Hey guys, look what I made!.” (If it’s hard to appreciate what this feels like, this video might help.)


Yahoo! gave away a bunch of t-shirts, this was on one of them

3 days before the Hack Day, I had an idea about building a phone-based interface for email. The idea was simple enough to build in a day, but fun enough to make an enjoyable demo. The only problem: I was already in the midst of a “hack” daymonth of my own; VLDB was due 3 hours before the start of the Hack Day, and I was already sacrificing sleep for LaTeX and Python for more than a week. There was no way I was going to be able to do this alone. Enter fellow grad students Dan and Pradeep. I told them about the contest and my idea. While they are both expert hackers, I totally forgot about the fact that people in Operating Systems research don’t really do a lot of Web Programming: “PHP....? I’ve never…” said Dan. I pointed them to the Yahoo Developer Network site and returned to my research paper writing madness. Hopefully by Friday evening, I would have a web-savvy hack team.

On Friday, I took a quick nap after my paper deadline, and walked over to the Hack Fest area to meet my team (who had become PHP and telephony wizards by now) and load up on caffeine and sugar that the Yahoo! folks had set up for us.


They even had my favorite candy !

We split the work into two parts; Dan would build the phone interface while Pradeep and I would figure out the email and contacts API to write an email client backend. 7 hours later, we had the first version of our product up and running. We could call in and read emails. Happy with our progress, we decided that it would be wiser to go home and show up early next day. We ended up wasting a few hours the next morning worrying about the presentation: the lecture hall had spotty cellphone coverage, a deal-killer for a phone demo! Pradeep made a breakthrough here, discovering that an obscure panel on the wall was actually a secret speakerphone. Having resolved demo issues, we resumed coding and plugged in the remaining features: navigating through emails, email summarization, and email prioritization. The friendly timestamps feature (“4 minutes ago”) was stolen from my blog’s code (i.e. the Status header of this blog).

Around 3:30pm on Saturday, we updated our hackday entry:

y!Vmail

by Arnab Nandi, Daniel Peek, Pradeep Padala

“Not everyone has a computer, but everyone has a phone.”

This hack allows people to access their Yahoo! mail through a 1-800 number, using ANY touch-tone phone.
Press 0 to open, * and # to navigate, 7 to delete. We figure out which emails are important, and read them first. We summarize long emails so that you dont have to listen to all of it. If you want to talk to the person, just press 5 — we’ll connect you.

APIs used: BBAuth, OpenMail, Contacts API, Term Extraction API

Hack presentations started at 4:00pm on Saturday. I started with a 20-second powerpoint pitch, followed by a rather entertaining demo. Using the lecture hall’s speakerphone we had the lecture hall call our service. Entering the correct PIN logged me in, which resulted in an entire roomful of people were now hearing the words “Welcome to y!Vmail. You have 5 new emails…”


Me pushing numbers on the phone


Here’s a short video walk through of our app:

More details at http://yvmail.info

A few minutes after the presentation ended, the prizes were announced. We ranked second. The winning hack was Brandon Kwaselow’s “Points of WOE”; a native iPhone app that allowed browsing and creation of placemarks on Yahoo! Maps. Congratulations, Brandon!

Overall, this was a very exciting and enjoyable event; I had a rocking good time hanging out with the Yahoo! folks and getting a cool project out the door with around 15 hours of work. I end with some lessons, acquired over years of doing demo contests:

  • Be creative, but avoid feature creep.
  • Split up into sub-teams, but make sure you’re pair programming most of the time.
  • Get Version 0 done Super Super Early. Then polish, polish, polish.
  • Reuse (with attribution) as much code as you can.
  • Take lots of breaks, make friends, and have fun.

Image credits: Rasmus, Erik
Shout outs: Folks at Twilio for making the coolest telephony API in the universe!

building simple systems

XML co-inventor Tim Bray talks about over-designing systems:

Programmers experience soaring joy when they can rip through code deleting functions and declarations, screens-full into the bit bucket, with the steady drumbeat of tests-fail-then-pass.

So maybe I didn’t build one to throw away, but I built one that needed major amputations out of the box.

He concludes with a quote from Fred Brook’s The Mythical Man Month, one that I strongly believe in:

“Where a new system concept or new technology is used, one has to build a system to throw away, for even the best planning is not so omniscient as to get it right the first time. Hence plan to throw one away; you will, anyhow.”

|

practical knowledge

The Compiler Design + Multimedia Applications practical exam went off without any major mishaps. Minor mishaps: The keyboard of the computer behind me got stuck with the chair, and flew off the desk when I turned. I had to do a clumsy dive-under-the-table action to save it. The external examiner gave me a weird look.

Then there was the multimedia program whose menus *just don't work*. Just one of those things that happen to perfect programs in crisis situations. I know my code, and I'm dead sure it was perfect, but it just didn't work! So I go out to drink water, and confirm the code with this friend, and the Compilers teacher sees me talking. Great, first of all, your code gives up on you, and then it also makes you look bad in front of everyone.

| |

oh dang didly doo!

Google buys Pyra! This is major. Ev, Jason and the gang rollerblading in the GooglePlex... hmm.

Everyone seems to be wondering why this happened - as in, what use is a content hosting service to a search company?
A lot of people are saying that there's going to be a "Blogs" tab now, beside the News and other tabs. Some people are also wondering if Blogspot sites will get a better pagerank.

But why buy Pyra? Why not just get 20 people to build Blogger in a month? Here's what I think: It's all about the data™. In Page and Brin's paper - The Anatomy of a Large-Scale Hypertextual Web Search Engine - the base assumption was that the WWW is a huge interlinked connection of information. And you know what's incredibly funny? The Blogosphere is just that - a small scale WWW.