BaconSnake to core: Pig 0.8 released!

My contribution to Python UDFs in Pig has finally released as part of the new and shiny 0.8 release! I’ve been meaning to blog about how this came about when I had time, but Dmitriy saved me the work, so I’ll just quote him instead:

This is the outcome of PIG-928; it was quite a pleasure to watch this develop over time — while most Pig tickets wind up getting worked on by at most one or two people, this turned into a collaboration of quite a few developers, many of them new to the project — Kishore Gopalakrishna’s patch was the initial conversation starter, which was then hacked on or merged into similar work by Woody Anderson, Arnab Nandi, Julien Le Dem, Ashutosh Chauhan and Aniket Mokashi (Aniket deserves an extra shout-out for patiently working to incorporate everyone’s feedback and pushing the patch through the last mile).

Yay Open Source! Documentation is available on the Pig website, but here’s a one-line example of what you can now do:

register '' using jython as myfuncs;

Notice that we’re using a python script instead of a jar. All python functions in become available as UDFs to Pig!

Going forward, there are two things I didn’t get to do, which I’d love any of my dear readers to pick up on. First, we’re still not at par with SCOPE’s ease-of-use since UDFs can’t be inlined yet (a la BaconSnake). This is currently open as a separate Apache JIRA issue, would be great to see patches here!

The second feature that I really wanted to make part of this patch submission, and didn’t have time for, was Java source UDF support as a JaninoScriptEngine. Currently, Pig users have to bundle their UDF code as jars, which quickly becomes painful. The goal of the ScriptEngine architecture is to let multiple languages be implemented and plugged in, so I really hope someone takes the time to look at the Janino project, which will allow Java source UDFs with zero performance penalty. This would also make it the first dynamically compiled query system over Hadoop (as far as I know), and opens up a world of query optimization work to explore.

Friend-based throttling in Facebook News Feeds

This dialog in my Facebook feed options seemed interesting:

Screen shot 2010-08-13 at 4.16.45 AM

Notice how it asks me how many friends I want my Live Feed from. It seems the default is 250 friends. What this means is that when you click “Recent Posts”, you’re getting recent posts from only your top 250 friends; all other friends are being ignored.

Obviously this is a problem only if you have more than 250 friends. I’ve heard the average is 150, but I’m sure there are a lot of people who are affected by this. This option caught my eye for two reasons:

From a technical perspective, news feeds are massive publish-subscribe systems. You subscribe to your friends’ posts, which when posted, are published to your feed. The 250 friend limit sets up a convenient soft limit for the system, reducing the stress on Facebook’s servers. Twitter doesn’t have such limits, and I can imagine this is one reason why its servers get overloaded. It’s a smart design from this perspective, but I wish Facebook was more transparent about the limit!

From a social perspective, I think this is a very primitive way to throttle friends. My understanding of the Feed was that my “Top Posts” ranked recent posts so that I had a high-level view of my feed, and “Recent Posts” gave me access to everything. It seems this belief is incorrect. When I increased this number to 1000(i.e. include ALL my friends), I suddenly started seeing updates from many friends I had totally forgotten about / lost touch with. Since I don’t see updates from them, I don’t interact with them on Facebook, leading to a self-reinforcing “poor get poorer” effect. I am assuming there’s some “Friendness” ranking going on here. This way, friends in my bottom 50 will never make it to my top 250 friends on Facebook. The use of a self-reinforcing ranking function is risky; especially when the stability of the ranking depends on human input. I wonder if the Feed team has done anything smart to introduce “compensators” based on interactions with bottom 50-friends, similar to the random reset in PageRank. The issue here is that unlike hyperlink edges, we’re dealing with a vocabulary of “Likes” and other social cues which are not well understood. It seems like this can be an excellent subject for a machine learning / information retrieval paper or two.

update: Horseman of the Interwebs Hung Truong points out Dunbar’s Number:

Dunbar’s number is a theoretical cognitive limit to the number of people with whom one can maintain stable social relationships. These are relationships in which an individual knows who each person is, and how each person relates to every other person. Proponents assert that numbers larger than this generally require more restrictive rules, laws, and enforced norms to maintain a stable, cohesive group. No precise value has been proposed for Dunbar’s number. It lies between 100 and 230, but a commonly detected value is 150.

This puts Facebook’s default threshold at a great place. However, Dunbar’s numbers are meant for offline relationships, i.e. the Dunbar number for ephemeral, online “feed” style relationship could arguably be much higher. It appears Dunbar has been working on this , I’m looking forward to a publication from his group soon.

Upcoming VLDB Trip : Lyon, France

I’m looking forward to my talk at VLDB 2009 in Lyon, France. I will be presenting HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching”, which is joint work I did with Phil Bernstein during my internship at Microsoft Research. The talk is scheduled for Tuesday 25, 2009 at 2pm in the Rhône 2 room at the conference venue.

Also look out for my labmate Bin Liu ‘s paper with our advisor, “Using Trees to Depict a Forest”.


Day Two

Cool things at work:

  • [cannot talk about thanks to NDA]
  • [cannot talk about because it’s too boring]
  • [cannot talk about because I don’t understand it]

Things I can talk about: Had dinner with fellow intern Nilanjan at Udupi Palace. Good food, good price. One great thing about Bellevue / Redmond is the large Indian population here; resulting in a competitive and thriving demand for Indian restaurants and grocery stores. Looking forward to check out Mayuri next time we go out for Indian food.

| |

training starts today!

Now that VLDB insanities, and St. Patrick’s Day’s festivities are behind us, I begin my uphill climb towards getting rid of all the evil life-shortening substances I have put in my body over the last few months. The goal is to not embarrass myself at the Dexter-Ann Arbor 10K run, which is on the 1st of June. I have 2 months and 12 days to do this; so hopefully this isn’t an impossible task.

The plan is to detox and switch to a strictly healthy diet first. So dear friends, if you ever see me eating anything that says “McDonalds” or “Milky Way” on it, please feel free to slap me in the face and shake some sense back into me. That’s what friends are for, after all. The second plan of attack is to start running Monday, Wednesday, Fridays along with strength exercises, and go swimming Tuesdays, Thursdays, Saturdays. Sunday is rest day. For March, let’s keep running at 3 miles a day, and 500m swimming. Thankfully I’ve been doing 60 push-ups everyday already, and have been running once in a while, so my body should not collapse by April.

To be honest, I’m not looking forward to the sugar and fried food cravings. But I’ll do anything for a T-shirt, and this should be worth the trouble. Expect weekly updates on this front!



In the light of the American Media Machine, I find this article very disturbing:

When University of Michigan social psychologist Norbert Schwarz had volunteers read the CDC flier, however, he found that within 30 minutes, older people misremembered 28 percent of the false statements as true. Three days later, they remembered 40 percent of the myths as factual.

Younger people did better at first, but three days later they made as many errors as older people did after 30 minutes. Most troubling was that people of all ages now felt that the source of their false beliefs was the respected CDC.

I’m really looking forward to the day when they’ll have a “Top Story” about how eating organic food inside hybrid vehicles causes certain chemical reactions in the food that trigger “bouts of homosexuality”.

sigmod plans

I’m heading off to SIGMOD 06 in a few hours. Looking forward to meeting a lot of superstars and familiar faces!


review on the d movie

Got this by email:

Movie review of D written by a friend:


Haha. Will have to watch it to corroborate. Next on my movie watching list: Parineeta, and Paheli. Both reportedly non-mainstream movies, looking forward to them.


Supersize Me

I’m starting a page cluster called Drupal, Supersized to discuss, collect and experiment with scalability and high-performance issues in Drupal. I’m hoping people will come forward and contribute to this, let’s see how this turns out.


Attacked by the Lucene Bugzilla

Lucene’s Bugzilla was apparently migrated over the weekend, resulting in hundreds of emails being sent to the lucene-developer mailing list, all of which were totally useless and a royal pain to delete using a web-based interface(I was using gmail). Considering there’s atleast some hundred people on that mailing list(I guess), multiply that with the close to a thousand emails, and you have a lot of useless email that could have been avoided if someone just turned off email notification before bulk updates / migration.

Also, in case you haven’t noticed, Lucene’s now become a top-level Apache project, putting it in the same league as the HTTP Server, SpamAssassin, Jakarta, and Struts. The Lucene project now comprises of Lucene Java, Nutch, and Lucene4c. I’m really looking forward to updates on Lucene4c; the webpages seem to be comatose.