Archive - 2009

July 29th

Yahoo: Just like the old times

I’m excited to go to work today, knowing that I will be witness, first hand, to one of the more incredible business deals being announced in the valley: Microsoft powering Yahoo Search.

There’s a lot that I want to say about this, but for now, I will leave you with this image. This is from when Yahoo! used to be powered by Google. (Many people believe that powering Yahoo was what made Google popular with the mainstream audience, and the Google owes who it is today to Yahoo.)

An excerpt from the Wikipedia:

In 2002, they bought Inktomi, a “behind the scenes” or OEM search engine provider, whose results are shown on other companies’ websites and powered Yahoo! in its earlier days. In 2003, they purchased Overture Services, Inc., which owned the AlltheWeb and AltaVista search engines.

AlltheWeb, Altavista, Overture, Inktomi. That’s a lot of heritage.

|

July 11th

Microsoft Research's Data-related Launches

Microsoft Research has been making a bunch of cool data analysis-related launches at the upcoming Faculty Summit.

First, there’s The academic release of Dryad and DryadLINQ

Dryad is a high-performance, general-purpose, distributed-computing engine that simplifies the task of implementing distributed applications on clusters of computers running a Windows® operating system. DryadLINQ enables developers to implement Dryad applications in managed code by using an extended version of the LINQ programming model and API. The academic release of Dryad and DryadLINQ provides the software necessary to develop DryadLINQ applications and to run them on a Windows HPC Server 2008 cluster. The academic release includes documentation and code samples.

They also launched Project Trident , a workflow workbench, which is available for download:

Project Trident: A Scientific Workflow Workbench is a set of tools—based on the Windows Workflow Foundation—for creating and running data analysis workflows. It addresses scientists’ need for a flexible and powerful way to analyze large and diverse datasets, and share their results. Trident Management Studio provides graphical tools for running, managing, and sharing workflows. It manages the Trident Registry, schedules workflow jobs, and monitors local or remote workflow execution. For large data sets, Trident can run multiple workflows in parallel on a Windows HPC Server 2008 cluster. Trident provides a framework to add runtime services and comes with services such as provenance and workflow monitoring. The Trident security model supports users and roles that allows scientists to control access rights to their workflows.

Then there’s Graywolf :

GrayWulf builds on the work of Jim Gray, a Microsoft Research scientist and pioneer in database and transaction processing research. It also pays homage to Beowulf, the original computer cluster developed at NASA using “off-the-shelf” computer hardware.

July 4th

BaconSnake: Inlined Python UDFs for Pig

I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.

I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.

Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:

R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 

#CS 
public static int StringOccurs(string str, string ptrn) {
   int cnt=0; 
   int pos=-1; 
   while (pos+1 < str.Length) {
        pos = str.IndexOf(ptrn, pos+1) ;
        if (pos < 0) break; cnt++; 
   } return cnt;
}
#ENDCS

Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?

Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:

-- Script calculates average length of queries at each hour of the day

raw = LOAD 'data/excite-small.log' USING PigStorage('\t')
           AS (user:chararray, time:chararray, query:chararray);

houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query;

hour_group = GROUP houred BY hour;

hour_frequency = FOREACH hour_group 
                           GENERATE group as hour,
                                    baconsnake.AvgLength($1.query) as count;

DUMP hour_frequency;

-- The excite query log timestamp format is YYMMDDHHMMSS
-- This function extracts the hour, HH
def ExtractHour(timestamp):
	return timestamp[6:8]

-- Returns average length of query in a bag
def AvgLength(grp):
	sum = 0
	for item in grp:
		if len(item) > 0:
			sum = sum + len(item[0])	
	return str(sum / len(grp))

Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.

It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.

Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:

cat scripts/histogram.bs | python scripts/bs2pig.py > scripts/histogram.pig
java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig

Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.

Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!

Go checkout BaconSnake at Google Code!

Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.

Update (Apr 2010): Looks like BaconSnake's spirit is slowly slithering into Pig Core! Also some attention from the Hive parallel universe.

June 28th

PrivatePond: Outsourced Management of Web Corpuses

This paper was presented at WEBDB 2009 at Providence, Rhode Island. The PDF version is available here.

My colleague from the database research group Dan Fabbri just presented our work, “PrivatePond” at WEBDB 2009. This paper is a clear example of the research environment at Michigan. Dan works on database security, while I work on database search. Given that we sit across each other at the lab, there is always a constant amount of crosstalk. Add in a few brainstorming sessions and a few work-intense weekends, and you have a secure database search paper!

The core idea of the paper is simple. Everybody uses Google (or Yahoo! or Bing). They’re fast, they’re easy to use, and they’re free. Now let’s say you had some secure information, like your prescription information from your psychiatrist. Obviously you don’t want Google to know about it, because they can do bad, bad things with it. So you encrypt it. But you still want it to be searchable. But you can’t search encrypted data! So what do we do?

Enter PrivatePond. Basically, we’re encrypting private data just enough that its possible to search with decent ranking, while still keeping it secure.

We call this the “Secure Indexable Representation”, and we study how increasing the encryption decreases the quality of search, and vice versa.

Update: We actually have a demo of our system. If you would like to see it, please contact me!

Here are the slides for the talk:

|

June 21st

The difference between Google and Yahoo!

Time for some good ol’ flamebait!:

State-of-the-art lawnmowing technology at Google:

State-of-the-art lawnmowing technology at Yahoo!:

As you can clearly see, Yahoo! is cuter.

| |

June 1st

Manhattanhenge

This weekend was Manhattanhenge :

Manhattanhenge (sometimes referred to as Manhattan Solstice) is a biannual occurrence in which the setting sun aligns with the east-west streets of Manhattan’s main street grid.

It results in quite an intriguing view, since you get to watch the Sun set while standing on the east side of the island. I took the above photo from the Eastern end of 14th street, where a small group of us had gathered to witness the event.

May 17th

New York, New York!

It’s been a little over 2 weeks since I moved to the Big Apple for the summer. Much like my last visit to NYC in 2007, this time I’m working for the another dotcom company .

NYC is a seriously awesome place. There’s so much stuff going on that it became a little hard to pick exactly what to do. Being the lazy person that I am, I decided to crowdsource this decision. So I put up a Facebook status update: Arnab wonders what to do in NYC. Suggestions? Within a few hours, I had 16 suggestions! Here’s an edited list. I’ve also added a few of my own suggestions:

The plan is to strike each one of these off my list (in addition to the usual stuff like going to see a Broadway play, etc). Let’s see how this works out!

May 7th

Nerds are the new Rock Stars

We’re seeing a new breed of rock stars these days: Scientists.

Apparently there is a Night Club for Nerdy People in the Big Apple :

The crowd is young and hip, mostly in their 20s and 30s, eager to gain entry to tonight’s hot-ticket entertainment event. Once the doors open, about 50 lucky people secure chairs, while another 50 stand four-deep around the room, and another 50 are gently turned away at the door.
“This is the third time I haven’t made it in,” a disappointed young woman sighs.
A mixtape of music plays through the speakers and the audience sips drinks from plastic cups while waiting for the featured act to begin. It won’t be the latest indie band, or an up-and-coming comedian. This is not the typical New York club scene. This is the monthly meeting of the Secret Science Club.

Then there’s DorkBot, which has branches everywhere:

the main goals of dorkbot are: to create an informal, friendly environment in which people can talk, […] to give us all an opportunity to see the strange things our neighbors are doing with electricity.

Meanwhile, in Cambridge, Massachusetts, “Dr. Evil” and the “Mexican Multiplier” have dueled it out till the very end, in an attempt to write the largest number on a chalkboard.

Finally, here’s an awesome ad from Intel’s amazing marketing team:

April 24th

Tapbots goes fulltime

The Tapbots duo are quitting their day jobs to work fulltime on their iPhone app company:

Longer term we aren’t looking to get any VC funding, grow to 100s of employees or get bought out by some big corporation. We may get help with support, testing and/or marketing, but development and design is going to just be us two for the foreseeable future. We think that’s the best way to keep the quality of our applications at the level that everyone expects. Our goal is to produce about 4 applications a year. We aren’t going to shovel out crap-ware to cash-in on our names. We aren’t going to write the next Office or Filemaker. We are going to write simple but incredibly polished applications that are created specifically for the iPhone/Touch devices. Two guys, lot’s of passion and a lot of hard work, that’s the Tapbots way.




Two guys, two popular iphone apps (“Weightbot sold 100k copies in its first 100 days, Convertbot is selling at about twice that rate.”), one mission to make quality apps. Good luck, guys!

|

April 18th

Organizing ideas

Anand Sarwate classifies research ideas he has into 5 convenient categories:

1. Pie-in-the-sky : wow, that seems interesting… I should think about it for more than 5 minutes…
2. Nebulous hand-waving : there’s a problem there, but what is the right framework?
3. Percolating : ok, but what is the actual formal problem?
4. In progress : finding the proof in the pudding.
5. Writing : oh no, a deadline! Gotta figure out how to fit the page limit!

|