Archive

September 15th, 2009

Microsoft Style

|

September 3rd

At the Yahoo! Key Scientific Challenges Graduate Student Summit

I’m at the Yahoo! Graduate Student summit for today and tomorrow. About the event:

On September 3 and 4 the Academic Relations team will host 21 exceptional PhD students at the Key Scientific Challenges Graduate Student Summit. These students are winners of this year’s KSC program, and over the course of the two day summit they will be attending tech talks and workshops, presenting their work, and discussing research trends with top researchers from Yahoo! Labs. These 21 students will also be joined by the program’s past winners and Yahoo! Student Fellows.

Thought I’d share notes:

  • Great spread of grad students in terms of research areas. HCI, Economists, Social Scientists, apart from typical CS people.
  • Presenters for Thursday:

    Welcome & Overview of Yahoo! Labs
    Prabhakar Raghavan, Head, Yahoo! Labs

    Search Technologies Overview
    Andrew Tomkins, Chief Scientist, Yahoo! Search

    Machine Learning & Statistics Research Overview
    Sathiya Keerthi Selvaraj, Senior Research Scientist

    Economics and Social Systems Research Overview
    Elizabeth Churchill, Principal Research Scientist

    Computational Advertising Research Overview
    Andrei Broder, Fellow and VP, Computational Advertising

    Web Information Management Research Overview
    Brian Cooper, Senior Research Scientist

  • Posters for the poster sessions look pretty awesome!
  • |

    Vowpal Wabbit now Open Source Project

    I was writing a longer post about VW a few weeks ago but ran out of time, so I’ll just post the initial few paragraphs for now

    There’s probably a limit to how many times one is allowed to use the word “awesome” in a day — I feel like I’ve hit my quota, but I need to use it just once more before I hit the sack:

    I think it’s awesome that Yahoo! Research lets researchers open source their projects.

    I'm pretty sure John did not make this image

    A few days ago, the amazing John Langford released his fast online learning tool, Vowpal Wabbit to the world as an open source project. Note the word project. That means all further development will happen out in the wild; . A bunch of people have question the origin of the name “Vowpal Wabbit” — “What is this undecipherable mess of vowels and consonants!?,” you ask. “That’s how Elmer Fudd would pronounce Vorpal Rabbit,” John answers. “Vorpal? Whatdoesthatmean?!,” you ask again. Which is where I cite the singular font of human knowledge and quote a few lines from Lewis Carrol’s Jabberwocky:

    He took his vorpal sword in hand (, and later,)
    One, two! One, two! And through and through
    The vorpal blade went snicker-snack!
    He left it dead, and with its head
    He went galumphing back.

    If the back story hasn’t made it clear to you yet, let me paraphrase it for you: This stuff is fast. Wicked fast. Like, voodoo fast. How? That’s best left for another post.

    |

    August 25th

    HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

    Just got done with the HAMSTER presentation; here is the paper, and here are my abstract and slides:

    We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from third-party data providers into a structured-search engine’s data warehouse. Our experiments show that traditional schema- based and instance-based schema matching methods fall short. We propose a new technique based on the search engine’s clicklogs. Two schema elements are matched if the distribution of keyword queries that cause click-throughs on their instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.

    I received a few questions after the talk, hence I thought I’d put up a quick FAQ:

    Q: Doesn’t the time(period) of the clicklog affect your integration quality?

    A: Yes. And we consider this a good thing. This allows trend information to come into the system, e.g. “pokemon” queries will start coming in, and merge “japanese toys” with “children’s collector items”. Unpopular items that are not searched for may not generate a mapping, but then again, this may be ok since the end goal was to integrate searched-for items.

    Q: You use clicklogs. I am a little old company/website owner X. Since my company’s name doesn’t start with G, M or Y, I don’t have clicklogs. How do I use your method?

    A: You already have clicklogs. Let’s say you are trying to merge your company/website X’s data with company Y’s data. Since both you (X) and Y have websites, you both run HTTP servers, which have the facility to log requests. Look through your HTTP server referral logs for strings like:
    URL: http://x.com
    REFERRER: http://www.google.com/?q=$search_string$

    This is your clicklog. The url http://x.com has the query $search_string$. You can grep both websites to create clicklogs, which can then be used to integration.

    Q: My website is not very popular and I don’t have that many clicks from search engines. What do I do?

    A: Yup, this is a very real case. Specifically, you might have a lot of queries for some of your items, but not for others. This can be balanced out. See the section in our paper about Surrogate Clicklogs. Basically you can use a popular website’s clicklog as a “surrogate” log for your database. From the paper:

    …we propose a method by which we identify surrogate clicklogs for any data source without significant web presence. For each candidate entity in the feed that does not have a significant presence in the clicklogs (i.e. clicklog volume is less than a threshold), we look for an entity in our collection of feeds that is most similar to the candidate, and use its clicklog data to generate a query distribution for the candidate object.

    Q: I am an academic and do not have access to a public clicklog, or a public website to do get clicklogs from. How do I use this technique?

    A: Participate in the Lemur project and get your friends to participate too.

    |

    August 22nd

    Upcoming VLDB Trip : Lyon, France

    I’m looking forward to my talk at VLDB 2009 in Lyon, France. I will be presenting HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching”, which is joint work I did with Phil Bernstein during my internship at Microsoft Research. The talk is scheduled for Tuesday 25, 2009 at 2pm in the Rhône 2 room at the conference venue.

    Also look out for my labmate Bin Liu ‘s paper with our advisor, “Using Trees to Depict a Forest”.

    |

    August 20th

    "My pledges as a reviewer"

    CUHK Professor Yufei Tao’s homepage has this interesting tidbit:

    My pledges as a reviewer:

    • I will treat your work with respect.
    • I will spend enough time with your paper. I will not make any decision without a good understanding.
    • In case I decide to recommend rejection, I will do so on solid grounds. I do not reject papers based on subjective and vacuous statements such as “I don’t like this idea”.
    • I will write reviews in a courteous manner. I have seen harsh reviews by other people which heavily mention my publications, and thus make people feel I was the reviewer. I will never do anything like this.
    |

    August 3rd

    Brim

    Standing by, watching sighs
    Escape from passersby
    Feelings collect, rise up, and in a while
    reflect, give up, and run dry.

    One day the brim will mean something.
    Till then, we’ll survive.

    |

    July 29th

    Yahoo: Just like the old times

    I’m excited to go to work today, knowing that I will be witness, first hand, to one of the more incredible business deals being announced in the valley: Microsoft powering Yahoo Search.

    There’s a lot that I want to say about this, but for now, I will leave you with this image. This is from when Yahoo! used to be powered by Google. (Many people believe that powering Yahoo was what made Google popular with the mainstream audience, and the Google owes who it is today to Yahoo.)

    An excerpt from the Wikipedia:

    In 2002, they bought Inktomi, a “behind the scenes” or OEM search engine provider, whose results are shown on other companies’ websites and powered Yahoo! in its earlier days. In 2003, they purchased Overture Services, Inc., which owned the AlltheWeb and AltaVista search engines.

    AlltheWeb, Altavista, Overture, Inktomi. That’s a lot of heritage.

    |

    July 11th

    Microsoft Research's Data-related Launches

    Microsoft Research has been making a bunch of cool data analysis-related launches at the upcoming Faculty Summit.

    First, there’s The academic release of Dryad and DryadLINQ

    Dryad is a high-performance, general-purpose, distributed-computing engine that simplifies the task of implementing distributed applications on clusters of computers running a Windows® operating system. DryadLINQ enables developers to implement Dryad applications in managed code by using an extended version of the LINQ programming model and API. The academic release of Dryad and DryadLINQ provides the software necessary to develop DryadLINQ applications and to run them on a Windows HPC Server 2008 cluster. The academic release includes documentation and code samples.

    They also launched Project Trident , a workflow workbench, which is available for download:

    Project Trident: A Scientific Workflow Workbench is a set of tools—based on the Windows Workflow Foundation—for creating and running data analysis workflows. It addresses scientists’ need for a flexible and powerful way to analyze large and diverse datasets, and share their results. Trident Management Studio provides graphical tools for running, managing, and sharing workflows. It manages the Trident Registry, schedules workflow jobs, and monitors local or remote workflow execution. For large data sets, Trident can run multiple workflows in parallel on a Windows HPC Server 2008 cluster. Trident provides a framework to add runtime services and comes with services such as provenance and workflow monitoring. The Trident security model supports users and roles that allows scientists to control access rights to their workflows.

    Then there’s Graywolf :

    GrayWulf builds on the work of Jim Gray, a Microsoft Research scientist and pioneer in database and transaction processing research. It also pays homage to Beowulf, the original computer cluster developed at NASA using “off-the-shelf” computer hardware.

    July 4th

    BaconSnake: Inlined Python UDFs for Pig

    I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.

    I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.

    Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:

    R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 
    
    #CS 
    public static int StringOccurs(string str, string ptrn) {
       int cnt=0; 
       int pos=-1; 
       while (pos+1 < str.Length) {
            pos = str.IndexOf(ptrn, pos+1) ;
            if (pos < 0) break; cnt++; 
       } return cnt;
    }
    #ENDCS
    

    Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?

    Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:

    -- Script calculates average length of queries at each hour of the day
    
    raw = LOAD 'data/excite-small.log' USING PigStorage('\t')
               AS (user:chararray, time:chararray, query:chararray);
    
    houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query;
    
    hour_group = GROUP houred BY hour;
    
    hour_frequency = FOREACH hour_group 
                               GENERATE group as hour,
                                        baconsnake.AvgLength($1.query) as count;
    
    DUMP hour_frequency;
    
    -- The excite query log timestamp format is YYMMDDHHMMSS
    -- This function extracts the hour, HH
    def ExtractHour(timestamp):
    	return timestamp[6:8]
    
    -- Returns average length of query in a bag
    def AvgLength(grp):
    	sum = 0
    	for item in grp:
    		if len(item) > 0:
    			sum = sum + len(item[0])	
    	return str(sum / len(grp))
    

    Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.

    It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.

    Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:

    cat scripts/histogram.bs | python scripts/bs2pig.py > scripts/histogram.pig
    java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig
    

    Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.

    Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!

    Go checkout BaconSnake at Google Code!

    Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.

    Update (Apr 2010): Looks like BaconSnake's spirit is slowly slithering into Pig Core! Also some attention from the Hive parallel universe.