Technology

Larry Ellison on Cloud Computing

Oracle Man Larry Ellison gives his two cents about the new Cloud Computing hype:

“Do you think they run on Water vapor? It’s Databases, and Operating Systems, and Memory, and Microprocessors and the Internet! What are you talking about!”

Also see: the 60 Minutes episode where he pitches the Network Computer.

At the Yahoo! Key Scientific Challenges Graduate Student Summit

I’m at the Yahoo! Graduate Student summit for today and tomorrow. About the event:

On September 3 and 4 the Academic Relations team will host 21 exceptional PhD students at the Key Scientific Challenges Graduate Student Summit. These students are winners of this year’s KSC program, and over the course of the two day summit they will be attending tech talks and workshops, presenting their work, and discussing research trends with top researchers from Yahoo! Labs. These 21 students will also be joined by the program’s past winners and Yahoo! Student Fellows.

Thought I’d share notes:

  • Great spread of grad students in terms of research areas. HCI, Economists, Social Scientists, apart from typical CS people.
  • Presenters for Thursday:

    Welcome & Overview of Yahoo! Labs
    Prabhakar Raghavan, Head, Yahoo! Labs

    Search Technologies Overview
    Andrew Tomkins, Chief Scientist, Yahoo! Search

    Machine Learning & Statistics Research Overview
    Sathiya Keerthi Selvaraj, Senior Research Scientist

    Economics and Social Systems Research Overview
    Elizabeth Churchill, Principal Research Scientist

    Computational Advertising Research Overview
    Andrei Broder, Fellow and VP, Computational Advertising

    Web Information Management Research Overview
    Brian Cooper, Senior Research Scientist

  • Posters for the poster sessions look pretty awesome!
  • |

    Vowpal Wabbit now Open Source Project

    I was writing a longer post about VW a few weeks ago but ran out of time, so I’ll just post the initial few paragraphs for now

    There’s probably a limit to how many times one is allowed to use the word “awesome” in a day — I feel like I’ve hit my quota, but I need to use it just once more before I hit the sack:

    I think it’s awesome that Yahoo! Research lets researchers open source their projects.

    I'm pretty sure John did not make this image

    A few days ago, the amazing John Langford released his fast online learning tool, Vowpal Wabbit to the world as an open source project. Note the word project. That means all further development will happen out in the wild; . A bunch of people have question the origin of the name “Vowpal Wabbit” — “What is this undecipherable mess of vowels and consonants!?,” you ask. “That’s how Elmer Fudd would pronounce Vorpal Rabbit,” John answers. “Vorpal? Whatdoesthatmean?!,” you ask again. Which is where I cite the singular font of human knowledge and quote a few lines from Lewis Carrol’s Jabberwocky:

    He took his vorpal sword in hand (, and later,)
    One, two! One, two! And through and through
    The vorpal blade went snicker-snack!
    He left it dead, and with its head
    He went galumphing back.

    If the back story hasn’t made it clear to you yet, let me paraphrase it for you: This stuff is fast. Wicked fast. Like, voodoo fast. How? That’s best left for another post.

    |

    HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

    Just got done with the HAMSTER presentation; here is the paper, and here are my abstract and slides:

    We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from third-party data providers into a structured-search engine’s data warehouse. Our experiments show that traditional schema- based and instance-based schema matching methods fall short. We propose a new technique based on the search engine’s clicklogs. Two schema elements are matched if the distribution of keyword queries that cause click-throughs on their instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.

    I received a few questions after the talk, hence I thought I’d put up a quick FAQ:

    Q: Doesn’t the time(period) of the clicklog affect your integration quality?

    A: Yes. And we consider this a good thing. This allows trend information to come into the system, e.g. “pokemon” queries will start coming in, and merge “japanese toys” with “children’s collector items”. Unpopular items that are not searched for may not generate a mapping, but then again, this may be ok since the end goal was to integrate searched-for items.

    Q: You use clicklogs. I am a little old company/website owner X. Since my company’s name doesn’t start with G, M or Y, I don’t have clicklogs. How do I use your method?

    A: You already have clicklogs. Let’s say you are trying to merge your company/website X’s data with company Y’s data. Since both you (X) and Y have websites, you both run HTTP servers, which have the facility to log requests. Look through your HTTP server referral logs for strings like:
    URL: http://x.com
    REFERRER: http://www.google.com/?q=$search_string$

    This is your clicklog. The url http://x.com has the query $search_string$. You can grep both websites to create clicklogs, which can then be used to integration.

    Q: My website is not very popular and I don’t have that many clicks from search engines. What do I do?

    A: Yup, this is a very real case. Specifically, you might have a lot of queries for some of your items, but not for others. This can be balanced out. See the section in our paper about Surrogate Clicklogs. Basically you can use a popular website’s clicklog as a “surrogate” log for your database. From the paper:

    …we propose a method by which we identify surrogate clicklogs for any data source without significant web presence. For each candidate entity in the feed that does not have a significant presence in the clicklogs (i.e. clicklog volume is less than a threshold), we look for an entity in our collection of feeds that is most similar to the candidate, and use its clicklog data to generate a query distribution for the candidate object.

    Q: I am an academic and do not have access to a public clicklog, or a public website to do get clicklogs from. How do I use this technique?

    A: Participate in the Lemur project and get your friends to participate too.

    |

    Yahoo: Just like the old times

    I’m excited to go to work today, knowing that I will be witness, first hand, to one of the more incredible business deals being announced in the valley: Microsoft powering Yahoo Search.

    There’s a lot that I want to say about this, but for now, I will leave you with this image. This is from when Yahoo! used to be powered by Google. (Many people believe that powering Yahoo was what made Google popular with the mainstream audience, and the Google owes who it is today to Yahoo.)

    An excerpt from the Wikipedia:

    In 2002, they bought Inktomi, a “behind the scenes” or OEM search engine provider, whose results are shown on other companies’ websites and powered Yahoo! in its earlier days. In 2003, they purchased Overture Services, Inc., which owned the AlltheWeb and AltaVista search engines.

    AlltheWeb, Altavista, Overture, Inktomi. That’s a lot of heritage.

    |

    Microsoft Research's Data-related Launches

    Microsoft Research has been making a bunch of cool data analysis-related launches at the upcoming Faculty Summit.

    First, there’s The academic release of Dryad and DryadLINQ

    Dryad is a high-performance, general-purpose, distributed-computing engine that simplifies the task of implementing distributed applications on clusters of computers running a Windows® operating system. DryadLINQ enables developers to implement Dryad applications in managed code by using an extended version of the LINQ programming model and API. The academic release of Dryad and DryadLINQ provides the software necessary to develop DryadLINQ applications and to run them on a Windows HPC Server 2008 cluster. The academic release includes documentation and code samples.

    They also launched Project Trident , a workflow workbench, which is available for download:

    Project Trident: A Scientific Workflow Workbench is a set of tools—based on the Windows Workflow Foundation—for creating and running data analysis workflows. It addresses scientists’ need for a flexible and powerful way to analyze large and diverse datasets, and share their results. Trident Management Studio provides graphical tools for running, managing, and sharing workflows. It manages the Trident Registry, schedules workflow jobs, and monitors local or remote workflow execution. For large data sets, Trident can run multiple workflows in parallel on a Windows HPC Server 2008 cluster. Trident provides a framework to add runtime services and comes with services such as provenance and workflow monitoring. The Trident security model supports users and roles that allows scientists to control access rights to their workflows.

    Then there’s Graywolf :

    GrayWulf builds on the work of Jim Gray, a Microsoft Research scientist and pioneer in database and transaction processing research. It also pays homage to Beowulf, the original computer cluster developed at NASA using “off-the-shelf” computer hardware.

    Nerds are the new Rock Stars

    We’re seeing a new breed of rock stars these days: Scientists.

    Apparently there is a Night Club for Nerdy People in the Big Apple :

    The crowd is young and hip, mostly in their 20s and 30s, eager to gain entry to tonight’s hot-ticket entertainment event. Once the doors open, about 50 lucky people secure chairs, while another 50 stand four-deep around the room, and another 50 are gently turned away at the door.
    “This is the third time I haven’t made it in,” a disappointed young woman sighs.
    A mixtape of music plays through the speakers and the audience sips drinks from plastic cups while waiting for the featured act to begin. It won’t be the latest indie band, or an up-and-coming comedian. This is not the typical New York club scene. This is the monthly meeting of the Secret Science Club.

    Then there’s DorkBot, which has branches everywhere:

    the main goals of dorkbot are: to create an informal, friendly environment in which people can talk, […] to give us all an opportunity to see the strange things our neighbors are doing with electricity.

    Meanwhile, in Cambridge, Massachusetts, “Dr. Evil” and the “Mexican Multiplier” have dueled it out till the very end, in an attempt to write the largest number on a chalkboard.

    Finally, here’s an awesome ad from Intel’s amazing marketing team:

    Tapbots goes fulltime

    The Tapbots duo are quitting their day jobs to work fulltime on their iPhone app company:

    Longer term we aren’t looking to get any VC funding, grow to 100s of employees or get bought out by some big corporation. We may get help with support, testing and/or marketing, but development and design is going to just be us two for the foreseeable future. We think that’s the best way to keep the quality of our applications at the level that everyone expects. Our goal is to produce about 4 applications a year. We aren’t going to shovel out crap-ware to cash-in on our names. We aren’t going to write the next Office or Filemaker. We are going to write simple but incredibly polished applications that are created specifically for the iPhone/Touch devices. Two guys, lot’s of passion and a lot of hard work, that’s the Tapbots way.




    Two guys, two popular iphone apps (“Weightbot sold 100k copies in its first 100 days, Convertbot is selling at about twice that rate.”), one mission to make quality apps. Good luck, guys!

    |

    y!Vmail - voice mail for your Yahoo! Mail

    Yesterday Dan, Pradeep and I presented “y!Vmail: voicemail for your Yahoo! Mail” at the Yahoo! University Hack Day Contest, winning the award for the 2nd best Hack! (jump to the demo video )


    Our team with judges Paul Tarjan and Rasmus Lerdorf

    The adventure started when I heard about Yahoo!‘s Hack U event:

    Join Yahoo! web experts including Rasmus Lerdorf, the creator of PHP, for a week of learning, hacking and fun! You’ll hear interesting tech talks, hacking tips and lessons, and get hands-on coding workshops where you’ll work with cutting-edge technology. The week’s events will culminate with our University Hack Day competition—a day-long festival of coding, camaraderie, demos, awards, food, music and jollity (it’s a real word, look it up).

    Years ago when I was in my teens, I was an avid participant on the school / college tech fest circuit. Almost every major institution in and around Delhi would organize annual technical festivals, hosting programming contests and software demo competitions. This was where I got a chance to showcase my creations and meet other hackers. Winning these events became a good way for me to pay off those telephone bills — web development in the dial-up age was an expensive hobby!

    I decided to enter the Hack Day contest just for fun; it had been a while since I participated in one of these. It wasn’t about winning this time; I just wanted to do the whole “idea to execution to demo” thing with a group of friends, and spend hours screaming at each other over STUPID hard-to-find bugs that are actually staring at you in the face, high-fiving every hour as a feature milestone was scratched off the todo-list. The reward: to be able to stand in front of a group of people and say “Hey guys, look what I made!.” (If it’s hard to appreciate what this feels like, this video might help.)


    Yahoo! gave away a bunch of t-shirts, this was on one of them

    3 days before the Hack Day, I had an idea about building a phone-based interface for email. The idea was simple enough to build in a day, but fun enough to make an enjoyable demo. The only problem: I was already in the midst of a “hack” daymonth of my own; VLDB was due 3 hours before the start of the Hack Day, and I was already sacrificing sleep for LaTeX and Python for more than a week. There was no way I was going to be able to do this alone. Enter fellow grad students Dan and Pradeep. I told them about the contest and my idea. While they are both expert hackers, I totally forgot about the fact that people in Operating Systems research don’t really do a lot of Web Programming: “PHP….? I’ve never…” said Dan. I pointed them to the Yahoo Developer Network site and returned to my research paper writing madness. Hopefully by Friday evening, I would have a web-savvy hack team.

    On Friday, I took a quick nap after my paper deadline, and walked over to the Hack Fest area to meet my team (who had become PHP and telephony wizards by now) and load up on caffeine and sugar that the Yahoo! folks had set up for us.


    They even had my favorite candy !

    We split the work into two parts; Dan would build the phone interface while Pradeep and I would figure out the email and contacts API to write an email client backend. 7 hours later, we had the first version of our product up and running. We could call in and read emails. Happy with our progress, we decided that it would be wiser to go home and show up early next day. We ended up wasting a few hours the next morning worrying about the presentation: the lecture hall had spotty cellphone coverage, a deal-killer for a phone demo! Pradeep made a breakthrough here, discovering that an obscure panel on the wall was actually a secret speakerphone. Having resolved demo issues, we resumed coding and plugged in the remaining features: navigating through emails, email summarization, and email prioritization. The friendly timestamps feature (“4 minutes ago”) was stolen from my blog’s code (i.e. the Status header of this blog).

    Around 3:30pm on Saturday, we updated our hackday entry:

    y!Vmail

    by Arnab Nandi, Daniel Peek, Pradeep Padala

    “Not everyone has a computer, but everyone has a phone.”

    This hack allows people to access their Yahoo! mail through a 1-800 number, using ANY touch-tone phone.
    Press 0 to open, * and # to navigate, 7 to delete. We figure out which emails are important, and read them first. We summarize long emails so that you dont have to listen to all of it. If you want to talk to the person, just press 5 — we’ll connect you.

    APIs used: BBAuth, OpenMail, Contacts API, Term Extraction API

    Hack presentations started at 4:00pm on Saturday. I started with a 20-second powerpoint pitch, followed by a rather entertaining demo. Using the lecture hall’s speakerphone we had the lecture hall call our service. Entering the correct PIN logged me in, which resulted in an entire roomful of people were now hearing the words “Welcome to y!Vmail. You have 5 new emails…”


    Me pushing numbers on the phone


    Here’s a short video walk through of our app:

    More details at http://yvmail.info

    A few minutes after the presentation ended, the prizes were announced. We ranked second. The winning hack was Brandon Kwaselow’s “Points of WOE”; a native iPhone app that allowed browsing and creation of placemarks on Yahoo! Maps. Congratulations, Brandon!

    Overall, this was a very exciting and enjoyable event; I had a rocking good time hanging out with the Yahoo! folks and getting a cool project out the door with around 15 hours of work. I end with some lessons, acquired over years of doing demo contests:

    • Be creative, but avoid feature creep.
    • Split up into sub-teams, but make sure you’re pair programming most of the time.
    • Get Version 0 done Super Super Early. Then polish, polish, polish.
    • Reuse (with attribution) as much code as you can.
    • Take lots of breaks, make friends, and have fun.

    Image credits: Rasmus, Erik
    Shout outs: Folks at Twilio for making the coolest telephony API in the universe!

    Maintained Relationships on Facebook

    Facebook Research Scientist Cameron Marlow has some interesting thoughts about Maintained Relationships, people who often stalk each other’s feeds, but don’t necessarily talk that much:


    In the diagram, the red line shows the number of reciprocal relationships, the green line shows the one-way relationships, and the blue line shows the passive relationships as a function of your network size. This graph shows the same data as the first graph, only combined for both genders. What it shows is that, as a function of the people a Facebook user actively communicate with, you are passively engaging with between 2 and 2.5 times more people in their network. I’m sure many people have had this feeling, but these data make this effect more transparent.

    I’m really jealous of the Facebook Data Team. They get to play with all that data!