HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Just got done with the HAMSTER presentation; here is the paper, and here are my abstract and slides:

We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from third-party data providers into a structured-search engine’s data warehouse. Our experiments show that traditional schema- based and instance-based schema matching methods fall short. We propose a new technique based on the search engine’s clicklogs. Two schema elements are matched if the distribution of keyword queries that cause click-throughs on their instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.

I received a few questions after the talk, hence I thought I’d put up a quick FAQ:

Q: Doesn’t the time(period) of the clicklog affect your integration quality?

A: Yes. And we consider this a good thing. This allows trend information to come into the system, e.g. “pokemon” queries will start coming in, and merge “japanese toys” with “children’s collector items”. Unpopular items that are not searched for may not generate a mapping, but then again, this may be ok since the end goal was to integrate searched-for items.

Q: You use clicklogs. I am a little old company/website owner X. Since my company’s name doesn’t start with G, M or Y, I don’t have clicklogs. How do I use your method?

A: You already have clicklogs. Let’s say you are trying to merge your company/website X’s data with company Y’s data. Since both you (X) and Y have websites, you both run HTTP servers, which have the facility to log requests. Look through your HTTP server referral logs for strings like:

This is your clicklog. The url has the query $search_string$. You can grep both websites to create clicklogs, which can then be used to integration.

Q: My website is not very popular and I don’t have that many clicks from search engines. What do I do?

A: Yup, this is a very real case. Specifically, you might have a lot of queries for some of your items, but not for others. This can be balanced out. See the section in our paper about Surrogate Clicklogs. Basically you can use a popular website’s clicklog as a “surrogate” log for your database. From the paper:

…we propose a method by which we identify surrogate clicklogs for any data source without significant web presence. For each candidate entity in the feed that does not have a significant presence in the clicklogs (i.e. clicklog volume is less than a threshold), we look for an entity in our collection of feeds that is most similar to the candidate, and use its clicklog data to generate a query distribution for the candidate object.

Q: I am an academic and do not have access to a public clicklog, or a public website to do get clicklogs from. How do I use this technique?

A: Participate in the Lemur project and get your friends to participate too.


tales from the blogout

You can see photos from the Blogout trip here

Some memorable quotes:

“Hi, what’s your URL?”— Me to Anuja, a non-blogger, who then made a very strange face.
“Tulsi, see? it was capsizing because of you” — Sathish VJ, 2 minutes before his canoe capsized for a third time (much to my chagrin, having substituted Tulsi as Sathish’s co-canoeist)
“Anybody got a deodorant?” — Me, starting off a rather exctiting ten minutes of attempting to kindle a rain-soaked campfire with the alcohol-based blow torches otherwise known as Axe and Fa. The campfire didn’t start, but I did get very nice smelling hands.
“Faaxe” — Venky (or was it Suman or Sathish?), christening the new scent.
“Mix.” — The rather curiously humored camp-coordinator Ganesh, on being asked if the container contained tea or coffee.
“I don’t care! – I want female — The pig-headed ticket collector who refused to believe that one of us had an accidental “F” in place of an “M” on his ticket – despite showing him the guy’s driving license. Don’t we all, sir, don’t we all.