search engine

Visualizations for Navigation : Experiments on my blog

This is a meta post describing two features on this blog that I don’t think I’ve documented before. Apologies for the navel-gazing, I hope there’s enough useful information here to make it worth reading

Most folks read my blog through the RSS feed, but those who peruse the web version get to see many different forms of navigational aids to help the user around the website. Since the blog runs on Drupal , I get to deploy all sorts of fun stuff. One example is the Similar Entries module, that uses MySQL’s FULLTEXT similarity to show possibly related posts1. This allows you to jump around on the website reading posts similar to each other, which is especially useful for readers who come in from a search engine result page. For example, they may come in looking for Magic Bus for the iPhone , but given that they’re probable iPhone users, they may be interested in the amusing DIY iPhone Speakers post.

The Timeline Footer

However, given that this blog has amassed about a thousand posts over seven years now, it becomes hard to expose an “overview” of that much information to the reader in a concise manner. Serendipitous browsing can only go so far. Since this is a personal blog, it is interesting to appreciate the chronological aspect of posts. Many blogs have a “calendar archive” to do this, but somehow I find them unappealing; they occupy too much screen space for the amount of information they deliver. My answer to this is a chronological histogram, which shows the frequency of posts over time:

Each bar represents the number of blog posts I posted that month, starting from August 2002 until now2. Moving your mouse over each bar tells you which month it is. This visualization presents many interesting bits of information. On a personal note, it clearly represents many stages of my life. June of 2005 was a great month for my blog — it had the highest number of posts, possibly related to the fact that I had just moved to Bangalore, a city with and active Blogging community. There are noticeable dips that reflect extended periods of travel and bigger projects.

In the background, this is all done by a simple SELECT COUNT(*) FROM nodes GROUP BY month type query. Some smoothing is applied to the counts due to the high variance, for my usage, Height = Log base 4 (frequency) gave me pretty good results. This goes into a PHP block, which is then displayed at the footer of every blog page. The Drupal PHP snippets section is a great place to start to do things like this. Note that the chart is pure HTML / CSS; there is no Javascript involved3.

The Dot Header

Many of my posts are manually categorized using Drupal’s excellent taxonomy system. A traditional solution to this is to create sections, so that the user can easily browse through all my Poems or my nerdy posts. The problem is that this blog contains notes and links to things that I think are “interesting”, a classification that has constantly evolved as my interests have changed over the past decade. Not only is it hard for me to box myself into a fixed set of categories, maintaining the evolution of these categories across 7+ years is not something I want to deal with every day.

This is where tags and automatic term extraction come in. As you can see in the top footer of the blog mainpage , each dot is a topic, automatically extracted from all posts on the website. I list the top 60 topics in alphabetical order, where each topic is also a valid taxonomy term. The aesthetics are inspired by the RaphaelJS dots demo, but just like the previous visualization, it is done using pure CSS + HTML. The size and color of the dot is based on the number of items that contain that term. Hovering over each dot gives you the label and count for that dot, clicking them takes you to an index of posts with that term. This gives me a concise and maintainable way to tell the user what kinds of things I write about. It also addresses a problem that a lot of my readers have — they either care only about the tech-related posts (click on the biggest purple dot!), or only about the non-tech posts (look for the “poetry” dot in the last row!).

This visualization works by first automatically extracting terms from each post. This is done using the OpenCalais module (I used to previously use Yahoo’s Term Extractor, but switched since it seems Yahoo!‘s extractor is scheduled to be decommissioned soon). The visualization is updated constantly using a cached GROUP BY block similar to the previous visualization, this time grouped on the taxnomy term. This lets me add new posts as often as I like, tags are automatically generated and are reflected in the visualization without me having to do anything.

So that’s it, two simple graphical ways to represent content. I know that the two visualizations aren’t the best thing since sliced bread and probably wont solve World Peace, but it’s an attempt to encourage discoverability of content on the site. Comments are welcome!


Footnotes:

1 I actually created that module (and the CAPTCHA module) over four years ago; they’ve been maintained and overhauled by other good folks since.

2 Arnab’s World is older than that (possibly 1997 — hence the childish name!), but that’s the oldest blog post I could recover.

3 I have nothing against Javascript, it’s just that CSS tends to be easier to manage and usually more responsive. Also, the HTML generated is probably not valid and is SUPER inefficient + ugly. Hopefully I will have time to clean this up sometime in the future.

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Just got done with the HAMSTER presentation; here is the paper, and here are my abstract and slides:

We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from third-party data providers into a structured-search engine’s data warehouse. Our experiments show that traditional schema- based and instance-based schema matching methods fall short. We propose a new technique based on the search engine’s clicklogs. Two schema elements are matched if the distribution of keyword queries that cause click-throughs on their instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.

I received a few questions after the talk, hence I thought I’d put up a quick FAQ:

Q: Doesn’t the time(period) of the clicklog affect your integration quality?

A: Yes. And we consider this a good thing. This allows trend information to come into the system, e.g. “pokemon” queries will start coming in, and merge “japanese toys” with “children’s collector items”. Unpopular items that are not searched for may not generate a mapping, but then again, this may be ok since the end goal was to integrate searched-for items.

Q: You use clicklogs. I am a little old company/website owner X. Since my company’s name doesn’t start with G, M or Y, I don’t have clicklogs. How do I use your method?

A: You already have clicklogs. Let’s say you are trying to merge your company/website X’s data with company Y’s data. Since both you (X) and Y have websites, you both run HTTP servers, which have the facility to log requests. Look through your HTTP server referral logs for strings like:
URL: http://x.com
REFERRER: http://www.google.com/?q=$search_string$

This is your clicklog. The url http://x.com has the query $search_string$. You can grep both websites to create clicklogs, which can then be used to integration.

Q: My website is not very popular and I don’t have that many clicks from search engines. What do I do?

A: Yup, this is a very real case. Specifically, you might have a lot of queries for some of your items, but not for others. This can be balanced out. See the section in our paper about Surrogate Clicklogs. Basically you can use a popular website’s clicklog as a “surrogate” log for your database. From the paper:

…we propose a method by which we identify surrogate clicklogs for any data source without significant web presence. For each candidate entity in the feed that does not have a significant presence in the clicklogs (i.e. clicklog volume is less than a threshold), we look for an entity in our collection of feeds that is most similar to the candidate, and use its clicklog data to generate a query distribution for the candidate object.

Q: I am an academic and do not have access to a public clicklog, or a public website to do get clicklogs from. How do I use this technique?

A: Participate in the Lemur project and get your friends to participate too.

|

switch

From “Greg”:http://glinden.blogspot.com/2006/02/motivating-switching-from-google.html:

As Mike points out, one way to get people to switch is to be obviously better. That’s what Google did to Altavista to steal the crown.

No, that’s not true. The real reason Google is popular is because Yahoo! switched from Inktomi(in a way, AV) to Google as their search engine, complete with “powered by google” signage. So you have the #1 website on the Interweb telling everyone that Search = Google. Then, when search became important (because the Internet exploded in terms of content, and hence usefulness of web search), people had no trouble to switching from a huge portal to a dedicated search engine they already used.

on the importance of timepass

A friend said this, I somehow think this is a nontrivial sentence:

“I am not looking for quality here. I am looking for passing time.”

Another quote from one of the Brit ads on Virgin Radio, which could be such an apt slogan for a search engine company:

“Why go anywhere else? Search Me!”

yahoo vs google

Yahoo dumps Google. They're powered by their own search engine now, thanks to the acquisitions the made last year. It's amusing to see how the browser wars have turned out. In the very beginning, we had the following players:

Altavista - One time leader in the websearch engine. Commited harakiri by tranforming into a portal, among other things.
AllTheWeb - Indexes a LOT of pages. I used it for pages I couldn't find anywhere else. But that's about all I used it for.
Yahoo - Search, mostly dependent on the then-huge human-managed directory. Plus points: search quality, and portal features.

google book search

Our favourite search engine joins the fray. via kottke

|

[this is cool]

Search Engine Decoder: gives you the skinny on the who's-feeding-who dynamics of the search engine world. via Ev

|

nutch

I found this new search engine project called Nutch via DaveNet:

Nutch is a nascent effort to implement an open-source web search engine. (It) provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.)

IMHO, this is like finding an RSA encryption algorithm for information retrieval. While many encryption systems base themselves upon the premise that their inner algorithms are kept secret; RSA, or any other publicly described methods are strong because the actual algorithm is tough to crack.

| |

oh dang didly doo!

Google buys Pyra! This is major. Ev, Jason and the gang rollerblading in the GooglePlex... hmm.

Everyone seems to be wondering why this happened - as in, what use is a content hosting service to a search company?
A lot of people are saying that there's going to be a "Blogs" tab now, beside the News and other tabs. Some people are also wondering if Blogspot sites will get a better pagerank.

But why buy Pyra? Why not just get 20 people to build Blogger in a month? Here's what I think: It's all about the data™. In Page and Brin's paper - The Anatomy of a Large-Scale Hypertextual Web Search Engine - the base assumption was that the WWW is a huge interlinked connection of information. And you know what's incredibly funny? The Blogosphere is just that - a small scale WWW.

towards a blog~wiki

Nilesh wrote about Wikis, and the need for keyword based browsing in a weblog. A discussion followed in the comments on that entry, and it seems there are quite a few people who like the Wiki concept, but are intimidated by usage.

Wikis are really useful for learning stuff, because of the way content is divided in to sizable, highly interlinked chunks of information. Jumping from one chunk to other is one action - a simple click. But for readers who want to "learn" stuff on a weblog, they do have the search engine at hand, but it's still too complex - yet another text box to fill, and a button to press.

| |