Technology

Visualizations for Navigation : Experiments on my blog

This is a meta post describing two features on this blog that I don’t think I’ve documented before. Apologies for the navel-gazing, I hope there’s enough useful information here to make it worth reading

Most folks read my blog through the RSS feed, but those who peruse the web version get to see many different forms of navigational aids to help the user around the website. Since the blog runs on Drupal , I get to deploy all sorts of fun stuff. One example is the Similar Entries module, that uses MySQL’s FULLTEXT similarity to show possibly related posts1. This allows you to jump around on the website reading posts similar to each other, which is especially useful for readers who come in from a search engine result page. For example, they may come in looking for Magic Bus for the iPhone , but given that they’re probable iPhone users, they may be interested in the amusing DIY iPhone Speakers post.

The Timeline Footer

However, given that this blog has amassed about a thousand posts over seven years now, it becomes hard to expose an “overview” of that much information to the reader in a concise manner. Serendipitous browsing can only go so far. Since this is a personal blog, it is interesting to appreciate the chronological aspect of posts. Many blogs have a “calendar archive” to do this, but somehow I find them unappealing; they occupy too much screen space for the amount of information they deliver. My answer to this is a chronological histogram, which shows the frequency of posts over time:

Each bar represents the number of blog posts I posted that month, starting from August 2002 until now2. Moving your mouse over each bar tells you which month it is. This visualization presents many interesting bits of information. On a personal note, it clearly represents many stages of my life. June of 2005 was a great month for my blog — it had the highest number of posts, possibly related to the fact that I had just moved to Bangalore, a city with and active Blogging community. There are noticeable dips that reflect extended periods of travel and bigger projects.

In the background, this is all done by a simple SELECT COUNT(*) FROM nodes GROUP BY month type query. Some smoothing is applied to the counts due to the high variance, for my usage, Height = Log base 4 (frequency) gave me pretty good results. This goes into a PHP block, which is then displayed at the footer of every blog page. The Drupal PHP snippets section is a great place to start to do things like this. Note that the chart is pure HTML / CSS; there is no Javascript involved3.

The Dot Header

Many of my posts are manually categorized using Drupal’s excellent taxonomy system. A traditional solution to this is to create sections, so that the user can easily browse through all my Poems or my nerdy posts. The problem is that this blog contains notes and links to things that I think are “interesting”, a classification that has constantly evolved as my interests have changed over the past decade. Not only is it hard for me to box myself into a fixed set of categories, maintaining the evolution of these categories across 7+ years is not something I want to deal with every day.

This is where tags and automatic term extraction come in. As you can see in the top footer of the blog mainpage , each dot is a topic, automatically extracted from all posts on the website. I list the top 60 topics in alphabetical order, where each topic is also a valid taxonomy term. The aesthetics are inspired by the RaphaelJS dots demo, but just like the previous visualization, it is done using pure CSS + HTML. The size and color of the dot is based on the number of items that contain that term. Hovering over each dot gives you the label and count for that dot, clicking them takes you to an index of posts with that term. This gives me a concise and maintainable way to tell the user what kinds of things I write about. It also addresses a problem that a lot of my readers have — they either care only about the tech-related posts (click on the biggest purple dot!), or only about the non-tech posts (look for the “poetry” dot in the last row!).

This visualization works by first automatically extracting terms from each post. This is done using the OpenCalais module (I used to previously use Yahoo’s Term Extractor, but switched since it seems Yahoo!‘s extractor is scheduled to be decommissioned soon). The visualization is updated constantly using a cached GROUP BY block similar to the previous visualization, this time grouped on the taxnomy term. This lets me add new posts as often as I like, tags are automatically generated and are reflected in the visualization without me having to do anything.

So that’s it, two simple graphical ways to represent content. I know that the two visualizations aren’t the best thing since sliced bread and probably wont solve World Peace, but it’s an attempt to encourage discoverability of content on the site. Comments are welcome!


Footnotes:

1 I actually created that module (and the CAPTCHA module) over four years ago; they’ve been maintained and overhauled by other good folks since.

2 Arnab’s World is older than that (possibly 1997 — hence the childish name!), but that’s the oldest blog post I could recover.

3 I have nothing against Javascript, it’s just that CSS tends to be easier to manage and usually more responsive. Also, the HTML generated is probably not valid and is SUPER inefficient + ugly. Hopefully I will have time to clean this up sometime in the future.

Inaction

He sat there staring at a blank terminal screen. He tried to remember exactly what it was that he was going to do next.

“Wow, this Twitter and Facebook habit has totally eliminated my ability to concentrate,” he thought.

Instinctively looking at the clock, he was alarmed at what time it was.

“2:31am… Wow, It’s tomorrow now… November 28th. Hmm.”

He smiled at the embarrassing memories. He remembered looking at the lone curled lock of hair that used to hang from the side of her forehead; the ill-fitting skirt; the smile. He remembered having conversations with her and getting distracted by the cuteness of her ever so slightly snubbed nose. He remembered being the new boy.

1997 was a confusing year. A new city, a new school, a new set of friends. The itinerant lifestyle had made it easier to compartmentalize relationships with people. It wasn’t something he preferred. Someone once had quipped that children of IAS officers were successful in life because of their ability to make friends quickly, and he had accepted that as a commiseration.

“A quick log in into the social networks I guess…”.

It had become a habit — any empty moment was occupied by “socializing” with a website. At least he had an excuse this time.

“Dear Julie, wish you a very Happy Birthday! Hope “ he wrote.

Backspace.

“Dear Julie, wish you a very Happy Birthday!”

It was 4 years since he’d broken up with her. It was painful but amicable; and they’d both moved on since. They had been great friends once, and they stayed friends since. The breakup left him in a strange place where he wasn’t quite sure exactly how much affection is too much. Especially on a Facebook wall. Better safe than sorry, he guessed.

He met Julie at a party in the first year of college. Common t-shirt colors led to a conversation about what else was in common. Not a lot, just states where they grew up, Zodiac signs and an uncanny interest in Lucky Ali. He liked her from the first time he met her, but he remembered her because of the irony in her birthdate. It was exactly the same as Divya’s.

“...Sharma. Divya Sharma. Roll Number 32” he remembered, revisiting a seldom-visited corner of his memories. Those memories were forgotten for good reason. Unlike college, which was a blast, he didn’t quite liked it when he joined Crescent Public.

The new high school was an absurdity. He had never met a bunch of more cacophonous kids before. Maybe this is a culture thing, but he’d much rather go back to his well-behaved alma-mater back in Bokaro. And somehow it seemed she knew exactly what he was thinking.

“You’ll get used to it. We’re not all that bad.”

“Well, I…”

“I’m Divya, by the way.”

He quickly found out that she was right. It was loud, but most of the kids were alright. More importantly, he had his first interaction with someone at school, and it was Divya. Amidst all the newness, he desperately needed some sense of familiarity, some sense of closeness. And when he found none, Divya became an easy substitute, even if she was that girl who sat in front of him and sometimes said Hi during break, even if he couldn’t come up with a single word to respond with. Weeks go by quickly when you have a pile of unfamiliar homework and a cute little puppy crush. And then one day Dad walks into the study room.

“Son, we have some good news. Mom mentioned how you were having trouble fitting in at your current school. We talked to the folks at this other school we think you’ll really like. I know it’s 3 weeks into the school year, but they’re willing to let you join.”

Lather Rinse Repeat. New uniform, new school bus, new school anthem that he would have to mumble through pretending to know the words.

The new school turned out to be yet another experience. It was still different from the Jesuit education imparted to him over the last 10 years and 4 schools, but he quickly found himself making a connection with the place. New interests were kindled, new friends were made, life went on.

And yet, the ponderous doodle on his notebook still said “Divya”. With a dot repeatedly penciled in so many times that it made a hole into the next page. It had been 3 months. He had new friends now! November 28th came by, he had astronomy camp at school that night. While everyone laid there on the school ground looking at the stars, he lay there thinking about parallel universes.

“#[Share]#” “Your wall message has been posted.”

“Hmm. I wonder where she is now….” he murmured as he typed in “Divya Sharma” into the search box. “There’s probably a million of them, hope I don’t have to wade through this for hours.”

Five minutes later, he was staring at the profile picture of the Divya Sharma he knew, with the same nose and the very same dangling lock of hair. In her wedding dress, with her new husband.

He smiled and stared at the browser window for a while. He clicked the “Request as friend” button, and began writing an introductory message. For some reason, the words after “Hey! Is this the Divya Sharma from Crescent Public School? Oh, btw, Happy Birthday!”

“Hey! Is this the Divya Sharma from Crescent Public School?”

“Hey!”

He smiled again, canceled the request and closed the browser window.

Some memories were best left untouched.

| |

The Scientist and his legacy

I’m not going to explain this one:

Well, ok here’s an explanation: It’s a picture of astrophysicist Neil deGrasse Tyson and Pluto looking for Pluto the ex-planet. Neil is famous, amongst other things, for being a prominent documenter of Pluto’s (the celestial body) demise as the 9th planet of our solar system.

|

Larry Ellison on Cloud Computing

Oracle Man Larry Ellison gives his two cents about the new Cloud Computing hype:

“Do you think they run on Water vapor? It’s Databases, and Operating Systems, and Memory, and Microprocessors and the Internet! What are you talking about!”

Also see: the 60 Minutes episode where he pitches the Network Computer.

At the Yahoo! Key Scientific Challenges Graduate Student Summit

I’m at the Yahoo! Graduate Student summit for today and tomorrow. About the event:

On September 3 and 4 the Academic Relations team will host 21 exceptional PhD students at the Key Scientific Challenges Graduate Student Summit. These students are winners of this year’s KSC program, and over the course of the two day summit they will be attending tech talks and workshops, presenting their work, and discussing research trends with top researchers from Yahoo! Labs. These 21 students will also be joined by the program’s past winners and Yahoo! Student Fellows.

Thought I’d share notes:

  • Great spread of grad students in terms of research areas. HCI, Economists, Social Scientists, apart from typical CS people.
  • Presenters for Thursday:

    Welcome & Overview of Yahoo! Labs
    Prabhakar Raghavan, Head, Yahoo! Labs

    Search Technologies Overview
    Andrew Tomkins, Chief Scientist, Yahoo! Search

    Machine Learning & Statistics Research Overview
    Sathiya Keerthi Selvaraj, Senior Research Scientist

    Economics and Social Systems Research Overview
    Elizabeth Churchill, Principal Research Scientist

    Computational Advertising Research Overview
    Andrei Broder, Fellow and VP, Computational Advertising

    Web Information Management Research Overview
    Brian Cooper, Senior Research Scientist

  • Posters for the poster sessions look pretty awesome!
  • |

    Vowpal Wabbit now Open Source Project

    I was writing a longer post about VW a few weeks ago but ran out of time, so I’ll just post the initial few paragraphs for now

    There’s probably a limit to how many times one is allowed to use the word “awesome” in a day — I feel like I’ve hit my quota, but I need to use it just once more before I hit the sack:

    I think it’s awesome that Yahoo! Research lets researchers open source their projects.

    I'm pretty sure John did not make this image

    A few days ago, the amazing John Langford released his fast online learning tool, Vowpal Wabbit to the world as an open source project. Note the word project. That means all further development will happen out in the wild; . A bunch of people have question the origin of the name “Vowpal Wabbit” — “What is this undecipherable mess of vowels and consonants!?,” you ask. “That’s how Elmer Fudd would pronounce Vorpal Rabbit,” John answers. “Vorpal? Whatdoesthatmean?!,” you ask again. Which is where I cite the singular font of human knowledge and quote a few lines from Lewis Carrol’s Jabberwocky:

    He took his vorpal sword in hand (, and later,)
    One, two! One, two! And through and through
    The vorpal blade went snicker-snack!
    He left it dead, and with its head
    He went galumphing back.

    If the back story hasn’t made it clear to you yet, let me paraphrase it for you: This stuff is fast. Wicked fast. Like, voodoo fast. How? That’s best left for another post.

    |

    HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

    Just got done with the HAMSTER presentation; here is the paper, and here are my slides:

    I received a few questions after the talk, hence I thought I’d put up a quick FAQ:

    Q: You use clicklogs. I am a little old company/website owner X. Since my company’s name doesn’t start with G, M or Y, I don’t have clicklogs. How do I use your method?

    A: You already have clicklogs. Let’s say you are trying to merge your company/website X’s data with company Y’s data. Since both you (X) and Y have websites, you both run HTTP servers, which have the facility to log requests. Look through your HTTP server referral logs for strings like:
    URL: http://x.com
    REFERRER: http://www.google.com/?q=$search_string$

    This is your clicklog. The url http://x.com has the query $search_string$. You can grep both websites to create clicklogs, which can then be used to integration.

    Q: My website is not very popular and I don’t have that many clicks from search engines. What do I do?

    A: Yup, this is a very real case. Specifically, you might have a lot of queries for some of your items, but not for others. This can be balanced out. See the section in our paper about Surrogate Clicklogs. Basically you can use a popular website’s clicklog as a “surrogate” log for your database.

    Q: Doesn’t the time(period) of the clicklog affect your integration quality?

    A: Yes. And we consider this a good thing. This allows trend information to come into the system, e.g. “pokemon” queries will start coming in, and merge “japanese toys” with “children’s collector items”. Unpopular items that are not searched for may not generate a mapping, but then again, this may be ok since the end goal was to integrate searched-for items.

    Q: I am an academic and do not have access to a public clicklog, or a public website to do get clicklogs from. How do I use this technique?

    A: Participate in the Lemur project and get your friends to participate too.

    Yahoo: Just like the old times

    I’m excited to go to work today, knowing that I will be witness, first hand, to one of the more incredible business deals being announced in the valley: Microsoft powering Yahoo Search.

    There’s a lot that I want to say about this, but for now, I will leave you with this image. This is from when Yahoo! used to be powered by Google. (Many people believe that powering Yahoo was what made Google popular with the mainstream audience, and the Google owes who it is today to Yahoo.)

    An excerpt from the Wikipedia:

    In 2002, they bought Inktomi, a “behind the scenes” or OEM search engine provider, whose results are shown on other companies’ websites and powered Yahoo! in its earlier days. In 2003, they purchased Overture Services, Inc., which owned the AlltheWeb and AltaVista search engines.

    AlltheWeb, Altavista, Overture, Inktomi. That’s a lot of heritage.

    |

    Microsoft Research's Data-related Launches

    Microsoft Research has been making a bunch of cool data analysis-related launches at the upcoming Faculty Summit.

    First, there’s The academic release of Dryad and DryadLINQ

    Dryad is a high-performance, general-purpose, distributed-computing engine that simplifies the task of implementing distributed applications on clusters of computers running a Windows® operating system. DryadLINQ enables developers to implement Dryad applications in managed code by using an extended version of the LINQ programming model and API. The academic release of Dryad and DryadLINQ provides the software necessary to develop DryadLINQ applications and to run them on a Windows HPC Server 2008 cluster. The academic release includes documentation and code samples.

    They also launched Project Trident , a workflow workbench, which is available for download:

    Project Trident: A Scientific Workflow Workbench is a set of tools—based on the Windows Workflow Foundation—for creating and running data analysis workflows. It addresses scientists’ need for a flexible and powerful way to analyze large and diverse datasets, and share their results. Trident Management Studio provides graphical tools for running, managing, and sharing workflows. It manages the Trident Registry, schedules workflow jobs, and monitors local or remote workflow execution. For large data sets, Trident can run multiple workflows in parallel on a Windows HPC Server 2008 cluster. Trident provides a framework to add runtime services and comes with services such as provenance and workflow monitoring. The Trident security model supports users and roles that allows scientists to control access rights to their workflows.

    Then there’s Graywolf :

    GrayWulf builds on the work of Jim Gray, a Microsoft Research scientist and pioneer in database and transaction processing research. It also pays homage to Beowulf, the original computer cluster developed at NASA using “off-the-shelf” computer hardware.

    Nerds are the new Rock Stars

    We’re seeing a new breed of rock stars these days: Scientists.

    Apparently there is a Night Club for Nerdy People in the Big Apple :

    The crowd is young and hip, mostly in their 20s and 30s, eager to gain entry to tonight’s hot-ticket entertainment event. Once the doors open, about 50 lucky people secure chairs, while another 50 stand four-deep around the room, and another 50 are gently turned away at the door.
    “This is the third time I haven’t made it in,” a disappointed young woman sighs.
    A mixtape of music plays through the speakers and the audience sips drinks from plastic cups while waiting for the featured act to begin. It won’t be the latest indie band, or an up-and-coming comedian. This is not the typical New York club scene. This is the monthly meeting of the Secret Science Club.

    Then there’s DorkBot, which has branches everywhere:

    the main goals of dorkbot are: to create an informal, friendly environment in which people can talk, [...] to give us all an opportunity to see the strange things our neighbors are doing with electricity.

    Meanwhile, in Cambridge, Massachusetts, “Dr. Evil” and the “Mexican Multiplier” have dueled it out till the very end, in an attempt to write the largest number on a chalkboard.

    Finally, here’s an awesome ad from Intel’s amazing marketing team: