Thoughts on Scribe

As someone who works with autocompletion, this week has been a good one. Google launched two products relevant to my research: the first one was Google Scribe, a Labs experiment that uses Web n-grams to assist in sentence construction. This system solves the same problem addressed in my VLDB’07 paper, “Effective Phrase Prediction” (paper, slides). The paper proposes a data structure called FussyTree to efficiently serve phrase suggestions, and provides a metric I called “Total Profit Metric”(TPM) to evaluate phrase prediction systems. Google Scribe looks quite promising, and I thought I’d share my observations.

To simplify writing, let’s quickly define the problem using a slide from the slide deck :

Query Time:
Latency while typing is quite impressive. There is no evidence of speculative caching(a la Google Instant), but interaction is fairly fluid, despite the fact that an HTTP GET is sent to a Google Frontend Server on every keystroke. I’m a little surprised that there isn’t a latency check (or if it exists, it’s too low) — GET requests are made even when I’m typing too fast for the UI to keep up, rendering many of the results useless even before the server has responded to them.

Length of Completion:
My experience with Google Scribe is that the length of completion is quite small; I was expecting it to produce large completions as I gave it more data, but I couldn’t get it to suggest beyond three words.

Length of Prefix+Context:
It looks like the length of the prefix/context(context being the text before the prefix, used to bias completions) is 40 characters, with no special treatment to word endings. At every keystroke, the previous 40 characters are sent to the server, with completions in return. So as I was typing in the sentence, this is what the requests look like:

this is a forty character sentence and i
his is a forty character sentence and it
is is a forty character sentence and it
s is a forty character sentence and it i
_(and so on)_

I’m not sure what the benefit of sending requests for partial words is. It’s hard to discern the prefix from the context by inspection, but the prefix seems to be quite small(2-3 words), which sounds right.

Prediction Confidence:
Google Scribe always displays a list of completions. This isn’t ideal, since it’s often making arbitrary low-confidence predictions. This makes sense from a demo perspective, but since there is a distraction cost associated with the completions, it would be valuable to completions only when they are of high-confidence. Confidence can either be calculated using TPM or learned from usage data(which I hope Scribe is collecting!)

Prediction Quality:
People playing with Scribe produced sentences such as “hell yea it is a good idea to have a look at the new version of the Macromedia Flash Player to view this video” and “Designated trademarks and brands are the property of their respective owners and are”. I find these sentences interesting because they are both very topical; i.e. they seem more like outliers from counting boilerplate text on webpages than “generic” sentences you’d find in, say an email. To solve this issue and produce more “generic” completions, one solution is to cluster the corpus into multiple topic domains, and ensure that the completion is not just popular in one isolated domain.

I was also interested in knowing, “How many keystrokes will this save?”. To measure this, we can use TPM. In these two slides, I describe the TPM metric with an example calculation:

While it would be nice to see a comparison of the FussyTree method vs Google Scribe in terms of Precision, Recall and TPM, constructing such an experiment is hard, since training FussyTree over web-sized corpora would require some significant instrumentation. Based on a few minutes of playing with it, I think Scribe will outperform the FussyTree method in Recall due to the small window size — i.e. it will produce small suggestions that are often correct. However, if we take into account the distraction factor from the suggestion itself, then Scribe in its current form will do poorly, since it pulls up a suggestion for every word. This can be fixed by making longer suggestions, and considering prediction confidence.

Overall, I am really glad that systems like these are making it into mainstream. The more exposure these systems get, the more chance they have to get better and more accurate, saving us time and enabling us to interact with computers better!


Google Search's Speed-based Ranking, Baking and Frying

I am looking for confirmations from other Drupal developers regarding details and corroborations. Comments are welcome here. PHBs need not worry, your Drupal site is just fine.

This post is about an inherent problem with Google’s recently announced “Speed-as-a-ranking-feature” and its problems with content-management systems like Drupal and Wordpress. For an auto-generated website, Google is often the first and only visitor to a lot of pages. Since Drupal spends a lot of time in the first render of the page, Google will likely see this delay. This is both due to a problem with how Drupal generates pages, and Google’s metric.

Google recently announced that as a part of it’s quest to making the web a faster place, it will penalize slow websites in its ranking:

today we’re including a new signal in our search ranking algorithms: site speed. Site speed reflects how quickly a website responds to web requests.

Since Google’s nice enough to provide webmaster tools, I looked up how my site was doing, and got this disappointing set of numbers:

Screen shot 2010-04-11 at 10.35.31 PM

I’m aware 3 seconds is too long. Other Drupal folks have reported ~600ms averages. My current site does under 1s second on average based on my measurements. This is probably because I occasionally have some funky experiments going on in some parts of the site that run expensive queries. Still, some other results were surprising:

Investigating further, it looks like there are 3 problems:

Screen shot 2010-04-11 at 10.49.44 PM

DNS issues & Multiple CSS: Since Google Analytics is on a large number of websites, so I’m expecting their DNS to be prefetched. CSS is not an issue since the 2 files are client media specific(print / screen).

GZip Compression: Now this is very odd. I’m pretty sure I have gzip compression enabled in Drupal (Admin > Performance > Compression). Why is Google reporting lack of compression? To check, I ran some tests, and discovered that since Google usually sees the page before it’s cached, it’s getting a non-gzipped version. This happens due to the way Drupal’s cache behaves, and is fixable. Ordinarily, this is a small problem, since uncached pages are rendered for only the first visitor. But since Google is the first visitor to a majority of the pages in a less popular site, it thinks the entire site is uncompressed. I’ve started a bug report for the uncached page gzip problem.

A flawed metric: The other problem is that Drupal (and Wordpress etc) use a fry model ; pages are generated on the fly per request. On the other hand, Movable Type, etc., bake their pages beforehand, so anything served up doesn’t go through the CMS. Caching in fry-based systems is typically done on the first-render, i.e. the first visit to a page is generated from scratch and written to the database/filesystem, any successive visitor to that page will see a render from the cache.

Since the Googlebot is usually the first (and only) visitor to many pages in a small site, the average crawl would hit a large number of pages where Drupal is writing things to cache for the next visitor. This means every page Googlebot visits costs a write to the database. While afaik Drupal runs page_set_cache after rendering the entire page and hence the user experience is snappy, I’m assuming Google counts time to connection close and not the closing </html> tag, resulting in a bad rendering time evaluation.

This means that Google’s Site Speed is not representative of the average user(i.e. second, third, fourth etc visitors that read from the cache), it only represents the absolute worst case situation for the website, which is hardly a fair metric. (Note that this is based on my speculation of what Site Speed means, based on the existing documentation.)

Web 2.0 and the relational database

Yes, this is yet another rant about how people incorrectly dismiss state-of-art databases. (Famous people have done it, why shouldn’t I?) It’s amazing how much the Web 2.0 crowd abhors relational databases. Some people have declared real SQL-based databases dead, while some have proclaimed them to be as not cool any more. Amazon’s SimpleDB, Google’s BigTable and Apache’s CouchDB are trendy, bloggable ideas that to be honest, are ideal for very specific, specialized scenarios. Most of the other use cases, and that comprises 95 out of a 100 web startups can do just fine with a memcached + Postgres setup, but there seems to be a constant attitude of “nooooo if we don’t write our code like google they will never buy us…!” that just doesn’t seem to go away, spreading like a malignant cancer throughout the web development community. The constant argument is “scaling to thousands of machines”, and “machines are cheap”. What about the argument “I just spent an entire day implementing the equivalent of a join and group by using my glorified key-value-pair library”? And what about the mantra “smaller code that does more”?

Jon Holland (who shares his name with the father of genetic algorithms) performs a simple analysis which points out a probable cause: People are just too stupid to properly use declarative query languages, and hence would rather roll their own reinvention of the data management wheel, congratulating themselves on having solved the “scaling” problem because their code is ten times simpler. It’s also a hundred times less useful, but that fact is quickly shoved under the rug.

It’s not that all Web-related / Open Source code is terrible. If you look at Drupal code, you’ll notice the amount of sane coding that goes on inside the system. JOINs used where needed, caching / throttling assumed as part of core, and the schema allows for flexibility to do fun stuff. (Not to say I don’t have a bone to pick with Drupal core devs; the whole “views” and “workflow” ideas are soon going to snowball into the reinvention of Postgres’s ADTs; all written in PHP running on top of a database layer abstracted Postgres setup.)

If Drupal can do this, why can’t everyone else? Dear Web 2.0, I have a humble request. Pick up the Cow book if you have access to a library, or attend a database course in your school. I don’t care if you use an RDBMS after that, but at least you’ll reinvent the whole thing in a proper way.

Implementing Caching in Drupal

The idea is to use memcached in place of Drupal’s MySQL based cache to make things more efficient.

Supposedly Linuxjournal already uses it; and Darix already has a modified for this.

Another post detailing extra parameters for memcached for Ragga-jungle.

The problem area is to cache $node, which is more complicated than $user; so we need to see how we can take care of this.