Google Search's Speed-based Ranking, Baking and Frying

I am looking for confirmations from other Drupal developers regarding details and corroborations. Comments are welcome here. PHBs need not worry, your Drupal site is just fine.

This post is about an inherent problem with Google’s recently announced “Speed-as-a-ranking-feature” and its problems with content-management systems like Drupal and Wordpress. For an auto-generated website, Google is often the first and only visitor to a lot of pages. Since Drupal spends a lot of time in the first render of the page, Google will likely see this delay. This is both due to a problem with how Drupal generates pages, and Google’s metric.

Google recently announced that as a part of it’s quest to making the web a faster place, it will penalize slow websites in its ranking:

today we’re including a new signal in our search ranking algorithms: site speed. Site speed reflects how quickly a website responds to web requests.

Since Google’s nice enough to provide webmaster tools, I looked up how my site was doing, and got this disappointing set of numbers:

Screen shot 2010-04-11 at 10.35.31 PM

I’m aware 3 seconds is too long. Other Drupal folks have reported ~600ms averages. My current site does under 1s second on average based on my measurements. This is probably because I occasionally have some funky experiments going on in some parts of the site that run expensive queries. Still, some other results were surprising:

Investigating further, it looks like there are 3 problems:

Screen shot 2010-04-11 at 10.49.44 PM

DNS issues & Multiple CSS: Since Google Analytics is on a large number of websites, so I’m expecting their DNS to be prefetched. CSS is not an issue since the 2 files are client media specific(print / screen).

GZip Compression: Now this is very odd. I’m pretty sure I have gzip compression enabled in Drupal (Admin > Performance > Compression). Why is Google reporting lack of compression? To check, I ran some tests, and discovered that since Google usually sees the page before it’s cached, it’s getting a non-gzipped version. This happens due to the way Drupal’s cache behaves, and is fixable. Ordinarily, this is a small problem, since uncached pages are rendered for only the first visitor. But since Google is the first visitor to a majority of the pages in a less popular site, it thinks the entire site is uncompressed. I’ve started a bug report for the uncached page gzip problem.

A flawed metric: The other problem is that Drupal (and Wordpress etc) use a fry model ; pages are generated on the fly per request. On the other hand, Movable Type, etc., bake their pages beforehand, so anything served up doesn’t go through the CMS. Caching in fry-based systems is typically done on the first-render, i.e. the first visit to a page is generated from scratch and written to the database/filesystem, any successive visitor to that page will see a render from the cache.

Since the Googlebot is usually the first (and only) visitor to many pages in a small site, the average crawl would hit a large number of pages where Drupal is writing things to cache for the next visitor. This means every page Googlebot visits costs a write to the database. While afaik Drupal runs page_set_cache after rendering the entire page and hence the user experience is snappy, I’m assuming Google counts time to connection close and not the closing </html> tag, resulting in a bad rendering time evaluation.

This means that Google’s Site Speed is not representative of the average user(i.e. second, third, fourth etc visitors that read from the cache), it only represents the absolute worst case situation for the website, which is hardly a fair metric. (Note that this is based on my speculation of what Site Speed means, based on the existing documentation.)

What other people have to say:

On the plus side, maybe this means that TechCrunch will disappear from the Google SERPs altogether!

I use this cron script to pre-load the cache to solve this problem.
#!/bin/bash
cd /home/guts/bin/no-delete-this-crawling-tmp/

delay=1
tmp=“downloads”
sites=“example.com test.example.com”
log=“log.txt”

for site in $sites
do

#run cron
/usr/bin/wget \
—output-document=- \
—quiet \
—tries=1 \
http://$site/cron.php
sleep $delay

#crawl the site to juice cache, but do it slowly to not overload the server
/usr/bin/wget \
—recursive \
—wait=$delay \
—domains=$site \
—level=inf \
—directory-prefix=$tmp \
—force-directories \
—delete-after \
—output-file=$log \
—no-verbose \
http://$site/

  1. When the crawl is done, the download files are removed.
    rm -rf $tmp

done

exit

Interesting stuff. One question. Why do you assume that Googlebot is always the first user to a page? Don’t they constantly re-crawl all the pages, and may be they average or some other weighting to determine the page’s true access speed?

Doesn’t the Boost module cover this with it’s own cron based crawler? Just curious to see what other peoples thoughts are?

I’m assuming Google counts time to connection close and not the closing tag, resulting in a bad rendering time evaluation.

Actually it’s not looking at the time to retrieve the HTML document, it’s looking at total render time. So the time that it gives is the time till window.ready I believe.

Yes boost does have a multi-threaded crawler. You can also use a recursive wget call if you don’t want to use boost. Something like this

wget -r -nd -l20 --delete-after http://www.example.com/

Yes, it’s the window.ready time. Thus I don’t think Drupal improvement helps much. A few hundreds of ms on server side don’t count as much as a banner.

Google says only less than 1% of search results are affected. I consider it means 1% slowest sites are penalized. 3 sec are okey, my site are 10 sec (95% slower than others, 1000 point test). I have 18 sec for the most visited page (about 2000 pageviews/day, it said, well cached by boost) because it has too many comments. So, server side optimization is nearly useless. I’ve just use a smaller comment pagination and wait…

@Hung now if it only weren’t linked to by thousands of twitter bots (humans or otherwise).

@Pradeep: Bots read many pages humans never search for, like my 2003 post about college life etc. Crawlers tend to crawl new pages aggressively (based on new links found etc.) and recrawl infrequently, in the order of once in months. Caches may expire before that (due to LRU cache size, cache pruning policy etc).

John, mikeytown2 and @Jamie, thanks for pointers to the boost module and recursive wget idea. Precaching is good idea and maybe this should go into Drupal core, i.e. trigger precaching for all pages created due to any content change (Movable Type used to do that). However, with automatic metadata(Calais) and vocabulary tags / terms in Drupal, even a small site will have a huge number of facets to view the same content, and hence could have hundreds of thousands of “pages”, which may be undesirable for webhosts with small databases.

dalin and jcisio: According to SearchEngineLand , their metric depends on 2 factors:

1. How a page responds to Googlebot
2. Load time as measured by the Google Toolbar

I don’t expect Googlebot to do a full page render for all my pages. There are some anecdotal reports of Googlebot-side JS execution for a fraction of pages, but not a full render.
For the second, load time assumes your site is frequented by people with Google Toolbar, which is fairly low for my site to get any reliable numbers (which they admit in their Webmaster Tools report).
Regardless, it turns out that database writes(cache population) are slow enough(~100s of ms) on my server to be considered significant, compared to render time(~1000ms).

@jsisio: Since your render time is >10s, I can see why server-side optimization doesn’t matter in your case. Also, given that you have precaching already implemented, and you’ve already performed the optimization we’re worrying about! :) Would love to see what your server-side numbers look like without boost.

About the author:

Arnab Nandi is an Assistant Professor in the Department of Computer Science and Engineering at The Ohio State University. You can read more about him here.


August 2002 : 9 posts September 2002 : 16 posts October 2002 : 7 posts November 2002 : 21 posts December 2002 : 25 posts January 2003 : 8 posts February 2003 : 11 posts March 2003 : 7 posts April 2003 : 21 posts May 2003 : 14 posts June 2003 : 15 posts July 2003 : 4 posts August 2003 : 16 posts September 2003 : 25 posts October 2003 : 15 posts November 2003 : 24 posts December 2003 : 17 posts January 2004 : 6 posts February 2004 : 8 posts March 2004 : 6 posts April 2004 : 5 posts May 2004 : 29 posts June 2004 : 3 posts July 2004 : 17 posts August 2004 : 19 posts September 2004 : 3 posts October 2004 : 4 posts December 2004 : 1 posts February 2005 : 15 posts March 2005 : 18 posts April 2005 : 8 posts May 2005 : 27 posts June 2005 : 73 posts July 2005 : 45 posts August 2005 : 13 posts September 2005 : 3 posts October 2005 : 9 posts November 2005 : 20 posts December 2005 : 6 posts January 2006 : 25 posts February 2006 : 24 posts March 2006 : 37 posts April 2006 : 35 posts May 2006 : 7 posts June 2006 : 22 posts July 2006 : 20 posts August 2006 : 27 posts September 2006 : 15 posts October 2006 : 6 posts November 2006 : 19 posts December 2006 : 4 posts January 2007 : 4 posts February 2007 : 1 posts March 2007 : 3 posts May 2007 : 5 posts June 2007 : 2 posts July 2007 : 1 posts August 2007 : 13 posts September 2007 : 2 posts October 2007 : 21 posts November 2007 : 7 posts December 2007 : 9 posts January 2008 : 4 posts February 2008 : 14 posts March 2008 : 14 posts April 2008 : 11 posts May 2008 : 12 posts June 2008 : 12 posts July 2008 : 5 posts August 2008 : 10 posts September 2008 : 11 posts October 2008 : 10 posts November 2008 : 8 posts December 2008 : 4 posts January 2009 : 6 posts February 2009 : 13 posts March 2009 : 7 posts April 2009 : 7 posts May 2009 : 2 posts June 2009 : 3 posts July 2009 : 4 posts August 2009 : 4 posts September 2009 : 6 posts October 2009 : 4 posts November 2009 : 7 posts December 2009 : 10 posts January 2010 : 3 posts February 2010 : 2 posts April 2010 : 5 posts May 2010 : 1 posts July 2010 : 4 posts August 2010 : 3 posts September 2010 : 4 posts October 2010 : 1 posts November 2010 : 2 posts December 2010 : 3 posts June 2011 : 1 posts August 2011 : 1 posts November 2011 : 1 posts December 2011 : 1 posts February 2012 : 1 posts May 2012 : 2 posts December 2012 : 1 posts June 2013 : 1 posts August 2013 : 1 posts October 2013 : 2 posts