Web 2.0 and the relational database

Yes, this is yet another rant about how people incorrectly dismiss state-of-art databases. (Famous people have done it, why shouldn’t I?) It’s amazing how much the Web 2.0 crowd abhors relational databases. Some people have declared real SQL-based databases dead, while some have proclaimed them to be as not cool any more. Amazon’s SimpleDB, Google’s BigTable and Apache’s CouchDB are trendy, bloggable ideas that to be honest, are ideal for very specific, specialized scenarios. Most of the other use cases, and that comprises 95 out of a 100 web startups can do just fine with a memcached + Postgres setup, but there seems to be a constant attitude of “nooooo if we don’t write our code like google they will never buy us…!” that just doesn’t seem to go away, spreading like a malignant cancer throughout the web development community. The constant argument is “scaling to thousands of machines”, and “machines are cheap”. What about the argument “I just spent an entire day implementing the equivalent of a join and group by using my glorified key-value-pair library”? And what about the mantra “smaller code that does more”?

Jon Holland (who shares his name with the father of genetic algorithms) performs a simple analysis which points out a probable cause: People are just too stupid to properly use declarative query languages, and hence would rather roll their own reinvention of the data management wheel, congratulating themselves on having solved the “scaling” problem because their code is ten times simpler. It’s also a hundred times less useful, but that fact is quickly shoved under the rug.

It’s not that all Web-related / Open Source code is terrible. If you look at Drupal code, you’ll notice the amount of sane coding that goes on inside the system. JOINs used where needed, caching / throttling assumed as part of core, and the schema allows for flexibility to do fun stuff. (Not to say I don’t have a bone to pick with Drupal core devs; the whole “views” and “workflow” ideas are soon going to snowball into the reinvention of Postgres’s ADTs; all written in PHP running on top of a database layer abstracted Postgres setup.)

If Drupal can do this, why can’t everyone else? Dear Web 2.0, I have a humble request. Pick up the Cow book if you have access to a library, or attend a database course in your school. I don’t care if you use an RDBMS after that, but at least you’ll reinvent the whole thing in a proper way.

When will they ever learn...

Facebook just launched a new Google Trends-esque toy called Lexicon:

Today we’re announcing the launch of Facebook Lexicon, a tool where you can see the buzz surrounding different words and phrases on Facebook Walls. Lexicon pulls from the wealth of data on Facebook without collecting any personal information in order to respect everyone’s privacy.

Basically they look at what everyone types on their walls, and then reports popularity across time. Here’s a graph of the phrases “party tonight” vs “hangover”. This is probably the funniest phase shift I have seen in 2 dimensions.

| |

how many computers does google have?

One of the first things I did outside of work at Google was to find out how many computers the company has. It’s a fairly secret number; it’s not quite a topic that people in the Googz like to talk about.

It took me a week to piece together the answer; and a few months to come to terms with my discovery. It’s hard to talk to people outside of the big G about the kind of stuff they pull off there, and I’m not talking about making ball pits out of director’s offices.

I can finally talk about this, now that this information is explicitly public, published in an article by MapReduce Gods Jeff Dean and Sanjay Ghemawat (bloggy synopsis here). In the paper, they talk of 11,081 machine years of computation used in Sept 2007 alone, for a subset of their MapReduce work. That’s 132972 machine months of CPU used in one month. Assuming all the computers were running at 100% capacity, without failure, without any break for the entire month, that’s almost a hundred and fifty thousand machines worth of computing used in September Oh Seven.

In other words, Google has about one hundred and fifty thousand computers that are reported here.

But does that account for ALL the computers at Google?

To find out, go ask a Google employee to violate his NDA today!

for your information, this may not be the right number. it should be obvious why. for example, they never said anything about not using hamsters. hamsters are 10x faster than computers, which would mean they could just have 10,000 hamsters and it would be fine.

startup idea #4984

Here’s an idea I thought of a while ago. You have the storm botnet, which is apparently now capable of being the world’s most powerful supercomputer:

The Storm botnet, or Storm worm botnet, is a massive network of computers linked by the Storm worm Trojan horse in a botnet, a group of “zombie” computers controlled remotely. It is estimated to run on as many as 1,000,000 to 50,000,000 infected and compromised computer systems as of September 2007. Its formation began around January, 2007, when the Storm worm at one point accounted for 8% of all infections on all Microsoft Windows computers.

The botnet reportedly is powerful enough as of September 2007 to force entire countries off of the Internet, and is estimated to be able to potentially execute more instructions per second than some of the world’s top supercomputers.

Obviously, having a large supercomputer is big business these days. So what you do is have a legal version of this. Let’s say you sell computers at 70% of their real price. The only catch is that people will have to run this special software as part of the system. The special software is basically a remote compute client similar to Folding@Home or Google Compute.

Once you have sold enough computers, you essentially have a large army of computers at your beck and call, for 30% the price of what you would have to invest in otherwise. Of course, obviously someone else owns the machines, but while they are doing lightweight tasks such as checking email and chatting, you are folding proteins, running simulations and cracking ciphers.

Now here’s the best part of the deal: the most expensive part of a grid is not the hardware, but the electricity that it uses. And guess who’s paying this electricity! The customer, not you!.

So there you have it. A cheap, one-time cost for an everlasting free CPU grid. Awesome ainnit?

note: This idea is under this license.


google sky

Google Earth. Now with Sky.


there goes flickr

Google introduces Picasa Web Albums : lots of nice things like image preloading, etc — I’m curious to see what the Yahoo Photos beta and Flickr have to offer in response to this.


double standards

Google’s double standards.


how a blog is worth 10K

How a blog is worth over 9000 a month. Of course, the assumption is that Google Adwords is a realistic price estimate of things, which I feel is a slightly specious standard.


google is building the metaverse

Here’s a set of comparisons between the big G and things from the book Snow Crash:

What next? Take a look at Second Life, and the talk they gave at Google NY — I’m not quite sure of the implementation itself, but this does seem like the next piece in the Snow crash puzzle.

… and then we’ll soon have Google merging with the Library of Congress, and Yahoo, Amazon, MS, and eBay merging with the mafia, and the picture will be complete.


bus schedules

MagicBus tracks UofM buses in real time using GPS and plots them on a Google Map.