The Story of Stuff: Electronics

This short film should be made mandatory watching for anyone who buys electronics! From their website :

The Story of Electronics, released on November 9th, 2010 at storyofelectronics.org, takes on the electronics industry’s “design for the dump” mentality and champions product take back to spur companies to make less toxic, more easily recyclable and longer lasting products. Produced by Free Range Studios and hosted by Annie Leonard, the eight-minute film explains ‘planned obsolescence’—products designed to be replaced as quickly as possible—and its often hidden consequences for tech workers, the environment and us. The film concludes with an opportunity for viewers to send a message to electronics companies demanding that they “make ‘em safe, make ‘em last, and take ‘em back.”



iFixit for-profit approach to solve this problem is pretty great — by making every device on the planet repairable, we save so many devices from being trashed.

| |

Fixing Facebook's Font Size Frustration

I don't know if I'm part of an alpha-beta test or if they rolled this out to all of its hundreds of millions of users, but Facebook recently reduced the font size of all status messages on its homepage without any explanation:


As a heavy user, this has been causing me a lot of eye strain. So I decided to create a quick script to fix this. You can either:

Note: I've tested it for Safari, Chrome and Firefox; for other browsers, fix it yourself! :)

Also, standard and obvious disclaimer: I am not responsible for any code you run.

| |

XKCD, the NLP remix

Just saw this old XKCD strip and came up with a more real-world-applicable, NLP version of the strip:

[ original image © Randall Munroe ]

For my non-NLP readers, I might as well explain the terms:

Wrapper Induction is “a technique for automatically constructing wrappers from labeled examples of a resource’s content.”. More details here .

CRF stands for Conditional Random Field and “is a type of discriminative probabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences.”


Adwords CPC Dips: Google Instant and Ad Pricing

I was explaining Google Instant to my housemate yesterday and had this thought1:

Are Google Ads and SEO going to be targeted on prefixes of popular words now?

For example, let’s consider the word “insurance”. There are a lot of people bidding on the whole word, and a lot of people bidding on the word “in”. Since Google Instant can show ads at every keystroke2, perhaps it would be cheaper to buy ads on the word “insura”, where the number of searches will be just as high, but since there are fewer people bidding on it, the CPC is low?

Here’s some data I pulled from Google Adwords Estimator :

The charts superimpose CPC, ad position(evidence of competition), Daily clicks and monthly searches for prefixes of 4 words, “target”, “insurance”, “doctor” and “lawyer”. Note the dips in the CPC at various lengths, and the fact that they’re not always correlated with ad position or search volume. I’m assuming these numbers will rapidly change over the next few months as instant search gets rolled out, uncovering interesting arbitrage opportunities for those who’re looking hard enough!

1 Disclaimer: I am not an expert on ads or auction markets, this stuff is just fun to think about.

2 While it can show ads, Adwords may not show ads based on various confidence metrics.

| |

Thoughts on Scribe

As someone who works with autocompletion, this week has been a good one. Google launched two products relevant to my research: the first one was Google Scribe, a Labs experiment that uses Web n-grams to assist in sentence construction. This system solves the same problem addressed in my VLDB’07 paper, “Effective Phrase Prediction” (paper, slides). The paper proposes a data structure called FussyTree to efficiently serve phrase suggestions, and provides a metric I called “Total Profit Metric”(TPM) to evaluate phrase prediction systems. Google Scribe looks quite promising, and I thought I’d share my observations.

To simplify writing, let’s quickly define the problem using a slide from the slide deck :

Query Time:
Latency while typing is quite impressive. There is no evidence of speculative caching(a la Google Instant), but interaction is fairly fluid, despite the fact that an HTTP GET is sent to a Google Frontend Server on every keystroke. I’m a little surprised that there isn’t a latency check (or if it exists, it’s too low) — GET requests are made even when I’m typing too fast for the UI to keep up, rendering many of the results useless even before the server has responded to them.

Length of Completion:
My experience with Google Scribe is that the length of completion is quite small; I was expecting it to produce large completions as I gave it more data, but I couldn’t get it to suggest beyond three words.

Length of Prefix+Context:
It looks like the length of the prefix/context(context being the text before the prefix, used to bias completions) is 40 characters, with no special treatment to word endings. At every keystroke, the previous 40 characters are sent to the server, with completions in return. So as I was typing in the sentence, this is what the requests look like:

this is a forty character sentence and i
his is a forty character sentence and it
is is a forty character sentence and it
s is a forty character sentence and it i
_(and so on)_

I’m not sure what the benefit of sending requests for partial words is. It’s hard to discern the prefix from the context by inspection, but the prefix seems to be quite small(2-3 words), which sounds right.

Prediction Confidence:
Google Scribe always displays a list of completions. This isn’t ideal, since it’s often making arbitrary low-confidence predictions. This makes sense from a demo perspective, but since there is a distraction cost associated with the completions, it would be valuable to completions only when they are of high-confidence. Confidence can either be calculated using TPM or learned from usage data(which I hope Scribe is collecting!)

Prediction Quality:
People playing with Scribe produced sentences such as “hell yea it is a good idea to have a look at the new version of the Macromedia Flash Player to view this video” and “Designated trademarks and brands are the property of their respective owners and are”. I find these sentences interesting because they are both very topical; i.e. they seem more like outliers from counting boilerplate text on webpages than “generic” sentences you’d find in, say an email. To solve this issue and produce more “generic” completions, one solution is to cluster the corpus into multiple topic domains, and ensure that the completion is not just popular in one isolated domain.

I was also interested in knowing, “How many keystrokes will this save?”. To measure this, we can use TPM. In these two slides, I describe the TPM metric with an example calculation:

While it would be nice to see a comparison of the FussyTree method vs Google Scribe in terms of Precision, Recall and TPM, constructing such an experiment is hard, since training FussyTree over web-sized corpora would require some significant instrumentation. Based on a few minutes of playing with it, I think Scribe will outperform the FussyTree method in Recall due to the small window size — i.e. it will produce small suggestions that are often correct. However, if we take into account the distraction factor from the suggestion itself, then Scribe in its current form will do poorly, since it pulls up a suggestion for every word. This can be fixed by making longer suggestions, and considering prediction confidence.

Overall, I am really glad that systems like these are making it into mainstream. The more exposure these systems get, the more chance they have to get better and more accurate, saving us time and enabling us to interact with computers better!


Quick and easy multicore sort using Bash oneliners

In my line of work I often encounter the need to sort multi-gigabyte files that contain some form of tabulated text data. Usually, one would do this using a single unix sort command, like this:

sort data.tsv -o data.tsv.sorted

Even with an adequate machine, this takes 21 minutes for a 7.4GB file with 115M objects. But like most moderate-sized work machines these days, we have multiple cores(2xQuad Intel Xeon), abundant memory(24G) and a fast disk(15K RPM), so running a single sort command on a file is serious underutilization of resources!

Now there’s all sorts of fancy commercial / non-commercial tools that can optimize this, but I just wanted something quick and dirty. Here’s a quick few lines that I end up using often that I thought would be useful to share:

split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;

Followed by:
sort -m _tmp* -o data.tsv.sorted

This takes less than 5 minutes!

How this works: The process is straightforward; you’re essentially:
1. splitting the file into pieces
2. sorting all the split pieces in parallel to take advantage of multiple cores (note the “&”, enabling background processes)
3. then merging the sorted files back

Since the speedup from the number of cores outweighs the cost of increased disk I/O from splitting / merging, you have a much faster sort. Note the use of while read; this ensures that just-created files don’t get considered, avoiding infinite loops.

For fun, here’s a screen cap of what I’d like to call the “Muhahahahaha moment” — when the CPU gets bombarded with sort processes, saturating all cores:

(see video version of this , you’ll need to skip to 00:36s mark)

Deceiving Users with the Facebook Like Button

Update: I've written a followup to this post, which you may also find interesting.

Facebook just launched a super-easy widget called "The Facebook Like Button". Website owners can add a simple iframe snippet to their html, enabling a nice "Like" button with a count of other people's "Likes" and a list of faces of people if any of them are your friends. The advantage of this new tool is that you don't need any fancy coding. Just fill up a simple wizard , and paste the embed code in, just like you do with Youtube, etc.

However, this simplicity has a cost: Users can be tricked into "Like"ing pages they're not at.

For example, try pressing this "Like" button below:

This is what happened to my Facebook feed when I pressed it:

Screen shot 2010-04-21 at 10.45.01 PM

I used BritneySpears.com as an example here to be work/family-safe; you're free to come up with examples of other sites you wouldn't want on your Facebook profile! :)

Important note: Removing the feed item from your newsfeed does not remove your like -- it stays in your profile. You have to click the button again to remove the "Like" relationship.

This works because the iframe lets me set up any URL I want. Due to the crossdomain browser security, the "Like Button" iframe really has no way to communicate with the website it's a part of. Facebook "Connect" system solved this using a crossdomain proxy, which requires uploading a file, etc. The new button trades off this security for convenience.

An argument in Facebook's favor is that no self-respecting webmaster would want to deceive the visitor! This is true, the motivation to deceive isn't very strong, but if I am an enterprising spammer, I can set up content farms posing as humble websites and use those "Like" buttons to sell, say Teeth Whitening formulas to my visitor's friends. Or, if I'm a warez / pirated movie site, I'm going to trick you with overlays, opacities and other spam tricks and sell your click on an "innocent" movie review page to a porn site, similar to what is done with Captchas. I'm going to call this new form of spam Newsfeed Spam.

This is scary because any victim to this is immediately going to become wary of using social networking buttons after the event; and will even stay away from a "Share on Twitter" button because "bad things have happened in the past with these newfangled things"!

I don't have a good solution to this problem; this sort of spam would be hard to detect or enforce since Facebook doesn't see the parent page.

• One weak solution is to use the iframe's HTTP_REFERER to prohibit crossdomain Likes. I'm not sure how reliable this is; it depends on the browser's security policies.

• Yet another solution is to provide the user with information about the target of the Like. e.g. it can be:

  • Shown in the initial text, i.e. "and 2,025 others like this" now becomes "and 2,025 others like "Britney Spears"..." The downside to this is that it can't be shown in the compact form of the button.
  • Shown upon clicking, i.e. "You just liked BritneySpears.com"
  • (my favorite) Shown on mouseover the button expands to show the domain, "Click to Like britneyspears.com/...."

This problem is an interesting mix of privacy and usability; would love to see a good solution!

Update: I've written a followup to this post, which you may also find interesting.

My vintage iPad case

Just built this yesterday, was well worth the effort!

Made with a 1926 yearbook I found at an antique book store, suede leather (left panel / screen cover), Elastic band, duct tape and plastic sheeting from IKEA frames (for mounting the elastic)

My Vintage iPad Case!
My Vintage iPad Case!
My Vintage iPad case

My apologies to Drury College’s class of 1926 whose yearbook I pillaged for this project. I plan to scan the contents of the book into the iPad, so that it remains true to its origin!

| |

Life before Google