Archive - 2010

December 19th

BaconSnake to core: Pig 0.8 released!

My contribution to Python UDFs in Pig has finally released as part of the new and shiny 0.8 release! I’ve been meaning to blog about how this came about when I had time, but Dmitriy saved me the work, so I’ll just quote him instead:

This is the outcome of PIG-928; it was quite a pleasure to watch this develop over time — while most Pig tickets wind up getting worked on by at most one or two people, this turned into a collaboration of quite a few developers, many of them new to the project — Kishore Gopalakrishna’s patch was the initial conversation starter, which was then hacked on or merged into similar work by Woody Anderson, Arnab Nandi, Julien Le Dem, Ashutosh Chauhan and Aniket Mokashi (Aniket deserves an extra shout-out for patiently working to incorporate everyone’s feedback and pushing the patch through the last mile).

Yay Open Source! Documentation is available on the Pig website, but here’s a one-line example of what you can now do:

register 'test.py' using jython as myfuncs;

Notice that we’re using a python script instead of a jar. All python functions in test.py become available as UDFs to Pig!

Going forward, there are two things I didn’t get to do, which I’d love any of my dear readers to pick up on. First, we’re still not at par with SCOPE’s ease-of-use since UDFs can’t be inlined yet (a la BaconSnake). This is currently open as a separate Apache JIRA issue, would be great to see patches here!

The second feature that I really wanted to make part of this patch submission, and didn’t have time for, was Java source UDF support as a JaninoScriptEngine. Currently, Pig users have to bundle their UDF code as jars, which quickly becomes painful. The goal of the ScriptEngine architecture is to let multiple languages be implemented and plugged in, so I really hope someone takes the time to look at the Janino project, which will allow Java source UDFs with zero performance penalty. This would also make it the first dynamically compiled query system over Hadoop (as far as I know), and opens up a world of query optimization work to explore.

December 6th

United States of Autocomplete, the Live Version

This blog post by Dorothy at Very Small Array caught my attention, it's very cute. Eytan suggested I do an automated version. So here we go, a live, javascript version of The United States of Autocomplete that is personalized to what you see in Google suggest!

(note: seems to work on Google Chrome / Firefox + SVG only for now, will fix it later)

Click here to see the live map!

| |

November 10th

The Story of Stuff: Electronics

This short film should be made mandatory watching for anyone who buys electronics! From their website :

The Story of Electronics, released on November 9th, 2010 at storyofelectronics.org, takes on the electronics industry’s “design for the dump” mentality and champions product take back to spur companies to make less toxic, more easily recyclable and longer lasting products. Produced by Free Range Studios and hosted by Annie Leonard, the eight-minute film explains ‘planned obsolescence’—products designed to be replaced as quickly as possible—and its often hidden consequences for tech workers, the environment and us. The film concludes with an opportunity for viewers to send a message to electronics companies demanding that they “make ‘em safe, make ‘em last, and take ‘em back.”

 

 

iFixit for-profit approach to solve this problem is pretty great — by making every device on the planet repairable, we save so many devices from being trashed.

| |

November 4th

Fixing Facebook's Font Size Frustration

I don't know if I'm part of an alpha-beta test or if they rolled this out to all of its hundreds of millions of users, but Facebook recently reduced the font size of all status messages on its homepage without any explanation:

fixfont

As a heavy user, this has been causing me a lot of eye strain. So I decided to create a quick script to fix this. You can either:

Note: I've tested it for Safari, Chrome and Firefox; for other browsers, fix it yourself! :)

Also, standard and obvious disclaimer: I am not responsible for any code you run.

| |

October 27th

XKCD, the NLP remix

Just saw this old XKCD strip and came up with a more real-world-applicable, NLP version of the strip:

[ original image © Randall Munroe ]

For my non-NLP readers, I might as well explain the terms:

Wrapper Induction is “a technique for automatically constructing wrappers from labeled examples of a resource’s content.”. More details here .

CRF stands for Conditional Random Field and “is a type of discriminative probabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences.”

|

September 15th

Shadow Tables using MySQL Triggers

One popular principle behind database-driven applications (websites, services, etc) is to never throw anything away. For example, if you’re writing a “To-do list” application, you might be tempted to run a DELETE FROM TODO WHERE ID = 123 whenever things are checked off.

However, this means that you’ve lost the data forever, so if you wanted to mine the data to gain insights, or wanted to provide features like Undo, you’re out of luck. One way to solve this problem is to always use UPDATE to set a “deleted” flag on the tuple, but this means your deleted data and your active data are in the same table, and you have to rewrite all your SELECTs to include the delete flag.

An alternative way is to move all the deleted tuples into a shadow1 table using TRIGGER. Here’s how:


CREATE TRIGGER todo_shadow
BEFORE DELETE ON todo
FOR EACH ROW
INSERT DELAYED INTO _todo_shadow values(OLD.col1,OLD.col2,...);

Now, every time a tuple is deleted from your “todo” table, it gets inserted first into the “_todo_shadow” table. “_todo_shadow” is a table that is identical to “todo”, but without any keys, attributes or indexes — MyISAM would work well here since we don’t plan to delete / update on this table. Note the use of DELAYED, an optional optimization2 to defer shadow inserts.

While I use MySQL as an example, you can tweak it to work with Postgres, SQLite and SQL Server as well, all of them support triggers. Note that triggers have varying impacts on sharding, indexes and transactions, so make sure you read up on table locks before you deploy this at a nuclear plant!

1 Shadow tables usually store all provenance, not just deletes. For full support, you can create an additional trigger for updates as well. You can even add a timestamp attribute that defaults to NOW() to store the time of deletion.

2 See docs for reasons why you may want to omit DELAYED.

|

September 10th

Adwords CPC Dips: Google Instant and Ad Pricing

I was explaining Google Instant to my housemate yesterday and had this thought1:

Are Google Ads and SEO going to be targeted on prefixes of popular words now?

For example, let’s consider the word “insurance”. There are a lot of people bidding on the whole word, and a lot of people bidding on the word “in”. Since Google Instant can show ads at every keystroke2, perhaps it would be cheaper to buy ads on the word “insura”, where the number of searches will be just as high, but since there are fewer people bidding on it, the CPC is low?

Here’s some data I pulled from Google Adwords Estimator :
cpcdips

The charts superimpose CPC, ad position(evidence of competition), Daily clicks and monthly searches for prefixes of 4 words, “target”, “insurance”, “doctor” and “lawyer”. Note the dips in the CPC at various lengths, and the fact that they’re not always correlated with ad position or search volume. I’m assuming these numbers will rapidly change over the next few months as instant search gets rolled out, uncovering interesting arbitrage opportunities for those who’re looking hard enough!

1 Disclaimer: I am not an expert on ads or auction markets, this stuff is just fun to think about.

2 While it can show ads, Adwords may not show ads based on various confidence metrics.

| |

Thoughts on Scribe

As someone who works with autocompletion, this week has been a good one. Google launched two products relevant to my research: the first one was Google Scribe, a Labs experiment that uses Web n-grams to assist in sentence construction. This system solves the same problem addressed in my VLDB’07 paper, “Effective Phrase Prediction” (paper, slides). The paper proposes a data structure called FussyTree to efficiently serve phrase suggestions, and provides a metric I called “Total Profit Metric”(TPM) to evaluate phrase prediction systems. Google Scribe looks quite promising, and I thought I’d share my observations.

To simplify writing, let’s quickly define the problem using a slide from the slide deck :

Query Time:
Latency while typing is quite impressive. There is no evidence of speculative caching(a la Google Instant), but interaction is fairly fluid, despite the fact that an HTTP GET is sent to a Google Frontend Server on every keystroke. I’m a little surprised that there isn’t a latency check (or if it exists, it’s too low) — GET requests are made even when I’m typing too fast for the UI to keep up, rendering many of the results useless even before the server has responded to them.

Length of Completion:
My experience with Google Scribe is that the length of completion is quite small; I was expecting it to produce large completions as I gave it more data, but I couldn’t get it to suggest beyond three words.

Length of Prefix+Context:
It looks like the length of the prefix/context(context being the text before the prefix, used to bias completions) is 40 characters, with no special treatment to word endings. At every keystroke, the previous 40 characters are sent to the server, with completions in return. So as I was typing in the sentence, this is what the requests look like:

this is a forty character sentence and i
his is a forty character sentence and it
is is a forty character sentence and it
s is a forty character sentence and it i
_(and so on)_

I’m not sure what the benefit of sending requests for partial words is. It’s hard to discern the prefix from the context by inspection, but the prefix seems to be quite small(2-3 words), which sounds right.

Prediction Confidence:
Google Scribe always displays a list of completions. This isn’t ideal, since it’s often making arbitrary low-confidence predictions. This makes sense from a demo perspective, but since there is a distraction cost associated with the completions, it would be valuable to completions only when they are of high-confidence. Confidence can either be calculated using TPM or learned from usage data(which I hope Scribe is collecting!)

Prediction Quality:
People playing with Scribe produced sentences such as “hell yea it is a good idea to have a look at the new version of the Macromedia Flash Player to view this video” and “Designated trademarks and brands are the property of their respective owners and are”. I find these sentences interesting because they are both very topical; i.e. they seem more like outliers from counting boilerplate text on webpages than “generic” sentences you’d find in, say an email. To solve this issue and produce more “generic” completions, one solution is to cluster the corpus into multiple topic domains, and ensure that the completion is not just popular in one isolated domain.

I was also interested in knowing, “How many keystrokes will this save?”. To measure this, we can use TPM. In these two slides, I describe the TPM metric with an example calculation:

While it would be nice to see a comparison of the FussyTree method vs Google Scribe in terms of Precision, Recall and TPM, constructing such an experiment is hard, since training FussyTree over web-sized corpora would require some significant instrumentation. Based on a few minutes of playing with it, I think Scribe will outperform the FussyTree method in Recall due to the small window size — i.e. it will produce small suggestions that are often correct. However, if we take into account the distraction factor from the suggestion itself, then Scribe in its current form will do poorly, since it pulls up a suggestion for every word. This can be fixed by making longer suggestions, and considering prediction confidence.

Overall, I am really glad that systems like these are making it into mainstream. The more exposure these systems get, the more chance they have to get better and more accurate, saving us time and enabling us to interact with computers better!

|

August 23rd

Rollover Tweets!

Twitter restricts you to 140 characters per tweet. But what if you don't use all of the 140 per tweet? What happens to all the leftover characters?

Inspired by AT&T's Rollover® Minutes, I whipped up a quick javascript widget that calculates your monthly Rollover® Tweets -- How many tweets you could write by using up the leftover characters from previous twitter messages? Simply enter your Twitter username and hit "Calculate", and you'll have an answer in a few minutes! (This is how it looks like for me.)



Developer Notes:
0. Twitter is a little flaky about reporting data, and I'm not doing a lot of exception handling; so if the script doesn't work, refresh/rerun! :P
1. What you see here is pure Crossdomain Javascript using JSONP -- there is no backend.
2. I only look at the latest 1000 tweets.
3. Formula: Sigma ( Floor(Leftover characters per month / 140) ) over the latest 1000 tweets.



Fine print: Rollover® is a trademark of AT&T. AT&T's lawyers are cute and look like teddy bears.
| |

August 13th

Friend-based throttling in Facebook News Feeds

This dialog in my Facebook feed options seemed interesting:

Screen shot 2010-08-13 at 4.16.45 AM

Notice how it asks me how many friends I want my Live Feed from. It seems the default is 250 friends. What this means is that when you click “Recent Posts”, you’re getting recent posts from only your top 250 friends; all other friends are being ignored.

Obviously this is a problem only if you have more than 250 friends. I’ve heard the average is 150, but I’m sure there are a lot of people who are affected by this. This option caught my eye for two reasons:

From a technical perspective, news feeds are massive publish-subscribe systems. You subscribe to your friends’ posts, which when posted, are published to your feed. The 250 friend limit sets up a convenient soft limit for the system, reducing the stress on Facebook’s servers. Twitter doesn’t have such limits, and I can imagine this is one reason why its servers get overloaded. It’s a smart design from this perspective, but I wish Facebook was more transparent about the limit!

From a social perspective, I think this is a very primitive way to throttle friends. My understanding of the Feed was that my “Top Posts” ranked recent posts so that I had a high-level view of my feed, and “Recent Posts” gave me access to everything. It seems this belief is incorrect. When I increased this number to 1000(i.e. include ALL my friends), I suddenly started seeing updates from many friends I had totally forgotten about / lost touch with. Since I don’t see updates from them, I don’t interact with them on Facebook, leading to a self-reinforcing “poor get poorer” effect. I am assuming there’s some “Friendness” ranking going on here. This way, friends in my bottom 50 will never make it to my top 250 friends on Facebook. The use of a self-reinforcing ranking function is risky; especially when the stability of the ranking depends on human input. I wonder if the Feed team has done anything smart to introduce “compensators” based on interactions with bottom 50-friends, similar to the random reset in PageRank. The issue here is that unlike hyperlink edges, we’re dealing with a vocabulary of “Likes” and other social cues which are not well understood. It seems like this can be an excellent subject for a machine learning / information retrieval paper or two.

update: Horseman of the Interwebs Hung Truong points out Dunbar’s Number:

Dunbar’s number is a theoretical cognitive limit to the number of people with whom one can maintain stable social relationships. These are relationships in which an individual knows who each person is, and how each person relates to every other person. Proponents assert that numbers larger than this generally require more restrictive rules, laws, and enforced norms to maintain a stable, cohesive group. No precise value has been proposed for Dunbar’s number. It lies between 100 and 230, but a commonly detected value is 150.

This puts Facebook’s default threshold at a great place. However, Dunbar’s numbers are meant for offline relationships, i.e. the Dunbar number for ephemeral, online “feed” style relationship could arguably be much higher. It appears Dunbar has been working on this , I’m looking forward to a publication from his group soon.