Microsoft

Reputation Misrepresentation, Trail Paranoia and other side effects of Liking the World

trafficspike

A few months ago, I wrote up some quick observations about Facebook’s then just-launched “Like” button, pitching “Newsfeed Spam” as a problem exacerbated by the new Like Buttons. The post went “viral”, so to speak, bouncing off Techmeme, ReadWriteWeb / NYTimes, even German news websites. Obviously this is nothing compared to “real” traffic on the Internet, but it was fun to watch the link spread. This is meant to be a follow-up to that post, based on thoughts I’ve had since.

In this post, I'll be writing about five "issues" with the Like button, followed by four "solutions" to these issues. Since this is a slightly long post, here's an outline:


Big Deal!


facebook stats

The Facebook Like Button has been huge success. With over 3 billion buttons served, and major players such as IMDB and CNN signing up to integrate the button (and other social plugins) into their websites, the chance of encountering a Facebook Like button while browsing on the web is quite high; if not certain. Many folks have questioned whether this is a big deal -- IFRAME and javascript based widgets have been around for a long time (shameless self-plug: Blogsnob used a javascript-based widget to cross polinate blogs across the internet as early as 8 years ago). Using the social concept of showing familiar faces to readers isn't new either; MyBlogLog has been doing it for a while. Then why is this silly little button such an issue? The answer is persistent user engagement. With 500 million users, out of which 50% of them log into Facebook at any given day, you're looking at an audience of 250 million users. If you're logged into Facebook while browsing any website with a social plugin, the logged in session is used. Now if you're like me, you'll probably have "remember me" checked at login, which means you're always logged into Facebook. What this means is that on any given day, Facebook has the opportunity to reach 250 million people throughout their web browsing experience; not just when they're on Facebook.com[1]. So clearly, from a company's perspective, this is important. It is a pretty big deal! But why is this something Facebook users need to be educated about? Onwards to the next section!

Issues with the Like Button


Readers should note the use of the word "Issues", as opposed to "Security vulnerability", "Privacy Leak", "Design Flaw", "Cruel Price of Technology", or "Horrible Transgression Against Humankind". Each issue has its own kind of impact on the user, you're welcome to decide which is which!

Screen shot 2010-07-21 at 1.37.51 AM

To better understand the issues with the Like button, let's understand what the Like button provides:
1) It provides a count of the number of people who currently "Like" something.
2) It provides a list of people you know who have liked said object, with profile pictures.
3) It provides the ability to click the button and instantaneously "Like" something, triggering an update on your newsfeed.
All of this is done using an embedded IFRAME -- a little Facebook page within the main page that displays the button.

In the next few paragraphs, we'll see some implications of this button on the web.

Reputation Misrepresentation


The concept of reputation misrepresentation is quite simple:
a not-so-popular website can use another website's reputation to make the site seem more reputed or established to the user.

Here's a quick diagram to explain it:

reputation misrepresentation

Simply put, as of now, any website(e.g. a web store) can claim they are popular (especially with your friends) to gain your trust. Since Facebook doesn't check referrer information, Facebook really doesn't have the power to do anything about this either. A possible solution is to include verifying information inside the like button, which ruins the simplicity of it all.

Browse Trail Inference


This one is a more paranoid concept, but I've noticed that people don't realize it until I spell it out for them:
Facebook is indirectly collecting your entire browsing history for all websites that have Facebook widgets. You don’t have to click any like buttons, just visiting sites like IMDB.com or CNN.com or BritneySpears.com will enable this.

Here's how it works:

browsetrail

Here, our favorite user Jane is logged into Facebook, and visits 2 pages on IMDB.com, checks the news on CNN, and then heads to Yelp to figure out where to eat. Interestingly enough, Facebook records all this information, and can tie it to her Facebook profile, and can thus come up with inferences like "Jane likes Romantic Movies, International News and Thai Food -- let's show her some ads for romantic getaways to Bali!"

(Even worse, if Jane unwittingly visits a nefarious website which coincidentally happens to have the Like button, Facebook gets to know about that too!)

Most modern browsers send the parent document's URL as HTTP_REFERER information to Facebook via the Like IFRAME, which allows Facebook to implicitly record a fraction of your browsing history. Since this information is much more voluminous than your explicit "Likes"; a lot more information can be data-mined from it; which can then be used for "Good"(i.e. adding value to Facebook) or "Evil"(i.e. Ads! Market data!)

What I like about this is that this is an ingenious system to track user's browsing behavior. Currently, companies like Google, Yahoo and Microsoft(Bing/Live/MSN) have to convince you to install a browser toolbar which has this minuscule clause in its agreement that you share back ALL your browsing history, which can be used to better understand the Web(and make more money, etc. etc.). Since Facebook is getting all websites to install this; it gets the job done without getting you to install a toolbar! I'll be discussing how I deal with this in the last section, "My solution".

Newsfeed Spam


In a previous post, I demonstrated how users could be tricked into "Liking" things they didn't intend to, leading to spam in their friends' newsfeeds. A month later, security firm Sophos reported an example of this, where users were virally tricked into spreading a trojan virus through Facebook Likes, something that could easily be initiated by Like buttons across the web, where you can easily be tricked into liking arbitrary things.

Again, this issue has the same root cause as Reputation Misrepresentation: since all the Like button shows you is a usercount, pictures and the button itself, there really is no way to know what you're liking. A solution to this is to use a bookmarklet in your browser, which is under your control.

"Likejacking"


This interesting demo by Eric Kerr demonstrates how to force unwitting users into clicking arbitrary like buttons. The way this works is by making a transparent like button, and make it move along with the users mouse cursor. Since the user is bound to click on the page at some point of time, they're bound to click the Like button instead.

Like Switching


likeswitch

Like switching is an alternative take on Like Jacking -- the difference is that the user is explicitly shown a like button with a prestigious like count and familiar friends first. When a user reaches out to click on it, the like button is swapped out for a different one, triggered by an onmouseover event from the rectangle around the button.

"Solutions"

Given these issues, let's discuss some solutions, responses and fixes. Note the use of quotes -- for many people can argue that nothing is broken, so we don't need solutions! Regardless, one piece of good news is that the W3C is aware of the extensive use of IFRAMES on the web, and has introduced a new "sandbox" attribute for IFRAMES. This will lead to more fine-grained control of social widgets. For example, if we can then set our browsers to force "sandbox" settings for all Facebook IFRAMES, we can avoid handing over our browsing history to Facebook.


Facebook's approach


While I don't expect companies to rationalize every design decision with their users, I am glad that some Facebook engineers are reaching out via online discussions. Clearly this is not representative of the whole company, but here's a snippet:
Also, in case it wasn't clear, as soon as we identify a domain or url to be bad, it's impossible to reach it via any click on facebook, so even if something becomes bad after people have liked it, we still retroactively protect users.

I like this approach because it fits in well with the rest of the security infrastructure that large companies have: the moment a URL is deemed insecure anywhere on the site, all future users are protected from that website. However, this approach doesn't solve problems with user trust -- it's relying on the fact that Facebook has flagged every evil website in the world before you chanced upon it -- something I wouldn't bet my peace of mind on. It's as if the police told you "We will pursue serial killers only after the first murder!"Would you sleep better knowing that? In essence, this approach is great when you're looking at it from the side of protecting 500 million users. But as one of the 500 million, it kinda leaves you out in the dark!


Secure Likes

As we mentioned in the Reputation Misrepresentation section, another interesting improvement would be to include some indication of the URL that is being "Liked" inside the button itself. An option is to display the URL as a tooltip when the user hovers his/her cursor over the button, especially if it disagrees with the parent frame's URL. Obviously placing the whole URL would make the button large and ugly. A possible compromise is to include the favicon(the icon that shows up for each site in your browser) right inside the Like button. The user can simply check if the browser icon is the same as the one on the like button to make sure it's safe. This way, if a website wants to (mis)use BritneySpears.com's Like Button, it will be forced to use BritneySpears.com's favicon too! Here's a mockup of what "Secure Like" would look like for IMDB:

securelike


A browser-based approach


Screen shot 2010-07-26 at 5.11.57 AM

This approach, best exemplified by "Social Web" browser Flock and recently acknowledged by folks at Mozilla, makes you log into the browser, not a web site. All user-sensitive actions(such as "Liking" a page) have to go through the browser, making it inherently more secure.

My Current Solution


dock

At this point, I guess it's best to conclude with what my solution to dealing with all these issues is. My solution is simple: I run Google and Facebook services in their own browsers, separate from my general web surfing. As you can see from the picture of my dock, my GMail and Facebook are separate from my Chrome browser. That way, I appear logged out[2]. Google Search and Facebook Likes when I surf the web or search for things. On a Mac, you can do this using Fluid.app; on Windows you can do this using Mozilla Prism.

And that brings us to the end of this rather long and winded discussion about such a simple "Like" button! Comments are welcome. Until the next post -- Surf safe, and Surf Smart!

 

 

Footnotes:
[1] To my knowledge, there is only one other company that has this level of persistent engagement: Google's GMail remembers logins more aggressively than Facebook. When you're logged into Gmail, you're also logged into Google Search, which means they log your search history as a recognized user. This is usually a good thing for the user, since Google then has a chance to personalize your search. Google actually takes it a step further and personalizes even for non-logged in users.

[2] Yes, they can still get me by my IP, but that's unlikely when I'm usually behind firewalls.

 

Cite this post!:


@article{reputationmisrepresentation,
title={{Reputation Misrepresentation, Trail Paranoia and other side effects of Liking the World}},
author={Nandi, A.},
year={2010},
journal={{Arnab's World}}
}

Upcoming VLDB Trip : Lyon, France

I’m looking forward to my talk at VLDB 2009 in Lyon, France. I will be presenting HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching”, which is joint work I did with Phil Bernstein during my internship at Microsoft Research. The talk is scheduled for Tuesday 25, 2009 at 2pm in the Rhône 2 room at the conference venue.

Also look out for my labmate Bin Liu ‘s paper with our advisor, “Using Trees to Depict a Forest”.

|

Yahoo: Just like the old times

I’m excited to go to work today, knowing that I will be witness, first hand, to one of the more incredible business deals being announced in the valley: Microsoft powering Yahoo Search.

There’s a lot that I want to say about this, but for now, I will leave you with this image. This is from when Yahoo! used to be powered by Google. (Many people believe that powering Yahoo was what made Google popular with the mainstream audience, and the Google owes who it is today to Yahoo.)

An excerpt from the Wikipedia:

In 2002, they bought Inktomi, a “behind the scenes” or OEM search engine provider, whose results are shown on other companies’ websites and powered Yahoo! in its earlier days. In 2003, they purchased Overture Services, Inc., which owned the AlltheWeb and AltaVista search engines.

AlltheWeb, Altavista, Overture, Inktomi. That’s a lot of heritage.

|

Microsoft Research's Data-related Launches

Microsoft Research has been making a bunch of cool data analysis-related launches at the upcoming Faculty Summit.

First, there’s The academic release of Dryad and DryadLINQ

Dryad is a high-performance, general-purpose, distributed-computing engine that simplifies the task of implementing distributed applications on clusters of computers running a Windows® operating system. DryadLINQ enables developers to implement Dryad applications in managed code by using an extended version of the LINQ programming model and API. The academic release of Dryad and DryadLINQ provides the software necessary to develop DryadLINQ applications and to run them on a Windows HPC Server 2008 cluster. The academic release includes documentation and code samples.

They also launched Project Trident , a workflow workbench, which is available for download:

Project Trident: A Scientific Workflow Workbench is a set of tools—based on the Windows Workflow Foundation—for creating and running data analysis workflows. It addresses scientists’ need for a flexible and powerful way to analyze large and diverse datasets, and share their results. Trident Management Studio provides graphical tools for running, managing, and sharing workflows. It manages the Trident Registry, schedules workflow jobs, and monitors local or remote workflow execution. For large data sets, Trident can run multiple workflows in parallel on a Windows HPC Server 2008 cluster. Trident provides a framework to add runtime services and comes with services such as provenance and workflow monitoring. The Trident security model supports users and roles that allows scientists to control access rights to their workflows.

Then there’s Graywolf :

GrayWulf builds on the work of Jim Gray, a Microsoft Research scientist and pioneer in database and transaction processing research. It also pays homage to Beowulf, the original computer cluster developed at NASA using “off-the-shelf” computer hardware.

BaconSnake: Inlined Python UDFs for Pig

I was at SIGMOD last week, and had a great time learning about new research, discussing various research problems, meeting up with old friends and making new ones. I don't recall exactly, but at one point I got into a discussion with someone about how I'm probably one of the few people who've actually had the privilege of using three of the major distributed scripting languages in production: Google's Sawzall, Microsoft's SCOPE and Yahoo's Pig. The obvious question then came up -- Which one do I like best? I thought for a bit, and my answer surprised me -- it was SCOPE, for the sole reason that it allowed inline UDFs, i.e. User Defined Functions defined in the same code file as the script.

I'm not aware if Sawzall allows UDFs, and Pig allows you to link any .jar files and call them from the language. But the Microsoft SCOPE implementation is extremely usable: the SQL forms the framework of your MapReduce chains, while the Mapper, Reducer and Combiner definitions can be written out in C# right under the SQL -- no pre-compiling / including necessary.

Here's how simple SCOPE is. Note the #CS / #ENDCS codeblock that contains the C#:

R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 

#CS 
public static int StringOccurs(string str, string ptrn) {
   int cnt=0; 
   int pos=-1; 
   while (pos+1 < str.Length) {
        pos = str.IndexOf(ptrn, pos+1) ;
        if (pos < 0) break; cnt++; 
   } return cnt;
}
#ENDCS

Since I'm working at Yahoo! Research this summer, and I missed this feature so much, I thought -- why not scratch this itch and fix the problem for Pig? Also, while we're at it, maybe we can use a cleaner language than Java to write the UDFs?

Enter BaconSnake (available here), which lets you write your Pig UDFs in Python! Here's an example:

-- Script calculates average length of queries at each hour of the day

raw = LOAD 'data/excite-small.log' USING PigStorage('\t')
           AS (user:chararray, time:chararray, query:chararray);

houred = FOREACH raw GENERATE user, baconsnake.ExtractHour(time) as hour, query;

hour_group = GROUP houred BY hour;

hour_frequency = FOREACH hour_group 
                           GENERATE group as hour,
                                    baconsnake.AvgLength($1.query) as count;

DUMP hour_frequency;

-- The excite query log timestamp format is YYMMDDHHMMSS
-- This function extracts the hour, HH
def ExtractHour(timestamp):
	return timestamp[6:8]

-- Returns average length of query in a bag
def AvgLength(grp):
	sum = 0
	for item in grp:
		if len(item) > 0:
			sum = sum + len(item[0])	
	return str(sum / len(grp))

Everything in this file in normal Pig, except the highlighted parts -- they're Python definitions and calls.

It's pretty simple under the hood actually. BaconSnake creates a wrapper function using the Pig UDFs, that takes python source as input along with the parameter. Jython 2.5 is used to embed the Python runtime into Pig and call the functions.

Using this is easy, you basically convert the nice-looking "baconsnake" file above ( the .bs file :P ) and run it like so:

cat scripts/histogram.bs | python scripts/bs2pig.py > scripts/histogram.pig
java -jar lib/pig-0.3.0-core.jar -x local scripts/histogram.pig

Behind the scenes, the BaconSnake python preprocessor script includes the jython runtime and baconsnake's wrappers and emits valid Pig Latin which can then be run on Hadoop or locally.

Important Notes: Note that this is PURELY a proof-of-concept written only for entertainment purposes. It is meant only to demonstrate the ease of use of inline functions in a simple scripting language. Only simple String-to-String (Mappers) and DataBag-to-String (Reducers) functions are supported -- you're welcome to extend this to support other datatypes, or even write Algebraic UDFs that will work as Reducers / Combiners. Just drop me a line if you're interested and would like to extend it!

Go checkout BaconSnake at Google Code!

Update: My roommate Eytan convinced me to waste another hour of my time and include support for Databags, which are exposed as Python lists. I've updated the relevant text and code.

Update (Apr 2010): Looks like BaconSnake's spirit is slowly slithering into Pig Core! Also some attention from the Hive parallel universe.

new version of hotmail

An interesting sighting on the Microsoft campus: Hotmail (aka Windows Live Mail Ultimate Express Service Pack xp Developer Network Edition) just launched their latest version of mail services:

| |

First Day at Microsoft Research

And so it begins; yet another crazy summer, this time at the world’s largest Computer Science department outside of a university. Today was my first day of my internship with the Database Group at MSR. Some lessons and observations:

  • MS Recruiting is a refined and well-oiled machine. HR departments everywhere should learn from them.
  • Apparently it rains in June in Redmond. That’s ALL of June. Every single hour. So much for the picture-perfect weather I saw a couple weeks ago. As my mentor put it, “That was the demo”.
  • Seattle traffic is rather well behaved. Being a driving newbie making millions of driving-etiquette faux paus per second during rush hour would have got me into serious trouble in Michigan.
  • You will shake hands with luminaries. Many of them. In a span of 10 minutes. Try not to jump around and look like a giggling preteen.
  • This one’s important: Before you start working for Microsoft, it is advisable to at least try using Windows. Going from a life completely devoid of Internet Exploder or LookOut Express into a world where EVERYTHING carries the Microsoft logo, is not ideal for the brain. My eyes literally had trouble focusing for the first few hours. Running a Vista / Outlook / IE / Studio stack after an elegantly simple life in unix-land is like jumping from a well-oiled bicycle into the cockpit of a thundering fighter jet.
  • “Google” is not a verb. Live Search. Live Maps. Live Live Live!

life update

Was in Seattle last week. Highlights:


Microsoft Research Building (above); UofM CS Building (below)

Microsoft Research’s new building is fancier than our new department building, but our offices are much better.

Lunching at the Space Needle is worth the high price. You get to sit and eat your lunch while downtown Seattle, Mt. Rainier and the lake revolve around you. The food is good, the view is spectacular on a good day and the service is eager and efficient.


Jidori Chicken with Roasted Garlic-Mashed Potatoes, White Zinfandel

There is an On Stage interactive exhibit at The Experience Music Project that lets you pretend you are a rock star, complete with simulated audience, fully set up stage (with monitors) and instruments that play themselves. The 20$ video they took of me was a little pricey in my opinion, but if you’re going as a group, this is totally worth it.


Art installation at the EMP

photocredits: me, rob scoble, eduardod

bags, balls and boyfriends

Links today brought to you by Red Bull™, my abusive friend in a can pushing me through a rather crazy day.

  • Lego Schoolbag : If I was a 10 year old girl, I would give away my younger brother for this one.
  • Every expression in this picture is priceless. I like how our hero has resigned to prayer.
  • A hilarious sketch from Snuff Box, starring Matt Berry, who also stars in the hilarious britcom The IT Crowd.
  • Adobe is opening up the SWF and FLV formats with the Open Screen project. (No sir, this is not about Single White Females or Fine Looking Virgins.) Flash has been sort of open for a while, with projects like SWFTools and GNash, but this takes things to a whole new level, with a slew of bigwig corporate backers. Flash and FLV have been in my opinion the critical enablers to the online video revolution; and this is definitely a great step ahead. I’m curious to know what Microsoft’s Silverlight team is thinking, as well as the folks at Sun (who just opened up all of Java). And of course, let’s not to forget the Android folks who have a very pretty stack, but tacking on some Flash magic would definitely be a very big deal. Considering the significant overlap between supporters of the Adobe effort and the Google effort, this is going to be fun to watch.

startup idea #4984

Here’s an idea I thought of a while ago. You have the storm botnet, which is apparently now capable of being the world’s most powerful supercomputer:

The Storm botnet, or Storm worm botnet, is a massive network of computers linked by the Storm worm Trojan horse in a botnet, a group of “zombie” computers controlled remotely. It is estimated to run on as many as 1,000,000 to 50,000,000 infected and compromised computer systems as of September 2007. Its formation began around January, 2007, when the Storm worm at one point accounted for 8% of all infections on all Microsoft Windows computers.

The botnet reportedly is powerful enough as of September 2007 to force entire countries off of the Internet, and is estimated to be able to potentially execute more instructions per second than some of the world’s top supercomputers.

Obviously, having a large supercomputer is big business these days. So what you do is have a legal version of this. Let’s say you sell computers at 70% of their real price. The only catch is that people will have to run this special software as part of the system. The special software is basically a remote compute client similar to Folding@Home or Google Compute.

Once you have sold enough computers, you essentially have a large army of computers at your beck and call, for 30% the price of what you would have to invest in otherwise. Of course, obviously someone else owns the machines, but while they are doing lightweight tasks such as checking email and chatting, you are folding proteins, running simulations and cracking ciphers.

Now here’s the best part of the deal: the most expensive part of a grid is not the hardware, but the electricity that it uses. And guess who’s paying this electricity! The customer, not you!.

So there you have it. A cheap, one-time cost for an everlasting free CPU grid. Awesome ainnit?

note: This idea is under this license.

|