Archive - 2012


December 29th

My Year in Cities, 2012

Continuing the long-running tradition, here’s my list of cities for 2012:

New Job: I’ve meant to write so, so many blog posts about this, but this job is intense enough that there’s been absolutely no time to catalog it. Short version of the story: Landed on a new island, on track to building my little empire.

  • Columbus, OH — Very pleasantly surprised by how nice this city is. The cultural scene is impressive, the area is flat and great for biking and running, and Ohio State is immense and brimming with possibilities.

Conferences, Panels, Random Life Things:

  • Washington, D.C. — I’d never been to DC ever before, but this year I ended up visiting D.C. thrice, for all sorts of reasons.
  • Istanbul, Turkey — VLDB 2012. What a wonderful city. Spending an afternoon ferrying back and forth between Europe and Asia is pretty awesome.
  • Ann Arbor, MI — hard to not drive here a few times. It’s good to be able to visit and participate in panels as “Distinguished Alumni”.
  • Atlanta, GA — visiting a friend at the very impressive Shepherd Center. Sadly, didn’t get to actually spend a lot of time touring the place. Seemed like a warm and welcoming city though. Some other time, maybe.
  • Pittsburgh, PA — Visited Carnegie Mellon. Very pretty campus. The drive from Columbus is very pleasant too.
  • Rekjavik, Iceland — “Surreal” is a good word to use here. So amazing.

8 Cities, 3 continents. Not quite the count for 2011, but then again, a lot more going on. Looking forward to 2013!

| |

May 27th

Introducing the CUBE operator for Apache Pig

Guest post by Prasanth Jayachandran , who has been working on implementing CUBE support for Pig, as part of the large-scale distributed cubing effort.

Update: As per Dmitriy’s tweet:
…the naive implementation is in. The scalable count distinct impl is pending 0.11 branching, will go into 0.12.

The next version of Apache Pig will support the CUBE operator ( patch available here ). The CUBE operator represents grouped aggregations on all possible combinations of the input dimensions, for a given input measure.

This patch adds syntactic sugar to the already existing built-in CubeDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. The following example illustrates the CUBE operator usage in Apache Pig:

Basic Usage:
inp = LOAD ‘/pig/data/salesdata’ USING PigStorage() AS(product:chararray,location:chararray, year:int, sales:long);
cubed_inp = CUBE inp BY (product,location,year);
out = FOREACH cubed_inp GENERATE FLATTEN AS (product, location,year), COUNT as total, SUM as sales;

Sample output:
For a sample input tuple (ipod, miami, 2012, 200000), the above query generates all combinations of the tuple:
(ipod, miami, 2012, 1, 200000)
(ipod, NULL, NULL, 1, 200000)
(NULL, miami, NULL, 1, 200000)
(NULL, NULL, 2012, 1, 200000)
(ipod, miami, NULL, 1, 200000)
(ipod, NULL, 2012, 1, 200000)
(NULL, miami, 2012, 1, 200000)
(NULL, NULL, NULL, 1, 200000)

Output Schema for CUBE operator:
grunt> describe cubed_inp;
cubed_inp: {group: (dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int), cube: {(dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int,sales: long)}}

Note the second column in cubed_input bag ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ in load statement is pushed down to CUBE statement so that it can be referenced later while computing aggregates on the measure like in this case SUM

Upcoming Enhancements:
The current implementation is equivalent to the naive implementation in MRCube. Following are the core features that I am planning to implement as a part of the Google Summer of Code 2012 program:

  • Optimize naive implementation
  • Support for hierarchical dimension
  • Support for ROLLUP/GROUPING SETS operation similar to SQL/Oracle server
  • Distributed cubing for holistic measures

All these features should be available by end of this summer. Keep an eye on PIG-2167 (and all its sub-tasks) for more updates!


May 22nd

Skimmer: Rapid Scrolling of Relational Query Results

This paper was presented at SIGMOD 2012. [pdf] [slides]

On occasion of SIGMOD 2012, I thought I'd write up a short post about a new project, Skimmer.

When considering challenges of ad-hoc, end-user interaction with databases, user actions can be broadly categorized into three groups: (1) explicit, articulate querying of the database, (2) searching through the database, and (3) browsing through the database. Prior work in the area of database usability has recognized (2: Searching) and (3: Browsing) as being significant challenges. My dissertation work attempted to solve (2: Searching) using a combination of autocompletion and qunits. The Skimmer project looks at (3: Browsing). In our SIGMOD 2012 paper, we introduce a method that makes it easier for users to browse large datasets. Here's the abstract and the intuition behind it, followed by a cool demo video of a new implementation of Skimmer:

A relational database often yields a large set of tuples as the result of a query. Users browse this result set to find the information they require. If the result set is large, there may be many pages of data to browse. Since results comprise tuples of alphanumeric values that have few visual markers, it is hard to browse the data quickly, even if it is sorted.

In this paper, we describe the design of a system for browsing relational data by scrolling through it at a high speed. Rather than showing the user a fast-changing blur, the system presents the user with a small number of representative tuples. Representative tuples are selected to provide a “good impression” of the query result. We show that the information loss to the user is limited, even at high scrolling speeds, and that our algorithms can pick good representatives fast enough to provide for real-time, high-speed scrolling over large datasets.

The intuition behind picking representative tuples is simple: Since pages will be consumed in succession, we represent a sampled overview of the current page, but it is a good idea to bias the samples such that we avoid (re)showing information provided in the prior pages. Thus, we promote tuples that increase the information provided to the user compared to the prior pages: i.e. minimize the information loss. The paper evaluates this idea against varying scrolling speed, page size, number of attributes and information quality.

The video above demonstrates the new version of Skimmer we've been working on at Ohio State. Similar to the original Skimmer system, it initially only shows you representative tuples from a page, loading and rendering in the rest of the data in a lazy fashion. Non-rendered tuples are shown as grey bands, for clarity.

This project falls under the "Surfacing of Insights" section of the roadmap laid out in the Guided Interaction VLDB 2011 vision paper, and is only the first step in what I hope is an engaging conversation in the area of large-scale data browsing.

And of course, if you are a current student at Ohio State looking to work on cool projects such as this one, I am looking for students to join my research group. Drop me a line!


February 8th

Calling all hackers

I put up this post on Twitter / a few places in the University:

“The data team is looking for hackers. 614-f14cfb5abaa0ee055e1222018cc1ec4f”

A few hints:

* If 614-f14… doesn’t work for you, also try 614-ba18b632839188a00b6f0e3784a14bd9. Both have the same desired outcome. (ba18… is what I’d initially posted and is harder to solve for, imho)
* I later tweeted: “Python solution fits in a tweet; takes 5 min to write, <5 to run.”
* 614 is the Columbus phone area code.

Looking forward to a few people cracking this code!

Update: Here’s the solution in both Python and SQL:

from hashlib import md5;
[x for x in range(10**7) if md5(”%0.3d-%0.4d” % (x/10000,x%10000)).hexdigest()== ‘f14cfb5abaa0ee055e1222018cc1ec4f’]

SQL (assumes table “n” containing a single column n, with tuples, 0 through 9):
SELECT * , MD5 ( CONCAT ( n1.n, n2.n, n3.n, “-”, n4.n, n5.n, n6.n, n7.n ) ) AS x
FROM n AS n1, n AS n2, n AS n3, n AS n4, n AS n5, n AS n6, n AS n7
HAVING x = “f14cfb5abaa0ee055e1222018cc1ec4f”