Archive - May 2012

  • All
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31

May 27th

Introducing the CUBE operator for Apache Pig

Guest post by Prasanth Jayachandran , who has been working on implementing CUBE support for Pig, as part of the large-scale distributed cubing effort.

Update: As per Dmitriy’s tweet:
…the naive implementation is in. The scalable count distinct impl is pending 0.11 branching, will go into 0.12.

The next version of Apache Pig will support the CUBE operator ( patch available here ). The CUBE operator represents grouped aggregations on all possible combinations of the input dimensions, for a given input measure.

This patch adds syntactic sugar to the already existing built-in CubeDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. The following example illustrates the CUBE operator usage in Apache Pig:

Basic Usage:
inp = LOAD ‘/pig/data/salesdata’ USING PigStorage() AS(product:chararray,location:chararray, year:int, sales:long);
cubed_inp = CUBE inp BY (product,location,year);
out = FOREACH cubed_inp GENERATE FLATTEN AS (product, location,year), COUNT as total, SUM as sales;

Sample output:
For a sample input tuple (ipod, miami, 2012, 200000), the above query generates all combinations of the tuple:
(ipod, miami, 2012, 1, 200000)
(ipod, NULL, NULL, 1, 200000)
(NULL, miami, NULL, 1, 200000)
(NULL, NULL, 2012, 1, 200000)
(ipod, miami, NULL, 1, 200000)
(ipod, NULL, 2012, 1, 200000)
(NULL, miami, 2012, 1, 200000)
(NULL, NULL, NULL, 1, 200000)

Output Schema for CUBE operator:
grunt> describe cubed_inp;
cubed_inp: {group: (dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int), cube: {(dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int,sales: long)}}

Note the second column in cubed_input bag ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ in load statement is pushed down to CUBE statement so that it can be referenced later while computing aggregates on the measure like in this case SUM

Upcoming Enhancements:
The current implementation is equivalent to the naive implementation in MRCube. Following are the core features that I am planning to implement as a part of the Google Summer of Code 2012 program:

  • Optimize naive implementation
  • Support for hierarchical dimension
  • Support for ROLLUP/GROUPING SETS operation similar to SQL/Oracle server
  • Distributed cubing for holistic measures

All these features should be available by end of this summer. Keep an eye on PIG-2167 (and all its sub-tasks) for more updates!


May 22nd

Skimmer: Rapid Scrolling of Relational Query Results

This paper was presented at SIGMOD 2012. [pdf] [slides]

On occasion of SIGMOD 2012, I thought I'd write up a short post about a new project, Skimmer.

When considering challenges of ad-hoc, end-user interaction with databases, user actions can be broadly categorized into three groups: (1) explicit, articulate querying of the database, (2) searching through the database, and (3) browsing through the database. Prior work in the area of database usability has recognized (2: Searching) and (3: Browsing) as being significant challenges. My dissertation work attempted to solve (2: Searching) using a combination of autocompletion and qunits. The Skimmer project looks at (3: Browsing). In our SIGMOD 2012 paper, we introduce a method that makes it easier for users to browse large datasets. Here's the abstract and the intuition behind it, followed by a cool demo video of a new implementation of Skimmer:

A relational database often yields a large set of tuples as the result of a query. Users browse this result set to find the information they require. If the result set is large, there may be many pages of data to browse. Since results comprise tuples of alphanumeric values that have few visual markers, it is hard to browse the data quickly, even if it is sorted.

In this paper, we describe the design of a system for browsing relational data by scrolling through it at a high speed. Rather than showing the user a fast-changing blur, the system presents the user with a small number of representative tuples. Representative tuples are selected to provide a “good impression” of the query result. We show that the information loss to the user is limited, even at high scrolling speeds, and that our algorithms can pick good representatives fast enough to provide for real-time, high-speed scrolling over large datasets.

The intuition behind picking representative tuples is simple: Since pages will be consumed in succession, we represent a sampled overview of the current page, but it is a good idea to bias the samples such that we avoid (re)showing information provided in the prior pages. Thus, we promote tuples that increase the information provided to the user compared to the prior pages: i.e. minimize the information loss. The paper evaluates this idea against varying scrolling speed, page size, number of attributes and information quality.

The video above demonstrates the new version of Skimmer we've been working on at Ohio State. Similar to the original Skimmer system, it initially only shows you representative tuples from a page, loading and rendering in the rest of the data in a lazy fashion. Non-rendered tuples are shown as grey bands, for clarity.

This project falls under the "Surfacing of Insights" section of the roadmap laid out in the Guided Interaction VLDB 2011 vision paper, and is only the first step in what I hope is an engaging conversation in the area of large-scale data browsing.

And of course, if you are a current student at Ohio State looking to work on cool projects such as this one, I am looking for students to join my research group. Drop me a line!