Introducing the CUBE operator for Apache Pig

Guest post by Prasanth Jayachandran , who has been working on implementing CUBE support for Pig, as part of the large-scale distributed cubing effort.

Update: As per Dmitriy’s tweet:
…the naive implementation is in. The scalable count distinct impl is pending 0.11 branching, will go into 0.12.

The next version of Apache Pig will support the CUBE operator ( patch available here ). The CUBE operator represents grouped aggregations on all possible combinations of the input dimensions, for a given input measure.

This patch adds syntactic sugar to the already existing built-in CubeDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. The following example illustrates the CUBE operator usage in Apache Pig:

Basic Usage:
inp = LOAD ‘/pig/data/salesdata’ USING PigStorage() AS(product:chararray,location:chararray, year:int, sales:long);
cubed_inp = CUBE inp BY (product,location,year);
out = FOREACH cubed_inp GENERATE FLATTEN AS (product, location,year), COUNT as total, SUM as sales;

Sample output:
For a sample input tuple (ipod, miami, 2012, 200000), the above query generates all combinations of the tuple:
(ipod, miami, 2012, 1, 200000)
(ipod, NULL, NULL, 1, 200000)
(NULL, miami, NULL, 1, 200000)
(NULL, NULL, 2012, 1, 200000)
(ipod, miami, NULL, 1, 200000)
(ipod, NULL, 2012, 1, 200000)
(NULL, miami, 2012, 1, 200000)
(NULL, NULL, NULL, 1, 200000)

Output Schema for CUBE operator:
grunt> describe cubed_inp;
cubed_inp: {group: (dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int), cube: {(dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int,sales: long)}}

Note the second column in cubed_input bag ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ in load statement is pushed down to CUBE statement so that it can be referenced later while computing aggregates on the measure like in this case SUM

Upcoming Enhancements:
The current implementation is equivalent to the naive implementation in MRCube. Following are the core features that I am planning to implement as a part of the Google Summer of Code 2012 program:

  • Optimize naive implementation
  • Support for hierarchical dimension
  • Support for ROLLUP/GROUPING SETS operation similar to SQL/Oracle server
  • Distributed cubing for holistic measures

All these features should be available by end of this summer. Keep an eye on PIG-2167 (and all its sub-tasks) for more updates!

|

About the author:

Arnab Nandi is an Assistant Professor in the Department of Computer Science and Engineering at The Ohio State University. You can read more about him here.


August 2002 : 9 posts September 2002 : 16 posts October 2002 : 7 posts November 2002 : 21 posts December 2002 : 25 posts January 2003 : 8 posts February 2003 : 11 posts March 2003 : 7 posts April 2003 : 21 posts May 2003 : 14 posts June 2003 : 15 posts July 2003 : 4 posts August 2003 : 16 posts September 2003 : 25 posts October 2003 : 15 posts November 2003 : 24 posts December 2003 : 17 posts January 2004 : 6 posts February 2004 : 8 posts March 2004 : 6 posts April 2004 : 5 posts May 2004 : 29 posts June 2004 : 3 posts July 2004 : 17 posts August 2004 : 19 posts September 2004 : 3 posts October 2004 : 4 posts December 2004 : 1 posts February 2005 : 15 posts March 2005 : 18 posts April 2005 : 8 posts May 2005 : 27 posts June 2005 : 73 posts July 2005 : 45 posts August 2005 : 13 posts September 2005 : 3 posts October 2005 : 9 posts November 2005 : 20 posts December 2005 : 6 posts January 2006 : 25 posts February 2006 : 24 posts March 2006 : 37 posts April 2006 : 35 posts May 2006 : 7 posts June 2006 : 22 posts July 2006 : 20 posts August 2006 : 27 posts September 2006 : 15 posts October 2006 : 6 posts November 2006 : 19 posts December 2006 : 4 posts January 2007 : 4 posts February 2007 : 1 posts March 2007 : 3 posts May 2007 : 5 posts June 2007 : 2 posts July 2007 : 1 posts August 2007 : 13 posts September 2007 : 2 posts October 2007 : 21 posts November 2007 : 7 posts December 2007 : 9 posts January 2008 : 4 posts February 2008 : 14 posts March 2008 : 14 posts April 2008 : 11 posts May 2008 : 12 posts June 2008 : 12 posts July 2008 : 5 posts August 2008 : 10 posts September 2008 : 11 posts October 2008 : 10 posts November 2008 : 8 posts December 2008 : 4 posts January 2009 : 6 posts February 2009 : 13 posts March 2009 : 7 posts April 2009 : 7 posts May 2009 : 2 posts June 2009 : 3 posts July 2009 : 4 posts August 2009 : 4 posts September 2009 : 6 posts October 2009 : 4 posts November 2009 : 7 posts December 2009 : 10 posts January 2010 : 3 posts February 2010 : 2 posts April 2010 : 5 posts May 2010 : 1 posts July 2010 : 4 posts August 2010 : 3 posts September 2010 : 4 posts October 2010 : 1 posts November 2010 : 2 posts December 2010 : 3 posts June 2011 : 1 posts August 2011 : 1 posts November 2011 : 1 posts December 2011 : 1 posts February 2012 : 1 posts May 2012 : 2 posts December 2012 : 1 posts June 2013 : 1 posts August 2013 : 1 posts October 2013 : 2 posts