I am always looking for undergraduate, PhD students and post-docs to join us in building the next generation of interactive database systems. My group works on fun, fast-paced research that asks a simple question: “How do we let humans interact with data more effectively?” The answers often lie in a wide range of topics including database systems, human computer interaction, information visualization, statistical sampling, and information retrieval.
Please contact me if you would like to work with my group. Before contacting me, be sure to read at least one of my publications, and provide include at least a few comments / questions in your email about my research. This is a little more effort than simply describing yourself, but it helps me have a better appreciation of your abilities. I try to respond to most students, but avoid replying to students who send out mass-copy-paste-emails without even considering the recipient.
Here are some links and resources for upcoming applicants that I think are useful to know and may help improve your applications.
Eugene Wu has a nice list of qualities for an ideal graduate student. (Eugene and I are collaborators on the Perceptvis project, you can apply to either of us to work in this area.)
Vijay Chidambaram prospective students lists some common expectations (from both the advisor and the advisee) that are worth asking your prospective advisor about.
Matt Might’s blog is a really great collection, and has a section devoted to graduate school, including a nice HOWTO.
Several students often make their graduate decisions based on US News / other simplistic rankings. This is typically a bad idea: please spend more time looking into your specific program. If you are planning to spend half a decade of your life somewhere, shouldn’t you be spending some more time engaging with your prospective advisor / peers?
If you are aspiring towards being a professor after your PhD, make sure you read Clauset et al.‘s study into faculty hiring networks that explains the current realities of faculty hiring.
Trip report is authored by Ph.D. students in my research group Niranjan Kamat and Lilong Jiang, who presented at the VLDB 2014 conference in Hangzhou, China.
We recently returned from VLDB 2014 in Hangzhou, China where we presented our recent contributions to the GestureDB and DICE projects. VLDB is one of the top conferences in database research and we were very excited to meet so many DB researchers, discuss different projects, and listen to interesting talks. The organizers had also planned an evening lights show in WestLake that was incredible.
Lilong presented his full paper on Gestural Query Specification. In this paper, we ask, “Can gestures be used to articulate queries on a database?” We describe various components of our multitouch iPad-based prototype (that we presented as a demo in VDLB 2013) — the classifier, gestural language and architecture for query specification. We discuss our usability studies and show that gestural querying is indeed easy to use and more intuitive when compared to traditional database query interfaces.
In addition to the in-memory database keynote by Hasso Plattner (co-founder of SAP) and Volker Markl’s keynote about declarative querying, there was an interesting keynote by our grand-advisor Prof. H.V. Jagadish at the Data4U workshop. Prof. Jagadish provided numerous insights in the database usability area. The key idea was the fact that query processing time is mainly divided into query authoring time and query execution time. For decades, researchers have spent countless hours working on query execution side of things. It looks like a good idea to spend more efforts now on the query authoring aspect. In addition to Data4U, there were two sessions for database usability, reinforcing the above observation.
Another interesting joint keynote was by Dr. Shivkumar Venkatraman (VP Engineering Ads and Commerce at Google) who presented the changes the ads backend at Google went through as it progressed from a company with a revenue of a few hundred millions to the behemoth of today and by Prof. Divy Agrawal, who presented his views on treating datacenters as computers.
There was an interesting panel moderated by Dr. C. Mohan of IBM on the Role of Databases in the era of Big Data. Are programming languages and distributed systems luminaries winning the battle against the DB community over the future of data? The panel included Michael Carey, Surajit Chaudhari, Ashish Gupta, Wolfgang Lehner, Chris Re, Gera Shegalov. It was a fascinating panel punctuated at the end by inputs from Divy.
Some talks that we liked
There were several great papers being presented at VLDB and we tried to attend as many of the presentations as possible. Here is a small subset of some of the papers we enjoyed:
A TV ad in the 2010 elections claimed that Jim Marshall, a Democratic incumbent from Georgia “voted the same as Republican leaders 65 percent of the time. This comparison was made with Republican Leader John Boehner over the votes in 2010. If we look at the history since 2007, however, the number would have been only 56 percent, which is not very high considering the fact that even the Democratic Whip, Jim Clyburn, voted 44 percent of the time with Boehner during that period. Basically, many votes in Congress are not as controversial as the public would think!
The authors describe their fact-checking framework that works by “perturbing” the parameterized query formulations of the claims in interesting ways.
Exemplar Queries: Give me an Example of What You Need: In this paper, when a user inputs a query in search engine, it considers the query an example instead of a query and tries to return similar examples to the user. The paper describes an implementation of this for graph-modeled data and provide a nice evaluation using Freebase.
The Case for Data Visualization Management Systems This is a very interesting vision paper that was in the same session as our “Gestural Query Specification” paper. in this paper, the author tries to combine the visualization with DBMS together and utilize the existing DBMS to serve visualization, with examples of techniques from imMens, Tableau, nanocubes, bigvis, M4 and DICE, amongst others.
A Sampling Algebra for Aggregate Estimation: This paper extends Generalized Uniform Sampling to allow for sampling operators to commute with selections and joins, which is a very powerful contribution. Since it is not currently possible to obtain a random sample of a join by sampling both the relations, the paper discussed techniques to obtain the sample variance even without having a random sample. We are excited about doing similar work in DICE to further optimize our query plans.
ClusterJoin: A Similarity Joins Framework using MapReduce: The paper addresses this challenge by partitioning the space using random samples as partition centers to reduce the probable pairs. This method first rules out and prunes impossible candidate pairs without performing actual comparisons by using bisector-based reasoning, and proposes a load balancing scheme to avoid “the curse of the last reducer”. The paper has some interesting evaluations over spatial and document data.
The conference was a lot of fun and a great learning experience. We are now working on our next contributions to DICE and GestureDB, and hope to present them at upcoming venues.
A friend introduced me to the game 24, a simple game you can play with a deck of playing cards. Over the course of playing the game, I realized that either my friend was really good at this game, or that I wasn't!
Given how simple the game was, I figured that I could write a program that was good at it. Scoping the problem out, it was a perfect weekend hack project, and a good break from all the grant writing, research and administrivia of my job. Also, I have always been fascinated by all the cool things my colleagues in Computer Vision do, so it was a good way to learn about some of the fundamentals.
Here's a description of the game, from the Wikipedia article:
The 24 Game is an arithmetical card game in which the object is to find a way to manipulate four integers so that the end result is 24. Addition, subtraction, multiplication, or division, and sometimes other operations, may be used to make four digits from one to nine equal 24. For example, card with the numbers 4,7,8,8, a possible solution is the following: (7-(8/8))*4=24.
The game is fun to play 1-on-1, where the first person to come up with a valid expression gets to keep the cards. Once the deck is exhausted, the person with the most cards is declared the winner.
An Automatic Solution
My vision for an automated version of the game was simple. Players sit across a table on which the cards are laid out. My program would take a picture of the cards and recognize them. It would then generate valid expression that yielded 24, and then project the answer on to the table. Before going into the details, here's a video of how the system would work:
Parts of the Problem
Let's think through the system and break it down into parts. The heart of the problem is to search for a valid expression that yields 24. To get the digits, we have to look at the cards using a camera, and identify that they are cards. Then, we need to process each card and recognize what the digits are. Then, once we get the expression, we project them onto the table surface using a projector.
Searching for Expressions:
Before getting into the computer vision and image processing bits, let's get the math out of the way. Consider the following set of cards. The digits here are: 3,7,8,9.
For these 4 digits, the search space (i.e. the set of all the possible candidates) can be enumerated as:
All the permutations of the digits (e.g. [3,7,8,9], [9,8,7,3] and so on)
All possible operators insertions for each permutation (e.g. [3 + 7 + 8 + 9], [3 * 7 - 8 + 9])
All possible parenthesizations of each operator configuration (e.g. 3 * [ 7 - 8 + 9 ] )
As it turns out, the number of possible candidates is bounded (for the curious, see Catalan Numbers) and hence tractable to brute force. Optimizations such as memoization (e.g. [3 * (7 + 8 + 9)] and [3 / (7 + 8 + 9)] both have the same subexpression (7 + 8 + 9) which can be computed and cached) and taking into account the associativity of operators (e.g. 3 * [ 7 - 8 + 9 ] does not need a nested parenthesis since - and + are associative.) help reduce the search space, speeding up the computation.
Note that not all digits generate valid 24 expressions, so there is a chance we'll have to say "Expression impossible".
Identifying and Registering Cards
Now that we have the solution, let's turn to the fun part -- seeing the cards! As we can see in the image above, we have a dark background for a tabletop, and the cards are laid out in a 2x2 row. They are however not quite vertically aligned, and the picture itself is at a perspective. Thus, we first need to perform image segmentation to pick out each card by itself, and then image registration to line up the picture of each card with a flat, rectangular representation of a card.
The python API for OpenCV is very easy to use for these tasks, and there's a ton of informative blog posts that help. We first read the picture and preprocess(greyscale, blur and threshold) it to get rid of artifacts so that we can focus on the real stuff:
We then find all contours in the image. These can be edges of the cards themselves, or contours of the figures and letters in the cards. Thus, we look for the four contours that span the most area -- which have to be the four cards themselves.
This takes the original image and identifies the top 4 contours (aka the four cards), which we can then cycle through one by one. Note the grey lines in the following image -- they are the contours identified, which include the symbols on the cards themselves.
Now, for each card, we need to register each card into a rectangular representation. To do this, we identify the rectangle representation of each card. This is done by approximating a polynomial from the contour (which is simply a vector of points) and then finding the minimum rotation-free bounding box:
for i in range(numcards):
card = contours[i]
peri = cv2.arcLength(card,True)
approx = cv2.approxPolyDP(card,0.02*peri,True)
rect = cv2.minAreaRect(contours)
r = cv2.cv.BoxPoints(rect)
Then, for each rectangle, we perform an Affine Transform to transform the image of each card into a rectangular representation, so that it can be easily recognized:
Now that we have a registered representation of the card, The next step is to recognize it. For this, we use a training deck -- to tell our code what each card looks like in the first place.
We pipe this image through the same code as above, creating registered representations of each card, and then labeling each card by hand. In my code, I created a simple tab-separated-values file that stored the card information in order of how the registered cards were listed by the code:
55 * *
54 8 S
53 K S
52 Q S
Note that "*" is used to represent Joker / other cards, which were also in the deck.
Now that we know what each card looks like, the next step is to match it up with the incoming candidate card. Both are lined up / registered, and both have been through the same preprocessing. There are a lot of robust algorithms to this problem such as SIFT and SURF which can be used for this problem: We could recognize each digit / letter using character recognition, or even recognize the suit symbols on each card and count them to get the digit.
The simplest option here, however, is to simply ask -- how different is the picture of the incoming card from each of the training card? We can simply write this as:
The steps in the above lines of code are: First transform the image to a grayscale colorspace and then blur it to get rid of artifacts that arise from lighting and camera noise. Then, perform an adaptive thresholding -- i.e. highlight differences in black / white, which articulates the symbols and inscriptions in the image. To make up for slight differences in alignment, we blur the image again, before and after performing an absolute difference of one image from another, and then summing the intensity of pixels that are different.
Here is the card 8 of spades being compared against Ace of Hearts. The right third of the image is the result. As you can see, the subtraction leads to a lot of artifacts:
In contrast, here's what happens when we compare the registered picture of 8 of spades from our perspective image above, against the picture of 8 of spades from the training deck. As you can see, the result is almost fully black, which means the difference is minimal!
Thus recognizing each card becomes a simple process of comparing each incoming card against each card against the deck, and going with the one with the minimum difference. As noted before, there are many more complex and robust approaches (including ones that involve the use of classifiers). However, since this is a fun project and I only had a few hours to write code, we're going to go with the version that takes the least amount of time!
One thing to note is that the card identification process could be reused for any card game -- not just this one.
Putting it Together
Given all the parts of the system, we can now get pictures from the webcam, and identify the cards from the picture. We then compute the solution for the numbers, and then print it out back onto the table by projecting it on the projector, which is connected to the computer.
training = img.get_training(training_labels_filename,training_image_filename,num_training_cards,avoid_cards)
webcam = cv2.VideoCapture(0)
if webcam.isOpened(): # try to get the first frame
rval = False
rval, im = webcam.read()
cards = [img.find_closest_card(training,c) for c in img.getCards(im,num_cards)]
cards = ['1' if c == 'A' else c for c in cards]
And there you have it, an end-to-end, automated player for the game 24! If you would like to try running the code yourself, the card recognition code, training deck image and test images are available on Github.
New Job: I’ve meant to write so, so many blog posts about this, but this job is intense enough that there’s been absolutely no time to catalog it. Short version of the story: Landed on a new island, on track to building my little empire.
Columbus, OH — Very pleasantly surprised by how nice this city is. The cultural scene is impressive, the area is flat and great for biking and running, and Ohio State is immense and brimming with possibilities.
Conferences, Panels, Random Life Things:
Washington, D.C. — I’d never been to DC ever before, but this year I ended up visiting D.C. thrice, for all sorts of reasons.
Istanbul, Turkey — VLDB 2012. What a wonderful city. Spending an afternoon ferrying back and forth between Europe and Asia is pretty awesome.
Ann Arbor, MI — hard to not drive here a few times. It’s good to be able to visit and participate in panels as “Distinguished Alumni”.
Atlanta, GA — visiting a friend at the very impressive Shepherd Center. Sadly, didn’t get to actually spend a lot of time touring the place. Seemed like a warm and welcoming city though. Some other time, maybe.
Pittsburgh, PA — Visited Carnegie Mellon. Very pretty campus. The drive from Columbus is very pleasant too.
Guest post by Prasanth Jayachandran , who has been working on implementing CUBE support for Pig, as part of the large-scale distributed cubing effort.
Update: As per Dmitriy’s tweet: …the naive implementation is in. The scalable count distinct impl is pending 0.11 branching, will go into 0.12.
The next version of Apache Pig will support the CUBE operator ( patch available here ). The CUBE operator represents grouped aggregations on all possible combinations of the input dimensions, for a given input measure.
This patch adds syntactic sugar to the already existing built-in CubeDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. The following example illustrates the CUBE operator usage in Apache Pig:
inp = LOAD ‘/pig/data/salesdata’ USING PigStorage() AS(product:chararray,location:chararray, year:int, sales:long);
cubed_inp = CUBE inp BY (product,location,year);
out = FOREACH cubed_inp GENERATEFLATTEN AS (product, location,year), COUNT as total, SUM as sales;
For a sample input tuple (ipod, miami, 2012, 200000), the above query generates all combinations of the tuple:
(ipod, miami, 2012, 1, 200000)
(ipod, NULL, NULL, 1, 200000)
(NULL, miami, NULL, 1, 200000)
(NULL, NULL, 2012, 1, 200000)
(ipod, miami, NULL, 1, 200000)
(ipod, NULL, 2012, 1, 200000)
(NULL, miami, 2012, 1, 200000)
(NULL, NULL, NULL, 1, 200000)
Note the second column in cubed_input bag ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ in load statement is pushed down to CUBE statement so that it can be referenced later while computing aggregates on the measure like in this case SUM
The current implementation is equivalent to the naive implementation in MRCube. Following are the core features that I am planning to implement as a part of the Google Summer of Code 2012 program:
Optimize naive implementation
Support for hierarchical dimension
Support for ROLLUP/GROUPING SETS operation similar to SQL/Oracle server
Distributed cubing for holistic measures
All these features should be available by end of this summer. Keep an eye on PIG-2167 (and all its sub-tasks) for more updates!
On occasion of SIGMOD 2012, I thought I'd write up a short post about a new project, Skimmer.
When considering challenges of ad-hoc, end-user interaction with databases, user actions can be broadly categorized into three groups: (1) explicit, articulate querying of the database, (2) searching through the database, and (3) browsing through the database. Prior work in the area of database usability has recognized (2: Searching) and (3: Browsing) as being significant challenges. My dissertation work attempted to solve (2: Searching) using a combination of autocompletion and qunits. The Skimmer project looks at (3: Browsing). In our SIGMOD 2012 paper, we introduce a method that makes it easier for users to browse large datasets. Here's the abstract and the intuition behind it, followed by a cool demo video of a new implementation of Skimmer:
A relational database often yields a large set of tuples as the result of a query. Users browse this result set to find the information they require. If the result set is large, there may be many pages of data to browse. Since results comprise tuples of alphanumeric values that have few visual markers, it is hard to browse the data quickly, even if it is sorted.
In this paper, we describe the design of a system for browsing relational data by scrolling through it at a high speed. Rather than showing the user a fast-changing blur, the system presents the user with a small number of representative tuples. Representative tuples are selected to provide a “good impression” of the query result. We show that the information loss to the user is limited, even at high scrolling speeds, and that our algorithms can pick good representatives fast enough to provide for real-time, high-speed scrolling over large datasets.
The intuition behind picking representative tuples is simple: Since pages will be consumed in succession, we represent a sampled overview of the current page, but it is a good idea to bias the samples such that we avoid (re)showing information provided in the prior pages. Thus, we promote tuples that increase the information provided to the user compared to the prior pages: i.e. minimize the information loss. The paper evaluates this idea against varying scrolling speed, page size, number of attributes and information quality.
The video above demonstrates the new version of Skimmer we've been working on at Ohio State. Similar to the original Skimmer system, it initially only shows you representative tuples from a page, loading and rendering in the rest of the data in a lazy fashion. Non-rendered tuples are shown as grey bands, for clarity.
This project falls under the "Surfacing of Insights" section of the roadmap laid out in the Guided Interaction VLDB 2011 vision paper, and is only the first step in what I hope is an engaging conversation in the area of large-scale data browsing.
And of course, if you are a current student at Ohio State looking to work on cool projects such as this one, I am looking for students to join my research group. Drop me a line!
I put up this post on Twitter / a few places in the University:
“The data team is looking for hackers. 614-f14cfb5abaa0ee055e1222018cc1ec4f”
A few hints:
* If 614-f14… doesn’t work for you, also try 614-ba18b632839188a00b6f0e3784a14bd9. Both have the same desired outcome. (ba18… is what I’d initially posted and is harder to solve for, imho)
* I later tweeted: “Python solution fits in a tweet; takes 5 min to write, <5 to run.”
* 614 is the Columbus phone area code.
Looking forward to a few people cracking this code!
Update: Here’s the solution in both Python and SQL:
from hashlib import md5;
[x for x in range(10**7) if md5(”%0.3d-%0.4d” % (x/10000,x%10000)).hexdigest()== ‘f14cfb5abaa0ee055e1222018cc1ec4f’]
SQL (assumes table “n” containing a single column n, with tuples, 0 through 9): SELECT * , MD5 ( CONCAT ( n1.n, n2.n, n3.n, “-”, n4.n, n5.n, n6.n, n7.n ) ) AS x FROM n AS n1, n AS n2, n AS n3, n AS n4, n AS n5, n AS n6, n AS n7 HAVING x = “f14cfb5abaa0ee055e1222018cc1ec4f”