The Tapbots duo are quitting their day jobs to work fulltime on their iPhone app company:
Longer term we aren’t looking to get any VC funding, grow to 100s of employees or get bought out by some big corporation. We may get help with support, testing and/or marketing, but development and design is going to just be us two for the foreseeable future. We think that’s the best way to keep the quality of our applications at the level that everyone expects. Our goal is to produce about 4 applications a year. We aren’t going to shovel out crap-ware to cash-in on our names. We aren’t going to write the next Office or Filemaker. We are going to write simple but incredibly polished applications that are created specifically for the iPhone/Touch devices. Two guys, lot’s of passion and a lot of hard work, that’s the Tapbots way.
Two guys, two popular iphone apps (“Weightbot sold 100k copies in its first 100 days, Convertbot is selling at about twice that rate.”), one mission to make quality apps. Good luck, guys!
Arnab Nandi
Dept. of EECS
University of Michigan, Ann Arbor
H. V. Jagadish
Dept. of EECS
University of Michigan, Ann Arbor
ABSTRACT
We demonstrate a novel query interface that enables users to
construct a rich search query without any prior knowledge
of the underlying schema or data. The interface, which is
in the form of a single text input box, interacts in real-time
with the users as they type, guiding them through the query
construction. We discuss the issues of schema and data complexity, result size estimation, and query validity; and provide novel approaches to solving these problems. We demonstrate our query interface on two popular applications; an
enterprise-wide personnel search, and a biological information database.
1. INTRODUCTION
The problem of searching for information in large databases
has always been a daunting task. In current database sys-
tems, the user has to overcome a multitude of challenges.
To illustrate these problems, we take the example of a user
searching for the employee record of an American “Emily
Davies” in an enterprise database. The first major chal-
lenge is that of schema complexity: large organizations may
have employee records in varied schema, typical to each de-
partment. Second, the user may not be aware of the exact
values of the selection predicates, and may provide only a
partial or misspelled attribute value (as is the case with our
example, where the correct spelling is “Davis”). Third, we
would like the user to issue queries that are meaningful in
terms of result size a query listing all employees in the or-
ganization would not be helpful to the user, and could be
expensive for the system to compute. Lastly, we do not ex-
pect the user to be proficient in any complex database query
language to access the database.
2. THE INSTANT-RESPONSE INTERFACE
The instant-response interface is similar in interaction to
various services such as auto-completion in word processors
and mobile phones, and keyword query suggestion in search
engines. The user is presented with a text box, and is ex-
pected to type in keywords, a simple interaction paradigm
that is considered the standard in search systems. As soon
as the user starts typing text into the textbox, the system
instantly begins to provide suggestions, as shown in Figure.
1. The first set of suggestions is a set of schema level param-
eters; i.e. a ranked list of keys, or entity names; which are
simplified names for attributes and table names in a rela-
tional database, or element tags in an XML database. This
allows the user to discover the schema as an ongoing im-
plicit process. The next set of suggestions is a ranked list of
text fragments from the data, which provides the user with
an insight into valid predicate values. Each text fragment
has an associated icon corresponding to the entity type it
belongs to, which are expanded to a full key:value pair on se-
lection. This allows the user to disambiguate between text
values based on their type. Clicking on the icons in the
textbox will wrap the text selection around the associated
key name and hence trigger the drop down to modify the
key, if needed. In addition to these suggestion sets that are
integrated into a single list, the estimated number of tu-
ples, or result objects from the resultant constructed query
is provided on the right of each suggestion. If the estimated
number of results becomes zero due to a value string that
never occurs, or a query that violates the schema, further
suggestions are stopped and the background of the textbox
changes color, depicting an invalid query. The user is then
expected to reformulate the query by changing the terms
entered.
2.1 User Experience
In the case of our example, the user first types in “US”, to
specify that he is considering only the United States offices
of the organization. Since “US” came up as a valid sug-
gestion, the user selected it from the drop down box, which
expands the query to country:US. This is beneficial for the
database, since it can now do an exact attribute match in-
stead of looking through multiple attributes. Then, the user
types in “em”; the suggestion mechanism returns a list of
suggestions, with two key values, and two data values. The
user notices that there are two employee records which have
“Emily Davis” in them, and selects the respective sugges-
tion. Since the query so far will still return a result set of
cardinality two, the user decides to refine the query even
further, typing in “department:”, which provides the sug-
gestions “Sales”, and “HR”, one for each result tuple. By
selecting “HR”, the user is able to locate the single employee
record of interest, and directly navigate to it. It should be
noted that the complete interaction occurs over just a few
seconds, as all suggestions are provided in real time as the
user is typing out the query.
3. PROBLEM DEFINITION
We adopt the standard conjunctive attribute-value query
language, where queries are of the form
Q = key1 : value1 · key2 : value2 · key3 : value3 · . . .
Where keyi denotes the key or attribute of a particular ob-
ject, and valuei denotes a specific value for that attribute
or key. This form of querying can be considered an anno-
tated form of basic keyword querying that is simple enough
to be effective and yet rich enough to efficiently query the
database. It is used widely in mainstream search engines,
and the biomedical community. Keyword identifiers are con-
sidered optional, so that a pure keyword query with no at-
tribute name specifications is also acceptable. Any such
value is expected to match any attribute of the selected ob-
ject. Conversely, a single key identifier without a data value
is also acceptable, and would be interpreted as the parent
type of the expected results. Depending on the schema, the
key attributes specified may not belong to the object of in-
terest, but rather to a “closely related” object. In such a
case the query is interpreted using the techniques suggested
in Schema-free XQuery[5].
3.1 Challenges
In contrast to visual query systems[8, 2], the idea of in-
crementally and instantaneously formulating queries using
database schema and data brings forth its own set of prob-
lems. Unlike form-based querying where the users informa-
tion need is explicit, we need to infer what data the user is
expecting to find while the user is typing in his query. Also,
while recent developments in completion based search sys-
tems[4] attempt to apply efficient index structures for basic
keyword querying; this is only a small part of the challenges
faced in our schema and data based language query system.
Suggestion Quality
There is a subtle element of distraction introduced while
proposing suggestions to the user, which may overwhelm the
user[1]. Hence, we have to keep in mind that the quality of
the suggestions need to be high enough such that it is clear
that the suggestion mechanism is worth the negative effects
of UI-based distraction. In addition to this, we acknowledge
the well known human mind’s limit[6] to perceive and con-
sider choices, taken to be approximately seven. Given this
fixed limit, we use a ranking strategy based on acceptance
likelihood, to prioritize suggestions and display only a top-k
set of suggestions.
Performance
The notion of suggesting information to the user at the time
of input also depends critically on the “instantaneous” ap-
pearance of the suggestions. It has been shown[7] that a
maximum delay of 100ms can be considered for user inter-
faces to seem instantaneous; clearly all our suggestions need
to generated and displayed to the user in less time than this.
This poses challenges for our system, which is required to be
efficient enough to afford extremely fast querying even for
large databases with diverse schema.
These two challenges must each be addressed in the con-
text of attribute name suggestions, data value suggestions,
and size estimation. We describe our solutions to these chal-
lenges in the following section.
4. IMPLEMENTATION
4.1 Schema based suggestions
The first few entries in the suggestion list are a ranked
list of suggestions for possible keys. This is done by using
a suffix tree to generate all possible attributes that are a
completion of the current word typed in. All candidate keys
are then ranked according to distances in the schema graph,
between the candidate keys and the currently selected keys.
Since the user may not know actual attribute names, or the
attribute names may be too cryptic to be used as keywords,
we maintain a metadata map that stores all the convenient
words used for the field name. We augment this metadata
with a subset of the WordNet dataset for further unsuper-
vised synonym-based expansion.
4.2 Schema based validation
We use the schema to construct a reachability index for de-
tecting validity of the query. An n×n matrix is constructed,
where entry(i,j) is true if element name i occurs at least as
an nth level descendent of element name j in any data object
in the database, where n is a upper bound parameter set dur-
ing database set up. For example, an attribute “employee”
could contain attributes “name” and “country” in its data,
while “designation”(an employee attribute) and “building-
type”(an office building’s attribute) will not have any com-
mon 1st or 2nd level ancestors in a personnel database. The
query is considered valid with respect to the schema if there
exists at least one row in the matrix that contains true en-
tries for each of the attributes in the query. To define de-
scendents in relational data, we develop a hierarchy with
attributes at the leaves, tuples as their parents, table as the
grandparent, and joins amongst tables as higher levels of
ancestry.
4.3 Data fragment suggestions
All human-readable data in the database is first frag-
mented into simple chunks of phrases using statistical meth-
ods. We then construct two tries; in the first, each phrase
is entered with its attribute as a prefix; in the second the
attribute is entered as a suffix. The first trie is used for
attribute-specific completion, where the attribute name has
been provided by the user. The second trie is used for generic
completions, where the attributes are stored as suffixes of
the phrase string, allowing us to deduce the attribute name,
and hence the icon for the suggestion. In the case of at-
tributes with a large number of distinct phrases, we prefer
to reuse the generic completion trie in place of the attribute-
specific trie, for efficiency reasons. The leaf node of each trie
is also appended with the count information, which is used
to rank the suggestions in descending order of popularity.
4.4 Size estimation
Each item in the database is provided a unique ID based
on its ancestry, in the form x.y.z, where x is the parent of
y, which in turn is the parent of z. Ancestry is easy to
understand for XML data. For relational data, we consider
the same hierarchy concepts as Sec. 4.2. We use a fielded
inverted index to store these unique identifiers. At query
time, we merge the query-generated attribute lists using a
simple ancestry check; which allows us to come up with the
number of actual items that match the query. Since the
query process is incremental, both attribute lists and merged
lists are cached. In addition, we do not consider large lists
for the sake of efficiency and time-constraints in our query
processing. This is done by ignoring merges of any lists
that are beyond a certain threshold count, and reporting
an estimate of size based on an incomplete independence
assumption.
5. SYSTEM ARCHITECTURE
In contrast to currently available iterative query build-
ing systems, our system employs a continuously updating
interface to provide real-time query responses, allowing the
user to incorporate suggestions and modify the query in real
time, as opposed to going through a time-consuming cycle
of iterative query building. Beyond the visible user inter-
face, our system involves a backend server that handles the
continuous query requests from the interface, and replies to
it. At the core of the system lie the three modules that pro-
vide suggestion inputs, as described in the previous section.
We use Lucene[3] as our inverted index for size estimation
purposes. For the fragment suggestions, we implement in-
memory trie-based data structures. The third module pro-
vides all schema-related functions, such as query validation
and attribute-name suggestion. The information from these
modules is then aggregated and combined to be presented
as suggestions, which are served off the UI backend server.
We implement a caching layer as an integral part of our sys-
tem, caching not only requests for suggestions, but also the
underlying size estimates, etc. Beyond this, the results are
then passed on to the suggestion server, which serves the
update calls of the user interface.
6. DEMONSTRATION
In our demonstration, we present to the user an inter-
face visible as a single text box in a web browser. This
javascript based interface communicates with the main Java-
based backend suggestion server, which can be run either
locally or on another system. We offer users a chance to
pose queries to the database loaded in the server. The users
are guided through the search process by the instantaneous
suggestions, and are then presented with results upon query
execution. We also demonstrate the “plug and play” na-
ture of our system, where we allow any existing relational or
XML database to be loaded quickly for use with the instant-
response interface, by simply pointing to the database loca-
tion. The schema and other metadata information are au-
tomatically read from the database, and are used to load
the database contents, creating the inverted index, phrase
tries and schema graph. An optional step is provided to
the user for annotating the schema with application-specific
metadata (e.g. attribute “EMPZ” could be annotated with
“Employees”).
We demonstrate our application on two datasets. The
first dataset is based on the IMDB dataset. This comprises
of movie information with over 100,000 records, including
actors, movie titles and other film related information. For
our second dataset, we use the MiMI molecular interaction
database, which is an XML dataset over 1 gigabyte in size,
containing over 48 million entities including data collected
from external databases.
7. REFERENCES
[1] B. Bailey, J. Konstan, and J. Carlis. The Effects of
Interruptions on Task Performance, Annoyance, and
Anxiety in the User Interface. Proc. INTERACT, 2001.
[2] T. Catarci, M. Costabile, S. Levialdi, and C. Batini.
Visual Query Systems for Databases: A Survey.
Journal of Visual Languages and Computing, 1997.
[3] E. Hatcher and O. Gospodnetic. Lucene in Action.
Manning Publications, 2004.
[4] I. W. Holger Bast. Type Less, Find More: Fast
Autocompletion Search with a Succinct Index. Proc.
SIGIR, 2006.
[5] Y. Li, C. Yu, and H. Jagadish. Schema-Free XQuery.
Proc. Conf. Very Large Databases, 2004.
[6] G. Miller et al. The Magical Number Seven, Plus Or
Minus Two: Some Limits on Our Capacity for
Processing Information. Bobbs-Merrill, 1956.
[7] R. Miller. Response time in man-computer
conversational transactions. Proceedings of the AFIPS
Fall Joint Computer Conference, 1968.
[8] A. Sengupta and A. Dillon. Query by templates: A
generalized approach for visual query formulation for
text dominated databases. Proc. Symposium on
Advanced Digital Libraries, 1997.
There has been a lot of debate and noise about the 2008 US election. Politics, ethics and opinions aside, let’s think about this from a simple facts-and-numbers perspective. Ignore all the controversy of popular vote vs electoral college, etc etc. Let’s consider the contest in the way it’s defined right now.
You have finite resources, and you need to win the election. It costs a fixed amount of effort to convince each person. What’s the most optimal way to win an election?
So you run an optimized campaign. You strategize and make a campaign that gives you a much better “bang for buck”. In a perfect world, you should spend enough time to convince as little over 50% of each constituency and then move on, since you’ve won there and you should spend your resources elsewhere. This should be the primary objective function to measure a campaign’s efficacy.
Using numbers from the Wikipedia :
Obama : 66,495,308 votes, 365 electoral votes. 365/66 = 5.5
McCain : 58,123,419 votes, 162 electoral votes. 162/58= 2.79
Hence, we can see that Obama’s campaign was TWICE more efficient than McCain’s.
If the ability to lead a campaign is any reflection of leadership of the country, this does seem like a decisive victory in leadership skills.
Now, in case someone argues that Obama had more money, let’s look at spending reports from Opensecrets.org: Obama : 640M$ = 1.75M$ per electoral vote McCain : 370M$ = 2.28M$ per electoral vote
Again, Obama was 1.3 times more efficient with his donation money.
So overall, 2x efficiency in campaign effort, and 1.3x efficiency in use of money. What I see is a great contest in management, with a clear winner.
so i
walked around
the crowded conclave of
these markets and people
and i
was so lost
about where those notes
were coming from and
who made that music
that had touched and then frozen
my sense of direction
in a
moment of perfection
i was so lost
when i saw
your fingers
strumming so gently
six strings in those patterns
that made me stand still
for a moment or two
or a million, i would never know
for i
was so lost.
here i was
spending my time seeking spirit
lending my life towards yearning
for a reason to breathe
looking for a hunger to appease
and there you were
sitting
with a calm up on your face
lips, silent
fingers, manipulating
strings of lonely hearts
tearing them apart
one by one.
i’m sure there is a reason
for each day in every season
to be spent in the hope of
some goal that
i can only wish
that one day you will find
but for now
i hope you do not mind, if
i could sit with you all day long
writing lyrics to your songs
like this one.
Russel Davies painted his laptop to work as a blackboard. I think the acrylic casing for the iBook makes an excellent whiteboard too.
Friend and mentor Cong Yu just got an honorable mention in the SIGMOD Dissertation Award:
...Two other nominees receive Honorable Mention recognizing their outstanding work on theoretical foundations and development of algorithms with great impact on important practical problems: Cong Yu, for his dissertation on “Managing Complex Databases in a Schema Management Framework” at the University of Michigan, and, Nilesh Dalvi, for his dissertation on “Managing Uncertainty Using Probabilistic Databases” at the University of Washington.
It’s interesting to see the hiring trends : the Award was won by now-MSR researcher Ariel Fuxman. Nilesh and Cong are both Yahoo! Researchers.
India’s first amusement park, Appu Ghar is closing down. This is sad. I admit that many of the rides there were either broken or dated the last time I visited it, but it’s still a big part of my memories, just like every other Delhiite. The land will apparently be used to extend the Supreme Court and Metro Rail facilities. I don’t care for the court, but if the metro folks could just implement a loop) in their train tracks through the place, I would be very happy.
One of the first things I did outside of work at Google was to find out how many computers the company has. It’s a fairly secret number; it’s not quite a topic that people in the Googz like to talk about.
It took me a week to piece together the answer; and a few months to come to terms with my discovery. It’s hard to talk to people outside of the big G about the kind of stuff they pull off there, and I’m not talking about making ball pits out of director’s offices.
I can finally talk about this, now that this information is explicitly public, published in an article by MapReduce Gods Jeff Dean and Sanjay Ghemawat (bloggy synopsis here). In the paper, they talk of 11,081 machine years of computation used in Sept 2007 alone, for a subset of their MapReduce work. That’s 132972 machine months of CPU used in one month. Assuming all the computers were running at 100% capacity, without failure, without any break for the entire month, that’s almost a hundred and fifty thousand machines worth of computing used in September Oh Seven.
In other words, Google has about one hundred and fifty thousand computers that are reported here.
But does that account for ALL the computers at Google?
To find out, go ask a Google employee to violate his NDA today!
for your information, this may not be the right number. it should be obvious why. for example, they never said anything about not using hamsters. hamsters are 10x faster than computers, which would mean they could just have 10,000 hamsters and it would be fine.
While I haven’t been able to get my hands on one yet(I’m waiting on Rhode to donate hers when she gets it), this whole Microsoft Zune thing seems like as if it was designed to be unspectacular and mediocre. Microsoft usually has a way about spewing awe, even if it’s for vaporware. For example, this Longhorn concept video, makes me want to run out and buy a copy. Why don’t I feel like that for the Zune? MacObserver is running an article with similar thoughts, and I agree that Microsoft isn’t really trying to compete here. There’s no way they can catch up with Apple’s 6-year lead instantly, so they’ve decided that they’ll simply launch a product that will dilute the market. Since the Zune looks(atleast from a distance) just like an iPod, it’s prevalence is going to take the “ooh, look he has an iPod” perception to “meh, it’s a Zune like thing; and that Zune is lame. What a loser.”. One may argue that millions of other mp3 players have not diluted iPod’s branding, but the difference is that this is Microsoft.
The next year is going to be a fun one for handheld devices. I’m saying handheld devices because I’m waiting for the real video iPod — a 9-inch UMPC tablet running a lightweight OSX — The Rebirth of the The Newton, I guess!
Update: After spending 2 minutes with Rhode’s Zune, I think it’s safe to say that it doesn’t really suck. It’s basically a fat iPod, a successor to the Portable Windows Media Players. The perfect analogy would be the characters from the “Hi, I’m a Mac” campaign — while the iPod is slim and suave, the Zune is quirky, plump and primitive; but definitely quite likable.
Oh-me, oh-my, oh-you
Whatever shall I do
Hallelujah, the question is peculiar
I’d give a lot of dough
If only I could know
The answer to my question
Is it yes or is it no
Does your chewing gum lose its flavour
On the bedpost overnight
If your mother says don’t chew it
Do you swallow it in spite
Can you catch it on your tonsils
Can you heave it left and right
Does your chewing gum lose its flavour
On the bedpost overnight
Now the nation rise as one
To send their honoured son
To the White House
The nation’s only White House
To voice their discontent
Unto the Pres-I-dent
To ask this burning question
That has swept this continent
Does your chewing gum lose its flavour
On the bedpost overnight
If you pull it out like rubber
Will it snap right back and bite
If you paste on the left side
Will you find it on the right?
Does your chewing gum lose its flavour
On the bedpost overnight
Here comes a blushing bride
The groom is by her side
Up to the altar
As steady as Gibraltar
The groom has got the ring
It’s such a pretty thing
As he slips it on her finger
The choir begins to sing
Does your chewing gum lose its flavour
On the bedpost overnight
Would you use it on your collar
When your button’s not in sight?
Put your hand beneath your seat
And you will find it there all right!
Does your chewing gum lose its flavour
On the bedpost overnight
Google Sketchup: A 3-D editor for the Metaverse’s real estate concept
What next? Take a look at Second Life, and the talk they gave at Google NY — I’m not quite sure of the implementation itself, but this does seem like the next piece in the Snow crash puzzle.
... and then we’ll soon have Google merging with the Library of Congress, and Yahoo, Amazon, MS, and eBay merging with the mafia, and the picture will be complete.