Various projects, etc

Assisted Querying using Instant-Response Interfaces

This paper was presented at SIGMOD 2007, Beijing. The PDF version is available (ACM mirror).

Video Overview: (paper continues below)


Arnab Nandi
Dept. of EECS
University of Michigan, Ann Arbor

H. V. Jagadish
Dept. of EECS
University of Michigan, Ann Arbor


We demonstrate a novel query interface that enables users to
construct a rich search query without any prior knowledge
of the underlying schema or data. The interface, which is
in the form of a single text input box, interacts in real-time
with the users as they type, guiding them through the query
construction. We discuss the issues of schema and data complexity, result size estimation, and query validity; and provide novel approaches to solving these problems. We demonstrate our query interface on two popular applications; an
enterprise-wide personnel search, and a biological information database.


The problem of searching for information in large databases
has always been a daunting task. In current database sys-
tems, the user has to overcome a multitude of challenges.
To illustrate these problems, we take the example of a user
searching for the employee record of an American “Emily
Davies” in an enterprise database. The first major chal-
lenge is that of schema complexity: large organizations may
have employee records in varied schema, typical to each de-
partment. Second, the user may not be aware of the exact
values of the selection predicates, and may provide only a
partial or misspelled attribute value (as is the case with our
example, where the correct spelling is “Davis”). Third, we
would like the user to issue queries that are meaningful in
terms of result size a query listing all employees in the or-
ganization would not be helpful to the user, and could be
expensive for the system to compute. Lastly, we do not ex-
pect the user to be proficient in any complex database query
language to access the database.


The instant-response interface is similar in interaction to
various services such as auto-completion in word processors
and mobile phones, and keyword query suggestion in search
engines. The user is presented with a text box, and is ex-
pected to type in keywords, a simple interaction paradigm
that is considered the standard in search systems. As soon
as the user starts typing text into the textbox, the system
instantly begins to provide suggestions, as shown in Figure.
1. The first set of suggestions is a set of schema level param-
eters; i.e. a ranked list of keys, or entity names; which are
simplified names for attributes and table names in a rela-
tional database, or element tags in an XML database. This
allows the user to discover the schema as an ongoing im-
plicit process. The next set of suggestions is a ranked list of
text fragments from the data, which provides the user with
an insight into valid predicate values. Each text fragment
has an associated icon corresponding to the entity type it
belongs to, which are expanded to a full key:value pair on se-
lection. This allows the user to disambiguate between text
values based on their type. Clicking on the icons in the
textbox will wrap the text selection around the associated
key name and hence trigger the drop down to modify the
key, if needed. In addition to these suggestion sets that are
integrated into a single list, the estimated number of tu-
ples, or result objects from the resultant constructed query
is provided on the right of each suggestion. If the estimated
number of results becomes zero due to a value string that
never occurs, or a query that violates the schema, further
suggestions are stopped and the background of the textbox
changes color, depicting an invalid query. The user is then
expected to reformulate the query by changing the terms

2.1 User Experience
In the case of our example, the user first types in “US”, to
specify that he is considering only the United States offices
of the organization. Since “US” came up as a valid sug-
gestion, the user selected it from the drop down box, which
expands the query to country:US. This is beneficial for the
database, since it can now do an exact attribute match in-
stead of looking through multiple attributes. Then, the user
types in “em”; the suggestion mechanism returns a list of
suggestions, with two key values, and two data values. The
user notices that there are two employee records which have
“Emily Davis” in them, and selects the respective sugges-
tion. Since the query so far will still return a result set of
cardinality two, the user decides to refine the query even
further, typing in “department:”, which provides the sug-
gestions “Sales”, and “HR”, one for each result tuple. By
selecting “HR”, the user is able to locate the single employee
record of interest, and directly navigate to it. It should be
noted that the complete interaction occurs over just a few
seconds, as all suggestions are provided in real time as the
user is typing out the query.


We adopt the standard conjunctive attribute-value query
language, where queries are of the form
Q = key1 : value1 · key2 : value2 · key3 : value3 · . . .
Where keyi denotes the key or attribute of a particular ob-
ject, and valuei denotes a specific value for that attribute
or key. This form of querying can be considered an anno-
tated form of basic keyword querying that is simple enough
to be effective and yet rich enough to efficiently query the
database. It is used widely in mainstream search engines,
and the biomedical community. Keyword identifiers are con-
sidered optional, so that a pure keyword query with no at-
tribute name specifications is also acceptable. Any such
value is expected to match any attribute of the selected ob-
ject. Conversely, a single key identifier without a data value
is also acceptable, and would be interpreted as the parent
type of the expected results. Depending on the schema, the
key attributes specified may not belong to the object of in-
terest, but rather to a “closely related” object. In such a
case the query is interpreted using the techniques suggested
in Schema-free XQuery[5].

3.1 Challenges
In contrast to visual query systems[8, 2], the idea of in-
crementally and instantaneously formulating queries using
database schema and data brings forth its own set of prob-
lems. Unlike form-based querying where the users informa-
tion need is explicit, we need to infer what data the user is
expecting to find while the user is typing in his query. Also,
while recent developments in completion based search sys-
tems[4] attempt to apply efficient index structures for basic
keyword querying; this is only a small part of the challenges
faced in our schema and data based language query system.
Suggestion Quality
There is a subtle element of distraction introduced while
proposing suggestions to the user, which may overwhelm the
user[1]. Hence, we have to keep in mind that the quality of
the suggestions need to be high enough such that it is clear

that the suggestion mechanism is worth the negative effects
of UI-based distraction. In addition to this, we acknowledge
the well known human mind’s limit[6] to perceive and con-
sider choices, taken to be approximately seven. Given this
fixed limit, we use a ranking strategy based on acceptance
likelihood, to prioritize suggestions and display only a top-k
set of suggestions.
The notion of suggesting information to the user at the time
of input also depends critically on the “instantaneous” ap-
pearance of the suggestions. It has been shown[7] that a
maximum delay of 100ms can be considered for user inter-
faces to seem instantaneous; clearly all our suggestions need
to generated and displayed to the user in less time than this.
This poses challenges for our system, which is required to be
efficient enough to afford extremely fast querying even for
large databases with diverse schema.
These two challenges must each be addressed in the con-
text of attribute name suggestions, data value suggestions,
and size estimation. We describe our solutions to these chal-
lenges in the following section.


4.1 Schema based suggestions

The first few entries in the suggestion list are a ranked
list of suggestions for possible keys. This is done by using
a suffix tree to generate all possible attributes that are a
completion of the current word typed in. All candidate keys
are then ranked according to distances in the schema graph,
between the candidate keys and the currently selected keys.
Since the user may not know actual attribute names, or the
attribute names may be too cryptic to be used as keywords,
we maintain a metadata map that stores all the convenient
words used for the field name. We augment this metadata
with a subset of the WordNet dataset for further unsuper-
vised synonym-based expansion.
4.2 Schema based validation
We use the schema to construct a reachability index for de-
tecting validity of the query. An n×n matrix is constructed,
where entry(i,j) is true if element name i occurs at least as
an nth level descendent of element name j in any data object
in the database, where n is a upper bound parameter set dur-
ing database set up. For example, an attribute “employee”
could contain attributes “name” and “country” in its data,
while “designation”(an employee attribute) and “building-
type”(an office building’s attribute) will not have any com-
mon 1st or 2nd level ancestors in a personnel database. The
query is considered valid with respect to the schema if there
exists at least one row in the matrix that contains true en-
tries for each of the attributes in the query. To define de-
scendents in relational data, we develop a hierarchy with
attributes at the leaves, tuples as their parents, table as the
grandparent, and joins amongst tables as higher levels of
4.3 Data fragment suggestions
All human-readable data in the database is first frag-
mented into simple chunks of phrases using statistical meth-
ods. We then construct two tries; in the first, each phrase
is entered with its attribute as a prefix; in the second the

attribute is entered as a suffix. The first trie is used for
attribute-specific completion, where the attribute name has
been provided by the user. The second trie is used for generic
completions, where the attributes are stored as suffixes of
the phrase string, allowing us to deduce the attribute name,
and hence the icon for the suggestion. In the case of at-
tributes with a large number of distinct phrases, we prefer
to reuse the generic completion trie in place of the attribute-
specific trie, for efficiency reasons. The leaf node of each trie
is also appended with the count information, which is used
to rank the suggestions in descending order of popularity.
4.4 Size estimation
Each item in the database is provided a unique ID based
on its ancestry, in the form x.y.z, where x is the parent of
y, which in turn is the parent of z. Ancestry is easy to
understand for XML data. For relational data, we consider
the same hierarchy concepts as Sec. 4.2. We use a fielded
inverted index to store these unique identifiers. At query
time, we merge the query-generated attribute lists using a
simple ancestry check; which allows us to come up with the
number of actual items that match the query. Since the
query process is incremental, both attribute lists and merged
lists are cached. In addition, we do not consider large lists
for the sake of efficiency and time-constraints in our query
processing. This is done by ignoring merges of any lists
that are beyond a certain threshold count, and reporting
an estimate of size based on an incomplete independence


In contrast to currently available iterative query build-
ing systems, our system employs a continuously updating
interface to provide real-time query responses, allowing the
user to incorporate suggestions and modify the query in real
time, as opposed to going through a time-consuming cycle
of iterative query building. Beyond the visible user inter-
face, our system involves a backend server that handles the
continuous query requests from the interface, and replies to
it. At the core of the system lie the three modules that pro-
vide suggestion inputs, as described in the previous section.
We use Lucene[3] as our inverted index for size estimation
purposes. For the fragment suggestions, we implement in-
memory trie-based data structures. The third module pro-
vides all schema-related functions, such as query validation
and attribute-name suggestion. The information from these
modules is then aggregated and combined to be presented
as suggestions, which are served off the UI backend server.
We implement a caching layer as an integral part of our sys-
tem, caching not only requests for suggestions, but also the
underlying size estimates, etc. Beyond this, the results are
then passed on to the suggestion server, which serves the
update calls of the user interface.


In our demonstration, we present to the user an inter-
face visible as a single text box in a web browser. This
javascript based interface communicates with the main Java-
based backend suggestion server, which can be run either
locally or on another system. We offer users a chance to
pose queries to the database loaded in the server. The users
are guided through the search process by the instantaneous
suggestions, and are then presented with results upon query
execution. We also demonstrate the “plug and play” na-
ture of our system, where we allow any existing relational or
XML database to be loaded quickly for use with the instant-
response interface, by simply pointing to the database loca-
tion. The schema and other metadata information are au-
tomatically read from the database, and are used to load
the database contents, creating the inverted index, phrase
tries and schema graph. An optional step is provided to
the user for annotating the schema with application-specific
metadata (e.g. attribute “EMPZ” could be annotated with
We demonstrate our application on two datasets. The
first dataset is based on the IMDB dataset. This comprises
of movie information with over 100,000 records, including
actors, movie titles and other film related information. For
our second dataset, we use the MiMI molecular interaction
database, which is an XML dataset over 1 gigabyte in size,
containing over 48 million entities including data collected
from external databases.


[1] B. Bailey, J. Konstan, and J. Carlis. The Effects of
Interruptions on Task Performance, Annoyance, and
Anxiety in the User Interface. Proc. INTERACT, 2001.
[2] T. Catarci, M. Costabile, S. Levialdi, and C. Batini.
Visual Query Systems for Databases: A Survey.
Journal of Visual Languages and Computing, 1997.
[3] E. Hatcher and O. Gospodnetic. Lucene in Action.
Manning Publications, 2004.
[4] I. W. Holger Bast. Type Less, Find More: Fast
Autocompletion Search with a Succinct Index. Proc.
SIGIR, 2006.
[5] Y. Li, C. Yu, and H. Jagadish. Schema-Free XQuery.
Proc. Conf. Very Large Databases, 2004.
[6] G. Miller et al. The Magical Number Seven, Plus Or
Minus Two: Some Limits on Our Capacity for
Processing Information. Bobbs-Merrill, 1956.
[7] R. Miller. Response time in man-computer
conversational transactions. Proceedings of the AFIPS
Fall Joint Computer Conference, 1968.
[8] A. Sengupta and A. Dillon. Query by templates: A
generalized approach for visual query formulation for
text dominated databases. Proc. Symposium on
Advanced Digital Libraries, 1997.

Effective Phrase Prediction

The following paper was presented in VLDB Vienna, Austria, 2007. The PDF version is available here.

Arnab Nandi

Dept. of EECS

University of Michigan, Ann Arbor

H. V. Jagadish

Dept. of EECS

University of Michigan, Ann Arbor

21 March 2006

Autocompletion is a widely deployed facility in systems that require user
input. Having the system complete a partially typed “word” can save user time
and effort.

In this paper, we study the problem of autocompletion not just at the level of
a single “word”, but at the level of a multi-word “phrase”. There are two
main challenges: one is that the number of phrases (both the number possible and
the number actually observed in a corpus) is combinatorially larger than the
number of words; the second is that a “phrase”, unlike a “word”, does not
have a well-defined boundary, so that the autocompletion system has to decide not just
what to predict, but also how far.

We introduce a FussyTree structure to address the first challenge and the
concept of a significant phrase to address the second.
We develop a probabilistically driven multiple completion choice model, and
exploit features such as frequency distributions to improve the quality of our suffix completions. We
experimentally demonstrate the practicability and value of our technique for an
email composition application and show that we can save approximately a fifth of the keystrokes typed.

1  Introduction

Text-input interfaces have been undergoing a sea change in the last few years.
The concept of automatic completion, or autocompletion has become
increasingly pervasive. An autocompletion mechanism unobtrusively prompts the
user with a set of suggestions, each of which is a suffix, or completion, of the
user’s current input. This allows the user to avoid unnecessary typing, hence
saving not just time but also user cognitive burden. Autocompletion is finding
applications in specialized input fields such as file location fields
 [30], email address fields [30] and URL address
bars [25], as well as new, more aggressive applications such as
dictionary-based word / phrase completion [23] and search query
suggestion, available now in mainstream web browsers [25]. With the recent increase in user-interface to server communication
paradigms such as AJAX [32], and as users become more receptive to the
idea of autocompletion, it will find many more applications, giving rise to
increased expectations from autocompletion systems.

Current autocompletion systems are typically restricted to single-word
completion. While being a very helpful feature, single word completion does
not take advantage of collocations in natural language – humans have a tendency
to reuse groups of words or “phrases” to express meanings beyond the simple
sum of the parts. From our observations, examples of such collocations can not only be
proper noun phrases; such as “Enron Incorporated", but also commonly used
phrases such as “can you please send me the" or “please let
me know if you have any questions"
. Such phrases become much more common when
attached to a personalized context such as email, or personal writing, due to
the tendency of an individual to mention the same the same phrases during
repetitive communication. Many current systems incorporate multiple word
occurrences by encoding them as single words; for example “New York"
is converted to “New_York". However, such schemes do not scale
beyond a few words, and would bloat the index (e.g. suffix-tree) sizes for
applications where we desire to index a large number of phrases.

It is this multi-word autocompletion problem that we seek to address in this

We note that autocompletion is only useful if it appears “instantaneous" in the
human timescale, which has been observed [5, 27] to be a time upper bound of
approximately 100ms. This poses a stringent time constraint
on our problem. For single word completion, typical techniques involve building a dictionary of all words and
possibly coding this as a trie (or suffix-tree), with each node representing one
character and each root-leaf path depicting a word (or a suffix). Such
techniques cannot be used directly in the multi-word case because one cannot
construct a finite dictionary of multi-word phrases. Even if we limit the length
of phrases we will consider to a few words at most, we still need an
“alphabet” comprising all possible words, and the size of the dictionary is
several orders of magnitude larger than for the single word case.

It goes without saying that autocompletion is useful only when suggestions
offered are correct (in that they are selected by the user). Offering
inappropriate suggestions is worse than offering no suggestions at all, distracting
the user, and increasing user anxiety [1]. As we move from word
completion to phrase completion, we find that this correctness requirement
becomes considerably more challenging. Given the first 5 letters of a long
word, for example, it is often possible to predict which word it is: in
contrast, given the first 5 words in a sentence, it is usually very hard to
predict the rest of the sentence. We therefore define our problem as one of
completing not the full sentence, but rather a multi-word phrase, going
forward as many words as we can with reasonable certainty. Note that our use of
the word “phrase” does not refer to a construct from English grammar, but
rather an arbitrary sequence of words. At any point in the input, there is no
absolute definition of the end of the current phrase (unlike word or sentence,
which have well-specified endings) – rather the phrase is as long as we are
able to predict, and determining the length of the phrase itself becomes a
problem to address.
For example, given the prefix please let, one possible phrase
completion is please let me know; a longer one is please let
me know if you have any problems
. The former is more likely to be a correct
prediction, but the latter is more valuable if we get it correct (in that more
user keystrokes are saved). Choosing between these, and other such options (such as
please let me know when), is one problem we address in this paper.

Traditional methods of autocompletion have provided only a single completion per
query. Current autocompletion para- digms and user interface mechanisms do allow for multiple
suggestions to be given to the user for example by cyclic tab-based
completions [39] in command line shells, or by drop-down
completions in search engines. However, this facility is
typically used to present a comprehensive list of possible completions without
any notion of ranking or summarization.

There are dozens of applications that could benefit from multi-word
autocompletion, and the techniques we develop in this paper should be applicable
irrespective of the application context. Nonetheless, to keep matters concrete,
we will focus on email composition as our motivating application. We all spend
a significant fraction of our work day composing emails. A tool can become
valuable even if it helps decrease this time only slightly. We hypothesize
the typical official email conversation is rerepetitive and has many standard
For each user, there is typically available a long archive of sent mail, which
we can use to train an autocompletion system. And finally, the high speed at
which most of us type out emails makes it essential that autocompletion queries
have very short response times if they are to help rather than hinder the user,
leading to tight performance requirements which we will attempt to address.

1.1  Our contributions

We introduce a query model that takes into account the multiplicity of
completion choice, and propose a way to provide top-k, ranked
, paying attention to the nature of results returned.

We introduce the concept of a significant phrase, which is used to
demarcate frequent phrase boundaries. It is possible for significant phrases to
overlap. It is also possible for there to be two (or more) significant phrases
of differing lengths for any completion point. To evaluate systems that provide multiple
ranked autocompletion results, we define a novel total profit metric for ranked autocompletion
results, that takes into account the cost of distraction due to incorrect suggestions.

At the physical level, we propose the construction of a word-based suffix tree structure we call the FussyTree,
where every node in the tree corresponds to a word instance in the corpus, and
root-leaf paths depict phrases. Queries are executed upon receipt of a prefix phrase from the user-interface, and the suffix-tree is then traversed to
provide the results. We develop techniques to minimize resource requirements in
this context, such as tree size and construction time.

The next section discusses the challenges and motivations for our work, and
details some of the existing solutions to the problems at hand. In Section 3,
we describe our data model and the notion of significance.
The FussyTree data structure, its construction and querying are defined in
Section 4. This is followed by Section 5, where we discuss evaluation strategies and metrics for
our experiments. We report our experiments and findings in
Section 6, and suggests extensions to our work in Section 7. We conclude with a discussion of the problem at hand in Section 8.

2  Motivation and Challenges

2.1  Pervasiveness of Autocompletion

The concept of autocompletion has become prevalent in current user interfaces.
It has found its way into mobile phones [38], OS-level file and web
browsers [25], desktop search engines such as Google
Desktop [22], GNOME Beagle [21] and Apple
Spotlight [20], various integrated development environments,
email clients such as Microsoft Outlook [23], and even word processors
such as Microsoft Word [23] and OpenOffice [2]. The latest
versions of Microsoft Internet Explorer [24] and Mozilla
Firefox [25] implement autocompletion to suggest search queries to the
user. This widespread adoption of autocompletion demands more from the backend
mechanisms that support it in terms of fast response times with large amounts of
suggestion data. While improvements in underlying computer hardware and
software do allow for implementation of more responsive user interfaces, the
basic computational challenges involving autocompletion still need to be solved.

2.2  Related Work

The concept of autocompletion is not new. Implementations such as the “Reactive
Keyboard” developed by Darragh and Witten [6] have been
available since the advent of modern user interfaces, where the interface
attempts to predict future keystrokes based on past interactions.
Learning-based assistive technologies have been very successfully used to help
users with writing disabilities [18], and this technology is now
being extended to accelerate text input for the average user.

There have been successful implementations for autocompletion in controlled
language environments such as command line shells developed by Motoda and
Yoshida [29], and by Davison and Hirsh [7]. Jacobs and
Blockeel have addressed the problem of Unix command prediction using variable
memory Markov models. Integrated Development environments such as Microsoft
Visual Studio also exploit the limited nature of controlled languages to deliver
accurate completion mechanisms that speed up user input.

As in these controlled languages, there is high predictability in natural human
language text, as studied by Shannon [37], making a great case for the
invention of automated natural language input. Phrase prediction and
disambiguation for natural language have been active areas of research in the
speech recognition and signal processing community [11], providing
reliable high-level domain knowledge to increase the accuracy of low-level audio
processing. Phrase level prediction and disambiguation have also found great
use in the area of machine translation to improve result quality [42].
Similar ideas involving language modelling have been used for spelling
correction [17]. The exact task of sentence completion has been
tackled using different approaches. Grabski et al. [12] have
developed an information retrieval based set of techniques, using cosine
similarity as a metric to retrieve sentences similar to the query. Bickel et
al. [3] take on a more language model-based approach, by estimating
parameters in linearly interpolated n-gram models.

While there have been significant advances in the level of phrase prediction for
natural language, there appears to be very little attention paid to the
efficiency of the prediction itself. Most models and systems are built
for offline use, and are not designed for real time use at the speed of human
typing - several queries a second. Also, the data models implemented by most
are agnostic of actual implementation issues such as system memory. We attempt
to bridge this gap, using a data-structures approach to solving this problem.
We conceive the phrase completion problem as a suffix tree implementation, and
institute methods and metrics to ensure result quality and fast query response

In addition to the nature of the queries, we recognize parameters that have not
been considered before in traditional approaches to the problem setting. While
the concept of contextual user interaction is quite prevalent [36],
the context of the query phrase has been ignored in surveyed
text-prediction literature. This is an important feature, since most human
textual input is highly contextual in nature. In the following sections, we
evaluate the ability of improving result quality using the query

The problem of frequent phrase detection has also been addressed in the context
of “phrase browsing" in which documents are indexed according to a hierarchy of constituent phrases.
The Sequitur algorithm used in these techniques [31, 28, 33]
suffers from the problem of aggressive phrase creation and is not ideal
for cases where we wish to index only the n-most frequent phrases, as opposed
to all the repeated phrases in the document. Additionally the in-memory data structure created is
hence extremely large and infeasible for large datasets.

Efficient data structures for prefix-text indexing have been studied. For example, trie variants such as burst tries [13, 43] have been shown to be effective for indexing word sequences in large corpora. However, these techniques still require memory resident structures and furthermore do not consider phrase boundaries or phrase frequencies, and hence cannot be used for our application.

2.3  Vocabulary and size

There has been considerable work [40, 41, 34] that
considers the problem of frequent phrase storage using suffix trees for both
small and large data sets. However, traditional techniques that study efficient
suffix tree construction make an assumption that the vocabulary (tree alphabet)
involved is small. Since we consider natural language phrase autocompletion as
our target application, our vocabulary is that of all the possible words in the
text, which is a very large number. Hence, the average fanout of each node in
the suffix tree is expected to be quite high, that is, each word will have a high
number of distinct words following it. In addition, the trees are expected to
be sparse and extremely varied in their structure. This high fanout and sparse,
varied structure challenges current scalable construction algorithms since
sub-tasks in these algorithms, such as the identification of common substrings.

Such flat, wide suffix-trees also make querying difficult due to the increased
costs in searching through the nodes at each level.
A possible solution is to maintain each list of children in lexicographic
order, which affects the lower bounds on suffix tree construction taking it from
O(n) to O(nlogσ), where σ is the size of the vocabulary and n
the size of our input text. Farach et al. [9] address this very
problem and propose a linear-time algorithm for suffix tree construction.
However, this algorithm involves multiple scans of the input string to
construct and splice together parts of the tree, which makes it impossible to
implement applications that are incremental or stream-like in nature, or where
the corpus size is very large.

In other related work, the estimation of phrase frequencies has been looked into
is an index estimation problem. Krishnan et al. [16] discuss the
estimation techniques in the presence of wildcards, while Jagadish et
al. [14] use various properties such as the short-memory property
of text to estimate frequency for substrings. However, both papers consider
only low vocabulary applications. For example, the construction of pruned count
suffix trees, as suggested in these papers, is infeasible for phrases due to the large intermediate size of the
trees, even with in-construction pruning.

3  Data Model

In the context of the problem setting described above, we now formalize our autocompletion problem.

Let a document be represented as a sequence of words, w1, w2, …,
wN. A phrase r in the document is an occurrence of consecutive words,
wi, wi+1, …, wi+x−1, for any starting position i in [1,N].
We call x the length of phrase r, and write it as len(r)=x.

The autocompletion problem is, given w1, w2, … wi−1, to predict completions of the form: wi, wi+1, …, wi+c−1, where we would like the likelihood of this completion being correct to be as high as possible, and for the length of the completion to be as large as possible.

Of course, the entire document preceding word i can be very long, and most of it may have low relevance to the predicted completion. Therefore, we will consider a phrase to comprise a prefix of length p and a completion of length c. Values for both p and c will be determined experimentally.

We now introduce an example to help explain concepts throughout the rest
of this paper, a sample of a multi document text stream for email as shown in Table 1. We also present the n-gram frequency table, in this example choosing a training sentence size N = 2. In other words, we determine frequencies for words and word pairs but not for word triples or longer sequences. We are not interested in phrases, or phrase components, that occur infrequently. In this example, we have set a threshold parameter τ = 2. The italicized (last three) phrases have frequency below this threshold, and are hence ignored.

Doc 1please call me asap
Doc 2please call if you
Doc 3please call asap
Doc 4if you call me asap

please3please call3
call4call me2
me2if you2
if2me asap2
you2call if1
asap3call asap1
  you call1

Table 1: An example of a multi-document collection

3.1  Significance

A phrase for us is any sequence of words, not necessarily a grammatical
construct. What this means is that there are no explicit phrase
boundaries*. Determining such boundaries is a first requirement.
In practice, what this means is that we have to decide how many words ahead we
wish to predict in making a suggestion to the user.

On the one hand we could err in providing suggestions that are too specific;
e.g. a certain prefix of the sentence is a valid completion and has a very high
probability of being correct. However the entire suggestion in its completeness
has a lower chance of being accepted. Conversely, the suggestions maybe too
conservative, losing an opportunity to autocomplete a longer phrase.

We use the following definition to balance these requirements:


A phrase ‶AB" is said to be significant if it satisfies the following four conditions:

  • frequency : The phrase AB occurs with a threshold frequency of at least τ in the corpus.
  • co-occurrence : ‶AB" provides additional information
    over ‶A", i.e. its observed joint probability is higher than that of
    independent occurrence.

    P(‶AB") > P(‶A") · P(‶B")

  • comparability : ‶AB" has likelihood of occurrence that is “comparable" to ‶A". Let z ≥ 1 be a comparability factor. We write this formally as:

    P(‶AB") ≥ 

     P(‶A")     (2)

  • uniqueness : For every choice of ‶C", ‶AB" is much more
    likely than ‶ABC". Let y ≥ 1 be a uniqueness factor such that for all C,

    P(‶AB") ≥ y P(‶ABC")

z and y are considered tuning parameters. Probabilities within a factor of z are considered comparable, and probabilities more than a factor of y apart are considered to be very different.

In our example, we set the frequency threshold τ = 2, the comparability factor z = 2, and the uniqueness factor y =
3. Consider Doc. 1 in the document table. Notice that the phrase “please call” meets all three conditions
of co-occurrence, comparability, and uniqueness and is a
significant phrase. On the other hand, “please call me”
fails to meet the uniqueness requirement, since “please call me asap” has the same frequency. In essence, we are trying to locate phrases which represent sequences of words that occur more frequently together than by chance, and cannot be extended to longer sequences without lowering the probability substantially. It is possible for multiple significant phrases to share the same prefix. E.g. both “call me” and “call me asap” are significant in the example above.
This notion of significant phrases is central to providing effective suggestions for autocompletion.

4  The FussyTree

Suffix trees are widely used, and are ideal data structures to determine
completions of given strings of characters. Since our concern is to find
multi-word phrases, we propose to define a suffix tree data structure over an alphabet of
words. Thus, each node in the tree represents a word. Such a phrase
completion suffix tree
is complementary with respect to a standard
character-based suffix tree, which could still be used for intra-word
completions using standard techniques. We refer to our data structure as a FussyTree, in that it is fussy about the strings added to it.

In this section we introduce the FussyTree as a variant of the pruned count suffix tree, particularly suited for phrase completion.

4.1  Data Structure

Since suffix trees can grow very large, a pruned count suffix
 [16] (PCST) is often suggested for applications such as ours.
In such a tree, a count is maintained with each node, representing the number of
times the phrase corresponding to the node occurs in the corpus. Only nodes
with sufficiently high counts are retained, to obtain a significant savings in
tree size by removing low count nodes that are not likely to matter for the
results produced. We use a PCST data structure, where every node is the
dictionary hash of the word and each path from the root to a leaf represents a
frequent phrase. The depth of the tree is bounded by the size of the largest
frequent phrase h, according to the hth order Markov assumption that
constrains wt to be dependent on at most words wt+1wt+h.
Since there are no given demarcations to signify the beginning and end of
frequent phrases, we are required to store every possible frequent substring,
which is clearly possible with suffix trees. Also, the structure allows for fast,
constant-time retrieval of phrase completions, given fixed bounds on the maximum
frequent phrase length and the number of suggestions per query.

Hence, a constructed tree has a single root, with all paths leading to a leaf
being considered as frequent phrases. In the Significance Fussytree variant, we additionally add a marker to denote that the node is significant, to denote its importance in the tree. By setting the pruning count
threshold to τ, we can assume that all nodes pruned are not significant in that they do not
satisfy the frequency condition for significance. However, not all retained nodes are significant. Since we
are only interested in significant phrases, we can prune any leaf nodes of the
ordinary PCST that are not significant. (We note that this is infrequent. By
definition, a leaf node u has count greater than the pruning threshold and
each of its children a count less than threshold. As such, it is likely that
the uniqueness condition is satisfied. Also, since the count is high enough for
any node not pruned, the comparability test is also likely to be satisfied. The
co-occurrence test is satisfied by most pairs of consecutive words in natural

4.2  Querying: The Probe Function

The standard way to query a suffix tree is to begin at the root and traverse
down, matching one query character in turn at each step. Let node u be the
current node in this traversal at the point that the query is exhausted in that
all of its characters have been matched. The sub-tree rooted at node u
comprises all possible completions of the given query string.

For the FussyTree, we have to make a few modifications to the above procedure
to take into account the lack of well-defined phrase boundaries. There are
precisely two points of concern – we have to choose the beginning of the prefix
phrase for our query as well as the end point of the completions.

Recall that no query is explicitly issued to our system – the user is
typing away, and we have to decide how far back we would like to go to define
the prefix of the current phrase, and use as the query to the suffix tree. This
is a balancing act. If we choose too long a prefix, we will needlessly
constrain our search by including irrelevant words from the “distant” past
into the current phrase. If we choose too short a prefix, we may lose crucial
information that should impact likely completions. Previous studies [14] have
demonstrated a “short memory property” in natural language, suggesting that
the probability distribution of a word is conditioned on the preceding 2-3
words, but not much more. In our scenario, we
experimentally found 2 to be the optimum choice, as seen in section 6.5.2. So we always use the last two
words typed in as the query for the suffix tree.

Once we have traversed to the node corresponding to the prefix, all descendants
of this node are possible phrase completions. In fact, there are many
additional phrase completions corresponding to nodes that have been pruned away
in the threshold-based pruning. We wish to find not all possible completions,
but only the significant ones. Since we have already marked nodes significant,
this is a simple question of traversing the sub-tree and returning all
significant nodes(which represent phrases) found. In general there will be more than one of them.

4.3  Tree Construction

4.3.1  Naive algorithm

The construction of suffix trees typically uses a
sliding window approach. The corpus is considered a stream of
word-level tokens in a sliding window of size N, where N is the order of the Markovian assumption. We consider each window instance as a separate string, and attempt to insert all its prefixes into the PCST. For
every node that already exists in the tree, the node count is incremented by
one. The tree is then pruned of infrequent items by deleting all nodes
(and subsequent descendants) whose count is lower than the frequency
threshold τ.

P ≠{}




i 1 maxfreq

//slide through with a window of size i

f Sliding-Window(P, i)

Frequent-Table-Contains(f) = false



//align phrase to relevant node in existing suffix tree

//add all the remaining words as a new path

4.3.2  Simple FussyTree Construction algorithm

The naive algorithm for PCST construction described above does not scale well
since the entire suffix tree including infrequent phrases is constructed as an
intermediate result, effectively storing a multiple of the corpus size in the
tree.  [16] suggests calling the prune() function at
frequent intervals during the construction of the suffix tree using a scaled
threshold parameter. However, our experiments showed that this strategy either
produced highly inaccurate trees in the case of aggressive thresholding, or even the intermediate trees
simply grew too large to be handled by the system with less aggressive thresholding.

To remedy this, we propose a basic Simple FussyTree Construction algorithm in which we
separate the tree construction step into two parts. In the first step we use a
sliding window to create an n-gram frequency table for the corpus, where n =
1,...,N, such that N is the training sentence size. We then prune all phrases below the
threshold parameter and store the rest. In our second step, we add phrases to
the tree in a fussy manner, using another sliding window pass to scan
through the corpus. This step stores all the prefixes of the window, given
that the phrase exists in the frequency table generated in the first step. The
testing of the phrase for frequency is done in an incremental manner, using the
property that for any bigram to be frequent, both its constituent unigrams need
to be frequent, and so on. Since the constituent n-grams are queried more
than once for a given region due to the sliding window property of our
insertion, we implement an LRU cache to optimize the isFrequent()

An example illustrating the FussyTree algorithm

We use the data from the example provided in Table 1 to construct our
aggregate tokenized stream:

(please, call, me, asap, -:END:-,
please, call, if, you,

where the -:END:- marker implies a frequency
of zero for any phrase containing it. We now begin the construction of our tree
using a sliding window of 4. The first phrase to be added is (please,
call, me, asap)
followed by its prefixes, (please, call, me),
(please, call) and (please). We then shift the window to
(call, me, asap, -:END:-), and its
prefixes (call, me, asap), (call, me) and (call), and
so on. All phrases apart from those that contain either of (call, if),
(call, asap), (you, call)

or (-:END:-) will meet the frequency
requirements, resulting in Fig. 1. Significant phrases are marked with a *. Note that all leaves are significant (due to the uniqueness clause in our definition for significance), but some internal nodes are significant too.


Figure 1: An example of a constructed suffix-tree

4.3.3  Analysis

We now provide a brief complexity analysis of our tree construction algorithm. It uses a sliding window approach over the corpus, adding the windowed phrases to the tree. Given a corpus of size S and a window of fixed size h, we perform frequency checks on each of the phrase incrementally. Hence, the worst case number of actual frequency checks performed is S × 2h. Note that h is a small number (usually under 8), and hence does not pose any cause for concern. Additionally, the LRU cache for frequency checks brings down this number greatly, guaranteeing that our corpus scanning is done in O(S). Further, each phrase can in the worst case create a completely new branch for itself which can have a maximum depth of h, the size of the window. Thus, the number of tree operations is also linear with the corpus size, indicating that our construction algorithm time will grow linearly with the size of the corpus size. The constant-time lookup for the querying stage is explained thus: a query can at worst force us to walk down the longest branch in the tree, which is of maximum depth h, a fixed number.

We now look into the correctness of our tree construction with respect to our definition of significance. In our implementation, the FussyTree is constructed with only those phrases possessing a frequency greater than τ, ensuring that all nodes marked significant in the tree represent frequent phrases. This asserts correctness the frequency rule in the notion of significance. The co-occurrence and comparability are also rendered correctly, since we possess all the count information to mark them. However, to assert uniqueness in the leaf nodes of our tree, we note the loss of count information for the leaf nodes’ children. Obviously, the frequency for the children of the leaf nodes is lesser than the threshold, and hence is most likely low enough to satisfy the uniqueness clause. However, there may be a case in which the children may have just missed the threshold, making the leaf node ineligible for the uniqueness clause. In this case, we conservatively mark all leaf nodes as significant.

4.3.4  Training sentence size determination

The choice of the size of the training sentence size N, is a significant parameter choice.
The cost of tree construction goes up with N, as longer frequent phrases are
considered during the construction. This predicates the need to store n-gram frequency tables (where n = 1 … N) take up considerable resources, both in terms of storage and query times to check for frequency. The cost of phrase look-up in the resultant
tree also increases because of this reason. Hence, in the interest of
efficiency, we would like to minimize the value of N. Yet, using too small a
value of N will induce errors in our frequency estimates, by ignoring crucial
dependence information, and hence lead to poor choices of completions. Keeping this in mind, we experimentally derive the value of N in Sec. 6.5.1.

We note similar ideas in  [41], where frequency information is recorded
only if there is a significant divergence from the inferred conditional
probabilities. However we point out that the frequency counts there are
considered on a per-item basis, as opposed to our approach, which is much faster
to execute since there are no requirements to do a recurring frequency check
during the actual creation of the FussyTree structure.

4.4  Telescoping

Telescoping [26] is a very effective space compression
meth-od in suffix trees (and tries), and involves collapsing any single-child
node (which has no siblings) into its parent node.
In our case, since each node possesses a unique count, telescoping would result
in a loss of information, and so cannot be performed directly, without loss of information.
Given the large amount of storage required to keep all these counts, we seek
techniques to reduce this, through mechanisms akin to telescoping.

To achieve this, we supplant the multi-byte count element by a single-bit flag to mark if a node is
“significant" in each nodeat depth greater than one. All linear paths between significant nodes are considered safe
to telescope. In other words, phrases that are not significant get ignored,
when determining whether to telescope, and we do permit the collapsing together
of parent and child nodes with somewhat different counts as long as the parent
has no other significant children. To estimate the frequency of each phrase, we
do not discard the count information from the significant nodes directly
adjacent to the root node. By the comparability rule in the definition
of significance, this provides an upper bound for the frequency
of all nodes, which we use in place of the actual frequency for the
probe function. We call the resulting telescoped structure a Significance FussyTree with (offline) significance marking.

Online significance marking

An obvious method to mark the significance of nodes would be to traverse the
constructed suffix tree in a postprocessing step, testing & marking each node.
However, this method requires an additional pass over the entire trie. To avoid them, we propose a
heuristic to perform an on-the-fly marking of significance, which enables us to
perform a significance based tree-compaction without having to traverse the entire tree again.

Given the addition of phrase ABCDE, only E is considered to be promoted of
its flagged status using the current frequency estimates, and D is considered
for demotion of its status. This ensures at least 2-word pipelining. In the
case of the addition of ABCXY to a path ABCDE where E is significant, the
branch point C is considered for flag promotion, and the immediate descendant
significant nodes are considered for demotion. This ensures considering common
parts of various phrases to be considered significant.

//align phrase to relevant node in existing suffix tree

using cursor c, which will be last point of


//add all the remaining words as a new path

the cursor c is again the last point of


Consider-Demotion(c →parent)

child c →children


We call the resulting structure a Significance FussyTree with online significance marking.

5  Evaluation Metrics

Current phrase / sentence completion algorithms discussed in related work only
consider single completions, hence the algorithms are evaluated using the
following metrics:

Precision = 

n(accepted completions)


Recall = 

n(accepted completions)
n(queries, i.e. initial
word sequences)


where n(accepted completions) = number of accepted completions, and so on.

In the light of multiple suggestions per query, the idea of an accepted
is not boolean any more, and hence needs to be quantified. Since our
results are a ranked list, we use a scoring metric based on the inverse rank of
our results, similar to the idea of Mean and Total Reciprocal Rank scores
described in  [35], which are used widely in evaluation for information
retrieval systems with ranked results. Also, if we envisage the interface as a
drop-down list of suggestions, the value of each suggestion is
inversely proportional to it’s rank; since it requires 1 keypress to select the
top rank option, 2 keypresses to select the second ranked one, and so on. Hence
our precision and recall measures are redefined as:

Precision = 

(1/rank of accepted
n(predicted completions)

Recall = 

(1/rank of accepted completion)
i.e. initial word sequences)

To this end we propose an additional metric based on a “profit” model to quantify the number
of keystrokes saved by the autocompletion suggestion. We define the income as a
function of the length of the correct suggestion for a query, and the cost as a
function of a distraction cost d and the rank of the suggestion. Hence, we define the Total Profit Metric(TPM) as:

TPM(d) = 

(sug. length × isCorrect) − (d +
length of document


Where isCorrect is a boolean value in our sliding window test, and d
is the value of the distraction parameter. TPM metric measures the
effectiveness of our suggestion mechanism while the precision and
recall metrics refer to the quality of the
suggestions themselves. Note that the distraction caused by the suggestion is
expected to be completely unobtrusive to user input : the value given to the
distraction parameter is subjective, and depends solely on how much a user is
affected by having an autocompletion suggestion pop up while she is typing. The
metric TPM(0) corresponds to a user who does not mind the distraction at all, and computes
exactly the fraction of keystrokes saved as a result of the autocompletion. The
ratio of TPM(0) to the total typed document length tells us what fraction of the
human typing effort was eliminated as a result of autocompletion. TPM(1) is an
extreme case where we consider every suggestion(right or wrong) to be a blocking
factor that costs us one keystroke. We conjecture that the average, real-world
user distraction value would be closer to 0 than 1.

We hence consider three main algorithms for comparison: (1) We take the naive pruned count suffix
tree algorithm(Basic PCST) as our baseline algorithm. (2) We consider the basic version of the FussyTree construction algorithm called the Simple FussyTree Construction as an initial comparison, where we use frequency counts to construct our suffix tree. (3) Our third variant, Significance FussyTree Construction is used to create a significance-based FussyTree with significance marked using the online heuristic.

6  Experiments

We use three different datasets for the evaluation with increasing degrees of size
and language heterogeneity. The first is a subset of emails from the Enron
corpus [15] related to emails sent by a single person which spans
366 emails or 250K characters. We use an email collection of multiple Enron
employees, with 20,842 emails / 16M characters, as our second collection. We use
a more heterogeneous random subset of the Wikipedia# for our third dataset comprising 40,000 documents / 53M characters. All documents were preprocessed
and transformed from their native formats into lowercase plaintext ASCII.
Special invalid tokens (invalid Unicode transformations, base64 fragments from
email) were removed, as was all punctuation, so that we can concentrate on
simple words for our analysis. All experiments were implemented in the Java
programming language and run on a 3GHz x86 based computer with 2 gigabytes of
memory and inexpensive disk I/O, running the Ubuntu Linux operating system.

To evaluate the phrase completion algorithms, we employ a sliding window based
test-train strategy using a partitioned dataset. We consider multi-document
corpora, aggregated and processed into a tokenized word stream. The sliding
window approach works in the following manner: We assume a fixed window of size
n, larger than or equal to the size of the training sentence size. We then scan
through the corpus, using the first α words as the query, and the
first β words preceding the window as our context. We then retrieve
a ranked list of suggestions using the suggestion algorithm we are testing, and
compare the predicted phrases against the remaining n−α words in the window,
which is considered the “true” completion. A suggested completion is accepted
if it is a prefix of the “true” completion.

6.1  Tree Construction

We consider the three algorithms described at the end of Sec. 5 for analysis. We ran the three
algorithms over all three candidate corpora, testing for construction
time, and tree size. We used the standard PCST construction algorithm [14] as a base line. We found the values of comparability parameter z = 2 and uniqueness parameter y = 2 in computing significance to be optimum in all cases we tested. Therefore, we use these values in all experiments reported. Additionally, the default threshold values for the three corpora were kept at 1.5 × 10−5 of corpus size; while the value for the training sentence size N was kept at 8, and the trees were queried with a prefix size of 2. The threshold, training sentence size and prefix size parameters were experimentally derived, as shown in the following sections.

We present the construction times for all three datasets in Table 2.
The PCST data structure does not perform well for large sized data – the
construction process failed to terminate after an overnight run for the
Wikipedia dataset. The FussyTree algorithms scale much better
with respect to data size, taking just over half an hour to construct the tree
for the Wikipedia dataset. As we can see from Table 2, the Significance FussyTree algorithm is faster than offline significance marking on the Simple FussyTree, and is only slightly slower to construct than the Simple FussyTree algorithm.

AlgorithmSmallLarge EnronWikipedia
Basic PCST131811592153
Simple FussyTree1628710245602073373
Offline Significance1929511431152342171
Online Significance1721010383582118618

Table 2: Construction Times (ms)

6.2  Prediction Quality

We now evaluate the the prediction quality of our algorithms in the following
manner. We simulate the typing of a user by using a sliding window of size 10
over the test corpus, using the first three words of the window as the context,
the next two words as the prefix argument, and the remaining five as the
true completion. We then call the probe function for each prefix,
evaluating the suggestions against the true completion. If a suggestion is
accepted, the sliding window is transposed accordingly to the last completed
word, akin to standard typing behavior.

In Table 4 we present the TPM(0) and TPM(1) scores we obtained for three datasets along with the recall and precision of the suggestions. We compare our two tree construction algorithms, the Simple FussyTree, and the Significance FussyTree. The Basic PCST algorithm is not compared since it does not scale, and because the resultant tree would any way be similar to the tree in the Simple FussyTree algorithm. We see (from the TPM(0) score) that across all the techniques and all the data sets, we save approximately a fifth of key strokes typed. Given how much time each of us puts into to composing text every day, this is a really huge saving.

To put the numbers in perspective, we consider modern word processors such as Microsoft
Word [23], where autocompletion is done on a single-suggestion basis
using a dictionary of named entities. We use the main text of the Lord of The Rings page on the English
Wikipedia as our token corpus, spanning 43888 characters. In a very
optimistic and unrealistic scenario, let us assume that all named
entities on that page (for our problem, all text that is linked to another
Wikipedia entry) are part of this dictionary; this could be automated using a
named entity tagger, and also assume that we have a perfect prediction
quality given a unit prefix size. Given this scenario, there are 226 multiple
word named entity instances, which, given our optimistic assumptions, would
result in an ideal-case completion profit of 2631 characters, giving us a TPM(0) of
5.99%. Our phrase completion techniques can do better than this by a factor of 3.

Despite the discarding of information through significance marking and telescoping, we observe how the Significance FussyTree algorithm results in a much smaller tree size (as seen in Table 3), and can still perform comparably with the baseline Simple FussyTree in terms of result quality. We observe an overall increase in recall and precision on adding significance, and that it causes a slight reduction in TPM metrics especially in high-variance corpora (as opposed to single author text). We consider the TPM drop to be acceptable in light of the reduction in tree size.

AlgorithmSmallLarge EnronWikipedia
Simple FussyTree82991249825536
Sigf. FussyTree4159803514503

Table 3: Tree Sizes (in nodes) with default threshold values

Corpus: Enron Small

Dataset / AlgoRecallPrecisionTPM(0) TPM(1)
Simple FussyTree26.67%80.17%22.37%18.09%
Sigf. FussyTree43.24%86.74%21.66%16.64%

Corpus: Enron Large

Dataset / AlgoRecallPrecisionTPM(0) TPM(1)
Simple FussyTree16.59%83.10%13.77%8.03%
Sigf. FussyTree26.58%86.86%11.75%5.98%

Corpus: Wikipedia

Dataset / AlgoRecallPrecisionTPM(0) TPM(1)
Simple FussyTree28.71%91.08%17.26%14.78%
Sigf. FussyTree41.16%93.19%8.90%4.6%

Table 4: Quality Metrics

Looking across the data sets, we observe a few weak trends worth noticing. If we just look at the simple FussyTree scores for recall and precision, we find that there is a slight improvement as the size of the corpus increases. This is to be expected – the larger the corpus, the more robust our statistics. However, the TPM scores do not follow suit, and in fact show an inverse trend. We note that the difference between TPM and recall/precision is that the latter additionally takes into account the length of suggestions made (and accepted), whereas the former only considers their number and rank. What this suggests is that the average length of accepted suggestion is highest for the more focused corpus, with emails of a single individual, and that this length is lowest for the heterogeneous encyclopedia. Finally, the recall and precision trends are mixed for the significance algorithms. This is because these algorithms already take significance and length of suggestion into account, merging the two effects described above. As expected, the TPM trends are clear, and all in the same direction, for all the algorithms.

6.3  Response Time

We measure the average query response times for various algorithms / datasets.
As per Table 5, it turns out that we are well within the 100ms time limit for “instantaneous"
response, for all algorithms.

AlgorithmSmallLarge EnronWikipedia
Simple FussyTree0.0200.020.02
Sigf. FussyTree0.0210.220.20
Sigf. + POS0.300.230.20

Table 5: Average Query Times (ms)

6.4  Online Significance Marking

We test the accuracy of the online significance marking heuristic by comparing
the tree it generates against the offline-constructed gold standard. The
quality metrics for all three datasets, as shown in Table 6, show that this method is near perfect in terms of accuracy%, yielding excellent precision§ and recall scores in a shorter tree construction time.

Enron Small99.62%97.86%98.30%
Enron Large99.57%99.78%99.39%

Table 6: Online Significance Marking heuristic quality, against offline baseline

6.5  Tuning Parameters

6.5.1  Varying training sentence size

As described in Sec. 4.3.4, the choice of the training sentence size N is crucial to the quality of the predictions. We vary the value of this training sentence size N while constructing the Significance FussyTree for the Small Enron dataset, and report the TPM scores in Fig. 2. We infer that the ideal value for N for this dataset is 8.


Figure 2: Effect of varying (a) training sentence size (b) prefix length on TPM in Small Enron dataset

6.5.2  Varying prefix length

The prefix length in the probe function is a tuning parameter. We study the
effects of suggestion quality across varying lengths of query prefixes, and present them in Fig. 2. We see that the quality of results is maximized at length = 2.

6.5.3  Varying frequency thresholds

To provide an insight into the size of the FussyTree, in nodes, as we vary the frequency
threshold during the construction of the FussyTree for the Small Enron and Large
Enron datasets; and observe the change in tree size, as shown in Fig.3.


Figure 3: Effect of varying threshold on FussyTree size in (a) Small Enron (b) Large Enron datasets

6.5.4  Varying tree size

We now use the frequency threshold as a way to scale up the size of the tree for
the Significance FussyTree. We plot the TPM values(0 and 1) against the tree size for the Small Enron dataset. As can be seen, since the decrease in tree size initially causes a very gradual decrease in TPM, but soon begins dropping quite sharply at around threshold = 5 and above. Hence, we prefer to use threshold = 4 in this case.


Figure 4: Effect of varying FussyTree size on TPM for Small Enron dataset, t labels show threshold values for the respective tree sizes

7  Possible Extensions

In experiments over our test corpora in Sec. 6.3, we observe that the probe
function takes less than a hundredth of a millisecond on average to execute and
deliver results. Studies by Card [5] and Miller [27] have shown
that the appearance of “instantaneous" results is achieved within a response
time span of 100ms. Hence, this validates our premise and shows that there is a
clear window of opportunity in terms of query time, to include mechanisms that improve result quality
using the additional information. We discuss
some methods to improve quality in this section.

We base our extensions on the premise that the context of a phrase affects the choice of suitable phrase
completions. The context of a phrase is defined as the set of words
wiy, wiy+1, …, wi−1 preceding the phrase wi, wi+1, …


The context of a prefix can be used to determine additional semantic
properties, such as semantic relatedness of the context to the suggestion, and
the grammatical appropriateness based on a part-of-speech determination. We
conjecture that these features abstract the language model in a way so as to
efficiently rank suggestions by their validity in light of the context. This
context of the query is used to rerank the suggestions provided by the
FussyTree’s probe function. The reranking takes place as a
postprocessing step after the query evaluation.

7.1  Reranking using part-of-speech

Corpus: Enron Small

Dataset / AlgoRecallPrecisionTPM(0) TPM(1)
Simple + POS25.0%77.09%22.22%17.93%
Sigf. + POS40.44%81.13%21.25%16.23%

Corpus: Enron Large

Dataset / AlgoRecallPrecisionTPM(0) TPM(1)
Simple + POS16.13%85.32%15.05%9.01%
Sigf. + POS23.65%84.99%12.88%6.16%

Corpus: Wikipedia

Dataset / AlgoRecallPrecisionTPM(0) TPM(1)
Simple + POS22.87%87.66%17.44%14.96%
Sigf. + POS41.16%95.3%8.88%4.6%

Table 7: Quality Metrics, querying with reranking

We use a hash map based dictionary based on the Brill part-of-speech(POS) tagger [4] to map words to their part-of-speech. A part
of speech automata is trained on the entire text during the sliding window tree
construction, where each node is a part of speech, and each n-gram (to a
certain length, say 5) is a distinct path, with the edge weights proportional to
the frequency. Hence the phrase “submit the annual reports" would be
considered as an increment to the “verb-det-adj-noun" path weights.
The reasoning is that part-of-speech n-grams are assumed to be a good
approximation for judging a good completion in running text. During the query, the probe function is modified to perform a successive rerank step, where the suggestions R are ranked in
descending order of prob(POSc, POSp, POSri), where POSc, POSp and
POSri are the POS of the context, prefix and suggestion, respectively. We
then evaluate the precision, recall, TPM and execution time
of the new reranking-based probe function, as reported in Table 7. We observe that the reranking has little average affect on Recall, Precision and TPM. We note however that on any individual query, the input can be significantly positive or negative. Isolating the cases where the effect is positive, we are working towards a characterization of the effect.

7.2  Reranking using semantics


Figure 5: Bipartite graph - synset co-occurrences

Another source of contextual information is the set of meanings of the prefix’s
word neighborhood. We reason that some word meanings would have a higher
probability to co-occur with certain others. For example, for any sentence
mentioning “hammer" in its context and “the" as prefix, it
would more probable to have “nail properly" as a possible completion
than “documents attached", even if the latter is a much more frequent
phrase overall, disregarding context. A simple bipartite classifier is used to
consider prob(A|B), where A is the set of WordNet [10] synsets, or
word meanings in the result set, and B is the set of synsets in the query, as shown in Fig.
 5. The classifier returns a collection of synsets for every synset set provided in the context and
prefix. Suggestions are then ranked based on the number of synsets mapped from
each suggestion. However, experiments show that the benefit due to semantic reranking was statistically insignificant, and no greater than due to POS reranking. In addition, it has resource requirements due to the large amount of memory required to store the synset classes*, and the high amount of computation required for each
bipartite classification of synsets. We believe that this is because co-occurrence frequency computed with the prefix already contains most of the semantic information we seek. We thus do not consider semantic reranking
as a possible improvement to our system.

7.3  One-pass algorithm

The FussyTree algorithm involves a frequency counting step before the actual construction of the data structure. This preprocessing step can be incorporated into the tree construction phase, using on-line frequency estimation techniques, such as those in  [19, 8] to determine
frequent phrases with good accuracy. A single pass algorithm also allows us to
extend our application domain to stream-like text data, where all indexing
occurs incrementally. We analyze various features of frequent phrases, such as
part-of-speech, capitalization, and semantics. We show how the part-of-speech
information of these phrases is a good feature for predicting frequent phrases
with near 100% recall, by building a simple naive Bayesian classifier to
predict frequency. Since the part-of-speech features of frequent strings are likely to be much less sensitive to changes in corpus than the strings themselves, we reason that it is viable to create a POS-based classifier using
an initial bootstrap corpus. The classifier can then be used to build suffix
trees, while at the same time train itself. This is a very convenient technique
to use for frequent re-indexing of data, for example an overnight re-index of
all personal email that improves the autocompletion for the next day. Our
experiment proceeds as follows. We train a Naive Bayes classifier to classify a
phrase as frequent or infrequent based on the POS n-grams,
which are generated using the process described in section 7.1. We then test
the classifier on the training data, which in our case is the online trading
emails from the Enron dataset. We also report numbers from tests on other
datasets; first, a separate sample of emails from the Enron collection, and
secondly the sample of text from the Wikipedia.


Figure 6: Recall vs precision for POS-based frequency classifier.

On the conservative side of the precision-recall curve, experiments on the large
sized Enron collection report precision scores of 30%, 27% and 20%, for 80%,
85% and 99% recall respectively. We focus on this part of the
recall-precision curve, since it is acceptable to have a large number of
prospective frequent phrases, which are pruned later when updated with real

8  Conclusion and Future work

We have introduced an effective technique for multi-word
autocompletion. To do so, we have overcome multiple challenges.
First, there is no fixed end-point for a multi-word “phrase”. We
have introduced the notion of significance to address this.
Second, there often is more than one reasonable phrase completion at a point. We have established the need for ranking in completion results, and have introduced a framework
for suggesting a ranked list of autocompletions, and developed
efficient techniques to take the query context into account for this purpose. Third, there is no dictionary of possible multi-word phrases, and the number of possible combinations is extremely large. To this end, we have
devised a novel FussyTree data structure, defined as a variant of a
count suffix tree, along with a memory-efficient tree construction
algorithm for this purpose.
Finally, we have introduced a new evaluation metric, TPM, which
measures the net benefit provided by an autocompletion system much
better than the traditional measures of precision and recall.
Our experiments show that the techniques presented in this paper can
decrease the number of keystrokes typed by up to 20% for email
composition and for developing an encyclopedia entry.

Today, there are widely used word completion algorithms, such as
T9 [38]. We have shown that phrase completion can save at least as many keystrokes as word completion. However, word completion and phrase completion are complementary rather than competing. In terms of actual implementation, phrase completion can be triggered at every word boundary(e.g. when the user types a space), while word
completion can be queried all other times.

As possible extensions, we discussed the notion of reranking suggestions using sentence context, and one-pass construction of the FussyTree. As future work, the idea of text autocompletion can be
extended to various other applications. The current state of art in genomics
only considers 1-8 bases as atomic units. There are no studies done at longer, variable
strings of bases, which would require large-vocabulary capable suffix tree
implementations. Another possible application is the inference of XML schema,
which can be done by training the FussyTree with the XML paths as input. The
single-pass version of the construction algorithm allows us to create a
FussyTree for trend detection in streams, where the significance concept can be
used to mark the start and end of various trends.



B. Bailey, J. Konstan, and J. Carlis.
The Effects of Interruptions on Task Performance, Annoyance, and
Anxiety in the User Interface.
Proceedings of INTERACT 2001.

C. Benson, M. Muller-Prove, and J. Mzourek.
Professional usability in open source projects.
Conference on Human Factors in Computing Systems, pages
1083–1084, 2004.

S. Bickel, P. Haider, and T. Scheffer.
Learning to Complete Sentences.
16th European Conference on Machine Learning (ECML ’05), pages
497–504, 2005.

E. Brill.
A simple rule-based part of speech tagger.
Proceedings of the Third Conference on Applied Natural Language
, pages 152–155, 1992.

S. Card, G. Robertson, and J. Mackinlay.
The information visualizer, an information workspace.
Proceedings of the SIGCHI conference on Human factors in
computing systems: Reaching through technology
, pages 181–186, 1991.

J. Darragh, I. Witten, and M. James.
The Reactive Keyboard: a predictive typing aid.
Computer, 23(11):41–49, 1990.

B. Davison and H. Hirsh.
Predicting sequences of user actions.
Notes of the AAAI/ICML 1998 Workshop on Predicting the Future:
AI Approaches to Time-Series Analysis
, 1998.

R. Fagin, A. Lotem, and M. Naor.
Optimal aggregation algorithms for middleware.
Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium
on Principles of database systems
, pages 102–113, 2001.

M. Farach.
Optimal suffix tree construction with large alphabets.
Foundations of Computer Science, 1997. Proceedings., 38th Annual
Symposium on
, pages 137–143, 1997.

C. Fellbaum.
Wordnet: An Electronic Lexical Database.
Bradford Book, 1998.

E. Giachin and T. CSELT.
Phrase bigrams for continuous speech recognition.
Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995
International Conference on
, 1, 1995.

K. Grabski and T. Scheffer.
Sentence completion.
Proceedings of the 27th annual international conference on
Research and developement in information retrieval
, pages 433–439, 2004.

S. Heinz, J. Zobel, and H. Williams.
Burst tries: a fast, efficient data structure for string keys.
ACM Transactions on Information Systems, 20(2):192–223, 2002.

H. Jagadish, R. Ng, and D. Srivastava.
Substring selectivity estimation.
Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium
on Principles of database systems
, pages 249–260, 1999.

B. Klimt and Y. Yang.
The Enron Corpus: A New Dataset for Email Classification Research.
Proceedings of the European Conference on Machine Learning,

P. Krishnan, J. Vitter, and B. Iyer.
Estimating alphanumeric selectivity in the presence of wildcards.
Proceedings of the 1996 ACM SIGMOD international conference on
Management of data
, pages 282–293, 1996.

K. Kukich.
Technique for automatically correcting words in text.
ACM Computing Surveys (CSUR), 24(4):377–439, 1992.

C. MacArthur.
Word Processing with Speech Synthesis and Word Prediction: Effects
on the Dialogue Journal Writing of Students with Learning Disabilities.
Learning Disability Quarterly, 21(2):151–166, 1998.

G. Manku and R. Motwani.
Approximate frequency counts over data streams.
Proceedings of the Twenty-Eighth International Conference on
Very Large Data Bases
, 2002.

Apple Inc.
Spotlight Search

Gnome Foundation.
Beagle Desktop Search

Google Inc.
Google Desktop Search

Microsoft Inc.
Microsoft Office Suite

Microsoft Inc.
Windows Explorer

Mozilla Foundation.
Mozilla Firefox version 2

E. M. McCreight.
A space-economical suffix tree construction algorithm.
J. ACM, 23(2):262–272, 1976.

R. Miller.
Response time in man-computer conversational transactions.
Proceedings of the AFIPS Fall Joint Computer Conference,
33:267–277, 1968.

A. Moffat and R. Wan.
Re-Store: A System for Compressing, Browsing, and Searching Large
Proceedings of the International Symposium on String Processing
and Information Retrieval’, IEEE Computer Society
, 2001.

H. Motoda and K. Yoshida.
Machine learning techniques to make computers easier to use.
Artificial Intelligence, 103(1):295–321, 1998.

B. Myers, S. Hudson, and R. Pausch.
Past, present, and future of user interface software tools.
ACM Transactions on Computer-Human Interaction (TOCHI),
7(1):3–28, 2000.

C. Nevill-Manning, I. Witten, and G. Paynter.
Browsing in digital libraries: a phrase-based approach.
Proceedings of the second ACM international conference on
Digital libraries
, pages 230–236, 1997.

L. Paulson.
Building rich web applications with Ajax.
Computer, 38(10):14–17, 2005.

G. Paynter, I. Witten, S. Cunningham, and G. Buchanan.
Scalable browsing for large collections: a case study.
Proceedings of the fifth ACM conference on Digital libraries,
pages 215–223, 2000.

A. Pienimaki.
Indexing Music Databases Using Automatic Extraction of Frequent
Proceedings of the International Conference on Music Information
, pages 25–30, 2002.

D. Radev, H. Qi, H. Wu, and W. Fan.
Evaluating Web-based Question Answering Systems.
Proceedings of LREC, 2002.

T. Selker and W. Burleson.
Context-aware design and interaction in computer systems.
IBM Systems Journal, 39(3):880–891, 2000.

C. Shannon.
Prediction and entropy of printed English.
Bell System Technical Journal, 30(1):50–64, 1951.

M. Silfverberg, I. S. MacKenzie, and P. Korhonen.
Predicting text entry speed on mobile phones.
In CHI ’00: Proceedings of the SIGCHI conference on Human
factors in computing systems
, pages 9–16, New York, NY, USA, 2000. ACM

A. Taddei.
Shell Choice A shell comparison.
Technical report, Guide CN/DCI/162, CERN, Geneva, September 1994d.

S. Tata, R. Hankins, and J. Patel.
Practical Suffix Tree Construction.
Proceedings of the Thirtieth International Conference on Very
Large Data Bases, Toronto, Canada, August
, pages 36–47.

S. Tata, J. Patel, J. Friedman, and A. Swaroop.
Towards Declarative Querying for Biological Sequences.
Technical report, Technical Report CSE-TR-508-05, University of
Michigan, 2005.

D. Vickrey, L. Biewald, M. Teyssier, and D. Koller.
Word-Sense Disambiguation for Machine Translation.
HLT/EMNLP, 2005.

H. E. Williams, J. Zobel, and D. Bahle.
Fast phrase querying with combined indexes.
ACM Trans. Inf. Syst., 22(4):573–594, 2004.

Supported in part by NSF grant number 0438909 and NIH grant number R01 LM008106.
WordNet 2.1 has 117597 synsets, and 207016 word-sense pairs
English Wikipedia:
percentage predictions that are correct
percentage significant marked nodes that are correct
percentage of truly significant nodes correctly predicted

Qunits: Queried Units for Database Search

The following paper was published in the 4th Biennial Conference on Innovative Data Systems Research (CIDR) January 4-7, 2009, Asilomar, California, USA.
The PDF version is available (CIDR mirror).

Arnab Nandi
Computer Science, EECS
University of Michigan, Ann Arbor
H.V. Jagadish
Computer Science, EECS
University of Michigan, Ann Arbor

March 26, 2008

Abstract: Keyword search against structured databases has become a popular topic of investigation, since many users find structured queries too hard to express, and enjoy the freedom of a “Google-like” query box into which search terms can be entered. Attempts to address this problem face a fundamental dilemma. Database querying is based on the logic of predicate evaluation, with a precisely defined answer set for a given query. On the other hand, in an information retrieval approach, ranked query results have long been accepted as far superior to results based on boolean query evaluation. As a consequence, when keyword queries are attempted against databases, relatively ad-hoc ranking mechanisms are invented (if ranking is used at all), and there is little leverage from the large body of IR literature regarding how to rank query results.

Our proposal is to create a clear separation between ranking and database querying. This divides the problem into two parts, and allows us to address these separately. The first task is to represent the database, conceptually, as a collection of independent “queried units”, or qunits, each of which represents the desired result for some query against the database. The second task is to evaluate keyword queries against a collection of qunits, which can be treated as independent documents for query purposes, thereby permitting the use of standard IR techniques. We provide insights that encourage the use of this query paradigm, and discuss preliminary investigations into the efficacy of a qunits-based framework based on a prototype implementation.

1  Introduction

Keyword search in databases has received great attention in the recent past, to accommodate queries from users who do not know the structure of the database or are not able to write a formal structured query. Projects such as BANKS [3], Discover [13], ObjectRank [2] and XRank [10] have all focused on providing better algorithms for ranking elements in a database given a keyword query. These algorithms use various data models for the purposes of scoring results. For example, BANKS considers candidate results that form a minimal spanning tree of the keywords. ObjectRank, on the other hand, combines tuple-level PageRank from a pre-computed data graph with keyword matching. A common theme across such efforts is the great attention paid to different ranking models for search.

Figure 1: The Qunits Search Paradigm : The database is translated into a collection of independent qunits, which can be treated as documents for standard IR-like document retrieval.

This paradigm is similar to one adopted by the information retrieval community who strive to improve document search. For them, the task is to create an algorithm to find the best document given a keyword query. Various approaches, such as TF/IDF and PageRank, have emerged as popular approaches to solving the search problem, where the user is presented with a list of snippets from top-ranking documents.

However, unlike document collections, large schemas do not have the luxury of a clear definition of what a document is, or what a search result comprises. Even if one assumes that the ranking of the results is correct, it remains unclear what information should be presented to the user, i.e. what snippets from a database search result. Records in a schema-rich database have many possible sub-records, parent records, and joinable records, each of which is a candidate for inclusion in the result, but only a few of which are potentially useful to the average user.

For example, consider the keyword query george clooney movies. When issued on a standard document search engine, the challenge is straightforward: pick the best document corresponding to the query and return it. However, for a database, the same task becomes nontrivial. In the case of a relational database, should we return just singular tuples that contains all the words from the query? Do we perform some joins to denormalize the result before presenting it to the user? If yes, how many tables do we include, and how much information do we present from them? How do we determine what the important attributes are? The title of a movie isn’t unique due to remakes and sequels, and hence it will not be a key. How do we determine it is more useful to the user than the primary key, movie_id? Clearly, there are no obvious answers to these questions.

In other words, a database search system requires not just a mechanism to find matches and rank them, but also a systematic method to demarcate the extent of the results. It may appear that simply including the complete spanning tree of joins used to perform the match (in the case of relational systems, such as BANKS), or including the complete sub-tree rooted at the least common ancestor of matching nodes (in the case of XML systems, such as XSearch), will suffice. However these simple techniques are insufficient, often including both too much unwanted information and too little desired information.

One cause of too much information is the chaining of joins through unimportant references, which happens frequently in normalized databases. One reason for too little information, particularly in relational systems, is the prevalence of id attributes due to normalization. If this were the only issue, it could be addressed by performing a value join every time an internal “id” element is encountered in the result. However, the use of simple rules such as this also appears insufficient.

Figure 2: A simplified example schema

Consider, as shown in Fig. 2, part of the example schema for the Internet Movie Database. The schema comprises primarily of two entities, person and movie. Additional tables, such as cast, establish relationships, while tables such as genre are used to normalize common strings. Given a keyword query such as george clooney movies, a natural solution is to perform a join of ‘‘name = george clooney’’ from the person table, with the movies table via cast. The movies table is normalized and contains id pointers to the locations, genre, and info table. There is nothing in terms of database structure to distinguish between these three references through id. Yet it is probably the case that resolving the id references to the shooting locations and genre is desired, but not to include the lengthy plot outline (located in the info table) in the search result.

In addition to the challenge of demarcating an actual result, the problem of identifying which result to return for a query still remains. Based on prior work in the area of search query log analysis [19, 29], and analyses described in Sections 5.1 and 5.2, we assert that:

  • A majority of users’ queries are underspecified
  • The mapping between true information need and the actual query terms is frequently not one-to-one
  • The challenge of mapping a query to a specific fragment of the database is nontrivial

In this paper, we introduce the notion of a “qunit”, representing a quantified unit of information in response to a user’s query. The search problem then becomes one of choosing the most appropriate qunit(s) to return, in ranked order – a problem that is much cleaner to address than the traditional approach of first finding a match in the database and then determining how much to return. The fundamentals of this concept are described in Sec. 2. Sec. 3 describes search inside a qunits based system, while Sec. 4 discusses methods presents different techniques for identifying qunits in a database.

We study the nature of search behavior, and then experimentally evaluate the quality of search results from our qunits-based approaches in Sec. 5. We close with a review of related work in Sec. 6 and some concluding remarks in Sec.7.

2  Qunit as a conceptual unit

We now define the notions of a qunit and qunit utility:

Qunit: A qunit is the basic, independent semantic unit of information in a database.

Every user has a “mental map” of how a database is organized. Qunits are an embodiment of this perceived organization. The organization may not be the most efficient representation, in terms of storage, or other computational considerations. For example, while the user may like to think of the IMDb as a collection of actor profiles and movie listings, the underlying relational database would store normalized “fact” tables and “entity” tables, for superior performance. In other words, a qunit can be considered as a view on a database, which we call the base expression, with an additional conversion expression that determines the presentation of the qunit.

Qunits do not need to be constrained by the underlying data representation model. This allows qunits to be much richer than views. For example, consider the IMDb database. We would like to create a qunit definition corresponding to the information need “cast”. A cast is defined as the people associated with a movie, and rather than have the name of the movie repeated with each tuple, we may prefer to have a nested presentation with the movie title on top and one tuple for each cast member. The base data in IMDb is relational, and against its schema, we would write the base expression in SQL with the conversion expression in XSL-like markup as follows:

SELECT * FROM person, cast, movie
WHERE cast.movie_id = AND
cast.person_id = AND
movie.title = "$x"
<cast movie="$x">

The combination of these two expressions forms our qunit definition. On applying this definition to a database, we derive qunit instances, one per movie.

Qunit Utility: The utility of a qunit is the importance of a qunit to a user query, in the context of the overall intuitive organization of the database. This applies to both qunit definitions and qunit instances.

The number of possible views in a database is very large. Hence, the total number of candidate qunit definitions is an inordinately large number. The utility score of a qunit is required for selecting a top-ranking set of useful qunits from the large pool of candidate qunits. From a user’s perspective, this process captures the most salient and relevant concepts that should be returned for a particular query.

It should be noted that qunit utility is a subjective metric. It is dependent on the intent and nature of the user, and is not necessarily uniform for all users. For our purpose, we approximate this metric with clearly defined, objective surrogates. This is similar in spirit to measuring document relevance in information retrieval, where TF/IDF is a well-defined and widely accepted objective metric used to approximate document relevance.

Once qunits have been defined, we will model the database as a flat collection of independent qunits. Qunits may overlap, and in fact one qunit may even completely include another. However, these overlaps will not be given any special attention. Similarly, references (links, joins, etc) are expected to have been resolved to the extent desired at the time that qunits were defined. Thus in our movie database, a qunit such as movie can list its actors, but there is no explicit link between a movie qunit and an actor qunit. In subsequent querying, each qunit is treated as an independent entity.

Note, however, that context information, not part of the qunit presented to the user, may often be useful for purposes of search and ranking. For example, link structures may be used for network analysis. Our model explicitly allows for this.

3  Qunit-based Search

Consider the user query, star wars cast, as shown in Fig.1. Queries are first processed to identify entities using standard query segmentation techniques[30].

In our case one high-ranking segmentation is “[] [cast]” and this has a very high overlap with the qunit definition that involves a join between “” and “cast”. Now, standard IR techniques can be used to evaluate this query against qunit instances of the identified type; each considered independently even if they contain elements in common. The qunit instance describing the cast of the movie Star Wars is chosen as the appropriate result.

As we can see, the qunits based approach is a far cleaner approach to model database search. In current models of keyword search in databases, several heuristics are applied to leverage the database structure to construct a result on the fly. These heuristics are often based on the assumption that the structure within the database reflects the semantics assumed by the user (though data / link cardinality is not necessarily an evidence of importance), and that all structure is actually relevant towards ranking (though internal id fields are never really meant for search).

The benefit of maintaining a clear separation between ranking and database content is that structured information can be considered as one source of information amongst many others. This makes our system easier to extend and enhance with additional IR methods for ranking, such as relevance feedback. Additionally, it allows us to concentrate on making the database more efficient using indices and query optimization, without having to worry about extraneous issues such as search and ranking.

It is important to note that this conceptual demarcation of rankings and results does not imply materialization of all qunits. In fact, there is no requirement that qunits be materialized, and we expect that most qunits will not be materialized in most implementations.

4  Qunit Derivation

In the previous section, we discussed how to execute a query in a database that already had qunits identified. One possibility is for the database creator to identify qunits manually at the time of database creation. Since the subject matter expert is likely to have the best knowledge of the data in the database, such expert human qunit identification is likely to be superior to anything that automated techniques can provide. Note that identifying qunits merely involves writing a set of view definitions for commonly expected query result types and the manual effort involved is likely to be only a small part of the total cost of database design. If a forms-based database interface has been designed, the set of possible returned results constitute a good human-specified set of qunits.

Even though manual expert identification of qunits is the best, it may not always be feasible. Certainly, legacy systems have already been created without qunits being created. As such, automated techniques for finding qunits in a database are important. In this section, we discuss the automated derivation of qunits from a database.

Given a database, there are several possible sources of information that can be used to infer qunits. Of course, there is the database schema. Since identifying qunits requires us to write base expressions defining them, and writing these expressions requires schema knowledge, the use of schema knowledge is a starting point. In addition, there are three independent possible sources of information worth considering. The first and most obvious is the data contained in the database itself. A second source of information is the history of keyword queries posed to the system from previous instances of search on the database, also known as a query log. The final source of information about the database is the source of evidence outside the database, such as published results and reports that could be based on information from the database in question, or a similar data source. We consider each of these three sources in turn.

4.1  Using Schema and Data

Inspired by previous efforts[2, 18] that model the database as a graph, we utilize the concept of queriability of a schema described in [15] to infer the important schema entities and attributes in a database. Queriability is defined as the likelihood of a schema element to be used in a query, and is computed using the cardinality of the data that the schema represents. Qunit base expressions are generated by looking at the top-k schema entities based on descending queriability score. Each of the top-k1 schema entities is then expanded to include the top-k2 neighboring entities as a join, where k1 and k2 are tunable parameters. While this method is intuitive and self-contained within the database, there are many instances where using just the data as an arbiter for qunit derivation is subobtimal. Specifically, in the case of underspecified queries such as “george clooney”, with the example schema in Fig. 2, creating a qunit for “person” would result in the inclusion of important movie “genre” and the unimportant movie “location” tables, since every movie has a genre and location. However, while there are many users who are interested in “george clooney” as an actor in romantic comedies, there is very little interest in the locations he has been filmed at.

4.2  Using Query Logs

We use a query rollup strategy for query logs, inspired by the observation that keyword queries are inherently under-specified, and hence the qunit definition for an under-specified query is an aggregation of the qunit definitions of its specializations. For example, if the expected qunit for the query “george clooney” is a personality profile about the actor George Clooney, it can be constructed by considering the popular specialized variations of this query, such as “george clooney actor”, “george clooney movies”, and so on.

We begin by sampling the database for entities, and look them up in the search query log. The found query log entries are then collected, along with the number of times they occur in the query log (query frequency). For each query, we then map each recognized entity on to the schema, constructing simple join plans. We then consider the popular plan fragments for the qunit definitions.

As an example, consider the schema element “”. Instances of this element are “george clooney” and “tom hanks”, which are looked up in the query log, where we find search queries such as “george clooney actor”, “george clooney batman” and “tom hanks castaway”. Additionally, searches for “” instances, “batman” and “castaway” and “cast.role” instance “actor” would also lead to the same 3 queries. Given these three queries, we have hence built an annotated set of schema links, where “” links to “cast.role” once, and to “” twice. This suggests that the rollup of the qunit representing “” should contain “” and “cast.role”, in that order.

4.3  Using External Evidence

There is a wealth of useful information that exists in the form of external evidence. We now propose a third method that uses external evidence to create qunits for the database. External evidence can be in the form of existing “reports” – published results of queries to the database, or relevant web pages that present parts of the data. Such evidence is common in an enterprise setting where such reports and web pages may be published and exchanged but the queries to the data are not published. By considering each piece of evidence as a qunit instance, the goal is to learn qunit definitions.

In our running example, our movie search engine is aware of the large amount of relevant organized information on the Internet. Movie information from sources such as the Wikipedia 1 are well organized and popular. Since the actual information in Wikipedia will have a great overlap with that from IMDb, our goal is thus to learn the organization of this overlapped data from Wikipedia.

By using methods similar to query rollup, we use records in the database to identify entities in documents. We then compute “signatures” for each web page, utilizing the DOM tree and frequency of each occurrence. An example of a type signature for the cast page for a movie on Wikipedia would be (( (, which would suggest using as a label field, followed by a “foreach” consisting of, based on the relative cardinality in the signature and the number of the tuples generated in our qunit base expression. By aggregating the type signatures over a collection of pages, we can infer the appropriate qunit definition.

5  Experiments and Evaluation

In this section, we begin by first discussing a brief pre-experiment to explore the nature of keyword searches that real users posed against a structured database. The following subsections describe the use of a real-world querylog to evaluate the efficacy of qunit based methods.

5.1  Understanding Search

The Internet Movie Database or IMDb2 is a well-known repository of movie-related information on the Internet. To gain a deeper understanding of search behavior, we performed a user study with five users, all familiar with IMDb and all with a moderate interest in movies. The subjects had a large variance in knowledge about databases. Two were graduate students specializing in databases, while the other three were non-computer science majors. The subjects were asked to consider a hypothetical movie database that could answer all movie-related queries. Given this, the users were asked to come up with five “information needs”, and the corresponding keyword queries that they would use to query the database.

As expected, nearly all our subjects look for the summary page of a movie. But this information need is expressed in five different ways by the users. Similarly, the cast of a movie and finding connections between two actors are also common interests. Once again these are expressed in many different ways. Conversely, a query that only specifies the title of the movie may be specified on account of four different information needs (first column of the table). Similarly, users may specify an actor’s name when they mean to look for either of two different pieces of information – the actor’s filmography, or information about co-actors. Clearly, there exists a many-to-many relationship between information need and queries.

info. need 90keyword query90[title]90[title] box office90[actor]90[award] [year]90[actor] [actor]90[genre]90[title] ost90[title] cast90[title] [freetext]90movie [freetext]90[title] year90[title] posters90[title] plot90don’t know
movie summarya,c       adb d 
caste      c,d      
filmography  e,a           
coactorship  a, c b         
posters           b  
related moviese             
awards   a          
movies of period             b
charts / lists     e  d     
recommendations         b   e
soundtracks      c       
box office d            
Table 1: Information Needs vs Keyword Queries. Five users (a, b, c, d, e) were each asked for their movie-related information needs, and what queries they would use to search for them.

Another key observation is that 10 of the 25 queries here are single entity queries, 8 of which are underspecified – the query could be written better by adding on additional predicates. This happened even though we reassured users that we could answer all movie-related queries.

The results of our interviews are displayed in Table 1. Each row in this table is an information need suggested by one or more users. Each column is the query structure the user thought to use to obtain an answer for this information need. The users themselves stated specific examples; what we are presenting is the conceptual abstraction. For example if the user said they would query for “star wars cast”, we abstract it to query type [title] cast. Notice that the unmatched portion of the query(cast) is still relevant to the schema structure and is hence considered an attribute. Conversely, users often issue queries with words that are non-structural details about the result, such as “movie space transponders”. We consider these words free-form text in our query analysis.

Entries in this matrix identify data points from individual subjects in our experiment. Note that some users came up with multiple queries to satisfy the same information need, and hence are entered more than once in the corresponding rows.

5.2  Movie Querylog Benchmark

To construct a typical workload, we use a real world dataset of web search engine query logs [26] spanning 650K users and 20M queries. All query strings are first aggregated to combine all identities into a single anonymous crowd, and only queries that resulted in a navigation to the domain are considered, resulting in 98,549 queries, or 46,901 unique queries. We consider this to be our base query log for the IMDb dataset. Additionally, we were able to identify movie-related terms in about 93% of the unique queries (calculated by sampling).

We then construct a benchmark query log by first classifying the base query log into various types. Tokens in the query log are first replaced with schema “types” by looking for the largest possible string overlaps with entities in the database. This leaves us with typed templates, such as “[name] movies” for “george clooney movies”. We then randomly pick two queries that match each of the top (by frequency) 14 templates, giving us 28 queries that we use as a workload for qualitative assessment.

We observed that our dataset reflects properties consistent with previous reports on query logs. At least 36% of the distinct queries to the search engine were “single entity queries” that were just the name of an actor, or the title of a movie, while 20% were “entity attribute queries”, such as “terminator cast”. Approximately 2% of the queries contained more than one entity such as “angelina jolie tombraider”, while less than 2% of the queries contained a complex query structure involving aggregate functions such as “highest box office revenue”.

5.3  Evaluating Result Quality

The result quality of a search system is measured by its ability to satisfy a user’s information need. This metric is subjective due to diversity of user intent and cannot be evaluated against a single hand-crafted gold standard. We conducted a result relevance study using a real-world search query log as described in the following subsection, against the Internet Movie Database. We asked 20 users to compare the results returned by each search algorithm, for 25 different search queries, rating each result between 1 (result is correct and relevant) and 0 (result is wrong or irrelevant).

For our experiments, we created a survey using 25 of the 28 queries from the movie querylog benchmark. The workload generated using the query log is first issued on each of the competing algorithms and their results are collected. For our algorithms mentioned in Sec. 4, we implement a prototype in Java, using the database (converted using IMDbPy( to 15 tables, 34M tuples) and the base query log as our derivation data. To avoid the influence of a presentation format on our results, all information was converted by hand into a paragraph in a simplified natural English language with short phrases. To remove bias, the phrases were collated from two independent sources. 20 users were then sourced using the Amazon Mechanical Turk ( service, all being moderate to advanced users of search engines, with moderate to high interest in movies. Users were then primed with a sample “information need” and “query” combination : need to find out more about julio iglesias being the need, and “julio iglesias” being the search query term. Users were then presented with a set of possible answers from a search engine, and were asked to rate the answers presented with one of the options listed in Table 2.

0provides incorrect information
0provides no information above the query
.5provides correct, but incomplete information
.5provides correct, but excessive information
1.0provides correct information
Table 2: Survey Options

Users were then asked to repeat this task for the 25 search queries mentioned above. The table also shows the score we internally assigned for each option. If the answer is incorrect or uninformative it obviously should be scored 0. If it is the correct answer, it obviously should be scored 1. Where an answer is partially correct(incomplete or excessive), we should give it a score between 0 and 1 depending on how correct it is. An average value for this is 0.5.

To provide an objective example of a qunit-based system, we utilize the structure and layout of the website as an expert-determined qunit set. Each page on the website is considered a unique qunit instance, identified by a unique URL format. A list of the qunits is generated by performing a breadth-first crawl starting at the homepage, of 100,000 pages of the website and clustering the different types of URLs. Qunit definitions were then created by hand based on each type of URL, and queried against the test workload. Users were observed to be in agreement with each other, with a third of the questions having an 80% or higher of majority for the winning answer.

We now compare the performance of currently available approaches against the qunits described in the derivation section, using a prototype based on ideas from Sec. 3. To do this, we first ran all the queries on the BANKS[3] online demonstration. A crawl of the website was converted to XML to retrieve the LCA(Lowest Common Ancestor) and MLCA [20] (Meaningful Lowest Common Ancestor). The MLCA operator is unique in that it ensures that the LCA derived is unique to the combination of queried nodes that connect to it, improving result relevance. In addition to these algorithms, we also include a data point for the theoretical maximum performance in keyword search, where the user rates every search result from that algorithm as a perfect match.

Results are presented in Fig. 3 by considering the average relevance score for each algorithm across the query workload. As we can see, we are still quite far away from reaching the theoretical maximum for result quality. Yet, qunit-based querying clearly outperforms existing methods.

6  Related Work

We acknowledge various bodies of relevant prior work in this section.

Previous efforts towards solving keyword search for databases have concentrated predominantly on ranking. Initial work such as BANKS [3, 18] and DBXplorer [1] use inverted indices on databases to return tuple joins in the form of spanning trees of matched query words and discuss efficient methods of deriving such trees. In the case of XML, one strategy is to identify the smallest element that contains some or all keywords [10], or to identify the smallest element that is meaningful [20]. Later work [2] in this area concentrates further on the ranking aspects of the problem, using a metric of authority transfer on a data graph to improve result quality. While these methods provide solutions towards the improvement of ranked results, the composition of the results themselves is not explored.

Figure 3: Comparing Result Quality against Traditional Methods : “Human” indicates hand-generated qunits.

Work by Koutrika, Simitsis and Ionnadis on Précis[28] receives special mention in this context. Précis responses to keyword queries are natural language compositions of “information directly related to the query selections” along with “information implicitly related to them in various ways”. Results are not intended to be complete, and instead concentrate on being “comprehensive”, akin to the idea of providing users with most probable next browse-points in a browsable database. Their contributions include a rich data model to express facts for the sake of representation, and semi-automated methods to infer these structures by way of annotating a schema graph with weights, providing detailed analysis on execution time and satisfaction of the user. We embrace and extend some of the methods involved, as described in Sec. 3.

A major challenge for Précis, and other keyword query systems for databases, is that new database-specific ranking algorithms have to be invented. These are often very complex, and yet tend not to have the quality of ranking algorithms in IR, where they have been studied much more intensively and for much longer. With the qunits proposal, our primary goal is to satisfy the users immediate information need. We address this problem by separating the problem of qunit identification from the problem of query evaluation and ranking. Our aim is to answer the user’s query intent with an atomic unit of information, and not to guide the user to the relevant sections in the database. Unlike Precis results, qunits results are independent of data models. The qunits paradigm first partitions the database into actual pieces of information called qunits, making it amenable to standard IR techniques.

Our solution to this problem can be considered a method to organize information such that it is more amenable to a user’s information needs. Work has been done in organizing information along individual perspectives using faceted search. Each “facet” is a distinct attribute that users can use to search and browse records. [6, 12] talk about design considerations for building faceted search systems.  [31] approaches the faceted search problem by picking hierarchies based on user search patterns. [27] surveys many techniques of dynamic taxonomies and combines them. However, it is clear that faceted search is ideal for homogeneous data with multiple attribute constraints, and not a database with many different kinds of information in it.

Work has been done in building search mechanisms aware of the application logic layer above the database. [5] utilize the architecture patterns in application logic to better index dynamic web applications. [9] approaches this problem by exploiting exposed web service interfaces to provide a unified concept-search interface.

Applications of database-backed query answering are slowly appearing in mainstream search engines. These services have been sucessful at implementing template-based search, where a keyword query is identified and rerouted as a stuctured query to a database, answering questions such as “what is the population of china”. With qunits, we propose to take this idea front and center, and make this the central paradigm for all search.

Managing complex databases have also been studied in the context of schema summarization [33], using characteristics of the schema. The schema clusters can then be queried upon [34] with low overhead. However, summarizing a schema does not take into account the nature of the query workload or the multiple ways to organize data. A related body of research in view selection [11, 24] attempts to select a set of views in the database that minimizes query execution time of a workload given resource costs, but does not consider the utility of the information provided by each view.

The broader concept of making databases more accessible using a presentation data model has been discussed by Jagadish et al. [14]. Work on exposing the database query interface using automatically generated forms has been discussed in [16, 17], while  [25] uses autocompletion to provide the user with database-specific information. However, both methods do not consider the option of using clues from outside the database to improve the query process.

7  Conclusion

Keyword queries against structured data is a recognized hard problem. Many researchers in both the database and information retrieval communities have attacked this problem, and made considerable progress. Nonetheless, the incompatibility of two paradigms has remained: IR techniques are not designed to deal with structured inter-linked data, and database techniques are not designed to produce results for queries that are under-specified and vague. This paper has elegantly bridged this gap through the concept of a qunit.

Additionally, this paradigm allows both techiniques to co-exist under the same ranking model. Search engines have started to provide interesting ways [8, 32] to simultaneously search over a heterogenous collection of documents and structured data, with interesting implications for search relevance and ranking. Qunits provide a clean approach to solving this problem.

For the keyword query front end, the structured database is nothing more than a collection of independent qunits; so standard information retrieval techniques can be applied to choose the appropriate qunits, rank them, and return them to the user. For the structured database backend, each qunit is nothing more than a view definition, with specific instance tuples in the view being computed on demand; there is no issue of underspecified or vague queries.

In this paper, we presented multiple techniques to derive the qunits for a structured database and determine their utility for various query types; we presented an algorithm to evaluate keyword queries against such a database of qunits, based on typifying the query; we experimentally evaluated these algorithms, and showed that users find qunit-based query results substantially superior to results from the best keyword-based database query systems available today.

We believe that the concept of a qunit is powerful, and that this is merely a first paper introducing this idea. In future work, we expect to be able to substantially improve upon the qunit finding and utility assignment algorithms presented above; we expect to deal with qunit evolution over time as user interests mutate during the life of a database system; we expect to extend qunit notions to databases with substantial mixed text content and to use IR techniques to query the text content in conjunction with the database structure.


S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: a system for keyword-based search over relational databases. Proc. ICDE, pages 5–16, 2002.
A. Balmin, V. Hristidis, and Y. Papakonstantinou. ObjectRank: Authority-based keyword search in databases. Proc. VLDB, 4:564–575, 2004.
G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. Proc. ICDE, pages 431–440, 2002.
A. Doan and A. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, 26(1):83–94, 2005.
C. Duda, D. Graf, and D. Kossmann. Predicate-based Indexing of Enterprise Web Applications, 2007.
J. English, M. Hearst, R. Sinha, K. Swearingen, and K. Yee. Hierarchical faceted metadata in site search interfaces. Conference on Human Factors in Computing Systems, pages 628–639, 2002.
R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity Search in Databases. Proc. VLDB, pages 26–37, 1998.
Google, Inc. Google Universal Search.
J. Graupmann and G. Weikum. The Role of Web Services in Information Search. IEEE Data Engineering Bulletin, 25(4):60–65, 2002.
L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: ranked keyword search over XML documents. Proc. ACM SIGMOD, pages 16–27, 2003.
H. Gupta. Selection of Views to Materialize in a Data Warehouse. Proc. ICDT, 1997.
M. Hearst. Design Recommendations for Hierarchical Faceted Search Interfaces. SIGIR Workshop on Faceted Search, 2006.
V. Hristidis and Y. Papakonstantinou. Discover: keyword search in relational databases. Proc. VLDB, pages 670–681, 2002.
H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Making database systems usable. Proc. ACM SIGMOD, pages 13–24, 2007.
M. Jayapandian and H. V. Jagadish. Automated Creation of a Forms-based Database Query Interface. Proc. VLDB, 2008.
M. Jayapandian and H. V. Jagadish. Automating the Design and Construction of Query Forms. Proc. ICDE, 2008.
M. Jayapandian and H. V. Jagadish. Expressive Query Specification through Form Customization. Proc. EDBT, 2008.
V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. Proc. VLDB, pages 505–516, 2005.
T. Lau and E. Horvitz. Patterns of Search: Analyzing and Modeling Web Query Refinement. Intl. Conf. on User Modeling, 1999.
Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. Proc. VLDB, 2004.
L. Lim, H. Wang, and M. Wang. Unifying data and domain knowledge using virtual views. Proc. VLDB, pages 255–266, 2007.
Z. Liu and Y. Chen. Identifying meaningful return information for XML keyword search. Proc. SIGMOD, pages 329–340, 2007.
M. Miah, G. Das, V. Hristidis, and H. Mannila. Standing Out in a Crowd: Selecting Attributes for Maximum Visibility. Proc. ICDE, pages 356–365, 2008.
H. Mistry, P. Roy, S. Sudarshan, and K. Ramamritham. Materialized view selection and maintenance using multi-query optimization. ACM SIGMOD Record, 30(2):307–318, 2001.
A. Nandi and H. V. Jagadish. Assisted querying using instant-response interfaces. Proc. ACM SIGMOD, pages 1156–1158, 2007.
G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. Proc. Infoscale, 2006.
G. Sacco. Some Research Results in Dynamic Taxonomy and Faceted Search Systems. SIGIR Workshop on Faceted Search, 2006.
A. Simitsis, G. Koutrika, and Y. Ioannidis. Précis: from unstructured keywords as queries to structured databases as answers. VLDB Journal, pages 117–149, 2008.
R. Song, Z. Luo, J. Wen, Y. Yu, and H. Hon. Identifying ambiguous queries in web search. In Proc. WWW, pages 1169–1170. ACM Press New York, NY, USA, 2007.
B. Tan and F. Peng. Unsupervised query segmentation using generative language models and Wikipedia. Proc. WWW, 2008.
D. Tunkelang. Dynamic category sets: An approach for faceted search. SIGIR Workshop on Faceted Search, 2006.
Yahoo! Inc. Yahoo Glue.
C. Yu and H. V. Jagadish. Schema summarization. Proc. VLDB, pages 319–330, 2006.
C. Yu and H. V. Jagadish. Querying Complex Structured Databases . Proc. VLDB, 2007.
C. Yu and L. Popa. Semantic adaptation of schema mappings when schemas evolve. Proc. VLDB, 2005.

Supported in part by NSF grants IIS-0438909 and IIS-0741620, and by a Yahoo! Research Fellowship.



Captcha module for Drupal


Enables a CAPTCHA for the comments and user registration forms in Drupal. Users are asked to type in the word shown in the image shown, and are allowed if the code entered is correct.

Project Homepage:


GNU General Public License

Similar Entries module for Drupal


A module that displays a block with the 5 most similar nodes to the currently viewed one, based on the title and body fields.

This module uses MySQL’s FULLTEXT indexing, and requires you to add a fulltext index on the node table, like this:

ALTER TABLE node ADD FULLTEXT(title, body);

statuatory disclaimer: This module works for me, and I’m sharing it with you. If you shoot yourself in the foot with it and burn down your house in the process, it’s not my problem.


GNU General Public License