Let us consider a situation (and a possible way of approaching this problem) in which when a user enters the first few letters of a search query, he/she is presented with some suggestions that have as their prefix, the string that the user has typed. Furthermore, these suggestions should be ordered by some score that is associated with each such suggestion.
Our first attempt at solving this would probably involve keeping the initial list of suggestions sorted in lexicographic order so that a simple binary search can give us the 2 ends of the list of strings that serve as candidate suggestions. These are all the strings that have the user's search query as a prefix. We now need to sort all these candidates by their associated score in non-increasing order and return the first 6 (say). We will always return a very small subset (say 6) of the candidates because it is not feasible to show all candidates since the user's screen is of bounded size and we don't want to overload the user with too many options. The user will get better suggestions as he/she types in more letters into the query input box.
We immediately notice that if the candidate list (for small query prefixes say of length 3) is large (a few thousand), then we will be spending a lot of time sorting these candidates by their associated score. The cost of sorting is O(n log n) since the candidate list may be as large as the original list in the worst case. Hence, this is the total cost of the approch. Apache's solr uses this approach. Even if we keep the scores bound within a certain range and use bucket sort, the cost is still going to be O(n). We should definitely try to do better than this.
One way of speeding things up is to use a Trie and store (pointers or references to) the top 6 suggestions at or below that node in the node itself. This idea is mentioned here. This results in O(m) query time, where m is the length of the prefix (or user's search query).
However, this results in too much wasted space because:
- Tries are wasteful of space and
- You need to store (pointers or references to) 6 suggestions at each node which results in a lot of redundancy of data
We can mitigate (1) by using Radix(or Patricia) Trees instead of Tries.
There are also other approaches to auto-completion such as prefix expansion that are using by systems such as redis. However, these approaches use up memory proportional to the square of the size of each suggestion (string). The easy way to get around this is to store all the suggestions as a linear string (buffer) and represent each suggestion as an (index,offset) pair into that buffer. For example, suppose you have the strings:
- hello world
- hell breaks lose
Then your buffer would look like this:
hello worldhell breaks losewhiteboardblackboard
And the 4 strings are actually represented as:
(0,11), (11,16), (27,10), (37,10)
Similarly, each prefix of the suggestion "whiteboard" is:
- w => (27,1)
- wh => (27,2)
- whi => (27,3)
- whit => (27,4)
- white => (27,5)
- whiteb => (27,6)
- and so on...
We can do better by using Segment (or Interval) Trees. The idea is to keep the suggestions sorted (as in approach-1), but have an additional data structure called a Segment Tree which allows us to perform range searches very quickly. You can query a range of elements in Segment Tree very efficiently. Typically queries such as min, max, sum, etc... on ranges in a segment tree can be answered in O(log n) where n is the number of leaf nodes in the Segment Tree. So, once we have the 2 ends of the candidate list, we perform a range search to get the element with the highest score in the list of candidates. Once we get this node, we insert this range (with the maximum score in that range as the key) into the priority queue. The top element in the queue is popped and split at the location where the element with the highest score occurs and the scores for the 2 resulting ranges are computed and pushed back into the priority queue. This continues till we have popped 6 elements from the priority queue. It is easy to see that we will have never considered more than 2k ranges (here k = 6).
Hence, the complexity of the whole process is the sum of:
- The complexity for the range calculation: O(log n) (omitting prefix match cost) and
- The complexity for a range search on a Segment Tree performed 2k times: O(2k log n) (since the candidate list can be at most 'n' in length)
Giving a total complexity of:
O(log n) + O(2k log n)
which is: O(k log n)
Update (29th October, 2010): I have implemented the approach described above in the form of a stand-alone auto-complete server using Pyhton and Mongrel2.
Some links I found online:
LingPipe - Text Processing and Computationsal Linguistics Tool Kit
What is the best open source solution for implementing fast auto-complete?
Suggest Trees (SourceForge): Though these look promising, treaps in practice can degenerate into a linear list.
Autocomplete Data Structures
Why is the autocomplete for Quora so fast?
Autocomplete at Wikipedia
How to implement a simple auto-complete functionality? (Stack Overflow)
Auto Compete Server (google code)
Three Autocomplete implementations compared
Trie Data Structure for Autocomplete?
What is the best autocomplete/suggest algorithm,datastructure [C++/C] (Stack Overflow)
Efficient auto-complete with a ternary search tree