In recent lectures, we’ve discussed several different data structures that can be used for the purposes of
searching. Each of these data structures is organized differently, and each has different performance
characteristics (measured in terms of time and memory usage). But they all share one goal: Given a key that
we’re searching for, find that key and any information that’s been associated with it.
Broadly, we say that a set is a kind of data structure that stores a collection of unique keys, allowing us to add
new keys, remove existing keys, and determine whether a particular key is stored. We say that a map is a
similar kind of data structure, except that we associate a value with each key, so that keys function as the tool
that lets us identify other information uniquely and find it easily. In lecture, we discussed several approaches to
solving the problems of implementing search structures like sets and maps.
We talked about using single-dimension search structures like arrays, vectors, and linked lists, but all had
significant limitations. Either they required linear time to search (by looking at every element in
sequence), or they could be searched more quickly (e.g., using binary search) but required linear time to
add or remove a key.
We discovered that binary search trees offer the ability to do searches, insertions, and removals in
logarithmic time, which is a significant improvement on the linear-time bound on at least some operations
on single-dimension search structures. But there was a catch: We only got logarithmic-time operations if
the tree stayed balanced, but it could fall out of balance in circumstances that weren’t necessarily rare in
practice (such as adding keys in ascending or descending order).
We found that all was not lost, because there are well-known algorithms for detecting and correcting
binary search trees that have fallen out of balance. In particular, we learned about a binary search tree
variant called AVL trees, which guarantee logarithmic-time searches, adds, and removals, by rearranging
the nodes in the tree whenever a significant imbalance is detected.
We explored skip lists, a very different solution that allows a variant of binary search to be performed on a
linked list. It employs randomness to achieve its goal of providing logarithmic-time searches, adds, and
removals, by storing some keys in two places, a few others in three places, a handful of others in four
places, and so on. While skip lists don’t actually guarantee logarithmic-time operations, if a good random
number generator is used, skip lists have an extremely high probability of performing their operations in
logarithmic-time, especially for large collections of keys.
Finally, we discussed how hash tables offer the possibility of not having to search. Every key has a place
where it “belongs,” so if we only ever store keys where they belong, we’ll never have to look anywhere
except there. Of course, in practice, it doesn’t turn out be quite as magical as that in practice, but it can
perform very well, provided that we spread keys around so not too many of them belong in any one
As we often see when there are multiple solutions to the same kind of problem, there are important tradeoffs
here. At first blush, hash tables with “good” hash functions appear to be the clear winner, as their search times
tend toward Θ(1). But it’s not always that simple. For example, you might not want to choose a hash table if you
don’t know enough about your keys to choose a suitable hash function. There’s another reason, too, why you
might prefer a balanced binary search tree or a skip list over a hash table: Since they impose ordering on their
keys, binary search trees and skip lists can be iterated in ascending order of their keys in linear time. Hash
tables can’t, since they impose no meaningful ordering on their keys.
Spell checkers, as a case study of search structures
In this project, you’ll build portions of a spell checker, similar to one you might find in a word processor, a text
editor, or an email client. A spell checker’s job is relatively simple: Given a list of correctly-spelled words (which
we’ll call a word set) and the input text, a spell checker reports all of the misspellings (i.e., words that are not
found in the word set) in the input. When a misspelling is found in the input, a good spell checker will also
suggest words appearing in the word set that are somewhat like each of the misspelled words, since there’s a
reasonably good chance that one of them will be the word the writer intended.
As an example, suppose that the word set contains the following words:
bake cake main rain vase
If the input text contains the word vake, a good spell checker will state that vake is misspelled, since it does not
appear in the word set, and will suggest that perhaps bake, cake, or vase was the word that was actually
intended, since all three of these words are reasonably similar to the word vake, differing by only one letter
each. (An even smarter spell checker might employ contextual clues to guess the word that is most appropriate
in the sentence that surrounds it, but this is beyond the scope of our work here.)
Of course, a spell checker’s task is centered around searching for words in a potentially large word set. An
efficient search structure is critical to the performance of the spell checker, since it will be necessary to search
not only for the words in the input, but possibly hundreds of potential suggestions that it might like to make for
each misspelling. As you will see, a poor technique can render a spell checker — or any system that requires a
large number of insertions and searches to be performed on a large search structure — effectively useless.