From charlesreid1

Whoosh: How It Works

Because the task of building a search engine is extremely broad in scope and can cover a reall ywide range of things, Whoosh is highly configurable. Unfortunately that means it's often hard to understand what the documentation is talking about because it stays at too high a level.

This page collects notes on a Whoosh setup to create a search index for something nice and concrete: a pile of Markdown files.


How Search Engines Work

Let's take a step back and talk about how Whoosh works at a high level.

A search engine takes a given document and determines all the terms that a user might enter into a search engine to find that document. This search term, like "tuba", can be turned into a number, like 001010111010110101010101010101010101, in a way that the computer can do very fast and very consistently, and the search result of that term, "tuba", can be stored in a lookup table, or a slot labeled with that number. Inside of that slot we store references to any places where that term occurred.

This is a data structure known as a hash map. It has incredible properties, and without it, computing as we know it would be a miserable failure.

The magic of the hash map comes from the hash function, a one-way function with some peculiar statistical properties. The hash function is what can turn a string into a number/address.

Simple Example

To start with a super simple search engine, documents might be indexed by the search engine using only a set of tags, so that search engine users would only be able to match words that were added as a tag. If I add the tags "music" and "tuba" to the document "docs/music/instruments/tuba.md", and I only implement a keyword-based search engine, then there are only two user inputs that will lead to that document being returned as a result: "music" and "tuba".

The search engine has a way to turn that search term into a number, and turn that number into a really fast lookup table, so that if we have predicted that the user might search for "music", and they do, then we have the result ready instantly in that lookup table. That's the hash function.

Because the number of tags that will occur is bound to be finite, such a lookup table is possible to build.

Complications

There are already complications about whether the tags are case-sensitive or multi-word, or whether certain kinds of punctuation are allowed or not, or whether there are length restrictions or if you can use a whole sentence as a tag.

This is where Whoosh not making any decisions for you can be useful, because it provides ways for you to make various decisions about these things (most of which you probably did not think of).

Realistic Example: Words and Grams

Considering a more realistic example than the tag search engine, consider a document search engine, where we are interested in matching words and phrases. Then we need to build an index of words occurring in the document, but also of n-grams (n words in a row). There are approx. as many n-grams as there are words in the document, for small n.

The procedure from above is repeated: the location of each occurrence of each word is added to the hash map (input is the word the user may search for, output is where we store the locations it occurred).

We then do that with bi-grams, two word combinations, likewise feeding the bi-gram to the hash function. And so on with tri-grams, and 4-grams, up to the maximum the user wishes.

Like the number of words in a document, the number of n-grams that actually appear in a corpus of documents will be far smaller than the number of possible n-grams, making this approach storage-efficient.


Flags