Full-Text Search – Word Breakers and Stemmers
There are numerous components to the Full-Text Search (FTS) subsystem in SQL Server that help provide efficient, relative answers to queries. Full-text Search is a little complex, and as I’ve been working with the system in an effort to learn more about it, I decided to document how a few things work.
Word Breakers and stemmers are two interesting parts of the FTS system. They deal with certain language specific operations that help searches work better. They are related as the two items are loaded together if you use third party word breakers and stemmers.
I haven’t seen third party word breakers, but some work from other products. As an example, here’s a post to load the Greek FTS search word breaker and stemmer from Sharepoint server if you are on SQL Server 2008. It’s included in SQL Server 2012.
Let’s start with word breakers, which do just what the term implies: they break words. It would seem to be obvious that spaces are the word boundaries, and they are in English, but not necessarily in all languages. There are also the issues of characters in Asian languages like Japanese and Chinese. You can’t count spaces as the word boundaries on those languages.
Word breakers use the lexical rules of the language to determine word boundaries. Essentially they find what the words are, and then further action can be taken in building the FTS index or processing the query.
The “words” that the word breaker spits out are seen a “tokens” to the FTS index, and each can then be processed by stemmers, stoplists, thesaurus, etc.
Stemmers are an interesting part of the full-text search system. They remind me of my high school Latin classes, where we had to conjugate words. A stemmer takes a word and generates inflectional forms, or conjugations. The example in Books Online, and an easy one to understand is “run”. There are various forms of "run” that we would want to consider as equivalent when performing a search. For example, you would want to consider:
- runner (perhaps)
The same could be said for “lay”. That would generate
This is one of the big advantages over the LIKE predicate in that stemmers can match these forms of the word being searched for. The index would relate all of these to the core, base word.
The Books Online page for Word Breakers and Stemmers has technical information on checking what’s installed, some troubleshooting, language settings, and some drier documentation on what you can do with word breakers, but not a lot of explanatory detail.
I used a little of the information in Books Online, and some from Pro Full-Text Search in SQL Server 2008. You can read more, but unfortunately I haven’t found a lot more documentation on the details of how things work.
You would probably learn more if you write your own Word Breaker and Stemmer, and there is a sample in the Windows SDK to get you started, but that’s beyond what I want to do.