Skip to content

Full-Text Search – Stoplists in SQL Server

December 31, 2012

Full-text search is an interesting subsystem in SQL Server. It allows you to implement searches through a variety of text formats stored in SQL Server. This is a one of a series of posts that looks at different facets of full-text search.

What is a Stoplist?

A stoplist is a list of stopwords that SQL Server should not include in a full-text index.  These are words that are seen as not adding any value to the full-text index. We typically see these words as important in language for structure, but not for content. Examples of stopwords are:

  • the
  • a
  • an
  • is
  • are

In previous versions of SQL Server, these were also known as noise words and a noise word list. You can read about the topic in Books Online.

How are they used?

Stoplists are used when building the index. The words that are contained in the text, and also in the stoplist are ignored and not populated inside the index. This makes for a smaller index, and it also means that the stopwords are not

The position of these words in the text being indexed, however, still do matter. This is to be sure that searches using proximity are still correctly carried out.

Each index can have a specific stoplist associated with it. You can associate a stoplist at index creation time, or alter the index later to add or change the stoplist.

Creating a Stoplist

For each language supported in the full-text system, there is a stoplist installed with SQL Server. These are the commonly used words that should be ignored for each language.

You can, however, create your own stoplist of word with the CREATE FULLTEXT STOPLIST command. The creation can be for a new stoplist, or you can copy an existing stoplist.This includes system stoplists, which you can use as a basis for your custom stoplist. The commands are simple, and they are well documented in BOL.

To add or remove words from a stoplist, the ALTER FULLTEXT STOPLIST command is used with the ADD or DROP parameters. Alterations to a stoplist must be for a specific language, which is specified with the LCID or name of the language.

Practical Points

The stoplists are important for limiting the size of the stoplist and making a more efficient index. Full-text indexes are very efficient and scalable in SQL Server, but the less data that needs to be indexed and searched, the most efficient the system will operate.

System stoplists works well for many natural language searches, but are not necessarily adequate for domain specific searches. For example, if I were indexing all white papers on SQL Server, I might want to ignore extremely common words or phrases that are in all documents. For example, I might consider “SQL” to be so common as to be useless in searches. Rather than bloat the size of the index with this word, I may add this to a stoplist for the full-text index and assume it’s a word like “the”, which I would not use for searches of these documents.

I haven’t necessarily found a reason to use custom stoplists in the past, but if my full-text index were extremely large or I had a large volume of searches, I might consider using stoplists to prune down my indexes.

If you have used these in your system, I’d be interested in knowing the reasons and effects.

About these ads

From → Blog

4 Comments
  1. It can also be useful to remove words from StopLists. For example, in our organisation we commonly search CVs (or résumés) for technical skills. By removing the “word” C we are allowed to search for C# or C++ or even C itself. An equivalent scenario might present itself if one wanted to search for (say) A-level grades.

    The downside to this is, of course, that the full-text catalog will be larger.

    Pete Moore
    IT Director
    Evolution Recruitment Solutions

    • That’s a great point. Hadn’t considered it, but there definitely are cases where removal makes sense.

      I’m not sure the size is a big consideration these days, but it’s all relative. Disk is cheap and the index is much more efficient than in the past.

  2. I did a Proof-of-Concept using FTS to act as a “Matching Engine” so the use of the custom Stoplist was very helpful since the index data had many words to be excluded that are not in the system stoplist like: publisher, music etc.

    We are using FTS to take incoming data and find matches in our database. We index over 30 million rows of data and we are getting approx. 6 matches per second so the speed is quite good.

    FTS is a gift.

    • That’s good to hear. 30mm isn’t that large these days. I saw some tests from MS saying they’d been working at 300mm documents.

Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 4,640 other followers

%d bloggers like this: