About Alice

This demo aims to illustrate the different analyzers available in Azure Search and by extension in our Unearth product. You can find the source code for this demo and the code that created the underlying index in github. Comments, corrections and questions are welcome there.

This is an ASP.NET Core 2 application using Razor Pages and field-scoped Azure Search. It's a fun read (for a programmer). Hint: put your cursor over the purple word 'magic' below. The index contains 3 copies of the text of Lewis Carroll's book 'Alice in Wonderland', each copy indexed with a different Analyzer.

Azure Search Analyzers

Analyzers convert text into tokens. When indexing documents the tokens are created and stored in indexes as n-grams. When a query is made against an index on a field the query text is converted into tokens by the same analyzer and used to look up that index. It can get a bit more complex than that but that’s basically what happens.

There are three standard analyzer types. We can also create custom analyzers for specific field types or custom implementations.

Standard Lucene: breaks text into tokens following the Unicode Consortium text segmentation rules, then converts all characters to their lower-case form. We use this for data fields in formal documents.

Language-specific Lucene: extends the standard analyzers adding language-specific transforms, usually including stemming. Azure Search currently provides 35 language-specific analyzers. The English Lucene analyser removes possessives (trailing 's) from words, applies stemming as per Porter Stemming algorithm, and removes English stop words.

Language-specific Microsoft: performs lemmatization instead of stemming to find the base form of words in the target language (the lemma). Azure Search provides 50 language-specific analyzers originally developed for Microsoft Office and Bing. These use Natural Language Processing (NLP) tools to create ‘better’ tokens. Lemmatizing involves the use of a dictionary/vocabulary and morphological analysis (parts of speech and context) of words. It does more work so indexing takes longer than a stemming analyser, but with Azure Search we have effectively unlimited computational resources so in Unearth we use the Microsoft language-specific analyzers for all fields that contain human language.

Custom Analyzers: We can use these to do magic . For example we can use a custom analyzer to identify keywords in text as it is being tokenized, or use regular expressions to identify and decorate particularly interesting tokens. We may extend this demo to include a custom analyzer at a later date.

The demo shows the difference between the 3 different standard analyser types in Azure Search.

Searching ‘Alice in Wonderland’

For ‘alice’:
Standard Lucene:  377 hits  - fails to find possessives (alice’s).
English Lucene:  387 hits  - stems to ‘alic’, removes possessives and searches for ‘alic’.
English Microsoft:  387 hits  - knows ‘alice’ is a noun, removes possessives, searches for ‘alice’.

For ‘thinking’:
Standard Lucene:   11 hits  - just looks for ‘thinking’.
English Lucene:   64 hits  - stems to ‘think’ and searches for ‘think’. Finds ‘thinking’, ‘thinks’, ‘think!’.
English Microsoft:  136 hits  - knows ‘thinking’ may be an adjective or a noun, knows it’s closely related synonyms in each of its usages. Finds ‘think’, 'think!', 'thinks', ‘thinking’ and ‘thought’.
For ‘knives’:
Standard Lucene:   0 hits  - just looks for ‘knives’ which is not a word in the text.
English Lucene:   0 hits  - stems to ‘kniv’ and searches for ‘kniv’. Doesn't find anything.
English Microsoft:  3 hits  - knows ‘knives’ is the plural of 'knife'. Finds ‘knife’.