Analyzers convert text into tokens. When indexing documents the tokens are created and stored in indexes as n-grams. When a query is made against an index on a field the query text is converted into tokens by the same analyzer and used to look up that index. It can get a bit more complex than that but that’s basically what happens.
There are three standard analyzer types. We can also create custom analyzers for specific field types or custom implementations.
Standard Lucene: breaks text into tokens following the Unicode Consortium text segmentation rules, then converts all characters to their lower-case form. We use this for data fields in formal documents.
Language-specific Lucene: extends the standard analyzers adding language-specific transforms, usually including stemming. Azure Search currently provides 35 language-specific analyzers. The English Lucene analyser removes possessives (trailing 's) from words, applies stemming as per Porter Stemming algorithm, and removes English stop words.
Language-specific Microsoft: performs lemmatization instead of stemming to find the base form of words in the target language (the lemma). Azure Search provides 50 language-specific analyzers originally developed for Microsoft Office and Bing. These use Natural Language Processing (NLP) tools to create ‘better’ tokens. Lemmatizing involves the use of a dictionary/vocabulary and morphological analysis (parts of speech and context) of words. It does more work so indexing takes longer than a stemming analyser, but with Azure Search we have effectively unlimited computational resources so in Unearth we use the Microsoft language-specific analyzers for all fields that contain human language.
Custom Analyzers: We can use these to do magic. If for example, in a custom implementation, we know that a field in a particular document type contains part numbers we could create a custom analyser that would look up the part numbers while they are being tokenized and store the part numbers and the part names in the index as n-grams. So, if you searched for ‘Widget’ you would get hits on ‘Big Widget’, ‘Little Widget’ and the part numbers for both.
The demo shows the difference between the 3 different standard analyser types in Azure Search. We have indexed some well-known (now public domain) English language texts with the standard Lucene, English Lucene, and English Microsoft analyzers. Each search is performed on all three indexes and the results presented in score order (best score first).
|Standard Lucene:||352 hits||- fails to find possessives (alice’s).|
|English Lucene:||360 hits||- stems to ‘alic’, removes possessives and searches for ‘alic’.|
|English Microsoft:||360 hits||- knows ‘alice’ is a noun, removes possessives, searches for ‘alice’.|
|Standard Lucene:||11 hits||- just looks for ‘thinking’.|
|English Lucene:||60 hits||- stems to ‘think’ and searches for ‘think’. Finds ‘thinking’, ‘thinks’, ‘think!’.|
|English Microsoft:||130 hits||- knows ‘thinking’ may be an adjective or a noun, knows it’s closely related synonyms in each of its usages. Finds ‘think’, 'think!', 'thinks', ‘thinking’ and ‘thought’.|
Mouse over the nn hits badges for a breakdown of the terms found. Note that the numbers won't always add up because 'hits' means sentences found - 1 sentence can contain multiple terms.