How We Index
In order to understand searching and expected search results, it is important for you to understand how Canopy indexes text. Canopy goes through an analyzing process as depicted below:
flowchart TD
Text --> Analysis -- token --> Index
subgraph Analysis
direction TB
Tokenizers --> token[Token Filters]
end
To ensure high-quality, consistent indexing of text data, Canopy uses a primary Tokenizer alongside specialized Language Tokenizers optimized for the linguistic nuances of specific languages.
- Release Version 4.0.0 and later: Canopy uses the Whitespace Tokenizer as the primary tokenizer, along with the following Language Tokenizers: English, French, German, Italian, Kuromoji (Japanese), Nori (Korean), and Smart Chinese.
- Release Version 3.0.0 and earlier: Canopy uses the Standard Tokenizer as the primary tokenizer, along with the same set of Language Tokenizers as in Version 4.0.0 and later.
The Whitespace Tokenizer breaks text into tokens based on whitespace characters, such as spaces, tabs, and newlines. It treats any sequence of non-whitespace characters as a single token, making it a straightforward and efficient way to tokenize text.
As a result, any symbols, punctuation marks, and other non-alphanumeric characters are fully indexed and preserved during tokenization.
[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]
The above query would produce the following results:
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
In Version 4.0.0 and later, the default search behavior on the Documents page is powered by the Whitespace Tokenizer.
If you wish to specify your search to the Whitespace Tokenizer, use the Field Mapping: precision:<search term>
The Standard Tokenizer utilizes a sophisticated, grammar-aware method to break down text into searchable units known as tokens. It follows the Unicode Text Segmentation standard (Unicode Standard Annex #29), allowing it to accurately identify word boundaries in English as well as many other languages.
This tokenizer primarily index all alphanumeric characters, which include letters and numbers. As a result, most symbols, punctuation marks, and other non-alphanumeric characters that usually act as delimiters are not indexed and are effectively removed during tokenization.
[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]
The above query would produce the following results:
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
To search for the text indexed specifically by the Standard Tokenizer, use the Field Mapping: content.text:<search term>
The Language Tokenizers are designed to handle the unique characteristics of specific languages. They are particularly useful for languages that have different rules for word formation, punctuation, and other linguistic features. These include:
The English Tokenizer is designed to handle the unique characteristics of the English language. Similar to Standard Tokenizer, English Tokenizer indexes text by applying linguistic rules that normalize English words, and removing symbols, punctuation marks, and other non-alphanumeric characters during tokenization.
To specify your search to the text indexed by the English Tokenizer, use the Field Mapping:
recall:<search term>(for Recall Search in Version 4.0.0 and later)content.text_english:<search term>(for Version 3.0.0 and earlier)
The French Tokenizer optimized for indexing French text, handling elisions and other French-specific features.
To specify your search to the text indexed by the French Tokenizer, use the Field Mapping: content.text_french:<search term>
The German Tokenizer optimized for indexing German text, including normalization of special characters.
To specify your search to the text indexed by the German Tokenizer, use the Field Mapping: content.text_german:<search term>
The Italian Tokenizer optimized for indexing Italian text, including handling elisions.
To specify your search to the text indexed by the Italian Tokenizer, use the Field Mapping: content.text_italian:<search term>
The Kuromoji Tokenizer optimized for indexing Japanese text.
To specify your search to the text indexed by the Kuromoji Tokenizer, use the Field Mapping: content.text_japanese:<search term>
The Nori Tokenizer optimized for indexing Korean text, providing morphological analysis, normalization of text and handling Korean-specific features.
To specify your search to the text indexed by the Nori Tokenizer, use the Field Mapping: content.text_korean:<search term>
The Smart Chinese Tokenizer is optimized for indexing Chinese or mixed Chinese-English text. This tokenizer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text.
The text is first broken down into sentences, then each sentence is tokenized into words. The Smart Chinese Tokenizer uses a dictionary-based approach to identify the most likely word boundaries, taking into account the context of the surrounding characters.
To specify your search to the text indexed by the Smart Chinese Tokenizer, use the Field Mapping: content.text_chinese:<search term>
Token filters accept a stream of tokens from a tokenizer, and they have the ability to modify tokens, delete tokens, or add tokens.
Different tokenizers uses difference token filters.
Canopy’s Whitespace Tokenizer uses Lowercase Filter. This filter changes search tokenization text to lowercase. For example,
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
would output:
[ the, 2, quick, brown-foxes, jumped, over, the, lazy, dog's, bone. ]
Canopy’s Standard Tokenizer uses Lowercase Filter. This filter changes search tokenization text to lowercase. For example,
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
would output:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
The English Tokenizer uses Lowercase Filter, English Stemmer, English Stopwords and English Possessive Stemmer.
Canopy specifies the English stemmer.
Each token’s English possessive (’s) is eliminated by the filter.
For example,
O'Neil's
would output:
[ O, Neil ]
The following English stopwords are removed when indexing:
| Letter | Word |
|---|---|
| A | “a”, “an”, “and”, “are”, “as”, “at” |
| B | “be”, “but”, “by” |
| F | “for” |
| I | “if”, “in”, “into”, “is”, “it” |
| N | “no”, “not” |
| O | “of”, “on”, “or” |
| S | “such” |
| T | “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to” |
| W | “was”, “will”, “with” |
The French Tokenizer uses Lowercase Filter, French Elision, French Stemmer, and French Stopwords.
This filter removes the following French elisions from tokens:
[l', m', t', qu', n', s', j', d', c', jusqu', quoiqu', lorsqu', puisqu']
Canopy specifies the French stemmer.
These French stopwords are removed when indexing.
The German Tokenizer uses Lowercase Filter, German Normalization, German Stemmer, and German Stopwords.
This filter is designed to normalizes German-specific characters within tokens by accounting for common variations in how German words are written. For example, it allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.
'ß' is replaced by 'ss''ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.'ae' and 'oe' are replaced by 'a', and 'o', respectively.'ue' is replaced by 'u', when not following a vowel or q.
Canopy specifies the German stemmer.
These German stopwords are removed when indexing.
The Italian Analyzer uses Lowercase Filter, Italian Elision, Italian Stemmer, and Italian Stopwords.
This filter removes the following Italian elisions from tokens:
["c", "l", "all", "dall", "dell", "nell", "sull", "coll", "pell", "gl", "agl", "dagl", "degl", "negl", "sugl", "un", "m", "t", "s", "v", "d"]
Canopy specifies the Italian stemmer.
These Italian stopwords are removed when indexing.
The Kuromoji Analyzer for Japanese text uses Lowercase Filter, Kuromoji Part of Speech, Kuromoji Base Form, Kuromoji Stemmer, and Japanese Stopwords.
Canopy remove the following stopwords when indexing Japanese text:
| Letter | Word |
|---|---|
| A | “a”, “and”, “are”, “as”, “at” |
| B | “be”, “but”, “by” |
| F | “for” |
| I | “if”, “in”, “into”, “is”, “it” |
| N | “no”, “not” |
| O | “of”, “on”, “or” |
| S | “such”, “s” |
| T | “t”, that", “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to” |
| W | “was”, “will”, “with”, “www” |
The Nori Analyzer for Korean text uses Lowercase Filter, Nori Part of Speech, Nori Reading Form, and Nori Number.
Nori Part of Speech Token Filter removes tokens that match the specified part of speech tags. By default, an array of part-of-speech tags that should be removed includes:
[ "E", "IC", "J", "MAG", "MAJ", "MM", "SP", "SSC", "SSO", "SC", "SE", "XPN", "XSA", "XSN", "XSV", "UNA", "NA", "VSV" ]
The Nori Reading Form Token Filter converts tokens written in Hanja to Hangul form.
For example, the input token 鄕歌 will have an output token of 향가.
The Nori Number Token Filter normalize Korean numbers written in Hangul to regular Arabic decimal numbers in half-width characters. Below are some examples of how the Nori Number Token Filter works:
| Hangul (untokenized text input) | Arabic (tokenized text output) |
|---|---|
| 영영칠 | 7 |
| 일영영영 | 1000 |
| 삼천2백2십삼 | 3223 |
| 일조육백만오천일 | 1000006005001 |
| 3.2천 | 3200 |
| 1.2만345.67 | 12345.67 |
| 4,647.100 | 4647.1 |
This normalization filter make it possible for users type 3200 in the search bar, and it would return documents that match 3.2천 in text.
The Smart Chinese Tokenizer for Chinese text uses Chinese Stopwords.
Canopy removes the following stopwords when indexing Chinese text:
| Letter | Word |
|---|---|
| A | “a”, “and”, “are”, “as”, “at” |
| B | “be”, “but”, “by” |
| F | “for” |
| I | “if”, “in”, “into”, “is”, “it” |
| N | “no”, “not” |
| O | “of”, “on”, “or” |
| S | “such”, “s” |
| T | “t”, that", “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to” |
| W | “was”, “will”, “with”, “www” |