How We Index
In order to understand searching and expected search results, it is important for you to understand how Canopy indexes text. Canopy goes through an analyzing process as depicted below:
flowchart TD Text --> Analysis -- token --> Index subgraph Analysis direction TB Tokenizers --> token[Token Filters] end
To ensure high-quality and consistent indexing of text data, Canopy utilizes the Standard Tokenizer in conjunction with many other specialized Language Tokenizers optimized for the linguistic nuances of specific languages.
The Standard Tokenizer utilizes a sophisticated, grammar-aware method to break down text into searchable units known as tokens. It follows the Unicode Text Segmentation standard (Unicode Standard Annex #29), allowing it to accurately identify word boundaries in English as well as many other languages.
This tokenizer primarily index all alphanumeric characters, which include letters and numbers. As a result, most symbols, punctuation marks, and other non-alphanumeric characters that usually act as delimiters are not indexed and are effectively removed during tokenization.
[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]
The above query would produce the following results:
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
To search for the text indexed specifically by the Standard Tokenizer, use the Field Mapping: content.text:<text indexed by Standard Tokenizer>
The Language Tokenizers are designed to handle the unique characteristics of specific languages. They are particularly useful for languages that have different rules for word formation, punctuation, and other linguistic features. These include:
The English Tokenizer is designed to handle the unique characteristics of the English language, including contractions and possessives.
To specify your search to the text indexed by the English Tokenizer, use the Field Mapping: content.text_english:<text indexed by English Tokenizer>
The French Tokenizer optimized for indexing French text, handling elisions and other French-specific features.
To specify your search to the text indexed by the French Tokenizer, use the Field Mapping: content.text_french:<text indexed by French Tokenizer>
The German Tokenizer optimized for indexing German text, including normalization of special characters.
To specify your search to the text indexed by the German Tokenizer, use the Field Mapping: content.text_german:<text indexed by German Tokenizer>
The Italian Tokenizer optimized for indexing Italian text, including handling elisions.
To specify your search to the text indexed by the Italian Tokenizer, use the Field Mapping: content.text_italian:<text indexed by Italian Tokenizer>
The Kuromoji Tokenizer optimized for indexing Japanese text.
To specify your search to the text indexed by the Kuromoji Tokenizer, use the Field Mapping: content.text_japanese:<text indexed by Japanese Tokenizer>
The Nori Tokenizer optimized for indexing Korean text, providing morphological analysis, normalization of text and handling Korean-specific features.
To specify your search to the text indexed by the Nori Tokenizer, use the Field Mapping: content.text_korean:<text indexed by Korean Tokenizer>
The Smart Chinese Analyzer is optimized for indexing Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text.
The text is first broken down into sentences, then each sentence is tokenized into words. The Smart Chinese Tokenizer uses a dictionary-based approach to identify the most likely word boundaries, taking into account the context of the surrounding characters.
To specify your search to the text indexed by the Smart Chinese Tokenizer, use the Field Mapping: content.text_chinese:<text indexed by Chinese Tokenizer>
Token filters accept a stream of tokens from a tokenizer, and they have the ability to modify tokens, delete tokens, or add tokens.
Different tokenizers uses difference token filters.
Canopy’s Standard Tokenizer uses Lowercase Filter. This filter changes search tokenization text to lowercase. For example,
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
would output:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
The English Tokenizer uses Lowercase Filter, English Stemmer, English Stopwords and English Possessive Stemmer.
Canopy specifies the English stemmer.
Each token’s English possessive (’s) is eliminated by the filter.
For example,
O'Neil's
would output:
[ O, Neil ]
The following English stopwords are removed when indexing:
Letter | Word |
---|---|
A | “a”, “an”, “and”, “are”, “as”, “at” |
B | “be”, “but”, “by” |
F | “for” |
I | “if”, “in”, “into”, “is”, “it” |
N | “no”, “not” |
O | “of”, “on”, “or” |
S | “such” |
T | “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to” |
W | “was”, “will”, “with” |
The French Tokenizer uses Lowercase Filter, French Elision, French Stemmer, and French Stopwords.
This filter removes the following French elisions from tokens:
[l', m', t', qu', n', s', j', d', c', jusqu', quoiqu', lorsqu', puisqu']
Canopy specifies the French stemmer.
These French stopwords are removed when indexing.
The German Tokenizer uses Lowercase Filter, German Normalization, German Stemmer, and German Stopwords.
This filter is designed to normalizes German-specific characters within tokens by accounting for common variations in how German words are written. For example, it allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.
'ß' is replaced by 'ss'
'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
'ae' and 'oe' are replaced by 'a', and 'o', respectively.
'ue' is replaced by 'u', when not following a vowel or q.
Canopy specifies the German stemmer.
These German stopwords are removed when indexing.
The Italian Analyzer uses Lowercase Filter, Italian Elision, Italian Stemmer, and Italian Stopwords.
This filter removes the following Italian elisions from tokens:
["c", "l", "all", "dall", "dell", "nell", "sull", "coll", "pell", "gl", "agl", "dagl", "degl", "negl", "sugl", "un", "m", "t", "s", "v", "d"]
Canopy specifies the Italian stemmer.
These Italian stopwords are removed when indexing.
The Kuromoji Analyzer for Japanese text uses Lowercase Filter, Kuromoji Part of Speech, Kuromoji Base Form, Kuromoji Stemmer, and Japanese Stopwords.
Canopy remove the following stopwords when indexing Japanese text:
Letter | Word |
---|---|
A | “a”, “and”, “are”, “as”, “at” |
B | “be”, “but”, “by” |
F | “for” |
I | “if”, “in”, “into”, “is”, “it” |
N | “no”, “not” |
O | “of”, “on”, “or” |
S | “such”, “s” |
T | “t”, that", “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to” |
W | “was”, “will”, “with”, “www” |
The Nori Analyzer for Korean text uses Lowercase Filter, Nori Part of Speech, Nori Reading Form, and Nori Number.
Nori Part of Speech Token Filter removes tokens that match the specified part of speech tags. By default, an array of part-of-speech tags that should be removed includes:
[ "E", "IC", "J", "MAG", "MAJ", "MM", "SP", "SSC", "SSO", "SC", "SE", "XPN", "XSA", "XSN", "XSV", "UNA", "NA", "VSV" ]
The Nori Reading Form Token Filter converts tokens written in Hanja to Hangul form.
For example, the input token 鄕歌
will have an output token of 향가
.
The Nori Number Token Filter normalize Korean numbers written in Hangul to regular Arabic decimal numbers in half-width characters. Below are some examples of how the Nori Number Token Filter works:
Hangul (untokenized text input) | Arabic (tokenized text output) |
---|---|
영영칠 | 7 |
일영영영 | 1000 |
삼천2백2십삼 | 3223 |
일조육백만오천일 | 1000006005001 |
3.2천 | 3200 |
1.2만345.67 | 12345.67 |
4,647.100 | 4647.1 |
This normalization filter make it possible for users type 3200
in the search bar, and it would return documents that match 3.2천
in text.
The Smart Chinese Tokenizer for Chinese text uses Chinese Stopwords.
Canopy removes the following stopwords when indexing Chinese text:
Letter | Word |
---|---|
A | “a”, “and”, “are”, “as”, “at” |
B | “be”, “but”, “by” |
F | “for” |
I | “if”, “in”, “into”, “is”, “it” |
N | “no”, “not” |
O | “of”, “on”, “or” |
S | “such”, “s” |
T | “t”, that", “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to” |
W | “was”, “will”, “with”, “www” |