How We Index

Introduction

In order to understand searching and expected search results, it is important for you to understand how Canopy indexes text. Canopy goes through an analyzing process as depicted below:

flowchart TD
    Text --> Analysis -- token --> Index
    subgraph Analysis
    direction TB
    Tokenizers --> token[Token Filters]
    end

Tokenizers: Breaking Down Text

To ensure high-quality and consistent indexing of text data, Canopy utilizes the Standard Tokenizer in conjunction with many other specialized Language Tokenizers optimized for the linguistic nuances of specific languages.

Standard Tokenizer

The Standard Tokenizer utilizes a sophisticated, grammar-aware method to break down text into searchable units known as tokens. It follows the Unicode Text Segmentation standard (Unicode Standard Annex #29), allowing it to accurately identify word boundaries in English as well as many other languages.

This tokenizer primarily index all alphanumeric characters, which include letters and numbers. As a result, most symbols, punctuation marks, and other non-alphanumeric characters that usually act as delimiters are not indexed and are effectively removed during tokenization.

[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]

The above query would produce the following results:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

To search for the text indexed specifically by the Standard Tokenizer, use the Field Mapping: content.text:<text indexed by Standard Tokenizer>

Language Tokenizers

The Language Tokenizers are designed to handle the unique characteristics of specific languages. They are particularly useful for languages that have different rules for word formation, punctuation, and other linguistic features. These include:

English Tokenizer

The English Tokenizer is designed to handle the unique characteristics of the English language, including contractions and possessives.

To specify your search to the text indexed by the English Tokenizer, use the Field Mapping: content.text_english:<text indexed by English Tokenizer>

French Tokenizer

The French Tokenizer optimized for indexing French text, handling elisions and other French-specific features.

To specify your search to the text indexed by the French Tokenizer, use the Field Mapping: content.text_french:<text indexed by French Tokenizer>

German Tokenizer

The German Tokenizer optimized for indexing German text, including normalization of special characters.

To specify your search to the text indexed by the German Tokenizer, use the Field Mapping: content.text_german:<text indexed by German Tokenizer>

Italian Tokenizer

The Italian Tokenizer optimized for indexing Italian text, including handling elisions. To specify your search to the text indexed by the Italian Tokenizer, use the Field Mapping: content.text_italian:<text indexed by Italian Tokenizer>

Kuromoji Tokenizer

The Kuromoji Tokenizer optimized for indexing Japanese text. To specify your search to the text indexed by the Kuromoji Tokenizer, use the Field Mapping: content.text_japanese:<text indexed by Japanese Tokenizer>

Nori Tokenizer

The Nori Tokenizer optimized for indexing Korean text, providing morphological analysis, normalization of text and handling Korean-specific features. To specify your search to the text indexed by the Nori Tokenizer, use the Field Mapping: content.text_korean:<text indexed by Korean Tokenizer>

Smart Chinese Tokenizer

The Smart Chinese Analyzer is optimized for indexing Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text.

The text is first broken down into sentences, then each sentence is tokenized into words. The Smart Chinese Tokenizer uses a dictionary-based approach to identify the most likely word boundaries, taking into account the context of the surrounding characters.

To specify your search to the text indexed by the Smart Chinese Tokenizer, use the Field Mapping: content.text_chinese:<text indexed by Chinese Tokenizer>

Token Filters: Refining Tokens

Token filters accept a stream of tokens from a tokenizer, and they have the ability to modify tokens, delete tokens, or add tokens.

Different tokenizers uses difference token filters.

Token Filters for Standard Tokenizer

Canopy’s Standard Tokenizer uses Lowercase Filter. This filter changes search tokenization text to lowercase. For example,

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

would output:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Token Filters for English Tokenizer

The English Tokenizer uses Lowercase Filter, English Stemmer, English Stopwords and English Possessive Stemmer.

English Stemmer Filter

Canopy specifies the English stemmer.

English Possessive Stemmer Filter

Each token’s English possessive (’s) is eliminated by the filter.

For example,

O'Neil's would output: [ O, Neil ]

English Stopwords Filter

The following English stopwords are removed when indexing:

Letter	Word
A	“a”, “an”, “and”, “are”, “as”, “at”
B	“be”, “but”, “by”
F	“for”
I	“if”, “in”, “into”, “is”, “it”
N	“no”, “not”
O	“of”, “on”, “or”
S	“such”
T	“that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”
W	“was”, “will”, “with”

Token Filters for French Tokenizer

The French Tokenizer uses Lowercase Filter, French Elision, French Stemmer, and French Stopwords.

French Elision Filter

This filter removes the following French elisions from tokens:

[l', m', t', qu', n', s', j', d', c', jusqu', quoiqu', lorsqu', puisqu']

French Stemmer Filter

Canopy specifies the French stemmer.

French Stopwords Filter

These French stopwords are removed when indexing.

Token Filters for German Tokenizer

The German Tokenizer uses Lowercase Filter, German Normalization, German Stemmer, and German Stopwords.

German Normalization Filter

This filter is designed to normalizes German-specific characters within tokens by accounting for common variations in how German words are written. For example, it allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.

'ß' is replaced by 'ss'
'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
'ae' and 'oe' are replaced by 'a', and 'o', respectively.
'ue' is replaced by 'u', when not following a vowel or q.

German Stemmer Filter

Canopy specifies the German stemmer.

German Stopwords Filter

These German stopwords are removed when indexing.

Token Filters for Italian Tokenizer

The Italian Analyzer uses Lowercase Filter, Italian Elision, Italian Stemmer, and Italian Stopwords.

Italian Elision Filter

This filter removes the following Italian elisions from tokens:

["c", "l", "all", "dall", "dell", "nell", "sull", "coll", "pell", "gl", "agl", "dagl", "degl", "negl", "sugl", "un", "m", "t", "s", "v", "d"]

Italian Stemmer Filter

Canopy specifies the Italian stemmer.

Italian Stopwords Filter

These Italian stopwords are removed when indexing.

Token Filters for Kuromoji Tokenizer

The Kuromoji Analyzer for Japanese text uses Lowercase Filter, Kuromoji Part of Speech, Kuromoji Base Form, Kuromoji Stemmer, and Japanese Stopwords.

Japanese Stopwords Filter

Canopy remove the following stopwords when indexing Japanese text:

Letter	Word
A	“a”, “and”, “are”, “as”, “at”
B	“be”, “but”, “by”
F	“for”
I	“if”, “in”, “into”, “is”, “it”
N	“no”, “not”
O	“of”, “on”, “or”
S	“such”, “s”
T	“t”, that", “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”
W	“was”, “will”, “with”, “www”

Token Filters for Nori Tokenizer

The Nori Analyzer for Korean text uses Lowercase Filter, Nori Part of Speech, Nori Reading Form, and Nori Number.

Nori Part of Speech Filter

Nori Part of Speech Token Filter removes tokens that match the specified part of speech tags. By default, an array of part-of-speech tags that should be removed includes:

[ "E", "IC", "J", "MAG", "MAJ", "MM", "SP", "SSC", "SSO", "SC", "SE", "XPN", "XSA", "XSN", "XSV", "UNA", "NA", "VSV" ]

Nori Reading Form Filter

The Nori Reading Form Token Filter converts tokens written in Hanja to Hangul form.

For example, the input token 鄕歌 will have an output token of 향가.

Nori Number Filter

The Nori Number Token Filter normalize Korean numbers written in Hangul to regular Arabic decimal numbers in half-width characters. Below are some examples of how the Nori Number Token Filter works:

Hangul (untokenized text input)	Arabic (tokenized text output)
영영칠	7
일영영영	1000
삼천2백2십삼	3223
일조육백만오천일	1000006005001
３.２천	3200
１.２만３４５.６７	12345.67
4,647.100	4647.1

This normalization filter make it possible for users type 3200 in the search bar, and it would return documents that match ３.２천 in text.

Token Filters for Smart Chinese Tokenizer

The Smart Chinese Tokenizer for Chinese text uses Chinese Stopwords.

Chinese Stopwords Filter

Canopy removes the following stopwords when indexing Chinese text:

Letter	Word
A	“a”, “and”, “are”, “as”, “at”
B	“be”, “but”, “by”
F	“for”
I	“if”, “in”, “into”, “is”, “it”
N	“no”, “not”
O	“of”, “on”, “or”
S	“such”, “s”
T	“t”, that", “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”
W	“was”, “will”, “with”, “www”