Search Syntax

Canopy’s search engine is built on the powerful Apache Lucene library, which allows for a wide range of search capabilities, including fuzzy searches, wildcard searches, proximity searches, and more. In this guide, we will cover the important query string syntax and search operators that you can use to refine your searches.

Overview

Query String Syntax commonly consists of Terms, Fields, and Operators.

Terms are single words or phrases you want to search for.

For example, tree and work are single search terms that allow you to search for documents with the words “tree” and “work”. When running a query, search Terms will be entered into a Field.

Fields: When performing a search, you may select a Field from the Fields guide. If no field is specified, Canopy searches across all relevant text fields, containing extracted or OCRed text. This search is optimized to be as complete as possible.
Operator allows you to customize your search. Common Operators include AND, OR, NOT (must be capitalized). You can also use + for AND and - for NOT.

Basic Search

Version 4.0.0 and later

In version 4.0.0 and later, Canopy’s advanced search capabilities allow users to tailor their search strategy to their specific needs by leveraging two new modes:

Precision Search provides exact, literal search results, ideal for users who need to find specific terms that contain special characters and stopwords.

Recall Search offers a broader search approach, capturing variations of words and phrases to ensure comprehensive results.

Below are the key differences between Precision Search and Recall Search:

Precision Search	Recall Search	Search Behavior
Index text by splitting on spaces, tabs, and newlines	Index text by applying linguistic rules that normalize English words
Case insensitive	Case insensitive	`Apple`, `APPLE` or `AppLE` will match `apple` for both search modes
Can search on Stopwords	Cannot search on Stopwords	Precision Search: A search for `"Account No"` will return documents containing the exact phrase `"Account No"`. Recall Search: A search for `"Account No"` will not return documents containing the exact phrase `"Account No"` because `No` is a stopword, and it is removed when indexing.
No English stemming	English stemming	Precision Search: A search for `Account` will return documents containing the exact word `Account`. Recall Search: A search for `Account` will return documents containing variations of the word `Account`, e.g., `Accounts`, `Accounting`, etc., due to stemming.
Can search on possessive and contraction (’)	Cannot search on possessive and contraction (’)	Precision Search: A search for `Account’s` will return documents containing the exact word `Account’s`. Recall Search: A search for `Account’s` will not return documents containing `Account’s`. Instead, it will return documents with a variation of the word `Account`.
Can search on Symbols and Punctuation	Cannot search on Symbols and Punctuation	Precision Search: A search for `"Account #"` will return documents containing the exact phrase `"Account #"`. Recall Search: A search for `"Account #"` will not return documents containing the exact phrase `"Account #"` because `#` is removed when indexing.

By default, all searches utilize Precision Search. To switch to Recall Search, simply add recall: to your search query. The precision: prefix is optional, but it’s available if you want to be explicit about using Precision Search. Below are some of the examples:

report returns Precision Search results for "report"
recall:report returns Recall Search results for report, e.g., reports, reporting, etc.
precision:report returns Precision Search results for report

Version 3.0.0 and earlier

Keyword Search: In earlier version, Canopy uses a built-in English stemmer to find documents based on root and base words. Typing run in the search bar will return documents containing variations of “running,” “runner,” “ran,” etc.
Phrase Search:
- Use double quotes to search for an exact phrase and its close variation. For example, "red delicious apple" (in quotes) returns results containing a variation of "red delicious apple" phrase, such as "red delicious apple" or "red delicious apples" due to stemming.
- Without quotes, the search will return documents containing each word in any order. For example, red delicious apple (without quotes) returns document containing each word ("red", "delicious", "apple") and the phrase containing those words in any order ("delicious apple", "delicious red apple", etc.).

These basic searches are not case-sensitive. Whether you type “Apple,” “Apple,” or “APPLE,” you’ll get the same results.

Wildcard Search

You can use wildcards to search for partial terms. Wildcards are useful when you are unsure of the spelling or want to find variations of a word.

Wildcard Operators	Description	Examples
`*`	Matches zero, single or multiple characters	`appl*` matches “apple”, “apples”, etc.
`?`	Matches exactly one character	`ex?mple` matches “example”, etc.

* and ? can be used at the beginning, middle, or end of a term.

Fuzzy Search

You can use the fuzzy operator to search for terms that are similar but not an exact match. This is useful for finding documents with misspelled words or variations of a term.

Use ~ after a term to enable fuzzy matching (e.g., aple~ matches “apple”, tre~ wrk~ matches “tree work”).
You can specify the edit distance (e.g., aple~2 find words that are up to 2 edits away from “aple”). The default edit distance is set to 2 characters.

Proximity Search

Proximity Search finds two or more words within a specific distance apart in a document and fields. To use Proximity Search, specify a maximum edit distance of words in a phrase by using the tilde (~) or (~~) operators followed by a number.

Use the tilde (~) operator to find words within a certain distance of each other, regardless of their order. (e.g., "data breach"~3 returns documents with the word “data” and “breach” within 3 words of each other, regardless of their order.)
Use the double tilde (~~) operator to find words within a certain distance of each other, with a specified order. (e.g., "data breach"~~3 returns documents with the word “data” and “breach” within 3 words of each other, and “data” appears before “breach”.)
To use Field Proximity Search, specify the field name followed by a colon before the phrase. (e.g., file_name:"data breach"~3 returns documents whose File Name field contains the word “data” and the word “breach” within 3 words of each other.)

Proximity Search with Regular Expressions

You can combine Regular Expressions (RegEx) with Proximity Search to locate specific text patterns near other terms within your documents. This feature is handy for finding structured data, such as Social Security Numbers, phone numbers, or dates, alongside a relevant keyword.

To use RegEx in a proximity search, you must use the order-preserving operator (~~) followed by the maximum word distance.

"term/phrase regex_pattern"~~N or "regex_pattern term/phrase"~~N, where N is the maximum word distance.

For example: To find documents where the word “ssn” appears within 5 words before a pattern that matches a Social Security Number (e.g., 123-45-6789), use one of the following queries:

"ssn [0-9]{3}-[0-9]{2}-[0-9]{4}"~~5 or ssn \d{3}-\d{2}-\d{4}"~~5

Canopy’s Proximity Search with RegEx requires the use of the double tilde (~~) operator followed by a number (e.g., ~~5), which preserve the words order. The single tilde (~) operator is not supported in this context.

For details on Regular Expressions, please refer to the Regular Expressions (RegEx) Syntax guide.

Field Search

Click here for more information on Field Search

Analyzer-Specific Field Search

To gain more control over your search results and achieve greater precision, you can target your searches to specific fields based on the Analyzer used to index the text data within those fields.

Understanding Analyzers and Fields

To ensure high-quality, consistent indexing of text data, Canopy uses a primary Analyzer alongside specialized Language Analyzers optimized for the linguistic nuances of specific languages.

Release Version 4.0.0 and later: Canopy uses the Whitespace Analyzer as the primary analyzer, along with the following Language Analyzers: English, French, German, Italian, Kuromoji (Japanese), Nori (Korean), and Smart Chinese.
Release Version 3.0.0 and earlier: Canopy uses the Standard Analyzer as the primary analyzer, along with the same set of Language Analyzers as in Version 4.0.0 and later.

Different Analyzers handle text indexing and searching differently.

The Whitespace Analyzer (Version 4.0.0 and later) breaks text into tokens based on whitespace characters, such as spaces, tabs and newlines, preserving the tokens’ original form.
The Standard Analyzer (Version 3.0.0 and earlier) breaks text into tokens based on whitespace and alphanumeric characters. This means that punctuation and special characters are removed during tokenization.
Language Analyzers (e.g., English, French) perform similar tokenization but also apply language-specific token filters, such as:
- Stemming: Reducing words to their root form (e.g., “running” becomes “run”).
- Stopwords Removal: Ignore common words that have little semantic value (e.g., “the,” “a,” “is”).
- Lowercase: Converting all characters to lowercase.
- Other linguistic normalizations: Handling elisions, case variations, etc.

Click here to learn more about How we Index the text data in Canopy

The following table shows the mapping of the different Analyzers used by Canopy and the corresponding data fields:

Analyzer	Field Mapping
Whitespace Analyzer (Version 4.0.0 or later)	`precision:<search term>`
Standard Analyzer (Version 3.0.0 or earlier)	`content.text`
English Analyzer	`recall:<search term>` (Version 4.0.0 or later) or, `content.text_english` (Version 3.0.0 or earlier)
French Analyzer	`content.text_french`
German Analyzer	`content.text_german`
Italian Analyzer	`content.text_italian`
Kuromoji Analyzer	`content.text_japanese`
Nori Analyzer	`content.text_korean`
Smart Chinese Analyzer	`content.text_chinese`

Default Search: When you enter a search term without specifying a field, Canopy searches on the Whitespace Analyzer (Version 4.0.0 or later), producing Precision Search results. In Version 3.0.0 or earlier, it searches on all fields regardless of the Analyzer used.

Analyzer-Specific Field Search: When you specify a field mapping in your search query, you instruct Canopy to search only within the text indexed by the Analyzer associated with that specific field. This allows you to leverage each Analyzer’s unique processing capabilities for more targeted results.

Example Search Behavior

Consider documents with the following content in different fields:

Account
Accounting
Account No
Accounting No
Account #
Accounting #

Version 4.0.0 or later

Search Term	Expected Result	Explanation
`account`	1	Whitespace analyzer preserve the word in original form, providing exact result.
`account*` e.g., `accounting`	1, 2, 3, 4, 5, 6	Using wildcard allow users to search for a variation of the word “account”.
`"account no"`	3	Whitespace analyzer index stop word, providing exact result of “account no”.
`"account* no"` e.g., `"accounting no"`	3, 4	Using wildcard allow users to search for a variation of the word “account”, exist in an exact phrase with the word “no”.
`"account #"`	5	Whitespace analyzer index special character, providing exact result of “account #”.
`"account* #"` e.g., `"accounting #"`	5, 6	Using wildcard allow users to search for a variation of the word “account”, exist in an exact phrase with the special character “#”.
`recall:account`	1, 2, 3, 4, 5, 6	English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account”
`recall:accounting`	1, 2, 3, 4, 5, 6	English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account”
`recall:"account no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “`recall:account`”
`recall:"accounting no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “`recall:accounting`”
`recall:"account #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “`recall:account`”
`recall:"accounting #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “`recall:accounting`”

Version 3.0.0 or earlier

Search Term	Expected Result	Explanation
`account`	1, 2, 3, 4, 5, 6	Search across all text fields for variation of the word “account”
`account*` e.g., `accounting`	1, 2, 3, 4, 5, 6	Search across all text fields for variation of the word “account”
`"account no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “account”
`"account* no"` e.g., `"accounting no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “accounting”
`"account #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “account”
`"account* #"` e.g., `"accounting #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “accounting”
`content.text:account`	1, 3, 5	Searches across all text fields for the exact word “account”
`content.text:accounting`	2, 4, 6	Searches across all text fields for the exact word “accounting”
`content.text:"account no"`	3	Searches across all text fields for the exact phrase “account no”. Standard Analyzer does not remove the stopword “no”
`content.text:"accounting no"`	4	Searches across all text fields for the exact phrase “accounting no”. Standard Analyzer does not remove the stopword “no”
`content.text:"account #"`	1, 3, 5	The symbol # is removed while indexing. This search term yield the same result as `content.text:account`
`content.text:"accounting #"`	2, 4, 6	The symbol # is removed while indexing. This search term yield the same result as `content.text:accounting`
`content.text_english:account`	1, 2, 3, 4, 5, 6	English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account”
`content.text_english:accounting`	1, 2, 3, 4, 5, 6	English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account”
`content.text_english:"account no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “`content.text_english:account`”
`content.text_english:"accounting no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “`content.text_english:accounting`”
`content.text_english:"account #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “`content.text_english:account`”
`content.text_english:"accounting #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “`content.text_english:accounting`”

Regular Expressions (regex)

Click here for more information on Regular Expression Syntax

Ranges

You can use different brackets to denote specific ranges for date, numeric, and string fields.

Ranges Search	Examples
Use square brackets to specify inclusive ranges [min-max]	`date:["2018-01-01 00:00:00.000" TO "2018-12-31 00:00:00.000"]` searches for all days in 2018. `count:[100 TO *]` searches for numbers from 100 upwards.
Use curly brackets to specify exclusive ranges {min-max}	`tag:{delta TO sigma}` searches for tags between `delta` and `sigma`, excluding `delta` and `sigma`. `date:{* TO "2018-01-01 00:00:00.000"}` searches for all dates before 2018.
Combine curly and square brackets	`count:[1 TO 8}` searches for numbers from 1 up to but not including 8.
Range search with one side unbounded	`age:>30` searches for ages greater than 30. `age:>=30` searches for ages greater than or equal to 30.
Combine an upper and lower unbounded range, and join them by AND operator	`age:(>=10 AND <30)` `age:(+>=10 +<30)`

Boost Operator

The boost operator (^) allows you to increase the relevance of a term or phrase in your search results. By default, all terms are given equal weight, but you can adjust the weight of specific terms to prioritize them in the search results.

Boost Operator can be used on	Examples
Individual Terms	`sugar^2 maple`
Phrases	`"tree work"^2`
Groups of Terms	`(sugar maple)^4`

Although the default boost value is 1, it can be any positive floating point number. Boosts between 0 and 1 reduce relevance.

Boolean Operators

Boolean operators are used to combine or exclude keywords in a search, allowing you to refine your search results.

Boolean operators include + (this term must be included) and - (this term must not be included), while all other terms are optional. For example, sugar maple +tree -work states that:

tree must be included
work must not be included
sugar and maple are optional; their inclusion increases relevance

Users may also use Operators such as AND, OR and NOT (also written &&, || and ! respectively) to combine or exclude keywords in a search. Some important rules to note:

NOT takes precedence over AND, which takes precedence over OR.
+ and - only affect the term to the operator’s right. However, AND and OR affect the terms to the left and right.

For example:

sugar OR maple AND tree AND NOT work. This example will yield an inaccurate result because maple is now a required term.
(sugar OR maple) AND tree AND NOT work. This example will yield an inaccurate result because at least one of sugar or maple is now required and the search for those terms would now be scored differently from the original query.
((sugar AND tree) OR (maple AND tree) OR tree) AND NOT work. This example replicates the logic from the original query, but the relevance scoring will not match that of the original query.

The operators AND, NOT, and OR must be in upper case.

Grouping

You can group terms or phrases using parentheses () to form sub-queries (e.g.,(sugar OR maple) AND tree).

Groups can be used to focus on a particular field or boost results of a sub-query (e.g.,piitag:(name OR phone) title:(full text search)^3).
Groups can be used to find a list of values in a field (e.g.,id:(2FG2G55FGF OR 2FG2G55CGF OR 3FG2G55FGF) or alternately: id:(2FG2G55FGF 2FG2G55CGF 3FG2G55FGF))