Search Syntax
Canopy’s search engine is built on the powerful Apache Lucene library, which allows for a wide range of search capabilities, including fuzzy searches, wildcard searches, proximity searches, and more. In this guide, we will cover the important query string syntax and search operators that you can use to refine your searches.
Query String Syntax commonly consists of Terms, Fields, and Operators.
- Terms are single words or phrases you want to search for.
For example, tree and work are single search terms that allow you to search for documents with the words “tree” and “work”.
When running a query, search Terms will be entered into a Field.
-
Fields: When performing a search, you may select a Field from the Fields guide. If no field is specified, Canopy searches across all relevant text fields, containing extracted or OCRed text. This search is optimized to be as complete as possible.
-
Operator allows you to customize your search. Common Operators include
AND,OR,NOT(must be capitalized). You can also use+forANDand-forNOT.
In version 4.0.0 and later, Canopy’s advanced search capabilities allow users to tailor their search strategy to their specific needs by leveraging two new modes:
Precision Search provides exact, literal search results, ideal for users who need to find specific terms that contain special characters and stopwords.
Recall Search offers a broader search approach, capturing variations of words and phrases to ensure comprehensive results.
Below are the key differences between Precision Search and Recall Search:
| Precision Search | Recall Search | Search Behavior |
|---|---|---|
| Index text by splitting on spaces, tabs, and newlines | Index text by applying linguistic rules that normalize English words | |
| Case insensitive | Case insensitive | Apple, APPLE or AppLE will match apple for both search modes |
| Can search on Stopwords | Cannot search on Stopwords | Precision Search: A search for "Account No" will return documents containing the exact phrase "Account No". Recall Search: A search for "Account No" will not return documents containing the exact phrase "Account No" because No is a stopword, and it is removed when indexing. |
| No English stemming | English stemming | Precision Search: A search for Account will return documents containing the exact word Account. Recall Search: A search for Account will return documents containing variations of the word Account, e.g., Accounts, Accounting, etc., due to stemming. |
| Can search on possessive and contraction (’) | Cannot search on possessive and contraction (’) | Precision Search: A search for Account’s will return documents containing the exact word Account’s. Recall Search: A search for Account’s will not return documents containing Account’s. Instead, it will return documents with a variation of the word Account. |
| Can search on Symbols and Punctuation | Cannot search on Symbols and Punctuation | Precision Search: A search for "Account #" will return documents containing the exact phrase "Account #". Recall Search: A search for "Account #" will not return documents containing the exact phrase "Account #" because # is removed when indexing. |
By default, all searches utilize Precision Search. To switch to Recall Search, simply add recall: to your search query. The precision: prefix is optional, but it’s available if you want to be explicit about using Precision Search. Below are some of the examples:
reportreturns Precision Search results for"report"recall:reportreturns Recall Search results forreport, e.g.,reports,reporting, etc.precision:reportreturns Precision Search results forreport
-
Keyword Search: In earlier version, Canopy uses a built-in English stemmer to find documents based on root and base words. Typing
runin the search bar will return documents containing variations of “running,” “runner,” “ran,” etc. -
Phrase Search:
- Use double quotes to search for an exact phrase and its close variation. For example,
"red delicious apple"(in quotes) returns results containing a variation of"red delicious apple"phrase, such as"red delicious apple"or"red delicious apples"due to stemming. - Without quotes, the search will return documents containing each word in any order. For example,
red delicious apple(without quotes) returns document containing each word ("red","delicious","apple") and the phrase containing those words in any order ("delicious apple","delicious red apple", etc.).
- Use double quotes to search for an exact phrase and its close variation. For example,
These basic searches are not case-sensitive. Whether you type “Apple,” “Apple,” or “APPLE,” you’ll get the same results.
You can use wildcards to search for partial terms. Wildcards are useful when you are unsure of the spelling or want to find variations of a word.
| Wildcard Operators | Description | Examples |
|---|---|---|
* |
Matches zero, single or multiple characters | appl* matches “apple”, “apples”, etc. |
? |
Matches exactly one character | ex?mple matches “example”, etc. |
* and ? can be used at the beginning, middle, or end of a term.
You can use the fuzzy operator to search for terms that are similar but not an exact match. This is useful for finding documents with misspelled words or variations of a term.
- Use ~ after a term to enable fuzzy matching (e.g.,
aple~matches “apple”,tre~ wrk~matches “tree work”). - You can specify the edit distance (e.g.,
aple~2find words that are up to 2 edits away from “aple”). The default edit distance is set to 2 characters.
Proximity Search finds two or more words within a specific distance apart in a document and fields. To use Proximity Search, specify a maximum edit distance of words in a phrase by using the tilde (~) or (~~) operators followed by a number.
-
Use the tilde (~) operator to find words within a certain distance of each other, regardless of their order. (e.g.,
"data breach"~3returns documents with the word “data” and “breach” within 3 words of each other, regardless of their order.) -
Use the double tilde (~~) operator to find words within a certain distance of each other, with a specified order. (e.g.,
"data breach"~~3returns documents with the word “data” and “breach” within 3 words of each other, and “data” appears before “breach”.) -
To use Field Proximity Search, specify the field name followed by a colon before the phrase. (e.g.,
file_name:"data breach"~3returns documents whose File Name field contains the word “data” and the word “breach” within 3 words of each other.)
You can combine Regular Expressions (RegEx) with Proximity Search to locate specific text patterns near other terms within your documents. This feature is handy for finding structured data, such as Social Security Numbers, phone numbers, or dates, alongside a relevant keyword.
To use RegEx in a proximity search, you must use the order-preserving operator (~~) followed by the maximum word distance.
"term/phrase regex_pattern"~~N or "regex_pattern term/phrase"~~N, where N is the maximum word distance.
For example: To find documents where the word “ssn” appears within 5 words before a pattern that matches a Social Security Number (e.g., 123-45-6789), use one of the following queries:
"ssn [0-9]{3}-[0-9]{2}-[0-9]{4}"~~5 or ssn \d{3}-\d{2}-\d{4}"~~5
Canopy’s Proximity Search with RegEx requires the use of the double tilde (~~) operator followed by a number (e.g., ~~5), which preserve the words order. The single tilde (~) operator is not supported in this context.
For details on Regular Expressions, please refer to the Regular Expressions (RegEx) Syntax guide.
Click here for more information on Field Search
To gain more control over your search results and achieve greater precision, you can target your searches to specific fields based on the Analyzer used to index the text data within those fields.
To ensure high-quality, consistent indexing of text data, Canopy uses a primary Analyzer alongside specialized Language Analyzers optimized for the linguistic nuances of specific languages.
- Release Version 4.0.0 and later: Canopy uses the Whitespace Analyzer as the primary analyzer, along with the following Language Analyzers: English, French, German, Italian, Kuromoji (Japanese), Nori (Korean), and Smart Chinese.
- Release Version 3.0.0 and earlier: Canopy uses the Standard Analyzer as the primary analyzer, along with the same set of Language Analyzers as in Version 4.0.0 and later.
Different Analyzers handle text indexing and searching differently.
- The Whitespace Analyzer (Version 4.0.0 and later) breaks text into tokens based on whitespace characters, such as spaces, tabs and newlines, preserving the tokens’ original form.
- The Standard Analyzer (Version 3.0.0 and earlier) breaks text into tokens based on whitespace and alphanumeric characters. This means that punctuation and special characters are removed during tokenization.
- Language Analyzers (e.g., English, French) perform similar tokenization but also apply language-specific token filters, such as:
- Stemming: Reducing words to their root form (e.g., “running” becomes “run”).
- Stopwords Removal: Ignore common words that have little semantic value (e.g., “the,” “a,” “is”).
- Lowercase: Converting all characters to lowercase.
- Other linguistic normalizations: Handling elisions, case variations, etc.
Click here to learn more about How we Index the text data in Canopy
The following table shows the mapping of the different Analyzers used by Canopy and the corresponding data fields:
| Analyzer | Field Mapping |
|---|---|
| Whitespace Analyzer (Version 4.0.0 or later) | precision:<search term> |
| Standard Analyzer (Version 3.0.0 or earlier) | content.text |
| English Analyzer | recall:<search term> (Version 4.0.0 or later) or, content.text_english (Version 3.0.0 or earlier) |
| French Analyzer | content.text_french |
| German Analyzer | content.text_german |
| Italian Analyzer | content.text_italian |
| Kuromoji Analyzer | content.text_japanese |
| Nori Analyzer | content.text_korean |
| Smart Chinese Analyzer | content.text_chinese |
Default Search: When you enter a search term without specifying a field, Canopy searches on the Whitespace Analyzer (Version 4.0.0 or later), producing Precision Search results. In Version 3.0.0 or earlier, it searches on all fields regardless of the Analyzer used.
Analyzer-Specific Field Search: When you specify a field mapping in your search query, you instruct Canopy to search only within the text indexed by the Analyzer associated with that specific field. This allows you to leverage each Analyzer’s unique processing capabilities for more targeted results.
Consider documents with the following content in different fields:
- Account
- Accounting
- Account No
- Accounting No
- Account #
- Accounting #
Version 4.0.0 or later
| Search Term | Expected Result | Explanation |
|---|---|---|
account |
1 | Whitespace analyzer preserve the word in original form, providing exact result. |
account* e.g., accounting |
1, 2, 3, 4, 5, 6 | Using wildcard allow users to search for a variation of the word “account”. |
"account no" |
3 | Whitespace analyzer index stop word, providing exact result of “account no”. |
"account* no" e.g., "accounting no" |
3, 4 | Using wildcard allow users to search for a variation of the word “account”, exist in an exact phrase with the word “no”. |
"account #" |
5 | Whitespace analyzer index special character, providing exact result of “account #”. |
"account* #" e.g., "accounting #" |
5, 6 | Using wildcard allow users to search for a variation of the word “account”, exist in an exact phrase with the special character “#”. |
recall:account |
1, 2, 3, 4, 5, 6 | English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account” |
recall:accounting |
1, 2, 3, 4, 5, 6 | English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account” |
recall:"account no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “recall:account” |
recall:"accounting no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “recall:accounting” |
recall:"account #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “recall:account” |
recall:"accounting #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “recall:accounting” |
Version 3.0.0 or earlier
| Search Term | Expected Result | Explanation |
|---|---|---|
account |
1, 2, 3, 4, 5, 6 | Search across all text fields for variation of the word “account” |
account* e.g., accounting |
1, 2, 3, 4, 5, 6 | Search across all text fields for variation of the word “account” |
"account no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “account” |
"account* no" e.g., "accounting no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “accounting” |
"account #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “account” |
"account* #" e.g., "accounting #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “accounting” |
content.text:account |
1, 3, 5 | Searches across all text fields for the exact word “account” |
content.text:accounting |
2, 4, 6 | Searches across all text fields for the exact word “accounting” |
content.text:"account no" |
3 | Searches across all text fields for the exact phrase “account no”. Standard Analyzer does not remove the stopword “no” |
content.text:"accounting no" |
4 | Searches across all text fields for the exact phrase “accounting no”. Standard Analyzer does not remove the stopword “no” |
content.text:"account #" |
1, 3, 5 | The symbol # is removed while indexing. This search term yield the same result as content.text:account |
content.text:"accounting #" |
2, 4, 6 | The symbol # is removed while indexing. This search term yield the same result as content.text:accounting |
content.text_english:account |
1, 2, 3, 4, 5, 6 | English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account” |
content.text_english:accounting |
1, 2, 3, 4, 5, 6 | English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account” |
content.text_english:"account no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “content.text_english:account” |
content.text_english:"accounting no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “content.text_english:accounting” |
content.text_english:"account #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “content.text_english:account” |
content.text_english:"accounting #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “content.text_english:accounting” |
Click here for more information on Regular Expression Syntax
You can use different brackets to denote specific ranges for date, numeric, and string fields.
| Ranges Search | Examples |
|---|---|
| Use square brackets to specify inclusive ranges [min-max] | date:["2018-01-01 00:00:00.000" TO "2018-12-31 00:00:00.000"] searches for all days in 2018. count:[100 TO *] searches for numbers from 100 upwards. |
| Use curly brackets to specify exclusive ranges {min-max} | tag:{delta TO sigma} searches for tags between delta and sigma, excluding delta and sigma. date:{* TO "2018-01-01 00:00:00.000"} searches for all dates before 2018. |
| Combine curly and square brackets | count:[1 TO 8} searches for numbers from 1 up to but not including 8. |
| Range search with one side unbounded | age:>30 searches for ages greater than 30. age:>=30 searches for ages greater than or equal to 30. |
| Combine an upper and lower unbounded range, and join them by AND operator | age:(>=10 AND <30) age:(+>=10 +<30) |
The boost operator (^) allows you to increase the relevance of a term or phrase in your search results. By default, all terms are given equal weight, but you can adjust the weight of specific terms to prioritize them in the search results.
| Boost Operator can be used on | Examples |
|---|---|
| Individual Terms | sugar^2 maple |
| Phrases | "tree work"^2 |
| Groups of Terms | (sugar maple)^4 |
Although the default boost value is 1, it can be any positive floating point number. Boosts between 0 and 1 reduce relevance.
Boolean operators are used to combine or exclude keywords in a search, allowing you to refine your search results.
Boolean operators include + (this term must be included) and - (this term must not be included), while all other terms are optional. For example, sugar maple +tree -work states that:
treemust be includedworkmust not be includedsugarandmapleare optional; their inclusion increases relevance
Users may also use Operators such as AND, OR and NOT (also written &&, || and ! respectively) to combine or exclude keywords in a search. Some important rules to note:
- NOT takes precedence over AND, which takes precedence over OR.
+and-only affect the term to the operator’s right. However, AND and OR affect the terms to the left and right.
For example:
-
sugar OR maple AND tree AND NOT work. This example will yield an inaccurate result becausemapleis now a required term. -
(sugar OR maple) AND tree AND NOT work. This example will yield an inaccurate result because at least one ofsugarormapleis now required and the search for those terms would now be scored differently from the original query. -
((sugar AND tree) OR (maple AND tree) OR tree) AND NOT work. This example replicates the logic from the original query, but the relevance scoring will not match that of the original query.
The operators AND, NOT, and OR must be in upper case.
You can group terms or phrases using parentheses () to form sub-queries (e.g.,(sugar OR maple) AND tree).
-
Groups can be used to focus on a particular field or boost results of a sub-query (e.g.,
piitag:(name OR phone) title:(full text search)^3). -
Groups can be used to find a list of values in a field (e.g.,
id:(2FG2G55FGF OR 2FG2G55CGF OR 3FG2G55FGF)or alternately:id:(2FG2G55FGF 2FG2G55CGF 3FG2G55FGF))