{% extends base_template %} {% block title_page %}AL - Search help{% endblock %} {% block body_param %}data-target=".bs-docs-sidebar" data-scroll-offset="60" data-spy="scroll"{% endblock %} {% block content %}

Search Help

Query Syntax

Overview

The query string “mini-language” is used by the Search API. Operators allow you to customize the search -- the available options are explained bellow:

Basic Search

The query string is parsed into a series of terms and operators. A term can be a single word  like quick, brown or a phrase, surrounded by double quotes "quick brown" which searches for all the words in the phrase, in the same order.

Examples:

"quick brown fox"
quick OR brown OR fox

Field names

By using basic search, only the default field is searched for the search terms, but it is possible to specify other fields in the query syntax:

Exemples:

  • where the status field contains active
    status:active
  • where the title field contains quick or brown
    title:(quick OR brown)
  • where the author field contains the exact phrase "john smith"
    author:"John Smith"
  • where any of the fields book.title, book.content or book.date contains quick or brown (note how we need to escape the * with a backslash):
    book.\*:(quick OR brown)
  • where the field title has any non-null value:
    _exists_:title

The Assemblyline datastore is divided in multiple indexes. Each collection has the following field available for searches. Default fields are copied over to the default search field that is used in the case where you only use basic search. Stored field will be returned by the index when issuing a query.

Important: When using the search engine in the UI, your search query is performed on all these buckets at the same time.

{% for index, fields in field_list.items() %}

{{ index.title() }} index:

{% for field, val in fields%} {% endfor %}
Field nameTypeAttributes
{{ field }}{{ val.type }}{% if val.stored %}stored {% endif %}{% if val.default %}default {% endif %}{% if val.list %}list{% endif %}
{% endfor %}

Wildcards

Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters:

qu?ck bro*

Be aware that wildcard queries can use an enormous amount of memory and perform very badly — just think how many terms need to be queried to match the query string "a* b* c*".

Note: Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match.

Regular Expressions

Regular expression patterns can be embedded in the query string by wrapping them in forward-slashes ("/"):

name:/joh?n(ath[oa]n)/

Warning

Having a wildcard as the leading parameter does not have any control over regular expressions. A query string such as the following would force the datastore to visit every term in the index, Use with caution!:

/.*n/

The supported regular expression syntax is the following:

Anchoring

Most regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using ^ to indicate the beginning or $ to indicate the end.

Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string "abcde":

ab.*     # match
abcd     # no match

Allowed characters

Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:

. ? + * | { } [ ] ( ) " \

Any reserved character can be escaped with a backslash "\*" including a literal backslash character: "\\"

Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:

john"@smith.com"

Match any character

The period "." can be used to represent any character. For string "abcde":

ab...   # match
a.c.e   # match

One-or-more

The plus sign "+" can be used to repeat the preceding shortest pattern once or more times. For string "aaabbb":

a+b+        # match
aa+bb+      # match
a+.+        # match
aa+bbb+     # match

Zero-or-more

The asterisk "*" can be used to match the preceding shortest pattern zero-or-more times. For string "aaabbb":

a*b*        # match
a*b*c*      # match
.*bbb.*     # match
aaa*bbb*    # match

Zero-or-one

The question mark "?" makes the preceding shortest pattern optional. It matches zero or one times. For string "aaabbb":

aaa?bbb?    # match
aaaa?bbbb?  # match
.....?.?    # match
aa?bb?      # no match

Min-to-max

Curly brackets "{}" can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:

{5}     # repeat exactly 5 times
{2,5}   # repeat at least twice and at most 5 times
{2,}    # repeat at least twice

For string "aaabbb":

a{3}b{3}        # match
a{2,4}b{2,4}    # match
a{2,}b{2,}      # match
.{3}.{3}        # match
a{4}b{4}        # no match
a{4,6}b{4,6}    # no match
a{4,}b{4,}      # no match

Grouping

Parentheses "()" can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string "ababab":

(ab)+       # match
ab(ab)+     # match
(..)+       # match
(...)+      # no match
(ab)*       # match
abab(ab)?   # match
ab(ab)?     # no match
(ab){3}     # match
(ab){1,2}   # no match

Alternation

The pipe symbol "|" acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string "aabb":

aabb|bbaa   # match
aacc|bb     # no match
aa(cc|bb)   # match
a+|b+       # no match
a+b+|b+a+   # match
a+(b|c)+    # match

Character classes

Ranges of potential characters may be represented as character classes by enclosing them in square brackets "[]". A leading ^ negates the character class. The allowed forms are:

[abc]   # 'a' or 'b' or 'c'
[a-c]   # 'a' or 'b' or 'c'
[-abc]  # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^abc]  # any character except 'a' or 'b' or 'c'
[^a-c]  # any character except 'a' or 'b' or 'c'
[^-abc]  # any character except '-' or 'a' or 'b' or 'c'
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'

Note that the dash "-" indicates a range of characters, unless it is the first character or if it is escaped with a backslash.

For string "abcd":

ab[cd]+     # match
[a-d]+      # match
[^a-d]+     # no match

Fuziness

We can search for terms that are similar to, but not exactly like our search terms, using the “fuzzy” operator:

quikc~ brwn~ foks~

This uses the Damerau-Levenshtein distance to find all terms with a maximum of two changes, where a change is the insertion, deletion or substitution of a single character, or transposition of two adjacent characters.

The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as:

quikc~1

Proximity Searches

While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. In the same way that fuzzy queries can specify a maximum edit distance for characters in a word, a proximity search allows us to specify a maximum edit distance of words in a phrase:

"fox quick"~5

The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than "quick brown fox".

Ranges

Ranges can be specified for date, numeric or string fields. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}.

  • All days in 2012:
    date:[2012-01-01 TO 2012-12-31]
  • Everything this year: (Using DateMath syntax)
    date:[now/y TO now/y+1y]
  • Everything since the year of specific date: (Using DateMath syntax)
    date:[2012-06-08||/y TO now]
  • Numbers 1..5
    count:[1 TO 5]
  • Tags between alpha and omega, excluding alpha and omega:
    tag:{alpha TO omega}
  • Numbers from 10 upwards
    count:[10 TO *]
  • Dates before 2012
    date:{* TO 2012-01-01}

Curly and square brackets can be combined:

  • Numbers from 1 up to but not including 5
    count:[1 TO 5}

Ranges with one side unbounded can use the following syntax:

age:>10
age:>=10
age:<10
age:<=10

To combine an upper and lower bound with the simplified syntax, you would need to join two clauses with an AND operator:

age:(>=10 AND <20)
age:(+>=10 +<20)

DateMath Syntax

The expression starts with an anchor date, which can either be now, or a date string ending with ||. This anchor date can optionally be followed by one or more maths expressions:

  • +1h: Add one hour
  • -1d: Subtract one day
  • /d: Round down to the nearest day

The supported time units differ from those supported by time units for durations. The supported units are:

y    =>    Years
M    =>    Months
w    =>    Weeks
d    =>    Days
h    =>    Hours
H    =>    Hours
m    =>    Minutes
s    =>    Seconds

Assuming now is 2001-01-01 12:00:00, some examples are:

now+1h             =>  now in milliseconds plus one hour. Resolves to: 2001-01-01 13:00:00
now-1h             =>  now in milliseconds minus one hour. Resolves to: 2001-01-01 11:00:00
now-1h/d           =>  now in milliseconds minus one hour, rounded by day. Resolves to: 2001-01-01 00:00:00
2001.02.01||+1M/d  =>  2001-02-01 in milliseconds plus one month rounded by day. Resolves to: 2001-03-01 00:00:00

Boolean Operators

By default, all terms are optional, as long as one term matches. A search for foo bar baz will find any document that contains one or more of foo or bar or baz. We have already discussed the default_operator above which allows you to force all terms to be required, but there are also boolean operators which can be used in the query string itself to provide more control.

The preferred operators are + (this term must be present) and - (this term must not be present). All other terms are optional. For example, this query:

quick brown +fox -news

states that:

  • fox must be present
  • news must not be present
  • quick and brown are optional — their presence increases the relevance

The familiar boolean operators AND, OR and NOT (also written &&, || and !) are also supported but beware that they do not honor the usual precedence rules, so parentheses should be used whenever multiple operators are used together. For instance the previous query could be rewritten as:

((quick AND fox) OR (brown AND fox) OR fox) AND NOT news

This form now replicates the logic from the original query correctly, but the relevance scoring bears little resemblance to the original.

Grouping

Multiple terms or clauses can be grouped together with parentheses, to form sub-queries:

(quick OR brown) AND fox

Groups can also be used to target a particular field:

status:(active OR pending) title:(+full -"text search")

Reserved Characters

If you need to use any of the characters which function as operators in your query itself (and not as operators), then you should escape them with a leading backslash.

For instance, to search for (1+1)=2, you would need to write your query as:

\(1\+1\)\=2

The reserved characters are:

+ - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

Failing to escape these special characters correctly could lead to a syntax error which prevents your query from running.

Note: < and > can’t be escaped at all. The only way to prevent them from attempting to create a range query is to remove them from the query string entirely.

{% endblock %}