The SnQL query language

This document describes the Snuba Query Language (SnQL). For more details on how to actually send a query to Snuba see Querying Snuba.

This is the query structure.:

MATCH simple | join | subquery
SELECT [expressions] | [aggregations BY expressions]
ARRAY JOIN [column]
WHERE condition [[AND | OR] condition]*
HAVING condition [[AND | OR] condition]*
ORDER BY expression ASC|DESC [, expression ASC|DESC]*
LIMIT n BY [expressions]
LIMIT n
OFFSET n
GRANULARITY n
TOTALS boolean

These queries are sent as strings to the /:dataset/snql endpoint encoded in a JSON body of the form below.:

{
    "query": "<query>",
    "dataset": "<dataset>",
    "consistent": bool,
    "turbo": bool,
    "debug": bool,
}

The dataset is implied through the url used for the query. All of the fields except for query are optional in the JSON body.

MATCH

Our data model is represented by a graph of entities. This clause identifies the pattern of the subgraphs we are querying on. There are three types of MATCH clause that are currently supported:

Simple:

MATCH (<entity> [SAMPLE n])

Or for datasets without entities:

MATCH (STORAGE(<storage>) [SAMPLE n])

This is equivalent to all of our current queries. This either queries data from a single entity or a single storage directly. It is possible to add an optional sample to the query by adding it with the entity.

Example:

MATCH (events) # for entity queries
MATCH (STORAGE(profile_chunks)) # for storage queries.

Subquery:

MATCH { <query> }

Inside the curly braces can be another SnQL query in its entirety. Anything in the SELECT/BY clause of the subquery will be exposed in the outer query using the aliases specified.

Example:

MATCH {
    MATCH (transactions)
    SELECT avg(duration) AS avg_d BY transaction
}
SELECT max(avg_d)

Join:

MATCH (<alias>: <entity> [SAMPLE n]) -[<join>]-> (<alias>: <entity> [SAMPLE n])

A join represents a multi node subgraph is a subgraph that includes multiple relationships between different nodes. We only support 1..n, n..1 and 1..1 directed relationships between nodes.

With JOINs every entity must have an alias, which is a unique string. Sampling can also be applied to any of the entities in the join. The <join> is a string that is specified in the Entity in Snuba, and is a short hand for a set of join conditions. It’s possible to have more than one join clause, separated by commas.

Example:

MATCH
    (e: events) -[grouped]-> (g: groupedmessage),
    (e: events) -[assigned]-> (a: groupassignee)
SELECT count() AS tot BY e.project_id, g.id
WHERE a.user_id = "somebody"

The type of join (left/inner) and the join key are part of the data model and not part of the query. They are hard coded in the entity code. This is because not entity can be safely joined with any other entity in the distributed version of the underlying database.

The tuples provided by the match clause to the where clause look exactly like the ones produced by conventional join clause.:

[
    {"e.project_id": 1,  "g.id": 10}
    {"e.project_id": 1,  "g.id": 11}
    {"e.project_id": 2,  "g.id": 20}
    ...
]

SELECT .. BY

This clause specifies which results should be returned in the output. If there is an aggregation, when everything in the BY clause is treated as a grouping key. It is possible to have aggregations without a BY clause if we want to aggregate across the entire result set, but, in such case, nothing other than the aggregation can be in the SELECT. It’s not valid to have an empty SELECT clause, even if there is a BY clause.

Expressions in the SELECT clause can be columns, arithmetic, functions or any combination of the three. If the query is a join, then each column must have a qualifying alias that matches one of the entity aliases in the MATCH clause.

WHERE

This is the filter of the query that happens before aggregations (like the WHERE in SQL).

Conditions are infix expressions of the form LHS OP RHS*, where LHS and RHS are literal values or expressions. OP refers to a specific operator to compare the two values. These operators are one of =, !=, <, <=, >, >=, IN, NOT IN, LIKE, NOT LIKE, IS NULL, IS NOT NULL. Note that the RHS is optional when using an operator like IS NULL.

Conditions can be combined using the boolean keywords AND or OR. They can also be grouped using ().

Some conditions will be mandatory to provide a valid query depending on the entity. For example the Transactions entity requires a project id condition and a time range condition.

HAVING

Works like the WHERE clause but it is applied after the aggregations declared in the SELECT clause. So we can apply conditions on the result of an aggregation function here.

ORDER BY

Specify the expression(s) to order the result set on.

LIMIT BY/LIMIT/OFFSET

Pretty self explanatory, they take integers and set the corresponding values in the Clickhouse query. If a query doesn’t specify the limit or offset, they will be defaulted to 1000 and 0 respectively.

GRANULARITY

An integer representing the granularity to group time based results.

Some of the entities in Snuba provides a magic column that you can use to group data by. The column gives a floored time value for each row so that rows in the same minute/hour/day/etc. can be grouped.

The magic column for a given entity can be found by finding the TimeSeriesProcessor for the entity. Example, for errors, you can find the TimeSeriesProcessor defined here. You can see that the magic column is time and it uses the timestamp column for grouping.

Granularity determines the number of seconds in each of these time buckets. Eg, to count the number of events by hour, you would do

Example:

MATCH(events) count(event_id) AS event_count
BY time
WHERE timestamp >= toDateTime('2022-01-15T00:00:00.000000') AND timestamp < toDateTime('2022-01-21T00:00:00.000000')
GRANULARITY 3600

TOTALS

If set to True, the response from Snuba will have a "totals" key that contains the total values across all the selected rows.

SAMPLE

If a sampling rate isn’t provided by a node in the MATCH clause, then it can be specified here. In this case, Snuba will assign the sample right to one of the nodes in the query. A sample can be either a float between 0 and 1, representing a percentage of rows to sample.

Or it can be an integer greater 1 which represents the number of rows to sample.