================
Snuba Data Model
================

This section explains how data is organized in Snuba and how user facing
data is mapped to the underlying database (Clickhouse in this case).

The Snuba data model is divided horizontally into a **logical model** and
a **physical model**. The logical data model is what is visible to the Snuba
clients through the Snuba query language. Elements in this model may or may
not map 1:1 to tables in the database. The physical model, instead, maps 1:1
to database concepts (like tables and views).

The reasoning behind this division is that it allows Snuba to expose a
stable interface through the logical data model and perform complex mapping
internally to execute a query on different tables (part of the physical
model) to improve performance in a way that is transparent to the client.

The rest of this section outlines the concepts that compose the two models
and how they are connected to each other.

The main concepts, described below are dataset, entity and storage.

.. image:: /_static/architecture/datamodel.png

Datasets
========

A Dataset is a name space over Snuba data. It provides its own schema and
it is independent from other datasets both in terms of logical model and
physical model.

Examples of datasets are, discover, outcomes, sessions. There is no
relationship between them.

A Dataset can be seen as a container for the components that define its
abstract data model and its concrete data model that are described below.

In term of query language, every Snuba query targets one and only one
Dataset, and the Dataset can provide extensions to the query language.

Entities and Entity Types
=========================

The fundamental block of the logical data model Snuba exposes to the client
is the Entity. In the logical model an entity represents an instance of an
abstract concept (like a transaction or an error). In practice an *Entity*
corresponds to a row in a table in the database.  The *Entity Type* is the
class of the Entity (like Error**s** or Transaction**s**).

The logical data model is composed by a set of *Entity Types* and by their
relationships.

Each *Entity Type* has a schema which is defined by a list of fields with
their associated abstract data types. The schemas of all the *Entity Types*
of a Dataset (there can be several) compose the logical data model that is
visible to the Snuba client and against which Snuba Queries are validated.
No lower level concept is supposed to be exposed.

Entity Types are unequivocally contained in a Dataset. An Entity Type cannot
be present in multiple Datasets.

Relationships between Entity Types
----------------------------------

Entity Types in a Dataset are logically related. There are two types of
relationships we support:

- Entity Set Relationship. This mimics foreign keys. This relationship is
  meant to allow joins between Entity Types. It only supports one-to-one
  and one-to-many relationships at this point in time.
- Inheritance Relationship. This mimics nominal subtyping. A group of Entity
  Types can share a parent Entity Type. Subtypes inherit the schema from the
  parent type. Semantically the parent Entity Type must represent the union
  of all the Entities whose type inherit from it. It also must be possible
  to query the parent Entity Type. This cannot be just a logical relationship.

Entity Type and consistency
---------------------------

The Entity Type is the largest unit where Snuba **can** provide some strong
data consistency guarantees. Specifically it is possible to query an Entity
Type expecting Serializable Consistency (please don't use that. Seriously,
if you think you need that, you probably don't). This does not extend to
any query that spans multiple Entity Types where, at best, we will have
eventual consistency.

This also has an impact on Subscription queries. These can only work on one
Entity Type at a time since, otherwise, they would require consistency between
Entity Types, which we do not support.

.. ATTENTION::
    To be precise the unit of consistency (depending on the Entity Type)
    can be even smaller and depend on how the data ingestion topics
    are partitioned (project_id for example), the Entity Type is the
    maximum Snuba allows. More details are (ok, will be) provided in
    the Ingestion section of this guide.

Storage
=======

Storages represent and define the physical data model of a Dataset. Each
Storage represent is materialized in a physical database concept like a table
or a materialized view. As a consequence each Storage has a schema defined
by fields with their types that reflects the physical schema of the DB
table/view the Storage maps to and it is able to provide all the details to
generate DDL statements to build the tables on the database.

Storages are able to map the logical concepts in the logical model discussed
above to the physical concept of the database, thus each Storage needs to be
related with an Entity Type. Specifically:

- Each Entity Type must be backed by least one Readable Storage (a Storage we
  can run query on), but can be backed by multiple Storages (for example a
  pre-aggregate materialized view). Multiple Storages per Entity Type are meant
  to allow query optimizations.
- Each Entity Type must be backed by one and only one Writable
  Storage that is used to ingest data and fill in the database tables.
- Each Storage is backing exclusively one Entity Type.


Examples
========

This section provides some examples of how the Snuba data model can represent
some real world models.

These case studies are not necessarily reflecting the current Sentry production
model nor they are part of the same deployment. They have to be considered as
examples taken in isolation.

Single Entity Dataset
---------------------

This looks like the Outcomes dataset used by Sentry.  This actually does not
reflect Outcomes as of April 2020. It is though the design Outcomes should
move towards.

.. image:: /_static/architecture/singleentity.png

This Dataset has one Entity Type only which represent an individual Outcome
ingested by the Dataset. Querying raw Outcomes is painfully slow so we have
two Storages. One is the Raw storage that reflects the data we ingest and a
materialized view that computes hourly aggregations that are much more efficient
to query. The Query Planner would pick the storage depending if the query
can be executed on the aggregated data or not.

Multi Entity Type Dataset
-------------------------

The canonical example of this Dataset is the Discover dataset.

.. image:: /_static/architecture/multientity.png

This has three Entity Types. Errors, Transaction and they both inherit from
Events. These form the logical data model, thus querying the Events Entity
Type gives the union of Transactions and Errors but it only allows common
fields between the two to be present in the query.

The Errors Entity Type is backed by two Storages for performance reasons.
One is the main Errors Storage that is used to ingest data, the other is a
read only view that is putting less load on Clickhosue when querying but
that offers lower consistency guarantees. Transactions only have one storage
and there is a Merge Table to serve Events (which is essentially a view over
the union of the two tables).

Joining Entity types
--------------------

This is a simple example of a dataset that includes multiple Entity Types
that can be joined together in a query.

.. image:: /_static/architecture/joins.png

GroupedMessage and GroupAssingee can be part of a left join query with Errors.
The rest is similar with what was discussed in the previous examples.