Snuba Data Slicing (under development)

This feature is under active development and is subject to change

To support a higher volume of data, we are building out support for datasets and storages that span multiple physical resources (Kafka clusters, Redis instances, Postgres databases, ClickHouse clusters, etc.) with the same schema. Across Sentry, data records will have a logical partition assignment based on the data’s organization_id. In Snuba, we maintain a mapping of logical partitions to physical slices in settings.LOGICAL_PARTITION_MAPPING.

In a future revision, this settings.LOGICAL_PARTITION_MAPPING will be used along with settings.SLICED_STORAGE_SETS to map queries and incoming data from consumers to different ClickHouse clusters using a (StorageSetKey, slice_id) pairing that exists in configuration.

Configuring a slice

Mapping logical partitions : physical slices

To add a slice to a storage set’s logical:physical mapping, or repartition, increment the slice count in settings.SLICED_STORAGE_SETS for the relevant storage set. Change the mapping of the relevant storage set’s logical partitions in settings.LOGICAL_PARTITION_MAPPING. Every logical partition must be assigned to a slice and the valid values of slices are in the range of [0,settings.SLICED_STORAGE_SETS[storage_set]).

Defining ClickHouse clusters in a sliced environment

Given a storage set, there can be three different cases:

  1. The storage set is not sliced

  2. The storage set is sliced and no mega-cluster is needed

  3. The storage set is sliced and a mega-cluster is needed

A mega-cluster is needed when there may be partial data residing on different sliced ClickHouse clusters. This could happen, for example, when a logical partition:slice mapping changes. In this scenario, writes of new data will be routed to the new slice, but reads of data will need to span multiple clusters. Now that queries need to work across different slices, a mega-cluster query node will be needed.

For each of the cases above, different types of ClickHouse cluster configuration will be needed.

For case 1, we simply define clusters as per usual in settings.CLUSTERS.

For cases 2 and 3:

To add a sliced cluster with an associated (storage set key, slice) pair, add cluster definitions to settings.SLICED_CLUSTERS in the desired environment’s settings. Follow the same structure as regular cluster definitions in settings.CLUSTERS. In the storage_set_slices field, sliced storage sets should be added in the form of (StorageSetKey, slice_id) where slice_id is in the range [0,settings.SLICED_STORAGE_SETS[storage_set]) for the relevant StorageSetKey.

Preparing the storage for sharding

A storage that should be sharded requires setting the partition key column that will be used to calculate the logical partition and ultimately the slice ID for how to query the destination data.

This is done with the partition_key_column_name property in the storage schema (we do not support sharded storages for non-YAML based entities). You can see an example of how one might shard by organization_id in generic_metrics_sets and generic_metrics_distributions dataset YAML files.

Adding sliced Kafka topics

In order to define a “sliced” Kafka topic, add (default logical topic name, slice id) to settings.SLICED_KAFKA_TOPIC_MAP. This tuple should be mapped to a custom physical topic name of the form logical_topic_name-slice_id. Make sure to add the corresponding broker configuration details to settings.SLICED_KAFKA_BROKER_CONFIG. Here, use the same tuple (default logical topic name, slice id) as the key, and the broker config info as the value.

Example configurations:

SLICED_KAFKA_TOPIC_MAP = {(“snuba-generic-metrics”, 1): “snuba-generic-metrics-1”}

SLICED_KAFKA_BROKER_CONFIG = {(“snuba-generic-metrics”, 1): BROKER_CONFIG}

These types of topics can be “sliced”: raw topics, replacements topics, commit log topics, subscription scheduler topics. Note that the slicing boundary stops at this point and the results topics for subscriptions cannot be sliced.

Working in a Sliced Environment

Starting a sliced consumer

First, ensure that your slicing configuration is set up properly: SLICED_STORAGE_SETS, SLICED_CLUSTERS, SLICED_KAFKA_TOPIC_MAP, and SLICED_KAFKA_BROKER_CONFIG. See above for details.

Start up snuba consumer as per usual, with an extra flag --slice-id set equal to the slice number you are reading from.

TODO: handling subscriptions, scheduler and executor, etc.