Sentry Streams is a distributed platform that, like most streaming platforms, is designed to handle real-time unbounded data streams.
This is built primarily to allow the creation of Sentry ingestion pipelines though the api provided is fully independent from the Sentry product and can be used to build any streaming application.
The main features are:
Kafka sources and multiple sinks. Ingestion pipeline take data from Kafka and write enriched data into multiple data stores.
Dataflow API support. This allows the creation of streaming application focusing on the application logic and pipeline topology rather than the underlying dataflow engine.
Support for stateful and stateless transformations. The state storage is provided by the platform rather than being part of the application.
Distributed execution. The primitives used to build the application can be distributed on multiple nodes by configuration.
Hide the Kafka details from the application. Like commit policy and topic partitioning.
Out of the box support for some streaming applications best practices: DLQ, monitoring, health checks, etc.
Support for Rust and Python applications.
Support for multiple runtimes.
Design principles¶
This streaming platform, in the context of Sentry ingestion, is designed with a few principles in mind:
Fully self service to speed up the time to reach production when building pipelines.
Abstract infrastructure aspect away (Kafka, delivery guarantees, schemas, scale, etc.) to improve stability and scale.
Opinionated in the abstractions provided to build ingestion to push for best practices and to hide the inner working of streaming applications.
Pipeline as a system for tuning, capacity management and architecture understanding
Getting Started¶
In order to build a streaming application and run it on top of the Sentry Arroyo runtime, follow these steps:
Run locally a Kafka broker.
Create a new Python project and a dev environment.
Import sentry streams
pip install sentry_streams
Create a new Pyhon module for your streaming application:
1from json import JSONDecodeError, dumps, loads
2from typing import Any, Mapping, cast
3
4from sentry_streams.pipeline import Filter, Map, streaming_source
5
6def parse(msg: str) -> Mapping[str, Any]:
7 try:
8 parsed = loads(msg)
9 except JSONDecodeError:
10 return {"type": "invalid"}
11
12 return cast(Mapping[str, Any], parsed)
13
14
15def filter_not_event(msg: Mapping[str, Any]) -> bool:
16 return bool(msg["type"] == "event")
17
18pipeline = (
19 streaming_source(
20 name="myinput",
21 stream_name="events",
22 )
23 .apply("mymap", Map(function=parse))
24 .apply("myfilter", Filter(function=filter_not_event))
25 .apply("serializer", Map(function=lambda msg: dumps(msg)))
26 .sink(
27 "myoutput",
28 stream_name="transformed-events",
29 )
30)
This is a simple pipeline that takes a stream of JSON messages, parses them, filters out the ones that are not events, and serializes them back to JSON and produces the result to another topic.
Run the pipeline
SEGMENT_ID=0 python -m sentry_streams.runner \
-n Batch \
--config sentry_streams/deployment_config/<YOUR CONFIG FILE>.yaml \
--adapter arroyo \
<YOUR PIPELINE FILE>
Produce events on the events topic and consume them from the transformed-events topic.
echo '{"type": "event", "data": {"foo": "bar"}}' | kcat -b localhost:9092 -P -t events
kcat -b localhost:9092 -G test transformed-events
Look for more examples in the sentry_streams/examples folder of the repository.