relay_server/
lib.rs

1//! The Sentry relay server application.
2//!
3//! This module contains the [`run`] function which starts the relay server. It responds on
4//! multiple supported endpoints, serves queries to downstream relays and send received events to
5//! the upstream.
6//!
7//! See the [`Config`] documentation for more information on configuration options.
8//!
9//! # Path of an Event through Relay
10//!
11//! ## Overview
12//!
13//! Simplified overview of event ingestion (ignores snuba/postprocessing):
14//!
15//! ```mermaid
16//! graph LR
17//!
18//! loadbalancer(Load Balancer)
19//! relay(Relay)
20//! projectconfigs("Project config endpoint (in Sentry)")
21//! ingestconsumer(Ingest Consumer)
22//! outcomesconsumer(Outcomes Consumer)
23//! preprocess{"<code>preprocess_event</code><br>(just a function call now)"}
24//! process(<code>process_event</code>)
25//! save(<code>save_event</code>)
26//!
27//! loadbalancer-->relay
28//! relay---projectconfigs
29//! relay-->ingestconsumer
30//! relay-->outcomesconsumer
31//! ingestconsumer-->preprocess
32//! preprocess-->process
33//! preprocess-->save
34//! process-->save
35//!
36//! ```
37//!
38//! ## Processing enabled vs not?
39//!
40//! Relay can run as part of a Sentry installation, such as within `sentry.io`'s
41//! infrastructure, or next to the application as a forwarding proxy. A lot of
42//! steps described here are skipped or run in a limited form when Relay is *not*
43//! running with processing enabled:
44//!
45//! *  Event normalization does different (less) things.
46//!
47//! *  In certain modes, project config is not fetched from Sentry at all (but
48//!    rather from disk or filled out with defaults).
49//!
50//! *  Events are forwarded to an HTTP endpoint instead of being written to Kafka.
51//!
52//! *  Rate limits are not calculated using Redis, instead Relay just honors 429s
53//!    from previously mentioned endpoint.
54//!
55//! *  Filters are not applied at all.
56//!
57//! ## Inside the endpoint
58//!
59//! When an SDK hits `/api/X/store` on Relay, the code in
60//! `server/src/endpoints/store.rs` is called before returning a HTTP response.
61//!
62//! That code looks into an in-memory cache to answer basic information about a project such as:
63//!
64//! *  Does it exist? Is it suspended/disabled?
65//!
66//! *  Is it rate limited right now? If so, which key is rate limited?
67//!
68//! *  Which DSNs are valid for this project?
69//!
70//! Some of the data for this cache is coming from the [projectconfigs
71//! endpoint](https://github.com/getsentry/sentry/blob/c868def30e013177383f8ca5909090c8bdbd8f6f/src/sentry/api/endpoints/relay_projectconfigs.py).
72//! It is refreshed every couple of minutes, depending on configuration (`project_expiry`).
73//!
74//! If the cache is fresh, we may return a `429` for rate limits or a `4xx` for
75//! invalid auth information.
76//!
77//! That cache might be empty or stale. If that is the case, Relay does not
78//! actually attempt to populate it at this stage. **It just returns a `200` even
79//! though the event might be dropped later.** This implies:
80//!
81//! *  The first store request that runs into a rate limit doesn't actually result
82//!    in a `429`, but a subsequent request will (because by that time the project
83//!    cache will have been updated).
84//!
85//! *  A store request for a non-existent project may result in a `200`, but
86//!    subsequent ones will not.
87//!
88//! *  A store request with wrong auth information may result in a `200`, but
89//!    subsequent ones will not.
90//!
91//! *  Filters are also not applied at this stage, so **a filtered event will
92//!    always result in a `200`**. This matches the Python behavior since [a while
93//!    now](https://github.com/getsentry/sentry/pull/14561).
94//!
95//! These examples assume that a project receives one event at a time. In practice
96//! one may observe that a highly concurrent burst of store requests for a single
97//! project results in `200 OK`s only. However, a multi-second flood of incoming
98//! events should quickly result in eventually consistent and correct status codes.
99//!
100//! The response is completed at this point. All expensive work (such as talking to
101//! external services) is deferred into a background task. Except for responding to
102//! the HTTP request, there's no I/O done in the endpoint in any form. We didn't
103//! even hit Redis to calculate rate limits.
104//!
105//! ### Summary
106//!
107//! The HTTP response returned is just a best-effort guess at what the actual
108//! outcome of the event is going to be. We only return a `4xx` code if we know that
109//! the response will fail (based on cached information), if we don't we return a
110//! 200 and continue to process the event asynchronously. This asynchronous
111//! processing used to happen synchronously in the Python implementation of
112//! `StoreView`.
113//!
114//! The effect of this is that the server will respond much faster that before but
115//! we might return 200 for events that will ultimately not be accepted.
116//!
117//! Generally Relay will return a 200 in many more situations than the old
118//! `StoreView`.
119//!
120//! ## The background task
121//!
122//! The HTTP response is out by now. The rest of what used to happen synchronously in the
123//! Python `StoreView` is done asynchronously, but still in the same process.
124//!
125//! So, now to the real work:
126//!
127//! 1.  **Project config is fetched.** If the project cache is stale or missing, we
128//!     fetch it. We may wait a couple milliseconds (`batch_interval`) here to be
129//!     able to batch multiple project config fetches into the same HTTP request to
130//!     not overload Sentry too much.
131//!
132//!     At this stage Relay may drop the event because it realized that the DSN was
133//!     invalid or the project didn't even exist. The next incoming event will get a
134//!     proper 4xx status code.
135//!
136//! 1.  **The event is parsed.** In the endpoint we only did decompression, a basic
137//!     JSON syntax check, and extraction of the event ID to be able to return it as
138//!     part of the response.
139//!
140//!     Now we create an `Event` struct, which conceptually is the equivalent to
141//!     parsing it into a Python dictionary: We allocate more memory.
142//!
143//! 1.  **The event is normalized.** Event normalization is probably the most
144//!     CPU-intensive task running in Relay. It discards invalid data, moves data
145//!     from deprecated fields to newer fields and generally just does schema
146//!     validation.
147//!
148//! 1.  **Filters ("inbound filters") are applied.** Event may be discarded because of IP
149//!     addresses, patterns on the error message or known web crawlers.
150//!
151//! 1.  **Exact rate limits ("quotas") are applied.** `is_rate_limited.lua` is
152//!     executed on Redis. The input parameters for `is_rate_limited.lua` ("quota
153//!     objects") are part of the project config. See [this pull
154//!     request](https://github.com/getsentry/sentry/pull/14558) for an explanation
155//!     of what quota objects are.
156//!
157//!     The event may be discarded here. If so, we write the rate limit info
158//!     (reason and expiration timestamp) into the in-memory project cache so that
159//!     the next store request returns a 429 in the endpoint and doesn't hit Redis
160//!     at all.
161//!
162//!     This contraption has the advantage that suspended or permanently
163//!     rate-limited projects are very cheap to handle, and do not involve external
164//!     services (ignoring the polling of the project config every couple of
165//!     minutes).
166//!
167//! 1.  **The event is datascrubbed.** We have a PII config (new format) and a
168//!     datascrubbing config (old format, converted to new format on the fly) as
169//!     part of the project config fetched from Sentry.
170//!
171//! 1.  **Event is written to Kafka.**
172//!
173//! **Note:** If we discard an event at any point, an outcome is written to Kafka
174//! if Relay is configured to do so.
175//!
176//! ### Summary
177//!
178//! For events that returned a `200` we spawn an in-process background task
179//! that does the rest of what the old `StoreView` did.
180//!
181//! This task updates in-memory state for rate limits and disabled
182//! projects/keys.
183//!
184//! ## The outcomes consumer
185//!
186//! Outcomes are small messages in Kafka that contain an event ID and information
187//! about whether that event was rejected, and if so, why.
188//!
189//! The outcomes consumer is mostly responsible for updating (user-visible)
190//! counters in Sentry (buffers/counters and tsdb, which are two separate systems).
191//!
192//! ## The ingest consumer
193//!
194//! The ingest consumer reads accepted events from Kafka, and also updates some
195//! stats. Some of *those* stats are billing-relevant.
196//!
197//! Its main purpose is to do what `insert_data_to_database` in Python store did:
198//! Call `preprocess_event`, after which comes sourcemap processing, native
199//! symbolication, grouping, snuba and all that other stuff that is of no concern
200//! to Relay.
201//!
202//! ## Sequence diagram of components inside Relay
203//!
204//! ```mermaid
205//! sequenceDiagram
206//! participant sdk as SDK
207//! participant endpoint as Endpoint
208//! participant projectcache as ProjectCache
209//! participant envelopemanager as EnvelopeManager
210//! participant cpupool as CPU Pool
211//!
212//! sdk->>endpoint:POST /api/42/store
213//! activate endpoint
214//! endpoint->>projectcache: get project (cached only)
215//! activate projectcache
216//! projectcache-->>endpoint: return project
217//! deactivate projectcache
218//! Note over endpoint: Checking rate limits and auth (fast path)
219//! endpoint->>envelopemanager: queue event
220//!
221//! activate envelopemanager
222//! envelopemanager-->>endpoint:event ID
223//! endpoint-->>sdk:200 OK
224//! deactivate endpoint
225//!
226//! envelopemanager->>projectcache:fetch project
227//! activate projectcache
228//! Note over envelopemanager,projectcache: web request (batched with other projects)
229//! projectcache-->>envelopemanager: return project
230//! deactivate projectcache
231//!
232//! envelopemanager->>cpupool: .
233//! activate cpupool
234//! Note over envelopemanager,cpupool: normalization, datascrubbing, redis rate limits, ...
235//! cpupool-->>envelopemanager: .
236//! deactivate cpupool
237//!
238//! Note over envelopemanager: Send event to kafka
239//!
240//! deactivate envelopemanager
241//! ```
242//!
243//! <script src="https://cdn.jsdelivr.net/npm/mermaid@8.8.4/dist/mermaid.min.js"></script>
244//! <script>
245//! mermaid.init({}, ".language-mermaid code");
246//! // Could not get dark mode in mermaid to work
247//! Array.from(document.getElementsByTagName('svg')).map(x => x.style.background = "white")
248//! </script>
249#![warn(missing_docs)]
250#![doc(
251    html_logo_url = "https://raw.githubusercontent.com/getsentry/relay/master/artwork/relay-icon.png",
252    html_favicon_url = "https://raw.githubusercontent.com/getsentry/relay/master/artwork/relay-icon.png"
253)]
254#![allow(clippy::derive_partial_eq_without_eq)]
255
256mod constants;
257mod endpoints;
258mod envelope;
259mod extractors;
260mod http;
261mod metrics;
262mod metrics_extraction;
263mod middlewares;
264mod service;
265mod services;
266mod statsd;
267mod utils;
268
269pub use self::envelope::Envelope; // pub for benchmarks
270pub use self::services::buffer::{
271    EnvelopeStack, PolymorphicEnvelopeBuffer, SqliteEnvelopeStack, SqliteEnvelopeStore,
272}; // pub for benchmarks
273pub use self::utils::{MemoryChecker, MemoryStat}; // pub for benchmarks
274
275#[cfg(test)]
276mod testutils;
277
278use std::sync::Arc;
279
280use relay_config::Config;
281use relay_system::{Controller, ServiceSpawnExt as _};
282
283use crate::service::ServiceState;
284use crate::services::server::HttpServer;
285
286/// Runs a relay web server and spawns all internal worker threads.
287///
288/// This effectively boots the entire server application. It blocks the current thread until a
289/// shutdown signal is received or a fatal error happens. Behavior of the server is determined by
290/// the `config` passed into this funciton.
291pub fn run(config: Config) -> anyhow::Result<()> {
292    let config = Arc::new(config);
293    relay_log::info!("relay server starting");
294
295    // Creates the main runtime.
296    let runtime = crate::service::create_runtime("main-rt", config.cpu_concurrency());
297    let handle = runtime.handle().clone();
298
299    // Run the system and block until a shutdown signal is sent to this process. Inside, start a
300    // web server and run all relevant services. See the `actors` module documentation for more
301    // information on all services.
302    runtime.block_on(async {
303        Controller::start(config.shutdown_timeout());
304
305        let mut services = handle.service_set();
306
307        let state = ServiceState::start(&handle, &services, config.clone()).await?;
308        services.start(HttpServer::new(config, state.clone())?);
309
310        tokio::select! {
311            _ = services.join() => {},
312            // NOTE: when every service implements a shutdown listener,
313            // awaiting on `finished` becomes unnecessary: We can simply join() and guarantee
314            // that every service finished its main task.
315            // See also https://github.com/getsentry/relay/issues/4050.
316            _ = Controller::shutdown_handle().finished() => {}
317        }
318
319        anyhow::Ok(())
320    })?;
321
322    drop(runtime);
323
324    relay_log::info!("relay shutdown complete");
325    Ok(())
326}