1 Why high-ingest real-time analytics is hard (and worth it)
In today’s world of digital services, sensors, user interactions, and complex systems, the demand isn’t just for storing massive volumes of data — it’s for making sense of it immediately. You might have trillions of events per day; but if your system only surfaces insights hours—or worse, days—later, you’ve missed the moment. In this article, we walk through why high-ingest, real-time analytics is both hard and worth the effort, and then point at three leading engine candidates: ClickHouse, Apache Cassandra and ScyllaDB.
1.1 Problem spaces
Let’s contextualize with concrete workload types and why they present special challenges.
1.1.1 Event telemetry (product analytics, observability, IoT)
Imagine you run a product used by millions of users, or you monitor fleets of IoT devices streaming telemetry every second. Each interaction or sensor reading becomes an event you want to ingest, store, query, and derive insights from. For example:
- A mobile application generates events like “user X clicked button Y at timestamp T, on session S, with attributes {locale=en, version=1.4.3}”.
- An observability pipeline collects logs, traces, metrics: “Service A latency = 120 ms at timestamp T on host H”.
- IoT devices push telemetry: “device D at location L, temperature = 35°C, timestamp T”.
In each case, you have many, many events arriving per second, and you want to ask questions such as:
- “How many users clicked feature Z in the last 60 seconds across tenants?”
- “Which hosts have error rates above threshold X in the last 5 minutes and correlate with latency spikes?”
- “What devices in region R reported temperature deviations > 5 °C in the last hour?”
The ingestion side needs to keep up; the query side needs to slice and aggregate fast; and late or out-of-order events complicate correct analytics.
1.1.2 Fraud detection (low-latency joins, sliding windows, enrichment)
Another scenario: financial transactions, ad-impressions, gameplays, reward systems. You might ingest events at high volume and within milliseconds want to decide “is this fraudulent?” or “should we serve this ad?” That means:
- Real-time enrichment: join incoming event with user profile, risk score, historical behavior.
- Sliding windows: e.g., “If user made > 10 transactions in last 30 s and location changed, flag.”
- Low p99 latency: you can’t delay until batch processed — you need near immediate results.
- Complexity of queries: heavy joins, filters, and time-based logic.
These workloads blur the line between operational (OLTP) and analytical (OLAP) processing. They demand high ingest, fast reads and writes, sometimes mixed patterns.
1.1.3 Time-bucketed OLAP (dashboards, anomaly detection, ad-hoc pivots)
Once you’ve ingested and optionally enriched your events, you’ll want to build dashboards, anomaly detectors, ad-hoc pivots:
- Example: “Number of active sessions per minute for each tenant in last 24 hours, grouped by country, broken down by platform.”
- Example: “Top 10 users by error rate over last hour, excluding known bots.”
- Example: “Detect unusual drop in uptime for service ‘X’ relative to baseline.”
These queries tend to access large volumes of data, often aggregated over time buckets (1 minute, 5 minutes, hourly). They often benefit from column-oriented scanning and pre-aggregated rollups. But to support real-time, you may also need fresh data (ingested just seconds ago).
1.2 Workload signatures
Across those problem spaces, certain workload “signatures” reappear. Recognizing them helps pick the right database engine and architecture.
1.2.1 100K–5M events/sec ingestion, mostly appends; late/out-of-order events
At scale, ingestion volumes may reach hundreds of thousands to millions of events per second. The pattern is mostly append-only — new events arriving with increasing timestamps — but not strictly ordered: events can arrive late, or out of order (e.g., network lag, retries). You need the ingestion pipeline and database storage to handle very high write-throughput and ingest concurrency.
1.2.2 Hot time ranges + selective filters; heavy rollups; backfill and replays
Once data is there, you see “hot” time windows — recent minutes or seconds are queried heavily: dashboards updating every minute, alerts firing. Queries often include selective filters (tenant, country, platform, feature flag) and heavy aggregation (group by time buckets, filter by top-N). Also: you may need to backfill data (for missed ingestion windows) or replay events (for bug fixes), meaning the system must handle bursts of older data ingestion and recompute rollups.
1.2.3 Write/read amplification, compaction/merge costs, TTLs and retention
The underlying storage system must handle not only simple writes but the operational tasks: compaction of data files, merging segments, dealing with tombstones or expired data (TTLs). Especially in time-series or event stores you’ll often keep only recent data (e.g., last 30 days) and evict old ones. That creates churn. Also, read amplification (many segments to scan or merge) and write amplification (multiple media writes per event) must be monitored — they cost I/O, CPU, and latency.
1.3 The three candidates at a glance
Here’s a quick summary of three high-ingest engine candidates — we’ll go deep later, but for now the overview.
1.3.1 ClickHouse: columnar, MergeTree family, MV/rollups, data-skipping indices; strong for OLAP and time-series analytics at scale
ClickHouse is an open-source, column-oriented analytics engine designed for high-volume reads and writes across large data sets. It leverages the MergeTree engine family, supports materialized views for rollups, and data-skipping indexes (e.g., min/max, bloom-token) for selective scans. It is optimized for scan-heavy, aggregate queries (time-bucketed analytics) more than ultra-low-latency individual lookups.
1.3.2 Apache Cassandra: wide-column with tunable consistency, SAI indexing (5.0), durable write path, geo-replication; great for high-throughput operational reads/writes
Apache Cassandra offers a wide-column NoSQL architecture: partition key → node, clustering columns → order within partition. It has a masterless peer-to-peer architecture with tunable consistency (you decide consistency/availability trade-off). With Cassandra 5.0 the SAI (Storage-Attached Index) feature improves secondary indexing. It is well suited for workloads requiring high write throughput, multi-region replication, and operational lookups rather than heavy OLAP.
1.3.3 ScyllaDB: Cassandra-compatible, shard-per-core/Seastar engine for very low p99 latency and high throughput; DynamoDB-compatible Alternator for API flexibility
ScyllaDB is a drop-in, high-performance replacement for Cassandra: same wide-column data model (CQL compatibility) but built on C++ and the Seastar framework with one shard per CPU core, minimizing cross-thread/lock overhead. The result: extremely high throughput and very low tail latency (p99). It also supports a DynamoDB-compatible API (“Alternator”) if that suits your stack.
In short — if your priorities are real-time time-series analytics and heavy aggregation, think ClickHouse. If your priority is very high write throughput, operational key-value or per-entity lookups, think Cassandra/Scylla. And if you need Cassandra API but better throughput/latency, lean Scylla.
2 Columnar vs wide-column: demystifying the trade-offs
One of the biggest misunderstandings when choosing among these engines is conflating columnar (analytical) architecture with wide-column (NoSQL) architecture. They sound similar but are very different. Getting clarity here will help you pick the right tool for the right job.
2.1 Storage and IO patterns
Let’s compare how data is physically stored and how IO works — this drives performance characteristics.
2.1.1 Columnar (ClickHouse): compressed column files, vectorized scans, primary-key order, data skipping (min/max, Bloom-style token filters)
In a columnar engine like ClickHouse, data is stored by column rather than by row. That means each column’s values are held together on disk — enabling heavy compression (because similar data types/values cluster) and efficient vectorized reads (you scan only the columns you need).
Here’s how typical IO works: you issue a query like “SUM(event_count) WHERE timestamp >= X AND tenant = ‘T’”. The engine locates the relevant data segments (via partitioning or index), reads only the event_count and tenant (and timestamp) columns, decompresses them, applies vectorised operations, uses indices (min/max, bloom) to skip segments that can’t match (data skipping), and returns aggregated result. Because you avoid reading full rows, IO is efficient.
Columns are often ordered by primary-key or sort key (in MergeTree, ORDER BY some tuple) so that related data (e.g., by date) co-locates, enabling efficient range scans and pruning. Skip-indexes reduce IO further.
The trade-off: writes tend to be batched, merges/compactions are heavy; random point updates or single row lookups are less efficient (because the engine is optimized for scans). Mutations and frequent updates can reduce performance.
2.1.2 Wide-column (Cassandra/Scylla): partition key → node, clustering columns order within partition; reads are row-oriented, penalize scatter-gathers
In a wide-column store (like Cassandra / Scylla), data is stored as partitions defined by a partition key; within each partition, clustering columns define ordering. A “row” here is a key mapped to a collection of columns (or columns families). The data model is optimized for key-based access, high write throughput, and horizontal scalability. The “column family” concept groups columns by row key rather than storing each column across the table separately.
When you write, you append to a memtable; flush to SSTables; compaction merges SSTables. Reads for a partition key are efficient (single node). But when your query requires scanning many partitions, or many rows across partitions, performance degrades (scatter/gather across nodes). Furthermore, selective filters on non-primary key attributes tend to be inefficient unless proper indexing is used.
The IO pattern is row-oriented: you retrieve columns related to a particular row or partition quickly, but scanning large volumes across partitions is less efficient than a true column store.
Trade-off: Wide-column excels at high-throughput writes, key lookups, and predictable query patterns. It is less ideal for large ad-hoc scans/aggregations over many rows.
2.2 Query shapes & latency expectations
Based on the storage/IO patterns, different engines favour different query types. Let’s map query shapes to expected latency.
2.2.1 Time-bucket aggregations, top-N, group-bys → columnar wins
Imagine a use case: “For each 5-minute window in the last hour, how many users performed event X, grouped by country and platform, top 10 countries by count.” This is an aggregation over a vast number of rows, selecting a subset of columns, grouping, ordering. This fits columnar storage perfectly: scan event rows by time, filter country/platform, only load the relevant columns, group and order.
Columnar engines like ClickHouse excel here — low latency for large aggregate queries, heavily compressed IO, vectorised execution, and skip-indexes further reduce IO. They are optimized for analytic workloads at scale.
2.2.2 Primary-key lookups, per-entity timelines, low-p99 writes → wide-column wins
Contrast: “Fetch the last 50 transactions for user U,” or “Insert 100K events per second for user entity E, and serve live views with <10 ms latency.” These are key-based lookups or time-ordered timelines per entity. Wide-column databases shine here: you model partition key = user U, clustering by timestamp DESC, write incoming events to that partition, and serve recent rows efficiently.
In those workloads you care about per-entity fast access and ability to handle extremely high write rates with consistent low tail latency. Wide-column (Cassandra/Scylla) favor that pattern.
Latency expectation table:
| Query type | Engine | Typical latency |
|---|---|---|
| Bulk scan & aggregate over many rows | Columnar | tens to hundreds of ms (depending on size) |
| Key-lookup, small result set | Wide-column | single-digit to low-tens ms p99 |
| Mixed (high ingest + ad-hoc analytics) | Hybrid or hybrid architecture | varies |
2.3 Operational behaviour: compaction/merges vs memtables/LSM
Under the hood, both systems perform background work — but different models.
-
In columnar systems (ClickHouse), writes go to new segments, and background merges/compactions consolidate segments, build indexes, prune data. Over time, merges cost IO and CPU; ingestion bursts or too many segments slow queries/ingest. Designing partitioning and merge concurrency is important.
-
In wide-column stores (Cassandra/Scylla), writes go to memtables, flush to SSTables; compaction (different strategies: LCS, STCS, TWCS) merges SSTables. Late-arriving data or wide partitions can create tombstones, and compactions also cost IO/CPU. Tuning memtable flush thresholds, compaction strategy, tombstone thresholds, partition size is key.
Understanding these operational costs is critical when dealing with “100k-5M events/sec ingestion” plus “backfill and replays” patterns. If you fail to manage compactions or merges, latency and throughput suffer.
2.4 Secondary indexing and search
Another important aspect: how well you can search or filter beyond primary keys, and what kinds of indexing are available.
2.4.1 ClickHouse: skip indexes (e.g., tokenbf_v1) and new text index for selective scans; not B-trees
In ClickHouse, traditional B-tree style indexes are rare. Instead, it uses skip indexes (for example minmax, tokenbf_v1) to skip reading data ranges where the filter can’t match. For fuzzy text or token searches there are dedicated text indexes. Because the engine expects large scan workloads, the indexes are optimized for selective scans rather than point lookups. Designing ORDER BY and partitioning correctly, and applying skip indexes on high-cardinality filter columns, is critical.
2.4.2 Cassandra 5.0 SAI: column-level indexes incl. vector embeddings & boolean logic; far lower footprint than older SASI/secondary indexes
In Cassandra (from version 5.0 onwards), the SAI (Storage-Attached Index) provides more flexible secondary indexing: you can index non-primary key columns, boolean logic, vector embeddings (for search-type workloads), and do range scans rather than just equality. This improves filtering capability beyond primary key; but using too many indexes or large sets can impact write path and storage overhead. The modeling advice remains: push as much filtering into partition key/clustering columns or pre-computed tables as possible.
Trade-off: While both engines support indexing beyond primary key, the cost (in terms of write overhead, storage overhead, latency) and the query shape they optimize for differ significantly. Choose wisely based on query patterns.
2.5 Consistency, transactions & concurrency (what you get and don’t)
When evaluating engines, you must not neglect how they handle consistency, concurrency, transactions.
-
ClickHouse: Primarily designed for analytics; transactions are minimal or not fully ACID (especially in distributed setups). Concurrency model supports many readers and decent write concurrency, but single-row update heavy workloads or full ACID semantics may be suboptimal. You typically accept “eventual consistency” or near-real-time freshness for analytic views. Some features like atomic inserts/merges exist, but rich transactional support is limited.
-
Cassandra/Scylla: Both provide strong support for distributed consistency tuning. Cassandra’s tunable consistency levels (e.g., LOCAL_QUORUM, EACH_QUORUM) let you decide trade-off between consistency, performance and availability. They provide replication across data-centers, high availability, and masterless design. Cassandra (and Scylla) are optimized for concurrent writes/reads, multi-region replication, and operational workloads. However, they are not typically used for complex multi-row ACID transactions (although lightweight transactions exist).
Hence: if your workload demands strict transactional semantics (multi-row atomic updates, joins across tables) you might need a different tool. If you care about scale, availability, ingestion latency, real-time lookups, wide-column may suit. For bulk analytics and aggregated freshness, columnar may fit.
2.6 Decision heuristics table: map workload traits to engine choice
Here’s a simplified decision matrix to guide your thinking:
| Workload trait | Prefer columnar (ClickHouse) | Prefer wide-column (Cassandra/Scylla) |
|---|---|---|
| High-volume analytics, scans over billions of rows, heavy aggregations, top-N, time-bucket rollups | ✅ | ⚠️ (possible but less efficient) |
| Very high ingestion rate (100k-5M events/sec) + low-latency key lookups/presence checks | ⚠️ (ingest heavy may hit merge/back-pressure) | ✅ |
| Queries grouped/time-bucketed by date, many rows, few columns | ✅ | ⚠️ |
| Queries are per-entity timelines, recent events, lookups by key | ⚠️ | ✅ |
| Secondary/complex filtering across non-key attributes (ad-hoc) | ✅ (via skip indexes) | ✅ (via SAI but costs) |
| Multi-region operational writes + availability + low latency | ⚠️ (less optimized) | ✅ |
| Ad-hoc exploratory analytics by many users | ✅ | ⚠️ |
| Frequent updates/mutations of individual records | ⚠️ (less efficient) | ✅ |
Of course actual decisions will depend on more nuance—operational maturity, team skills, costs, ecosystem etc.—but this heuristics table gives you a starting point.
Takeaways:
- Columnar systems shine when your workload is scan-heavy, aggregated, time-bucketed, analytics-driven.
- Wide-column systems shine when your workload is key-based, high-ingest writes, operational lookups, per-entity state.
- Mixing workloads often means combining engines (dual-write or streaming to analytics store) or picking the hybrid architecture.
3 End-to-end ingestion pipelines (Kafka/MassTransit) you can reproduce
By now, we’ve defined why high-ingest workloads exist and what each engine does best. The next step is making them work together — building real, reproducible ingestion pipelines. Whether your stack is JVM, Go, or .NET, the principles remain the same: stream → process → persist → serve. This section details reference architectures and configurations for Kafka-based ingestion pipelines, including MassTransit integration for .NET teams, schema management, and operational best practices.
3.1 Architectural patterns
At scale, ingestion pipelines must decouple event producers from consumers, allow schema evolution, tolerate bursts, and ensure at-least-once semantics. Apache Kafka remains the most common backbone for this due to its durability, ordering, partitioning, and wide connector ecosystem.
3.1.1 Kafka topics → stream processors → sinks (OLAP store + serving index)
A standard ingestion topology starts with Kafka topics partitioned by a natural key such as tenant, device, or region. Stream processors (e.g., Kafka Streams, Flink, KSQL, or custom microservices) enrich or aggregate data, then forward to sinks — typically one OLAP store (ClickHouse) for analytical queries and one serving index (Cassandra or ScyllaDB) for low-latency lookups.
# Example topic setup for telemetry data
kafka-topics.sh --create \
--topic telemetry.events.raw \
--partitions 48 \
--replication-factor 3 \
--config retention.ms=86400000
kafka-topics.sh --create \
--topic telemetry.events.aggregated \
--partitions 12 \
--replication-factor 3
Each stage of this pipeline can be scaled independently. Producers publish events, processors perform enrichment or deduplication, and sinks consume in bulk. You can visualize this as:
[Apps/Devices] → [Kafka] → [Stream Processor] → [ClickHouse + Cassandra/ScyllaDB]
Stream processors often perform windowed operations, deduplication by event ID, or partition-aware aggregation. This ensures OLAP tables store clean, time-bucketed data while operational stores retain entity timelines for live queries.
3.1.2 Handling late & out-of-order with watermarking/time buckets
Late or out-of-order events are inevitable when ingesting from distributed systems. Kafka and stream processors handle this using event-time watermarking — a mechanism that defines how long to wait before closing a time window.
In Flink or Kafka Streams, watermarks ensure correctness without unbounded buffering. A watermark of 5 minutes means “I’ll wait up to 5 minutes for late data.” Any event arriving later is considered too late and can be routed to a dead-letter topic for reprocessing.
// Example: Kafka Streams timestamp extractor with grace period
TimeWindows.ofSizeAndGrace(Duration.ofMinutes(1), Duration.ofMinutes(5))
.advanceBy(Duration.ofSeconds(30));
In ClickHouse, late events can still be merged later through materialized view rollups. By using partition keys on time buckets (e.g., per hour), you can re-ingest late events into the same partition, and the MergeTree family will reconcile aggregates during background merges.
3.2 Kafka + ClickHouse integration options
ClickHouse offers multiple ingestion strategies — each optimized for a different operational model. The two most common are Kafka Table Engine and Kafka Connect sink.
3.2.1 Kafka Table Engine + Materialized View fan-in to MergeTree targets (with dead-letter queue for bad messages)
The Kafka Table Engine lets ClickHouse directly subscribe to Kafka topics. Each Kafka message becomes a row in the table, and materialized views can transform and persist data into MergeTree tables. This pattern minimizes dependencies — ClickHouse acts as the consumer.
CREATE TABLE kafka_events
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka-0:9092,kafka-1:9092',
kafka_topic_list = 'telemetry.events.raw',
kafka_group_name = 'ch_consumer',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 8;
CREATE TABLE telemetry_raw
(
event_time DateTime,
tenant_id String,
session_id String,
metric Float64,
meta JSON
)
ENGINE = MergeTree()
PARTITION BY toDate(event_time)
ORDER BY (tenant_id, event_time);
CREATE MATERIALIZED VIEW mv_telemetry_raw
TO telemetry_raw
AS SELECT
parseDateTimeBestEffort(JSONExtractString(_raw_message, 'event_time')) AS event_time,
JSONExtractString(_raw_message, 'tenant_id') AS tenant_id,
JSONExtractString(_raw_message, 'session_id') AS session_id,
JSONExtractFloat(_raw_message, 'metric') AS metric,
_raw_message AS meta
FROM kafka_events;
If a message fails to parse or violates schema, it can be redirected to a dead-letter queue (DLQ) — another Kafka topic monitored for correction.
This native approach is great for smaller clusters or where latency < 1 s is critical, but large production deployments may prefer managed ingestion (Kafka Connect) for fault isolation.
3.2.2 Kafka Connect sink: official ClickHouse sink (self-managed) and Confluent Cloud’s fully-managed sink (production shortcut)
The official ClickHouse Kafka Connect Sink provides a managed ingestion layer decoupled from the database. Each record is mapped via topic → table, with schema derived from Avro, JSON Schema, or manually defined mappings.
{
"name": "clickhouse-sink",
"config": {
"connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
"topics": "telemetry.events.raw",
"clickhouse.url": "http://clickhouse:8123",
"clickhouse.database": "telemetry",
"clickhouse.table": "telemetry_raw",
"tasks.max": "8",
"insert.mode": "insert",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true"
}
}
With Confluent Cloud, the managed ClickHouse sink eliminates operational burden — you configure it through UI or API, and Confluent handles retries, offsets, and delivery guarantees. This option is ideal when you want a cloud-managed ingestion path without maintaining custom connectors.
3.3 Kafka → Cassandra
Cassandra’s ingestion from Kafka benefits from partition-aware mapping, ensuring write load distribution aligns with token ring placement.
3.3.1 DataStax Kafka Connector (OSS) with partition-aware writes, flexible mapping
The DataStax Kafka Connector (open source) writes Kafka messages into Cassandra tables using flexible field mappings and partition-aware batching. It supports both plain JSON and Avro, as well as schema evolution.
{
"name": "cassandra-sink",
"config": {
"connector.class": "com.datastax.oss.kafka.sink.CassandraSinkConnector",
"topics": "fraud.events",
"contactPoints": "cassandra-0.cassandra:9042",
"loadBalancing.localDc": "eastus",
"keyspace": "fraud",
"table.name.format": "transactions",
"maxConcurrentRequests": "500",
"topic.fraud.events.mapping": "user_id=value.user_id, amount=value.amount, event_time=value.event_time",
"consistencyLevel": "LOCAL_QUORUM",
"maxRetries": "3"
}
}
The connector maps Kafka partitions to Cassandra partition keys to avoid hotspots. You can also add an epoch or bucket ID to clustering columns for better write distribution.
Cassandra’s LWT (lightweight transaction) feature isn’t used here — inserts are idempotent via primary key deduplication. For failed writes or schema mismatches, the connector emits records to a DLQ topic for investigation.
3.4 Kafka ↔ ScyllaDB
ScyllaDB’s compatibility with Cassandra’s CQL makes ingestion straightforward, but it adds CDC (Change Data Capture) and Alternator (DynamoDB API) options that enable bidirectional streaming.
3.4.1 ScyllaDB CDC Source Connector (Debezium-based) for change streaming to Kafka; patterns for backfill + dual-writes
The ScyllaDB CDC Source Connector captures row-level changes (inserts, updates, deletes) from tables and streams them to Kafka topics. It’s Debezium-based, ensuring ordering and schema consistency.
{
"name": "scylla-cdc-source",
"config": {
"connector.class": "io.debezium.connector.scylladb.ScyllaDbConnector",
"tasks.max": "1",
"scylla.cluster.addresses": "scylla-0.scylla:9042",
"scylla.keyspace": "fraud",
"scylla.table": "transactions",
"kafka.topic.prefix": "scylla.cdc",
"snapshot.mode": "initial"
}
}
This pattern is crucial for dual-writes or backfill workflows: Cassandra/Scylla as operational store → Kafka → ClickHouse as analytics store. When backfilling, ensure ordering by primary key and include idempotency keys to prevent duplication.
Scylla’s shard-per-core architecture allows each connector task to align with a shard for optimal throughput — especially critical under high ingest rates.
3.5 MassTransit (for .NET) with Kafka
For .NET ecosystems, MassTransit provides a powerful abstraction for message-based workflows, including Kafka riders for easy producer/consumer integration.
3.5.1 Producer/consumer topology, riders, retries/backoff; connecting to community Kafka or Confluent Cloud; sample repo pointers
With MassTransit, you define message contracts, configure riders for Kafka transport, and rely on built-in retries, backoffs, and DLQ routing.
public record TelemetryEvent(string TenantId, DateTime EventTime, double Metric);
public class TelemetryConsumer : IConsumer<TelemetryEvent>
{
public async Task Consume(ConsumeContext<TelemetryEvent> context)
{
Console.WriteLine($"Event for {context.Message.TenantId}: {context.Message.Metric}");
}
}
Kafka configuration is concise:
builder.Services.AddMassTransit(x =>
{
x.AddConsumer<TelemetryConsumer>();
x.UsingKafka((context, cfg) =>
{
cfg.Host("broker:9092");
cfg.TopicEndpoint<TelemetryEvent>("telemetry.events.raw", "telemetry-group", e =>
{
e.ConfigureConsumer<TelemetryConsumer>(context);
e.AutoOffsetReset = AutoOffsetReset.Earliest;
e.ConcurrentMessageLimit = 64;
});
});
});
MassTransit’s retry and backoff policies handle transient Kafka errors:
e.UseMessageRetry(r => r.Interval(3, TimeSpan.FromSeconds(2)));
e.UseInMemoryOutbox();
This approach works identically for Confluent Cloud by providing SASL credentials. Sample implementations exist in the MassTransit Kafka Rider samples.
3.6 Schema & serialization
Schema consistency is crucial when multiple producers and consumers evolve over time.
3.6.1 Avro/Protobuf/JSON; schema evolution strategies; idempotent upserts
- Avro (with Confluent Schema Registry) remains the standard for event schemas. It supports evolution via backward/forward compatibility.
- Protobuf provides better binary efficiency but requires more tooling alignment.
- JSON is flexible for early stages or prototyping but can be costly under heavy throughput.
Example Avro schema for telemetry:
{
"type": "record",
"name": "TelemetryEvent",
"fields": [
{"name": "tenant_id", "type": "string"},
{"name": "event_time", "type": "long", "logicalType": "timestamp-millis"},
{"name": "metric", "type": "double"}
]
}
For idempotency, include an event ID or checksum to prevent duplicate inserts after retries:
# Example idempotent insert key computation
import hashlib, json
def event_key(event):
return hashlib.md5(json.dumps(event, sort_keys=True).encode()).hexdigest()
3.7 Operational concerns
3.7.1 Exactly-once “enough”: at-least-once + idempotency; DLQs; poison pill handling
True exactly-once semantics across Kafka + DB is rare. In practice, aim for exactly-once-enough — typically at-least-once delivery plus idempotent processing in the sink. Use DLQs (Dead Letter Queues) for records that can’t be parsed, fail validation, or exceed retries. Poison pills — bad records that always fail — should be quarantined with metadata (offset, partition, error cause) for debugging.
3.7.2 Throughput tuning: partitions, batch.size/linger.ms, acks, compression
Kafka producer and consumer configuration drives ingest performance.
| Setting | Purpose | Typical Value |
|---|---|---|
batch.size | Bytes per batch | 64 KB–256 KB |
linger.ms | Wait time to form batch | 10–50 ms |
acks | Durability vs latency | 1 (fast) or all (safer) |
compression.type | Network efficiency | lz4 or snappy |
num.partitions | Parallelism | 3–5× consumer instances |
Always benchmark per cluster and adjust for p99 latency, not just throughput.
4 Schema design that survives peak ingest (with examples)
Once ingestion is reliable, schema design determines whether your system remains fast under pressure. The patterns differ for ClickHouse, Cassandra, and ScyllaDB — each has its own “physics.”
4.1 Telemetry/events in ClickHouse
4.1.1 Raw table (MergeTree) ORDER BY (date, tenant, event_time, …); codecs; partition by day/hour
Raw events should be ingested into a MergeTree table ordered for efficient range scans. Partitioning by day (or hour for extreme rates) localizes merges.
CREATE TABLE telemetry_raw
(
event_date Date DEFAULT toDate(event_time),
event_time DateTime,
tenant String,
session_id String,
metric Float64,
tags Map(String, String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (tenant, event_time)
SETTINGS index_granularity = 8192,
min_bytes_for_wide_part = 0;
Use codecs (e.g., ZSTD, LZ4HC) on large text or JSON columns to reduce IO.
Time-based partitioning allows efficient TTLs:
ALTER TABLE telemetry_raw MODIFY TTL event_time + INTERVAL 30 DAY DELETE;
4.1.2 Rollups with Materialized Views → Summing/AggregatingMergeTree; cascading MVs for multi-granularity buckets (1m, 5m, 1h)
For fast dashboards, pre-aggregate with cascading materialized views.
CREATE MATERIALIZED VIEW mv_1m
TO telemetry_1m
AS
SELECT
tenant,
toStartOfMinute(event_time) AS bucket,
count() AS events,
avg(metric) AS avg_metric
FROM telemetry_raw
GROUP BY tenant, bucket;
Then build 5-minute and 1-hour rollups on top of the 1-minute table. This cascade reduces query latency from seconds to milliseconds while maintaining freshness.
4.1.3 Data skipping indices: tokenbf_v1 on user_id/session, minmax on ts; selective text index for fuzzy search fields
Data skipping indices drastically reduce scan volume.
ALTER TABLE telemetry_raw ADD INDEX idx_user tokenbf_v1(session_id) GRANULARITY 4;
ALTER TABLE telemetry_raw ADD INDEX idx_time minmax(event_time) GRANULARITY 1;
For fuzzy text search:
ALTER TABLE telemetry_raw ADD INDEX idx_message text('english_stemmer') TYPE text;
These indices allow filtering billions of rows efficiently.
4.2 Fraud/serving use-cases in Cassandra
4.2.1 Query-first modeling: primary partitions per entity or shard key; clustering on event_time DESC; bucketing for “hot” rows; TTL strategy per table
Cassandra schemas must start from the query pattern. Suppose you need “fetch last 50 transactions per user.”
CREATE TABLE transactions (
user_id TEXT,
bucket_id INT,
event_time TIMESTAMP,
txn_id UUID,
amount DECIMAL,
PRIMARY KEY ((user_id, bucket_id), event_time DESC)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND default_time_to_live = 2592000;
bucket_id (e.g., day-of-month) prevents hot partitions for extremely active users. TTLs handle automatic data expiration.
4.2.2 SAI for flexible filters (ranges, IN, boolean combos, vector query) where feasible; what still belongs in pre-computed tables
Cassandra 5.0’s SAI (Storage-Attached Index) enables flexible secondary queries:
CREATE CUSTOM INDEX idx_amount ON transactions (amount)
USING 'StorageAttachedIndex';
You can then query:
SELECT * FROM transactions WHERE user_id='u1' AND amount > 1000;
However, don’t overuse SAI on high-churn columns; pre-compute or materialize frequent query shapes into dedicated tables.
4.2.3 Materialized Views: when acceptable; async propagation caveats; guardrails
Materialized Views (MVs) automatically replicate data under a different key but propagate asynchronously. For fraud workloads, only use MVs for small, low-latency aggregates.
CREATE MATERIALIZED VIEW txn_by_amount AS
SELECT * FROM transactions WHERE amount IS NOT NULL AND user_id IS NOT NULL
PRIMARY KEY ((amount), user_id, event_time);
Guardrails: avoid multi-level MVs; monitor for pending view mutations; consider asynchronous backfill if consistency is critical.
4.3 ScyllaDB specifics
4.3.1 Same data model as Cassandra, but design to exploit shard-per-core: token/driver routing, prepared statements, per-shard concurrency
ScyllaDB uses the same schema definitions but scales differently. Each shard handles a portion of the token range — you can optimize by aligning driver connections per shard and using prepared statements.
# Python example using shard-aware driver
from cassandra.cluster import Cluster
cluster = Cluster(['scylla-node1'])
session = cluster.connect('fraud')
ps = session.prepare("INSERT INTO transactions (user_id,bucket_id,event_time,txn_id,amount) VALUES (?,?,?,?,?)")
session.execute(ps, ('u1', 10, datetime.utcnow(), uuid.uuid4(), 52.3))
Set core_connections_per_host equal to CPU cores for maximum throughput.
4.3.2 MVs & secondary indexes in ScyllaDB: capabilities and async update implications; tombstones and repair strategies
ScyllaDB supports MVs and secondary indexes but handles them asynchronously. Use Scylla Manager to monitor pending updates and repairs.
Tombstones from expired TTLs can accumulate — periodically run nodetool compact or tune gc_grace_seconds to manage space reclamation.
4.4 Cross-store considerations
4.4.1 Upserts vs inserts; dedupe keys; event idempotency keys
When writing from Kafka to multiple sinks (ClickHouse + Cassandra/Scylla), ensure consistent idempotency. Use unique event IDs as natural dedupe keys.
PRIMARY KEY ((tenant_id), event_time, event_id)
ClickHouse’s ReplacingMergeTree can safely handle upserts:
ENGINE = ReplacingMergeTree(event_id)
4.4.2 “Cold path” backfills and replays: keeping rollups correct
When reprocessing historical data (e.g., bug fix or enrichment), isolate to a “cold path.” Load older data into temporary tables, validate aggregates, then merge with production tables:
INSERT INTO telemetry_raw SELECT * FROM telemetry_backfill
WHERE event_time BETWEEN '2025-10-01' AND '2025-10-10';
Use materialized view rebuilds (POPULATE) to recompute rollups safely without corrupting hot partitions.
5 Kubernetes/AKS deployment patterns for cost & performance
5.1 Operators you’ll likely use (and why)
5.1.1 ClickHouse: Altinity ClickHouse Operator (Helm), Keeper support, production features
The Altinity ClickHouse Operator automates cluster lifecycle: installation, scaling, backups, and ZooKeeper/Keeper integration. It supports CRDs defining clusters, shards, replicas, and persistent volumes.
helm repo add clickhouse https://altinity.github.io/clickhouse-operator
helm install clickhouse clickhouse/clickhouse-operator
Key production features include automatic schema sync, TLS, and integration with object storage (S3/Blob).
5.1.2 Cassandra: K8ssandra Operator (multi-cluster/region, Reaper/Medusa)
K8ssandra Operator simplifies Cassandra management across regions. It bundles:
- Reaper for repair scheduling.
- Medusa for backup/restore.
- Stargate for APIs (REST/GraphQL).
helm repo add k8ssandra https://helm.k8ssandra.io/
helm install cassandra k8ssandra/k8ssandra
Supports cross-DC replication and integrates with Prometheus metrics for auto-healing decisions.
5.1.3 ScyllaDB: Scylla Operator (scaling, upgrades; new analyze tooling)
Scylla Operator provides declarative scaling, rolling upgrades, and integrates with Scylla Manager. It’s optimized for CPU isolation and host-network mode for performance.
helm repo add scylla https://scylla-operator-charts.storage.googleapis.com/stable
helm install scylla scylla/scylla-operator
Newer versions include shard-level metrics export and I/O scheduler analysis tooling.
5.2 AKS cost optimization essentials for high-ingest DBs
5.2.1 Right-size requests/limits; autoscaler tuning; workload bin-packing; Spot for stateless tiers only
For databases, right-sizing is non-negotiable. Assign guaranteed CPU and memory for stateful pods. Use Spot instances only for stateless tiers like stream processors or Kafka Connect workers.
5.2.2 Storage: Premium SSD v2 (independent IOPS/throughput provisioning) for Kafka/DB logs; snapshot costs; regional price variance
Use Azure Premium SSD v2, which allows tuning IOPS and throughput independently. Allocate separate volumes for WAL/commit logs. Automate snapshot pruning to control costs.
5.2.3 Region selection, node pool separation (brokers vs DB vs compute)
Separate node pools per tier:
- Pool A: Kafka brokers (I/O heavy)
- Pool B: Databases (storage + CPU heavy)
- Pool C: Stream processors (CPU heavy)
Region choice affects latency; co-locate all tiers within the same AZ where possible.
5.2.4 Tooling for FinOps visibility (PerfectScale, Mirantis, Zesty, etc.)
Use FinOps tools to monitor real utilization. For example, PerfectScale automatically rightsizes pods; Mirantis Lens and Zesty track real-time cost trends for persistent disks and compute usage.
5.3 Kafka on K8s for ingestion
5.3.1 StatefulSets, persistent volumes, advertised.listeners, monitoring with Prometheus/JMX exporters; Helm chart choices
Deploy Kafka as a StatefulSet with persistent volumes to maintain offsets and logs. Helm charts from Bitnami or Strimzi simplify management.
helm install kafka bitnami/kafka --set persistence.size=500Gi
Expose proper advertised.listeners for internal/external communication and add JMX exporters for metrics. Integrate with Prometheus for monitoring under sustained ingest.
5.4 Network & IO tuning
5.4.1 Host-networking trade-offs, CPU pinning/IRQ isolation for ScyllaDB, fsync/IO schedulers, hugepages (when)
- Use host networking for ScyllaDB to minimize latency; pin shards per CPU core.
- Configure IRQ isolation and
noopormq-deadlineschedulers for predictable IO. - Enable hugepages if workloads benefit from large memory segments (e.g., ClickHouse merges).
- Avoid overcommitting vCPUs — each ScyllaDB shard must map to a dedicated core.
5.5 Security & multi-tenancy
5.5.1 mTLS for Kafka, Azure Key Vault CSI for secrets, network policies
Secure pipelines end-to-end:
- Enable mTLS between Kafka producers, brokers, and consumers.
- Mount secrets using Azure Key Vault CSI driver rather than inline secrets.
- Use NetworkPolicies to restrict namespace cross-traffic.
These ensure compliance while maintaining low latency, a necessity for regulated sectors like finance and healthcare.
With these infrastructure and schema patterns, you now have a reproducible, performant ingestion-to-analytics pipeline across ClickHouse, Cassandra, and ScyllaDB — ready for production workloads that ingest millions of events per second and still deliver sub-second insights.
6 Reproducible benchmark plan (so readers can replicate)
Building a benchmark that others can reproduce isn’t just about measuring numbers — it’s about transparency, repeatability, and fairness. Whether you’re comparing ClickHouse, Cassandra, or ScyllaDB, your goal is to observe patterns, not chase vanity metrics. This section lays out a complete plan for setting up, running, and interpreting reproducible high-ingest benchmarks that mirror production-like conditions.
6.1 Goals and methodology
Benchmarking distributed databases requires a balance between precision and realism. The objective is not to find an artificial “winner,” but to reveal where each database thrives under specific workloads.
6.1.1 Transparent workloads, infra spec, configs, and scripts (GitHub repo layout)
Every benchmark should include all relevant assets — infrastructure code, workload definitions, configuration files, and seed datasets — versioned and openly documented. A clean repository layout might look like this:
/benchmarks
/infra
terraform/
aks-clusters/
/workloads
tsbs_clickhouse/
nosqlbench_cassandra/
scylla_bench/
/scripts
seed_data.py
ingest_test.sh
query_test.sh
/dashboards
grafana/
README.md
Each folder defines not just “how to run,” but also why the parameters matter. The repo should include infrastructure specifications (VM sizes, disk types, storage classes, regions) and explicit SLOs — e.g., p95 read latency < 150 ms, p99 ingest latency < 500 ms. Publishing these enables community replication and trust.
6.1.2 Repeatable data generation and seed scripts; publish SLOs (p95/p99)
Generate deterministic data using scripts or standard datasets so that multiple runs yield comparable results. Seed scripts should log generation seeds, timestamps, and schema versions.
# Example: deterministic event generator for telemetry
import random, json, time, hashlib
random.seed(42)
for i in range(10_000_000):
tenant = f"t{random.randint(1,1000)}"
metric = random.random() * 100
ts = int(time.time() * 1000)
event = {"tenant": tenant, "metric": metric, "event_time": ts}
key = hashlib.md5(json.dumps(event, sort_keys=True).encode()).hexdigest()
print(json.dumps({"event_id": key, **event}))
Define SLOs at the start and publish them in README:
- Ingest throughput target: 1 M events/sec per node
- Latency SLOs: p95 < 150 ms read, p99 < 400 ms ingest
- Durability SLO: no data loss under single-node failure
These boundaries convert benchmark numbers into actionable insights.
6.2 Workload generators (OSS)
Workload generators provide consistency and standardization. Using mature OSS tools eliminates bias and offers well-tested scenarios.
6.2.1 TSBS for time-series style OLAP in ClickHouse; queries: rollups, top-K, filters
The Time Series Benchmark Suite (TSBS) is ideal for evaluating ClickHouse’s OLAP efficiency. It simulates IoT and telemetry workloads with flexible schema generation and prebuilt queries.
Example setup:
# Generate data for 12h of simulated metrics
tsbs_generate_data --use-case="cpu-only" \
--seed=42 --scale=1000 \
--log-interval="10s" --timestamp-start="2025-10-21T00:00:00Z" \
| clickhouse-client --query="INSERT INTO telemetry_raw FORMAT CSV"
Query patterns include:
- Rollups: aggregate CPU usage by host over 5-minute windows
- Top-K: find top 10 hosts by average load
- Filters: per-tenant or tag-based selections
TSBS reports throughput, query latency distributions, and system resource metrics via Prometheus.
6.2.2 NoSQLBench (NB5) for Cassandra/Scylla; standard scenarios + tlp-stress for targeted patterns and adaptive SLO-driven throughput finding
NoSQLBench 5 (NB5) is the canonical tool for Cassandra-like workloads. It supports flexible YAML-based workloads for ingest, reads, and mixed patterns.
# workload.yaml
blocks:
- name: insert
ops:
- insert: |
insert into transactions(user_id, bucket_id, event_time, txn_id, amount)
values({user_id}, {bucket_id}, toTimestamp(now()), uuid(), {amount})
bindings:
user_id: Mod(1000000)
bucket_id: Mod(31)
amount: Uniform(1..10000)
Run ingest with SLO control:
nb5 run driver=cql workload=workload.yaml \
cycles=10M threads=100 tags=phase:ingest
Then add tlp-stress, a companion tool to find the maximum sustainable throughput for a given SLO. It automatically adapts concurrency to maintain target p99 latency.
6.2.3 Optional: scylla-bench if you want a native baseline; include CDC workloads
Scylla-bench provides native performance baselines and can test features like CDC.
scylla-bench -workload=sequential -mode=write -replication-factor=3 \
-cl=LOCAL_QUORUM -rows-per-request=1000 -duration=10m
For CDC validation, enable the cdc flag and measure additional latency overhead compared to plain inserts.
6.3 Test matrix
Benchmarks must cover ingestion, query, and mixed phases. A well-designed matrix ensures balanced comparison.
6.3.1 Ingest only; ingest+query (concurrency sweep); query-only on hot/cold ranges
Define phases:
- Ingest only: Stress-test write path and measure sustained throughput.
- Ingest + query: Simulate real operations where dashboards query while ingest continues. Sweep concurrency (10 → 500 threads) and record tail latencies.
- Query-only: Evaluate read performance on both “hot” (recent) and “cold” (historical) partitions.
6.3.2 Data shapes: narrow events, wide JSON, high-cardinality dimensions, skewed keys
To capture realistic scenarios, vary data shape and cardinality:
- Narrow rows: fixed schema of 5–10 columns, perfect for columnar engines.
- Wide JSON: semi-structured data up to 50 fields; tests serialization overhead.
- High-cardinality: many unique tenant IDs or device IDs — stresses index efficiency.
- Skewed keys: simulate “hot tenants” causing partition imbalance in Cassandra/Scylla.
Example NB5 binding for skewed distribution:
bindings:
tenant_id: Zipf(1..100000, 1.2)
6.3.3 Retention/TTL scenarios and compaction pressure tests
Simulate data expiration and compaction load. For Cassandra/Scylla, vary TTL (1d, 7d, 30d) and measure disk utilization, tombstones, and compaction latency. For ClickHouse, benchmark merges triggered by TTL deletions to gauge merge debt under heavy churn.
6.4 Environment & configs to publish
6.4.1 AKS versions, node SKUs, disk SKUs/IOPS, operators’ versions and CRs
Always include:
- AKS version (e.g., 1.30)
- Node type (e.g., Standard E16s v5, 16 vCPU, 128 GB RAM)
- Disk SKU (Premium SSD v2, 32K IOPS, 1 TB)
- Operator versions (ClickHouse 1.4.1, K8ssandra 1.8, Scylla 1.12)
- Configuration resources (Helm values, CRDs)
Publishing these ensures anyone can recreate the same infra baseline.
6.4.2 Kafka settings (partitions, acks, compression), client batch sizes
Document every producer/consumer configuration:
| Parameter | Value | Note |
|---|---|---|
num.partitions | 48 | Parallelism per topic |
acks | all | Durability |
compression.type | lz4 | Low CPU overhead |
batch.size | 131072 | 128 KB batch size |
linger.ms | 20 | Optimize network efficiency |
6.4.3 ClickHouse MergeTree & MV DDLs; Cassandra/Scylla schemas and consistency levels
Include actual DDLs in appendices for full reproducibility.
For Cassandra/Scylla, explicitly state consistency level per test (LOCAL_QUORUM or ONE).
For ClickHouse, include MergeTree settings (index granularity, part size) and materialized view definitions used during aggregation queries.
6.5 Metrics & observability
6.5.1 Throughput, end-to-end ingest latency, p95/p99 read latencies, CPU/IOPS, compaction/merge debt, storage footprint
Instrument everything via Prometheus + Grafana dashboards. Export database metrics, Kafka metrics, and host-level system stats. Track:
- Throughput (events/sec) from producers and confirmed writes.
- End-to-end latency (produce → persist → query-visible).
- p95/p99 latency for both reads and writes.
- CPU/IOPS utilization per node.
- Merge/compaction debt: number of pending merges in ClickHouse; pending compactions in Cassandra/Scylla.
- Storage footprint: ratio of logical vs physical data (compression efficiency).
Example Prometheus query for ClickHouse merges:
clickhouse_background_pool_task_count{task="Merge"} > 0
6.6 Validity & fairness checks
6.6.1 Warm caches vs cold; preload phases; anti-windfall measures; deterministic seeds
Avoid misleading numbers by ensuring fairness:
- Warm cache: measure steady-state after caches stabilize.
- Cold cache: include explicit cache flush phase (
DROP CACHES) to test cold performance. - Anti-windfall: randomize operation order between runs to prevent any single engine from benefiting from data layout.
- Deterministic seeds: fix random seeds across workloads to produce comparable datasets.
6.7 Interpreting results: where each engine shines/falters (and why)
After collecting data, interpret beyond raw numbers:
| Metric | ClickHouse | Cassandra | ScyllaDB |
|---|---|---|---|
| Ingest throughput | High (batched) | Very high | Extremely high |
| Query latency (aggregations) | Excellent | Moderate | Moderate |
| Point lookup latency | Moderate | Excellent | Excellent |
| Merge/Compaction cost | Heavy under churn | Steady | Low due to shard-per-core |
| Operational complexity | Medium | High | Moderate |
| Storage efficiency | High (columnar compression) | Medium | Medium |
ClickHouse dominates scan-heavy analytics, Cassandra handles multi-region operational writes, and ScyllaDB wins in ultra-low tail latency ingest. Understanding why — architecture, IO patterns, compaction design — is the real outcome of reproducible benchmarking.
7 Performance tuning & operations cookbook
Once you understand where each database stands, tuning determines how close you can get to its theoretical limits. This section distills proven tuning patterns for sustained high-ingest performance.
7.1 ClickHouse
7.1.1 Table engines: MergeTree vs Summing/ AggregatingMergeTree for rollups; cascade MVs; join reordering; skipping indices; partitions vs order-by
Start with MergeTree for raw ingestion, then layer Summing or AggregatingMergeTree for pre-aggregated rollups.
CREATE TABLE telemetry_hourly
(
tenant String,
bucket DateTime,
events UInt64,
avg_metric Float64
)
ENGINE = SummingMergeTree((events, avg_metric))
PARTITION BY toDate(bucket)
ORDER BY (tenant, bucket);
Enable join reordering for complex analytics joins:
SET join_algorithm = 'auto';
SET join_use_nulls = 0;
Always align ORDER BY with your most common filter columns; otherwise, merges become random and slow. Use tokenbf_v1 indices for selective filtering fields.
7.1.2 Ingestion: Kafka Engine vs Kafka Connect vs ClickPipes (managed) selection; dead-letter queues; background merges concurrency
Choose ingestion mode by operational control level:
- Kafka Engine → lowest latency, highest coupling.
- Kafka Connect → decoupled, horizontally scalable.
- ClickPipes (managed service) → fully managed ingestion in cloud deployments.
Tune background merges:
SET background_pool_size = 32;
SET background_schedule_pool_size = 32;
Monitor merge lag (system.merges) and set up DLQs for bad messages.
7.2 Cassandra (5.0)
7.2.1 Data modeling for time series (query-first), partition sizing, clustering order; SAI index usage guidelines; LCS vs STCS
Design partitions to limit row width (ideally ≤ 100 MB). Use TimeWindowCompactionStrategy (TWCS) for time-bucketed data and LeveledCompactionStrategy (LCS) for read-heavy tables.
compaction:
class: TimeWindowCompactionStrategy
compaction_window_unit: HOURS
compaction_window_size: 1
For SAI usage:
- Index low-cardinality or filter-heavy columns.
- Avoid indexing columns that frequently mutate.
- Validate index size with
nodetool sjk ttopbefore scaling production.
7.2.2 Consistency level tuning (LOCAL_QUORUM defaults), repair/rebuild cadence, MVs guardrails
LOCAL_QUORUM gives strong balance of consistency and latency. Use QUORUM only for cross-DC queries. Automate nodetool repair weekly; daily incremental repairs on hot partitions prevent entropy buildup.
MVs should remain single-table, async; monitor via system.views_builds_in_progress.
7.3 ScyllaDB
7.3.1 Shard-per-core: concurrency per shard, driver routing, prepared statements; IO schedulers and scheduling groups; avoiding cross-shard hot spots
Each shard in ScyllaDB owns part of the token range — meaning per-core isolation. Use shard-aware drivers to maintain routing efficiency.
session = cluster.connect('telemetry')
ps = session.prepare("INSERT INTO events (...) VALUES (...)")
for event in events:
session.execute(ps, event) # Routed per shard automatically
Tune IO scheduling groups for different workloads:
scylla_io_setup --mode=balanced
Avoid hot spots by salting partition keys (user_id + random(0..7)).
7.3.2 CDC at scale; MV/secondary indexes reality; tombstone hygiene
Enable CDC with cautious TTLs and adequate disk overhead (~10–20 % additional). Use Scylla Manager to track lag. Clean tombstones with periodic nodetool compact and keep gc_grace_seconds near the repair interval (1–2 days).
7.4 Kafka tier
7.4.1 Producer acks/batching, consumer max-poll, partitions sizing; ISR metrics, under-replicated partitions alarms
Kafka stability underpins database ingest performance. Tune producer settings:
acks=all
linger.ms=30
batch.size=131072
compression.type=lz4
max.in.flight.requests.per.connection=5
Monitor:
UnderReplicatedPartitions(should be 0).- ISR size via Prometheus:
kafka_cluster_partition_underreplicated_count. Adjust partition count for expected throughput (e.g., 1 partition per 50 MB/s).
7.5 Backup/restore & DR
7.5.1 ClickHouse object-storage snapshots; Cassandra Medusa; Scylla Manager + Operator workflows
For ClickHouse, back up directly to S3-compatible object storage:
BACKUP TABLE telemetry_raw TO S3('s3://backups/clickhouse/', 'AKI...', 'SECRET...');
Cassandra uses Medusa for consistent incremental backups; schedule via K8ssandra Operator. Scylla integrates with Scylla Manager, automating backups, restores, and repairs — crucial for multi-shard resilience.
7.6 Multi-region + hybrid patterns
7.6.1 Cassandra/Scylla multi-DC replication vs ClickHouse DR/replication strategies
Cassandra and Scylla natively support multi-data-center replication with per-DC consistency control:
CREATE KEYSPACE telemetry
WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'eastus': 3, 'westus': 3};
ClickHouse uses replicated MergeTree tables combined with Keeper (ZooKeeper/ClickHouse Keeper) for consistency. For hybrid clouds, use async replication via object storage snapshots or Kafka CDC pipelines.
In summary, sustained real-time analytics performance depends less on a single “fastest engine” and more on aligning architecture, schema, and operations. A reproducible benchmark validates your tuning decisions; disciplined operations keep them true under real production chaos.
8 Putting it together: decision framework, reference designs, and a starter kit
After exploring architecture, ingestion, schema design, and performance, the natural next question is: which database should you actually choose? And more importantly, how do you integrate it cleanly into a repeatable, production-grade system? This final section unifies everything into a practical decision framework, concrete blueprints, and a reproducible lab setup you can clone and extend.
8.1 Decision framework (matrix)
Choosing between ClickHouse, Cassandra, and ScyllaDB depends on your dominant workload shape, latency requirements, and operational model. The following decision points summarize what the previous benchmarks, schema patterns, and operational findings reveal.
8.1.1 “If you need sub-second rollups across billions of rows” → ClickHouse
ClickHouse dominates analytic use cases that demand fast aggregations over massive datasets. Its columnar layout and vectorized execution let you process billions of rows with sub-second latency, ideal for telemetry dashboards, observability, and business analytics.
Typical patterns include:
- Real-time metrics ingestion through Kafka → ClickHouse Table Engine.
- Cascading Materialized Views for minute/hour rollups.
- Interactive dashboards using Grafana or Superset.
Strengths:
- Exceptional compression and scan performance.
- Rich SQL semantics and aggregation functions.
- High freshness for streaming OLAP.
Limitations:
- Not ideal for random writes or high-cardinality point lookups.
- Merge overhead under ultra-high churn if partitioning is poor.
Example:
SELECT tenant, toStartOfMinute(event_time) AS bucket,
count() AS events, avg(metric) AS avg_value
FROM telemetry_raw
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY tenant, bucket
ORDER BY bucket DESC
LIMIT 100;
8.1.2 “If you need ultra-low-latency key lookups + multi-region writes” → Cassandra/Scylla
If your workload involves millions of small writes and sub-10 ms lookups, such as fraud detection, session state, or personalization, Cassandra and ScyllaDB excel. They’re wide-column, distributed databases with tunable consistency and native multi-region replication.
Cassandra offers mature tooling, strong ecosystem integration (DataStax, K8ssandra), and a vast knowledge base. ScyllaDB takes the same API and squeezes every last microsecond out of hardware.
When consistency and regional resilience matter, use their NetworkTopologyStrategy and LOCAL_QUORUM reads/writes for balance.
Example query pattern:
SELECT * FROM transactions
WHERE user_id='u42' AND bucket_id=1024
ORDER BY event_time DESC
LIMIT 20;
8.1.3 “If you need Cassandra API but lower tail latency” → ScyllaDB
ScyllaDB’s shard-per-core design means predictable latency and throughput even under high concurrency. Choose it when you want:
- Drop-in compatibility with Cassandra (CQL, drivers).
- Lower p99 latency for mixed ingest + query workloads.
- Better hardware utilization on dedicated nodes or Kubernetes with pinned CPUs.
It’s particularly compelling when every millisecond counts — for example, in ad bidding, in-game telemetry, or sensor processing.
8.2 Reference blueprints
Once you know which database aligns with your needs, the next step is architecture. The following blueprints are distilled from real-world deployments, combining ingestion, storage, and analytics tiers in reproducible, Kubernetes-native patterns.
8.2.1 Real-time telemetry stack (Kafka → ClickHouse MVs + Materialized Kafka sinks; dashboarding)
This stack supports observability, product analytics, and IoT telemetry. Flow: Applications → Kafka → ClickHouse → Grafana
Components:
- Kafka (48 partitions): handles 1–5 M events/sec.
- ClickHouse Kafka Engine consumes topics and writes to
telemetry_raw. - Materialized Views roll up data into 1 m and 1 h buckets.
- Grafana dashboards query aggregates for near-real-time visualization.
Example topology:
[Producers] → [Kafka Topic: telemetry.events.raw] → [ClickHouse Kafka Engine]
→ [telemetry_raw] → [mv_rollup_1m → telemetry_1m]
Deployment hint: Use the Altinity Operator for declarative ClickHouse clusters. Add object storage (Azure Blob or S3) for snapshot backups.
Example ClickHouse DDL snippet:
CREATE MATERIALIZED VIEW mv_rollup_1m
TO telemetry_1m
AS SELECT tenant, toStartOfMinute(event_time) AS bucket,
count() AS count, avg(metric) AS avg_value
FROM telemetry_raw
GROUP BY tenant, bucket;
Grafana integration:
datasource:
type: clickhouse
url: http://clickhouse:8123
default: true
This architecture gives continuous aggregation and interactive analytics with ~1 s freshness.
8.2.2 Fraud scoring stack (Cassandra/Scylla serving with SAI lookups; optional ClickHouse for offline OLAP)
Fraud workloads demand low-latency lookups and real-time correlation.
Flow: Kafka → Stream Enrichment → Cassandra/ScyllaDB → ClickHouse (optional OLAP)
-
Cassandra/Scylla stores the serving layer:
CREATE TABLE transactions ( user_id TEXT, bucket_id INT, event_time TIMESTAMP, txn_id UUID, amount DECIMAL, geo TEXT, PRIMARY KEY ((user_id, bucket_id), event_time DESC) );SAI indexes enable range filters:
CREATE CUSTOM INDEX idx_geo ON transactions (geo) USING 'StorageAttachedIndex'; -
Stream processor (Flink or Spark Structured Streaming) enriches events with user profiles and historical aggregates.
-
Optional ClickHouse sink receives CDC streams for offline analytics:
kafka-topics.sh --create --topic fraud.transactions.cdcRollups in ClickHouse enable heatmaps of suspicious regions or users.
This dual-system design decouples real-time serving (Cassandra/Scylla) from analytical reporting (ClickHouse), minimizing query contention.
8.2.3 Mixed architecture (dual-write or CDC replicate to ClickHouse for OLAP)
Many production systems require both high-ingest operational storage and fast analytical querying. A pragmatic design: dual-write Kafka consumers or CDC replication from Cassandra/Scylla → ClickHouse.
Option A: Dual-write from stream processor
for event in stream:
cassandra_session.execute(insert_stmt, event)
clickhouse_client.insert('telemetry_raw', event)
Option B: CDC-driven replication (Scylla or Cassandra 5.0) Use the ScyllaDB CDC Source Connector:
{
"connector.class": "io.debezium.connector.scylladb.ScyllaDbConnector",
"scylla.cluster.addresses": "scylla-0:9042",
"scylla.keyspace": "fraud",
"scylla.table": "transactions",
"kafka.topic.prefix": "cdc.transactions"
}
ClickHouse subscribes to these CDC topics via Kafka Engine, providing analytical rollups while Cassandra/Scylla handle real-time decisioning.
Result: You get operational latency (≤ 10 ms) and analytical flexibility (sub-second rollups) in one architecture, without compromising durability or scaling.
8.3 The “reproducible lab” repo contents to ship with the article
To make these blueprints reproducible, the reference repository includes Terraform, Helm, workload scripts, and dashboards ready to deploy on Azure Kubernetes Service (AKS) or any managed Kubernetes platform.
8.3.1 Terraform + AKS, Helm charts (Altinity Operator, K8ssandra, Scylla Operator), Kafka, sample producers, dashboards
Infrastructure (Terraform):
module "aks" {
source = "Azure/aks/azurerm"
cluster_name = "realtime-bench"
node_count = 6
vm_size = "Standard_E16s_v5"
}
Helm installs:
helm install clickhouse altinity/clickhouse-operator
helm install cassandra k8ssandra/k8ssandra
helm install scylla scylla/scylla-operator
helm install kafka bitnami/kafka
Sample producer:
from kafka import KafkaProducer
import json, random, time
producer = KafkaProducer(
bootstrap_servers='kafka:9092',
value_serializer=lambda v: json.dumps(v).encode()
)
while True:
msg = {"tenant": "t1", "metric": random.random()*100, "event_time": int(time.time())}
producer.send('telemetry.events.raw', msg)
time.sleep(0.001)
Dashboards: Prebuilt Grafana panels for:
- Ingest throughput and error rates.
- Merge/compaction lag.
- p95/p99 latency for each database.
8.3.2 Workload scripts (TSBS, NoSQLBench, tlp-stress), Grafana boards, how-to run
Each workload script is parameterized for easy automation:
- TSBS → ClickHouse (time-series OLAP).
- NoSQLBench → Cassandra/Scylla (high-ingest + lookup).
- tlp-stress → adaptive throughput finder.
Example invocation:
nb5 run driver=cql workload=transactions.yaml cycles=10M threads=200 \
tags=phase:ingest host=cassandra.default.svc.cluster.local
Grafana dashboards visualize results live; scripts export summary CSVs for post-run comparison.
The repo includes a make bench target to automate full runs:
make bench ENGINE=cassandra SCALE=medium
Each benchmark logs configuration metadata (node types, partitions, schema) ensuring reproducibility.
8.4 Common pitfalls checklist & final recommendations
A final checklist to keep your deployments healthy and predictable:
| Category | Common Pitfall | Fix |
|---|---|---|
| Schema | Unbounded partitions (Cassandra/Scylla) | Add bucket or shard key |
| Poor ORDER BY alignment (ClickHouse) | Align with frequent filters | |
| Ingestion | Overwhelming merges/compactions | Control batch size, tune merge concurrency |
| Benchmarking | Comparing warm cache vs cold | Separate test phases, document results |
| Operations | Under-provisioned disks for compaction | Size 2–3× expected data footprint |
| Replication | Async replication lag misinterpreted as loss | Monitor LSN/offset lag separately |
| Security | Exposed Kafka brokers without mTLS | Enforce mutual TLS and SASL |
Final recommendations:
- Start from workload shape, not hype. Use the decision matrix to select the right database for the core path.
- Keep ingestion stateless, storage stateful. Kafka and stream processors can autoscale; databases require careful provisioning.
- Measure, don’t assume. Always validate performance under your schema and real data skew.
- Document everything. Reproducibility is your most powerful optimization.
- Embrace hybrid architectures. For many production systems, ClickHouse + Cassandra/Scylla in tandem delivers the best of both worlds — real-time analytics and operational consistency.
By following this blueprint, you not only gain a clear decision path for choosing among ClickHouse, Cassandra, and ScyllaDB but also a fully reproducible lab to test, tune, and evolve your architecture as your data grows.