Skip to content

Commit bfb3122

Browse files
authored
Merge pull request #1463 from Adez017/patch-1
blog for batch vs stream processing
2 parents a3f2a6f + 1e6758b commit bfb3122

4 files changed

Lines changed: 284 additions & 1 deletion

File tree

1.05 MB
Loading
Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
---
2+
title: "Why We Rolled Back Our Kafka Pipeline to Batch After 6 Months"
3+
authors: [Aditya-Singh-Rathore]
4+
sidebar_label: "Why We Rolled Back Our Kafka Pipeline to Batch After 6 Monthss"
5+
tags: [batch-processing, stream-processing, data-engineering, apache-kafka, apache-flink, apache-spark, data-pipeline, real-time, azure, medallion-architecture, data-architecture]
6+
date: 2026-05-06
7+
8+
description: Everyone talks about the benefits of streaming pipelines — real-time insights, millisecond latency, live dashboards. Nobody talks about what it actually costs you. I rebuilt a working batch pipeline as a streaming system. Here's what I learned the hard way.
9+
10+
draft: false
11+
canonical_url: https://www.recodehive.com/blog/hidden-cost-of-streaming-pipelines
12+
13+
meta:
14+
- name: "robots"
15+
content: "index, follow"
16+
- property: "og:title"
17+
content: "The Hidden Cost of Streaming Pipelines Nobody Talks About"
18+
- property: "og:description"
19+
content: "Everyone talks about the benefits of real-time streaming. Nobody talks about what it actually costs. Here's the honest breakdown from someone who built both."
20+
- property: "og:type"
21+
content: "article"
22+
- property: "og:url"
23+
content: "https://www.recodehive.com/blog/hidden-cost-of-streaming-pipelines"
24+
- property: "og:image"
25+
content: "./img/streaming-hidden-cost-cover.png"
26+
- name: "twitter:card"
27+
content: "summary_large_image"
28+
- name: "twitter:title"
29+
content: "The Hidden Cost of Streaming Pipelines Nobody Talks About"
30+
- name: "twitter:description"
31+
content: "Everyone talks about real-time streaming benefits. Nobody talks about what it costs. Here's the honest breakdown."
32+
- name: "twitter:image"
33+
content: "./img/streaming-hidden-cost-cover.png"
34+
35+
---
36+
37+
<!-- truncate -->
38+
39+
# The Hidden Cost of Streaming Pipelines Nobody Talks About
40+
41+
Everyone in data engineering is obsessed with real time.
42+
43+
Kafka. Flink. Event-driven architectures. Millisecond latency. Live dashboards. It's the direction every conference talk points, every job description asks for, every architecture diagram proudly features.
44+
45+
And I bought into it completely.
46+
47+
About a year into my data engineering career, our product team came to us with a request: customers wanted to see their order status update in real time. Our existing batch pipeline ran at 2am every night, customers were calling support asking where their orders were.
48+
49+
Reasonable ask. So we rebuilt the pipeline as a streaming system.
50+
51+
Six months later, I had learned more about the real cost of streaming than any blog post or conference talk had ever prepared me for.
52+
53+
This is that story — and the honest breakdown I wish someone had given me before I started.
54+
55+
56+
## What We Had Before (And Why It Worked)
57+
58+
Our original order pipeline was batch. It ran every night at 2am via Azure Data Factory, pulled 24 hours of orders from our SQL database, ran a Spark transformation job, and wrote clean Delta tables to ADLS Gen2.
59+
60+
```
61+
Every night at 2am:
62+
63+
ADF Pipeline triggers
64+
65+
Pull all orders from the last 24 hours
66+
67+
Spark: clean → deduplicate → join product catalog
68+
69+
Write to Silver layer (Delta table on ADLS Gen2)
70+
71+
Aggregate into Gold layer
72+
73+
Power BI refreshes — customers see updated status
74+
```
75+
76+
It ran in 45 minutes. Our Spark cluster spun up, did its job, and shut down. We paid for 45 minutes of compute per day. The pipeline was simple, debuggable, and recoverable, if something broke, we fixed it and replayed from Bronze.
77+
78+
The only problem: customers saw data that was 6 to 30 hours old depending on when they ordered.
79+
80+
For most use cases, that's fine. For order status, it wasn't.
81+
82+
83+
## Hidden Cost #1 - Infrastructure That Never Sleeps
84+
85+
The first thing that surprised me about our streaming pipeline was the infrastructure bill.
86+
87+
Our batch Spark cluster ran 45 minutes a day. Our Kafka + Flink setup runs **every minute of every day** - 24 hours, 7 days a week, whether there are 10 events per second or 10,000.
88+
89+
Streaming infrastructure requires 24/7 uptime. You can't spin it down overnight to save money. You can't schedule it during off-peak hours. The pipeline is always on, always consuming resources, always incurring cost.
90+
91+
For our team, the monthly compute cost for the streaming pipeline was roughly **4x** what the equivalent batch job cost and that was before accounting for the additional engineering time to maintain it.
92+
93+
> **The question to ask before going streaming:** Is the business value of real-time data worth 4x the infrastructure cost? Sometimes the answer is yes. Often it isn't.
94+
95+
96+
## Hidden Cost #2 - Late-Arriving Data Will Break Your Logic
97+
98+
In a batch pipeline, late data is not a problem. If an event arrives 3 hours late, it's in the next batch. The pipeline processes it, life goes on.
99+
100+
In a streaming pipeline, late-arriving data is one of the hardest problems in distributed systems.
101+
102+
Events can arrive out of order due to network delays, retries, or clock skew between services. Your Flink job is processing event #1,000 when event #987 suddenly arrives 45 seconds late. What do you do?
103+
104+
The answer involves **watermarking**, telling your stream processor "wait X seconds after the event time before closing a window, to account for late arrivals." But choosing the right watermark is a balance:
105+
106+
- Too short: you miss late-arriving events, your aggregations are wrong
107+
- Too long: you hold state in memory longer, increasing latency and memory pressure
108+
109+
We got this wrong twice before landing on a configuration that worked. Both times, our order counts were silently off by 1-3%, small enough to look like noise, large enough to cause problems in financial reconciliation.
110+
111+
```
112+
Late data problem illustrated:
113+
114+
Event time: 10:00 10:01 10:02 10:03 10:04
115+
Arrived at: 10:00 10:01 10:04 10:03 10:05
116+
117+
event #3 arrived 2 minutes late
118+
— already missed the 10:02 window
119+
— your aggregate is wrong
120+
```
121+
122+
In batch, this doesn't exist as a problem. In streaming, it's a constant engineering challenge.
123+
124+
125+
## Hidden Cost #3 - Exactly-Once Is Harder Than It Sounds
126+
Handling failures in batch pipelines is usually predictable.
127+
If a batch job fails, you typically resolve the issue and rerun the pipeline from the beginning. Since the processing happens on bounded data, recovery is relatively straightforward.
128+
129+
Streaming systems work very differently.
130+
131+
In platforms like Kafka and Flink, data is continuously flowing through the system. If a streaming job crashes midway through processing, recovery becomes much more complex than simply restarting the job.
132+
133+
For example, after recovery:
134+
- Should previously processed events be replayed?
135+
- Could some records get skipped unintentionally?
136+
- Is there a possibility that certain events are processed more than once?
137+
138+
This challenge is commonly addressed through **exactly-once processing guarantees**, where the goal is to ensure that every event affects the system exactly one time even during failures and restarts.
139+
140+
Achieving reliable exactly-once behavior usually depends on several components working together correctly:
141+
142+
- Proper Kafka offset management
143+
- Reliable Flink checkpointing and state recovery
144+
- Idempotent writes to downstream systems
145+
- Consistent state synchronization during failover scenarios
146+
147+
In practice, recovery bugs in streaming systems can have real operational impact. A single restart issue can lead to duplicate event processing, inconsistent downstream data, repeated customer notifications, or inaccurate analytics until the state is corrected.
148+
149+
Unlike batch systems, where failures often leave datasets untouched until rerun, streaming failures can leave systems in partially updated states that are significantly harder to debug and recover from.
150+
151+
152+
## Hidden Cost #4 - Testing Is a Different Discipline
153+
154+
Testing a batch pipeline is relatively straightforward. You have a dataset, you run the transformation, you check the output. Deterministic, reproducible, easy to validate.
155+
156+
Testing a streaming pipeline requires simulating event streams with realistic timing, ordering, and volume. You need to test:
157+
158+
- What happens when events arrive out of order?
159+
- What happens when a consumer crashes and restarts?
160+
- What happens when Kafka lag builds up during a traffic spike?
161+
- What happens when an upstream service sends a malformed event?
162+
163+
We discovered most of our edge cases in production, not in testing. Not because we were careless, but because accurately simulating a live event stream in a test environment is genuinely difficult.
164+
165+
Our batch pipeline had a test suite that ran in 8 minutes. Our streaming pipeline's test suite took 40 minutes and still missed three production bugs in the first month.
166+
167+
168+
169+
## Hidden Cost #5 - Your Team Needs Streaming-Specific Skills
170+
171+
This one is easy to underestimate.
172+
173+
Batch data engineering skills - Spark, SQL, dbt, ADF are well-understood, well-documented, and widely held. If someone on your team leaves, finding a replacement with those skills is manageable.
174+
175+
Streaming-specific skills Kafka internals, Flink state management, watermarking strategies, consumer group management, exactly-once configuration are genuinely harder to find and take longer to develop.
176+
177+
When we hit our first major Flink issue (a state backend misconfiguration causing memory pressure under load), our team spent three days debugging something that an experienced Flink engineer would have spotted in 20 minutes. We didn't have one. We learned on the job, which is fine but it was expensive learning.
178+
179+
> Before committing to a streaming architecture, ask: does your team have the skills to maintain it? And if not, what's the cost of developing those skills or hiring them?
180+
181+
182+
183+
## So When Is Streaming Actually Worth It?
184+
185+
None of this means streaming is wrong. It means streaming has a real cost that should be weighed against a real business need.
186+
187+
Streaming is worth it when the business problem **genuinely cannot tolerate batch latency.** Here's a clear test:
188+
189+
**Reach for streaming when:**
190+
- Fraud needs to be detected **before** a transaction completes — batch latency means the fraud already happened
191+
- A customer's app needs to reflect a change **within seconds** of it occurring
192+
- A system needs to **react** to an event automatically — alerts, triggers, automated responses
193+
- You're processing IoT sensor data where stale readings are dangerous, not just inconvenient
194+
195+
**Stick with batch when:**
196+
- You're building monthly reports, financial summaries, or historical analyses
197+
- Your stakeholders check dashboards in the morning, not the second
198+
- Your transformations involve complex aggregations over large historical datasets
199+
- Your team is small and operational simplicity matters more than latency
200+
201+
The tech industry is currently obsessed with "real-time," which has led many organizations to over-engineer their stacks implementing complex stream-processing frameworks where a simple batch job would have sufficed. A well-built batch pipeline is more reliable, cheaper, and easier to maintain than a poorly-justified streaming one.
202+
203+
## The Architecture That Actually Works: Both
204+
205+
Here's what I'd tell myself before starting that project:
206+
207+
**You probably need both, not either/or.**
208+
209+
Our final architecture uses batch for everything that can tolerate it, and streaming only for the specific cases that genuinely can't:
210+
211+
```
212+
Streaming layer (Kafka + Flink):
213+
Order events → real-time status updates (Cassandra)
214+
Fraud signals → real-time alerts (notification service)
215+
216+
Batch layer (Spark + ADF):
217+
Nightly order aggregations → Silver → Gold (Power BI)
218+
Monthly revenue reports (finance team)
219+
ML training datasets (data science team)
220+
```
221+
222+
![Side-by-side architecture diagram showing batch and streaming layers working together. Streaming layer on top handles real-time events via Kafka + Flink into Cassandra. Batch layer below handles nightly Spark jobs into ADLS Gen2 Silver and Gold. Both layers feed into the same OneLake.](./img/batch-streaming-combined-architecture.png)
223+
224+
225+
The streaming layer handles the 5% of use cases where seconds matter. The batch layer handles the 95% where they don't , more reliably, more cheaply, with less operational overhead.
226+
227+
[Microsoft Fabric](https://www.recodehive.com/blog/microsoft-fabric-explained) is built around exactly this pattern, Eventstreams for real-time ingestion, ADF Pipelines and Spark Notebooks for batch transformation, both writing to the same OneLake. You don't have to choose one architecture. You choose the right tool for each use case within the same platform.
228+
229+
230+
## The Honest Summary
231+
232+
| | Batch | Streaming |
233+
|---|---|---|
234+
| **Infrastructure cost** | Low - runs on schedule | High - always on |
235+
| **Latency** | Minutes to hours | Milliseconds to seconds |
236+
| **Late data** | Not a problem | Significant engineering challenge |
237+
| **Failure recovery** | Fix and rerun | Complex - risk of duplicates or data loss |
238+
| **Testing** | Straightforward | Requires stream simulation |
239+
| **Team skills needed** | Spark, SQL, ADF | Kafka, Flink, state management |
240+
| **Best for** | Analytics, reporting, ML | Fraud detection, live status, alerts |
241+
| **Operational complexity** | Low | High |
242+
243+
Streaming pipelines are powerful. They enable product experiences that batch simply can't deliver.
244+
245+
But they come with real costs - infrastructure that never sleeps, late-data handling that never stops being tricky, failure recovery that's genuinely hard to get right, and a skills requirement that's easy to underestimate.
246+
247+
The next time someone on your team says "we should make this real time", ask the question first:
248+
249+
**How long can the business actually wait for this data?**
250+
251+
If the honest answer is "overnight is fine" — keep the batch job. It's not boring. It's the right call.
252+
253+
254+
## References & Further Reading
255+
256+
- [Databricks - Batch vs Streaming](https://docs.databricks.com/aws/en/data-engineering/batch-vs-streaming)
257+
- [Apache Flink - Watermarks and Late Data](https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/time/)
258+
- [Apache Kafka Documentation](https://kafka.apache.org/documentation/)
259+
- [Microsoft Fabric - Real-Time Intelligence](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/overview)
260+
- [RecodeHive - How Netflix Handles Millions of Events Every Minute](https://www.recodehive.com/blog/netflix-data-engineering)
261+
- [RecodeHive - Medallion Architecture Explained](https://www.recodehive.com/blog/medallion-architecture)
262+
- [RecodeHive - Microsoft Fabric: One Platform, One Lake](https://www.recodehive.com/blog/microsoft-fabric-explained)
263+
264+
265+
## About the Author
266+
267+
I'm **Aditya Singh Rathore**, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on [RecodeHive](https://www.recodehive.com/), turning hard-won lessons into content anyone can learn from.
268+
269+
🔗 [LinkedIn](https://www.linkedin.com/in/aditya-singh-rathore0017/) | [GitHub](https://github.com/Adez017)
270+
271+
📩 Have you been burned by a streaming pipeline that didn't need to be? Drop it in the comments.
272+
273+
<GiscusComments/>

src/database/blogs/index.tsx

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ const blogs: Blog[] = [
154154
authors: ["Aditya-Singh-Rathore"],
155155
category: "data engineering",
156156
tags: ["Azure", "Storage", "Data Lake", "ADLS Gen2", "Big Data", "Scalability", "Event Handling", "Technology", "Architecture", "Data Engineering"],
157-
},
157+
},
158158
{
159159
id: 14,
160160
title: "Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare",
@@ -167,6 +167,16 @@ const blogs: Blog[] = [
167167
tags: ["Medallion Architecture", "Data Pipeline", "Data Management", "Data Quality", "Data Governance", "Scalability", "Data Engineering"],
168168
},
169169
{
170+
id: 16,
171+
title: "Why We Rolled Back Our Kafka Pipeline to Batch After 6 Months",
172+
image: "/img/blogs/batch-vs-stream-cover.png",
173+
description:
174+
"Streaming pipelines are powerful for real-time data processing, but they come with hidden costs that are often overlooked. These costs include increased complexity, higher resource consumption, and potential challenges in maintaining data consistency and reliability. This article explores these hidden costs and provides insights on how to mitigate them.",
175+
slug: "batch-vs-stream-processing",
176+
authors: ["Aditya-Singh-Rathore"],
177+
category: "data engineering",
178+
tags: ["Streaming Pipelines", "Real-Time Data Processing", "Data Consistency", "Data Reliability", "Resource Consumption", "Complexity", "Data Engineering"],
179+
170180
id: 15,
171181
title: "Azure Synapse Analytics: When to Use It (And When to Choose Fabric Instead)",
172182
image: "/img/blogs/azure-synapse-cover.png",
1.01 MB
Loading

0 commit comments

Comments
 (0)