Hybrid Ingestion Architecture: From Batch to Real-Time
SCALE: HEAVY INDUSTRIAL LOGISTICS
PERFORMANCE IMPACT
BEFORE Legacy Single-Node Handlers — 10+ Events/min
AFTER Optimized Distributed Spark Handlers — 10+ Events/min
REDUCTION -64% Memory Overhead Reduction
LATENCY: MICRO-BATCH (NEAR REAL-TIME) — P95: 3 MINUTES
ARCHITECTURE: DUAL-PATH FLOW
graph TD
subgraph src ["Source Systems"]
S1[Weigh Stations]
S2[Compliance Systems]
S3[Transport Logistics]
end
subgraph hot ["Hot Path - Real-Time"]
P[AKS Producers] -->|CDC| SB[Service Bus]
SB -->|Events| C[AKS Consumers]
C -->|Upsert| SL[Silver Layer]
end
subgraph cold ["Cold Path - Batch"]
ADF[Data Factory] -->|Incremental| B[Bronze Layer]
B -->|Spark Notebooks| SL
end
S1 --> P
S2 --> ADF
S3 --> ADF
SL --> GL[Gold Layer]
GL --> PBI[Power BI]
BUSINESS PROCESS
The bulk commodity supply chain generates high-volume operational data from weigh stations, compliance systems, and transport logistics. Legacy batch processing incurred significant memory overhead per run with 4+ hour latencies.
We implemented a Dual-Path Architecture:
HOT PATH
Event-driven producers on AKS capture source changes in real-time, publish to Service Bus, consume with sub-second latency, and upsert to Delta Lake Silver layer using distributed Spark handlers.
COLD PATH
Azure Data Factory orchestrates scheduled batch ingestion. Synapse Spark notebooks handle complex aggregations and historical snapshots for analytics.
GOLD LAYER
Aggregated views power live Power BI dashboards for operational monitoring and executive reporting.
TECH STACK
Azure Data Factory Azure Kubernetes Service Azure Service Bus Azure Synapse Delta Lake Apache Kafka / Service Bus Apache Spark Power BI Azure DevOps
TECHNICAL ARCHITECTURE
The platform implements a Dual-Path Architecture to balance high-velocity data needs with complex batch processing. The Cold Path utilizes Azure Data Factory to orchestrate scheduled ingestion into a Medallion Lakehouse (Bronze-Silver-Gold).
The Hot Path is powered by a custom-built Producer-Consumer pattern running on Azure Kubernetes Service (AKS).
- Producers: Lightweight workers capture Change Data Capture (CDC) events from source databases and publish them as JSON messages to Azure Service Bus topics.
- Consumers: Horizontally scalable listeners that process incoming messages, apply business enrichment logic, and perform ACID-compliant upserts via Delta Lake. A strategic architectural decision was made to transition from single-node delta-rs native handlers to single-node Spark handlers to resolve memory bottlenecks encountered during complex merge operations at scale.