Big Data · 12 min read · ~27 min study · intermediate

Big Data Pipelines in Finance

Batch and streaming architectures, ETL, data lakes — how firms process massive financial datasets.

12 min read ~27 min study intermediate data-engineering #big-data#etl#streaming 20 learning outcomes

Big Data Pipelines in Finance

How financial firms process massive datasets — batch and streaming architectures, ETL patterns, data lakes, and the tools that power modern data infrastructure.

The Data Challenge in Finance

Financial firms are drowning in data. Market data feeds deliver millions of events per second. Trade records accumulate by the millions daily. Alternative data sources — satellite imagery, social media sentiment, web traffic — add another layer of volume. Regulatory reporting requires comprehensive historical data access.

Processing this data reliably, at scale, and on time is what data engineering is about. Get it right and your quant researchers have clean, timely data to build models on. Get it wrong and everyone downstream suffers: models train on stale data, reports are late, and risk calculations are incomplete.

Batch vs Streaming

Data pipelines fall into two categories, and most financial systems use both.

Batch Processing

Process large volumes of data at scheduled intervals. "Every night at midnight, calculate the end-of-day positions and P&L for every account."

# Simplified batch pipeline
def daily_eod_pipeline(date: str):
 # Extract
 trades = read_trades_from_database(date)
 market_data = read_closing_prices(date)
 positions = read_current_positions

 # Transform
 enriched_trades = enrich_with_market_data(trades, market_data)
 daily_pnl = calculate_pnl(enriched_trades, positions, market_data)
 updated_positions = update_positions(positions, trades)

 # Load
 write_to_data_warehouse(daily_pnl)
 write_to_reporting_database(updated_positions)
 generate_regulatory_report(daily_pnl, updated_positions)

Batch is appropriate for: end-of-day processing, historical analysis, regulatory reporting, model training.

Stream Processing

Process data as it arrives, in real time or near real time. "As each trade executes, update the position, recalculate risk, and check compliance."

Simplified streaming pipeline using Kafka

from kafka import KafkaConsumer

consumer = KafkaConsumer('trades', bootstrap_servers='kafka:9092')

for message in consumer: trade = deserialise(message.value) update_position(trade) updated_risk = recalculate_risk(trade.symbol) if updated_risk > risk_limit: send_alert(trade.symbol, updated_risk) publish_to_dashboard(trade, updated_risk)

Stream is appropriate for: real-time risk monitoring, live P&L, trade surveillance, market data processing.

The Modern Data Stack

Apache Kafka: The Message Bus

Kafka is the backbone of most streaming architectures in finance. It is a distributed message queue that can handle millions of messages per second with built-in durability and replayability.

Producers publish messages to topics. Consumers subscribe to topics and process messages. Messages are persisted, so if a consumer fails, it can restart and pick up where it left off.

Key concepts:

Topics — named channels (trades, market-data, risk-alerts)
Partitions — parallelism unit within a topic
Consumer groups — multiple consumers sharing the load
Retention — messages kept for configurable time (days, weeks, forever)

Apache Spark: Batch and Micro-Batch

Spark processes large datasets in parallel across a cluster. It is the standard for batch processing at scale and can handle streaming via Spark Structured Streaming.

from pyspark.sql import SparkSession import pyspark.sql.functions as F

spark = SparkSession.builder.appName("DailyPnL").getOrCreate

Read trades from data lake

trades = spark.read.parquet("s3://data-lake/trades/date=2024-01-15/")

Aggregate P&L by account and symbol

pnl = trades.groupBy("account_id", "symbol").agg( F.sum((F.col("exit_price") - F.col("entry_price")) * F.col("quantity")).alias("pnl"), F.count("*").alias("trade_count"), F.sum("quantity").alias("total_volume"), )

Write results to data warehouse

pnl.write.mode("overwrite").parquet("s3://warehouse/daily-pnl/date=2024-01-15/")

Apache Airflow: Orchestration

Airflow schedules and monitors pipelines. It defines workflows as directed acyclic graphs (DAGs) — ensuring tasks run in the right order and handling retries and failures.

from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime

dag = DAG( 'daily_eod_pipeline', schedule_interval='0 22 * * 1-5', # 10 PM on weekdays start_date=datetime(2024, 1, 1),

extract = PythonOperator(task_id='extract_trades', python_callable=extract_trades, dag=dag) transform = PythonOperator(task_id='calculate_pnl', python_callable=calculate_pnl, dag=dag) load = PythonOperator(task_id='write_reports', python_callable=write_reports, dag=dag)

extract >> transform >> load # Define execution order

The Data Lake Pattern

A data lake stores raw data in its original format — no upfront schema required. Data is structured and cleaned when it is read (schema-on-read) rather than when it is written (schema-on-write).

Data Lake (S3 / Azure Blob) ├── raw/ # Raw data as received │ ├── trades/ │ ├── market-data/ │ └── reference-data/ ├── processed/ # Cleaned and normalized │ └── market-data/ ├── curated/ # Business-ready datasets │ ├── daily-pnl/ │ ├── positions/ │ └── risk-reports/

The lake stores data in Parquet format for analytical workloads, partitioned by date for efficient querying. Cloud object storage (S3) provides virtually unlimited, cheap, durable storage.

Data Quality

Bad data in, bad decisions out. Data quality checks are not optional in finance:

def validate_trade_data(df): checks = []

Completeness

null_counts = df.isnull.sum checks.append(("No nulls in required fields", null_counts[["symbol", "price", "qty"]].sum == 0))

Range checks

checks.append(("All prices positive", (df["price"] > 0).all)) checks.append(("All quantities non-zero", (df["qty"] != 0).all))

Consistency

checks.append(("Sides valid", df["side"].isin(["BUY", "SELL"]).all))

Freshness

latest = df["timestamp"].max checks.append(("Data is recent", (datetime.now - latest).seconds

Want to go deeper on Big Data Pipelines in Finance?

This article covers the essentials, but there's a lot more to learn. Inside , you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.

Free to get started · No credit card required

Keep Reading

[Data & Databases

Data Formats for Financial Systems: CSV, JSON, Parquet, and Beyond

A practical comparison of data formats used in finance — when to use CSV, JSON, Parquet, or columnar storage, and why the choice matters more than you think.](/quant-knowledge/data/data-formats-for-financial-systems)[Data & Databases

Database Design for Trading Systems

How to structure databases for trading platforms — normalization, schema design, indexing strategies, and the tradeoffs that matter in financial systems.](/quant-knowledge/data/database-design-for-trading-systems)[Cloud & Infrastructure

AWS Essentials for Financial Services

The core AWS services that matter for finance — EC2, S3, RDS, Lambda, and the architectural patterns used in trading platforms and data pipelines.](/quant-knowledge/cloud/aws-essentials-for-financial-services)[Data & Databases

Time Series Databases in Finance: When Relational Is Not Enough

Why financial firms use specialized time series databases for market data, tick storage, and monitoring — and when you should consider one.](/quant-knowledge/data/time-series-databases-in-finance)

What You Will Learn

Explain the data challenge in finance.
Build batch vs streaming.
Calibrate the modern data stack.
Compute the data lake pattern.
Design data quality.

Prerequisites

SQL essentials — see SQL essentials.
Comfort reading code and basic statistical notation.
Curiosity about how the topic shows up in a US trading firm.

Mental Model

Big data in finance is mostly tick data — billions of rows that look almost identical. The job is to compress, partition, and query them at the speed of business questions, not at the speed of a single laptop. For Big Data Pipelines in Finance, frame the topic as the piece that batch and streaming architectures, ETL, data lakes — how firms process massive financial datasets — and ask what would break if you removed it from the workflow.

Why This Matters in US Markets

Petabyte-scale market data is a US industry. CME ESS captures every CME futures tick; SIP and direct feeds capture every US equities print; OPRA captures every options quote. US firms — Citadel, Two Sigma, Renaissance, Jane Street — build proprietary stacks on top of Spark, Snowflake, ClickHouse, and home-grown columnar engines.

In US markets, Big Data Pipelines in Finance tends to surface during onboarding, code review, and the first incident a junior quant gets pulled into. Questions on this material recur in interviews at Citadel, Two Sigma, Jane Street, HRT, Jump, DRW, IMC, Optiver, and the major bulge-bracket banks.

Common Mistakes

Designing a streaming pipeline when batch would be cheaper and just as timely.
Repartitioning unnecessarily and triggering a full shuffle.
Treating late events as failures instead of routing them through a side output.
Treating Big Data Pipelines in Finance as a one-off topic rather than the foundation it becomes once you ship code.
Skipping the US-market context — copying European or Asian conventions and getting bitten by US tick sizes, settlement, or regulator expectations.
Optimizing for elegance instead of auditability; trading regulators care about reproducibility, not cleverness.
Confusing model output with reality — the tape is the source of truth, the model is a hypothesis.

Practice Questions

Why does shuffle dominate Spark job costs on tick data?
Define a watermark in a streaming aggregation, and why it matters for late-arriving CME prints.
What is the practical difference between Iceberg and Delta Lake for a US tick warehouse?
Why do bid/ask quote tables compress 50× while trade tables compress 5-10×?
When is a streaming-only architecture wrong for a US risk job?

Answers and Explanations

Tick data is wide and joins on (symbol, timestamp) move enormous amounts of data across the network; partitioning data on those keys at write time is the standard fix.
A watermark is the cutoff time after which late events are dropped or routed to a side output; it lets the system emit windowed aggregates without waiting forever for stragglers.
Both add ACID tables on top of Parquet on object storage. Iceberg has stronger multi-engine support (Trino, Spark, Flink, Snowflake); Delta is tighter with Databricks. Pick by the engine your firm already runs.
Quote messages are dominated by tiny price increments and repeated identifiers, which dictionary and run-length encoding crush; trade messages have higher entropy in price and size and compress less.
When the calculation requires a global aggregate (intraday VaR across the whole portfolio) that depends on out-of-order or revised inputs; lambda or kappa architectures combining batch and stream are usually the fit.

Glossary

Shuffle — the all-to-all data movement step in distributed jobs; usually the dominant cost.
Partition — a chunk of a dataset stored together; partition keys decide what gets shuffled.
Skew — uneven partition sizes that bottleneck a job.
Lakehouse — a data lake (object storage + open formats) combined with table-level transactions (Delta, Iceberg, Hudi).
Catalog — a metadata service tracking tables, schemas, and partitions (Glue, Hive Metastore, Unity Catalog).
Streaming — processing data as it arrives instead of in scheduled batches.
Watermark — the cutoff time after which late-arriving events are dropped or sent to a side output.
Backpressure — slowing producers when consumers cannot keep up.

Further Study Path

Python for Quant Finance: Fundamentals — Variables, functions, data structures, classes, and error handling — the core Python every quant role expects.
Advanced Python for Financial Applications — Decorators, generators, and context managers — the patterns that separate beginner Python from production quant code.
NumPy for Quantitative Finance — Why array operations power everything from portfolio risk to Monte Carlo — and why they outpace plain Python.
Pandas for Financial Data Analysis — Loading market data, calculating returns, handling time series, and avoiding the common pitfalls.
SQL for Financial Data — Querying trade data, aggregating positions, joining reference data — the SQL fundamentals that matter for finance.

Key Learning Outcomes

Explain the data challenge in finance.
Apply batch vs streaming.
Recognize the modern data stack.
Describe the data lake pattern.
Walk through data quality.
Identify big-data as it applies to big data pipelines in finance.
Articulate etl as it applies to big data pipelines in finance.
Trace streaming as it applies to big data pipelines in finance.
Map how big data pipelines in finance surfaces at Citadel, Two Sigma, Jane Street, or HRT.
Pinpoint the US regulatory framing — SEC, CFTC, FINRA — relevant to big data pipelines in finance.
Explain a single-paragraph elevator pitch for big data pipelines in finance suitable for an interviewer.
Apply one common production failure mode of the techniques in big data pipelines in finance.
Recognize when big data pipelines in finance is the wrong tool and what to use instead.
Describe how big data pipelines in finance interacts with the order management and risk gates in a US trading stack.
Walk through a back-of-the-envelope sanity check that proves your implementation of big data pipelines in finance is roughly right.
Identify which US firms publicly hire against the skills covered in big data pipelines in finance.
Articulate a follow-up topic from this knowledge base that deepens big data pipelines in finance.
Trace how big data pipelines in finance would appear on a phone screen or onsite interview at a US quant shop.
Map the day-one mistake a junior would make on big data pipelines in finance and the senior's fix.
Pinpoint how to defend a design choice involving big data pipelines in finance in a code review.