Big Data · 12 min read · ~27 min study · intermediate
Big Data Pipelines in Finance
Batch and streaming architectures, ETL, data lakes — how firms process massive financial datasets.
Big Data Pipelines in Finance
How financial firms process massive datasets — batch and streaming architectures, ETL patterns, data lakes, and the tools that power modern data infrastructure.
The Data Challenge in Finance
Financial firms are drowning in data. Market data feeds deliver millions of events per second. Trade records accumulate by the millions daily. Alternative data sources — satellite imagery, social media sentiment, web traffic — add another layer of volume. Regulatory reporting requires comprehensive historical data access.
Processing this data reliably, at scale, and on time is what data engineering is about. Get it right and your quant researchers have clean, timely data to build models on. Get it wrong and everyone downstream suffers: models train on stale data, reports are late, and risk calculations are incomplete.
Batch vs Streaming
Data pipelines fall into two categories, and most financial systems use both.
Batch Processing
Process large volumes of data at scheduled intervals. "Every night at midnight, calculate the end-of-day positions and P&L for every account."
# Simplified batch pipeline
def daily_eod_pipeline(date: str):
# Extract
trades = read_trades_from_database(date)
market_data = read_closing_prices(date)
positions = read_current_positions
# Transform
enriched_trades = enrich_with_market_data(trades, market_data)
daily_pnl = calculate_pnl(enriched_trades, positions, market_data)
updated_positions = update_positions(positions, trades)
# Load
write_to_data_warehouse(daily_pnl)
write_to_reporting_database(updated_positions)
generate_regulatory_report(daily_pnl, updated_positions)
Batch is appropriate for: end-of-day processing, historical analysis, regulatory reporting, model training.
Stream Processing
Process data as it arrives, in real time or near real time. "As each trade executes, update the position, recalculate risk, and check compliance."
Simplified streaming pipeline using Kafka
from kafka import KafkaConsumer
consumer = KafkaConsumer('trades', bootstrap_servers='kafka:9092')
for message in consumer: trade = deserialise(message.value) update_position(trade) updated_risk = recalculate_risk(trade.symbol) if updated_risk > risk_limit: send_alert(trade.symbol, updated_risk) publish_to_dashboard(trade, updated_risk)
Stream is appropriate for: real-time risk monitoring, live P&L, trade surveillance, market data processing.
The Modern Data Stack
Apache Kafka: The Message Bus
Kafka is the backbone of most streaming architectures in finance. It is a distributed message queue that can handle millions of messages per second with built-in durability and replayability.
Producers publish messages to topics. Consumers subscribe to topics and process messages. Messages are persisted, so if a consumer fails, it can restart and pick up where it left off.
Key concepts:
- Topics — named channels (
trades,market-data,risk-alerts) - Partitions — parallelism unit within a topic
- Consumer groups — multiple consumers sharing the load
- Retention — messages kept for configurable time (days, weeks, forever)
Apache Spark: Batch and Micro-Batch
Spark processes large datasets in parallel across a cluster. It is the standard for batch processing at scale and can handle streaming via Spark Structured Streaming.
from pyspark.sql import SparkSession import pyspark.sql.functions as F
spark = SparkSession.builder.appName("DailyPnL").getOrCreate
Read trades from data lake
trades = spark.read.parquet("s3://data-lake/trades/date=2024-01-15/")
Aggregate P&L by account and symbol
pnl = trades.groupBy("account_id", "symbol").agg( F.sum((F.col("exit_price") - F.col("entry_price")) * F.col("quantity")).alias("pnl"), F.count("*").alias("trade_count"), F.sum("quantity").alias("total_volume"), )
Write results to data warehouse
pnl.write.mode("overwrite").parquet("s3://warehouse/daily-pnl/date=2024-01-15/")
Apache Airflow: Orchestration
Airflow schedules and monitors pipelines. It defines workflows as directed acyclic graphs (DAGs) — ensuring tasks run in the right order and handling retries and failures.
from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime
dag = DAG( 'daily_eod_pipeline', schedule_interval='0 22 * * 1-5', # 10 PM on weekdays start_date=datetime(2024, 1, 1),
extract = PythonOperator(task_id='extract_trades', python_callable=extract_trades, dag=dag) transform = PythonOperator(task_id='calculate_pnl', python_callable=calculate_pnl, dag=dag) load = PythonOperator(task_id='write_reports', python_callable=write_reports, dag=dag)
extract >> transform >> load # Define execution order
The Data Lake Pattern
A data lake stores raw data in its original format — no upfront schema required. Data is structured and cleaned when it is read (schema-on-read) rather than when it is written (schema-on-write).
Data Lake (S3 / Azure Blob) ├── raw/ # Raw data as received │ ├── trades/ │ ├── market-data/ │ └── reference-data/ ├── processed/ # Cleaned and normalized │ └── market-data/ ├── curated/ # Business-ready datasets │ ├── daily-pnl/ │ ├── positions/ │ └── risk-reports/
The lake stores data in Parquet format for analytical workloads, partitioned by date for efficient querying. Cloud object storage (S3) provides virtually unlimited, cheap, durable storage.
Data Quality
Bad data in, bad decisions out. Data quality checks are not optional in finance:
def validate_trade_data(df): checks = []
Completeness
null_counts = df.isnull.sum checks.append(("No nulls in required fields", null_counts[["symbol", "price", "qty"]].sum == 0))
Range checks
checks.append(("All prices positive", (df["price"] > 0).all)) checks.append(("All quantities non-zero", (df["qty"] != 0).all))
Consistency
checks.append(("Sides valid", df["side"].isin(["BUY", "SELL"]).all))
Freshness
latest = df["timestamp"].max checks.append(("Data is recent", (datetime.now - latest).seconds
Want to go deeper on Big Data Pipelines in Finance?
This article covers the essentials, but there's a lot more to learn. Inside , you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required
Keep Reading
[Data & Databases
Data Formats for Financial Systems: CSV, JSON, Parquet, and Beyond
A practical comparison of data formats used in finance — when to use CSV, JSON, Parquet, or columnar storage, and why the choice matters more than you think.](/quant-knowledge/data/data-formats-for-financial-systems)[Data & Databases
Database Design for Trading Systems
How to structure databases for trading platforms — normalization, schema design, indexing strategies, and the tradeoffs that matter in financial systems.](/quant-knowledge/data/database-design-for-trading-systems)[Cloud & Infrastructure
AWS Essentials for Financial Services
The core AWS services that matter for finance — EC2, S3, RDS, Lambda, and the architectural patterns used in trading platforms and data pipelines.](/quant-knowledge/cloud/aws-essentials-for-financial-services)[Data & Databases
Time Series Databases in Finance: When Relational Is Not Enough
Why financial firms use specialized time series databases for market data, tick storage, and monitoring — and when you should consider one.](/quant-knowledge/data/time-series-databases-in-finance)
What You Will Learn
- Explain the data challenge in finance.
- Build batch vs streaming.
- Calibrate the modern data stack.
- Compute the data lake pattern.
- Design data quality.
Prerequisites
- SQL essentials — see SQL essentials.
- Comfort reading code and basic statistical notation.
- Curiosity about how the topic shows up in a US trading firm.
Mental Model
Big data in finance is mostly tick data — billions of rows that look almost identical. The job is to compress, partition, and query them at the speed of business questions, not at the speed of a single laptop. For Big Data Pipelines in Finance, frame the topic as the piece that batch and streaming architectures, ETL, data lakes — how firms process massive financial datasets — and ask what would break if you removed it from the workflow.
Why This Matters in US Markets
Petabyte-scale market data is a US industry. CME ESS captures every CME futures tick; SIP and direct feeds capture every US equities print; OPRA captures every options quote. US firms — Citadel, Two Sigma, Renaissance, Jane Street — build proprietary stacks on top of Spark, Snowflake, ClickHouse, and home-grown columnar engines.
In US markets, Big Data Pipelines in Finance tends to surface during onboarding, code review, and the first incident a junior quant gets pulled into. Questions on this material recur in interviews at Citadel, Two Sigma, Jane Street, HRT, Jump, DRW, IMC, Optiver, and the major bulge-bracket banks.
Common Mistakes
- Designing a streaming pipeline when batch would be cheaper and just as timely.
- Repartitioning unnecessarily and triggering a full shuffle.
- Treating late events as failures instead of routing them through a side output.
- Treating Big Data Pipelines in Finance as a one-off topic rather than the foundation it becomes once you ship code.
- Skipping the US-market context — copying European or Asian conventions and getting bitten by US tick sizes, settlement, or regulator expectations.
- Optimizing for elegance instead of auditability; trading regulators care about reproducibility, not cleverness.
- Confusing model output with reality — the tape is the source of truth, the model is a hypothesis.
Practice Questions
- Why does shuffle dominate Spark job costs on tick data?
- Define a watermark in a streaming aggregation, and why it matters for late-arriving CME prints.
- What is the practical difference between Iceberg and Delta Lake for a US tick warehouse?
- Why do bid/ask quote tables compress 50× while trade tables compress 5-10×?
- When is a streaming-only architecture wrong for a US risk job?
Answers and Explanations
- Tick data is wide and joins on (symbol, timestamp) move enormous amounts of data across the network; partitioning data on those keys at write time is the standard fix.
- A watermark is the cutoff time after which late events are dropped or routed to a side output; it lets the system emit windowed aggregates without waiting forever for stragglers.
- Both add ACID tables on top of Parquet on object storage. Iceberg has stronger multi-engine support (Trino, Spark, Flink, Snowflake); Delta is tighter with Databricks. Pick by the engine your firm already runs.
- Quote messages are dominated by tiny price increments and repeated identifiers, which dictionary and run-length encoding crush; trade messages have higher entropy in price and size and compress less.
- When the calculation requires a global aggregate (intraday VaR across the whole portfolio) that depends on out-of-order or revised inputs; lambda or kappa architectures combining batch and stream are usually the fit.
Glossary
- Shuffle — the all-to-all data movement step in distributed jobs; usually the dominant cost.
- Partition — a chunk of a dataset stored together; partition keys decide what gets shuffled.
- Skew — uneven partition sizes that bottleneck a job.
- Lakehouse — a data lake (object storage + open formats) combined with table-level transactions (Delta, Iceberg, Hudi).
- Catalog — a metadata service tracking tables, schemas, and partitions (Glue, Hive Metastore, Unity Catalog).
- Streaming — processing data as it arrives instead of in scheduled batches.
- Watermark — the cutoff time after which late-arriving events are dropped or sent to a side output.
- Backpressure — slowing producers when consumers cannot keep up.
Further Study Path
- Python for Quant Finance: Fundamentals — Variables, functions, data structures, classes, and error handling — the core Python every quant role expects.
- Advanced Python for Financial Applications — Decorators, generators, and context managers — the patterns that separate beginner Python from production quant code.
- NumPy for Quantitative Finance — Why array operations power everything from portfolio risk to Monte Carlo — and why they outpace plain Python.
- Pandas for Financial Data Analysis — Loading market data, calculating returns, handling time series, and avoiding the common pitfalls.
- SQL for Financial Data — Querying trade data, aggregating positions, joining reference data — the SQL fundamentals that matter for finance.
Key Learning Outcomes
- Explain the data challenge in finance.
- Apply batch vs streaming.
- Recognize the modern data stack.
- Describe the data lake pattern.
- Walk through data quality.
- Identify big-data as it applies to big data pipelines in finance.
- Articulate etl as it applies to big data pipelines in finance.
- Trace streaming as it applies to big data pipelines in finance.
- Map how big data pipelines in finance surfaces at Citadel, Two Sigma, Jane Street, or HRT.
- Pinpoint the US regulatory framing — SEC, CFTC, FINRA — relevant to big data pipelines in finance.
- Explain a single-paragraph elevator pitch for big data pipelines in finance suitable for an interviewer.
- Apply one common production failure mode of the techniques in big data pipelines in finance.
- Recognize when big data pipelines in finance is the wrong tool and what to use instead.
- Describe how big data pipelines in finance interacts with the order management and risk gates in a US trading stack.
- Walk through a back-of-the-envelope sanity check that proves your implementation of big data pipelines in finance is roughly right.
- Identify which US firms publicly hire against the skills covered in big data pipelines in finance.
- Articulate a follow-up topic from this knowledge base that deepens big data pipelines in finance.
- Trace how big data pipelines in finance would appear on a phone screen or onsite interview at a US quant shop.
- Map the day-one mistake a junior would make on big data pipelines in finance and the senior's fix.
- Pinpoint how to defend a design choice involving big data pipelines in finance in a code review.