Data & Databases · 10 min read · ~25 min study · intermediate
Data Formats: CSV, JSON, Parquet & Beyond

When to use CSV, JSON, Parquet, or columnar storage — and why the choice matters more than you think.
10 min read ~25 min study intermediate data-engineering #data#formats#parquet 20 learning outcomes
Data Formats for Financial Systems: CSV, JSON, Parquet, and Beyond

A practical comparison of data formats used in finance — when to use CSV, JSON, Parquet, or columnar storage, and why the choice matters more than you think.
Why Format Choice Matters

In finance, you move a lot of data around. Market data feeds, trade records, risk reports, model outputs, regulatory filings — it all needs to be stored somewhere and read back efficiently. The format you choose directly impacts storage costs, read and write speed, data integrity, and pipeline complexity.
Choosing the wrong format for a large dataset can turn a 30-second pipeline step into a 30-minute one. Getting this right is one of those quiet decisions that compounds over time.
The Common Formats

CSV: The Universal Default

Everyone knows CSV. It is human-readable, universally supported, and dead simple. You can open it in Excel, edit it in a text editor, and every programming language can parse it.
date,symbol,price,volume
2024-01-15,AAPL,185.60,52340000
2024-01-15,GOOGL,140.25,28100000

**Strengths:** Universal compatibility, human readable, great for small datasets and debugging.

**Weaknesses:** No type information (is "100" a number or a string?), no compression, slow to parse at scale, ambiguous handling of special characters and encodings, no schema enforcement.

For [loading into Pandas](/quant-knowledge/python/pandas-for-financial-data-analysis), CSV works fine for files under ~100MB. Beyond that, you should be thinking about better options.

### JSON: Structured and Flexible

```json
{
 "date": "2024-01-15",
 "symbol": "AAPL",
 "price": 185.60,
 "volume": 52340000,
 "metadata": {
 "exchange": "NASDAQ",
 "currency": "USD"
 }

**Strengths:** Handles nested and semi-structured data well, self-describing, standard for [REST APIs](/quant-knowledge/software-engineering/apis-and-rest-for-financial-data), good type support (numbers vs strings vs booleans).

**Weaknesses:** Verbose (lots of repeated keys), slow to parse large files, not efficient for purely tabular data.

JSON is the right choice for API communication, configuration files, and semi-structured data. It is the wrong choice for storing a billion trade records.

### Parquet: The Modern Standard for Analytics

Parquet is a columnar binary format that has become the de facto standard for data engineering. It is compressed, fast, and preserves full type information.

```python
import pandas as pd

# Write: typically 5-10x smaller than equivalent CSV
df.to_parquet("trades.parquet", compression="snappy")

# Read: 5-10x faster than CSV
df = pd.read_parquet("trades.parquet")

# Read only specific columns — a major advantage
prices = pd.read_parquet("trades.parquet", columns=["symbol", "price", "date"])

**Strengths:** Excellent compression, very fast reads (especially column-selective reads), full type preservation, schema embedded in the file, widely supported across the data ecosystem.

**Weaknesses:** Not human-readable, requires libraries to work with, slight overhead for very small files.

### Other Formats Worth Knowing

**Avro** — row-oriented binary format with schema evolution support. Popular in streaming systems like Kafka.

**ORC** — similar to Parquet, common in the Hadoop ecosystem.

**Protocol Buffers / FlatBuffers** — extremely fast serialisation for inter-service communication, common in [low-latency systems](/quant-knowledge/networking/network-speeds-and-latency-in-financial-systems).

---

## Columnar vs Row Storage

This distinction is important and applies to both file formats and databases.

**Row storage** (CSV, JSON, traditional databases) keeps all fields for a record together. Reading one row is fast. Reading one column across millions of rows is slow because you read everything.

**Columnar storage** (Parquet, ORC, ClickHouse) keeps all values for one column together. Reading specific columns is extremely fast. Compression is better because similar values are adjacent.

For analytics — "give me the average price across 10 million trades" — columnar is dramatically faster because it only touches the price column, ignoring the dozens of other fields. This same principle applies at the database level: row-oriented databases like PostgreSQL versus columnar engines like ClickHouse or Redshift. Your [database design choices](/quant-knowledge/data/database-design-for-trading-systems) should factor this in.

---

## Practical Recommendations

| Use Case | Recommended Format |
| --- | --- |
| Small datasets, prototyping | CSV |
| API requests and responses | JSON |
| Data pipeline storage | Parquet |
| Streaming events | Avro or JSON |
| Inter-service communication | Protocol Buffers |
| High-frequency market data | Binary formats or [TSDB](/quant-knowledge/data/time-series-databases-in-finance) |

The general rule: if data will be written once and read many times (which is most analytical data in finance), Parquet is almost always the right choice. The compression saves storage costs, the columnar layout speeds up queries, and the type information prevents subtle bugs.

How you [build pipelines](/quant-knowledge/big-data/big-data-pipelines-in-finance) around these formats and how you [load them into analysis tools](/quant-knowledge/python/pandas-for-financial-data-analysis) determines whether your data workflows are fast or frustrating. It is one of those decisions worth getting right early.

### Want to go deeper on Data Formats for Financial Systems: CSV, JSON, Parquet, and Beyond?

This article covers the essentials, but there's a lot more to learn. Inside , you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.

Free to get started · No credit card required

## Keep Reading

[Python

### Pandas for Financial Data Analysis: Getting Started

How to use Pandas DataFrames for real financial workflows — loading market data, calculating returns, handling time series, and avoiding common pitfalls.](/quant-knowledge/python/pandas-for-financial-data-analysis)[Big Data

### Big Data Pipelines in Finance

How financial firms process massive datasets — batch and streaming architectures, ETL patterns, data lakes, and the tools that power modern data infrastructure.](/quant-knowledge/big-data/big-data-pipelines-in-finance)[Python

### NumPy for Quantitative Finance: A Practical Introduction

How NumPy array operations power everything from portfolio risk calculations to Monte Carlo simulations — and why it is so much faster than plain Python.](/quant-knowledge/python/numpy-for-quantitative-finance)[Data & Databases

### Database Design for Trading Systems

How to structure databases for trading platforms — normalization, schema design, indexing strategies, and the tradeoffs that matter in financial systems.](/quant-knowledge/data/database-design-for-trading-systems)

<!-- KB_ENHANCED_BLOCK_START -->

## What You Will Learn

- Explain why format cho__pn0__ matters.
- Build the common formats.
- Calibrate columnar vs row storage.
- Compute practical recommendations.
- Apply the ideas in *Data Formats: CSV, JSON, Parquet & Beyond* to a US-market quant problem.

## Prerequisites

- Basic SQL — see [Basic SQL](/quant-knowledge/data/sql-for-financial-data).
- Python data tooling — see [Python data tooling](/quant-knowledge/python/pandas-for-financial-data-analysis).
- Comfort reading code and basic statistical notation.
- Curiosity about how the topic shows up in a US trading firm.

## Mental Model

Financial data is heavy, append-only, and queried at every horizon — milliseconds for risk, hours for research, years for compliance. The right database is the one that aligns its access pattern with the dominant query, not the one that scales the highest in a vendor pitch. For *Data Formats: CSV, JSON, Parquet & Beyond*, frame the topic as the piece that when to use CSV, JSON, Parquet, or columnar storage — and why the choice matters more than you think — and ask what would break if you removed it from the workflow.

## Why This Matters in US Markets

US trading firms run on KDB+, ClickHouse, Snowflake, Postgres, and Parquet on S3. Tick stores at Citadel and Jane Street hold petabytes of CME and US equities data going back decades; risk groups at GS and JPM run intraday VaR off Snowflake or Vertica. Knowing how to model and query this data is the line between research and ops.

In US markets, *Data Formats: CSV, JSON, Parquet & Beyond* tends to surface during onboarding, code review, and the first incident a junior quant gets pulled into. Questions on this material recur in interviews at Citadel, Two Sigma, Jane Street, HRT, Jump, DRW, IMC, Optiver, and the major bulge-bracket banks.

## Common Mistakes

- Not partitioning by (symbol, date) and watching every query do a full scan.
- Treating timestamps without timezones — US markets are America/New_York, not UTC, on the wire.
- Joining on price columns subject to floating-point drift instead of integer ticks.
- Treating *Data Formats: CSV, JSON, Parquet & Beyond* as a one-off topic rather than the foundation it becomes once you ship code.
- Skipping the US-market context — copying European or Asian conventions and getting bitten by US tick sizes, settlement, or regulator expectations.
- Optimizing for elegance instead of auditability; trading regulators care about reproducibility, not cleverness.
- Confusing model output with reality — the tape is the source of truth, the model is a hypothesis.

## Practice Questions

1. Design a SQL query that returns the rolling 20-day VWAP for SPY by trade date.
2. Why do most US tick stores partition by date and symbol rather than purely by date?
3. Compare row-oriented Postgres and column-oriented ClickHouse for storing 5 years of NBBO quotes.
4. Define an idempotent UPSERT in Postgres for a daily P&L table keyed by (account, trade_date).
5. What is the difference between a CTE and a subquery from a query-planner perspective?

## Answers and Explanations

1. Use a window function: `SELECT trade_date, SUM(price*volume) OVER w / SUM(volume) OVER w AS vwap_20 FROM trades WHERE symbol='SPY' WINDOW w AS (ORDER BY trade_date ROWS 19 PRECEDING)`. The window function avoids self-joins and runs in a single pass.
2. Because the dominant query is `WHERE symbol = ? AND trade_date BETWEEN ? AND ?`; double partitioning prunes most files at the planner stage. Pure date partitioning forces every per-symbol query to scan every file in the date range.
3. Postgres handles the workload but is 10-100× slower on aggregations because every row must be read even when only `bid` and `ask` are projected. ClickHouse projects only the needed columns and applies block-level compression, so a sum-of-spread over a year completes in seconds rather than minutes.
4. `INSERT INTO pnl(account, trade_date, gross, fees, net) VALUES (...) ON CONFLICT (account, trade_date) DO UPDATE SET gross=EXCLUDED.gross, fees=EXCLUDED.fees, net=EXCLUDED.net;` — running it twice produces the same final state.
5. A CTE is sometimes materialized (as in older Postgres) and sometimes inlined (modern Postgres, Snowflake, BigQuery), while a correlated subquery is typically inlined and may execute once per outer row. The semantic effect is the same; the performance profile is not.

## Glossary

- **OLTP** — Online Transaction Processing; row-oriented, optimized for many small writes (order books, ledgers).
- **OLAP** — Online Analytical Processing; column-oriented, optimized for big aggregate queries (P&L, risk).
- **Index** — an auxiliary data structure (B-tree, hash, bitmap) that speeds up lookups at the cost of write throughput.
- **Normalization** — splitting data across tables to remove redundancy; trades disk for joins.
- **Denormalization** — collapsing tables for read speed; standard for time-series tick stores.
- **CTE** — Common Table Expression; a named subquery defined with `WITH`, often used for readability and recursive logic.
- **Window function** — SQL function that computes over a sliding subset (rolling means, lagged returns).
- **Parquet** — a columnar file format with compression and predicate pushdown; the de-facto research format on S3.

## Further Study Path

- [SQL for Financial Data](/quant-knowledge/data/sql-for-financial-data) — Querying trade data, aggregating positions, joining reference data — the SQL fundamentals that matter for finance.
- [Advanced SQL for Financial Systems](/quant-knowledge/data/advanced-sql-techniques-for-financial-systems) — CTEs, window functions, query optimization — the SQL patterns used in real trading platforms.
- [Database Design for Trading Systems](/quant-knowledge/data/database-design-for-trading-systems) — Schema design, normalization, indexing, and the tradeoffs that matter when latency and correctness collide.
- [Python for Quant Finance: Fundamentals](/quant-knowledge/python/python-for-quant-finance-fundamentals) — Variables, functions, data structures, classes, and error handling — the core Python every quant role expects.
- [Advanced Python for Financial Applications](/quant-knowledge/python/advanced-python-techniques-for-financial-applications) — Decorators, generators, and context managers — the patterns that separate beginner Python from production quant code.

## Key Learning Outcomes

- Explain why format cho__pn0__ matters.
- Apply the common formats.
- Recognize columnar vs row storage.
- Describe practical recommendations.
- Walk through data as it applies to data formats: CSV, JSON, parquet & beyond.
- Identify formats as it applies to data formats: CSV, JSON, parquet & beyond.
- Articulate parquet as it applies to data formats: CSV, JSON, parquet & beyond.
- Trace how data formats: CSV, JSON, parquet & beyond surfaces at Citadel, Two Sigma, Jane Street, or HRT.
- Map the US regulatory framing — SEC, CFTC, FINRA — relevant to data formats: CSV, JSON, parquet & beyond.
- Pinpoint a single-paragraph elevator pitch for data formats: CSV, JSON, parquet & beyond suitable for an interviewer.
- Explain one common production failure mode of the techniques in data formats: CSV, JSON, parquet & beyond.
- Apply when data formats: CSV, JSON, parquet & beyond is the wrong tool and what to use instead.
- Recognize how data formats: CSV, JSON, parquet & beyond interacts with the order management and risk gates in a US trading stack.
- Describe a back-of-the-envelope sanity check that proves your implementation of data formats: CSV, JSON, parquet & beyond is roughly right.
- Walk through which US firms publicly hire against the skills covered in data formats: CSV, JSON, parquet & beyond.
- Identify a follow-up topic from this knowledge base that deepens data formats: CSV, JSON, parquet & beyond.
- Articulate how data formats: CSV, JSON, parquet & beyond would appear on a phone screen or onsite interview at a US quant shop.
- Trace the day-one mistake a junior would make on data formats: CSV, JSON, parquet & beyond and the senior's fix.
- Map how to defend a design choice involving data formats: CSV, JSON, parquet & beyond in a code review.
- Pinpoint a fresh perspective on data formats: CSV, JSON, parquet & beyond from a US-market angle (item 20).

<!-- KB_ENHANCED_BLOCK_END -->