Cloud & Infrastructure · 14 min read · ~29 min study · advanced
Scaling Python Financial Models on AWS
From 150 scenarios in a Lambda to a million via Step Functions, Batch, and Fargate — serverless at scale.
Scaling Python Financial Models on AWS: A Serverless Batch Processing Approach
How to take a Python financial model from running 150 scenarios in a Lambda function to processing over a million using AWS Step Functions, Batch, and Fargate — without managing a single server.
From Laptop to a Million Scenarios
You've built a financial model in Python. It runs beautifully on your laptop — perhaps a portfolio stress-testing model that grinds through a few hundred economic scenarios. You've shown it to the right people, they're suitably impressed, and now someone has uttered the dreaded words: "Can we scale this up?"
Suddenly, 150 scenarios isn't enough. They want a million. And they want it running in the cloud. Serverless, naturally, because nobody wants the overhead of managing dedicated infrastructure.
This article walks through how to architect a large-scale serverless batch processing pipeline on AWS — one that takes a Python financial model and runs it at serious scale. Having taken this exact journey, from a humble Lambda function processing 150 scenarios to over a million using the setup described below, I can confirm: it works, and it's rather satisfying when it does.
The Problem (and Why Lambda Won't Cut It)
Let's set the scene. You have a portfolio model. It takes a set of financial scenarios — interest rate paths, equity shocks, credit spread movements, and so on — and runs your portfolio through each one. At small scale, AWS Lambda is perfectly adequate. Cheap, simple, no infrastructure to worry about.
But Lambda has limits. A 15-minute execution timeout, 10GB of memory, and no sensible way to coordinate thousands of parallel invocations without building your own orchestration layer. Once you need to process hundreds of thousands (or millions) of scenarios, you need something more industrial.
The answer is a serverless batch processing architecture using AWS Step Functions, AWS Batch, and AWS Fargate. Think of it as the next step up from Lambda — purpose-built for exactly this kind of workload.
The Architecture: A Bird's Eye View
Here's how the pieces fit together, end to end.
API Gateway and Authentication
Everything starts with an API call. AWS API Gateway provides the HTTPS endpoint, and you bolt on authentication — whether that's Cognito, API keys, or a custom authoriser. This is the entry point to the pipeline, and authentication ensures only izeauthorised clients can trigger a run.
Step Functions: The Orchestrator
AWS Step Functions is the conductor of this entire orchestra. It manages the workflow as a state machine: what runs first, what runs in parallel, and what happens when something fails — which, in distributed computing, you should always plan for.
The workflow runs in three stages:
Stage 1 — Shard the data. A Fargate task splits your scenario dataset into manageable chunks, which we'll call "shards." If you have a million scenarios and want each processing task to handle 500, that's 2,000 shards. This task writes the shard definitions (essentially pointers to subsets of your input data) and registers them for processing.
Stage 2 — Process in parallel. Step Functions triggers AWS Batch, which orchestrates the spinning up of thousands of concurrent Fargate tasks. Each task picks up a shard of scenarios, runs your Python model against them, and writes the results. AWS Batch handles the scheduling, queuing, and retry logic. You just tell it how many jobs you need and let it get on with things.
Stage 3 — Collate the results. Once every shard is processed, Step Functions calls a final Fargate task that stitches the individual outputs back together into one coherent dataset. This is where your aggregated risk figures, portfolio P&L distributions, or whatever your model produces come together in a single, consumable output.
It's a classic scatter-gather pattern, and Step Functions handles the coordination beautifully — including retries, error handling, and giving you visibility into exactly which step you're on at any given moment.
AWS Batch and Fargate: The Heavy Lifting
AWS Batch is the job scheduler. You tell it "I need 2,000 containers to run this Docker image with these parameters," and it works out how to make that happen. Fargate provides the compute — serverless containers that spin up on demand without you provisioning a single server.
Each Fargate task runs your Python model against its allocated shard of scenarios. When it's done, it writes the results and reports back. No servers to patch, no clusters to manage, and no on-call overhead for the underlying infrastructure.
Tracking Progress with DynamoDB
When you've got thousands of tasks running concurrently, you need to know what's going on. A DynamoDB table serves as the tracking layer — each shard gets a row recording its status (pending, running, complete, failed), the task ID processing it, and the output location.
DynamoDB is ideal here because it handles thousands of concurrent writes with ease, scales automatically, and you pay per request. It gives you a real-time view of progress: a quick query tells you that 1,847 of 2,000 shards are complete, 150 are still running, and 3 have failed and need retrying. It also stores the output object ID for each shard, linking the tracking data to the actual results.
Data Storage: S3 and DynamoDB Working Together
For storing actual results data — scenario outputs, intermediate calculations, final aggregated datasets — the pattern that works well is a combination of S3 and DynamoDB:
- S3 stores the blob objects — the actual data files (Parquet, CSV, JSON, or whatever format your model outputs).
- DynamoDB stores the metadata — what each object is, when it was created, which run it belongs to, where it lives in S3, and any summary statistics.
This separation keeps things clean. S3 is phenomenally cheap for storing large objects and handles virtually unlimited data. DynamoDB gives you fast, indexed lookups on the metadata without having to list S3 buckets — which, at scale, is both slow and surprisingly expensive.
What About SQL?
You could absolutely use a relational database — RDS PostgreSQL or Aurora — instead of DynamoDB for the metadata layer. SQL gives you richer querying, joins, and a more familiar interface for most developers.
The trade-off is that RDS requires provisioned capacity (or Aurora Serverless, which has its own quirks around cold starts), and it won't handle thousands of concurrent writes as gracefully as DynamoDB without careful connection pooling. For pure metadata tracking in a high-concurrency batch pipeline, DynamoDB tends to be the better fit. For downstream analytics and reporting on the results, SQL is often more practical.
Many setups use both — DynamoDB during the run, then ETL the results into a SQL database for analysis afterwards. Belt and braces, as they say.
Parallelism Within Each Fargate Task
Your Fargate tasks don't have to be single-threaded. If you allocate a task with 4 vCPUs, you can use Python's multiprocessing module (or concurrent.futures.ProcessPoolExecutor) to process multiple scenarios simultaneously within each container.
This gives you two levels of parallelism: AWS Batch handles the horizontal scaling across thousands of containers, and multiprocessing handles the vertical scaling within each container. The sweet spot depends on your model — CPU-bound financial models benefit enormously from multiprocessing, while I/O-bound workloads might get more from threading or async approaches.
One word of caution: Python's Global Interpreter Lock means that threading won't give you genuine parallelism for CPU-bound work. Use multiprocessing instead. This is one of those Python quirks that catches people out at least once, so worth being aware of from the start.
Fargate Spot vs On-Demand: The Cost Question
AWS Fargate Spot instances run on spare capacity and cost up to 70% less than on-demand pricing. The catch? AWS can reclaim them with two minutes' notice. For batch processing, Spot is often an excellent choice. Your tasks are short-lived, idempotent (they can be safely retried if interrupted), and you've got DynamoDB tracking which shards are complete. If a Spot instance gets reclaimed mid-shard, the shard simply gets reprocessed. You might lose a few minutes of compute, but you save a small fortune over the course of a large run.
The pragmatic approach: run most of your tasks on Spot and keep a smaller on-demand allocation as a fallback for the stragglers. AWS Batch supports mixed compute environments, so you can configure this without too much fuss.
Networking: The Bit Everyone Forgets
To run Fargate tasks, you need a VPC with subnets. This sounds straightforward enough, but there's a detail that trips up nearly everyone on their first large-scale run: each Fargate task consumes one IP address in the subnet.
If you're spinning up 2,000 concurrent tasks, you need at least 2,000 available IP addresses across your subnets. A /24 CIDR block gives you 251 usable addresses — nowhere near enough. You'll want a /20 or larger (4,091 usable addresses) to give yourself headroom. Running out of IP addresses mid-run is not a situation you want to find yourself in.
Plan your VPC CIDR ranges and subnet sizing carefully from the start. Expanding them later is possible but tedious, and it's much easier to get right the first time.
Spreading Across Availability Zones and Regions
To access enough Fargate Spot capacity for large runs, you'll want subnets spread across multiple Availability Zones within a region. AWS Batch can distribute tasks across AZs automatically, which both improves Spot availability and gives you resilience if one AZ has issues.
For truly massive scale, you can even look at running across multiple AWS regions. This adds meaningful complexity — you'll need to replicate your container images to ECR in each region, coordinate job submissions across regions, and aggregate results back to a central location — but it gives you access to a substantially larger pool of Spot capacity. Most setups won't need this, but it's worth knowing the option exists when scenario counts grow beyond what a single region can comfortably support.
The Results: From 150 to Over a Million
This architecture has been used to scale a financial model from running roughly 150 scenarios inside a single Lambda function to processing over a million scenarios in a single batch run. The model code itself barely changed — the same Python, packaged into a Docker container, running in parallel across thousands of Fargate tasks.
The scaling is largely linear: double the scenarios, double the tasks, roughly double the cost, same wall-clock time. That's the beauty of embarrassingly parallel workloads — they scale predictably and efficiently.
Wrapping Up
If you've got a Python financial model that's outgrown Lambda and you need to process scenarios at serious scale, the combination of Step Functions, AWS Batch, and Fargate gives you a fully serverless architecture that can handle millions of scenarios without you managing a single server.
The key ingredients: API Gateway as the entry point, Step Functions for orchestration, AWS Batch and Fargate for elastic compute, DynamoDB for progress tracking and metadata, S3 for data storage, and careful VPC planning so you don't run out of IP addresses at the worst possible moment.
It's not trivial to set up — there are a good number of moving parts, and you'll spend more time than you'd like reading AWS documentation about subnet CIDR ranges. But once it's running, it's remarkably robust, cost-effective with Spot pricing, and scales well beyond what any single machine could handle.
Want to go deeper on Scaling Python Financial Models on AWS: A Serverless Batch Processing Approach?
This article covers the essentials, but there's a lot more to learn. Inside , you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required
Keep Reading
[Cloud & Infrastructure
AWS Essentials for Financial Services
The core AWS services that matter for finance — EC2, S3, RDS, Lambda, and the architectural patterns used in trading platforms and data pipelines.](/quant-knowledge/cloud/aws-essentials-for-financial-services)[Cloud & Infrastructure
Cloud Computing for Finance: Getting Started
What cloud computing means for financial services — the major providers, core services, cost models, and why finance firms are migrating to the cloud.](/quant-knowledge/cloud/cloud-computing-for-finance)Cloud & Infrastructure
Docker and Containers for Financial Applications
How containerisation works, why finance teams use Docker, and practical patterns for packaging and deploying trading system components.[Big Data
Big Data Pipelines in Finance
How financial firms process massive datasets — batch and streaming architectures, ETL patterns, data lakes, and the tools that power modern data infrastructure.](/quant-knowledge/big-data/big-data-pipelines-in-finance)
What You Will Learn
- Explain from laptop to a million scenarios.
- Build the problem (and why lambda won't cut it).
- Calibrate the architecture: a bird's eye view.
- Compute tracking progress with dynamodb.
- Design data storage: S3 and dynamodb working together.
- Implement parallelism within each fargate task.
Prerequisites
- Cloud fundamentals — see Cloud fundamentals.
- Containers — see Containers.
- Comfort reading code and basic statistical notation.
- Curiosity about how the topic shows up in a US trading firm.
Mental Model
The cloud is a way to rent capacity by the second instead of buying servers by the rack. The trick is knowing what to keep on-prem (latency-sensitive, regulated) and what to push to the cloud (elastic research, batch risk, archival data). For Scaling Python Financial Models on AWS, frame the topic as the piece that from 150 scenarios in a Lambda to a million via Step Functions, Batch, and Fargate — serverless at scale — and ask what would break if you removed it from the workflow.
Why This Matters in US Markets
US capital markets cloud adoption is regulator-blessed: the SEC and FINRA have published cloud guidance, and Goldman, JPMorgan, and Morgan Stanley have publicly migrated huge portions of risk and research to AWS or Azure. Latency-critical paths still sit in Equinix NY4, NY5, and Chicago CH2 — but everything else is moving to elastic capacity.
In US markets, Scaling Python Financial Models on AWS tends to surface during onboarding, code review, and the first incident a junior quant gets pulled into. Questions on this material recur in interviews at Citadel, Two Sigma, Jane Street, HRT, Jump, DRW, IMC, Optiver, and the major bulge-bracket banks.
Common Mistakes
- Storing tick data in us-west-2 while compute lives in us-east-1 and paying egress on every join.
- Granting
*IAM permissions in dev and forgetting to tighten them in prod. - Ignoring cold-start latency on a synchronous Lambda that fronts a 30 ms SLA.
- Treating Scaling Python Financial Models on AWS as a one-off topic rather than the foundation it becomes once you ship code.
- Skipping the US-market context — copying European or Asian conventions and getting bitten by US tick sizes, settlement, or regulator expectations.
- Optimizing for elegance instead of auditability; trading regulators care about reproducibility, not cleverness.
- Confusing model output with reality — the tape is the source of truth, the model is a hypothesis.
Practice Questions
- Why do US trading firms keep execution paths on-prem in NY4/CH2 but push research to AWS or Azure?
- What is a VPC peering connection and why does it matter for a market-data Kinesis pipeline?
- When does a Lambda cold start become a real problem and how do you mitigate it?
- Why is egress cost the most-watched cloud bill line in a research org?
- Describe a regulator-acceptable disaster-recovery setup for a US broker-dealer in the cloud.
Answers and Explanations
- Because execution is bound by physics (light speed to the matching engine) while research is bound by elastic capacity; the cost structure flips between the two workloads.
- It is a private route between two VPCs; for market-data, it keeps traffic off the public internet, simplifies IAM, and avoids egress charges for cross-VPC data transfers.
- When the function fronts a synchronous request that has a low-latency SLA; mitigations are provisioned concurrency, smaller packages, and lighter runtimes (Rust, Go) — or replacing Lambda with a long-running service.
- Because moving market data out of a region (back to a researcher's desk, a partner, or another cloud) costs $0.05-$0.09 per GB; on petabyte-scale tick data, that line dwarfs compute.
- Active-active across two regions (us-east-1 + us-west-2) with synchronous replication for trade records, asynchronous for analytics, RTO < 4 hours, RPO < 15 minutes, audited annually under FINRA Rule 4370.
Glossary
- IaaS — Infrastructure as a Service (EC2, GCE, AKS); you manage the OS and up.
- PaaS — Platform as a Service (App Runner, App Engine); the platform manages the OS.
- Serverless — billing per invocation; Lambda, Cloud Run, Azure Functions.
- VPC — Virtual Private Cloud; the network isolation boundary in AWS.
- IAM — Identity and Access Management; the permission model.
- Region / AZ — geographically separate data center groups, used for disaster recovery and latency.
- Cold start — the latency penalty when a serverless function spins up from zero.
- Egress cost — the per-GB fee to send data out of the cloud; often the dominant cloud bill line in research.
Further Study Path
- Docker and Containers for Financial Applications — How containerisation works, why finance teams use Docker, and patterns for packaging trading components.
- Cloud Computing for Finance: Getting Started — What cloud means for financial services — providers, services, cost models, and why finance is migrating.
- AWS Essentials for Financial Services — EC2, S3, RDS, Lambda — the core AWS services and architectures used in trading platforms and data pipelines.
- Python for Quant Finance: Fundamentals — Variables, functions, data structures, classes, and error handling — the core Python every quant role expects.
- Advanced Python for Financial Applications — Decorators, generators, and context managers — the patterns that separate beginner Python from production quant code.
Key Learning Outcomes
- Explain from laptop to a million scenarios.
- Apply the problem (and why lambda won't cut it).
- Recognize the architecture: a bird's eye view.
- Describe tracking progress with dynamodb.
- Walk through data storage: S3 and dynamodb working together.
- Identify parallelism within each fargate task.
- Articulate fargate spot vs on-demand: the cost question.
- Trace AWS as it applies to scaling Python financial models on AWS.
- Map serverless as it applies to scaling Python financial models on AWS.
- Pinpoint scale as it applies to scaling Python financial models on AWS.
- Explain how scaling Python financial models on AWS surfaces at Citadel, Two Sigma, Jane Street, or HRT.
- Apply the US regulatory framing — SEC, CFTC, FINRA — relevant to scaling Python financial models on AWS.
- Recognize a single-paragraph elevator pitch for scaling Python financial models on AWS suitable for an interviewer.
- Describe one common production failure mode of the techniques in scaling Python financial models on AWS.
- Walk through when scaling Python financial models on AWS is the wrong tool and what to use instead.
- Identify how scaling Python financial models on AWS interacts with the order management and risk gates in a US trading stack.
- Articulate a back-of-the-envelope sanity check that proves your implementation of scaling Python financial models on AWS is roughly right.
- Trace which US firms publicly hire against the skills covered in scaling Python financial models on AWS.
- Map a follow-up topic from this knowledge base that deepens scaling Python financial models on AWS.
- Pinpoint how scaling Python financial models on AWS would appear on a phone screen or onsite interview at a US quant shop.