Finance · 8 min read · ~23 min study · beginner
Lazy Prices: 22% Alpha Hidden in 10-Ks
Cohen, Malloy & Nguyen's paper found that small year-on-year 10-K changes predict large negative returns.
Lazy Prices, Lazy Investors - and the 22% Alpha Hidden in 10-Ks That Nobody Reads
Cohen, Malloy and Nguyen's Lazy Prices paper found that small year-on-year changes in 10-K filings predict large negative returns. Here is what the paper actually says, and how Snowflake Cortex AI and Semantic Views collapse the original eight-year engineering pipeline into an afternoon's work.
On 23 February 2010, Baxter International filed its annual report with the SEC. The stock did not budge. Two months later, the New York Times broke a story about an FDA crackdown on infusion pumps. Baxter fell more than 20% in a two weeks and never recovered. The thing is - Baxter had told everyone it was coming. They just buried it in their 10-K.
We were sitting in the upstairs room at Snowflake's New York office last month, at one of the New York Quant Group evening seminars, listening to a talk about a paper most of us had heard of but few had actually re-read in years: Cohen, Malloy and Nguyen's Lazy Prices.
The original paper is from 2019. The result is, by quant-academic standards, almost rude in how big it is. A long-short portfolio that buys companies whose annual reports look like last year's and shorts companies whose annual reports have been quietly rewritten earns somewhere between 30 and 58 basis points per month in value-weighted abnormal returns - about 7% a year, risk-adjusted. Drill into changes concentrated in the Risk Factors section and the alpha climbs to 188 basis points per month, or over 22% per year, with a t-statistic of 2.76.
Numbers like that do not normally survive twenty years of academic scrutiny on a sample that contains every publicly traded firm in the United States. This one does.
What we had not appreciated until that LQG talk was that the entire pipeline - the bit that took Cohen and his co-authors years to build, between FOIA requests to the SEC, raw 10-K parsing, custom diff tooling, and a small army of research assistants - now fits comfortably inside a single Snowflake account, with the LLM work done by Cortex AI and the whole thing exposed as a Semantic View that an analyst can query in plain English.
It is worth thinking about why.
What the paper actually says
Lazy Prices is a behavioral finance result dressed up as a textual-analysis paper. The core claim is simple: investors have stopped reading 10-Ks carefully because 10-Ks have become roughly six times longer in twenty years, and twelve times more textually volatile year-on-year. Loughran and McDonald estimate the average public company's 10-K is downloaded from EDGAR roughly 28 times in the days after filing. Twenty-eight. For the entire investing public.
So when a company changes its 10-K - when management quietly inserts a paragraph about increased FDA scrutiny, or rewrites the Risk Factors section, or stops reassuring you that no further charges related to a particular product are likely - almost nobody notices. The signal is hiding in plain text, but the cost of reading thousands of filings carefully every quarter is so high that the market under-prices it. This is, in the end, a story about a violation of the efficient market hypothesis driven by attention costs rather than information costs.
Cohen, Malloy and Nguyen test the idea with four off-the-shelf textual similarity measures (cosine similarity on bag-of-words term frequencies, Jaccard, minimum edit distance, and a simple word-level diff), all of which give similar answers. They then sort firms each month into quintiles by year-on-year similarity and look at forward returns.
The empirical fingerprint is unusual. There is no announcement-day return - investors do not react to the filing. The drift accrues gradually over the next 6 to 18 months, and it does not reverse. That is not the signature of overreaction, or of a typical underreaction story like post-earnings announcement drift, where the price jumps and then drifts in the same direction. This is a population of investors who have simply not bothered to compare this year's text to last year's.
Two more findings stuck with us:
- 86% of the textual changes in 10-Ks are negative in sentiment. The market's failure to read is not symmetric. Companies that have something good to say tend to broadcast it through other channels; companies that have something bad to say will, if they can, bury it in a Risk Factors update.
- The effect is strongest among firms that do not include explicit comparative phrases ("compared to last year", "relative to prior year EBITDA"). When management actively draws attention to year-on-year changes, prices respond more or less efficiently. When they do not, it takes the market roughly half a year to figure it out.
That second point is the behavioral mechanism. The information is there. The cognitive cost of finding it is what drives the alpha. In quant trading strategy terms, you can think of the Lazy Prices signal as a textual-similarity factor that prices a slow-moving, attention-driven anomaly.
Why this was hard to replicate in 2019, and why it is not now
If you had wanted to build the Lazy Prices signal yourself in 2019 - and many people did try - you were looking at a multi-month engineering project before you even got to the trading rules.
You needed to: scrape every 10-K and 10-Q from EDGAR back to 1995, strip out the HTML, XBRL, embedded PDFs, exhibits and tables; identify the boundaries of each Item (1A Risk Factors, 7 MD&A, etc.) using regex against wildly inconsistent filing formats; pair each filing with its prior-year equivalent; compute four different similarity measures on documents that can run to 180,000 words; join the result back to CRSP returns, Compustat fundamentals and IBES forecasts; and then, finally, run the actual portfolio sorts.
The textual layer alone is the kind of thing that quietly consumes a graduate student for a year.
What we saw at the LQG talk - and this is the part that genuinely surprised us - is how much of that pipeline now collapses into a Snowflake-native workflow. Specifically, two things have changed since the paper was published:
- The data is just there. S&P Global Market Intelligence, Cybersyn and others publish parsed SEC filings as native Snowflake Marketplace shares. You do not ingest them; you query them. The raw text of every 10-K, already chunked and tagged by Item, lives one
SELECTaway. - The NLP is just there too. Snowflake Cortex has put
EMBED_TEXT_1024,VECTOR_COSINE_SIMILARITY,AI_SIMILARITYandCOMPLETEbehind a SQL function call, charged per token. The LLM infrastructure problem - embedding millions of document chunks, storing them, querying them - has been compressed into something a quant can write in an afternoon.
The interesting framing from the talk was not "use Cortex to do the embeddings". That is the obvious bit. The interesting framing was using Semantic Views as the abstraction.
The Semantic View as the alpha layer
A Semantic View in Snowflake is, roughly, a governed metadata layer that sits on top of your tables and turns them into a set of well-defined business concepts: dimensions, measures, relationships, synonyms. It is the thing that lets Cortex Analyst translate "show me companies whose Risk Factors section changed the most last quarter" into actual SQL without hallucinating column names.
The point - and this is the bit worth slowing down on - is that Lazy Prices is fundamentally a question about a comparison between two documents. Not a single forecast, not a single ticker, not a single number. It is "how different is this filing from its prior-year analogue, in this section, weighted this way?"
Once you express that as a Semantic View, every downstream question becomes a one-line query. A taste of what that pipeline looks like in practice:
-- 1. Embed every parsed 10-K Item (Risk Factors, MD&A, etc.)
CREATE OR REPLACE TABLE filings_embedded AS
SELECT
cik,
ticker,
filing_date,
fiscal_year,
item_id, -- e.g. '1A', '7'
item_text,
SNOWFLAKE.CORTEX.EMBED_TEXT_1024(
'snowflake-arctic-embed-l-v2.0',
item_text
) AS item_embedding
FROM sec_filings.parsed_items
WHERE form_type IN ('10-K', '10-Q');
-- 2. Compute year-on-year similarity per Item, per firm
CREATE OR REPLACE TABLE filings_similarity AS
SELECT
curr.cik,
curr.ticker,
curr.filing_date,
curr.item_id,
VECTOR_COSINE_SIMILARITY(
curr.item_embedding,
prev.item_embedding
) AS sim_cosine,
curr.fiscal_year
FROM filings_embedded curr
JOIN filings_embedded prev
ON curr.cik = prev.cik
AND curr.item_id = prev.item_id
AND curr.fiscal_year = prev.fiscal_year + 1;
That is the entire Lazy Prices similarity layer for one of the four measures, in two queries, on the full universe of US public companies. No ETL, no parsing, no infrastructure.
The Semantic View on top of this is where it gets interesting:
CREATE OR REPLACE SEMANTIC VIEW lazy_prices_signal AS
TABLES ( filings_similarity, monthly_returns, fundamentals )
DIMENSIONS (
ticker, fiscal_year, filing_date,
item_id WITH SYNONYMS = ('section', 'risk factors', 'MD&A')
)
METRICS (
similarity_cosine AS AVG(sim_cosine),
forward_alpha_3m AS ...,
quintile_rank AS NTILE(5) OVER (
PARTITION BY filing_date ORDER BY sim_cosine
)
);
Now an analyst - or, more likely now, an LLM-driven agent sitting in front of Cortex Analyst - can ask "What was the value-weighted return of the bottom-quintile Risk Factors changers in 2024?" and get a correct answer, on the entire CRSP-linked universe, without writing any SQL at all.
That is a meaningfully different research workflow from the one Cohen, Malloy and Nguyen had to build by hand.
Where this goes wrong
We do not want to be glib about this. Two things to keep an eye on.
First, the original paper measured similarity on bag-of-words term frequencies. Cortex embeddings measure semantic similarity. These are not the same thing - and in some ways, the embedding-based version is a worse fit for the original behavioral story. If the mechanism is "investors do not notice that the literal words have changed," then a paraphrase of the same content (which a transformer model would correctly score as highly similar) is precisely the kind of change a lazy investor would miss. We would want to run the classic Loughran-McDonald cosine on tokenised text in parallel with VECTOR_COSINE_SIMILARITY and treat any divergence as a research question, not a confirmation.
Second, this is now a crowded trade. The original paper was published in 2019 in the Journal of Finance, has been cited many hundreds of times, and is on every quant hedge fund's reading list. The fact that the cost of replicating it has collapsed by an order of magnitude does not mean the alpha has survived. What it does mean is that the next version of this idea - the cross-language equivalent on European filings, the application to bond covenants, the cross-document version that compares 10-Ks with the corresponding earnings call transcripts and 8-Ks for inconsistencies - is now extremely tractable. S&P's own follow-up work, Questioning the Answers: LLMs enter the Boardroom, is doing exactly that on the transcript side, scoring executives on how on-topic and proactive their answers are. There is a statistical arbitrage flavour to all of this - these are noisy, slow-moving, low-Sharpe signals that need to be combined carefully - but the building blocks are no longer the bottleneck.
The original Lazy Prices result was a paper about investor inattention. The 2026 version is a paper about which fund had the engineering pipeline to act on the signal first.
What this teaches us
The Knight Capital story we wrote about a few weeks ago was a parable about software systems eating risk management. This one is the inverse: data infrastructure eating research.
For most of the history of quant finance, the moat was the data - getting it, cleaning it, parsing it, joining it. The actual statistical idea on top was often surprisingly simple. A good chunk of the original Lazy Prices result is, mathematically, year-on-year cosine similarity sorted into quintiles. The hard part was not the maths. The hard part was the eight years of pipeline plumbing.
That moat is now extremely shallow. The infrastructure that used to take a graduate student a year takes an afternoon. The differentiator is moving back to where it always belonged: the quality of the question you are asking, the rigour of the backtest, the discipline of the risk management around the live signal, and a working understanding of why the alpha exists in the first place. Cohen, Malloy and Nguyen's answer - "because investors are lazy and 10-Ks are long" - is a real economic story, not a statistical artefact, which is why the result has held up.
If you want to break into modern quant work, the technology stack is part of the job now. Understanding how Cortex, Semantic Views, vector databases and the LLM tool-chain fit together with the maths of pricing and the economics of high-frequency trading is no longer optional. The advantage is no longer in having the pipeline. It is in knowing what to point it at.
Want to go deeper on Lazy Prices, Lazy Investors - and the 22% Alpha Hidden in 10-Ks That Nobody Reads?
This article covers the essentials, but there's a lot more to learn. Inside , you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required
Keep Reading
Finance
Selling Volatility: The Most Seductive Backtest in Finance
The volatility risk premium is real, well-documented, and has blown up more accounts than almost any other strategy. Here's why it works, why it kills, and what you need to understand before touching it.[Finance
Quant Trading Strategies: A Complete Guide for 2026
An in-depth guide to quantitative trading strategies — from statistical arbitrage and market making to momentum and machine learning approaches. Learn how quant funds actually make money.](/quant-knowledge/finance/quant-trading-strategies)[Finance
Quant Hedge Funds: How They Work, Top Firms & How to Get Hired (2026)
A complete guide to quantitative hedge funds — how they generate returns, the top firms to work for, compensation structure, and what it takes to get hired at a quant hedge fund.](/quant-knowledge/finance/quant-hedge-funds)[Finance
High Frequency Trading: How It Really Works in 2026
A practical guide to high frequency trading - what HFT firms actually do, the technology behind it, common strategies, top firms, and how to build a career in HFT.](/quant-knowledge/finance/high-frequency-trading)
What You Will Learn
- Explain what the paper actually says.
- Build why this was hard to replicate in 2019, and why it is not now.
- Calibrate the semantic view as the alpha layer.
- Compute where this goes wrong.
- Design what this teaches US.
Prerequisites
- Derivatives intuition — see Derivatives intuition.
- Options Greeks — see Options Greeks.
- Comfort reading code and basic statistical notation.
- Curiosity about how the topic shows up in a US trading firm.
Mental Model
Markets are auctions for risk. Every product, model, and strategy in this section is a way of pricing or transferring some piece of risk between counterparties — and US markets give you the deepest, most regulated, most algorithmic version of that auction in the world. For Lazy Prices: 22% Alpha Hidden in 10-Ks, frame the topic as the piece that cohen, Malloy & Nguyen's paper found that small year-on-year 10-K changes predict large negative returns — and ask what would break if you removed it from the workflow.
Why This Matters in US Markets
US markets are the deepest, most algorithmic, most regulated capital markets in the world. The SEC, CFTC, FINRA, and Federal Reserve govern equities, options, futures, treasuries, and OTC derivatives. The big buy-side (Bridgewater, AQR, Citadel, Two Sigma, Renaissance) and the major sell-side (GS, MS, JPM, Citi, BofA) hire heavily against the material in this section.
In US markets, Lazy Prices: 22% Alpha Hidden in 10-Ks tends to surface during onboarding, code review, and the first incident a junior quant gets pulled into. Questions on this material recur in interviews at Citadel, Two Sigma, Jane Street, HRT, Jump, DRW, IMC, Optiver, and the major bulge-bracket banks.
Common Mistakes
- Quoting risk-free rates without saying which curve (T-bill, OIS, fed funds futures).
- Treating implied volatility as a forecast instead of a market-clearing quantity.
- Using realized correlation as a hedge ratio without accounting for regime change.
- Treating Lazy Prices: 22% Alpha Hidden in 10-Ks as a one-off topic rather than the foundation it becomes once you ship code.
- Skipping the US-market context — copying European or Asian conventions and getting bitten by US tick sizes, settlement, or regulator expectations.
- Optimizing for elegance instead of auditability; trading regulators care about reproducibility, not cleverness.
- Confusing model output with reality — the tape is the source of truth, the model is a hypothesis.
Practice Questions
- Compute the delta of an at-the-money call on SPY with one month to expiry under Black-Scholes (σ=18%, r=5%).
- Why does the implied volatility surface for SPX exhibit a skew rather than a flat smile?
- Define the Sharpe ratio and explain why it is annualized.
- Why does delta-hedging a sold straddle on SPY produce P&L proportional to realized minus implied variance?
- What does a 100 bps move in the 10-year Treasury yield typically do to a 30-year fixed-rate mortgage rate?
Answers and Explanations
- Δ = N(d1) where d1 = (ln(S/K) + (r + σ²/2)T) / (σ√T). With S=K, T=1/12, σ=0.18, r=0.05: d1 ≈ (0 + (0.05 + 0.0162)·0.0833) / (0.18·0.2887) ≈ 0.106; N(0.106) ≈ 0.542. Delta ≈ 0.54.
- Because investors pay a premium for downside protection (left tail) and equity returns are negatively correlated with volatility; out-of-the-money puts therefore trade rich relative to OTM calls.
- Sharpe = (excess return) / (volatility). Annualization (multiply by √252 for daily returns) puts strategies of different frequencies on comparable footing — a key requirement for comparing US asset managers.
- Because the hedger captures gamma·dS² over time; integrating gives Σ gamma·(dS)², and theta paid over the life is set by implied variance. Net P&L tracks σ_realized² − σ_implied² scaled by gamma exposure.
- Roughly 75-100 bps move the same direction; mortgages are priced off the 10y plus a spread that includes prepayment risk and originator margin, which both move with rates.
Glossary
- Delta — first derivative of option price with respect to underlying.
- Gamma — second derivative; rate of change of delta.
- Vega — sensitivity of option price to implied volatility.
- Theta — time decay; daily P&L from holding the option as expiry approaches.
- Implied volatility — the σ that, when plugged into Black-Scholes, recovers the market price.
- Skew — variation of implied volatility across strikes.
- Spread — the difference between two prices; a yield curve, an option spread, or a cross-instrument arb.
- Sharpe ratio — annualized excess return divided by annualized volatility; the standard performance metric in US asset management.
Further Study Path
- Understanding Financial Markets — Equity, fixed income, FX, derivatives — how markets actually work and where quants fit in.
- Time Value of Money — Present value, future value, discounting, NPV — the concept that underpins all of finance.
- Bonds and Fixed Income — Pricing, yield to maturity, duration, convexity — the fixed-income concepts behind interest-rate modeling.
- Python for Quant Finance: Fundamentals — Variables, functions, data structures, classes, and error handling — the core Python every quant role expects.
- Advanced Python for Financial Applications — Decorators, generators, and context managers — the patterns that separate beginner Python from production quant code.
Key Learning Outcomes
- Explain what the paper actually says.
- Apply why this was hard to replicate in 2019, and why it is not now.
- Recognize the semantic view as the alpha layer.
- Describe where this goes wrong.
- Walk through what this teaches US.
- Identify alpha as it applies to lazy pr__pn0__s: 22% alpha hidden in 10-ks.
- Articulate research as it applies to lazy pr__pn0__s: 22% alpha hidden in 10-ks.
- Trace 10-k as it applies to lazy pr__pn0__s: 22% alpha hidden in 10-ks.
- Map how lazy pr__pn0__s: 22% alpha hidden in 10-ks surfaces at Citadel, Two Sigma, Jane Street, or HRT.
- Pinpoint the US regulatory framing — SEC, CFTC, FINRA — relevant to lazy pr__pn0__s: 22% alpha hidden in 10-ks.
- Explain a single-paragraph elevator pitch for lazy pr__pn0__s: 22% alpha hidden in 10-ks suitable for an interviewer.
- Apply one common production failure mode of the techniques in lazy pr__pn0__s: 22% alpha hidden in 10-ks.
- Recognize when lazy pr__pn0__s: 22% alpha hidden in 10-ks is the wrong tool and what to use instead.
- Describe how lazy pr__pn0__s: 22% alpha hidden in 10-ks interacts with the order management and risk gates in a US trading stack.
- Walk through a back-of-the-envelope sanity check that proves your implementation of lazy pr__pn0__s: 22% alpha hidden in 10-ks is roughly right.
- Identify which US firms publicly hire against the skills covered in lazy pr__pn0__s: 22% alpha hidden in 10-ks.
- Articulate a follow-up topic from this knowledge base that deepens lazy pr__pn0__s: 22% alpha hidden in 10-ks.
- Trace how lazy pr__pn0__s: 22% alpha hidden in 10-ks would appear on a phone screen or onsite interview at a US quant shop.
- Map the day-one mistake a junior would make on lazy pr__pn0__s: 22% alpha hidden in 10-ks and the senior's fix.
- Pinpoint how to defend a design choice involving lazy pr__pn0__s: 22% alpha hidden in 10-ks in a code review.