← All Articles
The Claude mark emitting light into five out-of-sample test windows, each beating the S&P 500

$25,000 Public Portfolio Challenge · Episode 10

Claude Fable 5 built me a live options strategy. It DESTROYED the market out of sample. Then the government banned it.

Anthropic called Fable 5 their strongest finance model. I handed it a hedge-fund-grade validation runbook, fixed every engine bug it surfaced, and ended with a strategy now trading real money.

Austin Starks Austin Starks ✦ Founder, NexusTrade ✦ June 13, 2026 ✦ 10 min read
What the process certified
+137.5%
true out-of-sample, every fold passed
The book I chose to deploy
+4,609%
5-year, incl. 2022 · 66% max drawdown
Trading live, fully public
$25,000
watch every fill →
The live $25,000 Public Portfolio Challenge book: current positions and performance, updating in real time Not a screenshot. This is the live book, updating in real time. Watch it trade →

When Anthropic shipped Claude Fable 5, it came with an unusual amount of hype and fanfare.

They called it Mythos, and for weeks it was the model almost no one could touch: their most capable system, kept behind heavy safeguards and a tightly limited release. Then they finally shipped the version the rest of us could use, the watered-down one: Claude Fable 5. Mythological. And, they claimed, "the strongest finance-first model we've tested, both on general finance and reasoning."

I assumed it was hype.

Anthropic's Claude Fable 5 announcement, highlighting the line: the longer and more complex the task, the larger Fable 5's lead over our other models.
Straight from Anthropic's Claude Fable 5 announcement. A $25,000 walk-forward campaign is exactly the long, complex task where they say the lead widens.

From my own experience, bigger models do not produce significantly better trading strategies. I have run that bakeoff more than once, and the cheap models keep winning.

But this is the fabled Mythos model. Anthropic subsidizes it inside Claude Code, so I connected NexusTrade over MCP, handed it a runbook that is two years of hard lessons in one file, and told it to build me a trading strategy.

I want to be 100% clear: nothing here is a black box. The whole runbook is open-source. You can copy and paste the prompt and get the same result I did.

austin-starks / public-portfolio-challenge The open runbook. Real money. AI agents that have to prove it out of sample before they deploy. github.com/austin-starks/public-portfolio-challenge →

Well, you could have yesterday. Not anymore. Claude Fable 5 was so capable that it became the first AI model in history the US government banned. Let me tell you how a now-banned model built the strategy that is currently trading over $28,000.

First, I turned the model into a quant

If you are AI-savvy, you already know the dirty secret: a language model cannot actually develop a trading strategy.

I mean, it can describe one. But it cannot test one. It needs tools, and a real engine behind them. I built one from scratch and exposed it over MCP.

HOW ANY MODEL DRIVES NEXUSTRADE ANY AI AGENT Fable · Opus · GPT Composer · your own MCP NEXUSTRADE agent harness + engine no code, plain English calls TOOLS Backtest Walk-forward certify Optimize & sweep Screen stocks Options chains Clone & deploy live deploys LIVE $25,000 BOOK real broker, real fills
The model decides; NexusTrade executes. Swap the model, keep the tools.

By connecting the NexusTrade MCP server, I handed Claude Fable the same instruments a real quant uses. It could screen the market, pull live options chains, write a strategy in plain English, run it through a simulation engine, optimize it across thousands of variants, and certify it on data it had never seen. No code. Every one of those actions was a single tool call.

Each of these tools was meticulously designed over the course of three years. Some are basic primitives, like backtesting a single strategy. Others automate an entire quant trading workflow end to end, the funnel below: search cheaply and often, kill almost everything at the gate, and deploy the rare survivor.

SEARCH WIDE · GATE HARD · DEPLOY ONCE SEARCH fast, cheap, and iterative ~16 DESIGNS TRIED RESEARCH DESIGN BACKTEST OPTIMIZE ↻ repeat until one survives screen_stocks · backtest_portfolio · optimize_portfolio · systematic_sweep THE GATE 8 gates · single-touch lockbox almost everything dies here 9 certification runs · run_walk_forward_study 1 of dozens DEPLOY 1 survivor live $25,000 book then monitored live · query_portfolio_events
Search wide and cheap. Almost nothing survives the gate. One book in dozens reaches the live account.

What it caught before I could

Before it designed anything, the very first thing the runbook demands is an engine sanity check: take one fold's out-of-sample result and independently re-run that exact window as a standalone backtest. The two numbers must reconcile to the digit. If they do not, the rule is one word: stop.

They did not reconcile. Fable ran the cross-check, found that the walk-forward path and a plain re-run of the identical window disagreed by 79 basis points at the out-of-sample boundary, classified it as an engine defect rather than noise, and stopped. It documented exactly where the two paths diverged: my engine was filling the first order a day early, inflating every fold's headline number.

From the runbook it was following, the rule it actually honored

"The fold's oosStatistics MUST match the manual backtest within rounding. Mismatch = engine defect = STOP, document, no campaign."

It turned out to be two separate bugs, on two different code paths. I fixed both, re-ran the check until the walk-forward and standalone numbers matched to the digit, and only then let it start building the strategy.

Then it designed the strategy

Fable did not pull a winning strategy out of thin air. It engineered one, the same way I would: build a version, backtest it, read why it failed, fix that, repeat. It went through about sixteen designs, and each dead end taught it the next move.

The first breakthrough was a deletion. Its early versions constantly rotated capital into whatever name was top-ranked that day, and Fable diagnosed that the rotation itself, the endless selling and rebuying as ranks shuffled, was the single biggest thing destroying returns. The fix was to stop rotating: hold a wide set of the strongest names and let them run. That one change unlocked the whole thing.

16 VARIANTS TRIED, EVERY ONE LOGGED each breakthrough, traced to the exact variant that proved it Stop rotating. Hold the leaders. PROVED BY v4a · 6a2b57a8 · wide-limit, no rank rotation “rotation churn was the return killer”, the single biggest drag on returns +192.8% best OOS fold Take profit at +100%. PROVED BY v11 · 6a2b5aef · added a hard +100% take-profit “banks gains before the crash”, fixed the fold the book kept giving back fold 1: +114.5% was the weak fold Regime-tiered sizing, 28 / 10 / 30. PROVED BY v14 · 6a2b5d1e · THE FINALIST calm core, lean into dips, buy crashes hard. the book that went on to certify +49.8% in the 2022 bear github.com/austin-starks/public-portfolio-challenge · episode-10/FABLE_CAMPAIGN.MD read the kill-log →
Sixteen variants, three breakthroughs, every dead end logged. Read the exact kill-log on GitHub →

From there it kept tightening the mechanism. It added a hard take-profit at +100%, sell a position once it doubles, which banked gains before drawdowns and stopped the book from over-committing. It added regime-tiered sizing: a calm core allocation when the market is near its highs, then lean in harder when the market is well off its highs, buying weakness aggressively. The result was a momentum book expressed entirely as long-dated call options, holding the leaders, taking profit on doublings, sized up into dips. (The exact rules are in the final section.)

That design is the easy half to be impressed by and the wrong half to trust. A good-looking backtest of a design you just invented is worth nothing. The real work was proving it.

Then it had to prove the strategy out of sample

A model with a backtest button will happily hand you a fantasy that looks incredible and falls apart the moment real money touches it. The discipline that separates the two is one step: walk-forward certification. It tiles history into folds. Each fold tunes on its own past, then is scored on a stretch of future the optimizer never sees. A held-out lockbox at the very end is opened exactly once. And one rule sits above everything: never lead with a backtest number. Out-of-sample is the only headline.

2022 run date train validation OOS (held out per fold) lockbox (touched once) fold 1 fold 2 fold 3 fold 4 LOCKBOX
Four anchored walk-forward folds + a single-touch held-out lockbox.

That is the bar: not one lucky backtest, but a positive result on every unseen fold, confirmed exactly once on the held-out lockbox. Fable built a book that cleared it. On the true out-of-sample fold, the stretch the optimizer was never allowed to look at, it returned +137.5%, and on a 2026 window it had never seen, +107.4% against the market's +9%, with its worst drawdown held to about 25%.

That is the number that counts. Anyone can show you a backtest scored on the same data it was tuned on. This is the result on data the model was forbidden to look at, which is the only result that says anything about tomorrow.

I went to run it again

The strategy was beyond outstanding. To make sure it was not a fluke, I opened a new session to run the whole thing again, from scratch.

Access revoked.

Claude Code model selector showing Claude Fable 5 is currently unavailable, linking to anthropic.com/news/fable-mythos-access
Claude Code, the next morning. Fable 5: currently unavailable.

Fable 5 would not load. I assumed I had broken my own setup. I had not. I read their announcement, and I was genuinely shocked.

The model that built it is gone

Fable 5 is the public face of Anthropic's Mythos-class model: the same engine, with the cyber and bio guardrails that never touch finance left off. It launched on June 9. Three days later, on June 12, the US government placed Fable 5 and Mythos 5 under export controls, barring access for all foreign nationals. To comply, Anthropic disabled both models entirely, for everyone.

"The US government, citing national security authorities, has issued an export control directive... We are complying with the government's legal directive and are removing access to Fable 5 and Mythos 5 for all users." Anthropic's statement, June 12, 2026.

It is the first export-control directive the US has ever issued for a language model, and the first time a major lab has taken a deployed model offline at the government's instruction. The trigger was national security: a claimed jailbreak of Mythos's cyber capabilities, nothing to do with markets. Fable was collateral, and Anthropic disagrees that "a narrow potential jailbreak should be cause for recalling a commercial model." On Hebbia's finance benchmark, Fable 5 had just ranked first. The strongest finance model anyone has measured built me a market-beating strategy, and three days into its life became a model none of us can run.

So I tried to recreate it with every other model

No matter. If the strategy was real, another model should be able to find it. So I tried to reproduce it, fairly: fresh sessions with no memory of the Fable run, the same open runbook, the same 20-name universe, the same eight gates. I handed it to the three best agents I still had, at the same time, and let them run start to finish on their own: Opus 4.8 in Claude Code, GPT-5.5 in Codex, and Composer 2.5 in Cursor.

None of them beat the baseline. All three reached the same honest verdict, do not deploy, and not one of them touched the held-out lockbox.

AgentVerdictBest out-of-sample bookLockbox
Claude Fable 5Certified, deployable+137.5% F3, +107.4% on 2026opened once
Opus 4.8 (Claude Code)No deploy+53.7% mean / Sortino 2.50untouched
GPT-5.5 (Codex)No deploy~+32% mean, fold-0 flatuntouched
Composer 2.5 (Cursor)No deploy+33.9% mean / Sortino 1.91untouched

These are not bad strategies. Opus 4.8's book returns +53.7% and beats the S&P. They are the kind of result most people would happily publish. But they all stayed in the safe corner of the search, kept the gates on, rotated to cash when the market sagged, and that corner caps out below the baseline. Fable was the only one that cleared the bar, and it is the one model I can no longer open. What Fable found in that search, and the call I made with it, is the last part of the story.

Then I tried one more thing

The certified book, the cautious one you just watched clear every fold, was deployable exactly as it stood. That was the safe call, and the disciplined one. But it was not the only book Fable left in the logs. One of the tools I gave it is a genetic optimizer: it works like evolution, spinning up hundreds of variations of a strategy, scoring each on data it never saw, keeping the winners, breeding them, mutating them, and repeating until the strongest survive.

HOW THE OPTIMIZER BREEDS A STRATEGY it evolves only the knobs below; the 20-name universe and the long-dated-call structure stay fixed THE GENOME · the six knobs that evolve rank by 252d ROC size / name 8% budget 30% regime gate 0.92 × hi take-profit +100% roll 45 DTE 1 · SELECT in a tournament, score genomes on data they never saw; the fitter ones win the right to breed. genome A · Sortino 4.6 ✓ breeds genome B · Sortino 3.9 ✓ breeds genome C · Sortino 1.2 ✗ dies 2 · CROSSOVER for every knob, the child takes the value from one parent or the other. rank size budget gate take-profit roll PARENT A 252d ROC 8% 22% 0.92 × hi +100% 45 DTE PARENT B 126d ROC 14% 34% 0.85 × hi +150% 35 DTE CHILD 252d ROCA 14%B 22%A 0.85 × hiB +150%B 45 DTEA 3 · MUTATE then nudge one knob at random. Most mutants are worse and die; a rare few are better and take over. take-profit +100% take-profit +356% One mutation let winners run far longer. Another threw the regime gate wide open. that pair of tweaks is the book that went live. repeat for many generations, and it lands on settings no human would think to try, for better and for worse.
The optimizer evolves the strategy's knobs, not its idea. Select on unseen data, breed the survivors, mutate, repeat. The deployed book is what that search produced when it was allowed to push the take-profit and the gate past anything I would have set by hand.

That search had bred a variant returning far more than the cautious book, and the process had set it aside, because the search itself was unstable: it crowned a different winner on every fold, and one of those folds blew up out of sample, exactly the kind of result the validation gates exist to refuse. I went back through the logs and read it carefully: the variant, its numbers, and the case for rejecting it.

I did not fully buy that rejection. I understood it, and I wrote the rule myself: the runbook is strict, fail out of sample and you do not deploy. But a real strategy does not win every single window, and one losing fold is not the same as a broken book. So rather than take the rejection at face value, I tested the rejected book in the places the campaign never had. I ran it through a full 2022 bear market, the one regime the walk-forward never graded out of sample, to see whether the gates it had thrown away were load-bearing or dead weight. I ran it across the entire five-year cycle that contains that bear. And I re-ran it at minute resolution, to be sure the numbers were not an artifact of daily bars.

It held up everywhere I pushed it: bruised in the bear, never broken, and extraordinary across the full cycle. The fold that had spooked the process came from a bear-only training window elsewhere in the search; this book, the one I would actually deploy, came back positive on every window I tested. That was enough. I made the call the process would not, and deployed the aggressive one.

THROUGH THE 2022 BEAR, THE REGIME IT WAS NEVER GRADED ON it bent hard and held: Jan 2022 to Jan 2023, the deployed book vs the S&P 0% peak −65% peak to trough book finishes +5% S&P -14% the deployed book S&P 500
The 2022 bear, the regime the walk-forward never graded out of sample. The deployed book ran up, gave back 64.7% peak to trough, and still finished positive while the S&P fell. Bruised, not broken, and the clearest picture of the risk that comes with it.

I want to be straight about what this is. Choosing the stress tests after I had already seen the candidate is the researcher's freedom that walk-forward exists to remove, and re-running a rejected book until it looks good is not the same as it passing clean. The certified book is what the discipline actually endorsed. The live book is my judgment placed over the discipline, and the risk in that is mine, not the method's.

The two books are the same idea tuned two ways. The genetic optimizer found that the regime throttle, the rule that pulls the book to cash when the market sags, was costing more than it saved: the strongest momentum names powered through the dips it was trying to dodge. So it threw the gates open and stayed fully invested in the leaders, and it pushed the take-profit far out, from the cautious book's +100% to +356%, letting winners run instead of banking them early. Same 20 names, same momentum signal, same long-dated calls. It just stopped sitting in cash, and stopped selling its winners short.

ONE KNOB: HOW FAR TO OPEN THE GATES same names, same signal, same options · the big change is how invested it stays GATES CLOSED the cautious book (now paper) sits in cash when the market sags ~28% median deployed 5-year return +591% worst drawdown 40% smoother. leaves the big runs on the table. OPTIMIZER OPENS IT GATES RELAXED the book that is LIVE stays in the leaders, rarely steps out ~80% median, near fully invested 5-year return +4,609% worst drawdown 66% about 8x the money. and a hole you have to be able to sit through.
The change that drives it is how invested it stays. Open the gates and the same idea goes from ~7x to ~47x over five years, and from a 40% drawdown to a 66% one.

Over five years, that shift turns the same idea into a different order of magnitude, and a different order of risk. Here is exactly what it returned, and exactly how much it can hurt.

The out-of-sample performance is unreal

Here is the live book measured directly, the exact object that is trading, across the regimes that actually matter: a brutal bear, two out-of-sample bull windows it was never tuned on, and the full five-year cycle that contains both.

Test windowReturnmaxDDSortino
2022 · the bear (stress test)+4.8%64.7%0.75
F3 · true out-of-sample+249.9%31.2%4.63
2026 · never seen+122.7% (SPY +9.2%)39.1%3.53
5 years · includes the 2022 bear+4,609%65.8%2.32

Read the bottom row twice, because both halves of it are true. Over five years that contain a real bear market, the same $25,000 turns into roughly $1.18 million, about 47x. And the cost of that is a 65.8% drawdown: at its worst the book gave back two-thirds of its value before recovering. Same book, same row. Against the market it is not close:

THE LIVE BOOK vs THE S&P 500 the two windows it was never tuned on · it wins both by more than twelve to one LIVE BOOK S&P 500 0% +249.9% +15.6% +122.7% +9.2% F3 · true out-of-sample 2026 · never seen over five years including the 2022 bear, the same edge compounds to +4,609%
On the two windows it never saw, the live book beat the S&P by more than twelve to one. Over the full five years, that edge compounds to roughly 47x.
The real risk, because honesty is the whole point

The risk is not concentration. The book spreads about 8% across the top names, so no single position can sink it. The risk is that it has no brake. With the gates relaxed it stays fully invested in the leaders almost all the time, around 80% deployed and trending toward fully invested, and nothing pulls it to cash when the market turns. That is exactly why it compounds the way it does, and exactly why its worst drawdown is 65.8%. In the 2022 bear it stayed deployed and rode a 64.7% drawdown to finish roughly flat. A bear worse than 2022 could be worse than that, and these are convex options, so a sustained downturn can hurt fast.

What the strategy actually does

Strip the jargon. Here is exactly what the live book does, in plain English.

It trades your 20 names, only as long-dated call options (LEAPs, 150 to 365 days out), bought through an affordability ladder that steps from a plain call down to tighter spreads when a name is expensive. The signal is momentum: it ranks the names by their 252-day rate of change and buys the strongest. The one big choice the optimizer made was how invested to stay, and it chose: almost always.

The deployed book, rule by rule

Universe and signal: the same 20 names, ranked by 252-day rate of change. It buys the strongest and keeps holding them rather than churning the list.
Sizing: about 8% of the account per name, up to roughly 30% put to work on each rebalance, bought as long calls 150 to 365 days out through an affordability ladder (a plain ATM call, stepping down to tighter call spreads when a name is expensive, that is, when one contract's premium would blow past its per-name budget).
Stays invested: the regime gate is relaxed wide open, so instead of rotating to cash when the market sags, it stays in the leaders. Median deployment lands around 80% and trends toward fully invested.
Exits: two, and both are profit-based. Take the gain on a runner at +356%, and within 29 days of expiry, trim a position that is up at least +48.6%. There is no time-based roll and no regime valve, nothing forces it out on a schedule or pulls it to cash.
The odd numbers are the tell: +356% and +48.6% are the optimizer's thresholds, not round human ones. This is the machine's book, run exactly as the machine wrote it.

That is the whole thing. Momentum names, expressed as convex long calls, held through trends, profit taken on the big runs, and kept fully invested almost all of the time. Nothing exotic. The edge was never a secret indicator, it was finding that staying invested in the leaders beats stepping aside, and being willing to live with the drawdown that comes with it.

What is, and is not, proven

What is proven: the deployed book is live on the $25,000 public account, cloned from a backtest that reproduces bit-for-bit, and it returned +249.9% on the true out-of-sample fold and +122.7% on a 2026 window the optimizer never saw, while staying roughly 80% deployed. What is not proven: the future. Out-of-sample history is the strongest evidence there is short of forward returns, and it is still not a guarantee, especially for a book with no downside throttle. Two more honest caveats. The 20 names were frozen by hand to a set of current leaders, so a multi-year backtest carries hindsight in its very symbol list; the live, forward record is the only test fully free of that. And these are long-dated options, where the bid/ask spread is a real cost, so the backtests model it honestly: every buy pays the real historical ask and every sell takes the real historical bid, from recorded NBBO quotes, with a tight synthetic spread only filling the rare gap where no quote was recorded. Slippage is priced in, not assumed away, though live fills are still the final proof. The forward results are public, in real time, position by position.

What I am left with

I built an entire methodology to stop myself from getting fooled by backtests, and the machine ran it more honestly than I used to. It caught a bug in my own engine before I did, it measured every version on data it had never seen, and the book it certified, +137.5% out of sample with every fold clean, was deployable exactly as it stood.

The book that is actually live is not that one. It is the aggressive variant the process refused to trust, the one I deployed over its objection: it came back positive on every window I tested, but it runs with no brake and a 66% drawdown I chose to accept. The discipline gave me the safe answer. I overrode it on purpose, and that call is mine to own.

One honest thing about the book it built: it is not designed to be safe. It stays fully invested in the leaders, it has no rule that pulls it to cash, and it will take the full hit of any real downturn. Its worst drawdown on record is 66%. It was never meant to be all-weather. It was built to win, and I deployed it knowing exactly what that costs.

BUILT TO WIN, NOT TO BE ALL-WEATHER SIDEWAYS CHOP stays invested, just grinds no cash cushion SELLOFF & BEAR stays in, takes the full hit 66% drawdown STRONG UPTREND rides the leaders hard wins big
Built to win in trends. It stays fully invested through everything, so when a real downturn comes it takes the full hit. That is the trade, made on purpose.

And you do not have to take it on faith that a model did all of this. Every decision, every backtest, every dead end is public:

Claude Fable 5 · the full campaign log Every backtest, optimization, gate, engine fix, and dead end, written down as the model ran it. github.com/austin-starks/public-portfolio-challenge →

The best tool I have ever pointed at this problem is one I can no longer run. But the strategy it built is still trading, the method that proved it is still here, and every line of both is in the open. You do not have to take my word for any number above. You can go run the same thing yourself.

Build your own strategy, the same way

The entire runbook, the gates, the lockbox rules, the fold calendar, all of it is open. You do not have to take my word for any number in this post. You can run the exact same campaign on your own idea and watch it pass or fail out of sample.

Fork the challenge. Run the same discipline.

Write your strategy, put it through the same walk-forward validation, and let the lockbox decide. If it survives, deploy it. If it does not, you found that out before risking a dollar.

github.com/austin-starks/public-portfolio-challenge →

Or just watch the book Fable built, every position and every fill, in real time: the live $25,000 portfolio →

Prefer the cautious version, with the risk gates left on? I am tracking it in the open too: the gated book, in paper →

Discussion

Sign in or create a free account to join the discussion.

No comments yet.