$25,000 Public Portfolio Challenge · Episode 10
Claude Fable 5 built me a live options strategy. It DESTROYED the market out of sample. Then the government banned it.
Anthropic called Fable 5 their strongest finance model. I handed it a hedge-fund-grade validation runbook, fixed every engine bug it surfaced, and ended with a strategy now trading real money.
When Anthropic shipped Claude Fable 5, it came with an unusual amount of hype and fanfare.
They called it Mythos, and for weeks it was the model almost no one could touch: their most capable system, kept behind heavy safeguards and a tightly limited release. Then they finally shipped the version the rest of us could use, the watered-down one: Claude Fable 5. Mythological. And, they claimed, "the strongest finance-first model we've tested, both on general finance and reasoning."
I assumed it was hype.
From my own experience, bigger models do not produce significantly better trading strategies. I have run that bakeoff more than once, and the cheap models keep winning.
But this is the fabled Mythos model. Anthropic subsidizes it inside Claude Code, so I connected NexusTrade over MCP, handed it a runbook that is two years of hard lessons in one file, and told it to build me a trading strategy.
I want to be 100% clear: nothing here is a black box. The whole runbook is open-source. You can copy and paste the prompt and get the same result I did.
austin-starks / public-portfolio-challenge The open runbook. Real money. AI agents that have to prove it out of sample before they deploy. github.com/austin-starks/public-portfolio-challenge →Well, you could have yesterday. Not anymore. Claude Fable 5 was so capable that it became the first AI model in history the US government banned. Let me tell you how a now-banned model built the strategy that is currently trading over $28,000.
First, I turned the model into a quant
If you are AI-savvy, you already know the dirty secret: a language model cannot actually develop a trading strategy.
I mean, it can describe one. But it cannot test one. It needs tools, and a real engine behind them. I built one from scratch and exposed it over MCP.
By connecting the NexusTrade MCP server, I handed Claude Fable the same instruments a real quant uses. It could screen the market, pull live options chains, write a strategy in plain English, run it through a simulation engine, optimize it across thousands of variants, and certify it on data it had never seen. No code. Every one of those actions was a single tool call.
Each of these tools was meticulously designed over the course of three years. Some are basic primitives, like backtesting a single strategy. Others automate an entire quant trading workflow end to end, the funnel below: search cheaply and often, kill almost everything at the gate, and deploy the rare survivor.
What it caught before I could
Before it designed anything, the very first thing the runbook demands is an engine sanity check: take one fold's out-of-sample result and independently re-run that exact window as a standalone backtest. The two numbers must reconcile to the digit. If they do not, the rule is one word: stop.
They did not reconcile. Fable ran the cross-check, found that the walk-forward path and a plain re-run of the identical window disagreed by 79 basis points at the out-of-sample boundary, classified it as an engine defect rather than noise, and stopped. It documented exactly where the two paths diverged: my engine was filling the first order a day early, inflating every fold's headline number.
"The fold's oosStatistics MUST match the manual backtest within rounding. Mismatch = engine defect = STOP, document, no campaign."
It turned out to be two separate bugs, on two different code paths. I fixed both, re-ran the check until the walk-forward and standalone numbers matched to the digit, and only then let it start building the strategy.
Then it designed the strategy
Fable did not pull a winning strategy out of thin air. It engineered one, the same way I would: build a version, backtest it, read why it failed, fix that, repeat. It went through about sixteen designs, and each dead end taught it the next move.
The first breakthrough was a deletion. Its early versions constantly rotated capital into whatever name was top-ranked that day, and Fable diagnosed that the rotation itself, the endless selling and rebuying as ranks shuffled, was the single biggest thing destroying returns. The fix was to stop rotating: hold a wide set of the strongest names and let them run. That one change unlocked the whole thing.
From there it kept tightening the mechanism. It added a hard take-profit at +100%, sell a position once it doubles, which banked gains before drawdowns and stopped the book from over-committing. It added regime-tiered sizing: a calm core allocation when the market is near its highs, then lean in harder when the market is well off its highs, buying weakness aggressively. The result was a momentum book expressed entirely as long-dated call options, holding the leaders, taking profit on doublings, sized up into dips. (The exact rules are in the final section.)
That design is the easy half to be impressed by and the wrong half to trust. A good-looking backtest of a design you just invented is worth nothing. The real work was proving it.
Then it had to prove the strategy out of sample
A model with a backtest button will happily hand you a fantasy that looks incredible and falls apart the moment real money touches it. The discipline that separates the two is one step: walk-forward certification. It tiles history into folds. Each fold tunes on its own past, then is scored on a stretch of future the optimizer never sees. A held-out lockbox at the very end is opened exactly once. And one rule sits above everything: never lead with a backtest number. Out-of-sample is the only headline.
That is the bar: not one lucky backtest, but a positive result on every unseen fold, confirmed exactly once on the held-out lockbox. Fable built a book that cleared it. On the true out-of-sample fold, the stretch the optimizer was never allowed to look at, it returned +137.5%, and on a 2026 window it had never seen, +107.4% against the market's +9%, with its worst drawdown held to about 25%.
That is the number that counts. Anyone can show you a backtest scored on the same data it was tuned on. This is the result on data the model was forbidden to look at, which is the only result that says anything about tomorrow.
I went to run it again
The strategy was beyond outstanding. To make sure it was not a fluke, I opened a new session to run the whole thing again, from scratch.
Access revoked.
Fable 5 would not load. I assumed I had broken my own setup. I had not. I read their announcement, and I was genuinely shocked.
The model that built it is gone
Fable 5 is the public face of Anthropic's Mythos-class model: the same engine, with the cyber and bio guardrails that never touch finance left off. It launched on June 9. Three days later, on June 12, the US government placed Fable 5 and Mythos 5 under export controls, barring access for all foreign nationals. To comply, Anthropic disabled both models entirely, for everyone.
"The US government, citing national security authorities, has issued an export control directive... We are complying with the government's legal directive and are removing access to Fable 5 and Mythos 5 for all users." Anthropic's statement, June 12, 2026.
It is the first export-control directive the US has ever issued for a language model, and the first time a major lab has taken a deployed model offline at the government's instruction. The trigger was national security: a claimed jailbreak of Mythos's cyber capabilities, nothing to do with markets. Fable was collateral, and Anthropic disagrees that "a narrow potential jailbreak should be cause for recalling a commercial model." On Hebbia's finance benchmark, Fable 5 had just ranked first. The strongest finance model anyone has measured built me a market-beating strategy, and three days into its life became a model none of us can run.
So I tried to recreate it with every other model
No matter. If the strategy was real, another model should be able to find it. So I tried to reproduce it, fairly: fresh sessions with no memory of the Fable run, the same open runbook, the same 20-name universe, the same eight gates. I handed it to the three best agents I still had, at the same time, and let them run start to finish on their own: Opus 4.8 in Claude Code, GPT-5.5 in Codex, and Composer 2.5 in Cursor.
None of them beat the baseline. All three reached the same honest verdict, do not deploy, and not one of them touched the held-out lockbox.
| Agent | Verdict | Best out-of-sample book | Lockbox |
|---|---|---|---|
| Claude Fable 5 | Certified, deployable | +137.5% F3, +107.4% on 2026 | opened once |
| Opus 4.8 (Claude Code) | No deploy | +53.7% mean / Sortino 2.50 | untouched |
| GPT-5.5 (Codex) | No deploy | ~+32% mean, fold-0 flat | untouched |
| Composer 2.5 (Cursor) | No deploy | +33.9% mean / Sortino 1.91 | untouched |
These are not bad strategies. Opus 4.8's book returns +53.7% and beats the S&P. They are the kind of result most people would happily publish. But they all stayed in the safe corner of the search, kept the gates on, rotated to cash when the market sagged, and that corner caps out below the baseline. Fable was the only one that cleared the bar, and it is the one model I can no longer open. What Fable found in that search, and the call I made with it, is the last part of the story.
Then I tried one more thing
The certified book, the cautious one you just watched clear every fold, was deployable exactly as it stood. That was the safe call, and the disciplined one. But it was not the only book Fable left in the logs. One of the tools I gave it is a genetic optimizer: it works like evolution, spinning up hundreds of variations of a strategy, scoring each on data it never saw, keeping the winners, breeding them, mutating them, and repeating until the strongest survive.
That search had bred a variant returning far more than the cautious book, and the process had set it aside, because the search itself was unstable: it crowned a different winner on every fold, and one of those folds blew up out of sample, exactly the kind of result the validation gates exist to refuse. I went back through the logs and read it carefully: the variant, its numbers, and the case for rejecting it.
I did not fully buy that rejection. I understood it, and I wrote the rule myself: the runbook is strict, fail out of sample and you do not deploy. But a real strategy does not win every single window, and one losing fold is not the same as a broken book. So rather than take the rejection at face value, I tested the rejected book in the places the campaign never had. I ran it through a full 2022 bear market, the one regime the walk-forward never graded out of sample, to see whether the gates it had thrown away were load-bearing or dead weight. I ran it across the entire five-year cycle that contains that bear. And I re-ran it at minute resolution, to be sure the numbers were not an artifact of daily bars.
It held up everywhere I pushed it: bruised in the bear, never broken, and extraordinary across the full cycle. The fold that had spooked the process came from a bear-only training window elsewhere in the search; this book, the one I would actually deploy, came back positive on every window I tested. That was enough. I made the call the process would not, and deployed the aggressive one.
I want to be straight about what this is. Choosing the stress tests after I had already seen the candidate is the researcher's freedom that walk-forward exists to remove, and re-running a rejected book until it looks good is not the same as it passing clean. The certified book is what the discipline actually endorsed. The live book is my judgment placed over the discipline, and the risk in that is mine, not the method's.
The two books are the same idea tuned two ways. The genetic optimizer found that the regime throttle, the rule that pulls the book to cash when the market sags, was costing more than it saved: the strongest momentum names powered through the dips it was trying to dodge. So it threw the gates open and stayed fully invested in the leaders, and it pushed the take-profit far out, from the cautious book's +100% to +356%, letting winners run instead of banking them early. Same 20 names, same momentum signal, same long-dated calls. It just stopped sitting in cash, and stopped selling its winners short.
Over five years, that shift turns the same idea into a different order of magnitude, and a different order of risk. Here is exactly what it returned, and exactly how much it can hurt.
The out-of-sample performance is unreal
Here is the live book measured directly, the exact object that is trading, across the regimes that actually matter: a brutal bear, two out-of-sample bull windows it was never tuned on, and the full five-year cycle that contains both.
| Test window | Return | maxDD | Sortino |
|---|---|---|---|
| 2022 · the bear (stress test) | +4.8% | 64.7% | 0.75 |
| F3 · true out-of-sample | +249.9% | 31.2% | 4.63 |
| 2026 · never seen | +122.7% (SPY +9.2%) | 39.1% | 3.53 |
| 5 years · includes the 2022 bear | +4,609% | 65.8% | 2.32 |
Read the bottom row twice, because both halves of it are true. Over five years that contain a real bear market, the same $25,000 turns into roughly $1.18 million, about 47x. And the cost of that is a 65.8% drawdown: at its worst the book gave back two-thirds of its value before recovering. Same book, same row. Against the market it is not close:
The risk is not concentration. The book spreads about 8% across the top names, so no single position can sink it. The risk is that it has no brake. With the gates relaxed it stays fully invested in the leaders almost all the time, around 80% deployed and trending toward fully invested, and nothing pulls it to cash when the market turns. That is exactly why it compounds the way it does, and exactly why its worst drawdown is 65.8%. In the 2022 bear it stayed deployed and rode a 64.7% drawdown to finish roughly flat. A bear worse than 2022 could be worse than that, and these are convex options, so a sustained downturn can hurt fast.
What the strategy actually does
Strip the jargon. Here is exactly what the live book does, in plain English.
It trades your 20 names, only as long-dated call options (LEAPs, 150 to 365 days out), bought through an affordability ladder that steps from a plain call down to tighter spreads when a name is expensive. The signal is momentum: it ranks the names by their 252-day rate of change and buys the strongest. The one big choice the optimizer made was how invested to stay, and it chose: almost always.
Universe and signal: the same 20 names, ranked by 252-day rate of change. It buys the strongest and keeps holding them rather than churning the list.
Sizing: about 8% of the account per name, up to roughly 30% put to work on each rebalance, bought as long calls 150 to 365 days out through an affordability ladder (a plain ATM call, stepping down to tighter call spreads when a name is expensive, that is, when one contract's premium would blow past its per-name budget).
Stays invested: the regime gate is relaxed wide open, so instead of rotating to cash when the market sags, it stays in the leaders. Median deployment lands around 80% and trends toward fully invested.
Exits: two, and both are profit-based. Take the gain on a runner at +356%, and within 29 days of expiry, trim a position that is up at least +48.6%. There is no time-based roll and no regime valve, nothing forces it out on a schedule or pulls it to cash.
The odd numbers are the tell: +356% and +48.6% are the optimizer's thresholds, not round human ones. This is the machine's book, run exactly as the machine wrote it.
That is the whole thing. Momentum names, expressed as convex long calls, held through trends, profit taken on the big runs, and kept fully invested almost all of the time. Nothing exotic. The edge was never a secret indicator, it was finding that staying invested in the leaders beats stepping aside, and being willing to live with the drawdown that comes with it.
What is proven: the deployed book is live on the $25,000 public account, cloned from a backtest that reproduces bit-for-bit, and it returned +249.9% on the true out-of-sample fold and +122.7% on a 2026 window the optimizer never saw, while staying roughly 80% deployed. What is not proven: the future. Out-of-sample history is the strongest evidence there is short of forward returns, and it is still not a guarantee, especially for a book with no downside throttle. Two more honest caveats. The 20 names were frozen by hand to a set of current leaders, so a multi-year backtest carries hindsight in its very symbol list; the live, forward record is the only test fully free of that. And these are long-dated options, where the bid/ask spread is a real cost, so the backtests model it honestly: every buy pays the real historical ask and every sell takes the real historical bid, from recorded NBBO quotes, with a tight synthetic spread only filling the rare gap where no quote was recorded. Slippage is priced in, not assumed away, though live fills are still the final proof. The forward results are public, in real time, position by position.
What I am left with
I built an entire methodology to stop myself from getting fooled by backtests, and the machine ran it more honestly than I used to. It caught a bug in my own engine before I did, it measured every version on data it had never seen, and the book it certified, +137.5% out of sample with every fold clean, was deployable exactly as it stood.
The book that is actually live is not that one. It is the aggressive variant the process refused to trust, the one I deployed over its objection: it came back positive on every window I tested, but it runs with no brake and a 66% drawdown I chose to accept. The discipline gave me the safe answer. I overrode it on purpose, and that call is mine to own.
One honest thing about the book it built: it is not designed to be safe. It stays fully invested in the leaders, it has no rule that pulls it to cash, and it will take the full hit of any real downturn. Its worst drawdown on record is 66%. It was never meant to be all-weather. It was built to win, and I deployed it knowing exactly what that costs.
And you do not have to take it on faith that a model did all of this. Every decision, every backtest, every dead end is public:
Claude Fable 5 · the full campaign log Every backtest, optimization, gate, engine fix, and dead end, written down as the model ran it. github.com/austin-starks/public-portfolio-challenge →The best tool I have ever pointed at this problem is one I can no longer run. But the strategy it built is still trading, the method that proved it is still here, and every line of both is in the open. You do not have to take my word for any number above. You can go run the same thing yourself.
Build your own strategy, the same way
The entire runbook, the gates, the lockbox rules, the fold calendar, all of it is open. You do not have to take my word for any number in this post. You can run the exact same campaign on your own idea and watch it pass or fail out of sample.
Write your strategy, put it through the same walk-forward validation, and let the lockbox decide. If it survives, deploy it. If it does not, you found that out before risking a dollar.
github.com/austin-starks/public-portfolio-challenge →Or just watch the book Fable built, every position and every fill, in real time: the live $25,000 portfolio →
Prefer the cautious version, with the risk gates left on? I am tracking it in the open too: the gated book, in paper →
No comments yet.