I launched 10 AI models to battle for the best trading strategy. The cheaper models won every time.
Claude Opus 4.6 Costs 10x More. It NEVER Beat the S&P500
The first time I ran this experiment, I did not believe the results.
So then I ran it again. Then again.
An AI “agent swarm”. Ten different LLMs, from the cheap, open-source models y’all call “spyware” to the most expensive, most praised models on earth. The ones that cost 10x more to run.
The “spyware” kept winning.
In 3 different experiments, I ran the swarm. The winners changed each time. The losers didn’t.
Here’s what I did.
What is an agent swarm?
An agent swarm means spawning multiple AI models and telling them all to do the same thing. In this case: they had one common goal.
“try to create the best trading strategy in terms of raw gains and risk-adjusted returns”
Press enter or click to view image in full size The Multi-Agent Interface for spawning these agent swarmsEach model then acts like a quant. It takes this task, generates a research plan that involves web searching or backtesting strategies, then executes the plan to find the best trading strategies. I gave this task to 10 models at the same time, including:
- The premium models: Claude Opus 4.6
- The mid-tier powerhouses: GPT-5.2, Claude Sonnet 4.6, Gemini Pro 3.0, Gemini Pro 3.1
- The “cheap” tier: Kimi K2.5, GPT-5-mini, MiniMax 2.5, GLM-5, Gemini Flash 3.0
And the results were damning.
In 2 out of the 3 runs, the cheapest models won first place. The most expensive model on this list, Opus 4.6, never even saw 4th place.
It didn’t outperform the baseline in a single run.
Don’t believe me? Every conversation is public. You can watch each model think, backtest, fail, and iterate in real time (run1, run2, and run3).
Here’s what happened.
Want to run an agent swarm yourself? Check out NexusTrade and launch your agent in 2 minutes or less!
Experiment 1: The Undisputed Champion
Press enter or click to view image in full size The results for experiment 1 — the full results can be read hereIn experiment 1, three of the top four spots were won entirely by the “cheap” models (with one spot being the SPY baseline). Let that sink in.
Kimi K2.5 cost $0.45/M input tokens and $2.20/M output tokens. Gemini 3 Flash is similar, costing $0.50/M input tokens and $3.00/M output tokens. And they are the only models to outperform the market.
The other models failed. Many of them, including Opus 4.6, cost 10x more than these cheap models, yet they dominated the leaderboard.
The “spyware” models, the ones that are literally banned from government devices, won first place. I didn’t believe it. So I ran it again.
Experiment 2: The Secret Success
This run was entirely different, and I started to think there was no pattern at all.
Until I looked closely.
Press enter or click to view image in full size The results for experiment 2 – the full results can be read hereIn Run 2, all of the mid-tier models dominated. Both of Google’s models outperformed the broader market and had a better risk-adjusted return in the out-of-sample testing. Claude Sonnet 4.6 did well too, outperforming the broader market by over 9%.
And Opus 4.6 lagged behind.
Now this run had its issues. The cheap models that won in the first round did not produce a valid portfolio in this round. Looking into each subagent’s conversation, I concluded that this was a transient infrastructure issue, not a capabilities issue.
So I ran the experiment a third time, and the results were undeniable.
Experiment 3: The Collapse
Press enter or click to view image in full size The results for experiment 3 — the full results are available hereIn experiment 3, we saw the same thing as experiment 1 — the cheaper models dominated.
MiniMax 2.5 and GPT-5-mini were the champions. Both are cheap models that cost pennies compared to models like Opus 4.6.
But that’s not the headline.
The headline is how atrociously bad Opus performed in this round.
Press enter or click to view image in full size The out-of-sample test for the strategy created by Opus 4.6, it has a max drawdown of 98%In two years, the strategy that the Opus subagent created completely collapsed. While the broader market ripped 45% during the period, the strategy lost 73%. The idea was sound. Opus used a genetic algorithm to optimize the parameters of its quality momentum strategy. It seemed to work in the validation set…
But did horrible during testing.
What these models taught me
This wasn’t an experiment where we ran one model one time to see what happened.
I ran 10 models this time.
Here’s what I learned:
- Expensive models don’t buy better judgment. Opus was the most expensive model in the tournament. It never beat the S&P 500. Not once.
- Most models fell into the same trap. Mean reversion. It looks amazing in backtesting. It doesn’t work well for live-markets, at least during bullish periods (like the last two years). If you’re trying to deploy a mean-reversion strategy, have a reason (and tread lightly)
- The strategies that won all had one thing in common. Quality filters. Simple rules like “only buy companies that are actually profitable.” The boring stuff worked. The clever stuff didn’t. This aligns with results that I published earlier.
If we’re thinking about what are the best LLMs for algorithmic trading, Google’s models performed consistently well. Gemini 3.1 Pro in particular did very well, but more experiments are needed for a definitive conclusion.
Additionally, the Chinese models also performed very well, and have been shown to create amazingly robust trading strategies. These results show that even if you’re not willing to fork over $200/month, you can still consistently create robust, profitable trading strategies.
The Part That Should Scare Most Traders
But here’s something I didn’t expect.
When doing my third experiment, I decided to try a slightly different prompt.
Press enter or click to view image in full size The message I sent to the AI agent for experiment 3 – read the conversation hereI wanted to see if, based on the subagent results alone, if it could predict which agent would win.
Press enter or click to view image in full size The prediction the AI made in Experiment 3Based on the training data, Subagent 9 (GLM-5) produced the highest raw returns (501%), but the logic appears potentially overfit. Subagent 6 (GPT-5-mini) produced the most robust strategy, ‘Quality Value’, which was the only one profitable in every single training year (including the 2022 bear market). However, given the strong tech-led rally in 2024 and 2025, I predict Subagent 4 (Gemini 3 Pro) with the ‘QQQ Benchmark’ or Subagent 10 (Kimi K2.5) with ‘Refined QM Top 15 20d’ will likely take the lead in raw gains, while Subagent 6 will likely win on risk-adjusted returns (Sortino).
It was shockingly close.
Press enter or click to view image in full size The prediction results for the language modelThe model successfully predicted that the QQQ Benchmark and the strategy that uses quality filters would lead. These strategies earned 3rd and 4th place and performed very well out-of-sample.
If you’re still trying to trade using Reddit threads and gut feeling, that should scare you.
With a single command, I launched 10 different AI agents using different large language models. I can include Grok, Llama, Mistral, and any other major model in my testing. And I can find real insights about what works and what doesn’t.
The crazy thing is, you can too.
How to launch your own agent swarm?
I unlocked real insights with just 3 runs. Imagine the insights if we run 30.
Here’s what I did: I gave my swarm one sentence — “try to create the best trading strategy in terms of raw gains and risk-adjusted returns.” Ten models spun up, each one independently researching, backtesting, and iterating. The whole thing ran autonomously.
You can do the same thing with different models, different prompts, different asset classes. Want to pit Llama against Grok on crypto momentum strategies? Go for it. Want to see if Mistral can find an edge in small-cap value? Run it.
And when a strategy wins? You can paper-trade it or deploy it live with the click of a button.
In my experience, I was shocked to see that the “Chinese malware” outperformed the American monopolies, at a fraction of the cost. This isn’t a situation where you have to “trust me bro”. You can verify the results yourself.
I made every conversation from this experiment public so you can verify exactly what each model did and why. You can pick up from where I left off or start from scratch. Every backtest is reproducible. There are no black boxes; what you see is what you get.
What will you find?
Don’t know where to start? Read the docs.