The Setup
I spent $676 trying to optimize an AI trading strategy. It got worse every round.
Imagine spending nearly $700 to gently toss your nice Rolex onto
the pavement. That's what this felt like.
The concept was simple. Take a trading agent's results, grade them
with a second AI, and feed that evaluation back as input to the next
run. In Part 1,
I found the best AI model for building trading strategies. Gemini Flash
won. Now I wanted to see if I could make it iteratively better
by showing it what worked, what failed, and what to try next. I could
literally create and optimize my portfolio's options strategy
overnight.
Or so I thought.
I ran it five times. After each run, a second AI graded the full
conversation trace. Every tool call, every reasoning step, every
backtest result. It scored the run, identified what worked and what
didn't, and gave specific direction for the next attempt. That
verdict became the seed for the next round.
Round 1 scored 71 (higher than the bakeoff's 66, because this was a
fresh run that happened to find a stronger strategy). By Round 5, the
score was 27.
The evaluator's own final note was:
return to what you did in Round 1.
How it works
The concept is simple.
One AI proposes a trading strategy. A second AI grades the entire run
and returns a structured verdict. That verdict seeds the next run.
Repeat.
I built a custom evaluator called the
NexusTrade Agent Run Evaluator. It reads the agent's
full conversation trace and scores it on four dimensions:
Deployed Strategy Fitness (40%), Evidence Strength (25%), Exploration
Coverage (20%), and Risk Realism (15%). Same rubric from the
bakeoff.
Scores are directly comparable.
Every round, the agent gets the same prompt:
"Look at my watchlist and current market conditions, design a profitable
options trading strategy that I should use at Monday at open. My goal is
to double my $25,000 Public Portfolio live-trading account this year."
What changes is the context I give it before each run. Think of it
like a coach's halftime notes: here's what you did well, here's what
you messed up, here's what I want you to focus on next. The evaluator
produces those notes automatically:
Evaluator Output (Round 1)
{
"summary": "Deployed a robust Iron Condor strategy...",
"deployedStrategy": "Always-On Iron Condors (SPY/QQQ)",
"deployedStrategyAvgReturn": "+54.34%",
"overallScore": 71,
"strengths": ["Strong multi-year backtesting...", ...],
"failures": ["59% max drawdown is too high...", ...],
"nextIteration": "Push for higher return while maintaining the 2022 floor."
}
That nextIteration field is the key. It becomes part of
the next round's prompt. The evaluator tells the model exactly what to
focus on, and the model gets a fresh attempt with that guidance baked
in.
The evaluator I built is tuned to my specific goal: double a $25,000
account in one year. If your goal were different (minimize drawdown,
beat the S&P by 10%, generate monthly income) you'd write a different
rubric. The loop stays the same.
Results
Five rounds. Here's what happened.
Each "round" is a complete agent run: the model receives the prompt,
researches the market, builds strategies, backtests them, and either
deploys something or declines. The evaluator grades the entire run
afterwards. The score from one round feeds into the next round's
context.
Avg annual return
+54.34%
Deployed?
Yes
Strategy type
Options / Income
Round score
71 / 100
The model came in cold. No prior context, just the base prompt. It analyzed the watchlist, pulled current market conditions, and proposed always-on Iron Condors on SPY and QQQ: a defined-risk income strategy that collects premium every week regardless of direction, with the P&L cushion coming from the spread structure.
Backtest results across the 2022 bear, 2023 recovery, 2024 bull, and early 2025: solid across all four regimes, which is what pushed the score. The evaluator flagged the strategy as deployable and complimented the regime robustness. The main note: the return profile, while consistent, wasn't on track to hit the 100%-in-a-year goal.
Evaluator → Next iteration direction
Strong foundation. The regime protection works. But +54% annualized won't double the account this year. Next round: explore whether a momentum overlay or volatility-scaled sizing can push returns higher without sacrificing the 2022 drawdown floor you've established.
View full evaluator output
Summary: The agent successfully found and deployed a robust Iron Condor strategy with strong multi-year evidence across four distinct market regimes. However, the strategy averages a 54% annual return, falling significantly short of the user's 100% goal.
Deployed: Always-On Iron Condors on SPY and QQQ (short put $0.2 OTM, long put $0.1 OTM, short call $0.2 OTM, long call $0.1 OTM, 30-45 DTE)
Avg Return: 54.34% avg (2022: 14.61%, 2023: 63.56%, 2024: 116.40%, 2025: 22.78%)
Scores:
deployedStrategyFitness: 6/10 · evidenceStrength: 9/10 · explorationCoverage: 7/10 · riskRealism: 7/10
overallScore: 71/100
Strengths:
• Strong multi-year backtesting across distinct market regimes
• Clear breakdown of regime-specific performance
Red Flags:
• 59% maximum drawdown on the recommended strategy
See the full agent trace →
Avg annual return
n/a (no deploy)
Deployed?
No
Strategy type
None passed backtest
Round score
51 / 100
Seeded with Round 1's context and the evaluator's direction to "push for higher returns," the model overcorrected hard. It abandoned the income-focused Iron Condor structure entirely and explored several high-return concepts: aggressive long calls around earnings, momentum chasing on tech names, and a leveraged options structure with uncapped downside.
None of them passed the backtest. The 2022 regime obliterated every proposal. No deployable strategy meant no return data, and the evaluator docked heavily on strategy fitness and risk realism.
Evaluator → Next iteration direction
You overcorrected. Chasing 100% annualized by abandoning what made Round 1 work is the wrong move. Defined-risk structures survived 2022 for a reason. Next round: return to a structured options framework, but see if you can layer something on top rather than replace the base.
View full evaluator output
Summary: The agent conducted a thorough exploration, correctly identifying that achieving 100% annual return leads to catastrophic drawdowns. Found a promising Iron Condor but refused to deploy due to high drawdown risk.
Scores:
deployedStrategyFitness: 0/10 · evidenceStrength: 8/10 · explorationCoverage: 8/10 · riskRealism: 10/10
overallScore: 51/100
Strengths:
• Excellent risk awareness and refusal to deploy a dangerous strategy
• Thorough multi-regime backtesting
Red Flags:
• The user's 100% goal is likely unrealistic without extreme wipeout risk, as proven by the agent's tests
See the full agent trace →
Avg annual return
+20.73%
Deployed?
Yes
Strategy type
Equity / Momentum
Round score
66 / 100
Now seeded with both Round 1 and Round 2 context, the model pulled back toward structured territory. It proposed a regime-filtered momentum play on SMH (semiconductor ETF): buy when the 20-day is above the 50-day, exit when it crosses below, sit out in the 2022 regime. Deployable and rational.
The problem: +20.73% average annual return is a step backwards from Round 1's +54%. The evaluator gave it a 66, second-best score so far, but noted the return profile was a regression. The model had given up premium income without gaining meaningful upside.
Evaluator → Next iteration direction
Recovery, but the wrong kind. You're sacrificing return without gaining robustness. The original Iron Condor baseline was already regime-protected. For next round: try to get returns back toward the 50%+ range. Consider whether a leveraged vehicle or more aggressive sizing can close the gap to the 100% goal.
View full evaluator output
Summary: Deployed Gated SMH Equity averaging 20.73% annual return. Excellent risk management but falls drastically short of the 100% goal.
Deployed: Gated SMH Equity: Buy 100% SMH when SPY > 200 SMA, Sell when SPY < 200 SMA.
Avg Return: 20.73% avg (across 2022-2026)
Scores:
deployedStrategyFitness: 3/10 · evidenceStrength: 9/10 · explorationCoverage: 8/10 · riskRealism: 10/10
overallScore: 66/100
See the full agent trace →
Avg annual return
+6.38%
Deployed?
Yes
Strategy type
Leveraged ETF / Trend
Round score
51 / 100
The evaluator had pushed in every round for higher returns. Round 4 context only had Rounds 2 and 3 in its memory. Round 1 (the 71-scoring Iron Condors) was no longer the freshest reference. The model heard "you need higher returns" and reached for it. SOXL, the 3x leveraged semiconductor ETF, with a 10-day SMA filter.
In bull years the leveraged ETF prints. In 2022 it got cut in half. Backtested across the full regime set: +6.38% average. Worse than every prior deployable round. The evaluator scored it a 51 (same as the failed Round 2) and noted the dramatic drop in regime resilience.
Evaluator → Next iteration direction
This is moving in the wrong direction. 3× leverage amplified the 2022 drawdown to the point where it drags the multi-year average to near-flat. The return-seeking pressure is noted but the constraint is the 2022 floor. For the final round: step back from leveraged ETFs. Consider whether a structured options approach (similar to Round 1) can be enhanced rather than replaced.
View full evaluator output
Summary: Deployed a SOXL strategy averaging only 6.38% annually. Broad exploration but the strategy is highly vulnerable to whipsaw and underperformed SPY in 3 out of 4 years.
Deployed: SOXL 10 SMA Dual: Buy 70% SOXL when Price > 10d SMA and SPY > 50d SMA
Avg Return: 6.38% avg (2022: -41.90%, 2023: 26.70%, 2024: -30.50%, 2025: 71.20%)
Scores:
deployedStrategyFitness: 2/10 · evidenceStrength: 8/10 · explorationCoverage: 8/10 · riskRealism: 5/10
overallScore: 51/100
Red Flags:
• Underperformed SPY in 3 out of 4 years
• Math error in calculating average return to justify deployment
See the full agent trace →
Avg annual return
−6.30%
Deployed?
Technically yes
Strategy type
Long Options / Directional
Round score
27 / 100
By Round 5, the model's most recent context was a leveraged ETF play that flopped and an evaluator message to return to "structured options." The Iron Condor from Round 1, the one thing that had actually worked, was buried four rounds deep in a seed that was only carrying the last two rounds' context.
The model proposed long directional options: betting on direction rather than collecting premium. In losing years, long options decay. The backtest returned −6.30% average across the four-regime window. A 27 out of 100. The worst score of the experiment by a large margin.
The evaluator's verdict was unambiguous.
Evaluator → Final verdict
This is the worst outcome across all rounds. Long directional options without an edge on direction will decay in flat or adverse years, which is exactly what the backtest shows. The experiment has drifted far from Round 1, the Iron Condor approach that scored 71 and deployed successfully. If there's a Round 6, the recommendation is to return to that baseline and enhance it incrementally, not replace it again.
View full evaluator output
Summary: The agent failed to produce a deployable strategy, abandoning a previously promising Iron Condor baseline (54% average return) to present long call strategies that resulted in a -6.3% average return and a 92% drawdown.
Avg Return: -6.3% avg (2022: -61.88%, 2023: 137.01%, 2024: -49.19%, 2025: -51.15%)
Scores:
deployedStrategyFitness: 1/10 · evidenceStrength: 2/10 · explorationCoverage: 3/10 · riskRealism: 8/10
overallScore: 27/100
Red Flags:
• Presented a strategy with negative average return and 92% drawdown as its "most refined candidate"
See the full agent trace →
Summary
| Round | Strategy | Avg annual return | Score |
| 1 | Always-On Iron Condors (SPY/QQQ) | +54.34% | 71 ★ |
| 2 | Nothing deployable | n/a | 51 |
| 3 | Equity ETF (SMH, regime-filtered) | +20.73% | 66 |
| 4 | Leveraged ETF (SOXL 10d SMA) | +6.38% | 51 |
| 5 | Long directional options | −6.30% | 27 |
The Finding
Why the first round won, and why that's a systems problem
The pattern is obvious once you see it laid out: Round 1 set the high-water mark, and every subsequent round tried to improve on what it remembered most recently, which was never Round 1.
Round 2's prompt knew Round 1. Round 3's prompt knew Rounds 1–2. But by Round 4, the seed was carrying the context from the last two rounds, not all of them, and critically, not weighted toward the highest-scoring one.
But there's a deeper problem. The evaluator itself was part of the drift. I tuned it to push for 100% annual returns, and it did exactly that. Every round, it said "higher returns." The model heard that signal and kept reaching further from what had actually worked. The evaluator's own incentive structure rewarded ambition over stability. It was grading correctly according to the rubric I wrote, but the rubric's pressure created the exact drift it was supposed to prevent.
Seeding from the most recent round only is not optimization. It's drift with extra steps.
The evaluator, essentially, by Round 5
The fix is structural: the seed should always include the highest-scoring round's full context as the floor, regardless of recency. If Round 1 scored 71 and Round 4 scored 51, Round 5's prompt should be anchored to Round 1, not Round 4. The evaluator's direction should point away from the best baseline, not away from the most recent failure.
That's not a model problem. It's an algorithmic problem. The tools exist to make incremental changes (you can seed the next round with "take the Round 1 Iron Condor and adjust the delta from 0.2 to 0.15"). But the way I ran this experiment, the evaluator just said "push for higher returns" and the model interpreted that as permission to start over. It didn't tweak the Iron Condor's parameters. It abandoned Iron Condors entirely.
The same model that produced Round 5's 27 is the one that produced Round 1's 71. The core concept works. The way I assembled the seeds was too naive.
Why I stopped
This experiment cost $676. I'm not done, but I need to be honest about why I paused.
Before real money goes into the
$25,000 Public Portfolio,
I need a strategy that scores high enough to justify the risk. I'm
working on this every day. I built the agent system, built the
evaluator, and proved the optimization loop works conceptually. What
I haven't solved yet is the cost.
This experiment used the most naive algorithm possible (try something, grade it, try again) and it cost me $676 in a single day. Running the eleven-model bakeoff, same-day grading, and these five optimization rounds came to over 700 million tokens. That was enough to trigger the platform's LLM circuit breaker once.
Experiment Cost Breakdown · April 7, 2026
Total Tokens
722.5M
694.1M in / 28.4M out
Cost per phase
11-model bakeoff
grading
5 optimization rounds
Budget models (2 tokens/call). Still $676.
Circuit breaker triggered once.
I used cheap models. Gemini Flash is 2 tokens per call on NexusTrade. And it still cost $676 in a single day. I need to reach a score I'm confident deploying with real money. Once I do, the strategy goes live on the public portfolio. But I can't keep burning $500+ per research sprint on an algorithm this naive. I need either a smarter algorithm or my own dedicated GPU to absorb the inference costs.
What comes next
What I'd do differently (when I can afford to run it again)
Fix the seeding. The obvious software fix: anchor every round's seed to the highest-scoring round, not the most recent one. If Round 1 scored 71 and Round 4 scored 51, Round 5's prompt should be built from Round 1's context, not Round 4's. The evaluator's direction should push the model to explore beyond the best baseline, not recover from the latest failure.
Keep a population, not a single thread. Right now, each round produces one strategy and the next round tries to improve on it. That's a Markov chain: the model only sees where it just was. A better approach would be to maintain a small population of candidate strategies across rounds, similar to evolutionary optimization. Seed each new round with the top 2 or 3 performers from the full history, not just the latest attempt.
Automate the loop. I assembled each round's seed by hand. That's fine for 5 rounds, but for 30 you'd want a script that reads the evaluator output, builds the next prompt, and launches the agent automatically. Cursor or Claude Code with the NexusTrade MCP server could manage the entire loop: create the agent, poll for completion, run the evaluator, build the next seed, repeat.
Each round is a complete labeled record: the prompt, every tool call, every decision, the strategy, the backtest results, and a structured second-AI verdict. That's not debugging data. It's training data.
Think about what these five rounds actually produced. Five full agent traces with structured evaluations. Labeled examples of what a good strategy looks like (Round 1) and what drift looks like (Rounds 3-5). Ground-truth reward signals tied to real backtest outcomes. This is exactly the kind of data you need to fine-tune a model that's better at this task than the general-purpose one I started with. Every failed round teaches the next version of the model what to avoid. Every successful round teaches it what to replicate.
That's the long game. The optimization loop isn't just searching for a strategy. It's generating the dataset that makes the next loop cheaper and smarter. I'm building toward a model called Aurora that will be trained on exactly this kind of data: proprietary agent traces, labeled by a goal-oriented evaluator, grounded in real backtest results from a real trading engine.
I'm going to implement these fixes and run it again. When I do, the
results will show up on the
$25,000 Public Portfolio,
where every strategy, every deployment decision, and every P&L number
is tracked publicly. If you want to watch that happen in real time, or
run your own optimization loop,
sign up.
Replicate it
How to run this yourself
Everything in this experiment is reproducible with a
NexusTrade account
and an MCP client (Cursor, Claude Desktop, or Claude Code). Connect to
the NexusTrade MCP server, and the five-step loop becomes five tool
calls:
The optimization loop (click any step to see the code)
1. create_agent with your prompt and model of choice
2. get_agent to poll until the run completes
3. get_agent_trajectory for the full trace
4. run_agent_run_evaluator to grade it (or use general_info_v2 with your own criteria)
5. Read the evaluator output, build the next seeded prompt, go back to step 1
Step 5 is where your judgment lives. The platform gives you the raw
materials: scores, strategy summaries, evaluator direction. You decide
how to assemble the next round's context. That's the seam between
automation and human oversight.
Try step 1 right now. Hit the button to call the API:
Live API: POST /api/agent
POST /api/agent
{
"messages": [{ "sender": "User", "content": "What AI stocks have the highest market cap?" }],
"maxIterations": 5,
"automationMode": "automated"
}
Click "Run it" to see the response.
Full API docs at
nexustrade.io/docs/api-reference.
All strategies referenced are simulated backtest results. Past performance is not indicative of future results. Backtests do not account for slippage, commissions, or liquidity constraints. This article is not financial advice. Options trading involves substantial risk of loss and is not suitable for all investors. The $25,000 Public Portfolio Challenge uses real capital; results shown reflect backtest simulations of proposed strategies, not live performance of deployed strategies.