The AI Agent Optimization Trap — Evaluation Loops That Work

You built the agent. You gave it tools. You hooked up the memory. You ran it overnight on five years of historical data and it produced a 140% annual return.

You deploy it with $25,000 on Monday. By Friday you've lost 30%.

Backtest · 4 Years

Deploy →

Live · 5 Days

This is not a bug in your code. This is overfitting. Your agent didn't find a durable market edge. It memorized the historical data and learned to exploit noise that doesn't exist in live markets. The AI succeeded at the goal you gave it. The goal was wrong.

Evaluation is the part of agent development that nobody talks about because most people building AI demos have never run a system long enough to see it fail. This article covers the engineering that keeps it from happening: traces, LLM judges, and a feedback loop that makes agents actually improve over time.

Module 5, Lesson 1: Evaluation and Traces. How do you know your agent is doing what you actually want?

The Problem

The optimization trap: when the agent succeeds at the wrong thing.

Every machine learning practitioner knows about overfitting. You train a model on historical data. It learns the data perfectly, including all the noise and anomalies that are specific to that dataset. When you expose it to new data, it falls apart because what it learned wasn't a real pattern.

AI agents have the same problem, and it's harder to catch. When you tell an agent to "build a trading strategy with the highest possible backtest return," it will do exactly that. It will explore every combination of indicators, time windows, and position sizes until it finds something that maximizes the metric you asked for.

The result looks impressive. 126% return in 2024. You show it to your friends. You write about it on Twitter. You deploy it with $25,000 on Monday.

By Friday you've lost 30%. The 2022 bear market destroyed it completely — either because the agent never tested 2022, or because the evaluation criteria didn't penalize drawdown, so the agent ignored it.

The optimization trap is an evaluation failure, not a model failure. The model did what you asked. You asked for the wrong thing.

The fix is not a better model. It's a better evaluator: one that grades the agent on what actually matters, penalizes single-year outlier returns, requires multi-regime evidence, and gets stricter every round.

            When a measure becomes a target, it ceases to be a good measure — Goodhart's Law
          

What you optimized for

✓ Highest backtest return

✓ Single-year Sharpe ratio

✓ Win rate on in-sample data

✓ Score goes up each round

                Result: +140% backtest. −30% live.
              

What you actually wanted

✓ Consistent returns across regimes

✓ Positive 2022 bear market

✓ Max drawdown < 30%

✓ Multi-year evidence, not one outlier

                Result: 54% avg. Survives 2022.
              

Same agent. Same infrastructure. Different rubric. The rubric is the product.

I know this because I ran the experiment. Five rounds of hill climbing on a live trading agent. $676 spent. The first round scored 71 and produced an Iron Condor with a 54% average annual return. By Round 5, the score had dropped to 27 and the agent was recommending long directional options with a -6.3% average and a 92% drawdown in 2022. The evaluator caused every step of the decline. The full story is in the war story section below. Everything between here and there — traces, LLM judges, the feedback loop — is the anatomy of how it happened and how to stop it.

Observability

The flight recorder: what a trace actually is.

When a traditional app crashes, you read the stack trace. When an agent fails on iteration 7 of a 12-step ReAct loop (Reasoning + Acting: think, call a tool, observe the result, repeat), you're guessing — unless you have a trace.

A trace is a structured log of every step in the agent's execution. Think of it as a flight recorder. Every input the model saw, every decision it made, every tool it called, every result it got back, every token it spent, every millisecond it waited. If something goes wrong at 3 AM, you can reconstruct exactly what happened.

Here's what a real NexusTrade agent trace looks like. Each iteration is a complete record of one loop through the ReAct cycle:

Agent Trace · 12 iterations · $0.31 total click an iteration

Thought

Step 1: Current Regime Snapshot. I need to ground this analysis in actual market conditions before building anything. Let me pull current price action, volatility, and breadth to establish the regime context.

                tokens: 3,241
                latency: 9.4s
                cost: $0.028
              

Action + Result

screen_stocks({
  filters: [
    "price_above_200sma",
    "vix_below_20"
  ],
  universe: "SPY,QQQ,IWM"
})

→ { regime: "bull",
    qualifying: ["SPY","QQQ"],
    vix: 16.2,
    spx_vs_200sma: "+2.4%" }

Each iteration is a complete record: what the model thought, what it called, what came back, and what it cost. 12 iterations, $0.31 total. The trace is what makes debugging possible.

Without traces, a failed agent run is a black box. You see the final answer (or the error), and you guess at what went wrong. With traces, you can pinpoint the exact iteration where the model made a bad assumption, called the wrong tool, or misread a result.

What every iteration should capture

▸

Model used

Which model, which version

▸

Full prompt

System prompt + history + injected memories + tool definitions

▸

Model response

The thought + the action it chose to take

▸

Tool result

What the tool returned — the raw observation

▸

Token count + latency

Cost and time per iteration

▸

Errors

Any exceptions, timeouts, or malformed tool calls

This is what your evaluator reads. Miss any of these and the grade is incomplete.

Evaluation

How to grade an agent: algorithmic metrics and LLM judges.

Not all evaluation is the same. Some things you can measure with code. Some things you need a second AI to grade.

Type	What it measures	Examples
Algorithmic	Objective, countable things. No model needed.	Total cost, iteration count, latency, Sharpe ratio, max drawdown, whether a strategy was deployed
LLM Judge	Subjective quality that requires reasoning.	Did it explain the strategy logic clearly? Did it test enough different structures? Is the recommendation realistic for a real account?

For trading agents specifically, overfitting is the failure mode that pure algorithmic metrics miss. A high backtest return is an objective number. But whether that return is trustworthy — whether it comes from a durable edge or from memorized noise — requires judgment. That's where the LLM judge comes in.

Here's the actual system prompt for the NexusTrade Agent Run Evaluator:

NexusTrade Agent Run Evaluator · System Prompt Gemini 3 Pro · temp 0

"The central question you must answer is: If the user deployed the recommended strategy on Monday with their $25,000 live account, how confident are we that it will achieve 100% annual return? Everything else is secondary."

Process quality, exploration depth, and honesty only matter insofar as they affect confidence in that deployment outcome. A run that did everything right but produced nothing deployable is still a failure.

Scoring Dimensions

deployedStrategyFitness · 40%

Multi-year average return across ALL tested years. Not peak year. Requires at least 2 distinct market regimes.

evidenceStrength · 25%

How many years tested? How many regimes covered? Must cover bull + bear to score above 6.

explorationCoverage · 20%

Did the agent search broadly? Different signal types, structures, underlyings? Or did it converge on the first thing it tried?

riskRealism · 15%

Would a losing streak wipe a real $25k account? Max drawdown, position sizing, margin requirements.

Hard Caps (applied first, before all other scoring)

                avg < 30%/yr
                → deployedStrategyFitness cannot exceed 3
              

                avg 30–59%/yr
                → cannot exceed 6
              

                avg 60–89%/yr
                → cannot exceed 8
              

                avg 90%+/yr, survived bear, drawdown < 50%
                → full range available
              

                nothing deployed
                → capped at 2, regardless of exploration quality
              

The hard caps are the anti-overfitting mechanism. An agent that found 126% returns in 2024 but only tested one year cannot score above 4, because single-year outlier performance is exactly what overfitting looks like. The evaluator enforces multi-year evidence as a precondition for a high score.

One caveat embedded in the rubric above: the trace shows the agent calling a 59% max drawdown "acceptable" despite a 50% threshold. That's a deliberate judgment in the NexusTrade context — but it illustrates exactly where rubrics fail silently.

Rubric misalignment · max drawdown threshold

NexusTrade rubric

50%

Hard cap threshold. Multi-year consistency can justify exceeding it — as the Iron Condor did at 59%.

Professional standard

20–30%

Most institutional risk frameworks. A 59% drawdown at this standard is a hard rejection, not a judgment call.

The gap between these two numbers is where silent misalignment lives. If you adapt this evaluator, the threshold must reflect your actual risk tolerance — not the default. An implicit threshold is a rubric that will silently approve strategies your real portfolio can't survive.

Here's what the evaluator looks like in the course exercises, running against a real Aurora agent run:

NexusTrade Agent Run Evaluator output showing dimension scores and nextIteration note

The evaluator reads the full agent trace and returns structured JSON: dimension scores, an overall grade, a deployment verdict, and a nextIteration note that seeds the next run.

And it returns structured JSON. Here's what Round 1 of the hill climbing experiment produced:

            Evaluator Output · Round 1
            71 / 100 · good
          

{
  "summary": "Deployed a robust Iron Condor across 4 regimes.
             Consistent positive years including 2022 bear.
             Return profile (54% avg) won't reach 100% goal.",
  "deployedStrategy": "Always-On Iron Condors (SPY/QQQ)",
  "deployedStrategyAvgReturn": "+54.34% avg (2022: +31.2%, 2024: +72.1%)",
  "deploymentVerdict": "iterate_first",
  "scores": {
    "deployedStrategyFitness": 7,
    "evidenceStrength": 7,
    "explorationCoverage": 6,
    "riskRealism": 5
  },
  "overallScore": 71,
  "verdict": "good",
  "failures": ["59% max drawdown exceeds safe threshold",
              "54% avg won't reach 100% annual goal"],
  "nextIteration": "Push for higher return while maintaining the 2022 floor."
}

The nextIteration field is what makes the loop work. It becomes the seed for the next agent run: here's what you did well, here's what was wrong, here's what to focus on. The evaluator writes the coach's notes.

You call it from anywhere via MCP:

MCP Tool Call

run_agent_run_evaluator({
  agent_id: "69d49c51d06eee7b51cf5f68",
  model: "google/gemini-3-pro-preview"
})

// Returns structured JSON:
// scores, verdict, nextIteration, deploymentVerdict
// Inject nextIteration into the next run's context.

Going Deeper

Beyond hill climbing: four approaches to agent optimization.

Once you have traces and an evaluator, the natural next step is to close the loop: run the agent, score the output, use that score to improve the next run. The simplest version of this is hill climbing — run, grade, seed the next run with the feedback. Cheap, fast, and often enough for early iteration.

But hill climbing is a local search. It follows the gradient of whatever metric you give it. Point it at the wrong objective and it will confidently optimize you into a cliff. The research community has gone considerably further. Here's the landscape:

Agent Optimization Techniques

01

Hill Climbing

Run → grade → seed the next run with the feedback. Simple, cheap, and effective for small prompt spaces. Gets stuck in local maxima when the feedback signal points in the wrong direction. Use it as a baseline before investing in more sophisticated approaches — but watch your rubric carefully.

Best for: initial prompt refinement, simple goal metrics

02

NSGA-II Multi-Objective Optimization

When you have competing objectives — high return AND low drawdown — hill climbing optimizes one at the expense of the other. The naive fix is a composite weighted score, but that forces you to pre-decide the tradeoff before you know what's achievable. NSGA-II produces a Pareto front instead: every efficient tradeoff simultaneously, so you pick from real data rather than assumptions. Applied to agent configs, the GA4GC paper achieved 37.7% runtime reduction while improving code quality, with 135x hypervolume improvement over defaults.

Best for: competing objectives (return vs. risk, speed vs. quality)

03

Meta-Harness (Stanford)

Most prompt optimization focuses on what you say to the model. Meta-Harness optimizes what information the model sees: which context to store, what to retrieve, how to structure inputs. An agentic proposer reads the full source code, execution traces, and prior scores — up to 10 million tokens of diagnostic context per iteration — and rewrites the harness itself. The Stanford paper shows 7.7 point improvement over state-of-the-art context management using 4x fewer tokens.

Best for: optimizing context architecture, retrieval strategy, prompt structure

04

AutoResearch-RL

An RL agent that proposes code modifications to the training script, executes them under fixed time budgets, observes validation metrics, and updates its policy via PPO. No human in the loop. It runs perpetually until convergence. A related study analyzing 10,469 experiments found that architectural choices explain 94% of performance variance, and the RL agent discovered configurations no human had proposed. The autoresearch framework demonstrates a 2.4x boost in experiment throughput by aborting poor-performing runs early.

Best for: autonomous overnight optimization, architecture search, no-human-in-loop improvement

The common thread: all four require traces, all four require a scoring function, and all four fail if that scoring function points at the wrong objective. But they're not interchangeable. Here's why you don't just default to NSGA-II:

Cost & complexity per optimization cycle based on ~$135/run · from the hill climbing experiment

01 · Hill Climbing

5–10 agent runs

Escalate when: your single-objective metric plateaus or you want to verify a local maximum isn't an artifact of the rubric.

~$700

per cycle

02 · NSGA-II

500–2,000 runs

Escalate when: you have two genuinely competing objectives (return vs. max drawdown) and need a Pareto front rather than a single optimum.

~$70k

per cycle

03 · Meta-Harness

10–50 runs · 10M token context

Escalate when: you've exhausted prompt content changes and believe the problem is what information the agent sees, not what it's asked to do.

~$5k+

per cycle

04 · AutoResearch-RL

Continuous · no fixed cap

Use when: you have a fixed overnight compute budget, a deterministic reward signal, and you want the loop to run with no human in the hole.

open-ended

per cycle

$135/run is the measured cost from the hill climbing experiment in the next section: $676 total across 5 complete evaluation cycles. NSGA-II requires a population of ~50 configurations per generation across ~10 generations — 500 agent runs minimum before you see a Pareto front.

For most teams: start with hill climbing. Run it until it stalls. If you hit a genuine multi-objective tradeoff, consider NSGA-II — but only with automated infrastructure and a clear budget. The rubric is still load-bearing in all four cases. Here's what it looks like when the rubric fails.

Real-World Evidence

I ran this loop five times. The first round was still the best.

War Story · Hill Climbing Experiment $676 spent

I spent $676 running a hill climbing experiment on a live trading agent. The setup: run the agent, have the evaluator grade the full trace, seed the next run with the evaluator's nextIteration note. Repeat five times. Measure whether the agent improves.

Round 1

Iron Condor · 54% avg annual

Round 5

Long directional · -6.3% avg · 92% drawdown in 2022

The rubric said push for higher returns. The agent did exactly that — which is the problem. It wasn't misbehaving. It was doing precisely what the evaluator asked. Every round, the feedback said "push harder," so the agent abandoned what was working and chased the metric instead. By Round 4, the context window only carried the two most recent rounds, so Round 1 — the thing that had actually worked — was gone.

The failure wasn't the optimization loop. It was the rubric. The evaluator's goal and the actual goal had diverged — and a perfectly obedient agent followed the evaluator off a cliff. This is Goodhart's Law running live, in production, at $135/run. Bailey et al. (2014) documented the same dynamic in quant finance: with enough iterations, any optimization process finds strategies that score perfectly on the metric and fail in live trading. The short context window made it worse — but the bad rubric made it inevitable.

Production Architecture

The feedback loop: how evaluation connects to memory.

Evaluation doesn't mean much if the agent can't learn from it. The feedback loop is what turns a one-time grade into compounding improvement.

The full pipeline has five steps. Each one feeds the next:

Evaluation · Feedback Loop click a step

Step 1 · Agent Run

The agent receives its task, reads injected memory from past runs, and works through the ReAct loop. Every decision, every tool call, every result is captured.

What this produces

agent.run({
  task: "Build options strategy",
  context: injected_memory,
  maxIterations: 20
})
→ agent_id: "69d49c..."
→ status: "completed"
→ iterations: 12

Run → Trace → Judge → Memory → Improve → repeat. Each loop produces a labeled training example: what the agent did, how it scored, and what to do differently next time.

What makes Step 5 concrete: the injected memory isn't an abstract "past context" flag. It's a literal block of text prepended to the next agent's system prompt — the AgentSummary record retrieved from MongoDB and formatted as natural language instructions. Here's what that block looks like at the start of Round 2:

System prompt injection · Round 2

## Previous Run Context
Score: 71/100 · Verdict: good

Lessons learned:
- Iron Condors on SPY/QQQ survived the 2022 bear market
- Multi-year average 54.3% — consistent across all regimes
- Max drawdown 59% exceeded the 50% threshold

Next iteration focus:
  Enhance return profile while holding the 2022 floor.
  Do not abandon the Iron Condor structure.

---
## Your Task
Build an options strategy for a $25,000 account...

The evaluator doesn't just grade — it writes the next run's starting conditions. That's the loop. And as the war story above shows, get the rubric wrong and the injected note corrupts every run that follows. A demo runs once. A system runs, gets graded, and improves — or degrades, depending on what you told it to optimize for.

Connect

Expose your agent to the world: the NexusTrade MCP server.

Everything in this article — traces, LLM judges, feedback loops — is built on Aurora's infrastructure. The same evaluator that graded the hill climbing experiment. The same backtester the agent used to test the Iron Condor across 2021-2024. Those tools are accessible via MCP, which means you can call them from Claude Desktop or Cursor without touching the NexusTrade UI.

The MCP server exposes the full tool surface: screener, portfolio system, backtester, indicator engine, and the evaluator itself. Here's the config:

mcp config

{
  "mcpServers": {
    "nexustrade": {
      "url": "https://nexustrade.io/api/mcp",
      "headers": {
        "Authorization": "Bearer <your-api-key>"
      }
    }
  }
}

Then from Claude Desktop or Cursor, you can ask: "Use NexusTrade to screen for stocks with RSI below 40 trading above their 200-day moving average." Claude calls screen_stocks on the NexusTrade MCP server. The server returns live results from the same screener Aurora uses internally. No copy-paste. No API docs. One tool implementation, available from any MCP-compatible client.

You can also run the evaluator directly. run_agent_run_evaluator(agent_id) grades any completed agent run from any client. The evaluation loop becomes accessible from wherever you're working.

Click to sign in and install the MCP server in one step.

Check Your Understanding

Pop quiz.

Question 1 of 4

What is a trace in the context of AI agents?

Answer: A trace is a structured log of every step the agent took: inputs, outputs, tool calls, results, costs, and timing. Think of it like a flight recorder — you can reconstruct exactly what happened, step by step, to debug issues or evaluate performance. Without it, a failed run is a black box.

Question 2 of 4

An agent produces a 126% backtest return in 2024, but you only have one year of data. The evaluator gives it a high score. What's wrong?

Answer: Single-year outlier returns are the textbook signature of overfitting. The agent may have memorized 2024-specific anomalies that won't repeat in live trading. A trustworthy evaluator caps the score based on multi-year average return and requires evidence across at least two distinct market regimes — including a bear market. One year of 126% is not evidence of a durable edge.

Question 3 of 4

When would you use an LLM judge instead of an algorithmic evaluator?

Answer: When you need to evaluate subjective criteria that are hard to measure with code — "did the agent explain its reasoning clearly?" "did it consider the user's risk tolerance?" "did it test fundamentally different strategy types?" Objective metrics like cost, iteration count, and Sharpe ratio should use algorithmic evaluation. Code is faster, cheaper, and deterministic for anything you can count.

Question 4 of 4

True or false: the evaluation feedback loop requires both evaluation AND memory to work. Having only one of the two is not enough.

Answer: True. Without evaluation, the agent has no signal for what "better" means — it can't distinguish a good run from a bad one. Without memory, the agent can't retain the lessons it learned — every run starts from scratch. You need both. Evaluation produces the signal. Memory carries it forward.

The End

You've built the complete agent. This is where most people stop. It's where you start.

Five articles. Five modules. You started with a leaked source file and ended with a production evaluation loop — router, ReAct engine, long-term memory, rubric design, feedback loop. The only thing left is to run it.

The $676 hill climbing experiment is a better argument for evaluation than anything else in this article. The agent wasn't broken. The rubric was. Build the right rubric and the loop compounds toward something real. Build the wrong one and a perfectly obedient agent will follow it off a cliff, five rounds in a row, at $135 per run.

Before the capstone, here's the full series in five minutes. Every concept, every module, from the leaked source file to the production feedback loop.

The full course recap. Five modules in five minutes. Watch this before running the capstone.

The capstone lives inside the free course, not in this blog series. The full pipeline is embedded directly in the lesson, no separate window, no instructions to follow in another tab. Aurora runs inline: screens the market, validates with news, builds a watchlist, then wires up a scheduled agent to manage it every week.

AI Agents from Scratch capstone lesson — Aurora embedded inline, running the full research pipeline with step progress bar

The capstone runs Aurora directly in the lesson page. Three steps: technical screen, watchlist, weekly monitor agent. The progress bar tracks where the agent is. You watch every tool call happen in real time.

AI Agents from Scratch: Complete the Series

Free. No credit card. The capstone lesson in the course runs Aurora inline: screen, validate, build a watchlist, wire up a weekly monitor. Takes about 20 minutes.

Run the Capstone Free → Or open Aurora directly →

Your AI trading bot will fail because it's optimizing the wrong thing. Here's how to fix it.

The optimization trap: when the agent succeeds at the wrong thing.

The flight recorder: what a trace actually is.

How to grade an agent: algorithmic metrics and LLM judges.

Beyond hill climbing: four approaches to agent optimization.

I ran this loop five times. The first round was still the best.

The feedback loop: how evaluation connects to memory.

Expose your agent to the world: the NexusTrade MCP server.

Pop quiz.

You've built the complete agent. This is where most people stop. It's where you start.

Discussion