You built the agent. You gave it tools. You hooked up the memory. You ran it overnight on five years of historical data and it produced a 140% annual return.

You deploy it with $25,000 on Monday. By Friday you've lost 30%.

Backtest · 4 Years
Deploy →
Live · 5 Days
2021 2022 2023 2024 +140%
Mon Tue Wed Thu -30%

This is not a bug in your code. This is overfitting. Your agent didn't find a durable market edge. It memorized the historical data and learned to exploit noise that doesn't exist in live markets. The AI succeeded at the goal you gave it. The goal was wrong.

Evaluation is the part of agent development that nobody talks about because most people building AI demos have never run a system long enough to see it fail. This article covers the engineering that keeps it from happening: traces, LLM judges, and a feedback loop that makes agents actually improve over time.

Module 5, Lesson 1: Evaluation and Traces. How do you know your agent is doing what you actually want?


The optimization trap: when the agent succeeds at the wrong thing.

Every machine learning practitioner knows about overfitting. You train a model on historical data. It learns the data perfectly, including all the noise and anomalies that are specific to that dataset. When you expose it to new data, it falls apart because what it learned wasn't a real pattern.

AI agents have the same problem, and it's harder to catch. When you tell an agent to "build a trading strategy with the highest possible backtest return," it will do exactly that. It will explore every combination of indicators, time windows, and position sizes until it finds something that maximizes the metric you asked for.

The result looks impressive. 126% return in 2024. You show it to your friends. You write about it on Twitter. You deploy it with $25,000 on Monday.

By Friday you've lost 30%. The 2022 bear market destroyed it completely — either because the agent never tested 2022, or because the evaluation criteria didn't penalize drawdown, so the agent ignored it.

The optimization trap is an evaluation failure, not a model failure. The model did what you asked. You asked for the wrong thing.

The fix is not a better model. It's a better evaluator: one that grades the agent on what actually matters, penalizes single-year outlier returns, requires multi-regime evidence, and gets stricter every round.

When a measure becomes a target, it ceases to be a good measure — Goodhart's Law
What you optimized for
Highest backtest return
Single-year Sharpe ratio
Win rate on in-sample data
Score goes up each round
Result: +140% backtest. −30% live.
What you actually wanted
Consistent returns across regimes
Positive 2022 bear market
Max drawdown < 30%
Multi-year evidence, not one outlier
Result: 54% avg. Survives 2022.
Same agent. Same infrastructure. Different rubric. The rubric is the product.

I know this because I ran the experiment. Five rounds of hill climbing on a live trading agent. $676 spent. The first round scored 71 and produced an Iron Condor with a 54% average annual return. By Round 5, the score had dropped to 27 and the agent was recommending long directional options with a -6.3% average and a 92% drawdown in 2022. The evaluator caused every step of the decline. The full story is in the war story section below. Everything between here and there — traces, LLM judges, the feedback loop — is the anatomy of how it happened and how to stop it.


The flight recorder: what a trace actually is.

When a traditional app crashes, you read the stack trace. When an agent fails on iteration 7 of a 12-step ReAct loop (Reasoning + Acting: think, call a tool, observe the result, repeat), you're guessing — unless you have a trace.

A trace is a structured log of every step in the agent's execution. Think of it as a flight recorder. Every input the model saw, every decision it made, every tool it called, every result it got back, every token it spent, every millisecond it waited. If something goes wrong at 3 AM, you can reconstruct exactly what happened.

Here's what a real NexusTrade agent trace looks like. Each iteration is a complete record of one loop through the ReAct cycle:

Agent Trace · 12 iterations · $0.31 total click an iteration
Thought

Step 1: Current Regime Snapshot. I need to ground this analysis in actual market conditions before building anything. Let me pull current price action, volatility, and breadth to establish the regime context.

tokens: 3,241 latency: 9.4s cost: $0.028
Action + Result
screen_stocks({
  filters: [
    "price_above_200sma",
    "vix_below_20"
  ],
  universe: "SPY,QQQ,IWM"
})

→ { regime: "bull",
    qualifying: ["SPY","QQQ"],
    vix: 16.2,
    spx_vs_200sma: "+2.4%" }

Each iteration is a complete record: what the model thought, what it called, what came back, and what it cost. 12 iterations, $0.31 total. The trace is what makes debugging possible.

Without traces, a failed agent run is a black box. You see the final answer (or the error), and you guess at what went wrong. With traces, you can pinpoint the exact iteration where the model made a bad assumption, called the wrong tool, or misread a result.

What every iteration should capture
Model used
Which model, which version
Full prompt
System prompt + history + injected memories + tool definitions
Model response
The thought + the action it chose to take
Tool result
What the tool returned — the raw observation
Token count + latency
Cost and time per iteration
Errors
Any exceptions, timeouts, or malformed tool calls
This is what your evaluator reads. Miss any of these and the grade is incomplete.

How to grade an agent: algorithmic metrics and LLM judges.

Not all evaluation is the same. Some things you can measure with code. Some things you need a second AI to grade.

Type What it measures Examples
Algorithmic Objective, countable things. No model needed. Total cost, iteration count, latency, Sharpe ratio, max drawdown, whether a strategy was deployed
LLM Judge Subjective quality that requires reasoning. Did it explain the strategy logic clearly? Did it test enough different structures? Is the recommendation realistic for a real account?

For trading agents specifically, overfitting is the failure mode that pure algorithmic metrics miss. A high backtest return is an objective number. But whether that return is trustworthy — whether it comes from a durable edge or from memorized noise — requires judgment. That's where the LLM judge comes in.

Here's the actual system prompt for the NexusTrade Agent Run Evaluator:

NexusTrade Agent Run Evaluator · System Prompt Gemini 3 Pro · temp 0

"The central question you must answer is: If the user deployed the recommended strategy on Monday with their $25,000 live account, how confident are we that it will achieve 100% annual return? Everything else is secondary."

Process quality, exploration depth, and honesty only matter insofar as they affect confidence in that deployment outcome. A run that did everything right but produced nothing deployable is still a failure.

Scoring Dimensions
deployedStrategyFitness · 40%
Multi-year average return across ALL tested years. Not peak year. Requires at least 2 distinct market regimes.
evidenceStrength · 25%
How many years tested? How many regimes covered? Must cover bull + bear to score above 6.
explorationCoverage · 20%
Did the agent search broadly? Different signal types, structures, underlyings? Or did it converge on the first thing it tried?
riskRealism · 15%
Would a losing streak wipe a real $25k account? Max drawdown, position sizing, margin requirements.
Hard Caps (applied first, before all other scoring)
avg < 30%/yr → deployedStrategyFitness cannot exceed 3
avg 30–59%/yr → cannot exceed 6
avg 60–89%/yr → cannot exceed 8
avg 90%+/yr, survived bear, drawdown < 50% → full range available
nothing deployed → capped at 2, regardless of exploration quality

The hard caps are the anti-overfitting mechanism. An agent that found 126% returns in 2024 but only tested one year cannot score above 4, because single-year outlier performance is exactly what overfitting looks like. The evaluator enforces multi-year evidence as a precondition for a high score.

One caveat embedded in the rubric above: the trace shows the agent calling a 59% max drawdown "acceptable" despite a 50% threshold. That's a deliberate judgment in the NexusTrade context — but it illustrates exactly where rubrics fail silently.

Rubric misalignment · max drawdown threshold
NexusTrade rubric
50%
Hard cap threshold. Multi-year consistency can justify exceeding it — as the Iron Condor did at 59%.
Professional standard
20–30%
Most institutional risk frameworks. A 59% drawdown at this standard is a hard rejection, not a judgment call.
The gap between these two numbers is where silent misalignment lives. If you adapt this evaluator, the threshold must reflect your actual risk tolerance — not the default. An implicit threshold is a rubric that will silently approve strategies your real portfolio can't survive.

Here's what the evaluator looks like in the course exercises, running against a real Aurora agent run:

NexusTrade Agent Run Evaluator output showing dimension scores and nextIteration note

The evaluator reads the full agent trace and returns structured JSON: dimension scores, an overall grade, a deployment verdict, and a nextIteration note that seeds the next run.

And it returns structured JSON. Here's what Round 1 of the hill climbing experiment produced:

Evaluator Output · Round 1 71 / 100 · good
{
  "summary": "Deployed a robust Iron Condor across 4 regimes.
             Consistent positive years including 2022 bear.
             Return profile (54% avg) won't reach 100% goal.",
  "deployedStrategy": "Always-On Iron Condors (SPY/QQQ)",
  "deployedStrategyAvgReturn": "+54.34% avg (2022: +31.2%, 2024: +72.1%)",
  "deploymentVerdict": "iterate_first",
  "scores": {
    "deployedStrategyFitness": 7,
    "evidenceStrength": 7,
    "explorationCoverage": 6,
    "riskRealism": 5
  },
  "overallScore": 71,
  "verdict": "good",
  "failures": ["59% max drawdown exceeds safe threshold",
              "54% avg won't reach 100% annual goal"],
  "nextIteration": "Push for higher return while maintaining the 2022 floor."
}

The nextIteration field is what makes the loop work. It becomes the seed for the next agent run: here's what you did well, here's what was wrong, here's what to focus on. The evaluator writes the coach's notes.

You call it from anywhere via MCP:

MCP Tool Call
run_agent_run_evaluator({
  agent_id: "69d49c51d06eee7b51cf5f68",
  model: "google/gemini-3-pro-preview"
})

// Returns structured JSON:
// scores, verdict, nextIteration, deploymentVerdict
// Inject nextIteration into the next run's context.

Beyond hill climbing: four approaches to agent optimization.

Once you have traces and an evaluator, the natural next step is to close the loop: run the agent, score the output, use that score to improve the next run. The simplest version of this is hill climbing — run, grade, seed the next run with the feedback. Cheap, fast, and often enough for early iteration.

But hill climbing is a local search. It follows the gradient of whatever metric you give it. Point it at the wrong objective and it will confidently optimize you into a cliff. The research community has gone considerably further. Here's the landscape:

The common thread: all four require traces, all four require a scoring function, and all four fail if that scoring function points at the wrong objective. But they're not interchangeable. Here's why you don't just default to NSGA-II:

Cost & complexity per optimization cycle based on ~$135/run · from the hill climbing experiment
01 · Hill Climbing
5–10 agent runs
Escalate when: your single-objective metric plateaus or you want to verify a local maximum isn't an artifact of the rubric.
~$700
per cycle
02 · NSGA-II
500–2,000 runs
Escalate when: you have two genuinely competing objectives (return vs. max drawdown) and need a Pareto front rather than a single optimum.
~$70k
per cycle
03 · Meta-Harness
10–50 runs · 10M token context
Escalate when: you've exhausted prompt content changes and believe the problem is what information the agent sees, not what it's asked to do.
~$5k+
per cycle
04 · AutoResearch-RL
Continuous · no fixed cap
Use when: you have a fixed overnight compute budget, a deterministic reward signal, and you want the loop to run with no human in the hole.
open-ended
per cycle
$135/run is the measured cost from the hill climbing experiment in the next section: $676 total across 5 complete evaluation cycles. NSGA-II requires a population of ~50 configurations per generation across ~10 generations — 500 agent runs minimum before you see a Pareto front.

For most teams: start with hill climbing. Run it until it stalls. If you hit a genuine multi-objective tradeoff, consider NSGA-II — but only with automated infrastructure and a clear budget. The rubric is still load-bearing in all four cases. Here's what it looks like when the rubric fails.


I ran this loop five times. The first round was still the best.

War Story · Hill Climbing Experiment $676 spent

I spent $676 running a hill climbing experiment on a live trading agent. The setup: run the agent, have the evaluator grade the full trace, seed the next run with the evaluator's nextIteration note. Repeat five times. Measure whether the agent improves.

Round 1
71
Iron Condor · 54% avg annual
Round 5
27
Long directional · -6.3% avg · 92% drawdown in 2022

The rubric said push for higher returns. The agent did exactly that — which is the problem. It wasn't misbehaving. It was doing precisely what the evaluator asked. Every round, the feedback said "push harder," so the agent abandoned what was working and chased the metric instead. By Round 4, the context window only carried the two most recent rounds, so Round 1 — the thing that had actually worked — was gone.

The failure wasn't the optimization loop. It was the rubric. The evaluator's goal and the actual goal had diverged — and a perfectly obedient agent followed the evaluator off a cliff. This is Goodhart's Law running live, in production, at $135/run. Bailey et al. (2014) documented the same dynamic in quant finance: with enough iterations, any optimization process finds strategies that score perfectly on the metric and fail in live trading. The short context window made it worse — but the bad rubric made it inevitable.

Read the full experiment →

The feedback loop: how evaluation connects to memory.

Evaluation doesn't mean much if the agent can't learn from it. The feedback loop is what turns a one-time grade into compounding improvement.

The full pipeline has five steps. Each one feeds the next:

Evaluation · Feedback Loop click a step
Step 1 · Agent Run

The agent receives its task, reads injected memory from past runs, and works through the ReAct loop. Every decision, every tool call, every result is captured.

What this produces
agent.run({
  task: "Build options strategy",
  context: injected_memory,
  maxIterations: 20
})
→ agent_id: "69d49c..."
→ status: "completed"
→ iterations: 12

Run → Trace → Judge → Memory → Improve → repeat. Each loop produces a labeled training example: what the agent did, how it scored, and what to do differently next time.

What makes Step 5 concrete: the injected memory isn't an abstract "past context" flag. It's a literal block of text prepended to the next agent's system prompt — the AgentSummary record retrieved from MongoDB and formatted as natural language instructions. Here's what that block looks like at the start of Round 2:

System prompt injection · Round 2
## Previous Run Context
Score: 71/100 · Verdict: good

Lessons learned:
- Iron Condors on SPY/QQQ survived the 2022 bear market
- Multi-year average 54.3% — consistent across all regimes
- Max drawdown 59% exceeded the 50% threshold

Next iteration focus:
  Enhance return profile while holding the 2022 floor.
  Do not abandon the Iron Condor structure.

---
## Your Task
Build an options strategy for a $25,000 account...

The evaluator doesn't just grade — it writes the next run's starting conditions. That's the loop. And as the war story above shows, get the rubric wrong and the injected note corrupts every run that follows. A demo runs once. A system runs, gets graded, and improves — or degrades, depending on what you told it to optimize for.


Expose your agent to the world: the NexusTrade MCP server.

Everything in this article — traces, LLM judges, feedback loops — is built on Aurora's infrastructure. The same evaluator that graded the hill climbing experiment. The same backtester the agent used to test the Iron Condor across 2021-2024. Those tools are accessible via MCP, which means you can call them from Claude Desktop or Cursor without touching the NexusTrade UI.

The MCP server exposes the full tool surface: screener, portfolio system, backtester, indicator engine, and the evaluator itself. Here's the config:

mcp config
{
  "mcpServers": {
    "nexustrade": {
      "url": "https://nexustrade.io/api/mcp",
      "headers": {
        "Authorization": "Bearer <your-api-key>"
      }
    }
  }
}

Then from Claude Desktop or Cursor, you can ask: "Use NexusTrade to screen for stocks with RSI below 40 trading above their 200-day moving average." Claude calls screen_stocks on the NexusTrade MCP server. The server returns live results from the same screener Aurora uses internally. No copy-paste. No API docs. One tool implementation, available from any MCP-compatible client.

You can also run the evaluator directly. run_agent_run_evaluator(agent_id) grades any completed agent run from any client. The evaluation loop becomes accessible from wherever you're working.


Pop quiz.

Question 1 of 4

What is a trace in the context of AI agents?

Answer: A trace is a structured log of every step the agent took: inputs, outputs, tool calls, results, costs, and timing. Think of it like a flight recorder — you can reconstruct exactly what happened, step by step, to debug issues or evaluate performance. Without it, a failed run is a black box.
Question 2 of 4

An agent produces a 126% backtest return in 2024, but you only have one year of data. The evaluator gives it a high score. What's wrong?

Answer: Single-year outlier returns are the textbook signature of overfitting. The agent may have memorized 2024-specific anomalies that won't repeat in live trading. A trustworthy evaluator caps the score based on multi-year average return and requires evidence across at least two distinct market regimes — including a bear market. One year of 126% is not evidence of a durable edge.
Question 3 of 4

When would you use an LLM judge instead of an algorithmic evaluator?

Answer: When you need to evaluate subjective criteria that are hard to measure with code — "did the agent explain its reasoning clearly?" "did it consider the user's risk tolerance?" "did it test fundamentally different strategy types?" Objective metrics like cost, iteration count, and Sharpe ratio should use algorithmic evaluation. Code is faster, cheaper, and deterministic for anything you can count.
Question 4 of 4

True or false: the evaluation feedback loop requires both evaluation AND memory to work. Having only one of the two is not enough.

Answer: True. Without evaluation, the agent has no signal for what "better" means — it can't distinguish a good run from a bad one. Without memory, the agent can't retain the lessons it learned — every run starts from scratch. You need both. Evaluation produces the signal. Memory carries it forward.

You've built the complete agent. This is where most people stop. It's where you start.

Five articles. Five modules. You started with a leaked source file and ended with a production evaluation loop — router, ReAct engine, long-term memory, rubric design, feedback loop. The only thing left is to run it.

The $676 hill climbing experiment is a better argument for evaluation than anything else in this article. The agent wasn't broken. The rubric was. Build the right rubric and the loop compounds toward something real. Build the wrong one and a perfectly obedient agent will follow it off a cliff, five rounds in a row, at $135 per run.

Before the capstone, here's the full series in five minutes. Every concept, every module, from the leaked source file to the production feedback loop.

The full course recap. Five modules in five minutes. Watch this before running the capstone.

Module 6 is where the capstone lives. The full pipeline, embedded directly in the lesson — no separate window, no instructions to follow in another tab. Aurora runs inline: screens the market, validates with news, builds a watchlist, then wires up a scheduled agent to manage it every week.

AI Agents from Scratch capstone lesson — Aurora embedded inline, running the full research pipeline with step progress bar

The capstone runs Aurora directly in the lesson page. Three steps: technical screen, watchlist, weekly monitor agent. The progress bar tracks where the agent is. You watch every tool call happen in real time.

AI Agents from Scratch: Complete the Series

Free. No credit card. The Module 6 capstone runs Aurora inline — screen, validate, build a watchlist, wire up a weekly monitor. Takes about 20 minutes.

Run the Capstone Free → Or open Aurora directly →