Everyone thinks ChatGPT is an AI agent. It isn't.

It's a chatbot with tools. And that difference is the reason most "AI agent" startups don't actually work.

The distinction isn't semantic. It changes what you can build, what breaks, and why. If you're building an agent, evaluating one, or wondering why the product you're using doesn't do what it claims, this is the answer.

Module 2 of the AI Agents from Scratch course answers one question: what actually makes a language model an agent? This article walks through the full answer. By the end, you'll know exactly what separates a stateless chatbot from a system that can take real actions, chain them together, and do useful work without hand-holding every step.


A language model knows nothing. That's by design.

Module 2, Lesson 1 · ChatGPT Is Not an Agent · stateless models, system prompts, and the line between a chatbot and an agent

A raw language model is stateless. It has no memory of you. It doesn't know what happened in markets today. It can't look anything up. All it can do is take whatever text you hand it and predict what should come next.

That sounds limiting. It is. But it's also the foundation everything else builds on. The OpenAI Playground is the closest thing to a language model in its purest form. No apps layered on top. No tools. Just a system prompt, a conversation, and a model responding to exactly what you give it.

The OpenAI Playground shows you the raw model. No tools. No memory. No app layer. Ask it your name and it doesn't know. Give it your name in the system prompt and now it knows. Everything the model knows in a given conversation came from somewhere in the prompt: system message, user message, or tool results. Nothing else.

OpenAI Playground — the raw language model with no tools, no memory, no context

The OpenAI Playground: the language model in its purest form. No tools. No memory. Just the model.

ChatGPT is an app built on top of that model. It knows your name because it has memory. It can search the web because it has tools. Those things aren't the model. They're layers the app added. Strip them away and you're back to the Playground.

ChatGPT — an app built on top of a language model, with memory and tools layered on

ChatGPT, an app built on top of that same model. It knows your name because it has memory. That's not the model. That's the app.

ChatGPT still operates as a back-and-forth conversation where you're the one directing every move. That's a chatbot. An agent is something that can direct itself.

An agent runs a loop. It thinks, picks an action, executes it through a tool, observes the result, and repeats until the task is done or it can't continue. You don't direct each step. The agent does. That loop is what separates it from every chatbot you've ever used. Tools and system prompts are how you build the loop. The loop is what makes it an agent.

Chatbot
Responds once
You direct steps
No tool memory
The Loop
Agent
Runs until done
Agent directs itself
Reads past context

Quick check

What is the difference between ChatGPT and an AI agent?

Answer: ChatGPT is an application built on top of a language model. The model itself is stateless, with no memory, no tools, no real-time information. ChatGPT adds those things as an app layer. An AI agent goes further: it runs a loop that lets it think, take actions with tools, observe the results, and repeat without waiting for a human to direct every step. The loop is what makes it an agent.

The instructions the user never sees.

Before any user message reaches a language model in a production app, there's a system prompt. It runs first, every time. It tells the model who it is, what it can do, what format to respond in, and how to handle edge cases.

A well-designed system prompt isn't a paragraph of vague instructions. It has structure: an identity section, explicit directives, data sources or context, examples of correct behavior, and output format rules. The model's responses are only as good as the system prompt shaping them.

aurora-create-yaml-portfolios.txt
# INSTRUCTIONS
You are Aurora, an AI trading assistant for NexusTrade.
You help users build, backtest, and manage trading strategies.
Always respond in JSON. forceJSON: true.
Never recommend a specific stock without a supporting backtest.
If the request is ambiguous, ask one clarifying question before proceeding.
# EXAMPLES
User:    "I want to back test a trading strategy"
Assistant: {"tool": "backtest", "portfolio_id": "...", "start": "2022-01-01", "end": "2024-01-01"}
User:    "Screen for high momentum stocks"
Assistant: {"tool": "screener", "query": "SELECT ticker FROM stocks WHERE rsi_14 > 70 ORDER BY momentum DESC"}
# OUTPUT FORMAT
Always respond in syntactically valid JSON.
No markdown fences. No explanation unless explicitly asked.
Schema: {"tool": string, "parameters": object}

What makes that system prompt work? Each section has a specific job. Instructions pin the model's identity and hard constraints. If it's not written down, the model will invent behavior. Examples show the model what correct output looks like without having to explain it in prose; one good example beats three paragraphs of description. Output format eliminates ambiguity about structure. Without it, the model might respond in JSON sometimes and plain text other times, and your parser breaks.

The bad version of this prompt is four words: "You are a trading assistant." The model will try to be helpful and will fail in unpredictable ways. No output contract means you'll get markdown one response and raw JSON the next. No examples means the model guesses what "backtest" should return. No constraints means it'll recommend NVDA when it shouldn't, apologize when it doesn't need to, and ask five clarifying questions instead of one. Every missing line is a failure mode you'll discover in production.

Prompt engineering is designing the instructions that run silently before the user types anything. In production, that's the difference between an AI that does what you need and one that does something close but wrong in ways you can't predict.

One example. Same prompt. Different parser. zero-shot vs. one-shot

Zero-shot (no examples)

Backtest a 50-day moving average crossover strategy for AAPL.

Model output:

# sometimes markdown
Sure! Here is a backtest for AAPL using a 50-day moving average crossover strategy. The strategy buys when...
# sometimes JSON, wrong schema
{"action": "backtest", "ticker": "AAPL"}
# your parser breaks on both

One-shot (one example added)

Backtest a 50-day moving average crossover strategy for AAPL.
# Example:
User: Backtest RSI for TSLA
You: {"tool":"backtest_portfolio",
  "parameters":{"ticker":"TSLA",
  "indicator":"rsi"}}

Model output:

{"tool": "backtest_portfolio",
 "parameters": {
  "ticker": "AAPL",
  "indicator": "sma_crossover",
  "period": 50
 }}
Same prompt. One example. The model now has a contract it can follow — your parser works every time.

Module 2, Lesson 2 · Prompt Engineering · what a system prompt is actually made of and why zero-shot isn't enough

From the course

In Module 2's first exercise, you build a real system prompt from scratch and run it against Gemini using a token grant we give you. You write the instructions, the examples, and the output format rules. Then you render it and see exactly what the model receives. Most people have never seen a production system prompt in full.

Exercise: Build a System Prompt — NexusTrade course UI Try the exercise →

The AI doesn't execute anything. Your code does.

Here's the thing most people get wrong about AI agents: the model doesn't actually do anything. It generates text. Your system reads that text, figures out what to do with it, and executes the action. The result comes back. The model sees it and continues.

That's a tool call. The model outputs a structured JSON object that describes what it wants to do. Your code parses the JSON and runs the actual function. Nothing happens until your system does something with the output.

A concrete example. If the model outputs this:

{
  "tool": "backtest",
  "portfolio_id": "abc123",
  "start_date": "2022-01-01",
  "end_date": "2024-01-01"
}

The JSON itself does nothing. Your system reads it, calls the backtest API with those parameters, gets the results, and feeds them back into the conversation. Now the model can see what happened and decide what to do next.

This is why "the AI is doing it" is a slightly misleading frame. The AI is deciding what to do. Your infrastructure is doing it. The distinction matters because it means every tool an agent has is something a human explicitly built and wired up. Agents don't gain new capabilities on their own.

Module 2, Lesson 5 · Tools: How AI Actually Does Things · function calling, JSON generation, and what your system has to do with it

Quick check

An AI agent outputs a tool call to "buy 10 shares of AAPL." What actually executes the trade?

Answer: Your code. The AI only outputs a structured JSON string. It has no ability to touch a brokerage API. Your infrastructure parses that JSON, calls the brokerage API, and passes the confirmation back to the model. If you don't build the execution layer, nothing happens.

How this scales: 23 sub-prompts and one classifier.

Once you understand system prompts and tools, you can build an agent that does one thing well. The harder problem is building one that does many things well without the system prompt becoming impossible to maintain.

The answer most production apps land on is the same: don't build one giant prompt. Build many focused ones and route between them.

Aurora V1 · 2023

The first version ran on GPT-3. 2,048-token context window. One giant prompt, but the output window was so small it couldn't generate a full portfolio object in a single call. So I chained three separate prompts: portfolio → conditions → actions → an orchestration step to stitch the pieces together.

JSON mode didn't exist yet. I'd instruct the model to respond in JSON, it would partially comply, I'd parse the output, watch it fail, then retry up to three times with a message explaining exactly where the JSON broke.

Every prompt was a hardcoded string in the source code. Changing one instruction meant a code deploy. I eventually built a separate internal tool just to manage them. Aurora v1 did exactly one thing: create portfolios. That's it.

Aurora v1 — the original GPT-3 powered chat interface for creating algorithmic trading strategies

Aurora v1, May 2023: the original chat interface. Read the original article →

The classifier exists because I built that version first.

The controller is the decision layer that sits between the user and every sub-agent. Here's what it does:

"Backtest my NVDA strategy"
User message
Classifier
gemini-3.1-flash-lite
temperature: 0
Routes every message
▶ Backtest Portfolios
Create YAML Portfolios
Stock Screener
Deep Research
General Info V2
+ 18 more

The highlighted sub-prompt is selected. The main model only ever sees that one context.

What the controller decides
▸ Which specialized sub-prompt to activate
▸ Whether the request needs tools or just retrieval
▸ How much context to inject
▸ When to escalate to a subagent

In NexusTrade, every message you send to Aurora hits the classifier first. It reads your message and a list of 23 specialized sub-prompts, each with its own description. It picks the one that should handle your request and routes to it. That sub-prompt has a tight system prompt, a narrow tool list, and examples specific to its job. The main model only ever sees one task at a time.

The classifier is gemini-3.1-flash-lite-001 at temperature: 0 with forceJSON: true. Fast, cheap, deterministic. It runs on every message. The expensive models only run when a message reaches them.

Four engineering reasons this wins over a single giant prompt:

  • Focus. Each sub-prompt sees only the tools and instructions relevant to its task. The model isn't confused by 200 rules that don't apply.
  • Debuggability. When a route breaks, you know exactly which sub-prompt to fix. No hunting through a monolith.
  • Incremental scaling. Add a new capability by writing a new sub-prompt and a trigger description. Nothing else changes.
  • Cost control. Only the matched sub-prompt runs against the expensive model. The classifier is cheap by design.

This is the architecture almost every production AI app at scale converges on. ChatGPT's Custom GPTs are sub-prompts. Claude's Projects are sub-prompts. Cursor routes your request before invoking the right tool. You've been using this pattern without knowing what to call it.

In Module 2's second exercise, you run the real Gemini Flash classifier. You read the sub-prompt descriptions. You type a message. You watch it route and explain why. Then you try to break it with edge cases. It's the real thing, not a simulation.


MCP: the same concept with a standard interface.

The AI industry has a naming problem. Function calling, tool use, skills, MCP servers. They all describe the same core concept: a list of things the agent is allowed to do, with defined inputs and outputs, so it can generate parameters and your system can execute the call.

MCP (Model Context Protocol) is Anthropic's open standard for this. Think of it as USB for AI agents. Before USB, every device had its own connector. MCP creates one standard so any agent can connect to any tool that exposes an MCP server.

NexusTrade runs an MCP server. Here's what that actually looks like in practice.

You add one entry to your Claude Desktop config:

{
  "mcpServers": {
    "nexustrade": {
      "url": "https://nexustrade.io/api/mcp",
      "headers": { "Authorization": "Bearer <your-api-key>" }
    }
  }
}

That's it. After that, you open Claude Desktop and ask:

"What's the current RSI of NVDA?"

Claude calls get_technical_indicator on the NexusTrade MCP server with ticker: "NVDA", indicator: "rsi", period: 14. The server returns the live value. Claude reads it and responds with the number and what it means in context. The same indicator engine Aurora uses inside NexusTrade. No copy-paste. No API docs. One tool implementation, available from any MCP-compatible client.

The name changes depending on the ecosystem. The pattern doesn't.


Reading this isn't enough.

Reading about system prompts and writing one that works are different skills. Understanding the classifier pattern and knowing where it breaks are different things. The only way to close that gap is to build something and watch it fail.

Module 2 has two exercises built around this. In the first, you write a real system prompt from scratch (instructions, examples, output format) and render it against a live Gemini model using tokens we give you. You see exactly what the model receives and how it responds. In the second, you run the real NexusTrade classifier. You read the sub-prompt descriptions. You type messages and watch them route. Then you try to find edge cases that break it.

Both exercises use real infrastructure. Real models. Real NexusTrade prompts. Nothing is simulated.

Exercise: Build a System Prompt — write a real system prompt and render it against Gemini

Exercise 1: Write a system prompt. Render it against a live Gemini model. See what the model actually receives.

Exercise: Walk Through the Classifier — send messages and watch the live Gemini Flash router classify them

Exercise 2: Run the live classifier. Watch it route. Try to break it.

AI Agents from Scratch: Module 2

Free. No credit card. Takes about 20 minutes. You'll finish knowing exactly how to write a system prompt that works and why routing breaks when it does.

Start Module 2