GPT-5 Mini is the BEST AI Model for Complex Reasoning Tasks

GPT-5 may have disappointed. But its "mini" model punches above its weight class

8 min read

Like the entire world, I too was disappointed in GPT-5.

I tried my best to like GPT-5. I just can't. It sucks.

OpenAI lied to us, over-promised, and (severely) under-delivered I had very high hopes for GPT-5. In my defense, they…

nexustrade.io

GPT-5 under-delivered. While not a bad model by any metric, its roughly equivalent to the current state-of-the-art.

There’s nothing new here.

However, in my evaluations, I found a hidden secret. One model stood out for being reliable, dirt cheap, and immensely capable, outperforming models 10x the cost in complex reasoning tasks.

Hidden amongst the hype is the best budget model ever created. GPT-5-mini is absolutely outstanding, and is the best AI model for normal daily usage.

Here’s the proof.

An Inexpensive and Insanely Powerful Daily Usage Model

I want to be very clear what “category” GPT-5-mini is in.

It’s not state-of-the-art. It’s a budget-friendly alternative.

It’s in the category of models that I would call “dirt-cheap”. You can use it for tasks that require millions of tokens, and barely spend a few bucks. The other models in its weight-class include Gemini Flash 2.5 and OpenAI’s 4o-mini.

Comparing the cost of other budget-friendly large language models

Unlike a model like Claude Opus 4.1, (which cost $15/M input tokens and $75/M output tokens), these budget models cost 1% of that, making it suitable for real-time, high-volume use-cases.

For over four months, Gemini was the undisputed best model for daily usage . However, with the launch of GPT-5-mini, there is another clear winner.

Despite all of the hype, Google BEATS OpenAI and remains the best AI company in the world.

O3, O4-mini and 4.1 are not enough to dethrone Google

medium.com

Here’s how I determined this.

Defining the Complex Reasoning Tasks

To objectively evaluate each large language model, I created highly personalized benchmarks for two different real-world complex reasoning tasks.

At a high-level, these tasks are SQL Query Generation and JSON Object Generation, but in truth, the real tasks are more nuanced.

Allow me to dig deeper.

Testing Each Model in SQL Query Generation

One of the most important reasoning tasks is understanding how good each model is at generating syntactically and semantically-valid SQL Queries.

This is extremely important. If a model can convert plain English questions into SQL Queries, non-technical investors can extract real insights without needing to know how to code.

I just tried OpenAI’s updated o1 model. This technology will BREAK Wall Street

All of my articles are 100% free to read! Non-members can read for free by checking out this link.

medium.datadriveninvestor.com

For example, if a user wants a list of fundamentally strong biotechnology stocks, a model that is good at generating SQL Queries can help users find accurate, data-backed answers – not from its training data, but from actual comprehensive stock analysis.

A list of fundamentally strong biotechnology stocks. Read the full list here

The process for generating getting these results is as follows.

I first uploaded financial data from a high-quality data provider ( EODHD) into an analytics database
I then created a system prompt capable of querying the database
I added high-quality examples with few-shot prompting and retrieval-augmented generation
Finally, I iterated until I had a highly accurate SQL Query Generator

Want accurate financial data for your use-cases? Sign up for EODHD today!

Parts of the system prompt for generating accurate SQL Queries

To then objectively evaluate each AI model, I created an open-source benchmark called EvaluateGPT. This benchmark takes each language model, ask a list of 90 financial questions, and generates a SQL Query to answer the question.

Some of these questions are as follows. For a full list, check out the code repo here:

What AI stocks have the highest market cap?
Find stocks trading below their 200-day moving average
What is the average P/E ratio, P/S ratio, and FcF of Meta, Amazon, Google, Netflix, Nvidia, and Microsoft?

The query is then executed and the results are fed into 3 powerful LLMs (GPT o4-mini, Claude Sonnet 4, and Gemini 2.5 Pro) to grade the output on a score from 0 to 1. The scores are averaged and we sort all of the models by their accuracy score.

A Diagram Depicting the Evaluation Pipeline for Creating SQL Queries for Natural Language

The results from this benchmark is summarized in this table.

A table showing the model name, median score, average score, distribution, and execution time for each model evaluated

Just as my previous articles suggest, Gemini 2.5 Pro is on the top of the list. That’s not surprising. What is surprising is seeing how GPT-5-mini performed.

And the answer is… exceptional.

GPT-5-mini is among the cheapest models on this list. Yet it significantly outperforms Gemini 2.5 Flash, getting a higher median score (0.933 vs 0.90), a higher average score (0.717 vs 0.657), and an equivalent query execution success rate (78.65%). It even performs on-par with GPT-5, while costing 80% less. When I saw this, I was shocked.

In fact, if we extrapolate the results from our previous articles, we can conclude that GPT-5-mini outperforms expensive reasoning powerhouses, such as Claude Opus 4 (which had a median score of 0.70), Grok 4 (with a median score of 0.70), and OpenAI o3 (with a median score of 0.90). All at a tiny fraction of the cost.

An older table showing the scores of different AI models for this SQL Query Generation task

A model that cost a few cents for millions of tokens outperforming a model that is 100x more? Imagine the possibilities.

I tested every AI Model on a complex SQL Query Generation Task. Here’s where Grok 4 stands

I spent $200 to test every language model. Here’s the best

ai.plainenglish.io

But it’s not just SQL Query Generation that GPT-5-mini outshines. It’s also fantastic at generating nuanced JSON objects.

Testing Each Model in JSON Object Generation

While SQL Query generation is important for allowing non-technical users to interact with data, another equally important use-case is the ability to create JSON objects.

Creating JSON objects from natural language is incredibly useful. A powerful JSON generator can allow non-technical people to create complex configurations using natural language. For example, with a JSON generator, I can fetch a list of fundamentally strong stocks, and instantly backtest these stocks on historical data without having to write a single line of code.

An algorithmic trading strategy that rebalances the Magnificent 7 based on the log of their P/E ratios. Read the full conversation here.

Similar to EvaluateGPT, I created a benchmark for generating, evaluating, and ranking all of the best models for generating these nuanced JSON objects.

Now creating these JSON objects isn’t as simple as it sounds. We’re creating these extremely large, deeply-nested JSON configurations. It includes:

Portfolio generation: AI generates an outline of a “portfolio”, which includes a description of the strategy and an initial value.
Trading strategy generation: the AI then generates each trading strategy. This includes an action and a condition. These sub-components have their own highly-specialized system prompts
Final assembly: we then combine all of the parts and assemble the fully generated portfolio of trading strategies

The process is depicted in the following diagram and described in detail in this article.

The diagram for creating algorithmic trading strategies (ie, complex JSON objects) using natural language

Because creating these objects requires so many steps, it’s fairly expensive. As such, I ran the 14 best models through a list of 12 financial questions, such as:

Create 3 portfolios for TQQQ. One SMA-based strategy. One EMA-based strategy. And one RSI-based strategy. It should have reasonable buy and sell rules.
Create a strategy that rebalances the MAG7 by 1/the square root of their P/E ratio
Create a portfolio that maintains a 2:1 UPRO/GLD ratio every single day

I then used two powerful AI models, Gemini 2.5 Pro and GPT-5, and graded the objects that were generated. The grades are assigned based on if the AI truly understands the semantics and the purpose of the trading strategies.

I then compiled a list of scores and formatted it into a table. Like in the SQL Query Generation, GPT-5 mini does outstanding.

A table showing the best models for JSON object generation

In this task, Claude Opus 4.1 and Claude Opus 4 do the very best. These are by far the most expensive models in this list, so this makes sense.

But then the results get crazy.

GPT-5 and Gemini 2.5 Pro are right after these. Then, outperforming every other expensive model is GPT-5-mini.

This model has a median accuracy of 0.933 and an average score of 0.717. This is better in nearly every way than much more expensive models including Grok 4 (median 0.700, average 0.723), Claude Sonnet 4 (median 0.70, average 0.684), and of course Gemini Flash 2.5 (median 0.825, average 0.746).

I’ve said it once, and I’ll say it again. Being able to go toe-to-toe with models that cost 10 to 100x more is absolutely insane.

Concluding Thoughts

GPT-5-mini has shattered my expectations of what a budget AI model can achieve. While GPT-5 itself may have underwhelmed with its incremental improvements, GPT-5-mini represents something far more revolutionary: democratized AI excellence.

In both SQL Query Generation and JSON Object Generation tasks, this model consistently punched above its weight class, matching or exceeding models that cost 10–100x more. With median scores of 0.933 in SQL tasks and comparable performance in JSON generation, it’s competing, and even beating, more premium models.

The implications are profound. For developers building AI-powered applications, for businesses looking to integrate intelligent features, and for individuals wanting to leverage AI in their daily workflows, GPT-5-mini removes the cost barrier without sacrificing quality. You no longer need to choose between affordability and capability.

When you can process millions of tokens for pennies instead of hundreds of dollars, entirely new applications become viable. Real-time analysis, continuous monitoring, massive data processing — all suddenly within reach. The implications are insane.

Ready to Put GPT-5-mini to Work?

The benchmarks speak for themselves, but the real magic happens when you start using these models for your own financial analysis and trading strategies.

Try NexusTrade today and experience firsthand how GPT-5-mini can transform your investment research. Generate complex SQL queries to analyze fundamentals, create sophisticated trading strategies with natural language, and backtest your ideas — all powered by the most cost-effective AI model ever created.

Don’t let the “mini” fool you. This is enterprise-grade intelligence at a fraction of the cost. This model is as powerful as the very best model of 2024. And it’s accessible to you today for free.

You don’t have to believe me. Take 5 minutes and try it out yourself. Algorithmic trading is accessible to you right now for free.

NexusTrade AI Chat - Talk with Aurora

Talk to an AI assistant to perform financial analysis and create algorithmic trading strategies.