The Chinese OBLITERATED OpenAI. A side-by-side comparison of DeepSeek R1 vs OpenAI O1 for Finance
Before today, I thought the OpenAI O1 model was the best thing to happen to the field of AI since ChatGPT.
The O1 family of models are “reasoning models” — instead of the traditional model which responds instantly, these models take their time “thinking”, resulting in much better outcomes.
And MUCH higher prices.
A full day’s usage of OpenAI’s most powerful modelsIn fact, these models are so expensive, that only the premium users for my AI app had access. Not because I didn’t want to inhibit my users, but because I quite literally could not afford to subsidize this expensive model.
The relative cost However, thanks to the Chinese, my users can now experience the full power of the next-generation of language models.
And they can do it at 2% of the price. This is not a joke.
The Chinese ChatGPT – like OpenAI and Meta had a baby
DeepSeek is the Chinese OpenAI, with a few important caveats. Unlike OpenAI, DeepSeek releases all of their models to the open-source community. This includes their code, architecture, and even model-weights — all available for anybody to download.
Ironically, this makes them more open than OpenAI.
DeepSeek R1 is their latest model. Just like OpenAI’s O1, R1 is a reasoning model, capable of thinking about the question before giving an answer.
And just like OpenAI, this “thinking process” is mind-blowing.
A side-by-side comparison of DeepSeek R1, OpenAI o1, and the originanl DeepSeek-V3R1 matches or surpasses O1 in a variety of different benchmarks. To look at these benchmarks, check out their GitHub page. Additionally, from my experience, it’s faster, cheaper, and has comparable accuracy.
In fact, if you compare it apples-to-apples, R1 isn’t just a little cheaper; it’s MUCH cheaper.
- R1: $0.55/M input tokens | $2.19/M output tokens
- O1: $15.00/M input tokens | $60.00/M output tokens
Cost of DeepSeek R1 vs OpenAI O1At the same benchmark performance, this model is 50x cheaper than OpenAI’s O1 model. That’s insane.
But that’s just benchmarks. Does the R1 model actually perform well for complex real-world tasks?
Spoiler alert: yes it does.
A side-by-side comparison of R1 to O1
In a previous article, I compared OpenAI’s O1 model to Anthropic’s Claude 3.5 Sonnet. In that article, I showed that O1 dominates Claude, and is capable of performing complex real-world tasks such as generating SQL queries. In contrast, Claude struggled.
The SQL that is generated by the model is subsequently executed, and then the results are sent back to the model for further processing and summarization.
A diagram showing the process of using LLMs for financial researchI decided to replicate this same exact test with O1. Specifically, I asked the following questions:
- Since Jan 1st 2000, how many times has SPY fallen 5% in a 7-day period?
- From each of these start dates, what was the average max drawdown within the next 180 days? What about the next 365 days?
- From each of these end dates, what was the average 180 day return and the average 365 day return, and how does it compare to the 7 day percent drop?
- Create a specific algorithmic trading strategy based on these results.
For a link to the exact conversation, where you can view, duplicate, and continue from where I left off, check out the following link.
Using R1 and O1 for complex financial analysis – a comparison
Let’s start with the first question, basically asking the model how often does SPY experience drastic falls.
The exact question was:
Since Jan 1st 2000, how many times has SPY fallen 5% in a 7-day period? In other words, at time t, how many times has the percent return at time (t + 7 days) been -5% or more.
Note, I’m asking 7 calendar days, not 7 trading days.
In the results, include the data ranges of these drops and show the percent return. Also, format these results in a markdown table.
Here was its response.
DeepSeek’s response to the drastic fall questionLet’s compare that to OpenAI’s o1’s response.
OpenAI’s response to the drastic fall questionBoth responses include a SQL query that we can inspect.
SQL query that R1 generatedWe can inspect the exact queries by viewing the full conversations and clicking the info icon at the bottom of the message.
If we look closely, we notice that both models responses are 100% correct.
The difference between them are:
- O1's response includes a total occurences field, which is technically more correct (I did ask “how many times has this happened?”)
- O1's response was also not truncated. In contrast, R1’s response was abridged for the markdown table, making it hard to see the full list of returns
OpenAI’s response was a little bit better, but not by much. Both models answered accurately, and R1’s response was completely fine in terms of extracting real-world insights.
Let’s move on to the next question.
From this, what is the average 180 day max drawdown, the average 365 day max drawdown, and how does it compare to the 7 day percent drop?
The R1 model responded as follows:
R1’s response for the average 180 day max drawdown, 365 day max drawdown, and how it compares to the 7-day dropIn contrast, this is what O1 responded.
O1’s response for the average 180 day max drawdown, 365 day max drawdown, and how it compares to the 7-day dropIn this example, R1’s answer was actually better! It answered the question of “how does it compare to the 7-day drop” by including a ratio in the response.
Other than that, the answers were nearly exactly the same.
For the next question, we asked the following:
What was the average 180 day return and the average 365 day return, and how does it compare to the 7 day percent drop?