I used to be a better software engineer than AI. Claude Opus 4.7 changed that.
I'd made the backtest engine as good as I could make it. One weekend with Opus 4.7 cut the disk bill 7×, flattened its scaling curve, and made warm backtests 4× faster.
I was quite proud of my backtesting system.
Lightning fast. Deterministic to the bit. And expensive.
Very expensive. At peak, over $1,000 a month across compute and disk, with $230 of that going to duplicated market history on rent even when the machines were idle. Every machine that served a backtest carried its own private 200 GB copy: prices, options chains, fundamentals. Two machines meant 200 GB of identical bytes billed twice. It was the most optimal design I could build with Claude Opus 4.1 and Gemini 2.5 Pro, over about a week of focused work last September.
Then, last week, Anthropic released Opus 4.7 at the same $5/$25 per million token price as Opus 4.6, with 90% off warm reads via prompt caching. On the API, Bedrock, Vertex, and Copilot on day one. The numbers:
Ten points over Opus 4.6 in six months. Seven-point lead over GPT-5.4. Anthropic also reports 3× the production-engineering task resolution rate of Opus 4.6 on Rakuten's internal SWE-bench. Benchmarks only go so far.
What I actually cared about was if it could implement a large, complicated feature end-to-end. I wrote out the plan: the target architecture, the fingerprint test as the correctness gate, the data types to migrate, the staged cutover. Then I handed it to Opus 4.7 and stepped back.
Every thirty minutes or so I'd check in: diffs looked right, the fingerprint held, commits read coherently. Otherwise I trusted the automated code reviews and my own review pass at the end to catch anything I missed.
Unlike the week I spent driving the mmap build as the lead architect, this time I mostly let it cook.
Every load-bearing architectural decision came from Opus: the three pruning layers, the concurrent chunk fetcher sharing one prefetched metadata, the condvar-backed appendable reader, the four nested retry layers. I read the diffs and ran the fingerprint. That's the difference this time.
And honestly, I was shocked.
The NexusTrade backtest engine, briefly
NexusTrade is the AI trading platform I've been building since 2020. Describe a trading idea in plain English. Aurora (our AI agent) creates a full algorithmic trading strategy: entry rules, exit rules, position sizing, options legs when relevant. The backtest engine runs that portfolio over real market history and reports what would have happened. Determinism is non-negotiable: identical inputs must produce identical outputs, bit-for-bit. Sub-second backtests feel magical; ten-second ones feel broken.
The magic is the goal.
That engine is what I handed Opus 4.7 this weekend. By Sunday night the architecture I'd been running since last September was gone.
200 GB of data on every worker, billed every time
For about seven months, the engine ran an
mmap architecture: a Rust hydrator downloaded Parquet
shards from GCS on startup, decoded them into consolidated
.bin files on a local persistent disk, and the backtest
process memory-mapped those files at request time. The hot path was
fast because, after the first read of each file, the OS kept those
bytes in RAM, so every subsequent read hit memory, not disk. Every
worker carried the same 200 GB of market history.
The deployment fought back at every stage. Live trading had to be split off on day one (a long-running backtest can't share a CPU with a process that needs to place real orders in milliseconds), so two machines with their own 200 GB copies was the starting point. Then optimizations started hogging the CPU on the backtest box, so optimization moved onto its own Render machine. As user load grew, I had to stand up additional backtest Render instances by hand, each with its own 200 GB disk, because Render persistent disks can't be shared or auto-scaled.
By the time I'd moved compute to Fly, the fleet was three Fly backtest workers (one always-on primary + two reserves), three Fly optimizer reserves (no primary: optimization was never on the hot path for user traffic), and one Render live-trading instance. Seven machines, each with its own 200 GB disk: 1.4 TB of duplicated market history on rent. At Fly's $0.15/GB/month for volumes and Render's $0.25/GB/month for SSDs, that came out to $230/month purely for duplicated disk. Scale-to-zero saved nothing on that line: volumes bill whether the machine is running or not.
The bill got untenable. I eventually dropped the always-on backtest primary and one optimizer reserve just to survive, trading peak throughput for a cost I could stomach. That's where the architecture sat when Opus 4.7 dropped.
How mmap served a backtest
Why that stopped working
Four infra stages tried to scale the design without dropping the "fast hot path" property. Every stage ran into the same wall: the 200 GB footprint was per-machine, and machines were multiplying.
-
Stage 1 · Render · day one
400 GB1 Render box for backtest + optimization, 1 for live trading. Optimizations hogged CPU and starved user backtests.
-
Stage 2 · Render · split + scaled
800 GBOptimization gets its own box. To handle more users I manually stood up additional backtest instances, each with its own 200 GB disk (Render persistent disks can't be shared or auto-scaled). 4 Render instances, same data × 4.
-
Stage 3 · Fly + mmap · peak
1.4 TB3 Fly backtest (1 primary + 2 reserves), 3 Fly optimizer reserves, 1 Render live-trade. Seven 200 GB disks, each one its own copy of the same data.
-
Stage 4 · Fly + mmap · cost cuts
1.0 TBDropped the always-on backtest primary and one optimizer reserve to survive Fly's volume bill. Worse throughput at peak load, survivable cost.
-
Stage 5 · Today
0 GBStateless workers. Scale-to-zero on compute AND disk. One shared source on Tigris, streamed on demand.
So how did Opus 4.7 replace it? Start with the shape.
One shared data layer, no required local disk
Every Fly worker now boots with no required persistent state. Tigris (S3-compatible, Fly-hosted) is the source of truth for Parquet shards. No mandatory binary cache on disk, no hydration at startup. All thirteen data types (stock intraday, daily OHLC, options intraday, options daily, fundamentals, dividends, splits, economic, index signals, reports, financials, earnings, crypto) go through the streaming path.
Warm data loads in pointer-bump time. A new worker serves its first backtest in seconds, not hours of hydration. One always-on primary carries an optional disk-LRU tier for second-touch wins on options and intraday workloads (covered in the postscript to the trade-offs section below); every other worker runs with zero local disk. But how is any of that possible?
The new architecture, end to end
Here's the full path of a cold backtest on the streaming stack: a fresh worker that has never served a request, no local disk, nothing cached. Click through the tabs to walk the request from "strategy + time window" all the way to "backtest result."
Step 1 · Backtest request arrives
The engine receives a request: strategy, symbols, and a time
window (start_ts, end_ts). This window
determines which data needs to be fetched. Nothing is on disk
yet.
That's the operational walkthrough. To understand why each of those steps is fast enough to add up to a sub-second warm backtest, four levers do the work: cut bytes, hide the network, keep the warm path instant, survive failures.
The first lever: don't fetch bytes you don't need
Four sections of deep mechanics follow. If that's not what you're here for, skip ahead to the takeaway.
A minute of SPY intraday in 2025 is a few hundred thousand option contracts. A month is tens of millions. The reader that tries to download all of it for a four-day backtest has already lost. Three mechanisms attack the same problem from three directions: change the storage layout so filters read 80× less data, prune whole row groups before they leave Tigris, and narrow the columns on data types where the backtest only needs one.
Why the decoder stores columns, not rows
The biggest single performance characteristic of the new decoder:
every field gets its own tight buffer instead of 80-byte interleaved
row structs. The options resolver's filter chains (find all calls in a
DTE window with strike near $450) touch one column at a time. In the
old row-form layout a single 1-byte option_type check
pulled 80 bytes per row into cache. In columnar, it pulls one. Same
data, 80× less memory bandwidth.
MarketData, the iterator surface, the options resolver's
call signature) are unchanged, so nothing upstream of the data layer
had to know the migration happened.
Three layers of row-group pruning
A Parquet shard on Tigris is split into dozens of row groups, each a few thousand rows. The reader's job is to avoid downloading the ones it doesn't need. Three independent filters run before any bytes leave the bucket, and each catches a different access pattern the others miss.
Timestamp min/max. Every row group's footer records the min and max timestamp it contains. A backtest on March 5th skips every row group whose range doesn't overlap the day. This is the coarse filter; it's what makes "give me one day out of a ten-year shard" free.
Symbol min/max. The writer emits shards with
ORDER BY (ticker, date) for daily types and
ORDER BY (underlying, timestamp) for intraday options.
That ordering turns the per-row-group ticker min/max into a real
filter: a {SPY} backtest reads the row groups whose
ticker range touches SPY and skips the rest. Without the
writer's ORDER BY, every row group's min/max straddles
the full alphabet and the filter is a no-op. The reader and the writer
have to agree for this to work.
Bloom filter. Min/max is a range check: it admits row groups whose range contains the ticker you asked for even if that specific ticker never appears in the group. A Bloom filter in the row-group metadata answers the precise question "could this ticker be here?" with false positives but no false negatives. For narrow universes on fat shards, this is the layer that drops the last unnecessary download.
The writer adds AAPL, MSFT, NVDA. Each ticker hashes to K=3 bit positions. Seven bits end up set (AAPL and MSFT both hash to 9, AAPL and NVDA both hash to 14, so the positions overlap).
16-bit bloom filter after write. Green cells are "1". Gray cells are "0".
SPY hashes to {4, 7, 10}. Check those bits. Any one being 0 proves SPY was never added.
bit[4] = 0, bit[7] = 0, bit[10] = 0 · SKIP · SPY is definitely not in this row group.
MSFT hashes to {2, 9, 12}. Check those bits. All three are 1, so MSFT might be in this row group.
bit[2] = 1, bit[9] = 1, bit[12] = 1 · DOWNLOAD · MSFT might be in this row group.
"All 1s" doesn't prove MSFT is actually stored: hash collisions mean another ticker could have set all three bits. False positives (unnecessary downloads) are acceptable; false negatives (missing a ticker that's actually there) would be a correctness bug. At Parquet's default SBBF sizing (10 bits per distinct ticker, 7 hashes), the false positive rate lands around 0.83%FPR ≈ (1 − e−k·n/m)k. With m/n = 10 bits/key and k = 7 (classical optimum): (1 − e−0.7)7 ≈ 0.0083 → ~0.83%..
A row group that survives all three filters still might contain rows
for tickers outside the universe, because min/max and bloom are
group-granular, not row-granular. So one more filter runs
after the decode, in memory: a
ColumnFamily::filter_rows_by_asset_ids step that permutes
each Arc<[T]> column through a kept-index list. It
sits after the LRU publish, not before it, so a
{SPY} backtest and a {QQQ} backtest on the
same machine share one cached row-group buffer and each applies its
own filter to the same Arc<Cols>. The cache-key
space stays universe-agnostic.
stock_intraday narrows to a lookup table for options DAY backtests
One last byte-reduction trick, narrower but meaningful. An options
backtest at daily granularity uses stock_intraday only to
resolve the underlying's close at decision time. It doesn't need open,
high, low, or volume. Loading all five columns for the sole purpose of
reading one is 40–60% wasted bytes.
When the backtest is options-only at DAY granularity, the reader
narrows the projection to ticker, timestamp,
and close. The other OHLCV columns are NaN-poisoned
rather than decoded. MINUTE and HOUR backtests still read the full
schema; the narrowing only kicks in where it's semantically safe. Same
fingerprint, less bandwidth.
Reducing work alone isn't enough. Even after pruning to two row groups, those two groups still have to come over the network. The next cluster is about making that network cost invisible.
The second lever: overlap the network with everything else
The bytes we do need still have to travel from Tigris to the worker. Two mechanisms keep that travel from showing up as wall time: fan the fetches out in parallel, and let the backtest start running before the full dataset has arrived.
Concurrent row-group fetches per shard
parquet-rs's ParquetRecordBatchStream reads row groups
sequentially. It asks for group 1, waits, asks for group 2, waits. For
a shard with 40 surviving row groups on a 50 ms RTT link, that's two
seconds of pure round-trip time even if each individual fetch is free.
The reader fans the work out itself. After pruning, the surviving row
groups for a shard are split into N chunks. N builders are constructed
against the same Arc<ArrowReaderMetadata> (the
footer is prefetched once and shared), and each builder streams its
chunk on its own task. HTTP/2 multiplexes the concurrent requests onto
a single connection to Tigris; the bucket sees the fan-out, the
process sees N times less wall time. Decoded row groups land in the
LRU in whatever order they arrive; the singleflight (covered in the
next cluster) means a late arrival for a group another task already
published is a cheap no-op. Default N is 4, tunable by env var.
Pipelined feeding: data loads while the backtest runs
A twelve-month intraday backtest reads data in timestamp order. The iterator consumes month 1 entirely before it touches a single row in month 2. So month 2 doesn't need to be fully decoded before exec starts; it just needs to be ready by the time the cursor crosses into its range. The architecture exploits that.
AppendableColumnarReader is the primitive. It
implements the same IndexedReader trait the merge
iterator already knows how to consume, but its internal state is a
Mutex<ReaderState> + Condvar. When the
iterator asks for a row past the currently-loaded tail, it parks on
the condvar. When a background fetch task calls
append_chunk, the condvar wakes. A 120-second watchdog
fails loud if a producer dies.
// Consumer side: called by KWayMergeIterator, blocks until data lands fn visit_point_at(&self, idx: usize, visitor: &mut impl Visitor<T>) { let mut state = self.state.lock().unwrap(); loop { if let Some(block) = state.block_for(idx) { return block.visit(idx, visitor); } if state.is_complete { return; } // producer said "no more" if let Some(err) = &state.error { panic!(...); } let (g, t) = self.cond.wait_timeout(state, DEADLOCK_GUARD).unwrap(); if t.timed_out() { panic!("appendable reader stuck >120s"); } state = g; } } // Producer side: called from the tokio::spawn feeder task pub fn append_chunk(&self, cols: C) { /* push + notify_all */ } pub fn mark_complete(&self) { /* set flag + notify_all */ } pub fn mark_error(&self, msg: String) { /* set err + notify_all */ }
The seed-then-spawn pattern is the orchestrator. The
first month's fetch is awaited synchronously because the iterator
needs something to emit. For every subsequent month, the reader wraps
an empty AppendableColumnarReader whose
min_timestamp is parsed directly from the
"YYYY-MM" key string, so the
MultiMonthReader's chronological sort is correct before
any chunk has landed. Then it spawns the feeder and returns.
// seed: first month blocks, iterator needs it to start emitting ticks let (seed_key, seed_cols) = fetch_one_key_cols_owned(keys[0], ...).await?; let seed_reader = build_columnar_reader(seed_cols, ...); let mut readers = vec![(seed_key, seed_reader)]; // pending: every other month. declare min_timestamp from the key string, // wrap an empty appendable reader, spawn the feeder for key in &keys[1..] { let min_ts = min_timestamp_from_key(key)?; // parses "2024-03" → ns let reader = AppendableColumnarReader::new_empty(min_ts); let handle = reader.clone_handle(); tokio::spawn(async move { match fetch_one_key_cols_owned(key, ...).await { Ok((_, cols)) => { handle.append_chunk(cols); handle.mark_complete(); } Err(e) => handle.mark_error(e.to_string()), } }); readers.push((key.clone(), AnyReader::AppendableColumnar(reader))); } MultiMonthReader::from_sorted(readers) // sorts on declared min_timestamp
The wait function changes shape. Instead of Σ(per-month),
the backtest waits max(yearly_types, seed_month) plus
whatever the iterator races against as it walks forward in time. On a
twelve-month intraday bench that's roughly twenty seconds of serial
decode collapsed to one or two seconds of seed wait, with everything
else hidden under exec. Options pipelining works the same way, with
NBBO enrichment moved inside each per-month task so there's no
cross-key post-decode barrier.
Cold is the first backtest on a fresh worker. Everything after that should feel free. The next cluster is about why.
The third lever: make every backtest after the first instant
The cold-path tricks above still pay one network round-trip per row group the first time. The process-global cache is what makes the second backtest to touch the same data read it from RAM. Two pieces work together: an LRU that holds decoded columns across requests, and a prewarm pass that fills it at server startup for the types every backtest needs.
The process-global LRU is the warm-path secret
Decoded row groups live in a process-global LRU keyed by
(path, row_group_idx) with a byte budget. A given row
group is fetched, decoded, and sort-deduped exactly once per worker
lifetime, not per backtest. The second backtest to touch SPY March 5th
reads from memory.
Two backtests can ask for the same row group at the same time. Without
coordination they'd both fetch it, both decode it, and both race to
publish into the cache. The cache uses a
singleflight pattern: a get_or_init call whose
first caller actually does the work while every concurrent caller
parks on the same future. N concurrent backtests on a cold cache pay
exactly one fetch per row group, not N.
Prewarm cache: daily types never touch network
Eight data types are small enough to fit comfortably in memory and are
touched by almost every backtest: daily_ohlc,
dividends, reports, economic,
index_signals, financials,
earnings, and crypto_daily. The worker loads
all eight into the columnar cache at server startup, before it
accepts any backtest traffic.
The upshot for most users: a daily backtest (stock or crypto, which together make up the vast majority of what Aurora generates) touches exactly these eight prewarmed types. First backtest on a freshly-booted worker runs at RAM speed, zero network, same as the hundredth. Only options and intraday backtests (stock intraday and crypto intraday) have to fetch anything from Tigris on cold. Even those see their second run at LRU pointer-bump speed, because the process-global LRU keeps decoded row groups around once they're touched. The cold-path trade-off documented later in this post is real, but it affects the minority of workloads, not the majority.
Three levers, one obvious problem: this whole stack now depends on Tigris being reachable. What happens when it isn't?
The fourth lever: survive the hiccups
Tigris is reliable but not perfect. Cold-path correctness now depends on the bucket being reachable every time a row group hasn't been cached yet. The retry stack is what catches transient failures before they turn into a user-visible backtest error.
Four layers of retry, each catching something different
The hiccups to survive: connection resets, stale connection-pool entries, the occasional 5xx, a chunk that silently stops streaming. The retry stack has four layers because one global retry knob can't catch everything without also hiding real failures.
-
object_storeinternal retries. The client itself retries 5xxs and dropped connections with exponential backoff. This is the default HTTP-level resilience; we don't configure it beyond defaults. -
with_retryat the chunk level. Wraps a single chunk fetch (one builder streaming its slice of the row groups). Two attempts, 0 ms → 1 s backoff. Catches a stale connection-pool entry thatobject_store's own retry missed. -
with_feeder_retryat the whole-key level. Wraps the entire per-month feed. Two attempts, 5 seconds apart. Catches the case where every chunk for a shard got a stale-pool error at once. Rare, but the failure mode when it happens is "month 2 never loads" and you want one more swing at the whole thing. -
APPEND_WAIT_TIMEOUTconsumer watchdog. The backstop. If the iterator has been parked on an appendable reader's condvar for 120 seconds with no chunks landing, it panics. Not a retry; a loud fail. A genuinely stuck feeder should crash the task, not quietly return partial data.
The layering matters. Layer 1 catches network blips. Layer 2 catches one stale pool entry. Layer 3 catches a gust of stale pool entries. Layer 4 catches code bugs and environments we haven't seen yet. A failure that makes it past all four is something we genuinely want to see as an error in the logs.
What did those four choices cost me, and what did they buy?
What the old architecture still did better
mmap was not wrong. For seven months it served every backtest at RAM speed off local disk, with zero network dependency and zero runtime variance. That is a real property. Saying streaming Parquet replaced it without saying what mmap gave up would be dishonest.
- Options and intraday first-run is slower. Daily stock and crypto backtests are unaffected because their data is prewarmed into the LRU at server boot. But for options and intraday workloads (which must fetch from Tigris on cold), the first backtest on a ready worker is measurably slower now. On my dev laptop (residential WAN), mmap ran the first options backtest in ~50s; streaming takes ~137s because it has to fetch and decompress Parquet pages before anything runs. Every subsequent backtest recovers this with interest (4× faster warm) but the cold loss is real on these specific workloads. Flamegraph data shows ZSTD decompression alone is ~25% of streaming-cold CPU, which means Fly won't rescue this the way I originally speculated: shared-tenant VMs typically have slower per-core CPU than a modern dev laptop, so the decode cost likely grows, not shrinks, in production.
- No runtime network dependency. Tigris could go down entirely and backtests kept running. Streaming needs Tigris reachable every cold fetch.
- Predictable latency. Local SSD access doesn't vary with WAN conditions or object-store tail latency.
- No cache eviction surprises. The OS page cache is one of the best-tuned caching systems ever shipped. The streaming LRU is custom code that can have bugs.
- No bootstrap. New worker serves its first backtest in seconds, not hours. Horizontal scale is immediate.
- No duplicated storage. One copy on Tigris, not seven across backtest + optimizer + livetrade. $230/month recurring disk cost drops to $34 (the L2 volume on the primary), 7× lower and flat instead of sloped with concurrency.
- Scale-to-zero on disk. Fly volumes bill even when the machine is stopped. Tigris-as-source-of-truth has no idle cost per-worker.
- Columnar decoder makes execution 4× faster. Not a streaming-vs-mmap thing per se, but it shipped alongside. The warm-path backtest is faster end-to-end than mmap ever was.
-
No risk of format bit-rot. Every reader reads
the same canonical Parquet bytes from Tigris. There's no
worker-local binary format that can drift out of sync with the
code. Adding a column is a non-event; the mmap path had to
version its
.binfiles and handle migrations carefully every time the schema changed.
Net for my operational model: spiky traffic, a small number of concurrent users per worker, marginal warm-path latency differences are acceptable, and $230/month of duplicated storage isn't. Streaming wins. For a system that needed sub-millisecond cold-start determinism or ran without network access, mmap would still be the right answer.
Writing this section made me spot the fix. The options and intraday cold-path loss read worse on the page than in the benchmark. That was the bullet I couldn't shrug off, so I flipped the roles for one subsystem.
The key shape: completely optional. A single env
var (STREAMING_DISK_LRU_PATH) gates the whole tier;
unset means pure streaming with zero overhead. I turned it on
for exactly one machine: the always-on primary with a 225 GB
Fly volume. Every other worker (burst reserves, dev laptops,
CI) keeps scale-to-zero intact. A micro-tweak, not a core
architectural change.
Inside that machine: a 200 GB LRU between RAM and Tigris. My first instinct for the on-disk format was raw binary or bincode. Opus pushed back and suggested Arrow IPC (Feather v2), which is what shipped. A disk hit mmap's the file and Arc-wraps the columns with no ZSTD and no Arrow decode, RAM-tier latency off local NVMe. Schema validation falls out of the format for free; a format-version byte in the header is the global invalidation hammer. Freshness is the Parquet ETag recorded per file, revalidated with a HEAD on every hit.
If the shape sounds familiar, it's a demand-populated, LRU-bounded version of the old mmap architecture. Same decoded-columns-on-local-disk idea, without the 2-hour bootstrap or the 200 GB per-worker duplication. ~$34/month on Fly: 7× lower than the $230 the old seven-worker design was paying for duplicated disk.
I spotted the trade-off and made the call. Opus drove the implementation, same as everywhere else. That's the part of the trade-off I didn't accept, and the closest I came to architecting anything this weekend.
Now: with the trade-offs on the table, here are the receipts.
What the move actually bought
Two wins, two separate mechanisms. The columnar decoder replaced the old 80-byte row struct, so each filter step now pulls one column into cache instead of the full row. That's where the 4× warm-backtest speedup comes from. The streaming architecture replaced the local 200 GB .bin cache with Tigris-as-source-of-truth, which is what cut the $230/month duplicated-disk bill down to $34, deleted the bootstrap, and let workers scale to zero. That $34 is one optional L2 volume on the always-on primary — every other worker stays at $0, and the code falls through cleanly on any machine without the volume mounted. Both shipped together over the weekend, but the speedups and the savings aren't coming from the same piece.
There's a third, subtler win on top: the process-global LRU turns every second-and-later backtest's data load into pointer bumps, so Aurora sessions that hit the same universe stay fast across variants.
SPY minute-level bull call spread, 11 business days (2025-10-13 →
2025-10-24), measured on a dev laptop against Tigris over
residential WAN. Mmap numbers come from the last pre-streaming
commit; streaming numbers from current HEAD with
BENCH_RUNS=3 in-process. I haven't benched on Fly,
so every number here is dev-laptop only. Treat the shape as the
signal, not the absolute times.
Architecture Data load Backtest exec Total ──────────────────────────────────────────────────────────────────── mmap (warm OS page cache) 0.33 s 41.1 s 41.5 s Streaming Parquet (warm LRU) 0.29 s 10.3 s 10.6 s
Same window, same SPY universe, minute-level bull-call-spread on both architectures (streaming run adds an RSI gate for entry, which is additive execution work). Load cost is basically identical once caches are warm. The 4× end-to-end win is the columnar decoder's execution speedup. Bit-for-bit fingerprint parity was gated during the migration via a broader regression suite, not re-verified in this post-hoc bench, since the two benches invoke slightly different strategy stacks.
Architecture Cold first-run total ───────────────────────────────────────────────────────────── mmap (ready worker, OS cache cold) ~50 s Streaming Parquet (residential WAN) ~137 s Streaming Parquet (Fly) unmeasured
"Cold" here means a worker that's already up and ready, running its first backtest. On residential WAN, streaming cold is slower than mmap cold because the first backtest pays full Tigris round-trips per shard and then decompresses every page. Flamegraphs show ZSTD decompression alone accounts for roughly 25% of streaming-cold CPU time, so this isn't a pure network cost: a Fly VM colocated with Tigris cuts the network portion but still pays the same decompression bill, on what's typically slower per-core CPU than a dev laptop. I haven't measured Fly end-to-end, so the honest answer is I don't know the production number. What I do know: residential-mmap beats residential-streaming on first-backtest, and the gap on Fly is probably meaningful too. The ~2 hour GCS→.bin hydration mmap needed before a worker was ever "ready" is a separate thing, handled by the no-bootstrap property, not by cold-path speed.
Aurora (our AI agent) generates several backtests against the same universe in quick succession: "try it with RSI(20), now RSI(30), now swap the bull call for a bull put spread." Under mmap each variant paid the same warm-OS-cache cost, and execution was bound by the 80-byte row struct. Under streaming, the first variant pays the network cost once; every variant after that reads decoded columns from the process-global LRU at pointer-bump speed, and the columnar decoder makes execution 4× faster too. Daily OHLC is prewarmed at server boot, so it's free on every backtest regardless. Net: Aurora's typical session is dominated by backtests that hit cached data, and those backtests are meaningfully faster than mmap ever was.
* Trades, pct-change, win rate, max drawdown, profit factor, and the
displayed four-digit Sharpe are bit-identical to the mmap baseline.
The hashed sharpe_ratio and
sortino_ratio each shift by 1 ULP because
SymbolRegistry's intern order changed under the new
filter code and floating-point accumulation isn't associative.
Documented in manager.rs.
4× faster warm, 2-hour bootstrap gone, disk bill 7× lower and flat with concurrency. That's the engineering ledger. The other one worth talking about is the bill for the AI that shipped it.
The whole migration ran inside a $200/month subscription.
I'm on Claude Code's $200-a-month plan. The whole migration (streaming rewrite, columnar decoder, writer-side clustering, cold-path work, pipelined feeders, retry stack, review pass) ran inside that subscription. Prompt caching does heavy lifting on a long session in one repo because file contents get reread across tool calls, so a good chunk of the input tokens landed at the 90%-off cached rate. Zero marginal cost. The cost of the model was not a variable I had to think about.
Now flip to the infra side of the ledger. The Render era was $275 a month for one Pro Max + disk, $550 for two, capped at two because Render disks don't horizontally scale. Fly broke that ceiling, but every additional worker brought another 200 GB volume with it. More traffic meant more disks meant more duplicated bytes. The line was sloped wrong.
Tigris replaces that line at $0.02 per GB-month with zero egress fees. The ZSTD-compressed Parquet dataset is a fraction of the 200 GB decoded footprint, so the storage line comes out to single-digit dollars a month. Request fees are $0.0005 per 1,000 GETs and $0.005 per 1,000 PUTs, which round to fractions of a cent at backtest volume. One bucket serves every worker, every role, every region. Adding the tenth backtest machine adds zero to this line. That's a flat cost curve instead of a sloped one.
This migration deletes six of the seven duplicated disks (the seventh stays as the primary's L2), deletes the bootstrap, and makes scale-to-zero actually cost zero on every replica. A stopped reserve worker now bills nothing. I could provision a fleet of a thousand reserve backtest machines, keep them all stopped, and pay zero dollars until one wakes up to serve a request. The one disk that stays costs $34 a month and doesn't scale with traffic.
The cost story is the boring answer. The more interesting one is who actually built this.
Opus 4.7 didn't help me build this. It built it.
I started the weekend skeptical. I've been building this engine since 2020. I know every line of the hot path. Every condition, every indicator, every corner of how the iterator merges data types. So when I say Opus 4.7 built this migration, I mean it literally.
A week ago I didn't know what a Bloom filter was. I couldn't have told you how Parquet stored row-group metadata in its footer, or that you can fetch a single row group's byte-range without downloading the whole shard. I knew Parquet was columnar. That was the extent of my architectural understanding of the format I now run production on.
Opus 4.7 knew it all. It picked the three pruning layers. It designed
the N-way concurrent chunk fetcher with one shared
ArrowReaderMetadata. It wrote the condvar-backed
AppendableColumnarReader that I could barely explain on a
whiteboard. It chose the four retry layers and what each one catches.
It got bit-for-bit fingerprint parity on the first full commit and
held it across every subsequent one. I checked in every thirty
minutes. The diffs kept getting better.
With the mmap architecture, I drove. I did the research on what the header files needed to look like. I figured out the file locks. I understood why certain operations would block the main Tokio runtime and how to route around them. Claude Opus 4.1 and Gemini 2.5 Pro helped me build it, but every architectural decision ran through me first.
With this new architecture, Opus drove. I reviewed. That's the difference between "an AI helped me code this" and "a model shipped a load-bearing migration I don't fully understand yet." My architectural knowledge of the new system is maybe 65%. The system works. The fingerprint holds. The bench times check out.
I've been writing Rust since 2021. I wouldn't have designed the new hot path this way. The pipelined feeder, the three pruning layers, the way the retry stack nests. Pick any of them. The shape Opus landed on was better than the shape I'd have drawn on a whiteboard. That's the part I haven't stopped thinking about.
I spent five years building the architecture I was proud of. This weekend I reviewed a better one. I didn't write any of the code.
65% isn't defensive. Engineering managers don't understand 100% of the code their teams ship. Senior engineers don't understand 100% of the libraries they import. Shipping code you don't fully understand is how software gets built above a certain scale. What changes with AI-driven architecture is that the unit of "code you didn't write" grows from a library call to a whole subsystem. And the unit can keep growing.
What I actually focused on during the migration, in rough order of attention:
- Architectural patterns. I read every diff. I couldn't have designed the
AppendableColumnarReaderfrom scratch, but I could tell whether a given commit was solving the problem in a reasonable shape. The "what" is comprehensible even when the "why" isn't always. - Tests. Unit tests for every new module. The regression fingerprint as the correctness gate across the whole migration. Bench numbers as the performance gate. All three had to hold to ship.
- Break-glass. Mmap and streaming ran side by side behind a
BACKTEST_DATA_BACKENDfeature flag for the duration of the weekend. If streaming broke, one env var flipped every worker back to mmap in the next deploy. Only after the fingerprint held across enough commits did I actually delete the mmap path. That was the last commit of the weekend.
If you're considering something like this, the workflow that worked for me is simple: write the fingerprint test first. Use Plan mode to draft the architecture, then Auto Mode to let Claude run, then a manual review pass at the end.
Shift+Tab in Claude Code
cycles through the permission modes. Keep the classifier on. Never
use --dangerously-skip-permissions. The difference
between useful autonomy and rm -rf'ing your home
directory is one flag.
One more thing worth emphasizing. The fingerprint made the weekend possible, but it's not what keeps the system running afterward. That's a different stack. Pagers at the 120-second AppendableReader watchdog. Alerts when the four-layer retry stack exhausts on a shard path, when disk-LRU ETag mismatches spike (reclusterer misbehaving), and when disk-LRU IO errors show up (volume degraded). Dashboards watch RAM-LRU eviction rate against the 1 GB per-type budget so a traffic spike surfaces as a budget-tuning signal instead of as thrashing. None of that is code Opus wrote during the migration. It's what I rely on to catch the problems the fingerprint can't.
If you want to see what this migration actually made faster, try Aurora at nexustrade.io/agent. If you want to see how the trading engine's AI layer works end-to-end, the AI Agents from Scratch series walks through it.
The engine is still live. I'm still answering support emails. I still have to read every change.
A week ago I didn't know what a Bloom filter was. This weekend I shipped one into production. One small subcomponent in a 600,000 LOC passion project.
I wonder when I'll ship a feature that's a complete black box.
For now.
If you want the primary sources behind this post:
•
Anthropic: Introducing Claude Opus 4.7
•
Claude Code permission modes (Plan, Auto-accept, Normal)
•
Apache Parquet file format spec
(row groups, statistics, projection)
•
Tigris object storage (the
S3-compatible backend behind the cold path)
If you want to use or learn what this engine powers:
•
Aurora, the AI
trading agent that runs on top of this backtest engine
•
Algorithmic Trading Fundamentals, my course on building and backtesting strategies from
scratch
•
AI Agents from Scratch, my series on how the agent layer works end-to-end
•
NexusTrade, the platform
that puts it all together
No comments yet.