← All Articles
Dark server rack glowing gold with data flowing from Parquet shards into an engine
NexusTrade · Infrastructure · April 2026

I used to be a better software engineer than AI. Claude Opus 4.7 changed that.

I'd made the backtest engine as good as I could make it. One weekend with Opus 4.7 cut the disk bill 7×, flattened its scaling curve, and made warm backtests 4× faster.

Austin Starks Austin Starks ✦ Founder, NexusTrade ✦ April 19, 2026 ✦ 20 min read

I was quite proud of my backtesting system.

Lightning fast. Deterministic to the bit. And expensive.

Very expensive. At peak, over $1,000 a month across compute and disk, with $230 of that going to duplicated market history on rent even when the machines were idle. Every machine that served a backtest carried its own private 200 GB copy: prices, options chains, fundamentals. Two machines meant 200 GB of identical bytes billed twice. It was the most optimal design I could build with Claude Opus 4.1 and Gemini 2.5 Pro, over about a week of focused work last September.

Then, last week, Anthropic released Opus 4.7 at the same $5/$25 per million token price as Opus 4.6, with 90% off warm reads via prompt caching. On the API, Bedrock, Vertex, and Copilot on day one. The numbers:

SWE-Bench Pro · April 2026 Opus 4.7 64.3% GPT-5.4 57.7% Gemini 3.1 Pro 54.2% Opus 4.6 53.4% 0 30 60 80
Scores reported in Anthropic's Opus 4.7 announcement, April 16, 2026. SWE-bench Verified tells a similar story (87.6% on Opus 4.7, up from 80.8% on 4.6).

Ten points over Opus 4.6 in six months. Seven-point lead over GPT-5.4. Anthropic also reports 3× the production-engineering task resolution rate of Opus 4.6 on Rakuten's internal SWE-bench. Benchmarks only go so far.

What I actually cared about was if it could implement a large, complicated feature end-to-end. I wrote out the plan: the target architecture, the fingerprint test as the correctness gate, the data types to migrate, the staged cutover. Then I handed it to Opus 4.7 and stepped back.

Every thirty minutes or so I'd check in: diffs looked right, the fingerprint held, commits read coherently. Otherwise I trusted the automated code reviews and my own review pass at the end to catch anything I missed.

Unlike the week I spent driving the mmap build as the lead architect, this time I mostly let it cook.

Every load-bearing architectural decision came from Opus: the three pruning layers, the concurrent chunk fetcher sharing one prefetched metadata, the condvar-backed appendable reader, the four nested retry layers. I read the diffs and ran the fingerprint. That's the difference this time.

And honestly, I was shocked.

One caveat worth knowing
Opus 4.7 ships with a new tokenizer. Identical prose can use up to 35% more tokens than on Opus 4.6. Per-token price is unchanged. Per-session bills can drift up. Worth knowing if you're migrating.

The NexusTrade backtest engine, briefly

NexusTrade is the AI trading platform I've been building since 2020. Describe a trading idea in plain English. Aurora (our AI agent) creates a full algorithmic trading strategy: entry rules, exit rules, position sizing, options legs when relevant. The backtest engine runs that portfolio over real market history and reports what would have happened. Determinism is non-negotiable: identical inputs must produce identical outputs, bit-for-bit. Sub-second backtests feel magical; ten-second ones feel broken.

The magic is the goal.

What Opus 4.7 had to preserve
Data types
13
stocks, options, fundamentals, more
Data resolution
1-min bars
+ full options chains
Engine language
Rust
k-way merge over storage
Determinism
bit-for-bit
identical inputs → identical outputs

That engine is what I handed Opus 4.7 this weekend. By Sunday night the architecture I'd been running since last September was gone.

200 GB of data on every worker, billed every time

For about seven months, the engine ran an mmap architecture: a Rust hydrator downloaded Parquet shards from GCS on startup, decoded them into consolidated .bin files on a local persistent disk, and the backtest process memory-mapped those files at request time. The hot path was fast because, after the first read of each file, the OS kept those bytes in RAM, so every subsequent read hit memory, not disk. Every worker carried the same 200 GB of market history.

The deployment fought back at every stage. Live trading had to be split off on day one (a long-running backtest can't share a CPU with a process that needs to place real orders in milliseconds), so two machines with their own 200 GB copies was the starting point. Then optimizations started hogging the CPU on the backtest box, so optimization moved onto its own Render machine. As user load grew, I had to stand up additional backtest Render instances by hand, each with its own 200 GB disk, because Render persistent disks can't be shared or auto-scaled.

By the time I'd moved compute to Fly, the fleet was three Fly backtest workers (one always-on primary + two reserves), three Fly optimizer reserves (no primary: optimization was never on the hot path for user traffic), and one Render live-trading instance. Seven machines, each with its own 200 GB disk: 1.4 TB of duplicated market history on rent. At Fly's $0.15/GB/month for volumes and Render's $0.25/GB/month for SSDs, that came out to $230/month purely for duplicated disk. Scale-to-zero saved nothing on that line: volumes bill whether the machine is running or not.

The bill got untenable. I eventually dropped the always-on backtest primary and one optimizer reserve just to survive, trading peak throughput for a cost I could stomach. That's where the architecture sat when Opus 4.7 dropped.

How mmap served a backtest

Once per worker boot · ~2 hours GCS Parquet shared source of truth Rust hydrator downloads + consolidates .bin cache files 200 GB · local disk · once hydrated, worker enters hot-loop mode · Per backtest request · zero network Backtest request + window mmap process reads mapped pages mmap() KWayMergeIterator merges by timestamp Backtest result OS page cache pages warm after first access, stay warm across backtests
mmap's contract: pay hours up front (amber), then serve every backtest at RAM-speed from local disk (green). No network on the hot path. This is what mmap did well.

Why that stopped working

Four infra stages tried to scale the design without dropping the "fast hot path" property. Every stage ran into the same wall: the 200 GB footprint was per-machine, and machines were multiplying.

  • Stage 1 · Render · day one
    400 GB
    1 Render box for backtest + optimization, 1 for live trading. Optimizations hogged CPU and starved user backtests.
  • Stage 2 · Render · split + scaled
    800 GB
    Optimization gets its own box. To handle more users I manually stood up additional backtest instances, each with its own 200 GB disk (Render persistent disks can't be shared or auto-scaled). 4 Render instances, same data × 4.
  • Stage 3 · Fly + mmap · peak
    1.4 TB
    3 Fly backtest (1 primary + 2 reserves), 3 Fly optimizer reserves, 1 Render live-trade. Seven 200 GB disks, each one its own copy of the same data.
  • Stage 4 · Fly + mmap · cost cuts
    1.0 TB
    Dropped the always-on backtest primary and one optimizer reserve to survive Fly's volume bill. Worse throughput at peak load, survivable cost.
  • Stage 5 · Today
    0 GB
    Stateless workers. Scale-to-zero on compute AND disk. One shared source on Tigris, streamed on demand.
Stage 3 · Fly.io · three workers · one shared source Parquet on GCS shared · single copy · prices · options · fundamentals ~2 h ~2 h ~2 h worker 1 · always-on 200 GB disk .bin files (mmap) Rust backtest engine KWayMergeIterator worker 2 · reserve 200 GB disk .bin files (mmap) Rust backtest engine KWayMergeIterator worker 3 · reserve 200 GB disk .bin files (mmap) Rust backtest engine KWayMergeIterator ⚠ same 200 GB on every machine · 600 GB billed · adding a worker = another full disk + another ~2-hour hydration Fly volumes pin each machine to a physical host: no scale-to-zero, no fast horizontal expansion
Three machines, three disks, the same 200 GB on each. The scaling math got worse every time I added a worker.

So how did Opus 4.7 replace it? Start with the shape.

One shared data layer, no required local disk

Every Fly worker now boots with no required persistent state. Tigris (S3-compatible, Fly-hosted) is the source of truth for Parquet shards. No mandatory binary cache on disk, no hydration at startup. All thirteen data types (stock intraday, daily OHLC, options intraday, options daily, fundamentals, dividends, splits, economic, index signals, reports, financials, earnings, crypto) go through the streaming path.

Warm data loads in pointer-bump time. A new worker serves its first backtest in seconds, not hours of hydration. One always-on primary carries an optional disk-LRU tier for second-touch wins on options and intraday workloads (covered in the postscript to the trade-offs section below); every other worker runs with zero local disk. But how is any of that possible?

The new architecture, end to end

Here's the full path of a cold backtest on the streaming stack: a fresh worker that has never served a request, no local disk, nothing cached. Click through the tabs to walk the request from "strategy + time window" all the way to "backtest result."

Step 1 · Backtest request arrives

The engine receives a request: strategy, symbols, and a time window (start_ts, end_ts). This window determines which data needs to be fetched. Nothing is on disk yet.

Backtest request + window Tigris ListObjects Parquet footer fetch RG filter ts_min/ts_max LRU check (path, rg_idx) HIT: Arc clone Disk L2 Arrow IPC · mmap ETag-validated disk HIT (skip decode) HTTP range fetch Decode Arrow → Cols LRU publish Arc<Cols> ColumnarStreamingReader KWayMergeIterator Backtest result MISS
Click any step tab above to walk through the cold path.
blue = the one tier I designed myself (click for details)

That's the operational walkthrough. To understand why each of those steps is fast enough to add up to a sub-second warm backtest, four levers do the work: cut bytes, hide the network, keep the warm path instant, survive failures.

The first lever: don't fetch bytes you don't need

Four sections of deep mechanics follow. If that's not what you're here for, skip ahead to the takeaway.

A minute of SPY intraday in 2025 is a few hundred thousand option contracts. A month is tens of millions. The reader that tries to download all of it for a four-day backtest has already lost. Three mechanisms attack the same problem from three directions: change the storage layout so filters read 80× less data, prune whole row groups before they leave Tigris, and narrow the columns on data types where the backtest only needs one.

Why the decoder stores columns, not rows

The biggest single performance characteristic of the new decoder: every field gets its own tight buffer instead of 80-byte interleaved row structs. The options resolver's filter chains (find all calls in a DTE window with strike near $450) touch one column at a time. In the old row-form layout a single 1-byte option_type check pulled 80 bytes per row into cache. In columnar, it pulls one. Same data, 80× less memory bandwidth.

Row form: Vec<OnDiskOptionPoint> · · · 80 bytes total · · · ts (8B) id (4) strike (4) exp (2) type bid (4) ask (4) delta (4) iv (4) → want + 11 more fields… filter by option_type (1 byte) = load 80 bytes × N rows 10,000 contracts × 80 bytes = 800 KB into cache to check 10 KB of data Columnar: OptionColumns (struct-of-arrays) timestamp_nanos: Arc<[i64]> · · · strike_cents: Arc<[u32]> · · · option_type: Arc<[u8]> ← filtering this · · · 1 byte per row, tightly packed delta: Arc<[f32]> · · · implied_volatility: Arc<[f32]> · · · + 15 more field buffers… filter by option_type = load 1 byte × N rows 10,000 contracts × 1 byte = 10 KB (80× less memory bandwidth)
The same 10,000 contracts. Row form loads 800 KB to check a 1-byte field. Columnar form loads 10 KB. This is where the 38–50% speed improvement comes from.
What stayed identical
Backtest statistics hash to the same fingerprint on both paths, bit-for-bit, on every one of the five regression benches. Public APIs (MarketData, the iterator surface, the options resolver's call signature) are unchanged, so nothing upstream of the data layer had to know the migration happened.

Three layers of row-group pruning

A Parquet shard on Tigris is split into dozens of row groups, each a few thousand rows. The reader's job is to avoid downloading the ones it doesn't need. Three independent filters run before any bytes leave the bucket, and each catches a different access pattern the others miss.

Row groups touched · 4-day SPY backtest · 1-month options shard Full shard ~400 row groups (≈120 MB compressed) 400 timestamp min/max After timestamp window only row groups overlapping the 4-day backtest range ~20 symbol min/max (needs writer ORDER BY) After symbol min/max only row groups whose ticker range contains SPY ~2 bloom filter (narrow non-contiguous universes) Downloaded bloom confirms SPY is actually present 2 → one more filter runs post-decode (in memory) to drop rows for tickers outside the universe that leaked through min/max
Each layer catches a different access pattern. Timestamp handles "narrow window of wide data." Symbol min/max handles "one ticker from a sorted shard." Bloom handles "narrow non-contiguous universe like {AAPL, MSFT} in a shard whose range is A..Z." The post-decode row-level filter is what stays universe-agnostic in the cache.

Timestamp min/max. Every row group's footer records the min and max timestamp it contains. A backtest on March 5th skips every row group whose range doesn't overlap the day. This is the coarse filter; it's what makes "give me one day out of a ten-year shard" free.

Symbol min/max. The writer emits shards with ORDER BY (ticker, date) for daily types and ORDER BY (underlying, timestamp) for intraday options. That ordering turns the per-row-group ticker min/max into a real filter: a {SPY} backtest reads the row groups whose ticker range touches SPY and skips the rest. Without the writer's ORDER BY, every row group's min/max straddles the full alphabet and the filter is a no-op. The reader and the writer have to agree for this to work.

Bloom filter. Min/max is a range check: it admits row groups whose range contains the ticker you asked for even if that specific ticker never appears in the group. A Bloom filter in the row-group metadata answers the precise question "could this ticker be here?" with false positives but no false negatives. For narrow universes on fat shards, this is the layer that drops the last unnecessary download.

How a bloom filter answers "could this ticker be here?"
1 · Write · set bits for each ticker

The writer adds AAPL, MSFT, NVDA. Each ticker hashes to K=3 bit positions. Seven bits end up set (AAPL and MSFT both hash to 9, AAPL and NVDA both hash to 14, so the positions overlap).

AAPL→ bits {3, 9, 14} MSFT→ bits {2, 9, 12} NVDA→ bits {1, 5, 14}
0 0 1 1 1 2 1 3 0 4 1 5 0 6 0 7 0 8 1 9 0 10 0 11 1 12 0 13 1 14 0 15

16-bit bloom filter after write. Green cells are "1". Gray cells are "0".


2a · Read "SPY" · definitely not here

SPY hashes to {4, 7, 10}. Check those bits. Any one being 0 proves SPY was never added.

0 0 1 1 1 2 1 3 0 4 1 5 0 6 0 7 0 8 1 9 0 10 0 11 1 12 0 13 1 14 0 15

bit[4] = 0, bit[7] = 0, bit[10] = 0 · SKIP · SPY is definitely not in this row group.


2b · Read "MSFT" · possibly here

MSFT hashes to {2, 9, 12}. Check those bits. All three are 1, so MSFT might be in this row group.

0 0 1 1 1 2 1 3 0 4 1 5 0 6 0 7 0 8 1 9 0 10 0 11 1 12 0 13 1 14 0 15

bit[2] = 1, bit[9] = 1, bit[12] = 1 · DOWNLOAD · MSFT might be in this row group.

"All 1s" doesn't prove MSFT is actually stored: hash collisions mean another ticker could have set all three bits. False positives (unnecessary downloads) are acceptable; false negatives (missing a ticker that's actually there) would be a correctness bug. At Parquet's default SBBF sizing (10 bits per distinct ticker, 7 hashes), the false positive rate lands around 0.83%FPR ≈ (1 − e−k·n/m)k. With m/n = 10 bits/key and k = 7 (classical optimum): (1 − e−0.7)7 ≈ 0.0083 → ~0.83%..

A row group that survives all three filters still might contain rows for tickers outside the universe, because min/max and bloom are group-granular, not row-granular. So one more filter runs after the decode, in memory: a ColumnFamily::filter_rows_by_asset_ids step that permutes each Arc<[T]> column through a kept-index list. It sits after the LRU publish, not before it, so a {SPY} backtest and a {QQQ} backtest on the same machine share one cached row-group buffer and each applies its own filter to the same Arc<Cols>. The cache-key space stays universe-agnostic.

stock_intraday narrows to a lookup table for options DAY backtests

One last byte-reduction trick, narrower but meaningful. An options backtest at daily granularity uses stock_intraday only to resolve the underlying's close at decision time. It doesn't need open, high, low, or volume. Loading all five columns for the sole purpose of reading one is 40–60% wasted bytes.

When the backtest is options-only at DAY granularity, the reader narrows the projection to ticker, timestamp, and close. The other OHLCV columns are NaN-poisoned rather than decoded. MINUTE and HOUR backtests still read the full schema; the narrowing only kicks in where it's semantically safe. Same fingerprint, less bandwidth.

Reducing work alone isn't enough. Even after pruning to two row groups, those two groups still have to come over the network. The next cluster is about making that network cost invisible.

The second lever: overlap the network with everything else

The bytes we do need still have to travel from Tigris to the worker. Two mechanisms keep that travel from showing up as wall time: fan the fetches out in parallel, and let the backtest start running before the full dataset has arrived.

Concurrent row-group fetches per shard

parquet-rs's ParquetRecordBatchStream reads row groups sequentially. It asks for group 1, waits, asks for group 2, waits. For a shard with 40 surviving row groups on a 50 ms RTT link, that's two seconds of pure round-trip time even if each individual fetch is free.

The reader fans the work out itself. After pruning, the surviving row groups for a shard are split into N chunks. N builders are constructed against the same Arc<ArrowReaderMetadata> (the footer is prefetched once and shared), and each builder streams its chunk on its own task. HTTP/2 multiplexes the concurrent requests onto a single connection to Tigris; the bucket sees the fan-out, the process sees N times less wall time. Decoded row groups land in the LRU in whatever order they arrive; the singleflight (covered in the next cluster) means a late arrival for a group another task already published is a cheap no-op. Default N is 4, tunable by env var.

Pipelined feeding: data loads while the backtest runs

A twelve-month intraday backtest reads data in timestamp order. The iterator consumes month 1 entirely before it touches a single row in month 2. So month 2 doesn't need to be fully decoded before exec starts; it just needs to be ready by the time the cursor crosses into its range. The architecture exploits that.

AppendableColumnarReader is the primitive. It implements the same IndexedReader trait the merge iterator already knows how to consume, but its internal state is a Mutex<ReaderState> + Condvar. When the iterator asks for a row past the currently-loaded tail, it parks on the condvar. When a background fetch task calls append_chunk, the condvar wakes. A 120-second watchdog fails loud if a producer dies.

appendable_reader.rs · the producer/consumer handshake
// Consumer side: called by KWayMergeIterator, blocks until data lands
fn visit_point_at(&self, idx: usize, visitor: &mut impl Visitor<T>) {
    let mut state = self.state.lock().unwrap();
    loop {
        if let Some(block) = state.block_for(idx) { return block.visit(idx, visitor); }
        if state.is_complete { return; }             // producer said "no more"
        if let Some(err) = &state.error { panic!(...); }
        let (g, t) = self.cond.wait_timeout(state, DEADLOCK_GUARD).unwrap();
        if t.timed_out() { panic!("appendable reader stuck >120s"); }
        state = g;
    }
}

// Producer side: called from the tokio::spawn feeder task
pub fn append_chunk(&self, cols: C)       { /* push + notify_all */ }
pub fn mark_complete(&self)               { /* set flag + notify_all */ }
pub fn mark_error(&self, msg: String)     { /* set err + notify_all */ }

The seed-then-spawn pattern is the orchestrator. The first month's fetch is awaited synchronously because the iterator needs something to emit. For every subsequent month, the reader wraps an empty AppendableColumnarReader whose min_timestamp is parsed directly from the "YYYY-MM" key string, so the MultiMonthReader's chronological sort is correct before any chunk has landed. Then it spawns the feeder and returns.

load_all.rs · the seed-then-spawn pattern
// seed: first month blocks, iterator needs it to start emitting ticks
let (seed_key, seed_cols) = fetch_one_key_cols_owned(keys[0], ...).await?;
let seed_reader = build_columnar_reader(seed_cols, ...);
let mut readers = vec![(seed_key, seed_reader)];

// pending: every other month. declare min_timestamp from the key string,
//          wrap an empty appendable reader, spawn the feeder
for key in &keys[1..] {
    let min_ts = min_timestamp_from_key(key)?;      // parses "2024-03" → ns
    let reader = AppendableColumnarReader::new_empty(min_ts);
    let handle  = reader.clone_handle();
    tokio::spawn(async move {
        match fetch_one_key_cols_owned(key, ...).await {
            Ok((_, cols)) => { handle.append_chunk(cols); handle.mark_complete(); }
            Err(e)        => handle.mark_error(e.to_string()),
        }
    });
    readers.push((key.clone(), AnyReader::AppendableColumnar(reader)));
}
MultiMonthReader::from_sorted(readers)   // sorts on declared min_timestamp

The wait function changes shape. Instead of Σ(per-month), the backtest waits max(yearly_types, seed_month) plus whatever the iterator races against as it walks forward in time. On a twelve-month intraday bench that's roughly twenty seconds of serial decode collapsed to one or two seconds of seed wait, with everything else hidden under exec. Options pipelining works the same way, with NBBO enrichment moved inside each per-month task so there's no cross-key post-decode barrier.

Cold is the first backtest on a fresh worker. Everything after that should feel free. The next cluster is about why.

The third lever: make every backtest after the first instant

The cold-path tricks above still pay one network round-trip per row group the first time. The process-global cache is what makes the second backtest to touch the same data read it from RAM. Two pieces work together: an LRU that holds decoded columns across requests, and a prewarm pass that fills it at server startup for the types every backtest needs.

The process-global LRU is the warm-path secret

Decoded row groups live in a process-global LRU keyed by (path, row_group_idx) with a byte budget. A given row group is fetched, decoded, and sort-deduped exactly once per worker lifetime, not per backtest. The second backtest to touch SPY March 5th reads from memory.

Two backtests can ask for the same row group at the same time. Without coordination they'd both fetch it, both decode it, and both race to publish into the cache. The cache uses a singleflight pattern: a get_or_init call whose first caller actually does the work while every concurrent caller parks on the same future. N concurrent backtests on a cold cache pay exactly one fetch per row group, not N.

Prewarm cache: daily types never touch network

Eight data types are small enough to fit comfortably in memory and are touched by almost every backtest: daily_ohlc, dividends, reports, economic, index_signals, financials, earnings, and crypto_daily. The worker loads all eight into the columnar cache at server startup, before it accepts any backtest traffic.

The upshot for most users: a daily backtest (stock or crypto, which together make up the vast majority of what Aurora generates) touches exactly these eight prewarmed types. First backtest on a freshly-booted worker runs at RAM speed, zero network, same as the hundredth. Only options and intraday backtests (stock intraday and crypto intraday) have to fetch anything from Tigris on cold. Even those see their second run at LRU pointer-bump speed, because the process-global LRU keeps decoded row groups around once they're touched. The cold-path trade-off documented later in this post is real, but it affects the minority of workloads, not the majority.

Three levers, one obvious problem: this whole stack now depends on Tigris being reachable. What happens when it isn't?

The fourth lever: survive the hiccups

Tigris is reliable but not perfect. Cold-path correctness now depends on the bucket being reachable every time a row group hasn't been cached yet. The retry stack is what catches transient failures before they turn into a user-visible backtest error.

Four layers of retry, each catching something different

The hiccups to survive: connection resets, stale connection-pool entries, the occasional 5xx, a chunk that silently stops streaming. The retry stack has four layers because one global retry knob can't catch everything without also hiding real failures.

Layer 1 · outermost object_store internal retries 5xx, connection reset · exponential backoff · default config Layer 2 · chunk-level with_retry wraps one chunk fetch · 2 attempts · 0 ms → 1 s · catches stale pool entry Layer 3 · whole-key with_feeder_retry wraps entire per-month feed · 2 attempts · 5 s apart · catches gust of stale entries Layer 4 · backstop APPEND_WAIT_TIMEOUT consumer watchdog · 120 s condvar wait · panics loud · not a retry
Nesting, outside in: network blips → one stale pool entry → a gust of stale entries → code bugs & environments we haven't seen. Each inner layer only fires when the outer layer has already failed.
  1. object_store internal retries. The client itself retries 5xxs and dropped connections with exponential backoff. This is the default HTTP-level resilience; we don't configure it beyond defaults.
  2. with_retry at the chunk level. Wraps a single chunk fetch (one builder streaming its slice of the row groups). Two attempts, 0 ms → 1 s backoff. Catches a stale connection-pool entry that object_store's own retry missed.
  3. with_feeder_retry at the whole-key level. Wraps the entire per-month feed. Two attempts, 5 seconds apart. Catches the case where every chunk for a shard got a stale-pool error at once. Rare, but the failure mode when it happens is "month 2 never loads" and you want one more swing at the whole thing.
  4. APPEND_WAIT_TIMEOUT consumer watchdog. The backstop. If the iterator has been parked on an appendable reader's condvar for 120 seconds with no chunks landing, it panics. Not a retry; a loud fail. A genuinely stuck feeder should crash the task, not quietly return partial data.

The layering matters. Layer 1 catches network blips. Layer 2 catches one stale pool entry. Layer 3 catches a gust of stale pool entries. Layer 4 catches code bugs and environments we haven't seen yet. A failure that makes it past all four is something we genuinely want to see as an error in the logs.

What did those four choices cost me, and what did they buy?

What the old architecture still did better

mmap was not wrong. For seven months it served every backtest at RAM speed off local disk, with zero network dependency and zero runtime variance. That is a real property. Saying streaming Parquet replaced it without saying what mmap gave up would be dishonest.

What mmap did better
  • Options and intraday first-run is slower. Daily stock and crypto backtests are unaffected because their data is prewarmed into the LRU at server boot. But for options and intraday workloads (which must fetch from Tigris on cold), the first backtest on a ready worker is measurably slower now. On my dev laptop (residential WAN), mmap ran the first options backtest in ~50s; streaming takes ~137s because it has to fetch and decompress Parquet pages before anything runs. Every subsequent backtest recovers this with interest (4× faster warm) but the cold loss is real on these specific workloads. Flamegraph data shows ZSTD decompression alone is ~25% of streaming-cold CPU, which means Fly won't rescue this the way I originally speculated: shared-tenant VMs typically have slower per-core CPU than a modern dev laptop, so the decode cost likely grows, not shrinks, in production.
  • No runtime network dependency. Tigris could go down entirely and backtests kept running. Streaming needs Tigris reachable every cold fetch.
  • Predictable latency. Local SSD access doesn't vary with WAN conditions or object-store tail latency.
  • No cache eviction surprises. The OS page cache is one of the best-tuned caching systems ever shipped. The streaming LRU is custom code that can have bugs.
What streaming wins on
  • No bootstrap. New worker serves its first backtest in seconds, not hours. Horizontal scale is immediate.
  • No duplicated storage. One copy on Tigris, not seven across backtest + optimizer + livetrade. $230/month recurring disk cost drops to $34 (the L2 volume on the primary), 7× lower and flat instead of sloped with concurrency.
  • Scale-to-zero on disk. Fly volumes bill even when the machine is stopped. Tigris-as-source-of-truth has no idle cost per-worker.
  • Columnar decoder makes execution 4× faster. Not a streaming-vs-mmap thing per se, but it shipped alongside. The warm-path backtest is faster end-to-end than mmap ever was.
  • No risk of format bit-rot. Every reader reads the same canonical Parquet bytes from Tigris. There's no worker-local binary format that can drift out of sync with the code. Adding a column is a non-event; the mmap path had to version its .bin files and handle migrations carefully every time the schema changed.

Net for my operational model: spiky traffic, a small number of concurrent users per worker, marginal warm-path latency differences are acceptable, and $230/month of duplicated storage isn't. Streaming wins. For a system that needed sub-millisecond cold-start determinism or ran without network access, mmap would still be the right answer.

Postscript · my one human improvement

Writing this section made me spot the fix. The options and intraday cold-path loss read worse on the page than in the benchmark. That was the bullet I couldn't shrug off, so I flipped the roles for one subsystem.

The key shape: completely optional. A single env var (STREAMING_DISK_LRU_PATH) gates the whole tier; unset means pure streaming with zero overhead. I turned it on for exactly one machine: the always-on primary with a 225 GB Fly volume. Every other worker (burst reserves, dev laptops, CI) keeps scale-to-zero intact. A micro-tweak, not a core architectural change.

Inside that machine: a 200 GB LRU between RAM and Tigris. My first instinct for the on-disk format was raw binary or bincode. Opus pushed back and suggested Arrow IPC (Feather v2), which is what shipped. A disk hit mmap's the file and Arc-wraps the columns with no ZSTD and no Arrow decode, RAM-tier latency off local NVMe. Schema validation falls out of the format for free; a format-version byte in the header is the global invalidation hammer. Freshness is the Parquet ETag recorded per file, revalidated with a HEAD on every hit.

If the shape sounds familiar, it's a demand-populated, LRU-bounded version of the old mmap architecture. Same decoded-columns-on-local-disk idea, without the 2-hour bootstrap or the 200 GB per-worker duplication. ~$34/month on Fly: 7× lower than the $230 the old seven-worker design was paying for duplicated disk.

I spotted the trade-off and made the call. Opus drove the implementation, same as everywhere else. That's the part of the trade-off I didn't accept, and the closest I came to architecting anything this weekend.

Now: with the trade-offs on the table, here are the receipts.

What the move actually bought

Two wins, two separate mechanisms. The columnar decoder replaced the old 80-byte row struct, so each filter step now pulls one column into cache instead of the full row. That's where the 4× warm-backtest speedup comes from. The streaming architecture replaced the local 200 GB .bin cache with Tigris-as-source-of-truth, which is what cut the $230/month duplicated-disk bill down to $34, deleted the bootstrap, and let workers scale to zero. That $34 is one optional L2 volume on the always-on primary — every other worker stays at $0, and the code falls through cleanly on any machine without the volume mounted. Both shipped together over the weekend, but the speedups and the savings aren't coming from the same piece.

There's a third, subtler win on top: the process-global LRU turns every second-and-later backtest's data load into pointer bumps, so Aurora sessions that hit the same universe stay fast across variants.

Bench methodology

SPY minute-level bull call spread, 11 business days (2025-10-13 → 2025-10-24), measured on a dev laptop against Tigris over residential WAN. Mmap numbers come from the last pre-streaming commit; streaming numbers from current HEAD with BENCH_RUNS=3 in-process. I haven't benched on Fly, so every number here is dev-laptop only. Treat the shape as the signal, not the absolute times.

Warm: the run that happens on ~every backtest
Architecture                    Data load    Backtest exec    Total
────────────────────────────────────────────────────────────────────
mmap (warm OS page cache)         0.33 s         41.1 s        41.5 s
Streaming Parquet (warm LRU)      0.29 s         10.3 s        10.6 s

Same window, same SPY universe, minute-level bull-call-spread on both architectures (streaming run adds an RSI gate for entry, which is additive execution work). Load cost is basically identical once caches are warm. The 4× end-to-end win is the columnar decoder's execution speedup. Bit-for-bit fingerprint parity was gated during the migration via a broader regression suite, not re-verified in this post-hoc bench, since the two benches invoke slightly different strategy stacks.

Cold: the one thing mmap still does better
Architecture                            Cold first-run total
─────────────────────────────────────────────────────────────
mmap (ready worker, OS cache cold)      ~50 s
Streaming Parquet (residential WAN)     ~137 s
Streaming Parquet (Fly)                 unmeasured

"Cold" here means a worker that's already up and ready, running its first backtest. On residential WAN, streaming cold is slower than mmap cold because the first backtest pays full Tigris round-trips per shard and then decompresses every page. Flamegraphs show ZSTD decompression alone accounts for roughly 25% of streaming-cold CPU time, so this isn't a pure network cost: a Fly VM colocated with Tigris cuts the network portion but still pays the same decompression bill, on what's typically slower per-core CPU than a dev laptop. I haven't measured Fly end-to-end, so the honest answer is I don't know the production number. What I do know: residential-mmap beats residential-streaming on first-backtest, and the gap on Fly is probably meaningful too. The ~2 hour GCS→.bin hydration mmap needed before a worker was ever "ready" is a separate thing, handled by the no-bootstrap property, not by cold-path speed.

Aurora: the workload this was actually built for

Aurora (our AI agent) generates several backtests against the same universe in quick succession: "try it with RSI(20), now RSI(30), now swap the bull call for a bull put spread." Under mmap each variant paid the same warm-OS-cache cost, and execution was bound by the 80-byte row struct. Under streaming, the first variant pays the network cost once; every variant after that reads decoded columns from the process-global LRU at pointer-bump speed, and the columnar decoder makes execution 4× faster too. Daily OHLC is prewarmed at server boot, so it's free on every backtest regardless. Net: Aurora's typical session is dominated by backtests that hit cached data, and those backtests are meaningfully faster than mmap ever was.

Warm backtest, end-to-end
4× faster
10.6s vs 41.5s, minute SPY
New-worker bootstrap
~2 hrs → 0
hydration deleted
Recurring disk cost
$230 → $34
monthly · 7× lower, flat curve
Fingerprint
bit-identical
visible stats unchanged*

* Trades, pct-change, win rate, max drawdown, profit factor, and the displayed four-digit Sharpe are bit-identical to the mmap baseline. The hashed sharpe_ratio and sortino_ratio each shift by 1 ULP because SymbolRegistry's intern order changed under the new filter code and floating-point accumulation isn't associative. Documented in manager.rs.

4× faster warm, 2-hour bootstrap gone, disk bill 7× lower and flat with concurrency. That's the engineering ledger. The other one worth talking about is the bill for the AI that shipped it.

The whole migration ran inside a $200/month subscription.

I'm on Claude Code's $200-a-month plan. The whole migration (streaming rewrite, columnar decoder, writer-side clustering, cold-path work, pipelined feeders, retry stack, review pass) ran inside that subscription. Prompt caching does heavy lifting on a long session in one repo because file contents get reread across tool calls, so a good chunk of the input tokens landed at the 90%-off cached rate. Zero marginal cost. The cost of the model was not a variable I had to think about.

Now flip to the infra side of the ledger. The Render era was $275 a month for one Pro Max + disk, $550 for two, capped at two because Render disks don't horizontally scale. Fly broke that ceiling, but every additional worker brought another 200 GB volume with it. More traffic meant more disks meant more duplicated bytes. The line was sloped wrong.

Tigris replaces that line at $0.02 per GB-month with zero egress fees. The ZSTD-compressed Parquet dataset is a fraction of the 200 GB decoded footprint, so the storage line comes out to single-digit dollars a month. Request fees are $0.0005 per 1,000 GETs and $0.005 per 1,000 PUTs, which round to fractions of a cent at backtest volume. One bucket serves every worker, every role, every region. Adding the tenth backtest machine adds zero to this line. That's a flat cost curve instead of a sloped one.

This migration deletes six of the seven duplicated disks (the seventh stays as the primary's L2), deletes the bootstrap, and makes scale-to-zero actually cost zero on every replica. A stopped reserve worker now bills nothing. I could provision a fleet of a thousand reserve backtest machines, keep them all stopped, and pay zero dollars until one wakes up to serve a request. The one disk that stays costs $34 a month and doesn't scale with traffic.

The real math
Claude Code on Opus 4.7: one weekend of work, bundled into a $200 subscription. The Render-then-Fly persistent-disk architecture: recurring monthly cost that grew linearly with concurrency. The migration paid for itself the weekend it shipped and keeps paying every month I don't add another duplicated 200 GB.

The cost story is the boring answer. The more interesting one is who actually built this.

Opus 4.7 didn't help me build this. It built it.

I started the weekend skeptical. I've been building this engine since 2020. I know every line of the hot path. Every condition, every indicator, every corner of how the iterator merges data types. So when I say Opus 4.7 built this migration, I mean it literally.

A week ago I didn't know what a Bloom filter was. I couldn't have told you how Parquet stored row-group metadata in its footer, or that you can fetch a single row group's byte-range without downloading the whole shard. I knew Parquet was columnar. That was the extent of my architectural understanding of the format I now run production on.

Opus 4.7 knew it all. It picked the three pruning layers. It designed the N-way concurrent chunk fetcher with one shared ArrowReaderMetadata. It wrote the condvar-backed AppendableColumnarReader that I could barely explain on a whiteboard. It chose the four retry layers and what each one catches. It got bit-for-bit fingerprint parity on the first full commit and held it across every subsequent one. I checked in every thirty minutes. The diffs kept getting better.

With the mmap architecture, I drove. I did the research on what the header files needed to look like. I figured out the file locks. I understood why certain operations would block the main Tokio runtime and how to route around them. Claude Opus 4.1 and Gemini 2.5 Pro helped me build it, but every architectural decision ran through me first.

With this new architecture, Opus drove. I reviewed. That's the difference between "an AI helped me code this" and "a model shipped a load-bearing migration I don't fully understand yet." My architectural knowledge of the new system is maybe 65%. The system works. The fingerprint holds. The bench times check out.

I've been writing Rust since 2021. I wouldn't have designed the new hot path this way. The pipelined feeder, the three pruning layers, the way the retry stack nests. Pick any of them. The shape Opus landed on was better than the shape I'd have drawn on a whiteboard. That's the part I haven't stopped thinking about.

I spent five years building the architecture I was proud of. This weekend I reviewed a better one. I didn't write any of the code.

65% isn't defensive. Engineering managers don't understand 100% of the code their teams ship. Senior engineers don't understand 100% of the libraries they import. Shipping code you don't fully understand is how software gets built above a certain scale. What changes with AI-driven architecture is that the unit of "code you didn't write" grows from a library call to a whole subsystem. And the unit can keep growing.

What I actually focused on during the migration, in rough order of attention:

  1. Architectural patterns. I read every diff. I couldn't have designed the AppendableColumnarReader from scratch, but I could tell whether a given commit was solving the problem in a reasonable shape. The "what" is comprehensible even when the "why" isn't always.
  2. Tests. Unit tests for every new module. The regression fingerprint as the correctness gate across the whole migration. Bench numbers as the performance gate. All three had to hold to ship.
  3. Break-glass. Mmap and streaming ran side by side behind a BACKTEST_DATA_BACKEND feature flag for the duration of the weekend. If streaming broke, one env var flipped every worker back to mmap in the next deploy. Only after the fingerprint held across enough commits did I actually delete the mmap path. That was the last commit of the weekend.

If you're considering something like this, the workflow that worked for me is simple: write the fingerprint test first. Use Plan mode to draft the architecture, then Auto Mode to let Claude run, then a manual review pass at the end. Shift+Tab in Claude Code cycles through the permission modes. Keep the classifier on. Never use --dangerously-skip-permissions. The difference between useful autonomy and rm -rf'ing your home directory is one flag.

One more thing worth emphasizing. The fingerprint made the weekend possible, but it's not what keeps the system running afterward. That's a different stack. Pagers at the 120-second AppendableReader watchdog. Alerts when the four-layer retry stack exhausts on a shard path, when disk-LRU ETag mismatches spike (reclusterer misbehaving), and when disk-LRU IO errors show up (volume degraded). Dashboards watch RAM-LRU eviction rate against the 1 GB per-type budget so a traffic spike surfaces as a budget-tuning signal instead of as thrashing. None of that is code Opus wrote during the migration. It's what I rely on to catch the problems the fingerprint can't.

If you want to see what this migration actually made faster, try Aurora at nexustrade.io/agent. If you want to see how the trading engine's AI layer works end-to-end, the AI Agents from Scratch series walks through it.

The engine is still live. I'm still answering support emails. I still have to read every change.

A week ago I didn't know what a Bloom filter was. This weekend I shipped one into production. One small subcomponent in a 600,000 LOC passion project.

I wonder when I'll ship a feature that's a complete black box.

For now.

Going deeper

If you want the primary sources behind this post:

Anthropic: Introducing Claude Opus 4.7
Claude Code permission modes (Plan, Auto-accept, Normal)
Apache Parquet file format spec (row groups, statistics, projection)
Tigris object storage (the S3-compatible backend behind the cold path)

More from me

If you want to use or learn what this engine powers:

Aurora, the AI trading agent that runs on top of this backtest engine
Algorithmic Trading Fundamentals, my course on building and backtesting strategies from scratch
AI Agents from Scratch, my series on how the agent layer works end-to-end
NexusTrade, the platform that puts it all together

Discussion

Sign in or create a free account to join the discussion.

No comments yet.