Contents

Building a Prediction Market From Scratch — Performance: From 50K to a Million Orders a Second

Building a Prediction Market From Scratch — Performance: From 50K to a Million Orders a Second |

“How many orders a second can it do?” is the question everyone asks about a matching engine. It’s the wrong first question. The right one is: where does the time actually go? Because almost everyone guesses wrong — they go optimize the matching algorithm, when the real cost is sitting somewhere much less glamorous.

This post is the performance roadmap for Zhulong, our prediction-market engine. I’ll show you where the time goes, why the engine needs a write-ahead log, a handful of latency numbers worth memorizing, and a ladder that climbs from tens of thousands of orders per second to a million. Every number you’ll see I measured on one machine (an Apple M4 Pro) — they’re here for the ratios, not the absolute values, and the benchmark that produced them is at the end so you can run it yourself.

“它每秒能处理多少笔?"——这是所有人问撮合引擎的第一个问题。但它问错了。该先问的是:时间到底花在哪儿? 因为几乎所有人都猜错——跑去优化撮合算法,而真正的开销蹲在一个不起眼得多的地方。

这篇是 Zhulong(我们的预测市场引擎)的性能路线图。我会告诉你时间花在哪、引擎为什么需要一份预写日志、几个值得背下来的延迟数字,以及一架从每秒几万笔爬到一百万笔的梯子。文中每个数字都是我在一台机器上(Apple M4 Pro)实测的——看的是比例,不是绝对值;产生这些数字的基准程序放在文末,你可以自己跑。

Where the time actually goes

Rank the costs in a naive matching engine from biggest to smallest, and the order surprises people:

  1. The fsync that makes orders durable. fsync is the call that forces your data to physically land on the disk and waits for the disk to confirm. If you naively do one after every order, it dwarfs everything else — by a factor of thousands.
  2. Memory allocation. Whenever a program needs to store something whose size isn’t known ahead of time, it asks the system for a chunk of memory and hands it back when done. One request is cheap — but a matching engine does this hundreds of thousands of times a second, once for every order and every trade, and then the borrowing-and-returning of memory itself becomes the hot spot, not the matching.
  3. The match itself — book lookups, walking price levels, the FIFO queues.
  4. Post-trade bookkeeping, event serialization, the network.

Notice the algorithm everyone wants to tune is third. The first two are plumbing. So before any of this makes sense, we have to talk about that fsync — and why it’s there at all.

时间到底花在哪

把一个朴素撮合引擎的开销从大到小排一遍,顺序会让人意外:

  1. 让订单持久化的那个 fsync fsync 是这样一个调用:它强制数据真正写进磁盘、并等磁盘确认。如果你天真地每来一单就 fsync 一次,它能把其他所有开销甩出几千倍。
  2. 内存分配。 程序每当要存一份事先不知道多大的数据,就向系统"要一块内存”,用完再还回去。要一次、还一次,单看很便宜——但撮合引擎每秒要做几十万次(每个订单、每笔成交各一次),于是"借内存、还内存"这件事本身,而不是撮合,成了热点。
  3. 撮合本身——簿的查找、遍历价位、FIFO 队列。
  4. 成交后记账、事件序列化、网络。

注意:大家最想调的"算法"排在第三。前两名都是管道工程。所以在这一切讲得通之前,我们得先聊那个 fsync——以及它为什么非在那儿不可。

Why a write-ahead log

The engine keeps everything in memory — order books, balances, positions. Memory is fast, but it has one fatal flaw: pull the plug and it’s gone. A crash, a kill, a power blip, and every order someone just placed vanishes. For anything touching money, that’s not acceptable.

So before applying a command, the engine appends it — the command, the input, not the resulting state — to a local file and fsyncs it to disk. Only after that does it touch memory, and only then does the user hear “accepted.” So “your order went through” means “your order is already on disk.” If the process dies a microsecond later, the order survives. On restart, replay the log from the top and the in-memory state rebuilds itself, bit for bit — because the engine is deterministic: same inputs, same state.

This is why the database gets demoted to an async mirror, and why the write-ahead log, not the database, is the source of truth. (I went deep on this in the architecture chapter.) The point for this post is simpler: durability has a price, and the price is fsync. Which brings us to the one idea about latency that reorganizes how you think about all of it.

为什么要一份预写日志

引擎把一切都放在内存里——订单簿、余额、持仓。内存快,但有个致命缺陷:一拔电源就没了。崩溃、被 kill、电闪一下,刚下的单全部蒸发。对任何碰钱的东西,这不能接受。

所以在应用一条命令之前,引擎先把它——是这条命令、是输入,不是结果状态——追加到一个本地文件并 fsync 落盘。落盘成功了,才去碰内存,才让用户听到"已接受"。所以"你的单成了"=“它已经在磁盘上了”。哪怕进程一微秒后就挂,这单也还在。重启时把日志从头重放,内存状态逐字节自我重建——因为引擎是确定性的:同样输入,同样状态。

这就是为什么数据库被降级成异步镜像、为什么真相源是预写日志而不是数据库(这点我在架构篇讲透了)。对这篇而言,结论更简单:持久化有代价,代价就是 fsync。而这把我们带到关于延迟的那个、会重组你全部认知的念头。

The latency numbers worth memorizing

The single most useful thing a systems programmer can carry around is a gut feel for how much slower one thing is than another. They aren’t a little different. They’re different by orders of magnitude. Here’s the ladder, with a trick: blow up 1 ns to 1 second, and see how human those gaps become.

Operation Time If 1 ns were 1 second
L1 cache hit ~1 ns 1 second
Branch mispredict ~3 ns 3 seconds
Main memory (RAM) ~100 ns ~1.5 minutes
measured: array price-level update 0.7 ns
measured: BTreeMap price-level update 12 ns
Uncontended mutex lock+unlock ~17 ns 17 seconds
SSD random read ~16 µs ~4.5 hours
Same-datacenter network round trip ~0.5 ms ~6 days
measured: one fsync to disk ~3.8 ms ~44 days
Cross-continent round trip ~150 ms ~5 years

Just read the last few rows. An fsync (milliseconds) is about a million times slower than a memory operation (nanoseconds). So “fsync after every order” turns an engine that could do hundreds of thousands a second into one that does a few hundred. That’s not a tuning problem. That’s a difference in kind — and it’s the first rung on the ladder.

值得背下来的延迟数字

一个系统程序员最该随身带着的,是对"这件事比那件事慢多少"的直觉。它们不是差一点,是差好几个数量级。这是那张梯子,带个小把戏:把 1 ns 放大成 1 秒,看那些差距瞬间变得多么"人类可感"。

操作 时间 把 1ns 当成 1 秒
L1 缓存命中 ~1 ns 1 秒
分支预测失败 ~3 ns 3 秒
访问主存(RAM) ~100 ns ~1.5 分钟
实测:数组价位更新 0.7 ns
实测:BTreeMap 价位更新 12 ns
无竞争互斥锁一上一下 ~17 ns 17 秒
SSD 随机读 ~16 µs ~4.5 小时
同机房网络往返 ~0.5 ms ~6 天
实测:一次 fsync 落盘 ~3.8 ms ~44 天
跨大洲网络往返 ~150 ms ~5 年

光看最后几行就够了。一次 fsync(毫秒)比一次内存操作(纳秒)慢大约一百万倍。所以"每单 fsync 一次"会把一个本可每秒几十万笔的引擎,压成每秒几百笔。这不是调优问题,是性质不同——而它正是梯子的第一阶。

The ladder, one rung at a time

Each rung removes one bottleneck. The numbers in bold are measured speedups from the benchmark at the end.

S0 — Correct and naive. BTreeMap for price levels, HashMap for orders, VecDeque for the FIFO queues, and an fsync after every command. It runs, it’s easy to read, it’s slow. Always start here: correctness first. A fast engine that settles trades wrong is worthless.

S1 — Batch the fsync (group commit). Don’t flush after every order. Write a batch of commands, then fsync once. The cost of an fsync is roughly fixed regardless of how much you’re flushing, so one flush amortized over a thousand commands nearly disappears. Measured: fsync-ing after every order manages a mere 261 orders/sec; batching lifts it to 145,000 orders/sec — a 558× jump. In production you don’t wait for a full batch (that would starve latency) — you flush every ~1 ms or every K commands, whichever comes first. You trade at most a millisecond of latency for hundreds of times the throughput, and you still never lose an order. Best deal in the building.

S2 — Stop borrowing memory on the hot path (the code that runs for every single order). Instead of asking the system for a fresh chunk on every order and trade and handing it back, keep one chunk around and reuse it — empty it and refill it. Think of a worker who runs to the tool room for a screwdriver and returns it after every single screw: the trips cost more than the work. Keep the screwdriver in hand. Measured: borrowing a fresh chunk on every operation costs 13.2 ns; reusing one costs 0.8 ns — 16×. Once you stop asking for memory, you stop paying for it.

那架梯子,一阶一阶

每一阶拆掉一个瓶颈。加粗的数字是文末基准里的实测加速比。

S0 — 正确而朴素。 价位用 BTreeMap、订单用 HashMap、FIFO 用 VecDeque,每条命令后 fsync 一次。能跑、好读、但慢。永远从这开始:先正确。一个把成交算错的快引擎一文不值。

S1 — 批量 fsync(group commit)。 别每单都刷。攒一批命令,只 fsync 一次。fsync 的成本几乎和刷多少数据无关,所以一次刷盘摊到一千条命令上,几乎就消失了。实测:每条都 fsync 时只有 261 笔/秒;批量之后到 145,000 笔/秒——558 倍。 生产里不会傻等攒满一批(那会饿死延迟)——而是每 ~1ms 或每攒够 K 条,先到先刷。你用至多一毫秒的延迟,换几百倍吞吐,还一单都不丢。这是全楼最划算的买卖。

S2 — 热路径(每来一单都要走的那段代码)上别再临时要内存。 与其每个订单、每笔成交都向系统现要一块新内存、用完再还,不如留一块在手里反复用——清空、重填。就像流水线上的工人:每拧一颗螺丝都跑去仓库领一次螺丝刀、拧完再还回去,跑腿的时间比干活还多;把螺丝刀攥在手里就好了。实测:每次现要一块 13.2 ns,反复用同一块 0.8 ns——16 倍。 一旦你不再向系统要内存,就不再为它付费。

S3 — Pick the right data structure (our unfair advantage). A general exchange has a huge price range, so it needs a tree-shaped structure (a BTreeMap — lookups get slower as more price levels pile up, and its scattered memory is unfriendly to the CPU cache). But a prediction market’s prices are just 1 to 99 cents — whole numbers, with a fixed ceiling. So the order book can be a plain fixed array of 99 slots, one queue per slot. Finding a price level is then constant-time (the same speed no matter how many levels there are) and sits in contiguous, cache-friendly memory; finding the best price is a single CPU instruction over a small bitmap — a row of 0/1 bits marking which slots have orders. Measured: the tree is 12.1 ns/op, the array 0.7 ns/op — 17×. The thing other engines sweat over, we get nearly for free, just because our price space is tiny. This is the real reason a prediction market is easier to make fast than a general exchange.

S4 — Squeeze out the system overhead. By now the algorithm isn’t the bottleneck — the operating system’s own overhead is. Pin the matching thread to a single CPU core, so the OS never pauses it to run something else. Replace the lock-based queue with a lock-free one — a fixed ring of slots that the producer and consumer hand off without either ever taking a lock. Move every bit of I/O — disk, network, sending events out — onto other threads, so the matching thread does nothing but match. Process commands in small batches to spread the fixed costs. This bundle of techniques is how LMAX, a London trading venue, famously pushed a single thread past a million operations a second more than a decade ago. It’s the most work, so it comes last — only after the benchmarks prove the earlier rungs landed.

S5 — Shard by market. One core has a ceiling. But different markets’ books are completely independent — market A’s matching has nothing to do with market B’s. So run N matching cores, each the sole writer of its own books. Throughput scales nearly linearly with cores. That’s what “millions per second” means for a venue: ~1M per book, multiplied across shards.

S3 — 选对数据结构(我们的不公平优势)。 通用交易所价格范围巨大,只能用树形结构(BTreeMap:价位越多、查找越慢,而且内存零散、对 CPU 缓存不友好)。但预测市场的价格就是 1 到 99 分——整数,还有固定上限。所以订单簿可以是一个朴素的99 格定长数组,每格一条队列。找某个价位就是常数时间(不管多少价位都一样快),内存连续、缓存友好;找最优价只需一条 CPU 指令扫一个小小的位图——就是一排 0/1,标记哪些格子上有单。实测:树 12.1 ns/op,数组 0.7 ns/op——17 倍。 别的引擎要冒汗的事,我们几乎白拿,只因我们的价格空间极小。这是预测市场比通用交易所更容易做快的根本原因。

S4 — 把系统开销榨干。 到这儿算法已经不是瓶颈了,操作系统本身的开销才是。把撮合线程钉在一个 CPU 核上,让系统永远不会把它暂停去跑别的东西。把基于锁的队列换成无锁的——一圈固定数量的格子,生产者和消费者交接时谁都不用加锁(这叫环形队列)。把每一点 I/O——磁盘、网络、把事件发出去——都挪到别的线程上,让撮合线程只干撮合。命令攒成小批处理,摊薄固定开销。这一整套手法,正是十多年前伦敦的交易场所 LMAX 让单线程突破每秒百万次的办法。它工程量最大,所以放最后——等基准证明前面几阶都落地了再上。

S5 — 按市场分片。 单核有天花板。但不同市场的簿彼此完全独立——市场 A 的撮合和市场 B 毫无关系。所以跑 N 个撮合核,各自独占自己那批簿。吞吐随核数近似线性增长。这才是"每秒几百万"对一个场所的真实含义:单簿 ~100 万,乘以分片数。

Is a million realistic? Where the pros land

A reality check, because these targets aren’t fantasy. Here’s roughly where established venues sit, using their own publicly reported figures. Read the numbers as orders of magnitude, not a leaderboard — they mix definitions (orders vs. messages vs. trades, one symbol vs. a whole venue, matching-only vs. end-to-end):

Venue Throughput (reported) Latency (reported)
LMAX (Disruptor) ~6M orders/sec on one thread tens of microseconds
Nasdaq INET >1M messages/sec under 100 µs (99.99th pct)
CME Globex up to ~20M orders on a peak day ~52 µs median, gateway to match
Binance ~1.4M orders/sec (stated); 6.5M trades in one second at a 2022 peak
Hyperliquid ~200k orders/sec, fully on-chain ~0.2 s order finality

Two things stand out. First, our S5 target — around a million orders a second per book, scaled out by sharding — sits in the same range as LMAX’s single thread and the big venues’ order of magnitude. We’re not chasing a number nobody hits. Second, look at Hyperliquid: a fully on-chain order book runs near 200k orders/sec. That’s the bar the on-chain settlement layer (M4) would have to clear, and it’s plainly within reach — reassuring for the day this market goes on-chain.

Publicly reported figures, for orientation only. Sources: the LMAX Disruptor paper, Nasdaq INET, CME Globex, Binance, and the Hyperliquid docs.

一百万现实吗?看看成熟交易所站在哪

对一下现实,因为这些目标不是空想。下面大致是成熟场所的位置,用的是它们自己公开的数字。把这些当数量级看,别当排行榜——口径并不统一(订单 vs 消息 vs 成交、单个品种 vs 整个场所、只算撮合 vs 端到端):

场所 吞吐(公开口径) 延迟(公开口径)
LMAX(Disruptor) 单线程约 600 万笔/秒 几十微秒
纳斯达克 INET 每秒 100 万+ 消息 p99.99 低于 100 µs
CME Globex 峰值日约 2000 万笔 入口到撮合中位约 52 µs
币安 Binance 约 140 万笔/秒(官方);2022 峰值一秒 650 万笔成交
Hyperliquid 约 20 万笔/秒,全链上 订单约 0.2 秒终局

两点值得注意。其一,我们 S5 的目标——单簿约一百万笔/秒、再靠分片横向扩——正好落在 LMAX 单线程和那些大场所的数量级里,不是在追一个谁都够不着的数。其二,看 Hyperliquid:一个全链上的订单簿能跑到约 20 万笔/秒。这正是将来上链结算层(M4)要迈过的门槛,而且显然够得着——等这套市场哪天上链,这一点让人安心。

以上为公开报道口径,仅供定位参考。来源:LMAX Disruptor 论文、纳斯达克 INET、CME Globex、币安、Hyperliquid 文档。

The twist: determinism and speed are on the same side

It’s easy to assume safety and speed fight each other. Here they don’t. The choices that make the engine deterministic — a single thread, no locks, commands processed in strict order — are the same choices that make it fast: no lock contention, no cache lines bouncing between cores. And the write-ahead log that guarantees durability is made cheap by the very same group-commit trick from S1.

So the MVP — one process, in memory, with a WAL — isn’t a throwaway you’ll rewrite when you need speed. It’s already the right foundation for a million orders a second. You don’t rebuild to get there; you climb. S1 to S5 are all additive.

反转:确定性和速度站在同一边

人们容易以为安全和速度是对头。这里不是。让引擎确定性的那些选择——单线程、无锁、命令严格顺序处理——正是让它的选择:没有锁竞争,没有缓存行在核之间乒乓。而保证持久的那份预写日志,又被 S1 那个 group-commit 小把戏变得便宜。

所以 MVP——一个进程、放内存、带一份 WAL——不是个等你要性能时再推翻的玩具。它本就是通往每秒百万笔的正确地基。你不是重建才到那儿,你是一阶阶爬上去。S1 到 S5,全是"加上去"。

See it for yourself

Here’s the self-contained benchmark behind the three measured numbers — no dependencies, just rustc -O perfdemo.rs && ./perfdemo:

use std::collections::BTreeMap;
use std::fs::OpenOptions;
use std::io::Write;
use std::time::Instant;

fn main() { bench_fsync(); bench_pricelevels(); bench_alloc(); }

// (1) Why group commit: fsync per write vs one fsync for the batch
fn bench_fsync() {
    let n = 1000u64;
    let path = std::env::temp_dir().join("perfdemo_wal.bin");
    let payload = [0u8; 64];
    let mut f = OpenOptions::new().create(true).write(true).truncate(true).open(&path).unwrap();
    let t = Instant::now();
    for _ in 0..n { f.write_all(&payload).unwrap(); f.sync_all().unwrap(); }
    let per = t.elapsed();
    let mut f = OpenOptions::new().create(true).write(true).truncate(true).open(&path).unwrap();
    let t = Instant::now();
    for _ in 0..n { f.write_all(&payload).unwrap(); }
    f.sync_all().unwrap();
    let batched = t.elapsed();
    let _ = std::fs::remove_file(&path);
    println!("fsync/write: {:.0} ops/s | batched: {:.0} ops/s | {:.0}x",
        n as f64 / per.as_secs_f64(), n as f64 / batched.as_secs_f64(),
        per.as_secs_f64() / batched.as_secs_f64());
}

// (2) Why a fixed array beats BTreeMap when prices are 1..99
fn bench_pricelevels() {
    let n = 5_000_000u64;
    let mut book: BTreeMap<u64, u64> = BTreeMap::new();
    let mut s = 0u64;
    let t = Instant::now();
    for i in 0..n { let p = 1 + (i % 99); *book.entry(p).or_insert(0) += 1;
        if let Some((&p, _)) = book.iter().next_back() { s ^= p; } }
    let bt = t.elapsed();
    let mut arr = [0u64; 100]; let mut s2 = 0u64;
    let t = Instant::now();
    for i in 0..n { let p = (1 + (i % 99)) as usize; arr[p] += 1;
        let mut best = 0; for q in (1..100).rev() { if arr[q] != 0 { best = q; break; } } s2 ^= best as u64; }
    let ar = t.elapsed();
    println!("BTreeMap: {:.1} ns/op | array: {:.1} ns/op | {:.1}x",
        bt.as_nanos() as f64 / n as f64, ar.as_nanos() as f64 / n as f64,
        bt.as_secs_f64() / ar.as_secs_f64());
    let _ = (s, s2);
}

// (3) Why the hot path must not allocate
fn bench_alloc() {
    let n = 5_000_000u64; let mut s = 0u64;
    let t = Instant::now();
    for i in 0..n { let mut v: Vec<u64> = Vec::new(); v.push(i); v.push(i ^ 7); s ^= v.iter().sum::<u64>(); }
    let a = t.elapsed();
    let mut buf: Vec<u64> = Vec::with_capacity(8);
    let t = Instant::now();
    for i in 0..n { buf.clear(); buf.push(i); buf.push(i ^ 7); s ^= buf.iter().sum::<u64>(); }
    let r = t.elapsed();
    println!("Vec::new: {:.1} ns/op | reused: {:.1} ns/op | {:.1}x",
        a.as_nanos() as f64 / n as f64, r.as_nanos() as f64 / n as f64,
        a.as_secs_f64() / r.as_secs_f64());
    let _ = s;
}

On an Apple M4 Pro: fsync/write 261 ops/s → batched 145,486 ops/s (558x), BTreeMap 12.1 ns → array 0.7 ns (16.9x), Vec::new 13.2 ns → reused 0.8 ns (16.7x).

One rule holds the whole thing together: don’t optimize without a number. Profile first — the bottleneck is almost always the fsync or the allocator, not the algorithm you assumed. We’ll keep a benchmark on every layer as the engine grows, and a throughput guard test that fails the build if a change quietly makes matching slower. And as promised — every line of it open, free, no strings.

自己跑一遍

下面就是三个实测数字背后的自包含基准——无依赖,rustc -O perfdemo.rs && ./perfdemo 即可:

use std::collections::BTreeMap;
use std::fs::OpenOptions;
use std::io::Write;
use std::time::Instant;

fn main() { bench_fsync(); bench_pricelevels(); bench_alloc(); }

// (1) 为什么要 group commit:每条 fsync vs 批量后一次 fsync
fn bench_fsync() {
    let n = 1000u64;
    let path = std::env::temp_dir().join("perfdemo_wal.bin");
    let payload = [0u8; 64];
    let mut f = OpenOptions::new().create(true).write(true).truncate(true).open(&path).unwrap();
    let t = Instant::now();
    for _ in 0..n { f.write_all(&payload).unwrap(); f.sync_all().unwrap(); }
    let per = t.elapsed();
    let mut f = OpenOptions::new().create(true).write(true).truncate(true).open(&path).unwrap();
    let t = Instant::now();
    for _ in 0..n { f.write_all(&payload).unwrap(); }
    f.sync_all().unwrap();
    let batched = t.elapsed();
    let _ = std::fs::remove_file(&path);
    println!("fsync/write: {:.0} ops/s | batched: {:.0} ops/s | {:.0}x",
        n as f64 / per.as_secs_f64(), n as f64 / batched.as_secs_f64(),
        per.as_secs_f64() / batched.as_secs_f64());
}

// (2) 价格 1..99 时,为什么定长数组打败 BTreeMap
fn bench_pricelevels() {
    let n = 5_000_000u64;
    let mut book: BTreeMap<u64, u64> = BTreeMap::new();
    let mut s = 0u64;
    let t = Instant::now();
    for i in 0..n { let p = 1 + (i % 99); *book.entry(p).or_insert(0) += 1;
        if let Some((&p, _)) = book.iter().next_back() { s ^= p; } }
    let bt = t.elapsed();
    let mut arr = [0u64; 100]; let mut s2 = 0u64;
    let t = Instant::now();
    for i in 0..n { let p = (1 + (i % 99)) as usize; arr[p] += 1;
        let mut best = 0; for q in (1..100).rev() { if arr[q] != 0 { best = q; break; } } s2 ^= best as u64; }
    let ar = t.elapsed();
    println!("BTreeMap: {:.1} ns/op | array: {:.1} ns/op | {:.1}x",
        bt.as_nanos() as f64 / n as f64, ar.as_nanos() as f64 / n as f64,
        bt.as_secs_f64() / ar.as_secs_f64());
    let _ = (s, s2);
}

// (3) 为什么热路径不能分配
fn bench_alloc() {
    let n = 5_000_000u64; let mut s = 0u64;
    let t = Instant::now();
    for i in 0..n { let mut v: Vec<u64> = Vec::new(); v.push(i); v.push(i ^ 7); s ^= v.iter().sum::<u64>(); }
    let a = t.elapsed();
    let mut buf: Vec<u64> = Vec::with_capacity(8);
    let t = Instant::now();
    for i in 0..n { buf.clear(); buf.push(i); buf.push(i ^ 7); s ^= buf.iter().sum::<u64>(); }
    let r = t.elapsed();
    println!("Vec::new: {:.1} ns/op | reused: {:.1} ns/op | {:.1}x",
        a.as_nanos() as f64 / n as f64, r.as_nanos() as f64 / n as f64,
        a.as_secs_f64() / r.as_secs_f64());
    let _ = s;
}

在 Apple M4 Pro 上:fsync/写 261 笔/s → 批量 145,486 笔/s(558x)BTreeMap 12.1 ns → 数组 0.7 ns(16.9x)Vec::new 13.2 ns → 复用 0.8 ns(16.7x)

有一条规矩撑起这整件事:没有数字就不要优化。 先 profile——瓶颈几乎总是 fsync 或分配器,不是你以为的那个算法。引擎长大的过程里,我们会给每一层都留一个基准,再加一个吞吐护栏测试:谁的改动悄悄让撮合变慢,就让构建失败。还有,说好的——每一行都公开,免费,没有附加条件。