Building a Prediction Market From Scratch (3): One Deterministic Engine and a Write-Ahead Log

CJ included in Engineering Trading Systems

2026-06-01 1790 words 9 minutes

Contents

Building a Prediction Market From Scratch (3): One Deterministic Engine and a Write-Ahead Log |

The engine has a name now: Zhulong (烛龙), the torch-dragon from the Classic of Mountains and Seas. The old text says it neither eats, sleeps, nor breathes; when it opens its eyes the world is day, when it closes them, night. A being whose output depends on nothing but its own state. I couldn’t have asked for a better mascot for what I’m building — because the heart of this design is exactly that: the same log replayed always produces the same state.

Before any code, I spent real time on the architecture, because this is the decision you don’t get to take back cheaply. I’ll tell you where I landed, walk through the alternatives I rejected and why, and lay out how it grows from a thing you run with cargo run on a laptop into something production-grade — by adding, never by rebuilding.

引擎有名字了:烛龙,《山海经》里的那条火龙。古书说它"不食不寝不息",视(睁眼)为昼,瞑(闭眼)为夜——一个输出只取决于自身状态、不依赖任何外物的存在。我想不出比它更贴的吉祥物了,因为这套设计的核心恰恰就是这句话:同一份日志重放,永远得到同一个状态。

动代码之前,我在架构上花了实打实的时间,因为这是那种"反悔起来很贵"的决定。我会告诉你我最后落在哪儿,把我否掉的几条路和原因讲清楚,再说说它怎么从"laptop 上 cargo run 就能跑"一步步长成生产级——靠的是加东西,绝不是推倒重来。

The one fact everything follows from

Here’s the thing that, once you really accept it, settles half the argument: matching is single-writer.

A single order book has to match in a strict order — price first, then time. Two buyers at the same price must fill in the order they arrived. That’s not an implementation detail you can relax; it’s the fairness guarantee that makes the market a market. And a strict global order means you cannot have two threads matching against the same book at once without locking it into a sequence anyway. Parallelism buys you nothing on the hot path — the code that runs for every single order — it only buys you bugs.

So the instinct a lot of us reach for first — “break it into microservices so it scales” — is solving a problem the matching engine doesn’t have. One CPU core, processing commands in a tight sequential loop, does tens of thousands of matches per second. That’s already more than almost any prediction market will ever need. The moment you internalize “matching is sequential anyway,” the whole architecture gets simpler, not poorer.

一切都从这一个事实推出来

有一件事,你一旦真正接受,半场争论就结束了:撮合是单写的。

一个订单簿必须按严格顺序撮合——先看价格,再看时间。同价的两个买家,必须按到达先后成交。这不是什么可以松一松的实现细节;它就是让市场成其为市场的那条公平性保证。而"严格的全局顺序"意味着:你没法让两个线程同时撮合同一个簿,除非你又把它锁回成一条序列。并行在热路径(每来一单都要走的那段代码)上什么都换不来,只换来 bug。

所以很多人第一反应去抓的那根稻草——“拆成微服务好扩展”——是在解一个撮合引擎根本没有的问题。一个 CPU 核心,在一个紧凑的顺序循环里处理命令,每秒能撮几万笔。这已经远超几乎任何预测市场会用到的量。当你把"撮合反正是顺序的"这句话吃进去,整个架构会变得更简单,而不是更弱。

So: one process, in memory, fed through a queue

If matching is sequential, give it the cleanest possible home: a single process that holds all the state in memory — order books, balances, positions — and pulls commands off a queue one at a time. No locks. No two threads arguing over the same balance. No network hop in the middle of a fill.

But in-memory state has an obvious problem: pull the plug and it’s gone. That’s where the write-ahead log comes in, and it’s the spine of the whole design:

   HTTP gateway  ──(stamp seq_id + timestamp)──►  command queue
                                                       │
                                                       ▼
                                              WAL append + fsync   ← durable HERE
                                                       │
                                                       ▼
                                         deterministic in-memory matching
                                                       │
                                                       ▼
                                               events ──┬──► reply to caller
                                                        ├──► async: Postgres mirror
                                                        └──► async: WebSocket push

One rule the engine lives by: it never reads the clock and never does I/O in the matching step. Timestamps and IDs are stamped before the command enters the queue. That’s what makes it deterministic — replay the same commands and you land in the exact same state, every time. No SystemTime::now() hiding in the matching path to make today’s replay differ from yesterday’s.

所以:一个进程、放在内存、用队列喂

既然撮合是顺序的,就给它一个最干净的家:一个进程,把全部状态都放在内存里——订单簿、余额、持仓——然后从一条队列里一次取一条命令。没有锁。没有两个线程为同一笔余额吵架。成交中间没有一次网络跳转。

但内存状态有个显而易见的问题:一拔电源就没了。预写日志(WAL)就是为这个而来的,它是整套设计的脊梁:

   HTTP 网关  ──(盖上 seq_id + 时间戳)──►  命令队列
                                              │
                                              ▼
                                      WAL append + fsync   ← 在这里就持久了
                                              │
                                              ▼
                                     确定性的内存撮合
                                              │
                                              ▼
                                       事件 ──┬──► 回包给调用方
                                              ├──► 异步:Postgres 镜像
                                              └──► 异步:WebSocket 推送

引擎守着一条铁律:撮合那一步里,它绝不读时钟、绝不做 I/O。时间戳和 ID 都在命令进队列之前就盖好了。这正是它确定性的来源——重放同一串命令,每一次都落到分毫不差的同一个状态。撮合路径里不藏一个 SystemTime::now(),让今天的重放跟昨天的不一样。

The write-ahead log is the source of truth — not the database

This is the part that felt wrong to me for years, until it didn’t: the database is not the truth. The log is.

Every command is appended to the WAL and flushed to disk with fsync — the call that forces the bytes to actually land on the disk, not just sit in a buffer, and waits for the disk to confirm — before the engine applies it. Only then does the caller get a success reply. So “your order was accepted” means “your order is already on durable storage” — not “it’s somewhere in flight.” If the process dies one microsecond later, the order survives.

The database, then, is demoted to what it’s actually good at: an asynchronous mirror for queries and history. It can lag a few milliseconds. It can hiccup. It can fall over for a minute and come back — and matching never stops, because matching doesn’t read from it. The engine’s in-memory state, rebuilt from the WAL, is authoritative; Postgres is a convenient read replica of that truth.

And crash recovery is simple, which is exactly what you want it to be: on startup, open the WAL, replay every command through the same code path, and you’re back — bit for bit. Same log, same state. No reconciliation, no guessing whether the database is ahead of or behind the engine.

真相之源是预写日志,不是数据库

这一点,我有很多年觉得别扭,直到某天不再别扭:数据库不是真相,日志才是。

每一条命令,都在引擎应用它之前,先写进 WAL 并用 fsync 落盘——fsync 会强制数据真正写进磁盘、而不是只待在缓冲区里,并等磁盘确认。然后调用方才拿到成功回包。所以"你的单子被接受了"等于"你的单子已经在持久存储上了"——不是"它还在路上飘着"。哪怕进程一微秒后就挂,这单也还在。

于是数据库被降级到它真正擅长的位置:一个供查询和看历史的异步镜像。它可以慢几毫秒,可以打个嗝,可以趴下一分钟再回来——而撮合从不停,因为撮合压根不读它。引擎那份由 WAL 重建的内存状态才是权威;Postgres 不过是这份真相的一个方便的只读副本。

而崩溃恢复很简单,这正是你想要的样子:启动时打开 WAL,把每条命令顺着同一条代码路径重放一遍,你就回来了,一个 bit 都不差。同一份日志,同一个状态。不用对账,也不用猜数据库到底比引擎超前还是落后。

Why not microservices, why not Kafka

Two roads I seriously considered and turned down.

Microservices over a message bus (an HTTP tier, a separate engine, a DB worker, all talking through Redis streams). I’ve built this shape; it’s the architecture of my own earlier version. It looks scalable. But the engine is still single-writer, so the distribution buys no matching throughput — and it costs you correctness: now you’re reasoning about at-least-once delivery, idempotency, lost replies, network reordering, and a message broker whose durability is weaker than a local fsynced log. You take on every hard distributed-systems problem to solve a scaling problem you didn’t have.

A durable replicated log like Kafka or Redpanda as the source of truth. This is genuinely good engineering — it’s how a lot of serious fintech runs — and it gives you replication and replay for free. But it’s a cluster to operate. For an MVP, and for a project whose whole pitch is “one person plus an AI, clone it and run it,” standing up a Kafka cluster is far more than this needs. The day I need multi-node ingestion, I can put a log in front of the engine without changing the engine. Not before.

The throughline: don’t pay for high availability or horizontal scale with infrastructure before you actually need them. A local, fsynced write-ahead log takes very little effort and buys you the part that matters most — production-grade durability (no acknowledged order is ever lost) today — while leaving a clean seam to add the rest later.

为什么不上微服务,为什么不上 Kafka

两条我认真考虑过、又放下的路。

消息总线上的微服务(一个 HTTP 层、一个独立引擎、一个 DB worker,全靠 Redis streams 串起来)。这个形状我搭过——它就是我自己早期那版的架构。看着很能扩。但引擎仍然是单写的,所以这种分布式换不来任何撮合吞吐,反而赔上正确性:你现在得操心至少一次投递、幂等、回包丢失、网络乱序,还有一个 durability 弱于本地 fsync 日志的消息中间件。你把分布式系统所有的硬骨头都揽下来,只为解一个你本来没有的扩展问题。

用 Kafka/Redpanda 这类持久复制日志当真相源。这是真正的好工程——很多严肃的金融系统就这么跑——它白送你复制和回放。但它是一整个集群要运维。对一个 MVP、对一个卖点就是"一个人加一个 AI,clone 下来就能跑"的项目,架一个 Kafka 集群,远超这点需求。等哪天我真需要多节点 ingress,我可以在引擎前面加一条日志,而不动引擎本身。不必提前。

一条主线:别在真正需要之前,就用基础设施去买高可用或横向扩展。一份本地 fsync 的预写日志,花很小的力气,就拿到了最要紧的那部分——今天就给你生产级的、不丢单的持久性(durability),同时留好一道干净的接缝,以后再把其余的接上。

The journey of one order

Make it concrete. Alice sends POST /orders — buy YES at 60.

The gateway authenticates her, then does the one thing it must do before anything else: stamps the command with a seq_id (a sequence number that only ever counts up) and a timestamp. (Those live in the command, not in the engine — that’s the determinism rule.) It drops the command on the queue and waits.

The engine pulls it off. First, it serializes the command and appends it to the WAL, and fsyncs. The instant that returns, the order is durable — and only now does anything else happen. The engine reserves Alice’s funds, runs it through the book (rest it, or match it and settle both sides), and produces a list of events: OrderPlaced, maybe TradeExecuted, BalanceUpdated. It replies to Alice’s waiting request with the outcome, and — completely off the hot path — it broadcasts those same events to whoever’s listening: the Postgres writer, the WebSocket pusher.

Notice what the matching step never touched: the network, and the database. That’s the whole trick. The expensive, flaky things live on the async side of a clean line; the fast, must-be-correct thing lives alone on the other side.

一笔订单的旅程

说具体点。Alice 发来 POST /orders——用 60 买 YES。

网关先验明她的身份,然后做那件必须最先做的事:给命令盖上一个只增不减的序号(seq_id)和一个时间戳(timestamp)。(这俩存在命令里,不在引擎里——这就是确定性那条规矩。)它把命令丢上队列,然后等着。

引擎把它取下来。第一步,把命令序列化、append 到 WAL、fsync。这一步一返回,单子就持久了——之后才轮到别的事发生。引擎冻结 Alice 的资金,让单子过一遍簿(挂上去,或者撮上并给两边结算),产出一串事件:OrderPlaced,也许还有 TradeExecuted、BalanceUpdated。它把结果回给一直在等的 Alice,然后——完全在热路径之外——把同一串事件广播给任何在听的人:Postgres 写入者、WebSocket 推送者。

注意撮合那一步从没碰过什么:网络,和数据库。诀窍全在这儿。又贵又会抽风的东西,待在一条干净分界线的异步那侧;又快又必须正确的东西,独自待在另一侧。

The MVP is already production-grade — it just isn’t big yet

The thing I want to be honest about: the first version is single-node. But single-node is not the same as a toy. On day one it is deterministic, durable, and crash-recoverable — the properties that actually decide whether you can put real money through it. What it lacks is scale and failover, and those are things you add, not things you rebuild for.

Here’s the road, and every step is “add,” never “rewrite”:

M0 — MVP. One process: engine + WAL + recovery + an in-process HTTP gateway + an async Postgres mirror. cargo run. Correct and durable; single node.
M1. Split the gateway into its own stateless tier so you can run several behind a load balancer; add the Postgres read models and WebSocket push.
M2 — HA. Stream the WAL to a hot standby; on failure, the standby replays and takes over. Tune group commit (batch many commands into a single fsync); add snapshots so recovery doesn’t replay from the beginning of time.
M3 — scale out. When one process finally saturates, shard by market group — many engines, each single-writer over its own books. Put a durable ingress log in front only if multi-node ingestion demands it.
M4 — on-chain. The non-custodial settlement layer (funds stay in the user’s own control, never held by the platform), Polymarket-style: USDC collateral, conditional tokens, an oracle for resolution, tiered withdrawals. That’s a whole series of its own.

The WAL is what makes this a ladder instead of a series of rewrites. Replication, snapshots, failover — they all hang off that one seam.

MVP 已经是生产级——只是还没长大

我想老实说的一点:第一版是单机的。但单机不等于玩具。它从第一天起就是确定性的、持久的、可崩溃恢复的——而这些,才是真正决定"你敢不敢让真钱从里头过"的性质。它缺的是规模和故障切换,而这些是你往上加的东西,不是要为它推倒重来的东西。

这是那条路,每一步都是"加",从不是"重写":

M0 — MVP:一个进程,引擎 + WAL + 恢复 + 进程内 HTTP 网关 + 异步 Postgres 镜像。cargo run。正确、持久;单节点。
M1:把网关拆成独立的无状态层,这样你能在负载均衡后头跑好几个;加上 Postgres 读模型和 WebSocket 推送。
M2 — 高可用:把 WAL 流式复制到一台热备;出故障时,备机重放接管。调 group commit(把很多条命令攒成一次 fsync);加快照,免得恢复要从盘古开天重放。
M3 — 横向扩:等一个进程终于吃满,按市场分片——多个引擎,各自单写自己那批簿。只有当多节点 ingress 真有需要时,才在前面加一条持久日志。
M4 — 链上:非托管结算层(钱由用户自己掌控,平台碰不到),Polymarket 那一套:USDC 抵押、条件代币、用预言机裁决、分层提现。这本身够单开一个系列。

是 WAL 让这变成一架梯子,而不是一连串重写。复制、快照、故障切换——全挂在那一道接缝上。

What’s next

That’s the blueprint: one deterministic engine, a write-ahead log as the source of truth, a database demoted to an async mirror, and a ladder from a laptop to production where every rung is additive. It has a name now — Zhulong — and from the next chapter we stop talking and start building it, from an empty repository, one small piece at a time.

We’ll begin where every trading system should: the account model. A Balance type that splits available from frozen funds and can’t, by construction, make money appear or vanish. As promised — every line in the open repo, free, no strings.

接下来

这就是蓝图:一个确定性引擎,一份当真相源的预写日志,一个被降级成异步镜像的数据库,以及一架从 laptop 到生产、每一阶都是"加上去"的梯子。它现在有名字了——烛龙——从下一章起,我们不再空谈,开始动手搭它:从一个空仓库出发,一次一小块。

我们会从每个交易系统都该起步的地方开始:账户模型。一个把可用资金和冻结资金分开、并且在构造上就不可能让钱凭空出现或消失的 Balance 类型。说好的——每一行都在公开仓库里,免费,没有附加条件。