Building a Crypto Matching Engine: 0 to 18,210 RPS

My system specs, because benchmarks without hardware context are just creative fiction. 13th Gen Intel Core i5-13500HX (20 threads, 4.70 GHz boost), 16 GB RAM, Fedora 43 (kernel 6.19.8).

Final k6 benchmark result showing 18,210 RPS

The Assignment

2CentsCapital sent over a backend engineering assignment. The brief was to build a trade clearing and analytics engine. HTTP endpoint for orders, WebSocket feed for streaming, a matching engine for limit and market orders with price-time priority, persistence to PostgreSQL, and load test evidence. The performance target was 2,000 orders per second with sub-100ms median latency. Estimated effort was six to twelve hours.

The six to twelve hour estimate was optimistic. What started as "build a limit order book" turned into a tour through Kafka delivery semantics, JVM heap forensics, and Go goroutine lifecycle management. The assignment asked for 2,000 req/s. The system ended up doing 18,210. The bugs that stood between those two numbers are what this post is about.

The Tech Stack and Why It Looks Like This

The First Version Was Wrong

Most of these decisions were not made upfront. They were made incrementally, each one forced by a problem with the previous approach.

The first version was a single Go process. HTTP handler receives an order, calls the matching engine, writes the result to PostgreSQL, sends a WebSocket event, returns a response. Clean. Simple. Wrong.

The problem became obvious the moment I thought about the load test. A single synchronous request chain means the HTTP handler is blocked waiting for the matching engine, which is blocked waiting for the database write, which means every client is waiting for all of it. At 2,000 concurrent orders per second, that chain becomes a bottleneck at whichever step is slowest. You do not have a trading system. You have a request queue wearing a trading system's clothes.

Why Not Just Use Goroutines?

The obvious Go answer is to fire a goroutine and return immediately.

func (h *Handler) CreateOrder(c *gin.Context) {
    order := parseOrder(c)
    go matchEngine.Process(order)
    c.JSON(202, gin.H{"status": "accepted"})
}

This fixes the blocking problem. For a single-node demo with no persistence requirement, it works. The assignment actually allows it.

The problem is everything else. If the process crashes mid-match, the order is gone. No log, no replay, no recovery. At 2,000 req/s you are spawning goroutines faster than they drain if anything downstream is slow, and goroutines are cheap but unbounded. The matching goroutine still needs to write to PostgreSQL and notify WebSocket clients, so if you call those directly you have just moved the coupling, not removed it.

Goroutines solve the concurrency problem inside a single process. They do not solve durability, backpressure, or cross-service decoupling. You end up using both anyway: goroutines for parallelism inside the matching service, and Kafka for communication between services.

The Decoupling Problem

The fix is to break the chain. The HTTP handler should accept an order and return immediately. The matching engine should process it whenever it gets there. The database writer should persist trades whenever they land. The WebSocket broadcaster should emit events independently. None of them should know whether the others are alive.

Once you decide that, you need something to carry messages between them. That something is Kafka.

Kafka gives you three things that matter here. First, durability: messages are written to disk, so a service restart does not lose anything. Second, replayability: you can rebuild state by replaying the log from any offset. Third, consumer group semantics: multiple independent services can subscribe to the same topic without stepping on each other. The matching engine and the database service both read from the same ORDER_ACCEPTED topic, each getting their own copy, each tracking their own offset.

The Stack

Go is the language because goroutines are cheap enough to run one per trading instrument with no meaningful memory overhead, and channels give you a communication primitive that maps directly to how an order book works. Orders go in. Trades come out. That is a channel.

Apache Kafka carries every message between every service. The sarama library from IBM handles the Go client side. It is the only part of this stack that requires reading actual documentation before using it, which became clear the hard way.

PostgreSQL is the system of record. Every matched trade eventually lands here. Redis does not sit in front of it as a read cache. What it actually does is more specific: the IdempotencyKey middleware checks an Idempotency-Key header on every POST /orders request, looks it up in Redis, and returns a 409 Conflict if the key already exists. A successful order publish sets the key with a one-hour TTL. This prevents duplicate submissions under concurrent retries. The GET endpoints hit PostgreSQL directly. No read cache exists yet.

The four services that came out of this design:

API Service: accepts HTTP requests, validates orders, publishes to the ORDER_ACCEPTED Kafka topic, returns 202 immediately.
Matching Service: consumes ORDER_ACCEPTED (group: matching-engine-group), runs the in-memory order book logic, publishes matched trades to TRADE_EXECUTED.
Database Service: consumes both ORDER_ACCEPTED (to insert each order into PostgreSQL) and TRADE_EXECUTED (to atomically update order quantities and insert the matched trade). Both consumers run in the same process under consumer group persistence-trades.
Stream Service: consumes TRADE_EXECUTED (group: stream-trades), broadcasts real-time events to connected WebSocket clients at /ws/trades.

All four compile from the same Go codebase into separate binaries. Docker Compose orchestrates all of them along with Kafka, Zookeeper, PostgreSQL, and Redis.

Sequence diagram showing order lifecycle from HTTP to PostgreSQL

Phase 1: Building the Foundation and Discovering the Wrong Abstraction

How a Matching Engine Actually Works

A limit order book is two sorted lists: bids sorted highest-to-lowest, asks sorted lowest-to-highest. When a new order arrives, if the best bid price is greater than or equal to the best ask price, a trade executes. Otherwise the order waits.

The videos that helped me understand this from first principles:

The natural data structure for "give me the highest bid quickly" is a max-heap. The natural structure for "give me the lowest ask quickly" is a min-heap. Go's container/heap package gives you both with a single interface. That is what OrderBook uses.

The order book lives entirely in memory. Matched trades are safe because Kafka offsets are not committed until the database write succeeds, so they replay on restart. Unmatched limit orders sitting in the heaps are gone.

The Mutex Mistake

The first version of the matching engine was something any Go developer would write naturally, which is how you know it was wrong. I had an OrderBook struct holding two heaps, protected by a sync.Mutex. Every operation locked it. It seemed correct.

type OrderBook struct {
    mu         sync.Mutex
    Instrument string
    BuyOrders  *OrderHeap
    SellOrders *OrderHeap
    TradeOut   chan *trades.Trade
}

func (ob *OrderBook) AddOrder(order *orders.MatchOrder) {
    ob.mu.Lock()
    defer ob.mu.Unlock()
    if order.Side == "BUY" {
        heap.Push(ob.BuyOrders, order)
    } else {
        heap.Push(ob.SellOrders, order)
    }
}

The problem, which became obvious once I stopped congratulating myself on how clean this looked, was that TryMatch did not lock the mutex. I was protecting writes but not reads. Two goroutines could be racing on the same heap and producing garbage fills. Even setting that aside, the mutex sends the wrong message to every future reader. It implies this struct is accessed concurrently and needs careful handling. The struct should never be accessed concurrently in the first place.

Before/After diagram: Mutex locking vs channel-based goroutine isolation per instrument

The Right Model: One Goroutine Per Instrument

The insight is that BTC-USD and ETH-USD have zero relationship to each other. Their order books are completely independent. If each instrument gets its own goroutine, that goroutine owns its order book exclusively. No goroutine ever touches another instrument's book. There is nothing to lock.

To understand goroutines and channels deeply, this video is worth your time:

Golang Concurrency - All the Basics you have to know!

The implementation uses a map of channels. When an order arrives for BTC-USD, a dispatcher looks up the BTC-USD channel, creates it if it does not exist, and sends the order down it. A dedicated goroutine on the other end reads from that channel sequentially and runs the matching logic. No two goroutines ever share an order book.

func (engine *Engine) getOrderChannel(instrument string) chan *orders.MatchOrder {
    engine.mutex.RLock()
    ch, exists := engine.OrderChannels[instrument]
    engine.mutex.RUnlock()

    if exists {
        return ch
    }

    engine.mutex.Lock()
    defer engine.mutex.Unlock()

    if ch, exists := engine.OrderChannels[instrument]; exists {
        return ch
    }

    newCh := make(chan *orders.MatchOrder, 100)
    engine.OrderChannels[instrument] = newCh

    go engine.runInstrumentProcessor(instrument, newCh)
    return newCh
}

func (engine *Engine) runInstrumentProcessor(instrument string, orderChan <-chan *orders.MatchOrder) {
    book := orderbook.NewOrderBook(instrument)
    book.TradeOut = engine.TradeChannels

    for order := range orderChan {
        book.TryMatch(order)
        if order.Quantity > 0 && order.Type == "LIMIT" {
            book.AddOrder(order)
        }
    }
}

matching/engine/engine.go

The engine.mutex protecting OrderChannels is the one legitimate lock: it serializes channel creation using a double-check pattern while keeping the hot path read-only. The OrderBook itself has no mutex, by design.

Phase 2: The Matching Logic Had Two Silent Bugs

The Market Order Corruption

The assignment requires market orders. A market order has no price. It matches against whatever is available on the book until it fills or the book runs out of liquidity. The bug was simple and quietly catastrophic. If a market order arrived and could not fully fill, the leftover quantity was being pushed into the limit order book.

for order := range orderChan {
    book.TryMatch(order)
    if order.Quantity > 0 && order.Type == "LIMIT" {
        book.AddOrder(order)
    }
}

matching/engine/engine.go

Without order.Type == "LIMIT", a market order with remaining quantity and a price of 0.0 would sit in a price-sorted heap alongside limit orders at $70,000. The container/heap invariant breaks. Future matches execute at incorrect prices or against incorrect counterparties. In a real exchange, this is the kind of bug that lives quietly in production until someone's trade history looks like abstract expressionism.

The Price-Time Priority Bug

Exchanges make one fairness promise above everything else. If two orders arrive at the same price, the earlier one gets matched first. This is price-time priority. The original OrderHeap.Less only compared prices.

func (h *OrderHeap) Less(i, j int) bool {
    if h.IsBuy {
        return h.data[i].Price > h.data[j].Price
    }
    return h.data[i].Price < h.data[j].Price
}

matching/orderbook/heap.go

When two orders share the same price, Less returns false for both comparisons, making them equal. Go's container/heap is not stable. Equal elements can appear in any order depending on heap state after previous push and pop operations. The trader who submitted first might get matched second. Randomly. With no error, no log, and no way to detect it.

The fix was adding a Timestamp field to MatchOrder, populated with time.Now().UnixNano() at the moment the order hits the API service.

type MatchOrder struct {
    ID         uuid.UUID `json:"id"`
    ClientID   uuid.UUID `json:"client_id"`
    Instrument string    `json:"instrument"`
    Side       string    `json:"side"`
    Type       string    `json:"type"`
    Price      float64   `json:"price"`
    Quantity   float64   `json:"quantity"`
    Timestamp  int64     `json:"timestamp"`
}

internals/orders/models.go

Then update the comparator to use the timestamp as a tiebreaker.

func (h *OrderHeap) Less(i, j int) bool {
    if h.IsBuy {
        if h.data[i].Price != h.data[j].Price {
            return h.data[i].Price > h.data[j].Price
        }
        return h.data[i].Timestamp < h.data[j].Timestamp
    }
    if h.data[i].Price != h.data[j].Price {
        return h.data[i].Price < h.data[j].Price
    }
    return h.data[i].Timestamp < h.data[j].Timestamp
}

matching/orderbook/heap.go

The timestamp is serialized into the Kafka message so arrival order survives the network boundary. For buy orders, highest price wins; ties go to whoever arrived first. For sell orders, lowest price wins with the same tiebreaker. This works correctly on a single node; multi-instance deployments need a logical clock instead.

Phase 3: The Kafka Layer Had a Data Loss Hole and a Duplicate Risk

For Kafka fundamentals, this is the best single video I found:

Master Apache Kafka in One Shot: LLD + HLD Setup + Partitioning + Pub/Sub

The Silent Trade Vanisher

The original consumer code silently dropped trades whenever the database was unavailable, committed the offset, and moved on. No error surfaced. No alert fired. The trades ceased to exist.

Flowchart showing before vs after offset commit behavior

To understand why this matters, you need to know how Kafka consumer offsets work. Every partition in a Kafka topic is an ordered log of messages. Consumers track their position in that log using an offset. When you call MarkMessage, you are telling Kafka that you have successfully processed up to this point. On the next restart, Kafka will start delivering from after that offset. If you mark a message as done before you actually finish processing it, it is gone. Kafka will never redeliver it.

type ConsumerHandler struct {
    Process func(message *sarama.ConsumerMessage)
}

func (h ConsumerHandler) ConsumeClaim(session sarama.ConsumerGroupSession, claim sarama.ConsumerGroupClaim) error {
    for msg := range claim.Messages() {
        if h.Process != nil {
            h.Process(msg)
        }
        session.MarkMessage(msg, "")
    }
    return nil
}

kafka/consumer.go

MarkMessage runs unconditionally. If Process fails halfway through a database write, the trade is committed as processed and disappears from the Kafka log. Matched in the engine. Published to Kafka. Consumed by the database service. Failed to write. Gone. The fix was changing Process to return an error and making MarkMessage conditional on success.

type ConsumerHandler struct {
    Process func(message *sarama.ConsumerMessage) error
}

func (h ConsumerHandler) ConsumeClaim(session sarama.ConsumerGroupSession, claim sarama.ConsumerGroupClaim) error {
    for msg := range claim.Messages() {
        if h.Process != nil {
            if err := h.Process(msg); err != nil {
                log.WithError(err).Errorf(
                    "failed to process message topic=%s partition=%d offset=%d",
                    msg.Topic, msg.Partition, msg.Offset,
                )
                continue
            }
        }
        session.MarkMessage(msg, "")
    }
    return nil
}

kafka/consumer.go

This is at-least-once delivery: a trade might be processed more than once during recovery, but it will never be dropped. A unique constraint on trade_id in PostgreSQL makes duplicate processing a no-op.

func (s *Service) UpdateDatabase(ctx context.Context) {
    handler := kafka.ConsumerHandler{
        Process: func(msg *sarama.ConsumerMessage) error {
            var trade Trade
            if err := json.Unmarshal(msg.Value, &trade); err != nil {
                return err
            }
            if err := s.repo.ApplyTrade(ctx, &trade); err != nil {
                log.WithError(err).Error("failed to apply trade to database")
                return err
            }
            log.Info("Trade applied to database")
            return nil
        },
    }
    consumer := kafka.NewConsumer("TRADE_EXECUTED", "persistence-trades", handler)
    defer consumer.Close()
    consumer.Start()
}

internals/trades/service.go

Idempotent Producer

Without Idempotent = true, a network timeout after Kafka has received but not acknowledged a message causes the producer to retry. Kafka stores the duplicate. The matching engine sends the same order twice. Someone gets matched twice and has a very interesting conversation with their broker.

cfg.Producer.RequiredAcks = sarama.WaitForAll
cfg.Producer.Partitioner = sarama.NewHashPartitioner
cfg.Producer.Return.Successes = true
cfg.Producer.Return.Errors = true
cfg.Producer.Idempotent = true
cfg.Net.MaxOpenRequests = 1

kafka/producer.go

Idempotent = true assigns sequence numbers so Kafka discards duplicates silently. MaxOpenRequests = 1 is required: without it, Kafka cannot correctly order in-flight retries.

One thing the config does not make obvious: sarama.NewAsyncProducer is non-blocking. SendMessage enqueues the message to an internal channel and returns immediately. WaitForAll tells the Kafka broker to acknowledge only after all in-sync replicas have written the message, but that acknowledgment arrives asynchronously in a background goroutine that drains the Successes() channel. The producer never blocks the caller. This is why the graceful shutdown sequence calls producer.Close() last: it blocks until all in-flight messages are flushed before the process exits.

Graceful Shutdown

One more gap. SIGTERM hit the matching service while trades were still in the buffer. The process exited. The trades in the channel were not published. The trades did not exist anywhere. This is the kind of thing that only fails in production, at 2am, after a deployment.

func (te *TradeExecutor) Start() {
    te.wg.Add(1)
    go func() {
        defer te.wg.Done()
        for trade := range te.TradeCh {
            tradeJSON, err := json.Marshal(trade)
            if err != nil {
                log.WithError(err).Error("Failed to marshal trade for publishing")
                continue
            }
            te.Producer.SendMessage("TRADE_EXECUTED", trade.Instrument, tradeJSON)
        }
    }()
}

func (te *TradeExecutor) Stop() {
    te.wg.Wait()
    te.Producer.Close()
}

matching/engine/executor.go

The shutdown sequence goes like this. Stop the consumer first so no new orders arrive. Then drain the per-instrument channels so every already-received order gets processed. Then close the trade channel. Then call Stop() which blocks on wg.Wait() until the executor goroutine has published every trade to Kafka. Finally close the producer. In that exact order. Skipping any step produces data loss that looks like a race condition and smells like a deployment problem.

Phase 4: The Load Test and 18,000 RPS

With the correctness fixes in place, the first load test came in at 12,286 req/s with zero failures. Good, but not done.

Then I noticed the KAFKA_NUM_PARTITIONS setting, which I had not set, which meant it was using the default value of 1. One partition for all ten trading instruments. A single log. One consumer thread. All the per-instrument goroutine parallelism I had carefully built in the matching engine was being funneled through a single Kafka partition before it got anywhere near that parallelism.

Kafka 1 partition vs 10 partitions diagram showing bottleneck vs parallel lanes

To understand why partitions matter: a Kafka partition is the unit of parallelism. Only one consumer in a consumer group can read from a given partition at a time. With one partition, no matter how many consumers you have, only one is active. With ten partitions, you can have ten active consumers reading in parallel.

The producer was already using sarama.NewHashPartitioner with the instrument symbol as the key, so BTC-USD always routes to the same partition and ordering per instrument is preserved. The goroutine-per-instrument design and RoundRobin rebalance strategy were already ready. The only missing piece was telling Kafka to create more than one partition.

environment:
  KAFKA_NUM_PARTITIONS: 10

docker-compose.yml

Also gave Kafka a reasonable amount of memory to work with instead of the default 256MB, which is a perfectly reasonable heap size if you are not planning to use Kafka under any actual load.

environment:
  KAFKA_HEAP_OPTS: "-Xmx1G -Xms1G"
  KAFKA_LOG_RETENTION_HOURS: 1
  KAFKA_LOG_RETENTION_BYTES: 1073741824

docker-compose.yml

A note on the retention config. One hour and 1GB is aggressive. This is tuned for the benchmark environment to prevent the dedicated /data/ partition from filling up during repeated test runs. In a real deployment you would set retention based on how long you need to be able to replay messages for recovery. A reasonable production default is 7 days or the size of your disk, whichever comes first.

The final numbers across two runs:

Metric	Benchmark 1	Benchmark 2
Throughput	12,286 req/s	18,210 req/s
Failure Rate	0.00%	0.00%
p95 Latency	242ms	231ms
Max Latency	28s	486ms
Kafka	Stable	Stable

The assignment target was 2,000 req/s with sub-100ms median latency. The final median is 20ms at 18,210 req/s, which satisfies the requirement in the same way that a firehose satisfies a request for a glass of water.

Where This Goes Next

The application is not the bottleneck anymore. At 18k req/s, the ceiling is the HTTP layer itself. JSON parsing, struct allocation, TCP write overhead per request. Switching to gRPC or a binary protocol over persistent connections would push this considerably higher without touching the matching logic.

Horizontal scaling is already supported by the architecture. Adding a second matching engine instance triggers a Kafka partition rebalance, automatically distributing 5 partitions to each instance. The matching logic stays correct because each partition is always owned by exactly one consumer. No code changes needed. Docker Compose does not scale horizontally, but that is a Docker Compose problem, not an application problem.

The matching logic had correctness bugs. The Kafka consumer had a silent data loss path. The producer could create duplicates under network stress. The default broker configuration was bottlenecking a 20-thread machine to a single I/O thread. Each fix was a separate commit, a separate afternoon of reading documentation, a separate moment of staring at logs until something clicked.

Known Limitations

Timestamp ordering under multiple API instances. time.Now().UnixNano() is stamped at the API service, so in a multi-instance deployment, clock skew can corrupt price-time priority for orders arriving nanoseconds apart. The correct fix is a monotonic sequence number issued by the matching engine per instrument, not a wall clock.

Float64 price representation. All prices and quantities use float64. IEEE 754 floating-point arithmetic does not represent most decimal values exactly. The OrderHeap.Less comparator uses != to compare prices, which can produce surprising results for values that differ only in floating-point rounding error. Production matching engines use fixed-point integers internally, representing prices as integer ticks scaled to the minimum price increment of the instrument, which eliminates rounding errors entirely and makes comparison semantics exact.

In-memory order book state is not durable. A matching service restart empties both heaps. Unmatched limit orders exist in PostgreSQL as PENDING but nothing rebuilds the book from them. A production system snapshots the book periodically and replays from the snapshot plus subsequent Kafka offsets on startup.

Self-trade prevention is absent. A self-trade occurs when a participant's own buy order matches against their own sell order. Exchanges reject these because they are economically meaningless and can be used to manipulate reported volume. This implementation has no check on ClientID in the matching logic. A single client can match against themselves indefinitely.

Log retention values are benchmark-only. One hour and 1GB prevents the test disk from filling during repeated runs. In production these values define your replay window; a service down longer than one hour must rebuild from the database instead. Seven days is a reasonable starting point.

The code is on GitHub. If you find a better approach or a bug I missed, open an issue. The correctness bugs in a matching engine are the kind that only reveal themselves when you look carefully, and there is no particular reason to believe all of them have been found.

The gap between "it works on my machine" and "it works at scale" is where all the interesting engineering lives.

I Had No Idea What an Order Book Was. Now I've Built One That Handles 18k RPS.

The Assignment

The Tech Stack and Why It Looks Like This

The First Version Was Wrong

Why Not Just Use Goroutines?

The Decoupling Problem

The Stack

Phase 1: Building the Foundation and Discovering the Wrong Abstraction

How a Matching Engine Actually Works

The Mutex Mistake

The Right Model: One Goroutine Per Instrument

Phase 2: The Matching Logic Had Two Silent Bugs

The Market Order Corruption

The Price-Time Priority Bug

Phase 3: The Kafka Layer Had a Data Loss Hole and a Duplicate Risk

The Silent Trade Vanisher

Idempotent Producer

Graceful Shutdown

Phase 4: The Load Test and 18,000 RPS

Where This Goes Next

Known Limitations

Comments

More from this blog

How I Cracked Paasa's Secret Hiring Challenge

Beyond the Green Dot: Engineering Online Presence using GoLang

How I Got gRPC Working Through Cloudflare Tunnel (The Hard Way)

Firebase Draining Your Wallet? Here's How to Stop Paying for Every Single Read.

Command Palette

The Assignment

The Tech Stack and Why It Looks Like This

The First Version Was Wrong

Why Not Just Use Goroutines?

The Decoupling Problem

The Stack

Phase 1: Building the Foundation and Discovering the Wrong Abstraction

How a Matching Engine Actually Works

The Mutex Mistake

The Right Model: One Goroutine Per Instrument

Phase 2: The Matching Logic Had Two Silent Bugs

The Market Order Corruption

The Price-Time Priority Bug

Phase 3: The Kafka Layer Had a Data Loss Hole and a Duplicate Risk

The Silent Trade Vanisher

Idempotent Producer

Graceful Shutdown

Phase 4: The Load Test and 18,000 RPS

Where This Goes Next

Known Limitations

Comments

More from this blog