Beyond the Green Dot: Engineering Online Presence using GoLang
Most system design blogs explain presence like it's a whiteboard sketch. Draw a box, label it "Presence Service," add arrows, done. Ship it. Nobody tells you about the part where your Redis connection pool catches fire at 3 AM because you skipped the gateway server.

I'm building Bluppi, a music listening app, as a solo project. At some point I wanted to add a friends activity feed, the part of the home screen that shows which friends are online, what they're listening to.
The centerpiece of that feature is a green dot. A tiny circle next to a profile picture that means: this person is here, right now.
I hit offline detection bugs that kept users "online" 40 seconds after they'd closed the app. I watched my server silently stop accepting new connections at 300 users because I'd wired the whole system into one binary. I found an article on systemdesign.one, spent a weekend with ChatGPT breaking it down, rebuilt the whole thing, named a goroutine the Reaper, and eventually stress-tested it to 51,322 concurrent connections on my laptop.
This is the full story. Including the parts that hurt.
Where I Started: gRPC Streaming and the First Problem
The transport choice was gRPC streaming from the start. Clients open a long-lived stream to the server and subscribe to presence updates for a list of user IDs. When anyone in that list changes status, the server pushes an event down the stream. One persistent connection per client.
That part was correct.
The first design question: how does the server know a user is online? Heartbeats. When a client connects, the gateway starts a goroutine that calls RecordHeartbeat on the presence service every 15 seconds with the user's ID and a timestamp. The presence service writes that timestamp to Redis. As long as heartbeats arrive, the user is online.
Online detection worked immediately. Offline detection is where things started breaking.
Problem 1: Nobody Was Going Offline
Attempt 1: Redis Keyspace Notifications for Offline Detection
I googled "redis presence detection" and found keyspace notifications. Set a TTL key per user. When it expires, Redis fires an event. Subscribe to those events. Offline detection, free, no extra goroutine, elegant.
I genuinely felt smart for about four hours.
Then I read that keyspace notifications fire for every key event in your entire Redis instance. Not just yours. Every SET, every DEL, every expiry across all your keys. And Redis doesn't expire keys in real time -- it uses lazy expiration or a periodic sweep. Users were showing offline 30 to 40 seconds after reconnecting. The expired key sat un-swept. The new heartbeat key sat ignored. Two conflicting states, zero resolution, one confused frontend.
Why it failed:
Key expiration in Redis is not deterministic [1]
Keyspace notifications fire on every key event cluster-wide -- CPU cost scales with total Redis write volume, not your user count
You're trusting Redis's garbage collector to be your presence engine. Bold.
Problem 2: One Binary, All the Problems
Attempt 2: One Binary to Rule Them All
Keyspace notifications were out. I moved to a heartbeat-based model -- the client sends a signal every 15 seconds, the server stores it in Redis with a timestamp, and I'd figure out offline detection separately. That was the right call.
What came next was not.
I had gRPC streaming. I had heartbeats. I put everything into one binary.
Every stream connection held its own Redis Pub/Sub subscription. Every heartbeat triggered a direct database call from inside the stream handler. Around 300 concurrent streams, the default Redis client pool (10 connections) was fully saturated. New connections blocked waiting for a free slot. PostgreSQL followed. The server didn't crash dramatically. It just... stopped accepting new connections. Quietly. Like it gave up.
Why it failed:
One Redis Pub/Sub subscription per client connection
One database call per heartbeat, per stream goroutine
Default connection pools hit the ceiling before you hit 400 users
The problem was architectural. Stateful connection management and stateless business logic do not belong in the same binary. Their connection pools compete. The long-lived streams always win.
The Architecture That Survived
At some point during Attempt 2, I stopped writing code and started reading.
I found the Real-Time Presence Platform System Design article on systemdesign.one [4]. Then I spent a weekend with ChatGPT going through it section by section the delayed-trigger pattern for offline detection, the gateway split, the single shared Pub/Sub subscription. Every piece had a reason. I just hadn't understood the reasons yet.
The pattern: separate the stateful connection layer (the gateway, which holds open streams) from the stateless logic layer (the presence service, which processes heartbeats and publishes events). One Redis Pub/Sub subscription in the gateway fans out to every connected client. The presence service never touches a client connection directly.
I rebuilt the whole thing. Here's what actually works.
The flow:
Client opens a gRPC stream to the Gateway with a list of user IDs they want to watch
Gateway starts a heartbeat loop, calling the API server's internal
RecordHeartbeatRPC every 15 secondsPresence Service writes to a Redis Sorted Set (score = timestamp, member = userID)
If it's a new connection, it publishes an
onlineevent to Redis Pub/SubThe Reaper goroutine sweeps for expired users every 5 seconds, publishes
offlineeventsGateway's EventsListener picks up events from Redis and fans them out to the right connections
Why One Redis Command Replaced Three Data Structures
You could use a Redis Set for who's online. A Hash for last-seen timestamps. A separate TTL key per user for expiration. Three data structures, three commands, three places to get out of sync.
A Sorted Set does all of it in one structure.
ZADD presence:active_users <unix_timestamp> <user_id>
ZADD is idempotent: one command handles both insert and update. No SET+EXPIRE dance, no branching.
To find and clean expired users (offline for more than 30 seconds):
ZRANGEBYSCORE presence:active_users -inf <now - 30>
ZREM presence:active_users <expired_user_ids...>
All O(log N). At 100k users, that's 17 comparisons. Redis barely notices.
The alternative was using key-per-user with TTL expiry and keyspace notifications. We tried that. It's Attempt 2. It's in the graveyard.
How the Heartbeat Works
Every 15 seconds, the gateway lies to the presence service. It tells it the user is still online. The presence service believes it, updates the timestamp, and if it's a new connection, shouts "online" to Redis Pub/Sub. If the lies stop coming, the Reaper sweeps up.
// internals/presence/repository.go
const (
presenceZSetKey = "presence:active_users"
ttlSeconds = 30
)
func (r *Repository) RecordHeartbeat(ctx context.Context, userID string) (bool, error) {
now := float64(time.Now().Unix())
cutoff := float64(time.Now().Unix() - ttlSeconds)
score, err := r.redisClient.ZScore(ctx, presenceZSetKey, userID).Result()
var isNewConnection bool
if err == redis.Nil {
// User not in the set at all. Fresh connection.
isNewConnection = true
} else if err == nil && score < cutoff {
// User exists but their last heartbeat expired. They're back.
isNewConnection = true
} else if err != nil {
return false, err
}
// Update (or add) the user's timestamp
err = r.redisClient.ZAdd(ctx, presenceZSetKey, redis.Z{
Score: now,
Member: userID,
}).Err()
return isNewConnection, err
}
Source: internals/presence/repository.go
The isNewConnection flag matters. If a user is already online and sends another heartbeat, we don't publish a redundant online event. Only genuine state transitions get broadcast:
// internals/presence/service.go
func (s *Service) RecordHeartbeat(ctx context.Context, req *pb.HeartBeatRequest) (*emptypb.Empty, error) {
isNewConnection, err := s.repo.RecordHeartbeat(ctx, req.UserId)
if err != nil {
log.Printf("Failed to record heartbeat for user %s: %v", req.UserId, err)
return &emptypb.Empty{}, err
}
if isNewConnection {
event := gateway.PresenceEvent{
UserID: req.UserId,
Status: "online",
LastSeen: time.Now().Unix(),
}
payload, err := json.Marshal(event)
if err == nil {
s.redisClient.Publish(ctx, "system:presence_events", payload)
}
}
return &emptypb.Empty{}, nil
}
Source: internals/presence/service.go
The gateway side. keepAliveLoop runs as a goroutine for the lifetime of the stream:
// internals/gateway/handler.go
func (s *Server) keepAliveLoop(ctx context.Context, userID string) {
ticker := time.NewTicker(15 * time.Second)
defer ticker.Stop()
s.sendHeartbeat(userID) // fire immediately on connect
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
s.sendHeartbeat(userID)
}
}
}
Source: internals/gateway/handler.go
Heartbeat interval is 15 seconds. TTL is 30 seconds. That gives two missed heartbeats before the user is considered offline. One missed heartbeat from a flaky mobile connection doesn't trigger a false offline event.
How Offline Detection Works: The Reaper
After heartbeats were working, I had a new problem. Users who closed the app stayed "online" on everyone else's screen. 30 seconds. 40 seconds. Sometimes longer.
There was no disconnect event. No callback. When the gRPC stream closed, the gateway's heartbeat goroutine exited. That was it. Nobody told the presence service. Nobody published an offline event. The user's timestamp just sat in the sorted set, aging, while their green dot stayed green.
I needed something to notice the silence.
I wrote a goroutine that runs every 5 seconds, queries Redis for users whose last heartbeat is older than 30 seconds, removes them from the sorted set, and publishes an offline event for each one. I called it the Reaper.
// internals/presence/reaper.go
func (r *Reaper) Start(ctx context.Context) {
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
r.sweep(ctx)
}
}
}
func (r *Reaper) sweep(ctx context.Context) {
sweepCtx, cancel := context.WithTimeout(ctx, 4*time.Second)
defer cancel()
expiredUsers, err := r.repo.GetAndRemoveExpiredUsers(sweepCtx)
if err != nil {
log.Printf("Reaper failed to fetch expired users: %v", err)
return
}
if len(expiredUsers) > 0 {
now := time.Now().Unix()
for _, userID := range expiredUsers {
event := gateway.PresenceEvent{
UserID: userID,
Status: "offline",
LastSeen: now,
}
payload, err := json.Marshal(event)
if err == nil {
r.redisClient.Publish(sweepCtx, "system:presence_events", payload)
}
}
}
}
Source: internals/presence/reaper.go
The sorted set query and removal:
// internals/presence/repository.go
func (r *Repository) GetAndRemoveExpiredUsers(ctx context.Context) ([]string, error) {
cutoff := fmt.Sprintf("%f", float64(time.Now().Unix()-ttlSeconds))
// fetch all users whose last heartbeat score is older than 30s
expiredUsers, _ := r.redisClient.ZRangeArgs(ctx, redis.ZRangeArgs{
Key: presenceZSetKey,
ByScore: true,
Start: "-inf",
Stop: cutoff,
}).Result()
if len(expiredUsers) > 0 {
// convert to []interface{} for ZRem
members := make([]interface{}, len(expiredUsers))
for i, u := range expiredUsers { members[i] = u }
r.redisClient.ZRem(ctx, presenceZSetKey, members...)
}
return expiredUsers, nil
// full implementation: https://github.com/dis70rt/bluppi-backend/blob/main/internals/presence/repository.go
}
Source: internals/presence/repository.go
The 4-second context timeout on sweep is intentional. If Redis is unresponsive, the Reaper doesn't hang forever and block subsequent sweeps. It logs the failure and tries again in 5 seconds. LinkedIn's presence platform uses a similar delayed-trigger approach for offline detection [3].
The Gateway: Multi-Device Connections and Targeted Fanout
The ConnectionManager uses a nested map: userId -> connectionId -> Connection. Each device gets a UUID. Disconnecting one device only cleans up that specific connection.
// internals/gateway/manager.go
type Connection struct {
ID string
UserID string
Chan chan PresenceEvent
SubscribedToTarget []string // which users this connection is watching
}
type ConnectionManager struct {
mu sync.RWMutex
// userId -> connectionId -> Connection
userConnections map[string]map[string]*Connection
// targetUserId -> set of connectionIds watching them
targetWatchers map[string]map[string]struct{}
// connectionId -> Connection reference (for quick lookup during push)
connById map[string]*Connection
}
Source: internals/gateway/manager.go
Three maps. Looks like overkill. It's not.
userConnections: "which connections does this user have?" Needed for cleanup.targetWatchers: "who's watching this user?" Needed for fanout. This is the inverted index.connById: "give me the connection object by ID." Needed for pushing events.
When a presence event arrives for User B, the PushEvent method looks up User B in targetWatchers, gets the set of connection IDs watching them, resolves each to a Connection via connById, and pushes the event to its channel:
// internals/gateway/manager.go
func (cm *ConnectionManager) PushEvent(targetUserID string, event PresenceEvent) {
cm.mu.RLock()
defer cm.mu.RUnlock()
watchers, exists := cm.targetWatchers[targetUserID]
if !exists {
return
}
for connID := range watchers {
if conn, ok := cm.connById[connID]; ok {
select {
case conn.Chan <- event:
default: // Drop event if channel is full to avoid blocking
}
}
}
}
Source: internals/gateway/manager.go
That select { default: } is load-bearing. Without it, a slow consumer blocks the entire fanout. With it, a full channel just drops the event. The user misses one status update and gets the next one 15 seconds later. Better than a deadlock that freezes the entire gateway.
The Redis Pub/Sub listener ties it together. One goroutine, one subscription, fans out to all connections:
// internals/gateway/listener.go
func (el *EventsListener) Start(ctx context.Context) {
pubsub := el.redisClient.Subscribe(ctx, "system:presence_events")
defer pubsub.Close()
for {
select {
case <-ctx.Done():
return
case msg := <-pubsub.Channel():
var event PresenceEvent
if err := json.Unmarshal([]byte(msg.Payload), &event); err != nil {
log.Printf("Failed to unmarshal presence event: %v", err)
continue
}
el.connManager.PushEvent(event.UserID, event)
}
}
}
Source: internals/gateway/listener.go
Goroutines as Actors: How Go Solved the Mutex Problem
Here's the part nobody explains when they show you a sync.RWMutex slapped on a map and call it a day.
The ConnectionManager has a mutex. It protects the three index maps: registration, deregistration, lookups. That's the right scope for a lock.
But event delivery has no lock at all. PushEvent holds only an RLock to read the watcher index, then sends to a channel. It never waits. It never coordinates. Each connection processes its own events in complete isolation.
This is the Actor Model [8], implemented with zero framework, zero library, zero ceremony. In Go, a goroutine plus a channel is an actor. The channel is the mailbox. The goroutine is the event loop. Messages never share memory across actors.
Look at SubscribePresence:
// internals/gateway/handler.go
func (s *Server) SubscribePresence(req *pb.SubscribeRequest, stream pb.PresenceGateway_SubscribePresenceServer) error {
ctx := stream.Context()
userID, err := middlewares.GetUserID(ctx)
if err != nil {
return err
}
conn, connID := s.connManager.AddConnection(userID, req.TargetUserIds)
defer s.connManager.RemoveConnection(userID, connID)
go s.keepAliveLoop(stream.Context(), userID)
for {
select {
case <-stream.Context().Done():
return stream.Context().Err()
case event, ok := <-conn.Chan: // the actor receiving its mailbox message
if !ok {
// channel closed by RemoveConnection: clean shutdown signal,
// no extra sync primitive needed
return nil
}
grpcMessage := &pb.PresenceUpdate{
UserId: event.UserID,
Status: event.Status,
LastSeen: timestamppb.New(time.Unix(event.LastSeen, 0)),
}
if err := stream.Send(grpcMessage); err != nil {
return err
}
}
}
}
Source: internals/gateway/handler.go
Each call to SubscribePresence is its own actor instance. It has one goroutine, one channel (conn.Chan, capacity 20), one job: drain the mailbox and stream events to the client. It shares no state with any other connection's event loop.
The alternative is a shared event queue with a mutex on every dequeue. At 50k concurrent connections, every event delivery serializes through one lock. Under that contention, sync.Mutex triggers O(N) goroutine wake-ups per unlock. The scheduler spends more time arbitrating lock ownership than your code spends doing actual work.
With the actor approach: PushEvent does one non-blocking channel send and moves on. No coordination. No waiting. The worst case is a dropped event when a consumer's buffer is full -- which is a better outcome than a deadlocked gateway serving nobody.
The only mutex in the entire system is on the maps. Not on message delivery. That's the point.
The Two-Binary Split: API Server vs. Gateway Server
The API server (cmd/api/main.go, port 50051) handles business logic: users, music, rooms, parties, and the internal RecordHeartbeat RPC. Stateless. Horizontally scalable. Talks to PostgreSQL, Memgraph, Redis, Solr.
The Gateway server (cmd/gateway/main.go, port 50050) handles long-lived gRPC streams. Stateful by nature (it holds open connections). Talks to Redis Pub/Sub and the API server over internal gRPC. (Note: you can use WebSockets instead of gRPC streams here; the core principles of connection management and fanout remain exactly the same.)
The gateway authenticates clients via Firebase Auth through a gRPC stream interceptor, then calls the API server with the user ID attached as gRPC metadata. The API server never sees raw client connections for presence. It processes heartbeats and that's it.
Load Test: 51,322 Users, Zero Errors
I wrote a custom load test tool in Go (cmd/loadtest/presence/main.go) that spins up concurrent gRPC streams. Each simulated user subscribes to their own events, connects, receives their online event, and holds the stream open.
Important caveat: this is a connection-overhead test, not a fanout test. Each user watches only themselves. In production, one popular user might have thousands of watchers -- that fanout path has different latency characteristics. The numbers below measure how many concurrent streams the gateway holds open without errors, not worst-case fanout latency.
Results from a single laptop (i5-13500HX, 15 GiB RAM):
| Users | Duration | Errors | Connect p50 | Connect p90 | Connect p95 | Connect p99 |
|---|---|---|---|---|---|---|
| 50 | 37s | 0 | 1.365 ms | 1.595 ms | 1.769 ms | 3.036 ms |
| 5,000 | 332s | 0 | 0.756 ms | 0.933 ms | 0.992 ms | 1.231 ms |
| 20,000 | 632s | 0 | 0.713 ms | 2.490 ms | 2.636 ms | 2.783 ms |
| 51,322* | 308s | 0 | 2.695 ms | 8.426 ms | 11.516 ms | 15.278 ms |
*Target was 100k. Stopped at 51,322 because the load generator ran out of RAM. The backend was fine. My laptop was not.
Zero errors across all runs. p99 latency at 51k users: 15.28 ms. The backend didn't break. The load generator did.
For context, LinkedIn targets sub-second presence delivery for their 700M+ user base [3]. Our p99 at 51k is 15ms. Different scale, sure, but the architecture patterns are the same.
Known Limitations
I'm still working on some of these. Listing them anyway because honesty is more useful than a blog post that pretends everything is solved.
Single-node gateway. If the gateway goes down, all streams drop. No cross-node fanout yet. Fixing this requires a dispatcher layer similar to LinkedIn's architecture [3] or consistent hashing across gateway instances.
Reaper sweep interval. The 5-second tick means offline detection can lag up to 35 seconds (30s TTL + 5s sweep). For a music app's friends activity, this is fine. For an ICU patient monitoring system, this would be a resume-generating event.
No cross-data center replication. Single region. Adding geo-distribution would need CRDT-based replication on the Redis layer, similar to what the systemdesign.one article describes [4].
ulimitcapped the load test. The default 1024 open file limit on the load generator machine was the actual bottleneck. Raising it and using distributed load generators would push past 100k.
Closing
This took months. For a green dot.
The full source is on GitHub: dis70rt/bluppi-backend. The presence system lives in internals/presence/ and internals/gateway/. Go read it, break it, tell me what's wrong with it.
Two things I wish I knew at the start:
If you're reaching for Redis keyspace notifications for offline detection -- don't. Write a Reaper. It's 66 lines of Go and it does exactly what you need with zero surprises.
If you're about to let every client hold a Redis Pub/Sub subscription from inside your API server -- also don't. Split the gateway out. The connection pool you save might be your own.
And if you're a student building something that "sounds simple" and it starts taking way longer than expected: that's normal. That's just what real systems feel like when you stop using the whiteboard version.
References
Redis documentation, "Redis keyspace notifications." redis.io/docs
gRPC documentation, "Keepalive." grpc.io/docs
Gupta & Lay, "Now You See Me, Now You Don't: LinkedIn's Real-Time Presence Platform" (2018). engineering.linkedin.com
Kim, "Real-Time Presence Platform System Design" (2023). systemdesign.one
Barber, "Building Real-Time Infrastructure at Facebook" (2017). SREcon / USENIX. youtube.com
PubNub, "Is Anyone Home? An Intro to Presence Webhooks" (2020). pubnub.com
Redis documentation, "Sorted Sets." redis.io/docs
