<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[EtherNode]]></title><description><![CDATA[I write software related blogs]]></description><link>https://blog.saikat.in</link><generator>RSS for Node</generator><lastBuildDate>Sun, 17 May 2026 20:52:13 GMT</lastBuildDate><atom:link href="https://blog.saikat.in/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Beyond the Green Dot: Engineering Online Presence using GoLang]]></title><description><![CDATA[I'm building Bluppi, a music listening app, as a solo project. At some point I wanted to add a friends activity feed, the part of the home screen that shows which friends are online, what they're list]]></description><link>https://blog.saikat.in/beyond-the-green-dot-engineering-online-presence-using-golang</link><guid isPermaLink="true">https://blog.saikat.in/beyond-the-green-dot-engineering-online-presence-using-golang</guid><category><![CDATA[golang]]></category><category><![CDATA[Redis]]></category><category><![CDATA[System Design]]></category><category><![CDATA[backend]]></category><category><![CDATA[distributed systems]]></category><category><![CDATA[Online presence]]></category><dc:creator><![CDATA[Saikat Das]]></dc:creator><pubDate>Mon, 11 May 2026 04:26:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/8ba8efa5-4353-4431-9463-fcfcd1ac7d3c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I'm building Bluppi, a music listening app, as a solo project. At some point I wanted to add a friends activity feed, the part of the home screen that shows which friends are online, what they're listening to.</p>
<p>The centerpiece of that feature is a green dot. A tiny circle next to a profile picture that means: this person is here, right now.</p>
<p>I hit offline detection bugs that kept users "online" 40 seconds after they'd closed the app. I watched my server silently stop accepting new connections at 300 users because I'd wired the whole system into one binary. I found an article on <a href="https://systemdesign.one/real-time-presence-platform-system-design/">systemdesign.one</a>, spent a weekend with ChatGPT breaking it down, rebuilt the whole thing, named a goroutine the Reaper, and eventually stress-tested it to <strong>51,322</strong> concurrent connections on my laptop.</p>
<p>This is the full story. Including the parts that hurt.</p>
<hr />
<h2>Where I Started: gRPC Streaming and the First Problem</h2>
<p>The transport choice was gRPC streaming from the start. Clients open a long-lived stream to the server and subscribe to presence updates for a list of user IDs. When anyone in that list changes status, the server pushes an event down the stream. One persistent connection per client.</p>
<p>That part was correct.</p>
<p>The first design question: how does the server know a user is online? Heartbeats. When a client connects, the gateway starts a goroutine that calls <code>RecordHeartbeat</code> on the presence service every 15 seconds with the user's ID and a timestamp. The presence service writes that timestamp to Redis. As long as heartbeats arrive, the user is online.</p>
<p>Online detection worked immediately. Offline detection is where things started breaking.</p>
<hr />
<h2>Problem 1: Nobody Was Going Offline</h2>
<h3><strong>Attempt 1: Redis Keyspace Notifications for Offline Detection</strong></h3>
<p>I googled "redis presence detection" and found keyspace notifications. Set a TTL key per user. When it expires, Redis fires an event. Subscribe to those events. Offline detection, free, no extra goroutine, elegant.</p>
<p>I genuinely felt smart for about four hours.</p>
<p>Then I read that keyspace notifications fire for every key event in your entire Redis instance. Not just yours. Every <code>SET</code>, every <code>DEL</code>, every expiry across all your keys. And Redis doesn't expire keys in real time -- it uses lazy expiration or a periodic sweep. Users were showing offline 30 to 40 seconds after reconnecting. The expired key sat un-swept. The new heartbeat key sat ignored. Two conflicting states, zero resolution, one confused frontend.</p>
<p><strong>Why it failed:</strong></p>
<ul>
<li><p>Key expiration in Redis is not deterministic <strong>[1]</strong></p>
</li>
<li><p>Keyspace notifications fire on every key event cluster-wide -- CPU cost scales with total Redis write volume, not your user count</p>
</li>
<li><p>You're trusting Redis's garbage collector to be your presence engine. Bold.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/af0a3649-c81f-43fb-9fe0-5fb018cbce06.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Problem 2: One Binary, All the Problems</h2>
<h3><strong>Attempt 2: One Binary to Rule Them All</strong></h3>
<p>Keyspace notifications were out. I moved to a heartbeat-based model -- the client sends a signal every 15 seconds, the server stores it in Redis with a timestamp, and I'd figure out offline detection separately. That was the right call.</p>
<p>What came next was not.</p>
<p>I had gRPC streaming. I had heartbeats. I put everything into one binary.</p>
<p>Every stream connection held its own Redis Pub/Sub subscription. Every heartbeat triggered a direct database call from inside the stream handler. Around 300 concurrent streams, the default Redis client pool (10 connections) was fully saturated. New connections blocked waiting for a free slot. PostgreSQL followed. The server didn't crash dramatically. It just... stopped accepting new connections. Quietly. Like it gave up.</p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/07a7af01-901c-4f0e-a613-4ce781c4dc6f.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Why it failed:</strong></p>
<ul>
<li><p>One Redis Pub/Sub subscription per client connection</p>
</li>
<li><p>One database call per heartbeat, per stream goroutine</p>
</li>
<li><p>Default connection pools hit the ceiling before you hit 400 users</p>
</li>
</ul>
<p>The problem was architectural. Stateful connection management and stateless business logic do not belong in the same binary. Their connection pools compete. The long-lived streams always win.</p>
<hr />
<h2><strong>The Architecture That Survived</strong></h2>
<p>At some point during Attempt 2, I stopped writing code and started reading.</p>
<p>I found the <a href="https://systemdesign.one/real-time-presence-platform-system-design/">Real-Time Presence Platform System Design</a> article on systemdesign.one <strong>[4]</strong>. Then I spent a weekend with ChatGPT going through it section by section the delayed-trigger pattern for offline detection, the gateway split, the single shared Pub/Sub subscription. Every piece had a reason. I just hadn't understood the reasons yet.</p>
<p>The pattern: separate the stateful connection layer (the gateway, which holds open streams) from the stateless logic layer (the presence service, which processes heartbeats and publishes events). One Redis Pub/Sub subscription in the gateway fans out to every connected client. The presence service never touches a client connection directly.</p>
<p>I rebuilt the whole thing. Here's what actually works.</p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/a3aa138e-27c8-4284-94cb-cba8d3bcd24d.png" alt="" style="display:block;margin:0 auto" />

<p><strong>The flow:</strong></p>
<ol>
<li><p>Client opens a gRPC stream to the Gateway with a list of user IDs they want to watch</p>
</li>
<li><p>Gateway starts a heartbeat loop, calling the API server's internal <code>RecordHeartbeat</code> RPC every 15 seconds</p>
</li>
<li><p>Presence Service writes to a Redis Sorted Set (score = timestamp, member = userID)</p>
</li>
<li><p>If it's a new connection, it publishes an <code>online</code> event to Redis Pub/Sub</p>
</li>
<li><p>The Reaper goroutine sweeps for expired users every 5 seconds, publishes <code>offline</code> events</p>
</li>
<li><p>Gateway's EventsListener picks up events from Redis and fans them out to the right connections</p>
</li>
</ol>
<hr />
<h2>Why One Redis Command Replaced Three Data Structures</h2>
<p>You could use a Redis Set for who's online. A Hash for last-seen timestamps. A separate TTL key per user for expiration. Three data structures, three commands, three places to get out of sync.</p>
<p>A Sorted Set does all of it in one structure.</p>
<pre><code class="language-javascript">ZADD presence:active_users &lt;unix_timestamp&gt; &lt;user_id&gt;
</code></pre>
<p><code>ZADD</code> is idempotent: one command handles both insert and update. No <code>SET</code>+<code>EXPIRE</code> dance, no branching.</p>
<p>To find and clean expired users (offline for more than 30 seconds):</p>
<pre><code class="language-javascript">ZRANGEBYSCORE presence:active_users -inf &lt;now - 30&gt;
ZREM presence:active_users &lt;expired_user_ids...&gt;
</code></pre>
<p>All <code>O(log N)</code>. At 100k users, that's 17 comparisons. Redis barely notices.</p>
<p>The alternative was using key-per-user with TTL expiry and keyspace notifications. We tried that. It's <a href="#attempt-2-redis-keyspace-notifications-for-expiry-detection">Attempt 2</a>. It's in the graveyard.</p>
<h2>How the Heartbeat Works</h2>
<p>Every 15 seconds, the gateway lies to the presence service. It tells it the user is still online. The presence service believes it, updates the timestamp, and if it's a new connection, shouts "online" to Redis Pub/Sub. If the lies stop coming, the Reaper sweeps up.</p>
<pre><code class="language-go">// internals/presence/repository.go

const (
    presenceZSetKey = "presence:active_users"
    ttlSeconds      = 30
)

func (r *Repository) RecordHeartbeat(ctx context.Context, userID string) (bool, error) {
    now := float64(time.Now().Unix())
    cutoff := float64(time.Now().Unix() - ttlSeconds)

    score, err := r.redisClient.ZScore(ctx, presenceZSetKey, userID).Result()
    var isNewConnection bool

    if err == redis.Nil {
        // User not in the set at all. Fresh connection.
        isNewConnection = true
    } else if err == nil &amp;&amp; score &lt; cutoff {
        // User exists but their last heartbeat expired. They're back.
        isNewConnection = true
    } else if err != nil {
        return false, err
    }

    // Update (or add) the user's timestamp
    err = r.redisClient.ZAdd(ctx, presenceZSetKey, redis.Z{
        Score:  now,
        Member: userID,
    }).Err()

    return isNewConnection, err
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/presence/repository.go">Source: <code>internals/presence/repository.go</code></a></p>
<p>The <code>isNewConnection</code> flag matters. If a user is already online and sends another heartbeat, we don't publish a redundant <code>online</code> event. Only genuine state transitions get broadcast:</p>
<pre><code class="language-go">// internals/presence/service.go

func (s *Service) RecordHeartbeat(ctx context.Context, req *pb.HeartBeatRequest) (*emptypb.Empty, error) {
    isNewConnection, err := s.repo.RecordHeartbeat(ctx, req.UserId)
    if err != nil {
        log.Printf("Failed to record heartbeat for user %s: %v", req.UserId, err)
        return &amp;emptypb.Empty{}, err
    }

    if isNewConnection {
        event := gateway.PresenceEvent{
            UserID:   req.UserId,
            Status:   "online",
            LastSeen: time.Now().Unix(),
        }
        payload, err := json.Marshal(event)
        if err == nil {
            s.redisClient.Publish(ctx, "system:presence_events", payload)
        }
    }

    return &amp;emptypb.Empty{}, nil
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/presence/service.go">Source: <code>internals/presence/service.go</code></a></p>
<p>The gateway side. <code>keepAliveLoop</code> runs as a goroutine for the lifetime of the stream:</p>
<pre><code class="language-go">// internals/gateway/handler.go

func (s *Server) keepAliveLoop(ctx context.Context, userID string) {
    ticker := time.NewTicker(15 * time.Second)
    defer ticker.Stop()

    s.sendHeartbeat(userID) // fire immediately on connect

    for {
        select {
        case &lt;-ctx.Done():
            return
        case &lt;-ticker.C:
            s.sendHeartbeat(userID)
        }
    }
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/gateway/handler.go">Source: <code>internals/gateway/handler.go</code></a></p>
<p>Heartbeat interval is 15 seconds. TTL is 30 seconds. That gives two missed heartbeats before the user is considered offline. One missed heartbeat from a flaky mobile connection doesn't trigger a false offline event.</p>
<h2>How Offline Detection Works: The Reaper</h2>
<p>After heartbeats were working, I had a new problem. Users who closed the app stayed "online" on everyone else's screen. 30 seconds. 40 seconds. Sometimes longer.</p>
<p>There was no disconnect event. No callback. When the gRPC stream closed, the gateway's heartbeat goroutine exited. That was it. Nobody told the presence service. Nobody published an offline event. The user's timestamp just sat in the sorted set, aging, while their green dot stayed green.</p>
<p>I needed something to notice the silence.</p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/4e533ee6-8af1-4b2f-8f78-fabc40738009.png" alt="" style="display:block;margin:0 auto" />

<p>I wrote a goroutine that runs every 5 seconds, queries Redis for users whose last heartbeat is older than 30 seconds, removes them from the sorted set, and publishes an offline event for each one. I called it the Reaper.</p>
<pre><code class="language-go">// internals/presence/reaper.go

func (r *Reaper) Start(ctx context.Context) {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case &lt;-ctx.Done():
            return
        case &lt;-ticker.C:
            r.sweep(ctx)
        }
    }
}

func (r *Reaper) sweep(ctx context.Context) {
    sweepCtx, cancel := context.WithTimeout(ctx, 4*time.Second)
    defer cancel()

    expiredUsers, err := r.repo.GetAndRemoveExpiredUsers(sweepCtx)
    if err != nil {
        log.Printf("Reaper failed to fetch expired users: %v", err)
        return
    }

    if len(expiredUsers) &gt; 0 {
        now := time.Now().Unix()
        for _, userID := range expiredUsers {
            event := gateway.PresenceEvent{
                UserID:   userID,
                Status:   "offline",
                LastSeen: now,
            }
            payload, err := json.Marshal(event)
            if err == nil {
                r.redisClient.Publish(sweepCtx, "system:presence_events", payload)
            }
        }
    }
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/presence/reaper.go">Source: <code>internals/presence/reaper.go</code></a></p>
<p>The sorted set query and removal:</p>
<pre><code class="language-go">// internals/presence/repository.go

func (r *Repository) GetAndRemoveExpiredUsers(ctx context.Context) ([]string, error) {
    cutoff := fmt.Sprintf("%f", float64(time.Now().Unix()-ttlSeconds))

    // fetch all users whose last heartbeat score is older than 30s
    expiredUsers, _ := r.redisClient.ZRangeArgs(ctx, redis.ZRangeArgs{
        Key:     presenceZSetKey,
        ByScore: true,
        Start:   "-inf",
        Stop:    cutoff,
    }).Result()

    if len(expiredUsers) &gt; 0 {
        // convert to []interface{} for ZRem
        members := make([]interface{}, len(expiredUsers))
        for i, u := range expiredUsers { members[i] = u }
        r.redisClient.ZRem(ctx, presenceZSetKey, members...)
    }

    return expiredUsers, nil
    // full implementation: https://github.com/dis70rt/bluppi-backend/blob/main/internals/presence/repository.go
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/presence/repository.go">Source: <code>internals/presence/repository.go</code></a></p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/17ae7f97-89dd-450e-b343-6114b9b21bff.png" alt="" style="display:block;margin:0 auto" />

<p>The 4-second context timeout on <code>sweep</code> is intentional. If Redis is unresponsive, the Reaper doesn't hang forever and block subsequent sweeps. It logs the failure and tries again in 5 seconds. LinkedIn's presence platform uses a similar delayed-trigger approach for offline detection <a href="#references">[3]</a>.</p>
<hr />
<h2>The Gateway: Multi-Device Connections and Targeted Fanout</h2>
<p>The <code>ConnectionManager</code> uses a nested map: <code>userId -&gt; connectionId -&gt; Connection</code>. Each device gets a UUID. Disconnecting one device only cleans up that specific connection.</p>
<pre><code class="language-go">// internals/gateway/manager.go

type Connection struct {
    ID                 string
    UserID             string
    Chan               chan PresenceEvent
    SubscribedToTarget []string // which users this connection is watching
}

type ConnectionManager struct {
    mu sync.RWMutex
    // userId -&gt; connectionId -&gt; Connection
    userConnections map[string]map[string]*Connection
    // targetUserId -&gt; set of connectionIds watching them
    targetWatchers map[string]map[string]struct{}
    // connectionId -&gt; Connection reference (for quick lookup during push)
    connById map[string]*Connection
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/gateway/manager.go">Source: <code>internals/gateway/manager.go</code></a></p>
<p>Three maps. Looks like overkill. It's not.</p>
<ul>
<li><p><code>userConnections</code>: "which connections does this user have?" Needed for cleanup.</p>
</li>
<li><p><code>targetWatchers</code>: "who's watching this user?" Needed for fanout. This is the inverted index.</p>
</li>
<li><p><code>connById</code>: "give me the connection object by ID." Needed for pushing events.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/c474c755-987a-41dd-b074-42a966b1f969.png" alt="" style="display:block;margin:0 auto" />

<p>When a presence event arrives for User B, the <code>PushEvent</code> method looks up User B in <code>targetWatchers</code>, gets the set of connection IDs watching them, resolves each to a <code>Connection</code> via <code>connById</code>, and pushes the event to its channel:</p>
<pre><code class="language-go">// internals/gateway/manager.go

func (cm *ConnectionManager) PushEvent(targetUserID string, event PresenceEvent) {
    cm.mu.RLock()
    defer cm.mu.RUnlock()

    watchers, exists := cm.targetWatchers[targetUserID]
    if !exists {
        return
    }

    for connID := range watchers {
        if conn, ok := cm.connById[connID]; ok {
            select {
            case conn.Chan &lt;- event:
            default: // Drop event if channel is full to avoid blocking
            }
        }
    }
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/gateway/manager.go">Source: <code>internals/gateway/manager.go</code></a></p>
<p>That <code>select { default: }</code> is load-bearing. Without it, a slow consumer blocks the entire fanout. With it, a full channel just drops the event. The user misses one status update and gets the next one 15 seconds later. Better than a deadlock that freezes the entire gateway.</p>
<p>The Redis Pub/Sub listener ties it together. One goroutine, one subscription, fans out to all connections:</p>
<pre><code class="language-go">// internals/gateway/listener.go

func (el *EventsListener) Start(ctx context.Context) {
    pubsub := el.redisClient.Subscribe(ctx, "system:presence_events")
    defer pubsub.Close()

    for {
        select {
        case &lt;-ctx.Done():
            return
        case msg := &lt;-pubsub.Channel():
            var event PresenceEvent
            if err := json.Unmarshal([]byte(msg.Payload), &amp;event); err != nil {
                log.Printf("Failed to unmarshal presence event: %v", err)
                continue
            }
            el.connManager.PushEvent(event.UserID, event)
        }
    }
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/gateway/listener.go">Source: <code>internals/gateway/listener.go</code></a></p>
<hr />
<h2>Goroutines as Actors: How Go Solved the Mutex Problem</h2>
<p>Here's the part nobody explains when they show you a <code>sync.RWMutex</code> slapped on a map and call it a day.</p>
<p>The <code>ConnectionManager</code> has a mutex. It protects the three index maps: registration, deregistration, lookups. That's the right scope for a lock.</p>
<p>But event <strong>delivery</strong> has no lock at all. <code>PushEvent</code> holds only an <code>RLock</code> to read the watcher index, then sends to a channel. It never waits. It never coordinates. Each connection processes its own events in complete isolation.</p>
<p>This is the <a href="https://en.wikipedia.org/wiki/Actor_model">Actor Model</a> <a href="#references">[8]</a>, implemented with zero framework, zero library, zero ceremony. In Go, a goroutine plus a channel is an actor. The channel is the mailbox. The goroutine is the event loop. Messages never share memory across actors.</p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/987ccefa-0b2e-4aa1-91b5-ce3d14bb68fa.png" alt="" style="display:block;margin:0 auto" />

<p>Look at <code>SubscribePresence</code>:</p>
<pre><code class="language-go">// internals/gateway/handler.go

func (s *Server) SubscribePresence(req *pb.SubscribeRequest, stream pb.PresenceGateway_SubscribePresenceServer) error {
    ctx := stream.Context()

    userID, err := middlewares.GetUserID(ctx)
    if err != nil {
        return err
    }

    conn, connID := s.connManager.AddConnection(userID, req.TargetUserIds)
    defer s.connManager.RemoveConnection(userID, connID)

    go s.keepAliveLoop(stream.Context(), userID)

    for {
        select {
        case &lt;-stream.Context().Done():
            return stream.Context().Err()

        case event, ok := &lt;-conn.Chan: // the actor receiving its mailbox message
            if !ok {
                // channel closed by RemoveConnection: clean shutdown signal,
                // no extra sync primitive needed
                return nil
            }
            grpcMessage := &amp;pb.PresenceUpdate{
                UserId:   event.UserID,
                Status:   event.Status,
                LastSeen: timestamppb.New(time.Unix(event.LastSeen, 0)),
            }
            if err := stream.Send(grpcMessage); err != nil {
                return err
            }
        }
    }
}
</code></pre>
<p><a href="https://github.com/dis70rt/bluppi-backend/blob/main/internals/gateway/handler.go">Source: <code>internals/gateway/handler.go</code></a></p>
<p>Each call to <code>SubscribePresence</code> is its own actor instance. It has one goroutine, one channel (<code>conn.Chan</code>, capacity 20), one job: drain the mailbox and stream events to the client. It shares no state with any other connection's event loop.</p>
<p>The alternative is a shared event queue with a mutex on every dequeue. At 50k concurrent connections, every event delivery serializes through one lock. Under that contention, <code>sync.Mutex</code> triggers O(N) goroutine wake-ups per unlock. The scheduler spends more time arbitrating lock ownership than your code spends doing actual work.</p>
<p>With the actor approach: <code>PushEvent</code> does one non-blocking channel send and moves on. No coordination. No waiting. The worst case is a dropped event when a consumer's buffer is full -- which is a better outcome than a deadlocked gateway serving nobody.</p>
<p>The only mutex in the entire system is on the maps. Not on message delivery. That's the point.</p>
<hr />
<h2>The Two-Binary Split: API Server vs. Gateway Server</h2>
<p>The API server (<code>cmd/api/main.go</code>, port 50051) handles business logic: users, music, rooms, parties, and the internal <code>RecordHeartbeat</code> RPC. Stateless. Horizontally scalable. Talks to PostgreSQL, Memgraph, Redis, Solr.</p>
<p>The Gateway server (<code>cmd/gateway/main.go</code>, port 50050) handles long-lived gRPC streams. Stateful by nature (it holds open connections). Talks to Redis Pub/Sub and the API server over internal gRPC. <em>(Note: you can use WebSockets instead of gRPC streams here; the core principles of connection management and fanout remain exactly the same.)</em></p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/a522798c-947e-4d4b-afbc-acdf47a9368d.png" alt="" style="display:block;margin:0 auto" />

<p>The gateway authenticates clients via Firebase Auth through a gRPC stream interceptor, then calls the API server with the user ID attached as gRPC metadata. The API server never sees raw client connections for presence. It processes heartbeats and that's it.</p>
<hr />
<h2>Load Test: 51,322 Users, Zero Errors</h2>
<p>I wrote a custom load test tool in Go (<a href="https://github.com/dis70rt/bluppi-backend/blob/main/cmd/loadtest/presence/main.go"><code>cmd/loadtest/presence/main.go</code></a>) that spins up concurrent gRPC streams. Each simulated user subscribes to their own events, connects, receives their online event, and holds the stream open.</p>
<p>Important caveat: this is a <strong>connection-overhead test</strong>, not a fanout test. Each user watches only themselves. In production, one popular user might have thousands of watchers -- that fanout path has different latency characteristics. The numbers below measure how many concurrent streams the gateway holds open without errors, not worst-case fanout latency.</p>
<p>Results from a single laptop (i5-13500HX, 15 GiB RAM):</p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/229d9b65-f8ce-4198-8530-691ecf1c3496.png" alt="" style="display:block;margin:0 auto" />

<table>
<thead>
<tr>
<th>Users</th>
<th>Duration</th>
<th>Errors</th>
<th>Connect p50</th>
<th>Connect p90</th>
<th>Connect p95</th>
<th>Connect p99</th>
</tr>
</thead>
<tbody><tr>
<td>50</td>
<td>37s</td>
<td>0</td>
<td>1.365 ms</td>
<td>1.595 ms</td>
<td>1.769 ms</td>
<td>3.036 ms</td>
</tr>
<tr>
<td>5,000</td>
<td>332s</td>
<td>0</td>
<td>0.756 ms</td>
<td>0.933 ms</td>
<td>0.992 ms</td>
<td>1.231 ms</td>
</tr>
<tr>
<td>20,000</td>
<td>632s</td>
<td>0</td>
<td>0.713 ms</td>
<td>2.490 ms</td>
<td>2.636 ms</td>
<td>2.783 ms</td>
</tr>
<tr>
<td>51,322*</td>
<td>308s</td>
<td>0</td>
<td>2.695 ms</td>
<td>8.426 ms</td>
<td>11.516 ms</td>
<td>15.278 ms</td>
</tr>
</tbody></table>
<p><em>*Target was 100k. Stopped at 51,322 because the load generator ran out of RAM. The backend was fine. My laptop was not.</em></p>
<p>Zero errors across all runs. p99 latency at 51k users: 15.28 ms. The backend didn't break. The load generator did.</p>
<p>For context, LinkedIn targets sub-second presence delivery for their 700M+ user base <a href="#references">[3]</a>. Our p99 at 51k is 15ms. Different scale, sure, but the architecture patterns are the same.</p>
<hr />
<h2>Known Limitations</h2>
<p>I'm still working on some of these. Listing them anyway because honesty is more useful than a blog post that pretends everything is solved.</p>
<ul>
<li><p><strong>Single-node gateway.</strong> If the gateway goes down, all streams drop. No cross-node fanout yet. Fixing this requires a dispatcher layer similar to LinkedIn's architecture <a href="#references">[3]</a> or consistent hashing across gateway instances.</p>
</li>
<li><p><strong>Reaper sweep interval.</strong> The 5-second tick means offline detection can lag up to 35 seconds (30s TTL + 5s sweep). For a music app's friends activity, this is fine. For an ICU patient monitoring system, this would be a resume-generating event.</p>
</li>
<li><p><strong>No cross-data center replication.</strong> Single region. Adding geo-distribution would need CRDT-based replication on the Redis layer, similar to what the systemdesign.one article describes <a href="#references">[4]</a>.</p>
</li>
<li><p><code>ulimit</code> <strong>capped the load test.</strong> The default 1024 open file limit on the load generator machine was the actual bottleneck. Raising it and using distributed load generators would push past 100k.</p>
</li>
</ul>
<hr />
<h2>Closing</h2>
<p>This took months. For a green dot.</p>
<p>The full source is on GitHub: <a href="https://github.com/dis70rt/bluppi-backend">dis70rt/bluppi-backend</a>. The presence system lives in <code>internals/presence/</code> and <code>internals/gateway/</code>. Go read it, break it, tell me what's wrong with it.</p>
<p>Two things I wish I knew at the start:</p>
<p>If you're reaching for Redis keyspace notifications for offline detection -- don't. Write a Reaper. It's 66 lines of Go and it does exactly what you need with zero surprises.</p>
<p>If you're about to let every client hold a Redis Pub/Sub subscription from inside your API server -- also don't. Split the gateway out. The connection pool you save might be your own.</p>
<p>And if you're a student building something that "sounds simple" and it starts taking way longer than expected: that's normal. That's just what real systems feel like when you stop using the whiteboard version.</p>
<hr />
<h2>References</h2>
<ol>
<li><p>Redis documentation, "Redis keyspace notifications." <a href="https://redis.io/docs/manual/keyspace-notifications/">redis.io/docs</a></p>
</li>
<li><p>gRPC documentation, "Keepalive." <a href="https://grpc.io/docs/guides/keepalive/">grpc.io/docs</a></p>
</li>
<li><p>Gupta &amp; Lay, "Now You See Me, Now You Don't: LinkedIn's Real-Time Presence Platform" (2018). <a href="https://engineering.linkedin.com/blog/2018/01/now-you-see-me--now-you-dont--linkedins-real-time-presence-platfo">engineering.linkedin.com</a></p>
</li>
<li><p>Kim, "Real-Time Presence Platform System Design" (2023). <a href="https://systemdesign.one/presence-platform-system-design/">systemdesign.one</a></p>
</li>
<li><p>Barber, "Building Real-Time Infrastructure at Facebook" (2017). SREcon / USENIX. <a href="https://www.youtube.com/watch?v=oQI3JYbkJhI">youtube.com</a></p>
</li>
<li><p>PubNub, "Is Anyone Home? An Intro to Presence Webhooks" (2020). <a href="https://www.pubnub.com/blog/intro-to-presence-webhooks/">pubnub.com</a></p>
</li>
<li><p>Redis documentation, "Sorted Sets." <a href="https://redis.io/docs/data-types/sorted-sets/">redis.io/docs</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[How I Got gRPC Working Through Cloudflare Tunnel (The Hard Way)]]></title><description><![CDATA[A complete guide to exposing a Go gRPC backend behind college NAT using Cloudflare Tunnel, Docker Compose, and TLS, including every mistake I made along the way.


Background
I'm a student building Bl]]></description><link>https://blog.saikat.in/how-i-got-grpc-working-through-cloudflare-tunnel-the-hard-way</link><guid isPermaLink="true">https://blog.saikat.in/how-i-got-grpc-working-through-cloudflare-tunnel-the-hard-way</guid><category><![CDATA[gRPC]]></category><category><![CDATA[cloudflare]]></category><category><![CDATA[self-hosted]]></category><category><![CDATA[Docker]]></category><category><![CDATA[Flutter]]></category><dc:creator><![CDATA[Saikat Das]]></dc:creator><pubDate>Sat, 28 Mar 2026 01:25:47 GMT</pubDate><content:encoded><![CDATA[<blockquote>
<p>A complete guide to exposing a Go gRPC backend behind college NAT using Cloudflare Tunnel, Docker Compose, and TLS, including every mistake I made along the way.</p>
</blockquote>
<hr />
<h2>Background</h2>
<p>I'm a student building <strong>Bluppi</strong>, a Flutter app with a Go gRPC backend. My server runs on my laptop on college WiFi, behind a NAT I don't control, no port forwarding, no static IP.</p>
<p>My stack:</p>
<ul>
<li><p><strong>Backend</strong>: Go gRPC server (<code>bluppi-api:50051</code>) + Go gRPC gateway (<code>bluppi-gateway:50050</code>) + Python FastAPI (<code>bluppi-audio-api:8000</code>)</p>
</li>
<li><p><strong>Infrastructure</strong>: Docker Compose</p>
</li>
<li><p><strong>Tunnel</strong>: Cloudflare Tunnel (<code>cloudflared</code>)</p>
</li>
<li><p><strong>Client</strong>: Flutter (Android/iOS)</p>
</li>
<li><p><strong>Domain</strong>: <code>bluppi.saikat.in</code></p>
</li>
</ul>
<p>What should have been a simple "expose my API to the internet" turned into a 6-hour debugging session. Here's everything I learned.</p>
<img src="https://cdn.hashnode.com/uploads/covers/67093bfb638dddf76481cbb9/fa59eda8-a56c-41a8-a787-e9781ac23d6a.png" alt="" style="display:block;margin:0 auto" />

<h2>Part 1: Setting Up Cloudflare Tunnel</h2>
<h3>Create the tunnel</h3>
<pre><code class="language-bash"># Login to Cloudflare
cloudflared tunnel login

# Create tunnel
cloudflared tunnel create bluppi

# Route your domain to the tunnel
cloudflared tunnel route dns bluppi bluppi.saikat.in

# Copy credentials to project
mkdir -p cloudflared
cp ~/.cloudflared/&lt;tunnel-id&gt;.json cloudflared/
cp ~/.cloudflared/cert.pem cloudflared/
</code></pre>
<h3>Initial <code>config.yml</code> (what I started with)</h3>
<pre><code class="language-yaml">tunnel: &lt;tunnel-id&gt;
credentials-file: /etc/cloudflared/&lt;tunnel-id&gt;.json

ingress:
  - hostname: bluppi.saikat.in
    path: /api/v2/.*
    service: grpc://bluppi-gateway:50050

  - hostname: bluppi.saikat.in
    path: /api/rest/.*
    service: http://bluppi-audio-api:8000

  - hostname: bluppi.saikat.in
    service: grpc://bluppi-api:50051

  - service: http_status:404
</code></pre>
<h3>Initial <code>docker-compose.yml</code> tunnel service</h3>
<pre><code class="language-yaml">tunnel:
  container_name: bluppi-tunnel
  image: cloudflare/cloudflared:latest
  command: tunnel --config /etc/cloudflared/config.yml run
  restart: unless-stopped
  networks:
    - cloudflared
  depends_on:
    - bluppi-api
    - bluppi-gateway
    - audio-api
  volumes:
    - ./cloudflared:/etc/cloudflared:ro
</code></pre>
<hr />
<h2>Part 2: The Mistakes (And How to Fix Them)</h2>
<h3>Mistake 1: Port 7844 blocked (college network)</h3>
<p><strong>Symptom:</strong></p>
<pre><code class="language-plaintext">dial tcp 198.41.192.107:7844: i/o timeout
</code></pre>
<p>cloudflared defaults to connecting Cloudflare's edge on TCP port <strong>7844</strong>. College networks often block non-standard ports.</p>
<p><strong>Fix:</strong> Force HTTP/2 protocol which falls back to port 443:</p>
<pre><code class="language-yaml"># docker-compose.yml
command: tunnel --protocol http2 --config /etc/cloudflared/config.yml run
</code></pre>
<p>Or in <code>config.yml</code>:</p>
<pre><code class="language-yaml">protocol: http2
</code></pre>
<hr />
<h3>Mistake 2: Too many HA connections</h3>
<p><strong>Symptom:</strong></p>
<pre><code class="language-plaintext">already connected to this server, trying another address
</code></pre>
<p>cloudflared opens <strong>4 parallel connections</strong> by default, but the Delhi edge (<code>del03</code>) only had 1-2 distinct IPs reachable from my network.</p>
<p><strong>Fix:</strong> Reduce HA connections:</p>
<pre><code class="language-yaml">ha-connections: 1
</code></pre>
<hr />
<h3>Mistake 3: <code>grpc://</code> is not a valid cloudflared service protocol</h3>
<p><strong>Symptom:</strong></p>
<pre><code class="language-plaintext">malformed HTTP response "\x00\x00\x06\x04..."
</code></pre>
<p><code>grpc://</code> is not a recognised protocol in <code>cloudflared</code>. Those bytes are raw HTTP/2 frames being received by cloudflared which expected HTTP/1.x.</p>
<p><strong>Fix:</strong> Initially, I changed <code>grpc://</code> to <code>h2c://</code> (HTTP/2 cleartext) for internal gRPC services. <strong>However, this is a trap.</strong> While <code>h2c://</code> stops this immediate error, it leads you right into Mistake 4 because of Cloudflare's proxy limitations.</p>
<hr />
<h3>Mistake 4: Cloudflare Free plan blocks gRPC on public hostnames</h3>
<p><strong>Symptom:</strong></p>
<pre><code class="language-plaintext">malformed header: missing HTTP content-type
</code></pre>
<p>Even with the gRPC toggle enabled in the Cloudflare Dashboard, the <strong>Free plan</strong> does not fully proxy gRPC through orange-cloud (proxied) hostnames. Cloudflare strips the <code>content-type: application/grpc</code> header.</p>
<p><strong>References:</strong></p>
<ul>
<li><p><a href="https://github.com/cloudflare/cloudflared/issues/491">cloudflared issue #491 — gRPC support</a></p>
</li>
<li><p><a href="https://github.com/cloudflare/cloudflared/issues/1304">cloudflared issue #1304 — Support h2c origin servers</a></p>
</li>
</ul>
<p><strong>Fix:</strong> This is a fundamental limitation, no simple config change fixes it. You cannot use <code>h2c://</code>. The real solution is adding TLS to your origin server and using <code>https://</code> (Detailed in Part 3).</p>
<hr />
<h2>Part 3: The Real Fix (TLS on the Origin)</h2>
<h3>Step 1: Generate self-signed certificates</h3>
<pre><code class="language-bash">mkdir -p certs
openssl req -x509 -newkey rsa:4096 \
  -keyout certs/server.key \
  -out certs/server.crt \
  -days 365 -nodes \
  -subj "/CN=localhost"
</code></pre>
<blockquote>
<p><strong>Note on CN:</strong> Since we use <code>noTLSVerify: true</code> in cloudflared config, the CN value doesn't matter. <code>localhost</code> works fine.</p>
</blockquote>
<h3>Step 2: Configure your gRPC Server</h3>
<p>Regardless of whether you are using Go, Python, Node, or Rust, you must configure your gRPC server to use the generated <code>server.crt</code> and <code>server.key</code>. Your specific language's gRPC library will handle the TLS implementation.</p>
<h3>Step 3: Verify TLS + ALPN on your server</h3>
<pre><code class="language-bash"># Check TLS works
openssl s_client -connect localhost:50051

# Check ALPN h2 is advertised (critical for gRPC)
openssl s_client -connect localhost:50051 -alpn h2
</code></pre>
<p>You must see:</p>
<pre><code class="language-plaintext">ALPN protocol: h2
</code></pre>
<p>If you see <code>No ALPN negotiated</code>, gRPC will not work through the tunnel. Check your server's TLS configuration.</p>
<blockquote>
<p><strong>Key point:</strong> <code>grpc-go</code> automatically advertises <code>ALPN: h2</code> when you use <code>credentials.NewServerTLSFromFile</code>. This is what enables cloudflared to negotiate HTTP/2.</p>
</blockquote>
<hr />
<h2>Part 4: Final Working Configuration</h2>
<h3><code>docker-compose.yml</code></h3>
<p>Since <code>cloudflared</code> is running on the host, we just need to make sure our backend services expose their ports to <code>localhost</code>.</p>
<pre><code class="language-yaml">  tunnel:
    container_name: bluppi-tunnel
    image: cloudflare/cloudflared:latest
    command: tunnel --no-autoupdate --protocol http2 --ha-connections 1 --config /etc/cloudflared/config.yml run
    restart: unless-stopped
    networks:
      - cloudflared
    depends_on:
      - bluppi-api
      - bluppi-gateway
      - audio-api
    volumes:
      - ./cloudflared:/etc/cloudflared:ro

networks:
  cloudflared:
    name: cloudflared
</code></pre>
<h3><code>cloudflared/config.yml</code></h3>
<pre><code class="language-yaml">tunnel: &lt;tunnel-token&gt;
credentials-file: /etc/cloudflared/&lt;tunnel-id&gt;.json
ha-connections: 1
protocol: http2

ingress:
  # 1. Dedicated REST API
  - hostname: bluppi.saikat.in
    path: /api/rest/.*
    service: http://bluppi-audio-api:8000

  # 2. gRPC Gateway
  - hostname: bluppi.saikat.in
    path: /api/v2/.*
    service: https://bluppi-gateway:50050
    originRequest:
      noTLSVerify: true
      http2Origin: true

  # 3. Pure gRPC API (Catch-All)
  - hostname: bluppi.saikat.in
    service: https://bluppi-api:50051
    originRequest:
      noTLSVerify: true
      http2Origin: true

  - service: http_status:404
</code></pre>
<p><strong>Why</strong> <code>https://</code> <strong>not</strong> <code>h2c://</code><strong>?</strong><br /><code>cloudflared</code> does NOT support <code>h2c://</code> (HTTP/2 cleartext) as an origin protocol — see <a href="https://github.com/cloudflare/cloudflared/issues/1304">issue #1304</a>. You must use <code>https://</code> with a TLS-enabled origin, then add <code>noTLSVerify: true</code> to accept the self-signed cert, and <code>http2Origin: true</code> to force HTTP/2 negotiation.</p>
<hr />
<h2>Part 6: Verification</h2>
<h3>Test gRPC end-to-end with grpcurl</h3>
<pre><code class="language-bash"># Install grpcurl (or you can use Postman)

# Test locally (bypassing tunnel) — baseline
grpcurl -plaintext localhost:50051 list
# Expected: Unauthenticated (server is running, auth is rejecting unauthenticated requests) (as I have setup JWT auth)

# Test through Cloudflare tunnel
grpcurl bluppi.saikat.in:443 list
# Expected: same Unauthenticated error = tunnel is fully transparent ✅
</code></pre>
<hr />
<h2>Part 7: Complete Mistake Summary</h2>
<table>
<thead>
<tr>
<th>#</th>
<th>Symptom</th>
<th>Root Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td><code>dial tcp ...:7844: i/o timeout</code></td>
<td>College network blocks port 7844</td>
<td><code>--protocol http2</code></td>
</tr>
<tr>
<td>2</td>
<td><code>already connected to this server</code></td>
<td>Del03 has limited edge IPs, 4 connections clash</td>
<td><code>ha-connections: 1</code></td>
</tr>
<tr>
<td>3</td>
<td><code>malformed HTTP response \x00\x00\x06\x04</code></td>
<td><code>grpc://</code> is not valid in cloudflared</td>
<td>Use <code>h2c://</code></td>
</tr>
<tr>
<td>4</td>
<td><code>use of closed network connection</code> / <code>EOF</code></td>
<td>cloudflared uses HTTP/1.1 to origin by default</td>
<td><code>http2Origin: true</code></td>
</tr>
<tr>
<td>5</td>
<td><code>No ALPN negotiated</code></td>
<td>gRPC server not advertising h2 during TLS handshake</td>
<td>Use <code>credentials.NewServerTLSFromFile</code> in Go</td>
</tr>
<tr>
<td>6</td>
<td><code>QUIC timeout</code> on <code>--token</code> mode</td>
<td>cloudflared token mode defaults to QUIC (UDP), also blocked</td>
<td><code>--protocol http2</code> flag</td>
</tr>
</tbody></table>
<hr />
<h2>Key Takeaways</h2>
<ol>
<li><p><code>grpc://</code> <strong>is not a valid cloudflared service protocol</strong> — use <code>h2c://</code> or <code>https://</code></p>
</li>
<li><p><strong>Cloudflare Free plan cannot proxy gRPC on public hostnames</strong> — cloudflare's gRPC toggle requires Pro plan to work properly</p>
</li>
<li><p><strong>cloudflared connects to origins using h2 (TLS), not h2c (cleartext)</strong> — your gRPC origin MUST serve TLS, even behind a private Docker network</p>
</li>
<li><p><code>http2Origin: true</code> <strong>is mandatory</strong> — without it, cloudflared uses HTTP/1.1 to the origin, which gRPC servers reject immediately</p>
</li>
<li><p><code>noTLSVerify: true</code> <strong>is safe inside Docker</strong> — the tunnel itself provides encryption; internal self-signed certs are fine</p>
</li>
<li><p><strong>Always test with</strong> <code>grpcurl</code> <strong>before testing Flutter</strong> — if <code>grpcurl bluppi.saikat.in:443 list</code> returns <code>Unauthenticated</code> (same as localhost), your Flutter app will work</p>
</li>
<li><p><strong>College/restricted networks block UDP and non-standard ports</strong> — always force <code>--protocol http2</code> to use TCP 443</p>
</li>
</ol>
<hr />
<h2>References</h2>
<ul>
<li><p><a href="https://github.com/cloudflare/cloudflared/issues/491">cloudflared issue #491 — gRPC support request</a></p>
</li>
<li><p><a href="https://github.com/cloudflare/cloudflared/issues/1304">cloudflared issue #1304 — Support h2c origin servers</a></p>
</li>
<li><p><a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/">Cloudflare Docs — Cloudflare Tunnel</a></p>
</li>
</ul>
<hr />
<p><em>Written after 6 hours of debugging. Hopefully this saves you the same pain.</em></p>
]]></content:encoded></item><item><title><![CDATA[Firebase Draining Your Wallet? Here's How to Stop Paying for Every Single Read.]]></title><description><![CDATA[❗
In this blog, we’ll focus on using Dart/Flutter for our code examples, but keep in mind that you can apply these methods in any programming language. The key here is to focus on the techniques, not ]]></description><link>https://blog.saikat.in/firebase-draining-your-wallet-here-s-how-to-stop-paying-for-every-single-read</link><guid isPermaLink="true">https://blog.saikat.in/firebase-draining-your-wallet-here-s-how-to-stop-paying-for-every-single-read</guid><category><![CDATA[Firebase]]></category><category><![CDATA[Flutter]]></category><dc:creator><![CDATA[Saikat Das]]></dc:creator><pubDate>Sat, 14 Mar 2026 22:46:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1734680734208/641c02b8-81c2-47aa-84ca-c7100eb23a41.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div>
<div>❗</div>
<div>In this blog, we’ll focus on using Dart/Flutter for our code examples, but keep in mind that you can apply these methods in any programming language. The key here is to focus on the techniques, not just the language!</div>
</div>

<p>If you're here, you probably already know that Firebase is great for building apps with real-time data. However, many developers run into a big issue—Firebase costs can quickly get out of hand due to too many read operations.</p>
<p>In 2018, a startup in Colombia faced this exact problem when they scaled up to 2 million daily active users (DAUs). A small, but costly mistake in their code ended up leaving them with a <strong>$30,356.56</strong> bill from Google Clouds in just 72 hours. Why? Because that tiny error caused 2 million users to each trigger 16,000 document reads, leading to over 40 billion requests to Firestore in less than 48 hours. <a href="https://hackernoon.com/how-we-spent-30k-usd-in-firebase-in-less-than-72-hours-307490bd24d"><em>Read the full article</em></a></p>
<p>You can avoid such excessive costs by optimizing your reads. In this blog, we'll dive into specific strategies to stop paying for every single read without sacrificing app performance.</p>
<h3>Understanding Firebase Read Costs</h3>
<p>Firebase's pricing model is simple. In Firestore, you are charged based on the number of documents you read. Each time a document is read from the database—whether through a query, real-time listener, or a simple fetch—it counts as a read.</p>
<p>Check out the <a href="https://cloud.google.com/firestore/pricing#firestore-pricing">Firestore Pricing</a> here.</p>
<p>Before jumping into solutions, it’s important to understand where things can go wrong.</p>
<ul>
<li><p><strong>Fetching unnecessary data:</strong> For instance, if you’re retrieving large collections when you only need a few documents, you’re paying for extra reads.</p>
</li>
<li><p><strong>Inefficient queries</strong>: Poor query structuring can result in pulling unnecessary data or performing repeated reads.</p>
</li>
<li><p><strong>Overuse of real-time syncing</strong>: Real-time listeners are great, but they can result in multiple reads as they sync data even when it’s not needed immediately.</p>
</li>
</ul>
<hr />
<h2>Strategies to Reduce Firebase Read Costs</h2>
<h3>Optimize Your Data Structure</h3>
<p>A well-organized database can greatly cut down the number of reads. If your Firestore database isn't structured efficiently, it can cause you to fetch more data than needed, leading to extra read costs.</p>
<p>Choosing when to use nested data vs flattened data is crucial for reducing unnecessary reads in Firestore.</p>
<table>
<thead>
<tr>
<th><strong>Nested Data</strong></th>
<th><strong>Flattened Data</strong></th>
</tr>
</thead>
<tbody><tr>
<td>If you often need to access both parent and child data at the same time, using a nested structure lets you fetch them with just one read.</td>
<td>Use this structure when you expect many related items, such as comments, tasks, or messages, that can quickly exceed document size limits.</td>
</tr>
</tbody></table>
<div>
<div>💡</div>
<div>Split data into multiple collections to avoid reading large, unnecessary datasets.</div>
</div>

<p><strong>Example</strong>: Instead of storing all user posts in one collection, divide them into smaller subcollections by category or user ID.</p>
<h3>Query Optimization</h3>
<p>Firestore is a NoSQL database, and it's important to use its indexing features. Writing efficient queries can help lower the number of documents read.</p>
<p>Firestore automatically indexes each document by its document ID. But for complex queries, like filtering on several fields, you might need to create composite indexes.</p>
<p><strong>Example</strong>: If you want to query users by both age and city, you'll need a composite index for <code>age</code> and <code>city</code>.</p>
<blockquote>
<p>Filtering documents with <code>where()</code> clauses retrieves only the necessary data.</p>
</blockquote>
<pre><code class="language-dart">final QuerySnapshot querySnapshot = await FirebaseFirestore.instance
      .collection('users')
      .where('age', isGreaterThanOrEqualTo: 18)
      .where('city', isEqualTo: 'New York')
      .get();
</code></pre>
<p><em>Make sure to combine filters that make sense together to reduce the number of reads.</em></p>
<blockquote>
<p><strong>Implement Pagination Using</strong> <code>limit()</code> for Efficient Data Retrieval</p>
</blockquote>
<pre><code class="language-dart">final int pageSize = 10;
DocumentSnapshot? lastVisible;

Future&lt;void&gt; getPosts() async {
  final query = FirebaseFirestore.instance
      .collection('posts')
      .orderBy('createdAt')
      .limit(pageSize)
      .startAfterDocument(lastVisible!);
  
  final querySnapshot = await query.get();

  querySnapshot.docs.forEach((doc) =&gt; print(doc.data()));
  lastVisible = querySnapshot.docs.isNotEmpty ? querySnapshot.docs.last : null;
}
</code></pre>
<p>Limiting documents can significantly reduce costs and improve performance, especially when dealing with large datasets.</p>
<div>
<div>💡</div>
<div>Full collection scans occur when you don't use indexes, making Firestore read every document in a collection. To avoid this, always use indexed fields in your queries.</div>
</div>

<h3>Avoid Unnecessary Real-time Listeners</h3>
<p>Real-time listeners are powerful but expensive if used without careful consideration. In many cases, you don't need real-time updates for every piece of data.</p>
<p><strong>Manual Refreshing:</strong> Instead of relying on real-time updates, users can manually trigger a refresh to load the latest data.</p>
<div>
<div>💡</div>
<div>Firestore offers offline persistence, enabling data to be cached on the client and synchronized later when the app reconnects. This feature can significantly reduce the number of reads, minimizing network calls during offline usage.</div>
</div>

<p>In this blog, we will look at advanced caching methods that can greatly reduce your Firestore costs. These strategies are more than just basic offline caching and can save you a lot of money while improving your app's performance.</p>
<hr />
<h1>Advanced Strategies to Reduce Firebase Costs</h1>
<p>You can use <code>SharedPreferences</code> to store small pieces of data that are frequently accessed, like user info or settings. This works well for data that doesn’t change much. However, <code>SharedPreferences</code> is only suitable for small amounts of data. If you store too much, it can slow down your app, use more memory, and hit storage limits.</p>
<pre><code class="language-dart">import 'package:shared_preferences/shared_preferences.dart';

Future&lt;void&gt; saveUserInfo(String name, int age, String email) async {
  final SharedPreferences prefs = await SharedPreferences.getInstance();
  await prefs.setString('userName', name);
  await prefs.setInt('userAge', age);
  await prefs.setString('userEmail', email);
}
</code></pre>
<p>For example, you can store user information in SharedPreferences as small chunks of data that are repeatedly used throughout your application.</p>
<h3>Use Provider for efficient state management to reduce Firestore reads.</h3>
<p>Without proper state management, your app might request the same data from Firestore multiple times, increasing the number of reads unnecessarily. By using Provider, once data is fetched from Firestore, it can be stored and accessed efficiently throughout the app without additional Firestore calls.</p>
<pre><code class="language-dart">import 'package:flutter/material.dart';
import 'package:cloud_firestore/cloud_firestore.dart';

class UserProvider extends ChangeNotifier {
  Map&lt;String, dynamic&gt;? _userData;
  Map&lt;String, dynamic&gt;? get userData =&gt; _userData;

  Future&lt;void&gt; fetchUserData(String userId) async {
    if (_userData != null) return; // Skip fetching if data is already loaded

    try {
      final doc = await FirebaseFirestore.instance.collection('users').doc(userId).get();
      _userData = doc.data();
    } catch (e) {
      print('Error fetching user data: $e');
    } 
      
    notifyListeners();
  }
}
</code></pre>
<p>Don’t forget to add the Provider class to the <code>main.dart</code> file and then use it in the UI. This ensures the data is fetched only once when the app starts and is shared efficiently across all UI components, screens, and pages.</p>
<p>If you know that the data is not going to change frequently, you can save it on the client side, essentially caching the data. This can be achieved through various methods, such as using <code>Hive</code>, <code>SQLite</code>, or <code>path_provider</code>, each serving a different purpose.</p>
<table>
<thead>
<tr>
<th><strong>Hive</strong></th>
<th><strong>SQLite</strong></th>
<th><strong>path_provider</strong></th>
</tr>
</thead>
<tbody><tr>
<td>NoSQL, key-value pairs</td>
<td>Relational (SQL queries)</td>
<td>File system access (no database)</td>
</tr>
<tr>
<td>Simple setup, fast for small data</td>
<td>Complex setup, powerful for large datasets</td>
<td>Minimal setup, no querying or relations</td>
</tr>
</tbody></table>
<div>
<div>💡</div>
<div>Although <strong>path_provider</strong> is easy to set up, it is not efficient for storing JSON data, as it requires serializing and de-serializing the data into your model, which can be a heavy task. Therefore, <strong>Hive</strong> is often used instead.</div>
</div>

<hr />
<h2>Server-Side Optimization for Cost Reduction</h2>
<p>If all the users are requesting the same data from your Firebase database, it can unnecessarily increase your reads. To avoid this, you can use the Firebase Admin SDK to create bundles, request the data only once, store it in those bundles, and send the bundles to the clients. This will reduce Firebase reads and improve your app's performance, as it only needs to request the data once.</p>
<pre><code class="language-python">from google.cloud import firestore
from google.cloud.firestore_bundle import FirestoreBundle

db = firestore.Client()
bundle = FirestoreBundle("bundle-name")

for user in db.collection("users").stream():
    bundle.add_document(user.get())
    
    for post in db.collection("posts").where("user_id", "==", user.id).stream():
        bundle.add_document(post.get())

bundle_buffer = bundle.build()

with open("cacheBundle.bin", "wb") as file:
    file.wrtie(bundle_buffer)
</code></pre>
<p>The above code snippet might not be entirely correct as I forgot the exact syntax, but it was something similar to this. <a href="https://firebase.google.com/docs/firestore/bundles#python">Check Docs</a></p>
<p>Once you've created the Firestore bundle and serialized it, you can <strong>save the bundle as a file (e.g.,</strong> <code>cacheBundle.bin</code>) and send it to all of your clients via HTTP or upload it to <strong>Firebase Cloud Storage</strong>. This way, clients can access the cached data from the bundle without repeatedly making requests to Firebase Database.</p>
<h2>Conclusion</h2>
<p>Optimizing Firebase reads can help you save costs and improve app performance. By structuring your data efficiently, optimizing queries, minimizing real-time listeners, and using caching strategies, you can reduce unnecessary reads without sacrificing performance.</p>
<p>Feel free to share your own methods or any corrections in the comments below. Happy coding, and may your Firebase costs stay low while your app thrives!</p>
]]></content:encoded></item></channel></rss>