BACK TO BLOG

ARCHITECTURES

OCT 24, 2023

12 min read

Scaling Microservices with Rust and gRPC

Deep dive into why we transitioned our core message broker to Rust and the 40% latency reduction we observed in production.

Rust

gRPC

Microservices

Architecture

Scaling Microservices with Rust and gRPC

When our engineering team first floated the idea of migrating our core message broker from Go to Rust, the room went quiet. We had a functioning system—one that had scaled us to 50,000 concurrent connections with sub-100ms p99 latency. Why fix what wasn't broken?

The Turning Point

The answer arrived during a post-mortem following a cascade failure triggered by a memory leak in one of our Go goroutines. The leak was subtle—triggered only under a very specific sequence of gRPC stream cancellations. By the time our alerting fired, three downstream services had already timed out.

Rust's ownership model isn't just a memory safety guarantee—it's a forcing function that makes entire classes of concurrency bugs structurally impossible to express.

Architecture Overview

Our new broker is built on Tokio, Rust's async runtime, with Tonic for gRPC transport. The service handles bidirectional streaming with backpressure built into the protocol layer itself. Here's the core stream handler:

rust

pub async fn handle_stream(
    &self,
    mut inbound: Streaming<Payload>,
) -> Result<Response<Streaming<Ack>>, Status> {
    let (tx, rx) = mpsc::channel(128);
    tokio::spawn(async move {
        while let Some(payload) = inbound.next().await {
            let ack = process(payload?).await?;
            tx.send(Ok(ack)).await?;
        }
        Ok::<_, Box<dyn Error>>(())
    });
    Ok(Response::new(ReceiverStream::new(rx)))
}

Benchmark Results

After three months in production, the numbers are in. Compared to our previous Go implementation running on identical hardware:

  • p50 latency: 4.2ms → 2.1ms (50% improvement)
  • p99 latency: 89ms → 53ms (40% improvement)
  • Memory footprint: 1.4GB → 420MB under peak load
  • CPU utilization: 68% → 41% at 50K connections
  • Zero memory-related incidents in 90 days of production

Lessons Learned

The migration took 14 weeks—longer than projected—primarily due to the learning curve around Rust's lifetime annotations in async contexts. If you're considering a similar path, budget 30% extra time for the team ramp-up phase. The long-term gains are absolutely worth it, but the short-term cost is real.

Stay Synchronized

Get notified when new technical logs are published. No spam, only signal.