Case studies · SaaS

Cutting incident MTTR from 4h to 25min for a Series-B SaaS

Took ownership of platform engineering for a fast-growing B2B SaaS. Rebuilt the deploy pipeline, instrumented SLOs, and ran the on-call rotation alongside the in-house team.

~90%
Incident MTTR reduction
3x
Deploy frequency
99.95%
Core API SLO sustained
0
Weekend pages, first quarter

The situation

A Series-B SaaS with ~50 engineers had outgrown the platform tooling that took them from seed to Series B. Releases were manual, observability was scattered across three tools, and on-call rotations were dreaded — incidents typically took 3–5 hours to resolve because finding the root cause was harder than fixing it.

Leadership wanted to invest in product, not in re-platforming. They needed a senior team that could own the unsexy infrastructure work without distracting the product engineers.

What we did

  • Replaced the deploy pipeline. Moved from a custom shell-based deploy to GitHub Actions with IaC, signed artifacts, branch previews, and automatic blue/green rollouts in production.
  • Instrumented SLOs. Defined customer-facing SLOs for the top five user journeys, wired up burn-rate alerts, and retired the alert noise that was training engineers to ignore pages.
  • Unified observability. Consolidated metrics, logs, and traces onto a single tool. Built dashboards per service owned by the team that owned the service.
  • Joined the on-call rotation. Took primary on-call for the platform tier. Wrote runbooks for every recurring incident class and ran blameless postmortems.
  • Knowledge transfer. By month six, the in-house platform engineers had taken back primary on-call with everything documented.

Outcomes

  • Incident MTTR fell from ~4 hours to ~25 minutes.
  • Deploy frequency tripled, with rollback time under 60 seconds.
  • The platform team's on-call burden dropped enough that engineers stopped requesting transfers off the team.
  • The core API held a 99.95% availability SLO for two consecutive quarters.

Stack

AWS (ECS Fargate, RDS, ElastiCache), Terraform, GitHub Actions, Datadog, PagerDuty, Postgres, Redis, TypeScript / Node.js, Python.

Want a similar turnaround?

Tell us about your platform reliability problems. We'll respond within one business day with a 30-minute call to scope the right engagement.

Book a call