Case studies · SaaS
Cutting incident MTTR from 4h to 25min for a Series-B SaaS
Took ownership of platform engineering for a fast-growing B2B SaaS. Rebuilt the deploy pipeline, instrumented SLOs, and ran the on-call rotation alongside the in-house team.
The situation
A Series-B SaaS with ~50 engineers had outgrown the platform tooling that took them from seed to Series B. Releases were manual, observability was scattered across three tools, and on-call rotations were dreaded — incidents typically took 3–5 hours to resolve because finding the root cause was harder than fixing it.
Leadership wanted to invest in product, not in re-platforming. They needed a senior team that could own the unsexy infrastructure work without distracting the product engineers.
What we did
- Replaced the deploy pipeline. Moved from a custom shell-based deploy to GitHub Actions with IaC, signed artifacts, branch previews, and automatic blue/green rollouts in production.
- Instrumented SLOs. Defined customer-facing SLOs for the top five user journeys, wired up burn-rate alerts, and retired the alert noise that was training engineers to ignore pages.
- Unified observability. Consolidated metrics, logs, and traces onto a single tool. Built dashboards per service owned by the team that owned the service.
- Joined the on-call rotation. Took primary on-call for the platform tier. Wrote runbooks for every recurring incident class and ran blameless postmortems.
- Knowledge transfer. By month six, the in-house platform engineers had taken back primary on-call with everything documented.
Outcomes
- Incident MTTR fell from ~4 hours to ~25 minutes.
- Deploy frequency tripled, with rollback time under 60 seconds.
- The platform team's on-call burden dropped enough that engineers stopped requesting transfers off the team.
- The core API held a 99.95% availability SLO for two consecutive quarters.
Stack
AWS (ECS Fargate, RDS, ElastiCache), Terraform, GitHub Actions, Datadog, PagerDuty, Postgres, Redis, TypeScript / Node.js, Python.
Want a similar turnaround?
Tell us about your platform reliability problems. We'll respond within one business day with a 30-minute call to scope the right engagement.
Book a call