Case Study

Hunting a three-year intermittent lock problem

Live diagnostic sessions and a written report on database lock contention in a high-traffic rental-management SaaS — including the honest part, where the intermittent bug stayed hard to catch.

This was a database performance diagnostic — working sessions, artifact review, a written report, and an instrumentation plan. It also includes the part most case studies leave out: a genuinely hard intermittent problem that the tooling never fully cornered. I'd rather show that than pretend otherwise.

The company

A rental-business management SaaS — software for companies that rent out physical inventory, event and party rental in the visible data. A mature, high-traffic Rails-on-MySQL system: 340 tables, a largest table around 1.5 million rows, and an intermittent database performance problem the client said had dogged them for roughly three years. They were explicit about what they wanted — consulting, not Rails programming.

What I did

The first round was two half-day working sessions plus a written report. I ran paired diagnostic sessions with their engineers and reviewed the artifacts they produced — the production schema, SHOW ENGINE INNODB STATUS output, deadlock records, slow-query logs. The lock contention concentrated on two patterns: the legacy background-job worker-claim UPDATE (lock times pushing past 90 seconds), and deadlocks on the inventory-availability recalculation, which is a core product operation. I wrote it up — about 12 pages — with causes and recommendations.

A few months later they re-engaged. They'd already acted on the first round, retiring their old background-job system and moving to Sidekiq. For the second round I prescribed a concrete instrumentation plan to catch the intermittent lock event in the act: a full week-plus of slow log, Percona's pt-deadlock-logger running for days, and pt-stalk set to dump system state when locked queries crossed a threshold.

Here's the honest ending. pt-deadlock-logger caught events, but pt-stalk never managed to catch the intermittent lock condition, despite repeated tuning of the trigger parameters — the incidents were short and infrequent, and the investigation was still open when the engagement wound down. The client put it well: they felt they might be "chasing a red herring." That frustration was about a genuinely slippery problem, not the work, but I'm not going to claim a resolution that didn't happen.

Facing a similar problem? Let's talk about it.

Contact Me