Postmortem: 23-minute degradation on Dec 4

Summary: Between 14:11 and 14:34 UTC on Dec 4, 2025, the SwyDex API returned 503 to roughly 35% of requests. No data was lost. Webhooks were delayed but eventually delivered.

Root cause: Our cron service holds a Postgres advisory lock to ensure only one replica processes the sweep queue at a time. A long-running query starved the lock-holder out of CPU; the cron stopped acknowledging its lock; the watchdog killed the process; the lock didn't release because the watchdog used SIGKILL. The next sweep cycle blocked indefinitely waiting for the lock.

Detection: Pingdom alerted at 14:14. We were paged at 14:15.

Mitigation: Killed the held-but-orphaned advisory lock manually via pg_advisory_unlock_all() against a session masquerading as the dead one. Confirmed with the Postgres team that this is the intended path.

Why it took 23 minutes: 14 minutes to identify which query was holding the lock; 4 minutes to confirm the unlock approach; 5 minutes to execute and verify recovery.

Permanent fix: Changed our watchdog to send SIGTERM with a 30-second grace period before SIGKILL. Postgres's automatic backend cleanup releases advisory locks on connection close, so SIGTERM-then-clean-shutdown means the lock releases immediately if the process is wedged.

If you observed elevated 5xx during this window — sorry. Webhook backlogs cleared by 14:51.

More posts