Service Incident Report - April 21, 2026

Incident Summary

Date: April 21, 2026
Duration: ~20 minutes outage during recovery (~10:15 → 10:35 UTC Apr 21) + ~12 hours degraded replication (20:52 UTC Apr 20 → 10:36 UTC Apr 21)
Severity: Moderate
Status: Resolved

Routine infrastructure deployment restarted the database coordination nodes too quickly.
Nodes didn’t rejoin the consensus (3-node quorum) cluster in time, causing loss of coordination between replicas.
Platform largely continued to operate (reads/writes accepted), but write coordination/replication between replicas paused.
During recovery, the application was restarted to reconnect to coordination layer, causing a ~20-minute full outage.

Full outage (~10:15–10:35 UTC, Apr 21):
- Dashboard, API, and data ingestion unavailable.
- Traces/spans sent during this window were not ingested and cannot be recovered.
Degraded replication (20:52 UTC Apr 20 – 10:36 UTC Apr 21):
- Reads/writes served, but internal replication paused.
- Likely minimal/no user-visible impact.
Experiments & scenarios:
- In-flight runs/executions during outage may have delayed/failed status updates.
Data integrity:
- Historical data ingested before/after outage verified intact and available.

Time (UTC)	Event
Apr 20, 20:52	Routine infrastructure deployment begins; coordination layer disrupted; internal replication paused; platform continues serving requests.
Apr 20, 21:06	Coordination nodes attempt to recover but remain degraded.
Apr 21, ~09:50	Issue identified during morning operations review.
Apr 21, 10:00	Recovery efforts begin.
Apr 21, ~10:15	Application becomes unavailable during recovery; ingestion stops.
Apr 21, ~10:35	Application restored; ingestion resumes.
Apr 21, 10:36	Coordination layer fully restored.
Apr 21, 11:35	All database tables verified; full service confirmed.

Deployment restarted coordination nodes in a way that violated quorum requirements.
Protocol needs 2/3 nodes available; all 3 were restarted within ~1 minute, preventing timely rejoin.
Loss of quorum prevented replication of writes across nodes; app restart during recovery triggered the outage window.

Quorum-aware deployment gates: health checks ensure each coordination node fully rejoins before restarting the next. (Implemented)
Staggered deployments: coordination layer and database server updates won’t be deployed simultaneously. (Implemented)
Improved monitoring: alerting to detect coordination health issues within minutes. (Implemented)
Application resilience: retry logic so temporary coordination unavailability doesn’t block serving.
Decouple migrations from deploys: run migrations independently so coordination issues don’t block app startup.