Back to overview
Service Incident Report - April 21, 2026
Apr 21, 2026 at 12:59pm UTC
Affected services
app.langwatch.ai
Collector
Processor
Evaluations and Guardrails
Workflows
Automations
Resolved
Apr 21, 2026 at 12:59pm UTC
Service Incident Report - April 21, 2026
Incident Summary
- Date: April 21, 2026
- Duration: ~20 minutes outage during recovery (~10:15 → 10:35 UTC Apr 21) + ~12 hours degraded replication (20:52 UTC Apr 20 → 10:36 UTC Apr 21)
- Severity: Moderate
- Status: Resolved
What happened
- Routine infrastructure deployment restarted the database coordination nodes too quickly.
- Nodes didn’t rejoin the consensus (3-node quorum) cluster in time, causing loss of coordination between replicas.
- Platform largely continued to operate (reads/writes accepted), but write coordination/replication between replicas paused.
- During recovery, the application was restarted to reconnect to coordination layer, causing a ~20-minute full outage.
Customer impact
- Full outage (~10:15–10:35 UTC, Apr 21):
- Dashboard, API, and data ingestion unavailable.
- Traces/spans sent during this window were not ingested and cannot be recovered.
- Degraded replication (20:52 UTC Apr 20 – 10:36 UTC Apr 21):
- Reads/writes served, but internal replication paused.
- Likely minimal/no user-visible impact.
- Experiments & scenarios:
- In-flight runs/executions during outage may have delayed/failed status updates.
- Data integrity:
- Historical data ingested before/after outage verified intact and available.
Timeline (UTC)
| Time (UTC) | Event |
|---|---|
| Apr 20, 20:52 | Routine infrastructure deployment begins; coordination layer disrupted; internal replication paused; platform continues serving requests. |
| Apr 20, 21:06 | Coordination nodes attempt to recover but remain degraded. |
| Apr 21, ~09:50 | Issue identified during morning operations review. |
| Apr 21, 10:00 | Recovery efforts begin. |
| Apr 21, ~10:15 | Application becomes unavailable during recovery; ingestion stops. |
| Apr 21, ~10:35 | Application restored; ingestion resumes. |
| Apr 21, 10:36 | Coordination layer fully restored. |
| Apr 21, 11:35 | All database tables verified; full service confirmed. |
Root cause
- Deployment restarted coordination nodes in a way that violated quorum requirements.
- Protocol needs 2/3 nodes available; all 3 were restarted within ~1 minute, preventing timely rejoin.
- Loss of quorum prevented replication of writes across nodes; app restart during recovery triggered the outage window.
Preventative actions
- Quorum-aware deployment gates: health checks ensure each coordination node fully rejoins before restarting the next. (Implemented)
- Staggered deployments: coordination layer and database server updates won’t be deployed simultaneously. (Implemented)
- Improved monitoring: alerting to detect coordination health issues within minutes. (Implemented)
- Application resilience: retry logic so temporary coordination unavailability doesn’t block serving.
- Decouple migrations from deploys: run migrations independently so coordination issues don’t block app startup.
Questions
- Contact support with any questions about impact to your data.
Affected services