Back to overview

Service Incident Report - April 21, 2026

Apr 21, 2026 at 12:59pm UTC
Affected services
app.langwatch.ai
Collector
Processor
Evaluations and Guardrails
Workflows
Automations

Resolved
Apr 21, 2026 at 12:59pm UTC

Service Incident Report - April 21, 2026

Incident Summary

  • Date: April 21, 2026
  • Duration: ~20 minutes outage during recovery (~10:15 → 10:35 UTC Apr 21) + ~12 hours degraded replication (20:52 UTC Apr 20 → 10:36 UTC Apr 21)
  • Severity: Moderate
  • Status: Resolved

What happened

  • Routine infrastructure deployment restarted the database coordination nodes too quickly.
  • Nodes didn’t rejoin the consensus (3-node quorum) cluster in time, causing loss of coordination between replicas.
  • Platform largely continued to operate (reads/writes accepted), but write coordination/replication between replicas paused.
  • During recovery, the application was restarted to reconnect to coordination layer, causing a ~20-minute full outage.

Customer impact

  • Full outage (~10:15–10:35 UTC, Apr 21):
    • Dashboard, API, and data ingestion unavailable.
    • Traces/spans sent during this window were not ingested and cannot be recovered.
  • Degraded replication (20:52 UTC Apr 20 – 10:36 UTC Apr 21):
    • Reads/writes served, but internal replication paused.
    • Likely minimal/no user-visible impact.
  • Experiments & scenarios:
    • In-flight runs/executions during outage may have delayed/failed status updates.
  • Data integrity:
    • Historical data ingested before/after outage verified intact and available.

Timeline (UTC)

Time (UTC) Event
Apr 20, 20:52 Routine infrastructure deployment begins; coordination layer disrupted; internal replication paused; platform continues serving requests.
Apr 20, 21:06 Coordination nodes attempt to recover but remain degraded.
Apr 21, ~09:50 Issue identified during morning operations review.
Apr 21, 10:00 Recovery efforts begin.
Apr 21, ~10:15 Application becomes unavailable during recovery; ingestion stops.
Apr 21, ~10:35 Application restored; ingestion resumes.
Apr 21, 10:36 Coordination layer fully restored.
Apr 21, 11:35 All database tables verified; full service confirmed.

Root cause

  • Deployment restarted coordination nodes in a way that violated quorum requirements.
  • Protocol needs 2/3 nodes available; all 3 were restarted within ~1 minute, preventing timely rejoin.
  • Loss of quorum prevented replication of writes across nodes; app restart during recovery triggered the outage window.

Preventative actions

  • Quorum-aware deployment gates: health checks ensure each coordination node fully rejoins before restarting the next. (Implemented)
  • Staggered deployments: coordination layer and database server updates won’t be deployed simultaneously. (Implemented)
  • Improved monitoring: alerting to detect coordination health issues within minutes. (Implemented)
  • Application resilience: retry logic so temporary coordination unavailability doesn’t block serving.
  • Decouple migrations from deploys: run migrations independently so coordination issues don’t block app startup.

Questions

  • Contact support with any questions about impact to your data.