OTP DLT Outage Response
Document Info
- Purpose
- Step-by-step response when OTP DLT delivery fails or Fast2SMS rejects DLT sends at scale.
- Intended Audience
- On-call engineers, operations, platform maintainers.
- Last Updated
- 2026-06-05 (Phase 8D)
- Related Documents
- OTP DLT RollbackLog TriageOTP DLT Observability
Symptoms
- Spike in
otp_notification_failedorotp_delivery_completedwithstatus=failed - Spike in
provider_response_failedwithroute=dlt - Spike in
otp_dlt_hard_failurefor DLT-only apps (Phase 8D — no automatic fallback) - User reports: OTP not received; API returns
502 sms_failed - Fast2SMS
return: falsein provider logs
Decision tree
Immediate actions
- Confirm scope — Query logs for
event:otp_delivery_completed,event:otp_dlt_hard_failure, andevent:provider_response_failedin last 15 minutes. - Check activation —
otp_dlt_activation_status/otp_cutover_status/otp_config_healthat last startup. - Identify delivery policy — Per app:
deliveryPolicyinotp_cutover_statusor/platform/otp→ Delivery policy table. - Sample failure — Inspect one
requestIdend-to-end:otp_generated→otp_dlt_dispatch→dlt_payload_ready→provider_response_failedorotp_dlt_hard_failure. - Assess user impact — Failed sends revoke OTP in Redis; users see
502and must retry. DLT-only apps have no route=q fallback.
DLT-only failure handling (Phase 8D)
When legacyRouteEnabled=false and DLT fails:
| Symptom | Log event | Mitigation |
|---|---|---|
| Provider 5xx / timeout | otp_dlt_hard_failure | Re-enable fallback (see Rollback) or set OTP_DLT_ENABLED=false |
| Template rejection | otp_dlt_hard_failure | Fix template metadata; do not expect fallback |
| Global DLT off | otp_dlt_fallback with reason=dlt_inactive | N/A for DLT-only apps — they use route=q only when global flag is false |
Operational signal: otp_dlt_hard_failure count should be zero in steady state. Any sustained spike on a retired app requires immediate action.
Escalation
| Severity | Condition | Action |
|---|---|---|
| P1 | >50% OTP sends failing >5 min | Rollback immediately; notify stakeholders |
| P2 | 10–50% failure rate | Investigate 15 min; rollback if not resolved |
| P3 | Isolated failures | Monitor; check template/provider for single app |
Verification checklist
- Failure rate returned to baseline
- Test OTP send succeeds (staging number)
- Logs show expected
deliveryModeafter mitigation - No OTP values appear in logs (security check)
Post-incident
Document root cause: template ID, sender ID, entity ID, variable ordering, or provider outage. Update Rollout runbook if config change required.