22. Alerts and incidents
The operational inbox — where every alert opens an AI triage chat and proposes remediations you approve before anything runs.
At a glance
| What it is | The inbox for alerts and incidents. Each alert opens a per-alert AI chat that triages the problem and proposes concrete remediation actions you approve to execute. |
| Where to find it | https://ops.daalu.io/alerts, and a detail page per alert at /alerts/<id>. |
| Who can use it | Everyone can read, acknowledge, and chat; executing a remediation action requires an eligible role. |
The model behind this page — events, alerts, incidents, notifications — is Chapter 12. This chapter is the walkthrough of the surface.
The list view
The page opens to filter tabs across the top: Open, Acknowledged, Resolved. Open is the default and shows currently-firing alerts, most urgent first.
Each row carries:
- Severity badge — info / warning / critical.
- Title — the alert’s headline, from the rule that produced it.
- Source — the integration that emitted it.
- Labels — pills for cluster, device, service, team.
- First seen — when the alert was first noticed.
- Assignee — who’s on it; empty means nobody.
- Status — firing / acknowledged / resolved / snoozed.
A filter bar adds severity (multi-select), source, assignee, label search, and full-text search. Top-right are two bulk shortcuts: Mute all warnings (snooze every warning for an hour, handy during maintenance) and Acknowledge all (assign yourself to every unassigned critical).
The alert detail page
Click any alert to open its detail page. This is where Daalu differs from a passive alert list: the page is a working session, not a record.
Header
Title, severity, and status, plus the action buttons: Acknowledge (assign to me), Snooze, Resolve manually, and Promote to incident.
AI triage and chat
The heart of the page is a chat panel bound to this specific alert. The first time you open an alert, an autonomous triage pass kicks off on its own — you don’t have to ask. The triage agent:
- Pulls the alert’s context (labels, source, recent fires).
- Runs read-only investigation tools — Prometheus and Loki queries, kubectl-style reads, arbitrary HTTP probes — to gather evidence.
- Writes a diagnosis: what it thinks is wrong and why, citing the evidence it pulled.
- Proposes a remediation plan — one or more concrete actions that, if approved, would fix the problem.
From there it’s a conversation. Ask a follow-up in plain language — “is this the same as last Tuesday’s outage?”, “show me the error rate for the last hour” — and the agent answers using the same tools. Read-only tool calls run automatically and inline. Anything that would change state never runs on its own; it lands as a pending remediation action (below).
If the triage looks stale — the situation moved on while you were reading — hit Re-triage to run a fresh pass against current data rather than re-spending a model turn every time you reload the tab.
Why it matters — The investigation most engineers do by hand in the first five minutes of an alert — pull the metric, tail the logs, form a hypothesis — has already happened by the time you open the page. You start from a diagnosis, not a blank terminal.
Remediation actions
When the agent proposes a change — restart a deployment, patch a resource, drain a node — it surfaces as a remediation action card in the chat with a clear status:
| Status | Meaning |
|---|---|
| pending | Proposed by the agent, waiting for a human. Nothing has run. |
| approved | You clicked Approve; execution is starting. |
| executed | The action ran successfully. |
| failed | The action ran but the underlying tool returned an error. |
| rejected | You clicked Reject; the action will never run. |
Each card shows exactly which tool will be called and with what arguments, so there are no surprises. You Approve to execute or Reject to drop it. Approving runs the action immediately and reports back — success, or failure with the error attached so the agent can re-propose. This is the same human-in-the-loop guardrail as everywhere else in Daalu: the AI can investigate all day, but only your approval turns a proposal into a real change.
The end-to-end flow:
Timeline
The chronological history of the alert: fired, acknowledged, comments, remediation actions approved or executed, resolved. Every triage interaction is recorded here too, so the page doubles as the audit trail.
Comments and related alerts
A threaded discussion (each comment notifies subscribers), plus a Related panel showing other recent alerts with overlapping labels — your first clue that you’re looking at one symptom of a larger event.
Labels
The alert’s labels, editable inline. Editing labels on a single alert is occasionally useful for routing a one-off; to change routing for good, edit the rule that produces the alert (admin-only, in Settings — Chapter 28).
Promote to incident
Some alerts deserve a full incident — multiple alerts firing together, customer impact, on-call coordination. Click Promote to incident to open one. The dialog asks for a title, severity (defaults to the alert’s), an assignee (defaults to you), and any additional alerts to attach.
The incident gets its own page at /incidents/<id> with a timeline that every attached alert contributes to, plus a postmortem draft that’s auto-generated when the incident resolves — you edit it and Save as final to keep it. Incidents are far less common than alerts; most teams open one to three a month, and they’re the unit you bring to a retrospective.
Snoozing
From any alert, Snooze for → 30 min / 2 hours / 8 hours / Custom.
Snoozing suppresses your notifications for the duration, keeps the alert firing (snoozing is not resolving), logs the snooze in the timeline, and affects only your notifications — not your teammates’.
Snoozing does not resolve the alert (the condition is still live) and is independent of any downstream PagerDuty escalation, by design — you almost never want to silence both at once. A snoozed alert wears a “Snoozed by X until Y” pill.
Testing the pipeline
To verify alerting end-to-end, use Settings → Alerts → Send test alert. Pick a severity, label set, and target channels; Daalu fires the alert through the same pipeline a real one uses and auto-resolves it after 60 seconds. It’s the fastest way to confirm a freshly-configured Slack or PagerDuty channel actually routes, or to show a new on-call what a real page looks like.
Alert rules
Hover an alert title to see the rule behind it; click for the rule’s detail page — conditions, severity, emitted labels, runbook URL, and recent fires. Editing rules lives in Settings → Alert rules and is admin-only.
Bulk operations
When you’re in a storm, multi-select with the row checkboxes and apply: Acknowledge selected, Snooze selected, Resolve selected (use sparingly — alerts are meant to auto-resolve when the condition clears), or Promote selected to incident to bundle them into one.