20. Agents
Autonomous helpers that watch, investigate, and propose — on a schedule or a trigger, all under your approval rules.
At a glance
| What it is | The roster of autonomous helpers running in your tenant. Each agent has a schedule or trigger and a focused job: write the daily briefing, sweep for expiring certs, detect drift, investigate a recurring alert. |
| Where to find it | https://ops.daalu.io/agents |
| Who can use it | Everyone can view agents and their run history; enabling, disabling, and editing is admin-only. |
The Agents page is where you see the autonomous helpers running in your tenant. It’s one of the most powerful pages once you’ve been using Daalu for a few weeks — new customers often don’t notice it exists, and by month two it’s where they live.
What an agent is
An agent is:
- A named, configured behaviour.
- Bound to a schedule (cron) or a trigger (event match).
- Implemented as a sequence of Assistant-like tool calls, ending in a write to one of: the timeline, the proposal queue, or a notification channel.
- Subject to the same approval rules as the Assistant. If the agent’s plan ends in a write, it routes through a change proposal that a human approves.
Agents are tenant-scoped. Each tenant has its own set, can enable or disable each, and can author custom ones (Chapter 21 covers authoring).
The shipped agents
Out of the box, every tenant gets these:
Briefing agent
Runs once a day at your tenant’s briefing time. Generates the Home daily briefing (Chapter 18).
Reconciler agent (source of truth)
Runs every 5 minutes per device (configurable). Detects drift and writes proposals. The concept is Chapter 14; the surface it drives is Operations (Chapter 19).
Alert-explainer agent
Runs ~30 seconds after each new alert. Queries the relevant metrics and logs, writes a 2–3 paragraph explanation that appears on the alert detail page. The first time you see this in action is usually a memorable moment.
Cluster-tunnel-health agent
Runs every minute. Verifies that each federated cluster’s
WireGuard tunnel is healthy. Writes a cluster.tunnel.down
event if a handshake is older than 5 minutes; promotes to
critical alert at 15 minutes.
Cost-anomaly agent
Runs every hour. Compares today’s cloud spend (so far) against your trailing 7-day average; flags deviations larger than the threshold in Settings → Briefings → Cost anomaly.
Cert-expiry agent
Runs daily. Sweeps every connected source’s TLS certificates and writes a notification if any is within 30 days of expiry.
The Agents page layout
The page is a table:
| Agent | Status | Last run | Next run | Success rate (7d) | p50 latency | ⋯ |
A row per shipped or custom agent. Click any row for the agent detail page:
- Configuration — schedule, trigger, parameters (e.g., briefing time, anomaly threshold).
- Run history — last 100 runs with timestamps, status, duration, output artifact (e.g., the briefing it generated).
- Logs — structured logs from the agent’s last run.
- Test now — trigger an immediate run.
- Disable — pause this agent (admin only).
Enabling and disabling
By default, all shipped agents are enabled in every tenant
except cert-expiry (off by default; some tenants find it
noisy).
Toggling an agent off:
- Click the row.
- Toggle Enabled off in the configuration panel.
- The next scheduled run is skipped; on-demand triggering from this page still works.
Disabling vs deleting: shipped agents can’t be deleted, only disabled. Custom agents (workflows) can be deleted.
Per-agent settings
Each agent has its own configuration UI:
Briefing agent
- Time of day (your tenant’s time zone).
- Days of the week.
- Inputs (which sources to include).
- Output channels (in-app, Slack, email digest).
Reconciler agent
- Per-device interval (default 5 min).
- Concurrency limit (how many devices may be checked in parallel — defaults are conservative for low-end Nautobot instances).
- Diff sensitivity (ignore comments, ignore whitespace, etc.).
Alert-explainer agent
- Maximum tokens spent per explanation.
- Sources to query (Prometheus, Loki, all, none).
- Model preference (cheap vs. high-quality).
Cost-anomaly agent
- Per-cloud thresholds.
- Tenant-wide threshold.
- Notification channel.
Each agent’s configuration is also editable via the API for infra-as-code workflows.
Authoring custom agents
To author your own:
- Go to Automations → New automation.
- Pick a trigger (cron schedule or event match).
- Build the plan: a sequence of tool calls and conditions.
- Save. Daalu adds it as a new row on the Agents page.
Chapter 21 covers the workflow authoring UI in detail. The agents you create here appear with type “Custom” on the agents page and have the same run-history features.
When an agent fails
Agents fail like any other program. The most common modes:
- Tool error. A downstream system returned an error (cloud API throttled, Loki unreachable). The run is marked failed with the error in the logs. The agent retries on its next scheduled run.
- Approval missing. An agent’s plan ended in a proposal that’s been waiting for approval for >24 hours. The agent pings notification channels to remind you.
- Timeout. Long-running queries can hit the per-agent timeout (default: 5 minutes). Either raise the timeout or break the agent’s work into smaller chunks.
The Agents page’s “Success rate (7d)” column is the best place to spot trending issues. A green 100 % means the agent is happy. A 70 % means you should investigate.
Next: Chapter 21 — Automations