50. Troubleshooting

Organized by symptom — how things present to you — with the likely cause, the quick check, and the fix.

When something doesn’t work, start here. Entries are ordered roughly the way you’d encounter them. For each one you get the most likely cause, the fastest check, and the fix.

Note. For deep operational runbooks, the Engineering & Operations manual (the sibling docs/book-engineer/ companion) goes a level below this chapter. This one is the operator’s first stop.

”I can’t log in”

The login form rejects your credentials, or you log in but get bounced back to the login screen.

Likely causes, most common first:

Wrong password. Use Forgot password; the reset email arrives within a minute.
Wrong tenant. If you have accounts in multiple tenants, you may be picking the wrong one on the post-login picker. Log out, log back in, choose carefully.
Cookies disabled for ops.daalu.io. Daalu uses an HttpOnly session cookie; no cookies, no auth.
Session expired during a long action. The bounce is normal — just log in again.

If you’ve been removed from the tenant you’ll see an explicit “Account no longer has access” message. Ask an admin to re-invite you.

”An integration row is red”

A Cloud Account / Source of Truth / Notification row shows status error.

Quick check: click the row → Health tab. The most recent probe response is shown verbatim.

Common errors and fixes:

AWS: AccessDenied — the IAM trust policy or external ID changed. Re-verify the role.
GCP: 401 invalid_grant — the workload-identity binding was removed. Re-bind in IAM.
Slack: token revoked — a workspace admin removed Daalu’s app. Re-authorize.
Nautobot: 401 Unauthorized — the API token expired or was rotated. Mint a new one.
Anything with timeouts — the target system is slow or unreachable. Check it from your side first.

”I’m not getting Slack notifications”

Quick checks, in order:

Did the alert fire? Alerts page — confirm a matching firing alert exists.
Are notifications enabled for you? Settings → Preferences → Per-channel toggles.
Is Slack the destination for this alert? Click the alert → Routing decision.
Is the channel allowed? Integrations → Notifications → Slack — verify the channel is allow-listed and Daalu has been invited (/invite @daalu).
Has the alert been snoozed? Check the alert’s pill bar.

If all pass and you still get nothing, fire a test from Settings → Alerts → Send test alert to isolate which step is broken.

”The Assistant is hanging / timing out”

Quick checks:

Is inference falling back to a slow tier? Usage & Pricing → Inference — if your own GPU’s pill is offline, calls spill to the Daalu-hosted or commercial tiers, which can be slower under load.
Is your prompt too long? If you’re pasting megabytes of logs, truncate.
Is a commercial provider throttling? Check status.anthropic.com / status.openai.com. The router doesn’t currently swap providers mid-conversation when one throttles.
Is there a regional outage on Daalu’s side? Check status.daalu.io.

If it’s hung for more than ~2 minutes, refresh and ask again; the underlying call should have timed out and retried by then.

”A change proposal won’t approve”

The approve button is grayed out, or clicking it errors.

Reasons:

You proposed it. Four-eyes rule — someone else must approve.
It’s already approved / applied / denied. Check the proposal’s current status.
The target integration is in error state. Daalu won’t execute against an unhealthy integration. Fix the integration first.
A workflow approval gate. If the proposal is a step in a multi-step automation with a named approver, only that approver can advance it.

”Cluster federation is stuck on ‘awaiting handshake’”

Quick checks from the cluster side:

Is the edge pod running? If it’s crashlooping, inspect its logs.
Look for connection errors in the edge logs. The common one is outbound UDP/51820 blocked by your egress firewall — the tunnel can’t form.
Did the bootstrap Job complete? If it failed, the token may have expired (1-hour limit). Regenerate the invite.

From the Daalu side:

Did you generate the invite recently? Tokens expire after 1 hour.
Is the cluster row’s tunnel key set? Managed Infra → Clusters → [cluster] → Tunnel diagnostics.

Full walkthrough: Chapter 41 — Deploying Daalu Edge.

”Devices aren’t showing on the Operations page”

Quick checks:

Is the Nautobot integration green? Managed Infra → Source of Truth.
Has the initial sync completed? The detail page shows “Last sync.”
Did the devices match Daalu’s filters? Settings → Reconciliation → Device filters — Daalu may be intentionally excluding certain sites or roles.

”Drift won’t clear”

You approved a drift proposal, the executor reported success, but the next reconciler cycle still shows drift.

Likely causes:

Normalization mismatch. The intended and live configs differ in a way Daalu treats as significant but shouldn’t. Check the diff detail and tune Settings → Reconciliation → Normalization rules.
The change didn’t actually apply. Check the proposal’s audit log — did the executor’s API call really return success?
The device reverted. Some devices load config from an external source (TFTP, RANCID) that overrides running config. Daalu pushed; the next reload pulled it back.

”I see a ‘rate limit exceeded’ error”

Most common when a script or automation hits the API hard.

Quick checks:

Per-token limit: 60 requests/minute by default (per personal access token).
Per-IP limit: 600 requests/minute by default.

Raise via Settings → API → Rate limits (admin only), or batch your requests.

”The home page is empty”

You’ve connected sources but the home page still shows empty states:

Hard-refresh the browser (Cmd-Shift-R / Ctrl-Shift-R).
Check the briefing-time setting — if you’re looking before today’s briefing has generated, you’ll see yesterday’s.
Click Regenerate on the briefing card.

”AI Factory shows high VRAM but 0% GPU utilisation”

This is normal and expected.

The vLLM serving engine pre-allocates most of the card’s VRAM at startup for its KV cache — that’s how it serves fast. So a healthy, idle GPU legitimately reads ~90% memory used and ~0% compute. You’ll see compute jump only while a request is actually being served.

Don’t read steady high VRAM as a leak or a stuck job. Read the utilisation line for activity and the health signals (XID / ECC / thermals) for problems. See Chapter 32 — GPU observability.

”AI Factory pill is offline / inference keeps spilling to cloud”

Your own GPU was healthy and is now offline, so calls fall back to the Daalu-hosted and commercial tiers (and your inference bill climbs).

Quick checks:

Is the cluster still connected? Managed Infra → Clusters — the serving stack is unreachable if the tunnel is down.
Did the serving pod restart? First start after a restart takes a few minutes to reload weights; the pill stays offline until the model is up.
Hardware health. A wedged GPU (a recurring XID, an ECC storm) can take the endpoint down. Run the diagnostics in Chapter 33 — Diagnostics and reliability.

”My Coding Workspace says ‘paused’”

This is by design. A workspace that’s been idle for more than 24 hours is paused automatically to free GPU and compute — your files and state are preserved.

Just reopen it; it resumes from where you left off. The first prompt after a resume may be slightly slower while the backing inference warms up. See Chapter 35 — Coding workspace.

When to write to support

Open a ticket from Help & Feedback if:

An integration is green per its health check but visibly not working.
The Assistant is confidently wrong on something that matters.
Anything billing-related doesn’t match your expectation.
A production-impacting outage of any kind.

Include:

The version from the Help page.
The URL where the symptom appeared.
Screenshots if the UI is involved.
Roughly when the symptom started.

Tip. Tickets with those details typically resolve in hours. Without them, the back-and-forth takes days. Front-load the context.

Next: Chapter 51 — Frequently asked questions

Tenant billing and invoicing Frequently asked questions