39. Observability integrations (Prometheus, Grafana, Loki, Thanos, OpenSearch)

Give the Assistant real query access to your metrics, logs, and traces — and mirror your existing alert rules in.

These integrations are how Daalu reaches your telemetry. The Assistant runs real queries against them when it investigates, and your existing alert rules can mirror in so you don’t rebuild alerting from scratch.

All five share the same setup shape, so this chapter covers them together.

At a glance


What it connects	Prometheus, Thanos, Loki, OpenSearch/Elasticsearch, Grafana
Auth model	Per target: basic auth, bearer token, or none; reachable directly or via a federation tunnel
Where to set it up	Managed Infra → Observability Stacks → Add

What each one gives you

Prometheus — Daalu runs PromQL on demand and pulls alert rules.
Thanos — the same surface as Prometheus, plus long-retention queries.
Loki — Daalu runs LogQL; the alert-explainer agent uses this for log context.
OpenSearch / Elasticsearch — log search via OpenSearch DSL.
Grafana — Daalu reads your dashboards, embeds panels in Reports dashboards (Chapter 23), and mirrors in alert rules.

Setup pattern

For each, go to Managed Infra → Observability Stacks → Add → [type]. The wizard asks for:

Base URL — https://prom.acme.io, http://loki:3100, etc.
Auth method — basic auth, bearer token, or none.
Reachable via federation tunnel? — tick if it’s inside your network.
Friendly name.

Daalu runs a benign test query (up{} == 1 for Prometheus, a trivial LogQL for Loki, and so on). If it succeeds, the row goes green.

Telemetry behind your firewall

If your Prometheus lives inside a federated cluster — the common case for on-prem operators:

Connect the cluster first (Chapter 41).
When adding the observability integration, tick Reachable via federation tunnel and use the cluster-internal URL (e.g. http://prometheus.monitoring:9090).
Daalu routes every query through the tunnel — nothing is exposed to the public internet.

This is the standard pattern, and it’s why you never have to put a metrics endpoint on the open internet to use Daalu.

Alert rules

For Prometheus and Grafana, Daalu auto-imports alert rules:

It pulls them on a schedule (every 15 minutes).
When an upstream rule fires, the Daalu adapter sees the alert and writes a corresponding Daalu alert.
Acknowledging or resolving in Daalu mirrors back to the upstream system where the API supports it.

For Loki, configure rules in your Loki ruler — same mirroring story. For OpenSearch, alerting plugins vary; Daalu pulls alarms-as-alerts wherever the plugin’s API allows.

What the Assistant uses each one for

When the Assistant investigates an alert:

Prometheus — pulls the metric value at fire time and N minutes around it.
Loki — pulls logs from the relevant labels in that window.
OpenSearch — searches the indices the alert’s source maps to.
Grafana — reads dashboards to find the panel that represents the metric.
Thanos — long-context queries (“is this trending up over the last 30 days?”).

Tip: Every answer the Assistant gives carries citations that link back to the live query results, so you can verify exactly what it looked at.

Choosing one when you have many

If your tenant has several Prometheuses (one per cluster, say), Daalu needs to route each query to the right one. Configure this per integration:

Default destination — the one used for unscoped queries.
Scoped routes — labels that force routing to a specific target.

For example: any query labelled cluster: eu-prod goes to the eu-prod Prometheus, cluster: us-east goes to us-east, and everything else goes to the default.

Capacity and rate limits

Daalu’s adapters apply per-target rate limits so an investigation can’t overload your stack:

Prometheus: ~30 queries/minute by default.
Loki: ~10 queries/minute (log queries are expensive).
OpenSearch: ~30 queries/minute.
Grafana: ~30 API calls/minute.

Raise these in Settings → Observability → Rate limits if your stack handles more.

Warning: Be conservative. DOS-ing your own metrics backend with an over-eager Assistant is an unhappy way to learn where your limits are.

Troubleshooting

Connection refused — wrong URL, or the service isn’t running.
401 / 403 — wrong auth method.
No data for a query that should work — a label got stripped, or the rate limit fired. Check the detail page’s “Last query” trace.
Stale alert mirror — the import schedule is 15 minutes by default. For faster updates, point your Prometheus Alertmanager at Daalu’s webhook endpoint (Chapter 40) so alerts arrive the moment they fire.

Next: Chapter 40 — Webhooks and custom sources

Network source of truth (Nautobot)Webhooks and custom sources