39. Observability integrations (Prometheus, Grafana, Loki, Thanos, OpenSearch)
Give the Assistant real query access to your metrics, logs, and traces — and mirror your existing alert rules in.
These integrations are how Daalu reaches your telemetry. The Assistant runs real queries against them when it investigates, and your existing alert rules can mirror in so you don’t rebuild alerting from scratch.
All five share the same setup shape, so this chapter covers them together.
At a glance
| What it connects | Prometheus, Thanos, Loki, OpenSearch/Elasticsearch, Grafana |
| Auth model | Per target: basic auth, bearer token, or none; reachable directly or via a federation tunnel |
| Where to set it up | Managed Infra → Observability Stacks → Add |
What each one gives you
- Prometheus — Daalu runs PromQL on demand and pulls alert rules.
- Thanos — the same surface as Prometheus, plus long-retention queries.
- Loki — Daalu runs LogQL; the alert-explainer agent uses this for log context.
- OpenSearch / Elasticsearch — log search via OpenSearch DSL.
- Grafana — Daalu reads your dashboards, embeds panels in Reports dashboards (Chapter 23), and mirrors in alert rules.
Setup pattern
For each, go to Managed Infra → Observability Stacks → Add → [type]. The wizard asks for:
- Base URL —
https://prom.acme.io,http://loki:3100, etc. - Auth method — basic auth, bearer token, or none.
- Reachable via federation tunnel? — tick if it’s inside your network.
- Friendly name.
Daalu runs a benign test query (up{} == 1 for Prometheus, a trivial
LogQL for Loki, and so on). If it succeeds, the row goes green.
Telemetry behind your firewall
If your Prometheus lives inside a federated cluster — the common case for on-prem operators:
- Connect the cluster first (Chapter 41).
- When adding the observability integration, tick Reachable via
federation tunnel and use the cluster-internal URL (e.g.
http://prometheus.monitoring:9090). - Daalu routes every query through the tunnel — nothing is exposed to the public internet.
This is the standard pattern, and it’s why you never have to put a metrics endpoint on the open internet to use Daalu.
Alert rules
For Prometheus and Grafana, Daalu auto-imports alert rules:
- It pulls them on a schedule (every 15 minutes).
- When an upstream rule fires, the Daalu adapter sees the alert and writes a corresponding Daalu alert.
- Acknowledging or resolving in Daalu mirrors back to the upstream system where the API supports it.
For Loki, configure rules in your Loki ruler — same mirroring story. For OpenSearch, alerting plugins vary; Daalu pulls alarms-as-alerts wherever the plugin’s API allows.
What the Assistant uses each one for
When the Assistant investigates an alert:
- Prometheus — pulls the metric value at fire time and N minutes around it.
- Loki — pulls logs from the relevant labels in that window.
- OpenSearch — searches the indices the alert’s source maps to.
- Grafana — reads dashboards to find the panel that represents the metric.
- Thanos — long-context queries (“is this trending up over the last 30 days?”).
Tip: Every answer the Assistant gives carries citations that link back to the live query results, so you can verify exactly what it looked at.
Choosing one when you have many
If your tenant has several Prometheuses (one per cluster, say), Daalu needs to route each query to the right one. Configure this per integration:
- Default destination — the one used for unscoped queries.
- Scoped routes — labels that force routing to a specific target.
For example: any query labelled cluster: eu-prod goes to the eu-prod
Prometheus, cluster: us-east goes to us-east, and everything else
goes to the default.
Capacity and rate limits
Daalu’s adapters apply per-target rate limits so an investigation can’t overload your stack:
- Prometheus: ~30 queries/minute by default.
- Loki: ~10 queries/minute (log queries are expensive).
- OpenSearch: ~30 queries/minute.
- Grafana: ~30 API calls/minute.
Raise these in Settings → Observability → Rate limits if your stack handles more.
Warning: Be conservative. DOS-ing your own metrics backend with an over-eager Assistant is an unhappy way to learn where your limits are.
Troubleshooting
- Connection refused — wrong URL, or the service isn’t running.
- 401 / 403 — wrong auth method.
- No data for a query that should work — a label got stripped, or the rate limit fired. Check the detail page’s “Last query” trace.
- Stale alert mirror — the import schedule is 15 minutes by default. For faster updates, point your Prometheus Alertmanager at Daalu’s webhook endpoint (Chapter 40) so alerts arrive the moment they fire.