Observability is not an afterthought
On most platforms, observability lives at the end of a long list of “things to set up later”. Flui inverts that ordering. Metrics and logs are there from minute zero — the very same operation that brings up a cluster also brings up the place where its metrics and logs go — and they cover both the apps you deploy and the platform’s own components from day one. The surfaces you use to read them on the first app are the same you will use on app number two hundred.
This chapter is about the intent and the shape of that choice. The detailed inventory of what runs inside the control cluster is the next chapter.
A pillar, not an add-on
Observability in Flui is not a separate tool you bolt on after the fact. It is one of the pillars the rest of the platform stands on, present from the first minute the installation is alive.
Three things make this concrete:
- It is part of the bootstrap. The same operation that brings up the control cluster also brings up the backend that collects metrics and logs. The installation is not considered ready until that backend is. Workload clusters added later feed into the same backend — you do not stand up a second observability stack, you point the new cluster’s agents at the one that already exists.
- It is modelled around the platform’s own concepts. The CLI and the dashboard expose metrics and logs in terms of clusters, servers and applications — the things you already reason about — instead of asking you to translate to namespaces, labels and query languages first.
- The platform itself relies on it. Alerts evaluate against the metrics backend, and the suggestions the platform makes for scaling — when to grow a node, when an application is running close to the edge — read from the same data. Take observability away and those features have nothing to read.
The consequence: you cannot run Flui in a useful state without observability. That is a deliberate constraint — it removes a class of “I’ll add monitoring later” decisions that, in practice, never get made.
What is in scope
Flui’s built-in observability covers:
- Metrics — host-level metrics on every node, cluster-level metrics on every object, and the baseline per-application metrics (CPU, memory, restart count, replica health) that Flui surfaces for every app it manages.
- Logs — application logs from every cluster, indexed centrally and queryable per app, per cluster, per time range.
- Dashboards — the everyday path is Flui’s own dashboard and CLI, which surface what is worth looking at — for a cluster and for an application — curated by Flui rather than left as a blank canvas. Grafana is also deployed on the control cluster, pre-wired to the metrics and log backends, but as an advanced tool for deep debugging and custom dashboards — not as the primary way to read metrics and logs.
- Alerts — rule evaluation against the metrics backend, so conditions of interest can trigger without external tooling.
Some classes of telemetry are not part of the built-in stack today: end-to-end request tracing across services, application performance profiling, and uptime checks from outside the platform. Applications that need them add their own tooling on top.
No database storage for logs or metrics
Flui’s database holds state about the installation — clusters, nodes, applications, operations, releases, builds — and nothing else. Logs and metrics live in their own purpose-built backends, with their own retention and capacity policies.
When you query logs or metrics through Flui, the API acts as a thin proxy: it resolves which application you are asking about, scopes the query to that application’s data, forwards the query to the appropriate backend, and streams the answer back. No log line and no metric sample is ever written to Flui’s own database.
Three things follow from this:
- The control plane stays small. Flui’s database does not grow with traffic; it grows with the number of resources you manage.
- The backends are independently replaceable. If a future release prefers a different metrics or log store, the change is contained — the user-facing surfaces stay the same.
- Standard tools work. Anyone with a background in the underlying query languages can drop into Grafana and run the same kind of queries Flui itself is running.
How workload clusters feed the control cluster
The control cluster is the consumer of metrics and logs from every workload cluster. The flow is push-based, not pull-based: every node on every cluster runs a metrics agent and a log shipper that send their data to the central backend over the cluster’s private network.
If the control cluster is unreachable, workloads keep running — each cluster is autonomous in steady state — and the agents buffer locally. When the link returns, recent data back-fills, and the dashboard surfaces the brief gap rather than failing silently.
Why push, not pull
The conventional model is pull, where a central scraper reaches into each target. Flui pushes instead, for three concrete reasons: the trust boundary points outward (workload clusters initiate, the control cluster never reaches in), short outages back-fill from the agent’s local buffer rather than leaving holes in the time series, and the control cluster does not need a registry of every endpoint to scrape. Tracing, when it lands, is expected to follow the same shape.
What the user sees
Three surfaces, all backed by the same data:
| Surface | What it shows | When to use it |
|---|---|---|
| CLI | Single-app logs (filtered by level and full-text search), runtime status, recent crashes, and metrics | Debugging a specific app, scripting, CI |
| Dashboard | Per-app live charts and log streams, plus cluster-level views | Day-to-day operation; no Grafana needed |
| Grafana | Full query power against the metrics and log backends, custom dashboards, alerts | Cross-cutting analysis, custom monitoring |
The CLI and the dashboard share the same backend semantics; Grafana is the always-there escape hatch. None of the three is authoritative over the others — they are different views of the same data.
Where this chapter goes from here
- The control cluster — the detailed inventory: which components, in which role, at what cost.
- The app concept — how the app surface is wired to the observability backend.