Kubernetes Clusters
Watchgrid can monitor and manage external Kubernetes clusters — K3s, Talos, RKE2, EKS, GKE, AKS, and others — using the lightweight cluster-agent.
How It Works
The cluster-agent runs as a Pod inside your Kubernetes cluster. It:
- Connects to your Watchgrid server via the WireGuard VPN
- Collects node health, pod counts, CPU/memory metrics
- Sends heartbeats with cluster status every few seconds
- Appears in the Watchgrid Dashboard alongside regular devices
Adding a Cluster
Clusters are onboarded into a Site, the same way devices are. The cluster-agent's manifest is generated from the site's onboarding modal so the new cluster lands directly in the right tenant + site with no follow-up assignment step.
- Go to Sites → [Site Name]
- Click Onboard To Site
- Scroll to the Cluster Manifest section
- Enter a cluster name (alphanumeric, max 63 characters) and click Generate Manifest
- Click Copy Manifest — the YAML embeds the site ID, the tenant onboarding token, and the server URL
- Apply it to your cluster:
The cluster-agent Pod will start, connect to your Watchgrid server over WireGuard, and the cluster appears under that site in Inventory within a few seconds.
Need to onboard a host that already runs K3s but isn't yet a Watchgrid cluster? Use the Installation With Kubernetes curl command from the same modal — it provisions the host, installs K3s, and registers the cluster in one step.
Cluster List
Clusters appear in Inventory, grouped under the site they were onboarded into. Expand a cluster row to see its details:
Row Header
- Status indicator — green (online) or red (offline)
- Cluster hostname and device ID
- Kubernetes metadata — distribution, version, node count, pod count
Expanded Details
Click a cluster row to expand it and see:
Nodes Table
| Column | Description |
|---|---|
| Name | Node hostname |
| Role | control-plane, worker, etc. |
| Status | Ready or NotReady |
| Version | Kubernetes version |
| CPU% | Current CPU utilization |
| Mem% | Current memory utilization |
| Pods | Number of pods on this node |
Pods by Namespace
A breakdown of pods per namespace showing:
- Running count
- Pending count
- Failed count
- Total count
Additional Info
- List of all namespaces as tag chips
- VPN IP — the cluster-agent's WireGuard address
- Last seen timestamp
Supported Distributions
The cluster-agent works with any Kubernetes distribution:
- K3s — lightweight, common on edge devices
- Talos — immutable Kubernetes OS
- RKE2 — Rancher's hardened distribution
- EKS — Amazon Elastic Kubernetes Service
- GKE — Google Kubernetes Engine
- AKS — Azure Kubernetes Service
- kubeadm clusters
- Any conformant Kubernetes cluster
The agent uses the standard Kubernetes API via client-go and metrics-server for resource usage.
Security
K8s API TLS Verification
When the server routes Kubernetes API calls through the WireGuard tunnel, it uses the cluster CA certificate embedded in the kubeconfig for TLS verification. The connection targets the device's WireGuard tunnel IP but validates the certificate against the original kubeconfig hostname (typically 127.0.0.1 for K3s). No TLS bypass is performed.
Service Proxy Access Control
The cluster-agent's internal K8s service proxy (port 8081) only accepts connections from the Watchgrid server's WireGuard gateway IP. The gateway is derived automatically from the cluster-agent's own tunnel IP (e.g. tunnel IP 100.64.2.5 → gateway 100.64.2.254), so multi-tenant deployments work without any extra configuration. Set WATCHGRID_GATEWAY_IP to override for non-standard subnet layouts.
Protected Namespaces
Watchgrid will refuse to deploy, delete, restart, or scale resources in kube-system, kube-public, kube-node-lease, or watchgrid-system. These namespaces are protected to prevent accidental or malicious disruption of cluster control-plane components.
Command Queue
Cluster deploy, undeploy, restart, and scale operations are written to a persistent queue (PostgreSQL table cluster_commands) before being dispatched to the cluster-agent. This means:
- Restart safety — if the Watchgrid server restarts mid-deploy, the command is still in the queue and is re-delivered to the cluster-agent on its next poll. No silent loss of in-flight operations.
- Idempotency — duplicate requests (e.g. double-click "Deploy") for the same cluster + app + namespace are de-duplicated at enqueue time while a previous request for the same target is still
pendingorclaimed. Each distinct operation still gets its own row once the previous one completes. - Crash recovery — commands stuck in the
claimedstate for more than 10 minutes are automatically reset topendingon server startup, allowing re-delivery. Cluster-agent deploys are idempotent at the Kubernetes layer (kubectl apply), so re-delivery is safe. - History —
GET /api/clusters/commands?cluster_id=<id>returns the queue history for a cluster (newest first), including status (pending/claimed/done/failed), timestamps, payload, and error messages.
Every successful enqueue returns a command_id in the response body which can be correlated against the history endpoint.