Kubernetes Clusters

Watchgrid can monitor and manage external Kubernetes clusters — K3s, Talos, RKE2, EKS, GKE, AKS, and others — using the lightweight cluster-agent.

How It Works

The cluster-agent runs as a Pod inside your Kubernetes cluster. It:

Connects to your Watchgrid server via the WireGuard VPN
Collects node health, pod counts, CPU/memory metrics
Sends heartbeats with cluster status every few seconds
Appears in the Watchgrid Dashboard alongside regular devices

Adding a Cluster

Clusters are onboarded into a Site, the same way devices are. The cluster-agent's manifest is generated from the site's onboarding modal so the new cluster lands directly in the right tenant + site with no follow-up assignment step.

Go to Sites → [Site Name]
Click Onboard To Site
Scroll to the Cluster Manifest section
Enter a cluster name (alphanumeric, max 63 characters) and click Generate Manifest
Optionally tick "Tunnel WireGuard over HTTPS (443)" if the cluster's network blocks outbound UDP — the tunnel is then carried over TLS/443 instead of UDP 51820. See WireGuard over HTTPS.
Click Copy Manifest — the YAML embeds the site ID, the tenant onboarding token, and the server URL
Apply it to your cluster:

kubectl apply -f watchgrid-cluster-agent.yaml

The cluster-agent Pod will start, connect to your Watchgrid server over WireGuard, and the cluster appears under that site in Inventory within a few seconds.

Need to onboard a host that already runs K3s but isn't yet a Watchgrid cluster? Use the Installation With Kubernetes curl command from the same modal — it provisions the host, installs K3s, and registers the cluster in one step.

Cluster List

Clusters appear in Sites → Devices, alongside agent devices — a cluster is just another device type. Each row carries a type label (Agent / Cluster), and the All / Agents / Clusters filter narrows the list by type. Expand a cluster row to see its details:

Row Header

Status indicator — green (online) or red (offline)
Cluster hostname and device ID
Kubernetes metadata — distribution, version, node count, pod count

Expanded Details

Click a cluster row to expand it and see:

Nodes Table

Column	Description
Name	Node hostname
Role	control-plane, worker, etc.
Status	Ready or NotReady
Version	Kubernetes version
CPU%	Current CPU utilization
Mem%	Current memory utilization
Pods	Number of pods on this node

Pods by Namespace

A breakdown of pods per namespace showing:

Running count
Pending count
Failed count
Total count

Applications

Deploy apps from the catalog to the cluster, and manage what's already running. Each deployed app can be:

Edited — for apps with config fields, the Edit action re-opens the config form pre-filled with the saved values and redeploys with your changes.
Removed — deletes the app's Deployment and Service from the cluster.

Additional Info

List of all namespaces as tag chips
VPN IP — the cluster-agent's WireGuard address
Last seen timestamp

Removing a Cluster

Expand the cluster row and click Delete Cluster in the detail header. After confirming, WatchGrid removes the cluster device, its WireGuard peer, and its proxy DNS records. Workloads already running inside the Kubernetes cluster are not affected — to fully decommission, also remove the cluster-agent Deployment from the cluster (kubectl delete -f watchgrid-cluster-agent.yaml).

Supported Distributions

The cluster-agent works with any Kubernetes distribution:

K3s — lightweight, common on edge devices
Talos — immutable Kubernetes OS
RKE2 — Rancher's hardened distribution
EKS — Amazon Elastic Kubernetes Service
GKE — Google Kubernetes Engine
AKS — Azure Kubernetes Service
kubeadm clusters
Any conformant Kubernetes cluster

The agent uses the standard Kubernetes API via client-go and metrics-server for resource usage.

Security

K8s API TLS Verification

When the server routes Kubernetes API calls through the WireGuard tunnel, it uses the cluster CA certificate embedded in the kubeconfig for TLS verification. The connection targets the device's WireGuard tunnel IP but validates the certificate against the original kubeconfig hostname (typically 127.0.0.1 for K3s). No TLS bypass is performed.

Service Proxy Access Control

The cluster-agent's internal K8s service proxy (port 8081) only accepts connections from the Watchgrid server's WireGuard gateway IP. The gateway is derived automatically from the cluster-agent's own tunnel IP (e.g. tunnel IP 100.64.2.5 → gateway 100.64.2.254), so multi-tenant deployments work without any extra configuration. Set WATCHGRID_GATEWAY_IP to override for non-standard subnet layouts.

Protected Namespaces

Watchgrid will refuse to deploy, delete, restart, or scale resources in kube-system, kube-public, kube-node-lease, or watchgrid-system. These namespaces are protected to prevent accidental or malicious disruption of cluster control-plane components.

Node InternalIP on Talos

The cluster-agent runs with hostNetwork and brings up a WireGuard interface (wg0) on the node it lands on. If a node's kubectl get nodes -o wide shows its INTERNAL-IP as a WireGuard address (100.64.x) instead of its real LAN IP, kubelet adopted the tunnel interface as the node IP — which breaks in-cluster routing and makes kubectl slow.

Watchgrid mitigates this by assigning the tunnel IP with scope link (so node-IP auto-detection ignores it; cluster-agent ≥ 1.24.2). On Talos, the guaranteed fix is to pin the node IP in the machine config so kubelet never considers the WireGuard range:

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 172.31.0.0/16   # your real LAN subnet(s)
        # or, equivalently, exclude the WireGuard range:
        # - "!100.64.0.0/10"

Apply the config and the node re-registers with the correct InternalIP.

Command Queue

Cluster deploy, undeploy, restart, and scale operations are written to a persistent queue (PostgreSQL table cluster_commands) before being dispatched to the cluster-agent. This means:

Restart safety — if the Watchgrid server restarts mid-deploy, the command is still in the queue and is re-delivered to the cluster-agent on its next poll. No silent loss of in-flight operations.
Idempotency — duplicate requests (e.g. double-click "Deploy") for the same cluster + app + namespace are de-duplicated at enqueue time while a previous request for the same target is still pending or claimed. Each distinct operation still gets its own row once the previous one completes.
Crash recovery — commands stuck in the claimed state for more than 10 minutes are automatically reset to pending on server startup, allowing re-delivery. Cluster-agent deploys are idempotent at the Kubernetes layer (kubectl apply), so re-delivery is safe.
History — GET /api/clusters/commands?cluster_id=<id> returns the queue history for a cluster (newest first), including status (pending/claimed/done/failed), timestamps, payload, and error messages.

Every successful enqueue returns a command_id in the response body which can be correlated against the history endpoint.