Changelog

All notable changes to Watchgrid are documented here. Format follows Keep a Changelog.

[Unreleased]

[1.32.0] - 2026-07-20

Added

Registry-push update detection. Every app deploy now records which internal-registry image digest it shipped to which device (app_deployments ledger). A background watcher re-resolves the digest of every tracked repo:tag (default every 60s, WATCHGRID_REGISTRY_POLL_SECONDS); pushing a new container — even under the same tag like :latest — marks the affected deployments as "update available". An optional Docker Registry notifications webhook (POST /api/registry/events, enabled by setting WATCHGRID_REGISTRY_EVENTS_TOKEN) makes detection instant.
One-click upgrade and rollback. New Deployments tab in the App Store shows every tracked deployment with its running digest versus the newest push, plus Upgrade and Roll back actions (POST /api/apps/upgrade, POST /api/apps/rollback). Upgrades redeploy with images pinned to the target digest (image:tag@sha256:...), which forces a real Kubernetes rollout on both deploy paths — direct K8s devices and cluster-agent clusters. The previous digest is kept for rollback. App Store cards show an "Update available" badge.
Auto-update per site. Next to the existing auto-deploy toggle, each app in a site's Applications tab now has an Auto-update toggle: when a new image is pushed to the registry, the app is automatically redeployed (digest-pinned) on that site's devices, with an audit-log entry per rollout. Default off.

Fixed

Site auto-deploy to regular K8s devices silently failed with 401. The internal invokeAppsInstall call bypassed the auth middleware but the handler requires claims — auto-deploy only worked for clusters. Internal server-initiated deploys now carry system claims.

[1.31.9] - 2026-07-20

Changed

All-caps text retired across the UI. Every uppercase label — eyebrow/section labels, table headers, card titles, status badges, form labels, and the entire login screen — now renders in normal case with a semibold weight instead of letter-spaced capitals. Tiny 10–11px caps labels were bumped to 12px for legibility, and the tracking-label letter-spacing tokens were removed.
One UI typeface. Space Grotesk (display font) is gone; Inter is now the single typeface for all headings, labels, and body text. IBM Plex Mono remains only for genuine data surfaces — terminal output, code/YAML blocks, keys, serials, IPs — and the in-browser terminal now uses the same IBM Plex Mono instead of an unbundled JetBrains Mono stack.
Uniform page headers across all System pages. Every page (Users, SSO, DNS, Firewall, Registry, PKI, License, Pending Approvals, Admin Devices, Audit Log, Profiles, Tenants, App Store) now uses the same header pattern as About: section eyebrow, bold title, one-line subtitle, and actions top-right. Admin Devices previously had no page header at all (plus a double-padding offset), Registry showed a duplicated title and all-caps buttons, and License had a stray icon in a card title — all normalized. Empty states now consistently use the dashed placeholder style.

Fixed

Sign-out never actually ended the session. The logout POST was rejected by the CSRF middleware (403) because the frontend sent it without the X-CSRF-Token header — the UI showed the login page but the httpOnly session cookie survived, so any reload silently signed you back in. The frontend now echoes the CSRF header on logout, and /api/auth/logout is CSRF-exempt as a fallback (it only clears the requester's own cookies; being locked into a session is worse than a forced-logout CSRF).
Confirmation dialogs were invisible when a map was on screen. Leaflet's internal panes/controls use z-index values up to 1000 in the root stacking context, painting over app overlays (confirm dialog and modals at z-50/60) — the sign-out dialog on the Dashboard was fully functional but rendered underneath the map. Leaflet is now contained in its own stacking context (isolation: isolate).

[1.31.8] - 2026-07-14

Changed

Frontend skin generalized into a design-token system. The tactical green-on-near-black look is unchanged, but it's now driven by ~20 semantic CSS-variable tokens (surfaces, borders, text, accent, status) wired into Tailwind — replacing 187 distinct hex colors and 1,260+ hardcoded [#hex] class values scattered across 48 files. The whole theme (or a future light mode / per-client re-skin) is now editable from one block of variables in src/index.css instead of thousands of call sites. The mil-* component classes were refactored onto the tokens; the empty Tailwind theme and the !important gray overrides are gone.
Typography system added. A display/body/data pairing — Space Grotesk (section titles, uppercase labels), Inter (UI/body), IBM Plex Mono (device IDs, metrics, code) — self-hosted via @fontsource to satisfy the font-src 'self' CSP. The 11 different letter-spacing values used for uppercase labels collapsed to one token.
Visual smoothness pass. A real elevation scale (cards/modals were flat), a single border-radius scale (was 8 ad-hoc radii), one consistent transition/hover token across interactive elements, unified focus rings for keyboard users, table-row hover, and a subtle accent glow on the active device. prefers-reduced-motion is respected.

[1.31.7] - 2026-07-13

Changed

Heartbeat DB writes are throttled to cut Postgres write volume ~6x — every device heartbeats every 5s, and each one used to rewrite the full device row (stats/location/metadata JSONB) plus the WireGuard peer row, producing a dead tuple per write and scaling linearly with fleet size (500 devices ≈ 200 writes/s → WAL/autovacuum/bloat pressure). Heartbeats now rewrite the full row at most every 30s; in between, only last_seen is bumped (a cheap primary-key update), so offline detection stays accurate at 5s resolution. The in-memory device map remains authoritative for the dashboard, so DB stats lagging by up to 30s is invisible to users.
Repository list is cached (15s TTL) to remove a per-request N+1 query — getConfiguredRepositories() ran one query per tenant and was called on ~10 app-related handlers per request. It's now cached and invalidated whenever a repository is added, edited, removed, or synced, so last-sync status stays fresh.
GET /api/dns/records uses a read lock instead of an exclusive lock for its read-only device iteration, so it no longer blocks heartbeats.

Note

Additional scale optimizations were identified and deferred to a supervised release (they touch the live heartbeat/dashboard hot paths): serving the dashboard snapshot purely from the in-memory maps (removing its per-second N+1 DB rebuild), index maps for O(1) Magic DNS lookups, and a background CPU sampler for the stats endpoint.

[1.31.6] - 2026-07-13

Fixed

Fixed an AB-BA deadlock that could hang the entire server — heartbeatHandler acquired deviceMux then wgMux, while wgApproveHandler/wgRemoveHandler acquired them in the opposite order (holding wgMux across a WireGuard command and a DB write). A device approval concurrent with a heartbeat from a not-yet-tunneled device could make the two goroutines wait on each other forever, and because both global locks stayed held, every subsequent heartbeat, registration, and dashboard load blocked too — a full hang needing a restart. The approve/remove paths now update the device map in a separate critical section, so deviceMux is never held-while-waiting-for-wgMux in the opposite order. Triggerable by the normal register→approve→heartbeat sequence, so more likely with more devices.
Fixed a data race that could crash the whole process — wgConfigHandler's approval-retry loop read the wgAgents map without holding wgMux while other handlers wrote it under lock. Go aborts on concurrent map read/write (fatal error, unrecoverable), so an agent polling /api/wg/config during startup while any device was being approved could kill the server. The retry read now holds the lock.
Fixed unbounded memory growth from device churn — commandNotify and osUpgradeState (keyed by device ID) were never pruned, so every register→delete cycle leaked a map entry (and, for commandNotify, a long-poll goroutine) forever. Both are now evicted when a device is deleted.
Fixed unsynchronized access to terminal-session state — the agent and user WebSockets attach in separate goroutines; agentConn/userConn/cancel/latestResize are now guarded by a mutex and the bridge setup is atomic, preventing a data race and a potential double-reader panic on a session's WebSocket.
WireGuard admin/list and generate-IP no longer hold wgMux across the HTTP response write — a slow client could otherwise stall every WireGuard operation; the response is now built under the lock and encoded after releasing it.

[1.31.5] - 2026-07-13

Security

Device and app endpoints now enforce tenant isolation — deleteDeviceHandler, debugDeviceHandler, appsInstallHandler, appsUninstallHandler, appsLabelHandler, and both app-config handlers acted on a caller-supplied device_id without checking the caller had access to that device's tenant, so any authenticated user could delete, deploy/uninstall/relabel, or read/write app config on another tenant's device. All now go through a shared authorizeDeviceAccess helper (super-admins excepted). App-install with no device_id no longer auto-selects a device from another tenant.
User create/delete now require an admin role — usersCreateHandler had no minimum-role gate (a viewer could create an operator and log in as it) and usersDeleteHandler had no role or tenant check (any authenticated user could delete any non-admin account in any tenant). Both now require tenant-admin/super-admin, and delete additionally requires access to the target user's tenant.
JWT issuance fails closed for unknown users — generateJWT defaulted an unrecognised username to tenant-admin/tenant-default, so a deleted account whose token was refreshed was silently re-issued as tenant-admin. It now errors for any username that is neither a known user nor the bootstrap admin, closing the escalation while leaving normal login and the bootstrap admin unaffected.
Registry token endpoint is now rate-limited — /v2/token verified credentials with bcrypt and issued a full session JWT with no rate limiting, bypassing the login limiter entirely and serving as an unthrottled password oracle. It is now bounded per IP (generous enough for docker login/pull).
Forwarding headers are only trusted from trusted proxies — X-Forwarded-For/X-Real-IP were trusted unconditionally, letting a client spoof its apparent IP to defeat rate limiting and session IP-binding. They are now honored only when the immediate peer is a trusted proxy (loopback/private ranges by default, overridable via TRUSTED_PROXY_CIDRS); the real client is taken as the first non-proxy hop from the right. Direct connections use the socket address.
Removed the unused Docker socket mount from the server — the server container mounted /var/run/docker.sock (read-write in dev, read-only in prod) though nothing in the server uses the Docker API (only the registry HTTP API), giving any server compromise a path to host root. The mount and the docker-cli package are removed from the server. (Traefik's socket mount, a separate conventional concern, is unchanged.)

Note

The unauthenticated agent command channel (/api/commands/, /api/device-exists, /api/commandresult) is a known remaining item; authenticating it requires a coordinated agent-side change and is deferred to a dedicated release.

[1.31.4] - 2026-07-13

Security

Git repository sync now validates the repository name, URL, branch and path — the git sync path accepted these fields unchecked (the Helm path already validated them). Repository URLs are now restricted to safe transports (https, http, ssh, git, and scp-style git@host:path); git's command-executing ext:: transport, file:///fd:: local transports, and option-injection via a leading - are rejected. Repository name, branch and path are checked for path-traversal (.., absolute/leading-separator paths) and option injection. Validation runs both at create/edit time (clear 400) and inside syncGitRepository as a defense-in-depth choke point covering the scheduler and manual sync. Existing repositories with normal names/URLs are unaffected.
Helm install/uninstall no longer interpolate unvalidated values into the on-device command — the app-deploy path built a shell command string from the target namespace, release name and chart reference. These are now validated as RFC1123 labels / chart references (no shell metacharacters possible) before the command is assembled, closing a command-injection vector on managed devices.
/api/setcommand now enforces tenant isolation — the endpoint queued a command for any device ID without checking the caller's access, so any authenticated user could run commands on a device in another tenant. It now verifies the caller has access to the target device's tenant (super-admins excepted) via a shared authorizeDeviceAccess helper, and rejects malformed request bodies. Legitimate same-tenant use is unchanged.

[1.31.3] - 2026-07-13

Fixed

Repository auto-sync died with cannot fork(): Resource temporarily unavailable after ~2 weeks of uptime — modern git (2.46+) spawns a detached background git maintenance child after every pull. Inside the server container the Go server runs as PID 1 and never reaps orphans, so each auto-sync leaked one zombie git process (~288/day at the default 5-minute interval) until the container hit its pids cgroup ceiling (systemd's default TasksMax, ~4.5k on a 4 GB host) — at which point every fork in the container failed: repo syncs, SSH certificate signing (ssh-keygen), and firewall updates (iptables). Git syncs now run with gc.autoDetach=false / maintenance.autoDetach=false so maintenance stays a foreground child that git itself reaps, and the server service sets init: true in all compose files so docker-init (as PID 1) reaps any future orphans. A container restart clears accumulated zombies on already-affected installs.
A hung repo sync can no longer stall the sync scheduler — git and helm sync commands now run under a 5-minute timeout; previously a stuck network call blocked the (sequential) scheduler indefinitely.
scripts/changelog-entry.sh failed on macOS and inserted entries in the wrong place — BSD awk rejects multi-line -v values, and the insertion anchor (second ---) pointed into the middle of the history. Entries are now inserted directly under ## [Unreleased] portably. docs/changelog.md had silently gone stale since April because of this; it is re-synced and the copy step is kept.

[1.31.2] - 2026-06-16

Fixed

Selected site still wasn't preserved (follow-up to #74) — the previous fix didn't take because, during the initial load, siteGroups is just the "unassigned" placeholder and the URL-sync effect pinned ?site=__unassigned__ before the real sites arrived — so the remembered site was never restored and the persist effect overwrote it with unassigned. Both effects now wait until sites have finished loading (loading guard), so the last-selected site is correctly restored on navigation and refresh.
Repository Edit form now matches the Add form — Edit was a floating modal with a different field set/order than the inline Add panel. Both forms now render the same inline panel via a shared RepoFormFields component (identical fields, grouping, and buttons); Edit only differs where it must (locked name/type, "stored secret" hints, the Auto-sync-enabled toggle). Opening Add or Edit closes the other.

[1.31.1] - 2026-06-16

Changed

Device types renamed to deployment-target terms (#73) — "Agent" is now Host (🖥️) and "Cluster" is now Kubernetes Cluster (☸️), reflecting what users manage rather than how Watchgrid connects. Type labels, icons, and the Devices filter (All / Hosts / Kubernetes Clusters) were updated, and a host's detail panel now shows agent connectivity as a capability (Management: WatchGrid Agent Connected) instead of the primary type. Underlying device_type values are unchanged.

Fixed

Selected site was lost when navigating away and back (#74) — the active site lives in the URL, so leaving the Sites workspace and returning (or a route change without the ?site= param) snapped back to the first/unassigned site. The last-selected site is now remembered (localStorage) and restored when no site is in the URL, so users stay in their chosen site context across navigation and refresh.

[1.31.0] - 2026-06-16

Added

Automatic + manual sync and editing for Git repositories (#71) — repositories now sync automatically on a configurable interval (migration 024, default 5 minutes, per repository). A background scheduler (startRepoSyncScheduler, checked every minute) syncs each enabled, non-local repo whose interval has elapsed and records the outcome. New Edit action (PUT /api/repositories/{name}) lets you change URL, branch, path, credentials (blank keeps the stored secret), sync interval, and enabled (disable to pause auto-sync). The create form gained a sync-interval field, and each repo row shows its auto-sync interval, last sync time, and the last sync status/error. The existing Sync button is an immediate force-sync. Sync results (status + error) are persisted (last_sync_status, last_sync_error).

Fixed

"Deploy To Site" hung on "Loading application status…" (#72) — siteAppsStatusHandler fetched each Kubernetes host's app status sequentially, and a single slow/unreachable host (a kubeconfig shell round-trip can take up to 30s) stalled the whole screen. The per-host fetch now runs in parallel (bounded to 6 at a time) with a 15s per-host timeout, so the aggregate is as slow as the slowest single host, not their sum. The UI also gained a clear error state with a Retry button (instead of silently showing nothing), and the loading message explains the per-host query.

[1.30.0] - 2026-06-16

Changed

Clusters and agent devices are unified under Sites → Devices (#69) — a cluster is also a device type, so the separate Clusters tab is gone. The Devices tab now lists both agent devices and clusters together, each with a type label (Agent / Cluster), and a new All / Agents / Clusters filter lets you narrow by type. All cluster functionality (node inventory, applications, K8s management, delete) is preserved inline; cluster onboarding stays in the site Onboard To Site modal. Legacy ?tab=clusters URLs redirect to the Devices tab.
Session timeout is now sliding (inactivity-based) (#70) — instead of a hard ~60-minute cap that logged users out mid-work, the session now extends automatically while you're active. requireAuth re-issues the token (refreshed cookie + X-Refreshed-Token header) once it passes half its lifetime, and the frontend adopts it and slides its logout timer forward. A session only expires after a full inactivity window (the existing session_ttl_minutes setting, default 60) with no requests. The setting is relabelled "Inactivity timeout" under System → Users. IP binding and the existing TTL knob are unchanged; the refreshed token preserves the original IP binding.

[1.29.4] - 2026-06-16

Fixed

Server spammed clusters with unsupported shell commands — several server helpers (executeK8sCommand: kubeconfig reads, lsusb inventory, Helm) queue a shell command on the legacy agent command channel, which only regular agents understand. When invoked against a cluster device, the cluster-agent logged Command shell failed: unknown command type: shell on a loop and the caller stalled for the 30s timeout. executeK8sCommand now refuses cluster devices up front (they use the structured cluster command queue), eliminating the log noise and the stalls.

[1.29.3] - 2026-06-16

Fixed

Deleted cluster apps stayed visible in the UI (cluster-agent → 1.24.3) — the cluster-agent only included exposed_apps in its heartbeat when it had at least one proxy route (if len(routes) > 0). Deleting the last app meant exposed_apps was omitted entirely, and the server (if requestData.ExposedApps != nil) kept the stale list — so /api/clusters/apps still returned the removed app and it lingered in the cluster's Applications tab. The cluster-agent now always sends exposed_apps (an empty array when there are no routes), so deletes clear the server's view and the UI. This is safe because rediscoverProxyRoutes rebuilds routes at startup before the heartbeat loop, so a restart never reports a spurious empty set while apps are still deployed.
Deleting an already-removed cluster app reported a failure (cluster-agent → 1.24.3) — follow-up to the delete fix in 1.28.2. When a delete command matched no resources (e.g. a duplicate delete after the app was already gone), handleDelete returned an error, which surfaced in the UI as "delete failed" even though the app was correctly absent. Delete is now idempotent: a successful List that matches nothing returns success ("already absent"). Genuine Kubernetes API errors (failed List/Delete) are still reported as errors.

[1.29.2] - 2026-06-16

Fixed

Cluster node InternalIP could become the WireGuard IP (kubectl slow) (#68, cluster-agent → 1.24.2) — the cluster-agent runs with hostNetwork, so its wg0 interface lives in the host network namespace. configureTunInterface assigned the tunnel IP (100.64.x) with the default global scope, which Talos/kubelet node-IP auto-detection could adopt as the node's InternalIP instead of the real LAN address — breaking in-cluster routing and slowing kubectl on that node. The tunnel IP is now assigned with scope link (excluded from node-IP selection but still fully usable for the mesh — the WireGuard handshake runs over the host's real interface, and local delivery is scope-independent), and the VPN subnet route is added with a high metric so it never competes with primary routes. Note: the authoritative fix on Talos is to pin the kubelet node IP in the machine config (machine.kubelet.nodeIP.validSubnets, excluding 100.64.0.0/10) — see docs. Validate on the cluster after upgrading the cluster-agent.

[1.29.1] - 2026-06-16

Changed

Site management consolidated into the Sites workspace (#65) — the standalone Site Management page (/sites) and its nav entry are gone. Creating, editing, and deleting sites now happens directly in the Sites workspace (/inventory): a + button next to the site list creates a site, and hovering a site reveals ✎ (edit) and ✕ (delete) actions. The create/edit form (name, slug, description, status, location, labels) moved into the workspace's left site column, so all site-related actions — management, device assignment, app deploy, profiles — live in one place. Sites.jsx was removed; the site API (/api/sites) is unchanged.

[1.29.0] - 2026-06-16

Added

Devices inherit their Site's location on enrollment (#63) — when a device (or cluster) is first enrolled into a Site that has a location set, it now adopts that location as its initial value (source: "site"), so it lands on the map automatically instead of needing manual entry. Applies in both registration paths (wgRegisterHandler and registerHandler). Manual overrides (source: "manual") are never touched, an existing device's location is preserved on re-registration (this also fixes a latent bug where re-registration wiped the in-memory location), and a real agent-reported location still refines the inherited value. Site location changes do not propagate to devices, so manual device locations are safe.
Last sync time per App Store repository (#66) — repositories now record when they were last synced (last_sync, migration 023) and the App Store shows "Last sync: …" per repository. The built-in local catalog is synthetic and not tracked.
Edit a deployed cluster app (#67) — deployed apps on a cluster (Inventory → cluster → Applications) now have an Edit action (shown when the app has config fields) that pre-fills the saved config and redeploys with the edited values. Deploy config is persisted per cluster (reusing the app_configs table keyed by cluster_id), exposed via the new GET /api/clusters/config, and edits redeploy through the existing /api/clusters/deploy.

[1.28.3] - 2026-06-16

Fixed

Could not deploy apps to an onboarded Kubernetes cluster from the site "deploy to site" action — siteAppsInstallHandler (and the auto-deploy-on-enrollment path) skipped any device where K8sEnabled was false and routed eligible ones through /api/apps/install, which targets a device's own kubectl. Clusters report their state via k8s_info (not the K8sEnabled flag, which stays false for device_type=cluster) and deploy through the cluster command queue (/api/clusters/deploy), so site deploys silently skipped clusters entirely. Site app deploys now treat clusters as eligible and route them to the cluster deploy path: a new deploySiteAppToDevice helper sends device_type=cluster deploys via invokeClusterDeploy (cluster command queue) and everything else via invokeAppsInstall, used by both the manual "deploy to site" and the auto-deploy-on-enrollment flows. (Site-level status aggregation still reads only regular devices; a cluster's deployed apps remain visible in its own cluster row — broader site status for clusters is a separate follow-up.)

[1.28.2] - 2026-06-16

Fixed

Deleting a cluster app reported success but left the workload running (cluster-agent → 1.24.1) — handleDelete selected resources with LabelSelector: app=<AppName>, but AppName is the display name (e.g. Hello Web). That's not a valid Kubernetes label value, so the List call errored; the code did if err == nil { delete }, silently swallowed the error, deleted nothing, and still returned Deleted app …. The UI then sat on "deleting" because the Deployment/pod never went away. Fixed by matching resources the way the server labels them: the app name is sanitized (Hello Web → hello-web) and resources are matched client-side on app.watchgrid.io/name (or the app label, or the resource name) instead of via an invalid selector. List/delete errors are now surfaced, and a delete that matches nothing returns an explicit error instead of a false success.
Built-in "local" app repository disappeared after adding a git repository — getConfiguredRepositories returned only database-backed repos as soon as the DB was non-empty; the built-in local catalog exists only in code (the fallback path), so adding the first git repo made it vanish from the App Store. The synthetic local repo is now always included (prepended) whenever it isn't already present, regardless of how many git repos are configured.

[1.28.1] - 2026-06-16

Fixed

Clusters could not be deleted from the UI (#64) — the cluster detail view (ClusterRow) had no working delete control: it imported useConfirm but never rendered a delete button, so there was no way to remove a cluster. Added a Delete Cluster button in the cluster detail header that opens a confirmation dialog, calls DELETE /api/devices/delete, shows a success/error toast, collapses the row, and refreshes the list — the same pattern used elsewhere (e.g. admin devices). On the server, deleteDeviceHandler now also drops the cluster's proxy DNS records: cluster_commands/cluster_apps and the WireGuard peer already cascade on device delete, but custom_dns_records has no FK to devices, so the cluster's exposed-app hostnames are now explicitly cleaned (via syncClusterDNSRecords with an empty app set) to avoid orphaned .wg records pointing at a deleted cluster's tunnel IP. Workloads already running inside the cluster are unaffected.

[1.28.0] - 2026-06-16

Added

WireGuard over HTTPS/443 (opt-in, cluster-agent) — clusters on networks that block outbound UDP can now join the mesh by tunneling WireGuard inside a TLS/WebSocket connection on port 443, the same way registration already works. The WireGuard cryptography is unchanged; only the packet transport moves from raw UDP 51820 to UDP-over-WSS, which traverses corporate firewalls as ordinary HTTPS. (Pattern: Tailscale DERP / Talos "WG over HTTP2" / wstunnel.)
Server: new relay endpoint GET /api/wg/tunnel (WebSocket) in server/wg_tunnel.go — each binary frame is one WireGuard datagram, bridged to the server's local wg0 socket (127.0.0.1:51820). No bearer auth: WireGuard's static-key crypto authenticates end-to-end and the relay only ever forwards to localhost. Rides the existing Caddy/nginx /api/* WebSocket path (the same one the terminal uses), so no reverse-proxy changes are needed.
cluster-agent (bumped to 1.24.0): new wsBind (cluster-agent/ws_bind.go) implementing wireguard-go's conn.Bind over a WebSocket to wss://<server>/api/wg/tunnel, with automatic reconnect-with-backoff. Enabled per cluster via WATCHGRID_WG_OVER_HTTPS=true; the TUN MTU is lowered to 1280 in this mode to leave headroom for WS/TLS/TCP framing.
Onboarding UI: a "Tunnel WireGuard over HTTPS (443)" checkbox in the cluster-manifest section (Inventory → site onboarding) sets wg_over_https on POST /api/clusters/provision, which emits WATCHGRID_WG_OVER_HTTPS: "true" into the generated manifest's ConfigMap.
Default OFF — existing clusters and the regular/service agents are completely unaffected; raw UDP remains the default. Regular agent/service-agent (kernel wg-quick) over 443 is a planned Phase 2. Note: tunneling WireGuard's UDP inside TCP/TLS incurs the usual TCP-over-TCP overhead — fine for management traffic, which is the intended use.

[1.27.2] - 2026-06-16

Fixed

Cluster (and any WireGuard-first) device could get permanently locked out of its own tenant with 403 Device is locked to original tenant — wgRegisterHandler rebuilt both the database and in-memory Device records without carrying OriginalTenantID. dbSaveDevice backfills the empty original on the DB copy (→ tenant-default), but the in-memory map kept the empty value, desyncing the two. On the subsequent /api/register call, the tenant-switch guard compares tenant.ID != existingDevice.OriginalTenantID; with an empty original that is always true, so a routine re-registration to the device's own tenant was flagged as a hostile tenant switch and — because the device was tenant-locked — rejected with a 403. The cluster-agent then loops on "Failed to register: register returned status 403" and can never onboard. Manifested specifically on freshly deployed servers where a cluster is onboarded for the first time after the WireGuard handshake. Two-sided fix: (1) wgRegisterHandler now computes originalTenantID (preserved from the existing device, or the token-resolved tenant for a brand-new one) and writes it into both the DB and in-memory Device structs, so the records never diverge; (2) registerHandler's tenant-switch guard now treats an empty in-memory OriginalTenantID as suspect — it recovers the authoritative value from the database, and only if that is also empty does it treat the token-bearing registration as the binding event rather than a switch. This both prevents the lockout and self-heals devices already stuck in the desynced state on their next registration. The security property is unchanged: a device whose recorded original tenant differs from the token's tenant is still blocked.

[1.27.1] - 2026-06-16

Fixed

Cluster onboarding manifest was missing the cluster-agent Secret for one boot after a fresh deploy — clusterProvisionHandler read the tenant's onboarding token from the in-memory tenants cache, but on a brand-new deployment ensureDefaultTenantToken() mints that token after loadTenants() has already cached an empty value, and the token-minting paths only persisted to the database — they never refreshed the cache. The stale empty token made the manifest template's {{- if .OnboardingToken}} guard drop the entire Secret resource and the WATCHGRID_ONBOARDING_TOKEN env mapping, so a freshly deployed server produced an incomplete manifest until its second restart (an already-running server like an established tenant was unaffected because its token had long since been loaded from the database). Two-sided fix: (1) clusterProvisionHandler now resolves the token from the database via dbGetTenant and backfills + persists one if the tenant somehow still has none, so the manifest always carries the Secret — this self-heals existing fresh deploys without a restart; (2) a new cacheTenant() helper refreshes the in-memory map after every token mutation (ensureDefaultTenantToken, the getOnboardingInfoHandler backfill, regenerateOnboardingTokenHandler, and createTenantHandler), so all readers stay consistent.

[1.27.0] - 2026-05-18

Added

System → About page (/system/about) consolidates versions, control-plane endpoints, container CPU/memory, Postgres health, and live tenant counts into a single screen. CPU and memory are sourced from cgroup v2 (/sys/fs/cgroup/cpu.stat + memory.current / memory.max) with cgroup v1 and host /proc fallbacks, so the percentages reflect the actual server container — not the host VM. Database card runs a 2 s PingContext, surfaces ping latency, Postgres version, the live pg_stat_activity connection count for current_database(), and the sql/pgx pool's open/in-use/idle counters. The Resources card auto-refreshes every 5 s; the rest of the page reloads on demand via the Refresh button. New endpoint GET /api/system/stats powers the card and is documented in Swagger.
Configurable session timeout + IP-bound sessions — new system_settings table (migration 022) stores session_ttl_minutes (default 60 minutes, was hardcoded at 24 hours) and session_bind_ip (default on). Admins manage both from a new "Session" card at the top of System → Users. JWTs now carry an ip claim derived from X-Forwarded-For / X-Real-IP / RemoteAddr; requireAuth rejects requests whose client IP differs from the claim when IP binding is on. Tokens minted before this release have an empty IP claim and are grandfathered through the IP check so existing sessions aren't yanked at deploy time. Auth cookie Max-Age and the registry-token expires_in now both follow the configured TTL. Backed by GET /api/system/settings (any authed user) and PUT /api/system/settings (admin only) with a 1-minute in-memory cache. Registry tokens are deliberately not IP-bound — docker daemon traffic comes over a separate connection.
Per-repository sync feedback in App Store — Sync All now fires every repository in parallel via Promise.all, and each repository row shows its own inline state (Syncing… with a pulsing dot, Synced at HH:MM:SS in green, Sync failed: <reason> in red) instead of a single blocking browser alert at the end. The aggregate toast tells you "synced N repositories" or "X of N failed — see row for details."

Changed

SSH-key repository auth is now stored in the database, not as a filesystem path — the Add Repository form's "SSH Key Path" single-line input is replaced with a "SSH Private Key" textarea that accepts the PEM key contents directly. The server writes the value to a per-sync os.CreateTemp file with 0600 permissions, hands the path to git via GIT_SSH_COMMAND=ssh -i <tmp> -o IdentitiesOnly=yes -o StrictHostKeyChecking=no, and defers the file removal. The repository list/create API responses now run through a redactedRepo helper that blanks password and ssh_key and emits has_password / has_ssh_key booleans — secrets that the API previously echoed back to every caller are no longer disclosed.
Notifications are uniformly toasts — the inline mil-banner-success / mil-banner-error blocks and native alert() / confirm() dialogs were swept out of DNS, SSO, PKI, App Store, Audit Log, Inventory, License Management, Firewall, Tenants, User Management, Registry Manager, K8s Device Panel, App Config Modal, Host Row, and Cluster Row in favour of the existing ToastProvider API and ConfirmProvider modal. Persistent page-state banners (initial-load errors, license-summary status card, permission-restriction notices, delete-confirm modal warnings) were intentionally left in place — those are page state, not transient notifications. Pre-auth flows (Login, 2FA setup, onboarding wizard) keep their inline banners since there's no ToastProvider mounted before login.
Sidebar redesign — the Hosts Overview page header lost its Sites eyebrow so it matches the flat title pattern used everywhere else (DNS Management, Single Sign-On, User Management, etc.). The main left sidebar is narrower (w-56 → w-52), with tighter row padding (px-6 py-3 → px-5 py-2.5) and a smaller logo cell; system submenu items use px-8 py-2. Every nav label has whitespace-nowrap so long items like "Site Management" and "Pending Approvals" never wrap. The inner Sites column (visible on /inventory) is also narrower (w-60 → w-48), drops the "List" subheading, replaces the rounded "card" treatment with the flat left-accent-bar style used in the main nav, and truncates long site names. The Sites column no longer appears on /sites (Site Management) where it was redundant with the page content.
Sites workspace collapse arrow looks like a macOS sidebar toggle — the chevron-in-a-bordered-pill (|<|) was reading as "something hidden between two vertical lines" (h/t Joël for the feedback). Now a borderless SF-Symbols-style sidebar icon (rounded rect + inner divider) with a subtle green pane fill when expanded, and a hover-only background. The toggle column lost its border-r so it no longer visually frames the icon.
Version stickers moved off the chrome — the v1.26.x / Agent v1.26.x strip under the WatchGrid logo was visual noise. Both lines are gone from the sidebar; versions now live exclusively in System → About, where Server, Agent, uptime, and platform sit on the Versions card.

Security

Default session lifetime cut from 24 hours to 1 hour. Combined with the new IP binding (also on by default), a stolen token is dead within an hour and unusable from a different network within seconds.
Repository credentials redacted in API responses — GET /api/repositories previously returned the raw password and ssh_key fields to any authenticated caller. The new redactedRepo helper blanks both and emits boolean has_password / has_ssh_key flags instead. POST responses use the same redaction.

Fixed

docker-compose.ui-test.yml failed to start after the JWT-length validator landed — the committed JWT_SECRET was 25 characters, but the server now rejects anything under 32. The committed value is now a 60-char dev-only secret. While in the file, also seed WATCHGRID_ALLOWED_ORIGINS with all four localhost:5173 / 5174 and 127.0.0.1:5173 / 5174 variants so a browser opened on the Vite default origin gets through CORS instead of seeing "Network error" after the in-browser fetch is 403'd. Additionally WATCHGRID_DEV_MODE=true now auto-appends those four origins inside the server (allowedBrowserOrigins in ws_security.go), so a fresh checkout works without any env-var dance.
docs/local-ui-testing.md pointed at the wrong Vite port — was 5174, is actually 5173.

Database migrations

022_system_settings.sql — new system_settings (key, value, updated_at, updated_by) table seeded with session_ttl_minutes=60 and session_bind_ip=true. Mirrored into scripts/init-db.sql for clean installs.

[1.26.7] - 2026-05-14

Fixed

Cluster-agent pod restart wiped Watchgrid's view of installed apps — the cluster-agent's proxy route map is in-memory, and after kubectl rollout restart deploy/watchgrid-cluster-agent -n watchgrid-system the next heartbeat carried exposed_apps: []. ClusterRow.jsx reads /api/clusters/apps which just returns the device's reported exposed_apps, so the UI offered the apps as available-to-install even though their Deployments and Services were still running in the cluster. With the new heartbeat-driven DNS sync from 1.26.6 it got worse — the empty array also caused syncClusterDNSRecords to delete the cluster's DNS rows. Fix is two-sided: (1) cluster-agent/commands.go handleDeploy now stamps annotations on the Service after kubectl apply (app.watchgrid.io/hostname, /expose-port, /protocol, /name) — the server already set app.watchgrid.io/managed-by=watchgrid and /name as labels via addAppLabels, but those alone weren't enough to rebuild a route because the port and protocol were lost; (2) new cluster-agent/rediscover.go runs once at startup after WireGuard comes up, lists every Service cluster-wide carrying the managed-by label, and rebuilds the in-memory proxy map from the annotations. Includes a fallback path for apps deployed before this fix shipped: hostname is derived from BuildDNSHostname(appName) and port falls back to svc.Spec.Ports[0].Port — good enough for the bundled demo apps, and gets superseded by exact annotations on the next redeploy. After this fix lands, the next heartbeat repopulates exposed_apps, the 1.26.6 DNS sync re-upserts records, and the UI shows the apps as installed again — automatically, without any manual redeploy.

[1.26.6] - 2026-05-14

Fixed

Cluster-agent DNS registration calls were 401-ing — cluster-agent/dns.go POSTed proxy hostnames to /api/dns/records, but that endpoint is gated by requireAuth (admin JWT) and the cluster-agent has no token, so every deploy logged DNS register failed for {hostname} after 3 attempts: status 401 (operators saw this for nginx-demo---local, hello-web, etc.). The Kubernetes/Proxy UI's "Open WebUI" link reverse-proxies via /api/k8s/service-proxy/... and dials dev.WireGuard.TunnelIP:8081 directly, so this didn't block that flow — but the *.wg hostnames advertised on the cluster's proxy never had matching DNS records, so anyone trying to resolve {app}.wg directly (over the VPN) got NXDOMAIN. The cluster-agent already publishes its current proxy routes in the exposed_apps array of every heartbeat, so the server now owns DNS state: a new syncClusterDNSRecords function in server/main.go upserts a custom_dns_records row for each advertised hostname (pointing at the cluster's tunnel IP) and deletes rows for hostnames the cluster previously claimed but no longer reports. Per-cluster ownership is tracked in an in-memory map, with a safety guard so we only delete records whose IP still matches the cluster's tunnel — operators who overwrote a record manually won't have it clobbered. cluster-agent/dns.go was deleted; commands.go and main.go no longer wire a DNSManager. Stale rows from before this fix linger until the same cluster reasserts a smaller exposed_apps set; clean them up by hand via DELETE FROM custom_dns_records WHERE ip_address = '<cluster tunnel ip>' AND hostname NOT IN (...) if needed.
Registry Test Web couldn't register a proxy route because its metadata had no expose_port — apps/registry-test-web/metadata.yaml was missing expose_port and expose_protocol, so when the cluster-agent's deploy handler read the metadata it had nothing to add to the proxy and never called proxy.AddRoute(...). Symptom: the agent log showed Command deploy completed: Deployed Registry Test Web: ... but no matching Proxy route added line, and clicking the app in the UI did nothing because the cluster's exposed_apps heartbeat array never contained it. Set expose_port: 80 and expose_protocol: http on the metadata to match the Service definition in deployment.yaml. (Note: the deployment image is pulled from registry.wg:5000/..., which still requires containerd registry-mirror configuration on each cluster node — that's a separate problem if image pulls fail.)

[1.26.5] - 2026-05-14

Fixed

K8s reverse proxy showed "cluster device not reachable" because the cluster-agent never brought up WireGuard — clusterProvisionHandler generated the cluster-agent ConfigMap with WATCHGRID_DISABLE_WIREGUARD: "true" baked in. That dates back to when the cluster-agent shipped (commit 22d6167) and the only traffic it needed was HTTPS to the public server URL for registration/heartbeats/commands. The K8s service-proxy feature added later (k8sServiceProxyHandler in server/main.go) reverse-proxies into cluster workloads by dialing the cluster-agent's :8081 endpoint at dev.WireGuard.TunnelIP, so a cluster-agent without a tunnel always fails with "cluster device not reachable" (server/main.go:3624-3626). The provisioning manifest now sets WATCHGRID_DISABLE_WIREGUARD: "false" (with an inline comment explaining when to disable it). Prerequisites — /dev/net/tun mount and NET_ADMIN capability — were already in the manifest, so userspace WireGuard via wireguard-go works without any other changes. If WG setup fails on a given node, the cluster-agent logs WireGuard setup failed: ... (continuing without VPN) and proceeds, so the failure mode is graceful. Operators with already-onboarded clusters need to regenerate the cluster manifest in the UI and re-apply it on the cluster so the cluster-agent pod restarts with the new env var.

[1.26.4] - 2026-05-14

Fixed

App deploys to newly-onboarded clusters failed silently because the cluster row was never written to devices — registerHandler upserted the device into the in-memory devices map and returned 200, but never called dbSaveDevice. Heartbeats from a registered device take the UPDATE-only path (dbUpdateHeartbeat), so a missing row stays missing forever; zero rows affected, no error. The cluster therefore showed up in /api/clusters (which reads the in-memory map) and in the UI, but cluster_commands.cluster_id has a foreign key on devices(id) ON DELETE CASCADE, so the first INSERT INTO cluster_commands (...) from clusterDeployHandler → dbEnqueueClusterCommand was rejected by Postgres and the deploy never reached the cluster-agent. Symptom on the rob.trial server: an Omni cluster heartbeating with K8s info visible and 18 pods reported, but the user could not install any app onto it. registerHandler now persists the device with dbSaveDevice immediately after upserting the in-memory map (same pattern the heartbeat auto-register path already uses), so the row exists by the time any FK-dependent insert runs. Operators with already-onboarded clusters that hit this bug will need to delete/re-apply the cluster manifest (or restart the cluster-agent pod) so it re-issues /api/register against the patched server.

[1.26.3] - 2026-05-07

Fixed

Onboarding & cluster manifest URLs honor the real client scheme behind Caddy — the frontend nginx forwarded X-Forwarded-Proto: $scheme to the Go server, but $scheme is the scheme nginx itself listens on (port 80, so always http). That overwrote the https value Caddy was already setting on the inbound request, so forwardedProto(r) returned http and the onboarding "Reprovision Existing Device" / "Basic Installation" / "Installation With Kubernetes" curl snippets — plus the server_url baked into the cluster manifest — all came out as http://... even when the operator was on HTTPS. New map $http_x_forwarded_proto $forwarded_proto block at the top of frontend/nginx.conf falls back to $scheme only when the upstream proxy didn't set the header; all three proxy_set_header X-Forwarded-Proto ... lines (/api/, /downloads/, \.sh$) now use that variable.
Duplicated --siteid flag in onboarding commands — the backend getOnboardingInfoHandler appends --siteid <id> whenever the request carries site_id=…, and Inventory.jsx was also wrapping the returned strings with siteScopedCommand() which appended the same flag again. The deduplication regex inside siteScopedCommand was broken ("+? required at least one literal quote, which the backend output never has), so the second flag always slipped through and operators saw --siteid 1 --siteid 1. Removed the redundant frontend wrapper entirely; the backend is the single source of truth.
Device-detail tabs wrap instead of clipping the rightmost ones — HostRow.jsx and ClusterRow.jsx rendered the tab strip with flex … overflow-x-auto whitespace-nowrap … no-scrollbar. On a typical desktop width the ninth tab (Kubernetes, only present when device.k8s_enabled) was pushed past the right edge with the scrollbar deliberately hidden, so operators couldn't see or reach the Kubernetes tab even though the tab itself was rendered. Replaced with flex flex-wrap gap-x-6 gap-y-2; tabs now flow to a second line when they don't fit, which mirrors how page-level tab strips already behave.
Sites/Inventory side panel starts expanded each time you enter that workspace — sitesExpanded was initialised once at Layout mount from location.pathname, so if the app loaded on Dashboard (or anywhere outside /sites / /inventory) the panel was stuck collapsed even after you navigated into Sites. Default state is now true and a small useEffect re-expands the panel whenever the user enters the Sites/Inventory workspace from outside it. In-session manual collapse via the ‹ toggle still works — leaving the workspace and coming back simply reopens it.
Onboard-To-Site modal couldn't scroll on shorter viewports — .mil-modal-card had no max-height and no overflow rule, so the Cluster Manifest section at the bottom was clipped off-screen with no way to reach it. Added max-h-[90vh] overflow-y-auto to the shared modal card class.
K8sDevicePanel is not defined when opening Kubernetes on a host device — the Inventory.jsx split into per-row component files (HostRow.jsx, ClusterRow.jsx, etc.) extracted the JSX that renders <K8sDevicePanel ... /> into HostRow.jsx but did not carry the import K8sDevicePanel from '../K8sDevicePanel' along with it. ClusterRow.jsx got the import; HostRow.jsx didn't. The build still succeeded because JSX references are compiled to React.createElement(K8sDevicePanel, …) calls that only blow up at render time, so the regression only surfaced when an operator with a K3s-enabled host clicked the Kubernetes tab — they then saw a red error overlay instead of the cluster panel. Added the missing import; verified the eight PascalCase JSX tags in HostRow.jsx now all resolve.
createPortal is not defined in cluster + app-config modals — same class of regression as the K8sDevicePanel miss: the inventory split moved createPortal(...) calls into ClusterRow.jsx and AppConfigModal.jsx without bringing import { createPortal } from 'react-dom'. TerminalOverlay.jsx and K8sDevicePanel.jsx had the import; the other two did not. Same Vite-can't-see-it-at-build-time, blows-up-at-render mechanism — the failure surfaced when a user opened the cluster row's config modal or the app-config modal. Added the missing imports; full audit of every PascalCase JSX tag across src/components/inventory/*.jsx now passes (no other unbound references).

[1.26.2] - 2026-05-07

Fixed

Creating users in the UI no longer fails with users_role_check — the System → Users dropdown offered user, admin, operator, but the database constraint users_role_check only accepts super-admin, tenant-admin, operator, viewer. Saving any role other than operator was rejected by Postgres with pq: new row for relation "users" violates check constraint "users_role_check". The dropdown is now viewer / operator (plus tenant-admin and super-admin when the caller is a super-admin), the backend default in usersCreateHandler is viewer, the privilege gate is updated to block super-admin/tenant-admin for non-super-admins, and unknown roles are now rejected with a 400 before they reach the DB.

Removed

Licensing docs page — docs/licensing.md removed and dropped from the mkdocs.yml nav and the docs/index.md table; the Licensing / editions row in CLAUDE.md's docs-sync table is gone too.
"Getting Help" footer on docs landing page — removed the trailing Getting Help section (sales email + "Watchgrid B.V. — The Netherlands" line) from docs/index.md.

Changed

Dashboard map fits all devices + control plane on first load — MapContainer only honors center/zoom on mount, and serverLocation is fetched async, so the very first paint had only the device list (one Pi → zoom 10 → map locked on London). After the server location landed the props were ignored and the user was stuck looking at London with the Falkenstein control plane off-screen. New FitToContent child uses useMap().fitBounds once data arrives, then sets a ref so subsequent polling refreshes don't yank the user's manual pan/zoom back.

Fixed

Device flapping between assigned site and "unassigned" after WireGuard approval — wgApproveHandler was overwriting the in-memory devices[id] entry with a stripped Device{} that carried only the ID, tunnel IP, and LastSeen. That clobbered the TenantID, Hostname, DeviceType, and crucially the SiteID that wgRegisterHandler had just populated. The DB row stayed correct, so REST /api/devices?tenant_id=X (which filters in-memory by tenant and falls back to the DB record) still returned the right site, but the dashboard WebSocket snapshot uses tenantID="" and the in-memory entry survives the merge — every WS push wiped site_id and the device jumped to "Unassigned" until the next REST poll restored it. The handler now updates LastSeen and the tunnel IP on the existing in-memory record instead of replacing it. Existing servers with corrupt in-memory state recover on the next restart (the map is rehydrated from the DB at boot).
Onboarding commands respect TLS-terminating reverse proxy — getOnboardingInfoHandler (and the server_url it returns to the cluster-manifest generator) built the URL with r.TLS != nil, which is always nil when TLS is terminated upstream. The Onboard-To-Site modal therefore showed curl -fsSL http://... even when the operator was logged in over HTTPS, leaving them to hand-edit the command. Now uses the existing forwardedProto(r) helper (which honors X-Forwarded-Proto) and falls back to X-Forwarded-Host when set, matching the convention already used by getExternalBaseURL for OIDC redirects.

Changed

SSO config moved to its own System menu item — the OIDC settings form is no longer a section under System → Users; it now lives at System → SSO (/system/sso). New frontend/src/SSO.jsx owns the form, fetch, and save handlers. Users.jsx drops the OIDC state, fetchOIDCSettings, handleSaveOIDCSettings, and the embedded form (~200 lines). Same backend endpoints (GET/PUT /api/auth/oidc/settings); no API changes. Page is super-admin only — non-super-admins see a permission-denied notice instead of the form.
Inventory.jsx split into per-component files (#42) — the 4308-line monolith is now 2218 lines. New frontend/src/components/inventory/ hosts six extracted files: shared.jsx (formatters + InfoPanel/InfoRow/DeviceTabPanel/RuntimeTrend), HostRow.jsx (1237 lines), ClusterRow.jsx (538 lines), ServiceRow.jsx, AppConfigModal.jsx, TerminalOverlay.jsx. All 17 unit tests still pass; frontend build clean; no behaviour change (each extracted function already took its dependencies via props so extraction was purely structural).

Added

Web Vitals reporting (#44) — web-vitals initialised in main.jsx; LCP / CLS / INP / FCP / TTFB ship via navigator.sendBeacon to a new POST /api/metrics/vitals endpoint on the server, which folds the values into two Prometheus histograms (watchgrid_web_vitals_ms and watchgrid_web_vitals_cls). No third-party analytics — stays on the customer's own infrastructure. Endpoint is exempt from CSRF + CORS (fires before auth, via sendBeacon which may drop cookies).
Vitest + React Testing Library harness (#45) — vite.config.js gains a test: section (jsdom, globals, coverage via v8). New src/test/setup.js pulls in @testing-library/jest-dom matchers and cleans the DOM between cases. 17 tests shipped against useApi, usePolling, ConfirmProvider, ToastProvider; CI gates the frontend image build on vitest run via a new test-frontend job in build.yml. npm test, npm run test:watch, npm run test:coverage added to scripts.
Virtualization primitive for long device lists (#43) — react-window dependency + new frontend/src/lib/virtualList.js exposing VirtualizedList and a VIRTUALIZE_THRESHOLD constant (300). Not yet wired into Inventory's expandable-row path — current tenants stay well below the threshold and expandable rows need per-row height tracking — but the primitive is ready for the first customer who trips it.

Changed

Shared skeleton-loader primitives (#37) — new components/Skeleton.jsx (SkeletonBlock, SkeletonLine, SkeletonCard, SkeletonRows). Inventory and Dashboard render skeletons matching the real layout on first load instead of spinner→content, eliminating CLS. Announced to screen readers via role="status" + aria-busy="true".
Typed API client helper (#41) — frontend/src/lib/api.js wraps AuthContext.apiRequest and exposes useApi() returning { get, post, put, patch, delete, raw }. Errors throw a typed ApiError with .status, .body, human-readable .message; uncaught errors auto-surface via the toast system (401 is skipped — already handled by the session-expired overlay). Firewall delete flow seeded as a migration example; remaining call-sites can move incrementally.

Changed

Exponential backoff with jitter on terminal reconnect (#36) — DeviceTerminal.jsx moves from fixed 2s × 2^n retries to full-jitter exponential backoff (1 s base, 30 s cap, 6 attempts). Counter resets on every successful ready status so a network blip doesn't count against a fresh streak. After max attempts the UI shows a Reconnect button that resets the counter.
CSP tightened (#40) — connect-src dropped from 'self' ws: wss: (any-origin) to 'self' (modern browsers cover same-origin WSS). Added frame-src 'none'. Preserved style-src 'self' 'unsafe-inline' as a documented Tailwind exception; nginx.conf now carries the full rationale inline.
Mobile: horizontal scroll on device-panel tabs (#39) — Inventory device- and cluster-panel tab rows switch from flex-wrap to horizontal scroll (overflow-x-auto + whitespace-nowrap + new .no-scrollbar utility in index.css). Min 44 px tap target per WCAG 2.5.5. Added role="tablist" / role="tab" / aria-selected.
Accessibility spot pass (#38) — Login error banner is now role="alert" + aria-live="assertive"; username/password inputs gain aria-invalid/aria-describedby pointing at the error banner, plus correct autoComplete hints.

Added

Structured JSON logging with log/slog (#13) — new server/logging.go and agent/logging.go install a slog.JSONHandler at the level picked from WATCHGRID_LOG_LEVEL (debug/info/warn/error; default info). Server logs carry component=server + version; agent logs add device_id when available. The standard log package is bridged through slog via log.SetOutput so existing log.Printf call sites emit JSON immediately (marked legacy_log=true) — migration to first-class slog attributes can land incrementally without touching every file at once.
GDPR user export + cascade purge (#24) — two new super-admin-only endpoints. GET /api/users/{username}/export bundles every row across users, admin_audit_log, device_security_log, ssh_certificates, license_audit_log, and device_profile_runs that references the user into a single JSON download (password hash + 2FA secret redacted). DELETE /api/users/{username}?purge=true runs a transactional cascade-delete across the same tables, scrubs username matches in device_security_log.details, and writes a user_gdpr_purge audit entry before deleting so the record survives its own cascade. The regular DELETE /api/users/{id} without ?purge=true keeps its existing off-boarding semantics. Documented in docs/users.md#gdpr--data-subject-access-requests.
Prometheus /metrics endpoint (#14) — github.com/prometheus/client_golang with a package-private registry instruments HTTP routes (low-cardinality route bucketing, status class, latency histogram), agent heartbeats, login failures by reason, per-limiter rate-limit rejections, WireGuard peer count, and DB pool stats (open + in-use). Mounted at /metrics on :8080; production deployments must block external access via the reverse proxy / NetworkPolicy. Documented in docs/production.md#observability-prometheus.
k8s hardening: PodDisruptionBudget + default-deny NetworkPolicies (#22) — new k8s/07-policies.yaml ships PDBs (minAvailable: 2 for server, minAvailable: 1 for Postgres) plus a default-deny network policy with explicit allow edges: frontend→server, server→postgres, server→registry, cluster-agent→server, kube-dns for every workload, and controlled Internet egress from the server (RFC1918 excluded).
Route-level code splitting with React.lazy + Suspense (#31) — every dashboard route except / (Dashboard) is now React.lazy-loaded with an accessible spinner fallback. The entry chunk shrank from ~375 KB raw / 87 KB gz to 60 KB / 19 KB gz (3× smaller); heavy screens like Inventory (40 KB gz) and Sites/Users/Tenants (~4.5 KB gz each) ship only when visited. Initial-load JS+CSS drops from ~163 KB gz to ~136 KB gz.
Production runbook (#16) — new docs/runbook.md covers first-install checklist, backup & verification drills (Postgres, WireGuard key, SSH-CA), upgrade + rollback procedure with post-upgrade smoke tests, an alert → action playbook, leader-election verification drill, and the observability cheat-sheet. Linked from mkdocs.yml and deployed at docs.watchgrid.dev/runbook/.
CSRF protection on state-changing endpoints (#19) — new csrfMiddleware enforces the double-submit-cookie pattern: on login the server sets a non-httpOnly watchgrid_csrf cookie with 32 random bytes; AuthContext.apiRequest reads it back and echoes the value in an X-CSRF-Token header on every POST/PUT/PATCH/DELETE. The middleware compares header against cookie in constant time and returns 403 on mismatch. Exempt paths: agent endpoints (token-authenticated), registry proxy, /downloads/, WebSocket upgrades, /api/auth/login, /api/auth/oidc/*. GET/HEAD/OPTIONS and Authorization-header-only clients are unaffected.
Password policy + bcrypt cost 12 (#20) — new validatePasswordPolicy enforces min length 8, mixed case + digit + special, and a common-passwords blocklist on every user-facing password endpoint (createUserHandler, changePasswordHandler, onboardingHandler). Bcrypt cost raised from the library default (10) to 12 for new hashes. GET /api/auth/password-policy publishes the rules so the frontend can render live feedback. Existing stored hashes at cost 10 remain verifiable — no forced migration. Env-var admin bootstrap does NOT validate (so operators with weak existing ADMIN_PASSWORD don't lose access on upgrade).
Toast notification system (#33) — new ToastProvider exposes toast.success/error/info with accessible role="status" announcement, auto-dismiss (5 s for info/success, 8 s for errors), pause-on-hover, and focus behaviour. Top-level alert() call sites in Tenants, ProvisioningProfiles, AdminDevices, Sites, and more migrated off the browser's native dialogs.
Branded ConfirmModal replaces window.confirm (#32) — new ConfirmProvider exposes confirm({ title, message, variant }) returning Promise<boolean>, with focus trap, ESC-to-close, backdrop-click-to-cancel, and a variant: 'danger' style for destructive actions. Sign-out in Layout, plus delete flows in Users, Sites, DNS, Firewall, ProvisioningProfiles, Tenants, and AdminDevices all migrated.
Cluster command queue persisted to PostgreSQL — new cluster_commands table (migration 021) stores every deploy/delete/restart operation destined for a cluster-agent with its kind, JSONB payload, status (pending/claimed/done/failed), idempotency key, result, and lifecycle timestamps. clusterDeployHandler and clusterUndeployHandler now enqueue to the DB via dbEnqueueClusterCommand; commandHandler claims the next command atomically with SELECT ... FOR UPDATE SKIP LOCKED when a cluster-agent polls; commandResultHandler marks the claimed row done/failed on result POST. An idempotency key (<kind>:<app>:<namespace>) deduplicates double-click enqueues while a previous command for the same target is still in flight. On server startup, commands stuck claimed for more than 10 minutes are reset to pending for re-delivery (Kubernetes deploys are idempotent). New GET /api/clusters/commands?cluster_id=<id> endpoint returns queue history. Documented in docs/clusters.md#command-queue.

Fixed

Hijacker/Flusher interface forwarding through metrics middleware (#57) — Tier 3a's statusRecorder embedded http.ResponseWriter for Write/WriteHeader but Go doesn't promote interfaces across embedded fields, so http.Hijacker was lost. gorilla/websocket refuses to upgrade without Hijacker and 500'd on /api/ws/dashboard. Surfaced by post-deploy E2E smoke. statusRecorder.Hijack() + Flush() now forward explicitly.
CSRF bypass for Authorization: Bearer / X-Agent-Token (#55) — follow-up to #48. Requests authenticating with header tokens can't be forged cross-origin (CORS preflight blocks setting those headers), so CSRF adds no defensive value and just breaks API-token callers that happen to share a browser with a live session cookie. Surfaced by the E2E firewall suite, which uses Bearer auth for fixtures.
E2E fixtures aligned with Tier 2b changes (#54, #56) — sign-out test now drives the ConfirmModal instead of window.confirm; multi-tenancy fixture password bumped from 9 chars to 19 chars to meet the new policy; firewall-delete test drives the modal.
CI pipeline unblocked (multiple hotfixes early in the session) — aquasecurity/trivy-action tag pinned to @v0.36.0 (earlier @0.28.0 didn't exist as a tag and its transitively-referenced setup-trivy@v0.2.1 was also removed). Image refs lowercased via a REPO_LC env var (Trivy can't parse uppercase). dorny/paths-filter@v3 received explicit pull-requests: read permission on the detect-changes job (PRs otherwise failed with "Resource not accessible by integration" and skipped every downstream build). The static-analysis workflow (security.yml with CodeQL + Trivy SARIF upload) was removed — every job in it targets GitHub code-scanning alerts, which requires GitHub Advanced Security on private repos; the image-scan Trivy gate in the build workflow remains the primary CVE enforcement path.
Migration 020 insert statement — 020_onboarding_token_expiry.sql was missing the name column in its INSERT INTO schema_migrations statement, causing fresh-install migrations to fail at that step. The insert now matches the format used by migrations 018 and 019.

Changed

CI: image scanning + SBOM — every build workflow run now scans the pushed image with Trivy (CRITICAL-severity gate, ignore-unfixed) and generates a CycloneDX SBOM per component (server, frontend, cluster-agent, service-agent) as a 90-day retained artefact. CVE response SLA documented in docs/production.md (Critical: triage 24h, patch 3d, release 7d). The originally-planned static-analysis workflow (CodeQL Go + CodeQL JavaScript + Trivy fs SARIF upload) was dropped — every job in it relies on GitHub code-scanning alerts, which requires GitHub Advanced Security on private repos. The Trivy image gate in the build workflow remains the primary CVE enforcement path.
K8s manifests: container images pinned to semver + sha256 digest — imagePullPolicy changed from Always to IfNotPresent in all base manifests. k8s/base/kustomization.yaml and k8s/kustomization.yaml default to the current release tag instead of :latest. The production overlay (k8s/overlays/production/kustomization.yaml) carries both newTag and digest fields. scripts/pin-images.sh <version> fetches digests from the registry (using crane, skopeo, or docker) and updates the overlay in one step. The release CI job (pin-manifests) runs automatically on semver tag pushes, captures the digest from the build step, and commits the updated overlay to main. Dev overlay retains :latest intentionally.

Added

Real /healthz and /readyz endpoints (#15) — liveness probe runs a DB PingContext, checks that the wg0 WireGuard device is present, and stats the default-tenant SSH-CA host key. Readiness additionally gates on a migrationsApplied flag that flips once schema_migrations has at least one row. k8s/04-server.yaml swapped tcpSocket probes for httpGet ones and picked up a startupProbe so slow first migrations don't trigger liveness failures.
Proprietary LICENSE, EULA.md, NOTICES.txt (#26) — LICENSE states the source-code terms, EULA.md is the customer-facing End User Licence Agreement, NOTICES.txt is regenerated by scripts/gen-notices.sh (Go modules via go-licenses when installed, else go list -m all; npm via license-checker-rseidelsohn). README now links all three. README's prior License: MIT footer was inconsistent with the product's paid licence-key enforcement — replaced with a Licensing section that points to the three new files.

Changed

Frontend polling: 30s default + pause on hidden tabs (#28) — new frontend/src/lib/usePolling.js hook encapsulates setInterval + visibilitychange so dashboards stop hammering the API when the operator switches tabs and resume immediately on focus. Dashboard, Sites, Inventory, AppManager, AdminDevices, and K8sDevicePanel moved to the hook; intervals raised from 10–15 s to 30 s. On a 100-device tenant that cuts idle background load roughly 3× (and 100 % while tabs are hidden).
Audit-log retention sweeper (#25) — new audit_retention.go runs a daily goroutine that DELETEs rows older than WATCHGRID_AUDIT_RETENTION_DAYS (default 90) from both admin_audit_log and device_security_log. Sweep cadence is tunable via WATCHGRID_AUDIT_RETENTION_SWEEP_HOURS (default 24). Guidance for regulated customers on archive-to-object-storage workflows added to docs/audit.md.
Frontend build hardening (#29, #30) — Vite now drops debugger statements and marks console.{log,info,warn,debug,trace} as pure in production builds (so they're tree-shaken from the shipped bundle while console.error is preserved for real errors). Sourcemaps no longer ship to prod, and React/router/xterm/leaflet/icons are split into their own long-cacheable chunks via manualChunks. frontend/nginx.conf serves /assets/ (Vite's hashed output) with Cache-Control: public, max-age=31536000, immutable while index.html stays no-cache. Initial-load gzipped JS+CSS is ~163 KB (target was <500 KB).
Soft 401 handling on background polls (#34) — AuthContext.apiRequest(url, { background: true }) no longer hard-logs-out on 401. Instead it sets a sessionExpired flag, returns the response, and a new SessionExpiredOverlay mounts the Login form on top of the running app so unsaved form state is preserved. User-initiated requests still log out hard. Polling sites in Dashboard, Sites, Inventory, AppManager, Layout, AdminDevices, and K8sDevicePanel are opted in.

Security

CORS lockdown on browser-facing API (#23) — new corsMiddleware rejects requests carrying an Origin header that is neither same-origin (Origin host == request Host) nor on the comma-separated WATCHGRID_ALLOWED_ORIGINS allowlist. Agent endpoints (/api/register, /api/heartbeat, /api/commands/, /api/commandresult, /api/wg/..., /api/logs/...), the registry proxy (/api/registry/, /v2/, /registry/), public /downloads/, and WebSocket upgrades are exempt. OPTIONS preflights short-circuit with Access-Control-Max-Age: 600. Reuses the existing WATCHGRID_ALLOWED_ORIGINS env var (renamed helper allowedWebSocketOrigins → allowedBrowserOrigins). Documented in docs/production.md.
Trial + prod Postgres TLS bootstrap fixed for fresh volumes — the previous design had two bugs that combined to make a fresh-volume bootstrap impossible. (1) command: postgres -c ssl=on -c ssl_cert_file=... crashed on first boot because the docker-entrypoint forwards command-line args to the temp server it spins up to run init scripts — and that temp server can't start without a cert that hasn't been generated yet, leaving the container in a restart loop where the init script never runs. (2) postgres:16-alpine doesn't ship openssl, and init scripts run as the unprivileged postgres user so apk add from the script isn't possible — the cert generation would have failed silently with exit 127 anyway. Fix: drop the command: override in both trials/docker-compose.trial.yml and docker-compose.prod.yml, switch the image to postgres:16 (Debian, ships openssl), and have scripts/postgres-ssl-init.sh generate the cert AND append ssl = on plus cert paths to postgresql.auto.conf. The real exec postgres after init reads auto-conf, persisting SSL across restarts. Re-run admin-panel/scripts/seed-kv.sh <kv-namespace-id> so Cloudflare KV picks up the new compose template; the seed script itself was updated to current wrangler syntax (kv key put + --remote).
Go toolchain bumped to 1.24.13 / 1.25.9 — addresses CVE-2025-68121 (crypto/tls: incorrect certificate validation in stdlib, CRITICAL) which was blocking the Trivy image gate in CI. server, agent, and service-agent Dockerfiles move from golang:1.24.9-alpine to golang:1.24.13-alpine; their go.mod toolchain directives bump from go1.24.9 to go1.24.13 so cross-compiled agent binaries embed the patched stdlib. cluster-agent pins golang:1.25-alpine → golang:1.25.9-alpine for explicit patch-level tracking.
PostgreSQL TLS enforced in production — initDatabase now returns a fatal error if WATCHGRID_DB_SSLMODE=disable outside WATCHGRID_DEV_MODE=true. docker-compose.prod.yml mounts scripts/postgres-ssl-init.sh into /docker-entrypoint-initdb.d/, which generates a one-time self-signed cert and writes ssl = on (plus cert paths) into postgresql.auto.conf so the cluster starts encrypted from first boot, with WATCHGRID_DB_SSLMODE=require on the client side. Registry proxy logs a security warning when REGISTRY_URL uses HTTP for a non-localhost host. Trust bundle handling documented in docs/production.md.
SSH CA key backup & restore runbook — scripts/backup-ssh-ca.sh creates an AES-256-CBC encrypted tarball of all four CA key files, supports local paths and rsync remote destinations, and retains a configurable number of backups (default 14). Systemd service + timer units in scripts/systemd/ for daily automated backups. Full restore procedure with RTO < 15 min documented in docs/ssh-ca.md#backup--restore.
Rate limiter memory bounded + Traefik cross-replica safety net — in-process rate limiter now runs a background goroutine that evicts stale buckets every 5 minutes (10-minute TTL), replacing the ad-hoc GC. docker-compose.prod.yml gains a Traefik auth-ratelimit middleware (10 req/min/IP) on the /api router as a cross-replica enforcement layer. Architecture decision documented in docs/production.md.
WebSocket endpoints require JWT before upgrade — POST /api/ws-ticket issues a 2-minute purpose-bound ticket ("ws") so browsers can open WebSockets without putting a long-lived JWT in the URL. Both dashboardWSHandler and terminalUserWebsocketHandler now use shared extractWSToken + verifyWSToken helpers that accept regular JWTs, ws-tickets, httpOnly cookies, and the Sec-WebSocket-Protocol: watchgrid-jwt.<token> sub-protocol trick. terminalUserWebsocketHandler also gains a tenant check: the connecting user must have access to the session device's tenant.
Admin password required in production — server now refuses to start if neither ADMIN_PASSWORD nor ADMIN_PASSWORD_HASH is set and WATCHGRID_DEV_MODE is not true. The hardcoded changeme fallback is restricted to dev mode only. docker-compose.prod.yml updated to document and pass ADMIN_PASSWORD.
Onboarding tokens now expire — token_expires_at column added to tenants; default TTL is 1 year on generation/rotation (raised from the original 30 days — short enough to bound the blast radius of a leaked token, long enough that fleets on a yearly re-image cadence don't have to rotate mid-cycle). Expired tokens are rejected at /api/register and /api/wg/register and the event is logged to device_security_log. Existing tokens are backfilled by migration 020. The Tenants UI shows expiry date (yellow <7 days, red = expired) and a "Rotate Token" button for admins.
Agent binary self-update now supports Ed25519 signature verification — when WATCHGRID_UPDATE_PUBKEY (hex-encoded Ed25519 public key) is set or the key is embedded at build time via -ldflags, the agent downloads a .sig file alongside the binary and verifies the signature before installation. Updates are rejected if verification fails. Falls back to checksum-only with a warning when no key is configured.
Provisioning script requires HTTPS — provision.sh now refuses http:// server URLs to prevent supply-chain attacks during agent binary download. Set WATCHGRID_ALLOW_HTTP=1 to override in local development.
SPKI certificate pinning in agent — set WATCHGRID_SERVER_SPKI to a comma-separated list of hex-encoded SHA-256 SPKI hashes to pin the server's TLS certificate. Both the HTTP client and WebSocket dialer enforce the pin.
Terminal WebSocket agent connection requires a per-session token — the server issues a one-time agent_token with each terminal session command. The agent sends it as X-Agent-Token header when connecting; the server rejects connections with a missing or incorrect token.
tcpdump interface name validated against system allowlist — runCapture now validates the interface name with a strict regex and confirms it exists on the host before invoking tcpdump, preventing command-injection via crafted interface names.
Packet captures written to private directory — captures are now stored in /var/lib/watchgrid/captures/capture.pcap (directory mode 0700, file mode 0600) instead of world-readable /tmp/capture.pcap.

[1.26.1] - 2026-04-21

Security

K8s API TLS verification restored — the server no longer disables certificate verification when routing K8s API calls through the WireGuard tunnel. The cluster CA embedded in the kubeconfig is now used with ServerName set to the original kubeconfig hostname, preserving cert validation while routing via the tunnel IP.
JWT secret minimum length enforced — the server now rejects startup if JWT_SECRET is shorter than 32 characters, preventing weak secrets that could be brute-forced to forge tokens.
JWT removed from service proxy URL — the K8s service proxy no longer appends the auth token as a URL query parameter (exposed in browser history and server logs). The httpOnly session cookie is used instead.
K8s service proxy restricted to Watchgrid server — the cluster-agent's port 8081 proxy now only accepts connections from the WireGuard gateway IP (100.64.1.254), blocking other WireGuard peers from reaching internal cluster services.
K8s proxy and registry DNS gateway IP derived per-tenant — the cluster-agent previously hardcoded 100.64.1.254 as the allowed gateway IP and DNS server, which broke multi-tenant deployments where the gateway is a different IP (e.g. 100.64.2.254). The gateway is now derived from the cluster-agent's own tunnel IP (replacing the last octet with .254), matching the per-tenant subnet convention. The WATCHGRID_GATEWAY_IP env var can still override this for non-standard deployments.
System namespaces protected from destructive K8s commands — handleDeploy, handleDelete, handleRestart, handleScale, and handleK8sDeploy now reject any operation targeting kube-system, kube-public, kube-node-lease, or watchgrid-system.
OIDC issuer URL validated against SSRF — the server now resolves the OIDC issuer hostname before fetching the discovery document and rejects URLs that resolve to loopback, private, or link-local addresses, and requires HTTPS.
Firewall rule scopeID validated against tenant — when creating a firewall rule scoped to a site or device, the server now verifies that the resource belongs to the authenticated user's tenant, preventing cross-tenant rule injection.

[1.26.0] - 2026-04-20

Fixed

Dashboard map now visible on fresh tenants — the map is shown when the control plane server has a location set, even if no devices have registered yet
App Store repository sync crash — clicking Sync in the Repository tab threw "is not a function"; the onRepoChange callback was missing from the RepoManager component call

Security

Agent self-update now verifies SHA-256 checksum before installing the downloaded binary — a compromised or tampered binary is rejected before it can replace the running agent. The build script generates .sha256 files alongside each architecture binary; the server exposes them at /downloads/watchgrid-agent-{arch}.sha256.
Removed insecure_skip_verify: true from K3s registry config — the containerd registry configuration no longer disables TLS certificate verification. Plain HTTP endpoints (sufficient for the WireGuard-encrypted tunnel) are used directly, eliminating the unnecessary TLS bypass.
Shell command parameters no longer logged — the agent debug log redacted the full command Params field to prevent credentials or secrets embedded in shell commands from appearing in system logs.

[1.24.0] - 2026-04-12

Added

OIDC single sign-on — configurable Login with SSO button on the login page; supports Microsoft Entra ID and any OpenID Connect provider
Super-admin SSO settings panel in System → Users for configuring issuer, client ID/secret, button text, claim mapping, default tenant/role, and auto-provisioning behavior
Automatic OIDC user linking and provisioning with persisted auth_source metadata
System → Admin Devices — dedicated page for managing WireGuard-enabled admin workstations (moved from dashboard)
System → Pending Approvals — dedicated page with full approve/deny/profiles workflow (moved from dashboard)
Multi-level firewall rule management — create allow/deny rules at tenant, site, or device scope, enforced as iptables entries in the WireGuard mesh
Firewall rules support protocol (tcp/udp/icmp/any), source/destination IP or CIDR, port or port range, direction (inbound/outbound/both), priority, and enable/disable toggle
System → Firewall page with scope tabs, rules table, and create/edit modal
REST API for firewall rules: GET/POST /api/firewall/rules, PUT/DELETE/POST /api/firewall/rules/{id}[/toggle]
Location tab on device and cluster detail panels — set name, latitude, longitude, and location lock directly from the Sites workspace
Raspberry Pi telemetry (CPU temperature, core voltage, SDRAM voltage) shown in the Sites device info panel under a dedicated Pi Telemetry section
Devices without a location now appear on the dashboard map as a gray ? marker at a deterministic placeholder position

Fixed

Persistence manager was never initialized, causing OIDC settings save to fail with "Persistence is not initialized" — now initialized at startup using /etc/watchgrid as config dir
OIDC redirect_uri was built with http:// instead of https:// when running behind a reverse proxy without X-Forwarded-Proto — added WATCHGRID_EXTERNAL_URL env var to explicitly set the base URL
Firewall rule direction: both now correctly creates iptables entries for both src→dst and dst→src; previously two separate rules were required for bidirectional traffic

Changed

Dashboard decluttered: device list and services section removed; devices are managed from the Sites workspace. Dashboard now shows map, license warnings, and pending approvals badge only
Dashboard map expanded to fill available viewport height
Pending approvals section on dashboard replaced with a compact orange badge linking to the dedicated approvals page
User management now displays whether an account is local or oidc
Auth configuration can be stored in persisted server state; environment variables remain as fallbacks
Tenant peer allowlist removed — replaced by tenant-scope firewall rules
Tenant firewall modal now shows only the peer-to-peer toggle (master open/isolated switch)
API documentation switched from Swagger UI to Redoc for improved readability

[1.23.0] - 2025-04-01

Added

Sites — logical groupings of devices representing physical locations or teams
Aggregate metrics (avg CPU/memory/disk, total bandwidth) across all site devices
One-click provisioning profile runs across entire site
Auto-deploy: apps automatically deployed to new devices joining the site
Site-scoped firewall rules
REST API: GET/POST /api/sites, GET/PUT/DELETE /api/sites/{id}, assign/unassign endpoints
K3s Cluster Management — register and manage K3s clusters via cluster-agent
Cluster provisioning generates a ready-to-run install command
Deploy/undeploy apps from the Watchgrid catalog to any cluster
K8s service proxy: forward HTTP requests to services running inside the cluster
Kubernetes resource queries (pods, deployments, namespaces, logs, scale)
Multi-architecture cluster-agent builds (amd64, arm64)
Provisioning Profiles — tag-based bash scripts that run on devices automatically
Profiles match devices by tag overlap
Execution tracking with per-device run history and output
Site-level bulk profile execution
Quick-add bundles for common setups
App Routines — schedule recurring actions on deployed apps
Actions: start, stop, restart
Cron-based scheduling with per-routine timezone support
Manual trigger (run now) outside of schedule
App Repositories — add external Git or Helm repositories as app sources
Git: public or private repos, configurable branch
Helm: chart repositories
Manual sync trigger
Onboarding tokens — provision devices to specific tenants using --token flag
Device re-registration preserves site assignment and WireGuard stats
Real client IP tracked in audit log (bypasses Docker internal proxy)
In-cluster registry access via localhost:5000 hostPort and registry proxy sidecar
K8s hostNetwork on cluster-agent to prevent localhost registry port conflicts
Auto-site-lock: site assignment is locked once set, preventing accidental reassignment

Changed

Unified frontend layout and typography across all pages
Cluster app management moved from AppManager to dedicated Clusters UI with tabbed interface

[1.22.0] - 2025-04-01

Added

Per-device app configuration system — configure app settings (strings, secrets, booleans) through the web UI
Automatic config substitution on deployment — values injected into K8s manifests at deploy time
Configuration persistence across redeployments

[1.21.0] - 2025-03-01

Added

Provisioning profiles — tag-based scripts that run automatically on device registration
App metadata system — define configurable fields in app manifests

Fixed

WireGuard peer cleanup on device deletion

[1.20.0] - 2025-02-01

Added

Audit log — tracks all administrative actions with user, timestamp, and detail
Multi-tenancy firewall policies — per-tenant WireGuard ACLs

Changed

Server module split begun — main.go decomposed into auth.go, database.go, middleware.go, and domain modules

[1.19.0] - 2025-01-01

Added

SSH Certificate Authority — server-signed short-lived user certs (24h) and host certs (365d)
./test-ssh-ca.sh validation script

Fixed

Magic DNS resolution timing on fresh device registration

[1.18.0] - 2024-12-01

Added

Two-factor authentication (TOTP) — HMAC-SHA1, ±30s window, custom base32 implementation
K3s cluster-agent for external Kubernetes cluster registration

Changed

WireGuard subnet expanded to 100.64.0.0/10 (RFC 6598) for multi-tenant scalability

[1.17.0] - 2024-11-01

Added

Private Docker registry built into the stack — accessible at registry.wg:5000 over VPN
Registry authentication proxy through server API

Fixed

Agent reconnection after server restart

[1.16.0] - 2024-10-01

Added

Web terminal — WebSocket-based shell access to devices and K8s pods via @xterm/xterm
Real-time dashboard WebSocket feed for device status

Changed

Frontend migrated to React 18 + Vite + TailwindCSS