Production Runbook
This is the operator reference for running Watchgrid in production — from first install through upgrades, backups, and incident response.
See docs/production.md for the one-time setup guide and docs/ssh-ca.md for the SSH-CA-specific runbook.
1. First-install checklist
Before docker compose up -d against a new customer environment:
- [ ] DNS
Arecord forFRONTEND_HOSTpoints at the server's public IP. - [ ] Firewall allows :80/tcp (ACME), :443/tcp (dashboard + API), :51820/udp (WireGuard), :53/udp + :53/tcp (Magic DNS — only if you expose the DNS publicly; most deployments keep it internal).
- [ ]
.envcomplete — every variable from the Required table indocs/production.mdhas a non-placeholder value. - [ ]
JWT_SECRETis at least 32 chars ofopenssl rand -hex 32output. - [ ]
ADMIN_PASSWORDmeets the policy (8 chars, upper/lower/digit/special, not on the blocklist). The startup check accepts weak env-bootstrapped values to avoid locking legacy deployments out, but new passwords set via the UI must comply. - [ ] Postgres TLS cert either already exists in the
postgres_datavolume (for an upgrade) or will be generated byscripts/postgres-ssl-init.shon first boot (for a fresh install). - [ ] SSH-CA key backups are configured (
scripts/backup-ssh-ca.sh+ systemd timer). Seedocs/ssh-ca.md#backup--restore.
Post-start smoke checks:
# HTTP health (expect 200 with {"status":"ok", ...})
curl -fsS https://FRONTEND_HOST/healthz
# Readiness (expect same shape with migrations: ok)
curl -fsS https://FRONTEND_HOST/readyz
# Version
curl -fsS https://FRONTEND_HOST/api/version
If /healthz is green but /readyz is 503 with migrations: pending, migrations are still running — tail docker compose logs server and wait.
2. Backups
What to back up
| Source | Contents | Frequency | Retention |
|---|---|---|---|
| Postgres | Devices, users, audit, cluster-command queue, tenants, DNS records | Every 6 h | 30 days online + offsite |
watchgrid-ssh-ca volume |
Per-tenant SSH CA keys (user + host) | On every change + weekly | Forever (compliance) |
watchgrid-wireguard volume |
Server WG private key + config | Weekly | 30 days |
.env |
Secrets — keep in a password manager or sealed-secrets, not alongside the Postgres dump | On every change | Forever |
Scripted Postgres dump
# Full dump with custom format (supports parallel restore)
docker exec watchgrid-postgres pg_dump -U watchgrid -Fc -Z9 watchgrid \
> /backup/watchgrid-$(date -u +%Y%m%dT%H%M%SZ).dump
# Verify the dump is readable
pg_restore --list /backup/watchgrid-*.dump > /dev/null
Ship the dump off-host within 1 hour of creation (rsync / S3 / Backblaze / customer's object storage of choice). Encrypt in transit and at rest.
SSH-CA
scripts/backup-ssh-ca.sh creates AES-256-CBC encrypted tarballs of /etc/watchgrid/ca_*. See docs/ssh-ca.md#backup--restore for the systemd timer config. RTO target: 15 minutes. Full restore procedure is scripted.
WireGuard key
The WireGuard private key lives in /etc/wireguard/wg0.conf inside the server volume. Losing it forces every agent to re-register (they'll reject the new server pubkey via pin). Back up monthly:
Verification drill (quarterly)
- Spin up a scratch compose stack on a separate host.
- Restore the latest Postgres dump:
pg_restore -d watchgrid /backup/watchgrid-*.dump. - Copy the ssh-ca + wireguard volumes over.
- Start the stack. Confirm:
docker compose logs servershowsAll migrations completedand no errors.curl /api/versionreturns the expected version.curl /api/devices(authenticated) returns the expected device count.- One agent from the real environment can reconnect (DNS point or
/etc/hostsoverride). - Record the drill date + result in the operations log.
3. Upgrades
Standard procedure — same as docker-compose.prod.yml updates:
cd /opt/watchgrid
docker compose -f docker-compose.prod.yml pull
docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d
docker compose -f docker-compose.prod.yml logs -f server
Pre-upgrade checklist
- [ ] Take a fresh Postgres dump and ship it off-host (see §2).
- [ ] Read the CHANGELOG entries between the current deployed version and target version. Flag any migration notes, breaking env-var changes, or "operator action required" lines.
- [ ] Pin the target tag rather than
:latestin.envso a second pull doesn't pick up a newer, untested build:VERSION=1.26.2. - [ ] Verify healthy state first:
curl /readyzshould return 200.
Post-upgrade verification
- [ ]
curl /api/versionreturns the target version string. - [ ]
curl /healthzreturns 200 with all checksok. - [ ]
curl /readyzreturns 200 withmigrations: ok. - [ ] Pick one real agent and confirm its next heartbeat lands —
docker compose logs server | grep Heartbeat.from.device. - [ ] Log in to the dashboard; the License page shows the expected device count.
Rollback
If post-upgrade verification fails and the issue is not a trivial config fix:
- Stop the upgraded stack:
docker compose -f docker-compose.prod.yml down. - Restore the pre-upgrade Postgres dump:
- Set
VERSION=<previous-tag>in.env. docker compose -f docker-compose.prod.yml pull server frontend.docker compose -f docker-compose.prod.yml up -d server frontend.- Rerun post-upgrade verification against the previous version.
Important constraint: Postgres migrations are forward-only. Rolling back the container without restoring the DB dump may succeed if the newer migrations were strictly additive, but don't rely on it — always restore the dump.
4. Incident response
Alert → action playbook
| Alert | Likely cause | First action |
|---|---|---|
/healthz returns 503, DB check failed |
Postgres container crash-looping, disk full | docker compose logs postgres → check for FATAL: the database system is shutting down or disk errors. df -h on the host. |
/healthz returns 503, WireGuard check failed |
wg0 interface gone after a host reboot without capability |
docker compose restart server. Verify the host has net.ipv4.ip_forward=1 and the container has NET_ADMIN. |
/healthz returns 503, SSH-CA check failed |
watchgrid-ssh-ca volume not mounted or wiped |
Restore from scripts/backup-ssh-ca.sh output — see docs/ssh-ca.md#backup--restore. |
| Agents stopped heartbeating | Server crash, network partition, WG key mismatch | Check server /healthz. If healthy, check docker exec watchgrid-server wg show wg0 for peer handshakes. |
| Login failures spike on the Grafana dashboard | Credential-stuffing attack | Check docker compose logs server | grep -i "invalid credentials" for IP patterns. Traefik ratelimit middleware should already be dropping the worst offenders. If sustained, add an IP block at the host firewall level. |
Rate-limit rejections spike for registration |
A new batch of devices onboarding, or an attacker probing for onboarding tokens | Confirm with the customer before tuning. If legitimate, raise WATCHGRID_RATELIMIT_REGISTRATION_CAPACITY env temporarily. |
| Postgres CPU at 100% | Long-running migration, or N+1 query regression | docker exec watchgrid-postgres psql -U watchgrid -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state <> 'idle';" — kill runaway queries with SELECT pg_terminate_backend(<pid>). |
| Disk usage climbing on the Postgres volume | Runaway audit log — retention sweeper not keeping up, or pathological agent | Check WATCHGRID_AUDIT_RETENTION_DAYS; if set high, lower it temporarily. Manually DELETE FROM admin_audit_log WHERE timestamp < now() - interval '30 days'; + VACUUM FULL admin_audit_log. |
Collecting evidence for support
# Server + frontend logs since last hour
docker compose -f docker-compose.prod.yml logs --since 1h server frontend \
> /tmp/support-bundle-logs.txt
# Container state + versions
docker compose -f docker-compose.prod.yml ps > /tmp/support-bundle-state.txt
curl -fsS https://FRONTEND_HOST/api/version >> /tmp/support-bundle-state.txt
# Recent audit log (last 500 rows, redacted)
docker exec watchgrid-postgres psql -U watchgrid -c \
"SELECT timestamp, admin_user, action, resource_type, success FROM admin_audit_log ORDER BY timestamp DESC LIMIT 500" \
> /tmp/support-bundle-audit.txt
tar czf /tmp/support-bundle-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/support-bundle-*.txt
Ship the bundle to Watchgrid support — do not include .env, Postgres dumps, or the SSH-CA keys.
5. Leader election (multi-replica)
K8s deployments run 3 server replicas behind leader election. Only one replica drives WireGuard, DNS, and the command-queue worker at any time — the others are hot standbys.
Verification
Once per cluster after deploy (and on any server upgrade):
kubectl -n watchgrid get lease watchgrid-server-leader -o json \
| jq '{holder: .spec.holderIdentity, acquired: .spec.acquireTime, renewed: .spec.renewTime}'
Expected: holder is one specific pod name; renewed ticks forward every ~15 s.
Drain the current leader to confirm failover:
LEADER=$(kubectl -n watchgrid get lease watchgrid-server-leader -o jsonpath='{.spec.holderIdentity}')
kubectl -n watchgrid delete pod "$LEADER"
# Within ~30 s, the lease should flip to a different replica.
kubectl -n watchgrid get lease watchgrid-server-leader -w
Document the failover time in the operations log. Target: new leader in ≤ 30 s, WireGuard handshakes resume within ≤ 60 s.
6. Observability
Metrics
/metrics on port 8080 exposes Prometheus-format metrics. Expect a Prometheus scrape every 30 s from inside the cluster. Key series:
watchgrid_http_requests_total{route, method, status}— for error rate and RED-method dashboards.watchgrid_http_request_duration_seconds— latency histograms.watchgrid_agent_heartbeats_total— compared against the expected device count for fleet health.watchgrid_login_failures_total{reason}— spike on this drives the "credential stuffing" alert.watchgrid_rate_limit_rejections_total{limiter}— distinguishes organic bursts from attacks.watchgrid_wireguard_peers— matcheslen(devices)when the WG reconcile loop is healthy.watchgrid_db_open_connections/watchgrid_db_in_use_connections— pool saturation signal.
See docs/production.md#security-scanning--cve-response for how these tie into the SLA.
Never expose /metrics publicly
Traefik in docker-compose.prod.yml does not route external traffic to /metrics. In k8s, NetworkPolicy (k8s/07-policies.yaml) blocks ingress from outside the cluster. If you add a new ingress host, remember to add a path rule that drops /metrics at the proxy.