Production Runbook

This is the operator reference for running Watchgrid in production — from first install through upgrades, backups, and incident response.

See docs/production.md for the one-time setup guide and docs/ssh-ca.md for the SSH-CA-specific runbook.

1. First-install checklist

Before docker compose up -d against a new customer environment:

[ ] DNS A record for FRONTEND_HOST points at the server's public IP.
[ ] Firewall allows :80/tcp (ACME), :443/tcp (dashboard + API), :51820/udp (WireGuard), :53/udp + :53/tcp (Magic DNS — only if you expose the DNS publicly; most deployments keep it internal).
[ ] .env complete — every variable from the Required table in docs/production.md has a non-placeholder value.
[ ] JWT_SECRET is at least 32 chars of openssl rand -hex 32 output.
[ ] ADMIN_PASSWORD meets the policy (8 chars, upper/lower/digit/special, not on the blocklist). The startup check accepts weak env-bootstrapped values to avoid locking legacy deployments out, but new passwords set via the UI must comply.
[ ] Postgres TLS cert either already exists in the postgres_data volume (for an upgrade) or will be generated by scripts/postgres-ssl-init.sh on first boot (for a fresh install).
[ ] SSH-CA key backups are configured (scripts/backup-ssh-ca.sh + systemd timer). See docs/ssh-ca.md#backup--restore.

Post-start smoke checks:

# HTTP health (expect 200 with {"status":"ok", ...})
curl -fsS https://FRONTEND_HOST/healthz

# Readiness (expect same shape with migrations: ok)
curl -fsS https://FRONTEND_HOST/readyz

# Version
curl -fsS https://FRONTEND_HOST/api/version

If /healthz is green but /readyz is 503 with migrations: pending, migrations are still running — tail docker compose logs server and wait.

2. Backups

What to back up

Source	Contents	Frequency	Retention
Postgres	Devices, users, audit, cluster-command queue, tenants, DNS records	Every 6 h	30 days online + offsite
`watchgrid-ssh-ca` volume	Per-tenant SSH CA keys (user + host)	On every change + weekly	Forever (compliance)
`watchgrid-wireguard` volume	Server WG private key + config	Weekly	30 days
`.env`	Secrets — keep in a password manager or sealed-secrets, not alongside the Postgres dump	On every change	Forever

Scripted Postgres dump

# Full dump with custom format (supports parallel restore)
docker exec watchgrid-postgres pg_dump -U watchgrid -Fc -Z9 watchgrid \
  > /backup/watchgrid-$(date -u +%Y%m%dT%H%M%SZ).dump

# Verify the dump is readable
pg_restore --list /backup/watchgrid-*.dump > /dev/null

Ship the dump off-host within 1 hour of creation (rsync / S3 / Backblaze / customer's object storage of choice). Encrypt in transit and at rest.

SSH-CA

scripts/backup-ssh-ca.sh creates AES-256-CBC encrypted tarballs of /etc/watchgrid/ca_*. See docs/ssh-ca.md#backup--restore for the systemd timer config. RTO target: 15 minutes. Full restore procedure is scripted.

WireGuard key

The WireGuard private key lives in /etc/wireguard/wg0.conf inside the server volume. Losing it forces every agent to re-register (they'll reject the new server pubkey via pin). Back up monthly:

docker exec watchgrid-server cat /etc/wireguard/wg0.conf | gpg -c > wg0-$(date +%Y%m%d).conf.gpg

Verification drill (quarterly)

Spin up a scratch compose stack on a separate host.
Restore the latest Postgres dump: pg_restore -d watchgrid /backup/watchgrid-*.dump.
Copy the ssh-ca + wireguard volumes over.
Start the stack. Confirm:
docker compose logs server shows All migrations completed and no errors.
curl /api/version returns the expected version.
curl /api/devices (authenticated) returns the expected device count.
One agent from the real environment can reconnect (DNS point or /etc/hosts override).
Record the drill date + result in the operations log.

3. Upgrades

Standard procedure — same as docker-compose.prod.yml updates:

cd /opt/watchgrid
docker compose -f docker-compose.prod.yml pull
docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d
docker compose -f docker-compose.prod.yml logs -f server

Pre-upgrade checklist

[ ] Take a fresh Postgres dump and ship it off-host (see §2).
[ ] Read the CHANGELOG entries between the current deployed version and target version. Flag any migration notes, breaking env-var changes, or "operator action required" lines.
[ ] Pin the target tag rather than :latest in .env so a second pull doesn't pick up a newer, untested build: VERSION=1.26.2.
[ ] Verify healthy state first: curl /readyz should return 200.

Post-upgrade verification

[ ] curl /api/version returns the target version string.
[ ] curl /healthz returns 200 with all checks ok.
[ ] curl /readyz returns 200 with migrations: ok.
[ ] Pick one real agent and confirm its next heartbeat lands — docker compose logs server | grep Heartbeat.from.device.
[ ] Log in to the dashboard; the License page shows the expected device count.

Rollback

If post-upgrade verification fails and the issue is not a trivial config fix:

Stop the upgraded stack: docker compose -f docker-compose.prod.yml down.

Restore the pre-upgrade Postgres dump:

docker compose -f docker-compose.prod.yml up -d postgres
sleep 10
docker exec -i watchgrid-postgres psql -U watchgrid -c "DROP DATABASE watchgrid; CREATE DATABASE watchgrid;"
docker exec -i watchgrid-postgres pg_restore -U watchgrid -d watchgrid < /backup/watchgrid-pre-upgrade.dump

Set VERSION=<previous-tag> in .env.
docker compose -f docker-compose.prod.yml pull server frontend.
docker compose -f docker-compose.prod.yml up -d server frontend.
Rerun post-upgrade verification against the previous version.

Important constraint: Postgres migrations are forward-only. Rolling back the container without restoring the DB dump may succeed if the newer migrations were strictly additive, but don't rely on it — always restore the dump.

4. Incident response

Alert → action playbook

Alert	Likely cause	First action
`/healthz` returns 503, DB check failed	Postgres container crash-looping, disk full	`docker compose logs postgres` → check for `FATAL: the database system is shutting down` or disk errors. `df -h` on the host.
`/healthz` returns 503, WireGuard check failed	`wg0` interface gone after a host reboot without capability	`docker compose restart server`. Verify the host has `net.ipv4.ip_forward=1` and the container has `NET_ADMIN`.
`/healthz` returns 503, SSH-CA check failed	`watchgrid-ssh-ca` volume not mounted or wiped	Restore from `scripts/backup-ssh-ca.sh` output — see `docs/ssh-ca.md#backup--restore`.
Agents stopped heartbeating	Server crash, network partition, WG key mismatch	Check server `/healthz`. If healthy, check `docker exec watchgrid-server wg show wg0` for peer handshakes.
Login failures spike on the Grafana dashboard	Credential-stuffing attack	Check `docker compose logs server \| grep -i "invalid credentials"` for IP patterns. Traefik `ratelimit` middleware should already be dropping the worst offenders. If sustained, add an IP block at the host firewall level.
Rate-limit rejections spike for `registration`	A new batch of devices onboarding, or an attacker probing for onboarding tokens	Confirm with the customer before tuning. If legitimate, raise `WATCHGRID_RATELIMIT_REGISTRATION_CAPACITY` env temporarily.
Postgres CPU at 100%	Long-running migration, or N+1 query regression	`docker exec watchgrid-postgres psql -U watchgrid -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state <> 'idle';"` — kill runaway queries with `SELECT pg_terminate_backend(<pid>)`.
Disk usage climbing on the Postgres volume	Runaway audit log — retention sweeper not keeping up, or pathological agent	Check `WATCHGRID_AUDIT_RETENTION_DAYS`; if set high, lower it temporarily. Manually `DELETE FROM admin_audit_log WHERE timestamp < now() - interval '30 days';` + `VACUUM FULL admin_audit_log`.

Collecting evidence for support

# Server + frontend logs since last hour
docker compose -f docker-compose.prod.yml logs --since 1h server frontend \
  > /tmp/support-bundle-logs.txt

# Container state + versions
docker compose -f docker-compose.prod.yml ps > /tmp/support-bundle-state.txt
curl -fsS https://FRONTEND_HOST/api/version >> /tmp/support-bundle-state.txt

# Recent audit log (last 500 rows, redacted)
docker exec watchgrid-postgres psql -U watchgrid -c \
  "SELECT timestamp, admin_user, action, resource_type, success FROM admin_audit_log ORDER BY timestamp DESC LIMIT 500" \
  > /tmp/support-bundle-audit.txt

tar czf /tmp/support-bundle-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/support-bundle-*.txt

Ship the bundle to Watchgrid support — do not include .env, Postgres dumps, or the SSH-CA keys.

5. Leader election (multi-replica)

K8s deployments run 3 server replicas behind leader election. Only one replica drives WireGuard, DNS, and the command-queue worker at any time — the others are hot standbys.

Verification

Once per cluster after deploy (and on any server upgrade):

kubectl -n watchgrid get lease watchgrid-server-leader -o json \
  | jq '{holder: .spec.holderIdentity, acquired: .spec.acquireTime, renewed: .spec.renewTime}'

Expected: holder is one specific pod name; renewed ticks forward every ~15 s.

Drain the current leader to confirm failover:

LEADER=$(kubectl -n watchgrid get lease watchgrid-server-leader -o jsonpath='{.spec.holderIdentity}')
kubectl -n watchgrid delete pod "$LEADER"
# Within ~30 s, the lease should flip to a different replica.
kubectl -n watchgrid get lease watchgrid-server-leader -w

Document the failover time in the operations log. Target: new leader in ≤ 30 s, WireGuard handshakes resume within ≤ 60 s.

6. Observability

Metrics

/metrics on port 8080 exposes Prometheus-format metrics. Expect a Prometheus scrape every 30 s from inside the cluster. Key series:

watchgrid_http_requests_total{route, method, status} — for error rate and RED-method dashboards.
watchgrid_http_request_duration_seconds — latency histograms.
watchgrid_agent_heartbeats_total — compared against the expected device count for fleet health.
watchgrid_login_failures_total{reason} — spike on this drives the "credential stuffing" alert.
watchgrid_rate_limit_rejections_total{limiter} — distinguishes organic bursts from attacks.
watchgrid_wireguard_peers — matches len(devices) when the WG reconcile loop is healthy.
watchgrid_db_open_connections / watchgrid_db_in_use_connections — pool saturation signal.

See docs/production.md#security-scanning--cve-response for how these tie into the SLA.

Never expose `/metrics` publicly

Traefik in docker-compose.prod.yml does not route external traffic to /metrics. In k8s, NetworkPolicy (k8s/07-policies.yaml) blocks ingress from outside the cluster. If you add a new ingress host, remember to add a path rule that drops /metrics at the proxy.