Skip to content

Production Runbook

This is the operator reference for running Watchgrid in production — from first install through upgrades, backups, and incident response.

See docs/production.md for the one-time setup guide and docs/ssh-ca.md for the SSH-CA-specific runbook.


1. First-install checklist

Before docker compose up -d against a new customer environment:

  • [ ] DNS A record for FRONTEND_HOST points at the server's public IP.
  • [ ] Firewall allows :80/tcp (ACME), :443/tcp (dashboard + API), :51820/udp (WireGuard), :53/udp + :53/tcp (Magic DNS — only if you expose the DNS publicly; most deployments keep it internal).
  • [ ] .env complete — every variable from the Required table in docs/production.md has a non-placeholder value.
  • [ ] JWT_SECRET is at least 32 chars of openssl rand -hex 32 output.
  • [ ] ADMIN_PASSWORD meets the policy (8 chars, upper/lower/digit/special, not on the blocklist). The startup check accepts weak env-bootstrapped values to avoid locking legacy deployments out, but new passwords set via the UI must comply.
  • [ ] Postgres TLS cert either already exists in the postgres_data volume (for an upgrade) or will be generated by scripts/postgres-ssl-init.sh on first boot (for a fresh install).
  • [ ] SSH-CA key backups are configured (scripts/backup-ssh-ca.sh + systemd timer). See docs/ssh-ca.md#backup--restore.

Post-start smoke checks:

# HTTP health (expect 200 with {"status":"ok", ...})
curl -fsS https://FRONTEND_HOST/healthz

# Readiness (expect same shape with migrations: ok)
curl -fsS https://FRONTEND_HOST/readyz

# Version
curl -fsS https://FRONTEND_HOST/api/version

If /healthz is green but /readyz is 503 with migrations: pending, migrations are still running — tail docker compose logs server and wait.


2. Backups

What to back up

Source Contents Frequency Retention
Postgres Devices, users, audit, cluster-command queue, tenants, DNS records Every 6 h 30 days online + offsite
watchgrid-ssh-ca volume Per-tenant SSH CA keys (user + host) On every change + weekly Forever (compliance)
watchgrid-wireguard volume Server WG private key + config Weekly 30 days
.env Secrets — keep in a password manager or sealed-secrets, not alongside the Postgres dump On every change Forever

Scripted Postgres dump

# Full dump with custom format (supports parallel restore)
docker exec watchgrid-postgres pg_dump -U watchgrid -Fc -Z9 watchgrid \
  > /backup/watchgrid-$(date -u +%Y%m%dT%H%M%SZ).dump

# Verify the dump is readable
pg_restore --list /backup/watchgrid-*.dump > /dev/null

Ship the dump off-host within 1 hour of creation (rsync / S3 / Backblaze / customer's object storage of choice). Encrypt in transit and at rest.

SSH-CA

scripts/backup-ssh-ca.sh creates AES-256-CBC encrypted tarballs of /etc/watchgrid/ca_*. See docs/ssh-ca.md#backup--restore for the systemd timer config. RTO target: 15 minutes. Full restore procedure is scripted.

WireGuard key

The WireGuard private key lives in /etc/wireguard/wg0.conf inside the server volume. Losing it forces every agent to re-register (they'll reject the new server pubkey via pin). Back up monthly:

docker exec watchgrid-server cat /etc/wireguard/wg0.conf | gpg -c > wg0-$(date +%Y%m%d).conf.gpg

Verification drill (quarterly)

  1. Spin up a scratch compose stack on a separate host.
  2. Restore the latest Postgres dump: pg_restore -d watchgrid /backup/watchgrid-*.dump.
  3. Copy the ssh-ca + wireguard volumes over.
  4. Start the stack. Confirm:
  5. docker compose logs server shows All migrations completed and no errors.
  6. curl /api/version returns the expected version.
  7. curl /api/devices (authenticated) returns the expected device count.
  8. One agent from the real environment can reconnect (DNS point or /etc/hosts override).
  9. Record the drill date + result in the operations log.

3. Upgrades

Standard procedure — same as docker-compose.prod.yml updates:

cd /opt/watchgrid
docker compose -f docker-compose.prod.yml pull
docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d
docker compose -f docker-compose.prod.yml logs -f server

Pre-upgrade checklist

  • [ ] Take a fresh Postgres dump and ship it off-host (see §2).
  • [ ] Read the CHANGELOG entries between the current deployed version and target version. Flag any migration notes, breaking env-var changes, or "operator action required" lines.
  • [ ] Pin the target tag rather than :latest in .env so a second pull doesn't pick up a newer, untested build: VERSION=1.26.2.
  • [ ] Verify healthy state first: curl /readyz should return 200.

Post-upgrade verification

  • [ ] curl /api/version returns the target version string.
  • [ ] curl /healthz returns 200 with all checks ok.
  • [ ] curl /readyz returns 200 with migrations: ok.
  • [ ] Pick one real agent and confirm its next heartbeat lands — docker compose logs server | grep Heartbeat.from.device.
  • [ ] Log in to the dashboard; the License page shows the expected device count.

Rollback

If post-upgrade verification fails and the issue is not a trivial config fix:

  1. Stop the upgraded stack: docker compose -f docker-compose.prod.yml down.
  2. Restore the pre-upgrade Postgres dump:
    docker compose -f docker-compose.prod.yml up -d postgres
    sleep 10
    docker exec -i watchgrid-postgres psql -U watchgrid -c "DROP DATABASE watchgrid; CREATE DATABASE watchgrid;"
    docker exec -i watchgrid-postgres pg_restore -U watchgrid -d watchgrid < /backup/watchgrid-pre-upgrade.dump
    
  3. Set VERSION=<previous-tag> in .env.
  4. docker compose -f docker-compose.prod.yml pull server frontend.
  5. docker compose -f docker-compose.prod.yml up -d server frontend.
  6. Rerun post-upgrade verification against the previous version.

Important constraint: Postgres migrations are forward-only. Rolling back the container without restoring the DB dump may succeed if the newer migrations were strictly additive, but don't rely on it — always restore the dump.


4. Incident response

Alert → action playbook

Alert Likely cause First action
/healthz returns 503, DB check failed Postgres container crash-looping, disk full docker compose logs postgres → check for FATAL: the database system is shutting down or disk errors. df -h on the host.
/healthz returns 503, WireGuard check failed wg0 interface gone after a host reboot without capability docker compose restart server. Verify the host has net.ipv4.ip_forward=1 and the container has NET_ADMIN.
/healthz returns 503, SSH-CA check failed watchgrid-ssh-ca volume not mounted or wiped Restore from scripts/backup-ssh-ca.sh output — see docs/ssh-ca.md#backup--restore.
Agents stopped heartbeating Server crash, network partition, WG key mismatch Check server /healthz. If healthy, check docker exec watchgrid-server wg show wg0 for peer handshakes.
Login failures spike on the Grafana dashboard Credential-stuffing attack Check docker compose logs server | grep -i "invalid credentials" for IP patterns. Traefik ratelimit middleware should already be dropping the worst offenders. If sustained, add an IP block at the host firewall level.
Rate-limit rejections spike for registration A new batch of devices onboarding, or an attacker probing for onboarding tokens Confirm with the customer before tuning. If legitimate, raise WATCHGRID_RATELIMIT_REGISTRATION_CAPACITY env temporarily.
Postgres CPU at 100% Long-running migration, or N+1 query regression docker exec watchgrid-postgres psql -U watchgrid -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state <> 'idle';" — kill runaway queries with SELECT pg_terminate_backend(<pid>).
Disk usage climbing on the Postgres volume Runaway audit log — retention sweeper not keeping up, or pathological agent Check WATCHGRID_AUDIT_RETENTION_DAYS; if set high, lower it temporarily. Manually DELETE FROM admin_audit_log WHERE timestamp < now() - interval '30 days'; + VACUUM FULL admin_audit_log.

Collecting evidence for support

# Server + frontend logs since last hour
docker compose -f docker-compose.prod.yml logs --since 1h server frontend \
  > /tmp/support-bundle-logs.txt

# Container state + versions
docker compose -f docker-compose.prod.yml ps > /tmp/support-bundle-state.txt
curl -fsS https://FRONTEND_HOST/api/version >> /tmp/support-bundle-state.txt

# Recent audit log (last 500 rows, redacted)
docker exec watchgrid-postgres psql -U watchgrid -c \
  "SELECT timestamp, admin_user, action, resource_type, success FROM admin_audit_log ORDER BY timestamp DESC LIMIT 500" \
  > /tmp/support-bundle-audit.txt

tar czf /tmp/support-bundle-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/support-bundle-*.txt

Ship the bundle to Watchgrid support — do not include .env, Postgres dumps, or the SSH-CA keys.


5. Leader election (multi-replica)

K8s deployments run 3 server replicas behind leader election. Only one replica drives WireGuard, DNS, and the command-queue worker at any time — the others are hot standbys.

Verification

Once per cluster after deploy (and on any server upgrade):

kubectl -n watchgrid get lease watchgrid-server-leader -o json \
  | jq '{holder: .spec.holderIdentity, acquired: .spec.acquireTime, renewed: .spec.renewTime}'

Expected: holder is one specific pod name; renewed ticks forward every ~15 s.

Drain the current leader to confirm failover:

LEADER=$(kubectl -n watchgrid get lease watchgrid-server-leader -o jsonpath='{.spec.holderIdentity}')
kubectl -n watchgrid delete pod "$LEADER"
# Within ~30 s, the lease should flip to a different replica.
kubectl -n watchgrid get lease watchgrid-server-leader -w

Document the failover time in the operations log. Target: new leader in ≤ 30 s, WireGuard handshakes resume within ≤ 60 s.


6. Observability

Metrics

/metrics on port 8080 exposes Prometheus-format metrics. Expect a Prometheus scrape every 30 s from inside the cluster. Key series:

  • watchgrid_http_requests_total{route, method, status} — for error rate and RED-method dashboards.
  • watchgrid_http_request_duration_seconds — latency histograms.
  • watchgrid_agent_heartbeats_total — compared against the expected device count for fleet health.
  • watchgrid_login_failures_total{reason} — spike on this drives the "credential stuffing" alert.
  • watchgrid_rate_limit_rejections_total{limiter} — distinguishes organic bursts from attacks.
  • watchgrid_wireguard_peers — matches len(devices) when the WG reconcile loop is healthy.
  • watchgrid_db_open_connections / watchgrid_db_in_use_connections — pool saturation signal.

See docs/production.md#security-scanning--cve-response for how these tie into the SLA.

Never expose /metrics publicly

Traefik in docker-compose.prod.yml does not route external traffic to /metrics. In k8s, NetworkPolicy (k8s/07-policies.yaml) blocks ingress from outside the cluster. If you add a new ingress host, remember to add a path rule that drops /metrics at the proxy.