Production Deployment
This guide covers deploying Watchgrid for production use with SSL, reverse proxy, and proper security.
Architecture
The production stack uses Traefik as a reverse proxy with automatic Let's Encrypt SSL certificates:
Internet
│
├─ HTTPS (:443) ──► Traefik ──► Frontend (:80)
│ └──► Server API (:8080)
│
└─ UDP (:51820) ──────────────► WireGuard
- Traefik handles SSL termination and HTTP→HTTPS redirects
- PostgreSQL is internal only (not exposed to the host)
- Docker Registry is internal only (accessible via VPN at
registry.wg:5000)
Setup
1. Clone and Configure
From a checkout of the Watchgrid repository:
2. Edit Environment Variables
Edit .env with your production settings:
# REQUIRED
WG_SERVER_ENDPOINT=your-public-ip:51820
JWT_SECRET=$(openssl rand -hex 32)
POSTGRES_PASSWORD=$(openssl rand -hex 24)
ADMIN_PASSWORD=your-secure-admin-password
FRONTEND_HOST=watchgrid.yourdomain.com
# SSL (Cloudflare DNS challenge)
TRAEFIK_ACME_EMAIL=ssl@yourdomain.com
CF_DNS_API_TOKEN=your-cloudflare-api-token
# Optional
SERVER_LATITUDE=52.0705
SERVER_LONGITUDE=4.3007
VERSION=latest
3. Start the Production Stack
4. Verify
Open https://watchgrid.yourdomain.com in your browser.
Required Environment Variables
| Variable | Purpose | Example |
|---|---|---|
WG_SERVER_ENDPOINT |
Public IP:port devices connect to | 203.0.113.50:51820 |
JWT_SECRET |
JWT signing key (min 32 chars) | openssl rand -hex 32 |
POSTGRES_PASSWORD |
Database password | openssl rand -hex 24 |
ADMIN_PASSWORD |
Admin account password | Your secure password |
FRONTEND_HOST |
Domain for SSL certificate | watchgrid.example.com |
TRAEFIK_ACME_EMAIL |
Email for Let's Encrypt | admin@example.com |
CF_DNS_API_TOKEN |
Cloudflare API token for DNS challenge | Your Cloudflare token |
WATCHGRID_ALLOWED_ORIGINS |
Comma-separated extra browser origins for CORS + WebSocket. Optional — same-origin is always allowed. | https://watchgrid.example.com,https://staging.example.com |
Observability (Prometheus)
The server exposes /metrics on port 8080 in Prometheus text format. Scrape from inside the cluster only — never expose it publicly (it leaks route coverage, login-failure distributions, and pool state that help an attacker tune further probes). Traefik in docker-compose.prod.yml does not route external traffic to /metrics; in Kubernetes, k8s/07-policies.yaml applies a default-deny NetworkPolicy so only allow-listed neighbours can reach the server.
Key series:
watchgrid_http_requests_total{route, method, status}— RED-method error rate / saturation.watchgrid_http_request_duration_seconds— latency histograms.watchgrid_agent_heartbeats_total— fleet-health signal.watchgrid_login_failures_total{reason}— credential-stuffing alerting.watchgrid_rate_limit_rejections_total{limiter}— per-limiter drop counts.watchgrid_wireguard_peers— WG reconcile-loop health.watchgrid_db_open_connections/watchgrid_db_in_use_connections— pool saturation.
A starter Prometheus scrape config:
scrape_configs:
- job_name: watchgrid
metrics_path: /metrics
static_configs:
- targets: ['server.watchgrid.svc.cluster.local:8080']
CORS Policy
The browser-facing /api/... surface is locked to same-origin by default. Origins not matching the request Host header are rejected with 403 Origin not allowed. Add comma-separated extras to WATCHGRID_ALLOWED_ORIGINS if the frontend is served from a different host than the API.
Agent endpoints (/api/register, /api/heartbeat, /api/commands/, /api/commandresult, /api/wg/..., /api/logs/...), the Docker registry proxy (/api/registry/, /v2/, /registry/), public /downloads/, and WebSocket upgrades are exempt — agents and CLI clients don't send a browser Origin, and WebSocket upgrades have their own Origin check (server/ws_security.go). OPTIONS preflights are answered with a 10-minute Access-Control-Max-Age cache.
SSL Certificates
The production stack uses Traefik with Cloudflare DNS challenge for Let's Encrypt:
- Traefik requests a certificate from Let's Encrypt
- Uses the Cloudflare API to create a DNS TXT record for validation
- Certificate is automatically renewed before expiry
- All HTTP traffic is redirected to HTTPS
Firewall Rules
Ensure these ports are open:
| Port | Protocol | Purpose |
|---|---|---|
| 80 | TCP | HTTP (redirects to HTTPS) |
| 443 | TCP | HTTPS (web dashboard + API) |
| 51820 | UDP | WireGuard VPN |
Database Backups
PostgreSQL data is stored in a Docker volume. Back it up regularly:
# Dump the database
docker exec watchgrid-postgres pg_dump -U watchgrid watchgrid > backup-$(date +%Y%m%d).sql
# Restore from backup
docker exec -i watchgrid-postgres psql -U watchgrid watchgrid < backup-20240101.sql
Database Migrations
Migrations run automatically on server startup. When upgrading:
docker compose -f docker-compose.prod.yml pull # Get latest images
docker compose -f docker-compose.prod.yml down # Stop services
docker compose -f docker-compose.prod.yml up -d # Start — migrations run automatically
The server waits for PostgreSQL to be ready, then applies any pending migrations from scripts/migrations/. Failed migrations prevent startup (fail-fast).
SSH CA Key Backup
The SSH CA keys are stored in the watchgrid-ssh-ca Docker volume. Losing them invalidates all issued host certificates. An automated daily backup script is provided.
See SSH CA — Backup & Restore for the full runbook including:
- Automated daily encrypted backup via
scripts/backup-ssh-ca.sh - systemd timer for unattended backups
- Step-by-step restore procedure (RTO < 15 minutes)
- Volume restore for full stack rebuilds
- Monthly backup verification checklist
Updates
To update Watchgrid:
cd watchgrid
docker compose -f docker-compose.prod.yml pull
docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d
Check the logs to confirm a clean startup:
Rate Limiting
Architecture decision
Watchgrid's server process includes an in-process, per-IP token-bucket rate limiter that protects login, 2FA, and registration endpoints. This limiter works correctly for single-replica deployments. With multiple server replicas the per-replica bucket count multiplies the effective limit (e.g., 3 replicas → 30 login attempts/minute per IP instead of 10).
Decision for v1 (single-replica): The in-process limiter is sufficient and is the primary brute-force protection. docker-compose.prod.yml additionally configures a Traefik ratelimit middleware (10 req/min/IP across the router) as a cross-replica safety net. For multi-replica deployments backed by a load balancer, configure rate limiting at the load balancer or reverse proxy layer.
Redis-backed limiter is the correct long-term solution for multi-replica deployments but adds operational complexity. It is tracked as a future enhancement.
Traefik middleware (production)
The docker-compose.prod.yml Traefik labels configure an auth-ratelimit middleware on the /api router:
- Average: 10 requests/minute
- Burst: 10 requests
- Key: client IP (
X-Forwarded-Fordepth 1)
This is enforced at the ingress level regardless of how many server replicas are running.
Memory boundedness
The in-process limiter runs a background goroutine that evicts buckets not seen for 10 minutes, running every 5 minutes. The map is effectively capped to IPs active within the last 10 minutes.
PostgreSQL TLS
In production, the Watchgrid server requires an encrypted connection to PostgreSQL (WATCHGRID_DB_SSLMODE=require or higher). Setting sslmode=disable causes a fatal startup error unless WATCHGRID_DEV_MODE=true is set.
Self-signed certificate (single-host)
docker-compose.prod.yml includes a one-time init script (scripts/postgres-ssl-init.sh) that generates a 10-year self-signed certificate into the PostgreSQL data directory on first startup. The Postgres container is started with -c ssl=on pointing to this certificate.
WATCHGRID_DB_SSLMODE=require encrypts the connection but does not validate the server certificate chain (appropriate for same-host Docker networks). No additional trust bundle is needed.
Custom CA / verify-full (k8s or external Postgres)
For Kubernetes or an external managed Postgres, use WATCHGRID_DB_SSLMODE=verify-full and provide:
WATCHGRID_DB_SSLMODE=verify-full
WATCHGRID_DB_SSLROOTCERT=/etc/watchgrid/pg-ca.crt # PEM bundle of the CA cert
Mount the CA certificate into the server container. The Go pq driver reads sslrootcert from the DSN.
Docker development stack
docker-compose.yml uses WATCHGRID_DEV_MODE=true which allows sslmode=disable. This is intentional for local development where Postgres does not have SSL configured.
Security Scanning & CVE Response
What runs in CI
Every push to main and every pull request triggers two layers of scanning:
1. Image scanning (.github/workflows/build.yml) — runs after each image is built and pushed:
- Trivy CRITICAL CVE gate: the build fails if any
CRITICAL-severity, fixed vulnerability is found in OS packages or language dependencies of the image. Unfixed CVEs are reported but do not fail the build. - CycloneDX SBOM: a CycloneDX JSON SBOM is generated per image (server, frontend, cluster-agent, service-agent) and uploaded as a workflow artefact (
sbom-<component>-<sha>.cyclonedx.json, retained 90 days). - Digest pinning: on semver tag pushes (
1.26.1), thepin-manifestsjob updatesk8s/overlays/production/kustomization.yamlwith the exactsha256:digest of each image and commits back tomain.
CVE response SLA
Vulnerabilities in Watchgrid images or dependencies are triaged and patched on the following timeline:
| Severity | Triage | Patch merged | Released |
|---|---|---|---|
| Critical (CVSS ≥ 9.0) | ≤ 24 hours | ≤ 3 business days | ≤ 7 days |
| High (CVSS 7.0–8.9) | ≤ 3 business days | ≤ 14 days | next scheduled release |
| Medium (CVSS 4.0–6.9) | ≤ 7 days | next scheduled release | next scheduled release |
| Low (CVSS < 4.0) | best effort | best effort | rolled into periodic dependency bumps |
Exception: if a CRITICAL CVE has no upstream fix, the image scan gate allows the build to proceed (ignore-unfixed: true), but the incident is tracked and mitigation (e.g. disabling the vulnerable component, upgrading to a different base image) is attempted within the Critical SLA window.
Reporting a vulnerability
Security issues must not be filed as public GitHub issues. Report privately via one of:
- GitHub Security Advisory: Security → Report a vulnerability on the repo.
- Email:
security@watchgrid.dev(PGP key fingerprint published on the website).
Include reproduction steps, affected version(s), and impact. You will receive an acknowledgement within the Critical triage window regardless of actual severity.
Accessing SBOMs
SBOMs for every CI build are attached as workflow artefacts. To download:
# List recent runs
gh run list --workflow build.yml --limit 5
# Download all artefacts from a specific run
gh run download <run-id>
Release SBOMs (tag builds) are additionally attached to the GitHub Release page.