Skip to content

Production Deployment

This guide covers deploying Watchgrid for production use with SSL, reverse proxy, and proper security.


Architecture

The production stack uses Traefik as a reverse proxy with automatic Let's Encrypt SSL certificates:

Internet
   ├─ HTTPS (:443) ──► Traefik ──► Frontend (:80)
   │                          └──► Server API (:8080)
   └─ UDP (:51820) ──────────────► WireGuard
  • Traefik handles SSL termination and HTTP→HTTPS redirects
  • PostgreSQL is internal only (not exposed to the host)
  • Docker Registry is internal only (accessible via VPN at registry.wg:5000)

Setup

1. Clone and Configure

From a checkout of the Watchgrid repository:

cd watchgrid
cp .env.example .env

2. Edit Environment Variables

Edit .env with your production settings:

# REQUIRED
WG_SERVER_ENDPOINT=your-public-ip:51820
JWT_SECRET=$(openssl rand -hex 32)
POSTGRES_PASSWORD=$(openssl rand -hex 24)
ADMIN_PASSWORD=your-secure-admin-password
FRONTEND_HOST=watchgrid.yourdomain.com

# SSL (Cloudflare DNS challenge)
TRAEFIK_ACME_EMAIL=ssl@yourdomain.com
CF_DNS_API_TOKEN=your-cloudflare-api-token

# Optional
SERVER_LATITUDE=52.0705
SERVER_LONGITUDE=4.3007
VERSION=latest

3. Start the Production Stack

docker compose -f docker-compose.prod.yml up -d

4. Verify

Open https://watchgrid.yourdomain.com in your browser.


Required Environment Variables

Variable Purpose Example
WG_SERVER_ENDPOINT Public IP:port devices connect to 203.0.113.50:51820
JWT_SECRET JWT signing key (min 32 chars) openssl rand -hex 32
POSTGRES_PASSWORD Database password openssl rand -hex 24
ADMIN_PASSWORD Admin account password Your secure password
FRONTEND_HOST Domain for SSL certificate watchgrid.example.com
TRAEFIK_ACME_EMAIL Email for Let's Encrypt admin@example.com
CF_DNS_API_TOKEN Cloudflare API token for DNS challenge Your Cloudflare token
WATCHGRID_ALLOWED_ORIGINS Comma-separated extra browser origins for CORS + WebSocket. Optional — same-origin is always allowed. https://watchgrid.example.com,https://staging.example.com

Observability (Prometheus)

The server exposes /metrics on port 8080 in Prometheus text format. Scrape from inside the cluster only — never expose it publicly (it leaks route coverage, login-failure distributions, and pool state that help an attacker tune further probes). Traefik in docker-compose.prod.yml does not route external traffic to /metrics; in Kubernetes, k8s/07-policies.yaml applies a default-deny NetworkPolicy so only allow-listed neighbours can reach the server.

Key series:

  • watchgrid_http_requests_total{route, method, status} — RED-method error rate / saturation.
  • watchgrid_http_request_duration_seconds — latency histograms.
  • watchgrid_agent_heartbeats_total — fleet-health signal.
  • watchgrid_login_failures_total{reason} — credential-stuffing alerting.
  • watchgrid_rate_limit_rejections_total{limiter} — per-limiter drop counts.
  • watchgrid_wireguard_peers — WG reconcile-loop health.
  • watchgrid_db_open_connections / watchgrid_db_in_use_connections — pool saturation.

A starter Prometheus scrape config:

scrape_configs:
  - job_name: watchgrid
    metrics_path: /metrics
    static_configs:
      - targets: ['server.watchgrid.svc.cluster.local:8080']

CORS Policy

The browser-facing /api/... surface is locked to same-origin by default. Origins not matching the request Host header are rejected with 403 Origin not allowed. Add comma-separated extras to WATCHGRID_ALLOWED_ORIGINS if the frontend is served from a different host than the API.

Agent endpoints (/api/register, /api/heartbeat, /api/commands/, /api/commandresult, /api/wg/..., /api/logs/...), the Docker registry proxy (/api/registry/, /v2/, /registry/), public /downloads/, and WebSocket upgrades are exempt — agents and CLI clients don't send a browser Origin, and WebSocket upgrades have their own Origin check (server/ws_security.go). OPTIONS preflights are answered with a 10-minute Access-Control-Max-Age cache.


SSL Certificates

The production stack uses Traefik with Cloudflare DNS challenge for Let's Encrypt:

  1. Traefik requests a certificate from Let's Encrypt
  2. Uses the Cloudflare API to create a DNS TXT record for validation
  3. Certificate is automatically renewed before expiry
  4. All HTTP traffic is redirected to HTTPS

Firewall Rules

Ensure these ports are open:

Port Protocol Purpose
80 TCP HTTP (redirects to HTTPS)
443 TCP HTTPS (web dashboard + API)
51820 UDP WireGuard VPN

Database Backups

PostgreSQL data is stored in a Docker volume. Back it up regularly:

# Dump the database
docker exec watchgrid-postgres pg_dump -U watchgrid watchgrid > backup-$(date +%Y%m%d).sql

# Restore from backup
docker exec -i watchgrid-postgres psql -U watchgrid watchgrid < backup-20240101.sql

Database Migrations

Migrations run automatically on server startup. When upgrading:

docker compose -f docker-compose.prod.yml pull    # Get latest images
docker compose -f docker-compose.prod.yml down     # Stop services
docker compose -f docker-compose.prod.yml up -d    # Start — migrations run automatically

The server waits for PostgreSQL to be ready, then applies any pending migrations from scripts/migrations/. Failed migrations prevent startup (fail-fast).


SSH CA Key Backup

The SSH CA keys are stored in the watchgrid-ssh-ca Docker volume. Losing them invalidates all issued host certificates. An automated daily backup script is provided.

See SSH CA — Backup & Restore for the full runbook including:

  • Automated daily encrypted backup via scripts/backup-ssh-ca.sh
  • systemd timer for unattended backups
  • Step-by-step restore procedure (RTO < 15 minutes)
  • Volume restore for full stack rebuilds
  • Monthly backup verification checklist

Updates

To update Watchgrid:

cd watchgrid
docker compose -f docker-compose.prod.yml pull
docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d

Check the logs to confirm a clean startup:

docker compose -f docker-compose.prod.yml logs -f server

Rate Limiting

Architecture decision

Watchgrid's server process includes an in-process, per-IP token-bucket rate limiter that protects login, 2FA, and registration endpoints. This limiter works correctly for single-replica deployments. With multiple server replicas the per-replica bucket count multiplies the effective limit (e.g., 3 replicas → 30 login attempts/minute per IP instead of 10).

Decision for v1 (single-replica): The in-process limiter is sufficient and is the primary brute-force protection. docker-compose.prod.yml additionally configures a Traefik ratelimit middleware (10 req/min/IP across the router) as a cross-replica safety net. For multi-replica deployments backed by a load balancer, configure rate limiting at the load balancer or reverse proxy layer.

Redis-backed limiter is the correct long-term solution for multi-replica deployments but adds operational complexity. It is tracked as a future enhancement.

Traefik middleware (production)

The docker-compose.prod.yml Traefik labels configure an auth-ratelimit middleware on the /api router:

  • Average: 10 requests/minute
  • Burst: 10 requests
  • Key: client IP (X-Forwarded-For depth 1)

This is enforced at the ingress level regardless of how many server replicas are running.

Memory boundedness

The in-process limiter runs a background goroutine that evicts buckets not seen for 10 minutes, running every 5 minutes. The map is effectively capped to IPs active within the last 10 minutes.


PostgreSQL TLS

In production, the Watchgrid server requires an encrypted connection to PostgreSQL (WATCHGRID_DB_SSLMODE=require or higher). Setting sslmode=disable causes a fatal startup error unless WATCHGRID_DEV_MODE=true is set.

Self-signed certificate (single-host)

docker-compose.prod.yml includes a one-time init script (scripts/postgres-ssl-init.sh) that generates a 10-year self-signed certificate into the PostgreSQL data directory on first startup. The Postgres container is started with -c ssl=on pointing to this certificate.

WATCHGRID_DB_SSLMODE=require encrypts the connection but does not validate the server certificate chain (appropriate for same-host Docker networks). No additional trust bundle is needed.

Custom CA / verify-full (k8s or external Postgres)

For Kubernetes or an external managed Postgres, use WATCHGRID_DB_SSLMODE=verify-full and provide:

WATCHGRID_DB_SSLMODE=verify-full
WATCHGRID_DB_SSLROOTCERT=/etc/watchgrid/pg-ca.crt  # PEM bundle of the CA cert

Mount the CA certificate into the server container. The Go pq driver reads sslrootcert from the DSN.

Docker development stack

docker-compose.yml uses WATCHGRID_DEV_MODE=true which allows sslmode=disable. This is intentional for local development where Postgres does not have SSL configured.


Security Scanning & CVE Response

What runs in CI

Every push to main and every pull request triggers two layers of scanning:

1. Image scanning (.github/workflows/build.yml) — runs after each image is built and pushed:

  • Trivy CRITICAL CVE gate: the build fails if any CRITICAL-severity, fixed vulnerability is found in OS packages or language dependencies of the image. Unfixed CVEs are reported but do not fail the build.
  • CycloneDX SBOM: a CycloneDX JSON SBOM is generated per image (server, frontend, cluster-agent, service-agent) and uploaded as a workflow artefact (sbom-<component>-<sha>.cyclonedx.json, retained 90 days).
  • Digest pinning: on semver tag pushes (1.26.1), the pin-manifests job updates k8s/overlays/production/kustomization.yaml with the exact sha256: digest of each image and commits back to main.

CVE response SLA

Vulnerabilities in Watchgrid images or dependencies are triaged and patched on the following timeline:

Severity Triage Patch merged Released
Critical (CVSS ≥ 9.0) ≤ 24 hours ≤ 3 business days ≤ 7 days
High (CVSS 7.0–8.9) ≤ 3 business days ≤ 14 days next scheduled release
Medium (CVSS 4.0–6.9) ≤ 7 days next scheduled release next scheduled release
Low (CVSS < 4.0) best effort best effort rolled into periodic dependency bumps

Exception: if a CRITICAL CVE has no upstream fix, the image scan gate allows the build to proceed (ignore-unfixed: true), but the incident is tracked and mitigation (e.g. disabling the vulnerable component, upgrading to a different base image) is attempted within the Critical SLA window.

Reporting a vulnerability

Security issues must not be filed as public GitHub issues. Report privately via one of:

  • GitHub Security Advisory: Security → Report a vulnerability on the repo.
  • Email: security@watchgrid.dev (PGP key fingerprint published on the website).

Include reproduction steps, affected version(s), and impact. You will receive an acknowledgement within the Critical triage window regardless of actual severity.

Accessing SBOMs

SBOMs for every CI build are attached as workflow artefacts. To download:

# List recent runs
gh run list --workflow build.yml --limit 5

# Download all artefacts from a specific run
gh run download <run-id>

Release SBOMs (tag builds) are additionally attached to the GitHub Release page.