predictor/DEPLOYMENT.md

5.8 KiB

Deploying stratoflights-predictor

The predictor is a single static Go binary with no database and no required external services. It downloads NOAA GFS/GEFS wind data to node-local disk and serves the REST API (see /docs or api/rest/predictor.swagger.yml).

It is an internal backend: the public entrypoint is the stratoflights API gateway, which calls the predictor over an internal overlay network. The predictor enforces no auth of its own.

Environments

Environment File Notes
Local dev docker-compose.yml one instance, metrics off, named volume
Staging (single host) docker-compose.staging.yml all features + bundled Prometheus
Production (Swarm) docker-compose.swarm.yml node-pinned, replicated, metrics
# Local
docker compose up --build
curl localhost:8080/ready

# Staging (single host, exercises the metrics pipeline)
docker compose -f docker-compose.staging.yml up --build
# Prometheus at :9090, predictor target should be UP

# Production — see below

Production (Docker Swarm)

Storage and node placement — the important part

The wind dataset is ~8.9 GiB (0.5°) and must live on local disk, never NFS. To bound the number of copies, the service is pinned to nodes carrying the predictor.data=true label; label at most two nodes. Each labelled node keeps exactly one copy under a node-local bind mount.

On each labelled node, provision the local directories and a writable owner for the non-root container (uid:gid 65532:65532):

sudo mkdir -p /srv/predictor/data /srv/predictor/elevation
sudo chown -R 65532:65532 /srv/predictor
# (optional) seed the elevation dataset so descent terminates at ground level:
#   python3 scripts/build_elevation.py /srv/predictor/elevation/ruaumoko-dataset

Label the two storage nodes:

docker node update --label-add predictor.data=true <node-a>
docker node update --label-add predictor.data=true <node-b>

Replicas are spread one-per-node by default (redundancy across both copies). Scaling to multiple replicas per node is safe: they share the node-local volume and coordinate the download with an exclusive flock, so only one process per node fetches the dataset — the others wait and load the committed file. To scale: docker service scale predictor_predictor=4 (≤2 per node).

Network

The gateway and Prometheus reach the predictor over a shared overlay. Create it once and have the gateway stack join the same external network:

docker network create -d overlay --attachable stratoflights-net

The service is published only on that network under the alias predictor (http://predictor:8080). No public Traefik router — the gateway is the edge.

Deploy

Via the CI pipeline (recommended): push a v* tag → the image is built and the stack is deployed through the Swarmpit API. Manually:

TAG=v1.0.0 docker stack deploy -c docker-compose.swarm.yml --with-registry-auth predictor

or import docker-compose.swarm.yml into Swarmpit and set TAG.

Configuration

All settings are env vars (file/env/flag precedence; see README). Production defaults are in docker-compose.swarm.yml:

Variable Purpose
PREDICTOR_DATA_DIR=/data node-local dataset dir (bind mount)
PREDICTOR_ELEVATION_DATASET=/srv/ruaumoko-dataset optional terrain data
PREDICTOR_SOURCE=gfs-0p50-3h gfs-0p50-3h, gfs-0p25-3h, gfs-0p25-1h, gefs-0p50-3h
PREDICTOR_DOWNLOAD_PARALLEL=16 concurrent GRIB downloads
PREDICTOR_UPDATE_INTERVAL=6h forecast refresh cadence
PREDICTOR_METRICS_ENABLED=true expose /metrics

No Docker secrets are needed — the predictor has no database or credentials.

Health

  • GET /health — liveness (always 200 while the process runs). The container HEALTHCHECK calls the binary's -healthcheck mode (no curl in the image).
  • GET /ready — readiness (200 only once a dataset is loaded). The gateway should gate traffic on this; Swarm does not kill a container that is still performing its first download thanks to the 120s start_period.

Metrics

/metrics exposes Prometheus counters (predictor_predictions_total, predictor_downloads_total, predictor_download_bytes_total) and the predictor_active_dataset_epoch_seconds gauge. The service carries prometheus.scrape/port/path deploy labels for Swarm service discovery; point your central Prometheus at the stratoflights-net network.

CI/CD (Forgejo → Swarmpit)

.forgejo/workflows/ci-cd.yml:

  1. test (every push/PR): gofmt check, go vet, go build, go test -race.
  2. build (develop branch and v* tags): buildx linux/amd64 image pushed to git.intra.yksa.space/web/predictor (:develop, or :<version> + :latest).
  3. deploy-staging (develop) / deploy-production (v* tags): deploy docker-compose.swarm.yml to the environment's Swarmpit stack via deploy/swarmpit-deploy.sh.

Configure runner secrets (scope staging/production via Forgejo environments):

  • REGISTRY_USERNAME, REGISTRY_PASSWORD — container registry
  • SWARMPIT_URL, SWARMPIT_TOKEN, STACK_NAME — Swarmpit deploy target
  • CA_CERTIFICATES — optional PEM bundle if Swarmpit uses a private CA

Cut a release:

git tag v1.0.0 && git push origin v1.0.0

Operations

docker service ls --filter label=com.docker.stack.namespace=predictor
docker service logs -f predictor_predictor
docker service scale predictor_predictor=2          # ≤2 per labelled node
docker service rollback predictor_predictor

Trigger a dataset refresh or inspect jobs through the admin API:

curl -X POST http://predictor:8080/api/v1/admin/datasets -d '{"latest":true}'
curl http://predictor:8080/api/v1/admin/jobs
curl http://predictor:8080/api/v1/admin/status