Deploying stratoflights-predictor

The predictor is a single static Go binary with no database and no required external services. It downloads NOAA GFS/GEFS wind data to node-local disk and serves the REST API (see /docs or api/rest/predictor.swagger.yml).

It is an internal backend: the public entrypoint is the stratoflights API gateway, which calls the predictor over an internal overlay network. The predictor enforces no auth of its own.

Environments

Environment	File	Notes
Local dev	`docker-compose.yml`	one instance, metrics off, named volume
Staging (single host)	`docker-compose.staging.yml`	all features + bundled Prometheus
Production (Swarm)	`docker-compose.swarm.yml`	node-pinned, replicated, metrics

# Local
docker compose up --build
curl localhost:8080/ready

# Staging (single host, exercises the metrics pipeline)
docker compose -f docker-compose.staging.yml up --build
# Prometheus at :9090, predictor target should be UP

# Production — see below

Production (Docker Swarm)

Storage and node placement — the important part

The wind dataset is ~8.9 GiB (0.5°) and must live on local disk, never NFS. To bound the number of copies, the service is pinned to nodes carrying the predictor.data=true label; label at most two nodes. Each labelled node keeps exactly one copy under a node-local bind mount.

On each labelled node, provision the local directories and a writable owner for the non-root container (uid:gid 65532:65532):

sudo mkdir -p /srv/predictor/data /srv/predictor/elevation
sudo chown -R 65532:65532 /srv/predictor
# (optional) seed the elevation dataset so descent terminates at ground level:
#   python3 scripts/build_elevation.py /srv/predictor/elevation/ruaumoko-dataset

Label the two storage nodes:

docker node update --label-add predictor.data=true <node-a>
docker node update --label-add predictor.data=true <node-b>

Replicas are spread one-per-node by default (redundancy across both copies). Scaling to multiple replicas per node is safe: they share the node-local volume and coordinate the download with an exclusive flock, so only one process per node fetches the dataset — the others wait and load the committed file. To scale: docker service scale predictor_predictor=4 (≤2 per node).

Network

The gateway and Prometheus reach the predictor over a shared overlay. Create it once and have the gateway stack join the same external network:

docker network create -d overlay --attachable stratoflights-net

The service is published only on that network under the alias predictor (http://predictor:8080). No public Traefik router — the gateway is the edge.

Deploy

Via the CI pipeline (recommended): push a v* tag → the image is built and the stack is deployed through the Swarmpit API. Manually:

TAG=v1.0.0 docker stack deploy -c docker-compose.swarm.yml --with-registry-auth predictor

or import docker-compose.swarm.yml into Swarmpit and set TAG.

Configuration

All settings are env vars (file/env/flag precedence; see README). Production defaults are in docker-compose.swarm.yml:

Variable	Purpose
`PREDICTOR_DATA_DIR=/data`	node-local dataset dir (bind mount)
`PREDICTOR_ELEVATION_DATASET=/srv/ruaumoko-dataset`	optional terrain data
`PREDICTOR_SOURCE=gfs-0p50-3h`	`gfs-0p50-3h`, `gfs-0p25-3h`, `gfs-0p25-1h`, `gefs-0p50-3h`
`PREDICTOR_DOWNLOAD_PARALLEL=16`	concurrent GRIB downloads
`PREDICTOR_UPDATE_INTERVAL=6h`	forecast refresh cadence
`PREDICTOR_METRICS_ENABLED=true`	expose `/metrics`

No Docker secrets are needed — the predictor has no database or credentials.

Health

GET /health — liveness (always 200 while the process runs). The container HEALTHCHECK calls the binary's -healthcheck mode (no curl in the image).
GET /ready — readiness (200 only once a dataset is loaded). The gateway should gate traffic on this; Swarm does not kill a container that is still performing its first download thanks to the 120s start_period.

Metrics

/metrics exposes Prometheus counters (predictor_predictions_total, predictor_downloads_total, predictor_download_bytes_total) and the predictor_active_dataset_epoch_seconds gauge. The service carries prometheus.scrape/port/path deploy labels for Swarm service discovery; point your central Prometheus at the stratoflights-net network.

CI/CD (Forgejo → Swarmpit)

.forgejo/workflows/ci-cd.yml:

test (every push/PR): gofmt check, go vet, go build, go test -race.
build (develop branch and v* tags): buildx linux/amd64 image pushed to git.intra.yksa.space/web/predictor (:develop, or :<version> + :latest).
deploy-staging (develop) / deploy-production (v* tags): deploy docker-compose.swarm.yml to the environment's Swarmpit stack via deploy/swarmpit-deploy.sh.

Configure runner secrets (scope staging/production via Forgejo environments):

REGISTRY_USERNAME, REGISTRY_PASSWORD — container registry
SWARMPIT_URL, SWARMPIT_TOKEN, STACK_NAME — Swarmpit deploy target
CA_CERTIFICATES — optional PEM bundle if Swarmpit uses a private CA

Cut a release:

git tag v1.0.0 && git push origin v1.0.0

Operations

docker service ls --filter label=com.docker.stack.namespace=predictor
docker service logs -f predictor_predictor
docker service scale predictor_predictor=2          # ≤2 per labelled node
docker service rollback predictor_predictor

Trigger a dataset refresh or inspect jobs through the admin API:

curl -X POST http://predictor:8080/api/v1/admin/datasets -d '{"latest":true}'
curl http://predictor:8080/api/v1/admin/jobs
curl http://predictor:8080/api/v1/admin/status

5.8 KiB Raw Blame History