feat: polish & windviz & deploy
This commit is contained in:
parent
81b8e763bd
commit
465ad00f7b
78 changed files with 20622 additions and 2154 deletions
156
DEPLOYMENT.md
Normal file
156
DEPLOYMENT.md
Normal file
|
|
@ -0,0 +1,156 @@
|
|||
# Deploying stratoflights-predictor
|
||||
|
||||
The predictor is a single static Go binary with no database and no required
|
||||
external services. It downloads NOAA GFS/GEFS wind data to **node-local disk**
|
||||
and serves the REST API (see `/docs` or `api/rest/predictor.swagger.yml`).
|
||||
|
||||
It is an **internal backend**: the public entrypoint is the stratoflights API
|
||||
gateway, which calls the predictor over an internal overlay network. The
|
||||
predictor enforces no auth of its own.
|
||||
|
||||
## Environments
|
||||
|
||||
| Environment | File | Notes |
|
||||
|---|---|---|
|
||||
| Local dev | `docker-compose.yml` | one instance, metrics off, named volume |
|
||||
| Staging (single host) | `docker-compose.staging.yml` | all features + bundled Prometheus |
|
||||
| Production (Swarm) | `docker-compose.swarm.yml` | node-pinned, replicated, metrics |
|
||||
|
||||
```bash
|
||||
# Local
|
||||
docker compose up --build
|
||||
curl localhost:8080/ready
|
||||
|
||||
# Staging (single host, exercises the metrics pipeline)
|
||||
docker compose -f docker-compose.staging.yml up --build
|
||||
# Prometheus at :9090, predictor target should be UP
|
||||
|
||||
# Production — see below
|
||||
```
|
||||
|
||||
## Production (Docker Swarm)
|
||||
|
||||
### Storage and node placement — the important part
|
||||
|
||||
The wind dataset is ~8.9 GiB (0.5°) and must live on **local disk, never NFS**.
|
||||
To bound the number of copies, the service is pinned to nodes carrying the
|
||||
`predictor.data=true` label; **label at most two nodes**. Each labelled node
|
||||
keeps exactly one copy under a node-local bind mount.
|
||||
|
||||
On **each** labelled node, provision the local directories and a writable owner
|
||||
for the non-root container (uid:gid `65532:65532`):
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /srv/predictor/data /srv/predictor/elevation
|
||||
sudo chown -R 65532:65532 /srv/predictor
|
||||
# (optional) seed the elevation dataset so descent terminates at ground level:
|
||||
# python3 scripts/build_elevation.py /srv/predictor/elevation/ruaumoko-dataset
|
||||
```
|
||||
|
||||
Label the two storage nodes:
|
||||
|
||||
```bash
|
||||
docker node update --label-add predictor.data=true <node-a>
|
||||
docker node update --label-add predictor.data=true <node-b>
|
||||
```
|
||||
|
||||
Replicas are spread one-per-node by default (redundancy across both copies).
|
||||
Scaling to multiple replicas **per** node is safe: they share the node-local
|
||||
volume and coordinate the download with an exclusive `flock`, so only one
|
||||
process per node fetches the dataset — the others wait and load the committed
|
||||
file. To scale: `docker service scale predictor_predictor=4` (≤2 per node).
|
||||
|
||||
### Network
|
||||
|
||||
The gateway and Prometheus reach the predictor over a shared overlay. Create it
|
||||
once and have the gateway stack join the same external network:
|
||||
|
||||
```bash
|
||||
docker network create -d overlay --attachable stratoflights-net
|
||||
```
|
||||
|
||||
The service is published only on that network under the alias `predictor`
|
||||
(`http://predictor:8080`). No public Traefik router — the gateway is the edge.
|
||||
|
||||
### Deploy
|
||||
|
||||
Via the CI pipeline (recommended): push a `v*` tag → the image is built and the
|
||||
stack is deployed through the Swarmpit API. Manually:
|
||||
|
||||
```bash
|
||||
TAG=v1.0.0 docker stack deploy -c docker-compose.swarm.yml --with-registry-auth predictor
|
||||
```
|
||||
|
||||
or import `docker-compose.swarm.yml` into Swarmpit and set `TAG`.
|
||||
|
||||
### Configuration
|
||||
|
||||
All settings are env vars (file/env/flag precedence; see README). Production
|
||||
defaults are in `docker-compose.swarm.yml`:
|
||||
|
||||
| Variable | Purpose |
|
||||
|---|---|
|
||||
| `PREDICTOR_DATA_DIR=/data` | node-local dataset dir (bind mount) |
|
||||
| `PREDICTOR_ELEVATION_DATASET=/srv/ruaumoko-dataset` | optional terrain data |
|
||||
| `PREDICTOR_SOURCE=gfs-0p50-3h` | `gfs-0p50-3h`, `gfs-0p25-3h`, `gfs-0p25-1h`, `gefs-0p50-3h` |
|
||||
| `PREDICTOR_DOWNLOAD_PARALLEL=16` | concurrent GRIB downloads |
|
||||
| `PREDICTOR_UPDATE_INTERVAL=6h` | forecast refresh cadence |
|
||||
| `PREDICTOR_METRICS_ENABLED=true` | expose `/metrics` |
|
||||
|
||||
No Docker secrets are needed — the predictor has no database or credentials.
|
||||
|
||||
### Health
|
||||
|
||||
- `GET /health` — liveness (always 200 while the process runs). The container
|
||||
`HEALTHCHECK` calls the binary's `-healthcheck` mode (no curl in the image).
|
||||
- `GET /ready` — readiness (200 only once a dataset is loaded). The gateway
|
||||
should gate traffic on this; Swarm does **not** kill a container that is still
|
||||
performing its first download thanks to the 120s `start_period`.
|
||||
|
||||
### Metrics
|
||||
|
||||
`/metrics` exposes Prometheus counters (`predictor_predictions_total`,
|
||||
`predictor_downloads_total`, `predictor_download_bytes_total`) and the
|
||||
`predictor_active_dataset_epoch_seconds` gauge. The service carries
|
||||
`prometheus.scrape/port/path` deploy labels for Swarm service discovery; point
|
||||
your central Prometheus at the `stratoflights-net` network.
|
||||
|
||||
## CI/CD (Forgejo → Swarmpit)
|
||||
|
||||
`.forgejo/workflows/ci-cd.yml`:
|
||||
|
||||
1. **test** (every push/PR): `gofmt` check, `go vet`, `go build`, `go test -race`.
|
||||
2. **build** (develop branch and `v*` tags): buildx `linux/amd64` image pushed to
|
||||
`git.intra.yksa.space/web/predictor` (`:develop`, or `:<version>` + `:latest`).
|
||||
3. **deploy-staging** (develop) / **deploy-production** (`v*` tags): deploy
|
||||
`docker-compose.swarm.yml` to the environment's Swarmpit stack via
|
||||
`deploy/swarmpit-deploy.sh`.
|
||||
|
||||
Configure runner secrets (scope staging/production via Forgejo environments):
|
||||
|
||||
- `REGISTRY_USERNAME`, `REGISTRY_PASSWORD` — container registry
|
||||
- `SWARMPIT_URL`, `SWARMPIT_TOKEN`, `STACK_NAME` — Swarmpit deploy target
|
||||
- `CA_CERTIFICATES` — optional PEM bundle if Swarmpit uses a private CA
|
||||
|
||||
Cut a release:
|
||||
|
||||
```bash
|
||||
git tag v1.0.0 && git push origin v1.0.0
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
```bash
|
||||
docker service ls --filter label=com.docker.stack.namespace=predictor
|
||||
docker service logs -f predictor_predictor
|
||||
docker service scale predictor_predictor=2 # ≤2 per labelled node
|
||||
docker service rollback predictor_predictor
|
||||
```
|
||||
|
||||
Trigger a dataset refresh or inspect jobs through the admin API:
|
||||
|
||||
```bash
|
||||
curl -X POST http://predictor:8080/api/v1/admin/datasets -d '{"latest":true}'
|
||||
curl http://predictor:8080/api/v1/admin/jobs
|
||||
curl http://predictor:8080/api/v1/admin/status
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue