Automated end-to-end health monitoring for the FLAME federated learning and analysis platform (staging cluster).
Every 30 minutes, a GitHub Action runs a complete federated analysis round against the FLAME Hub — from login to result retrieval — and publishes the outcome to a static status page via GitHub Pages.
Maintainer: Jules Kreuer
Each run executes flame_health_check.py. After a shared login, it
runs a separate minimal analysis per compute node — each one pairing the aggregator with a
single node (aggregator-1 + default-1, aggregator-1 + default-2, …) — in parallel, each
gated by an online pre-check. This yields an independent up/down verdict and latency for every
node, instead of one all-or-nothing round that hangs if any node is down.
The results are merged into six step cards and a per-node section:
| Check | What it verifies | Timeout |
|---|---|---|
login |
Authentication against the FLAME Hub and basic API access (node listing). Shared, once. | 10 s |
upload |
Per pair: analysis creation, code bucket provisioning, and upload of the test script (flame_checks/00_test_connection.py) as entrypoint. |
60 s |
distribute |
Per pair: analysis image build and distribution to the paired nodes. | 60 s |
execute |
Per pair: execution of the analysis on the paired nodes. | 120 s |
results |
Per pair: download of the result tarball and confirmation that both paired nodes reported ok. |
— |
latency |
Per pair: E2E duration stays below 300 s. | 300 s |
The five pipeline step cards (upload…latency) are aggregated across the parallel pair runs
(status merged, duration averaged). The per-node cards at the bottom show each node's own up/down
and latency; aggregator-1 counts as up if any of its pairs succeeds. The overall badge is the
aggregation of the node verdicts: all up → operational, some down → partial, all down or login
failing → major outage.
If a step fails within a pair run, that run's subsequent steps are recorded as unknown (shown as
"no data" on the page, excluded from uptime statistics); an offline node is recorded down without
spending a run on it.
Results are appended as date, status, duration lines to docs/logs/<check>_report.log (capped at 2000 lines, ≈ 40 days at 30-minute intervals) and committed back to the repository. The frontend (docs/index.html / docs/index.js) renders the last 30 days per check, including run durations.
- Repository secrets (Settings → Secrets and variables → Actions):
FLAME_USERNAME/FLAME_PASSWORD— Hub credentials (required). The Hub endpoints (*.staging.privateaim.net) are hardcoded at the top offlame_health_check.py.
- GitHub Pages: Settings → Pages → deploy from the
mainbranch,/docsfolder. - Optionally adjust
TARGET_NODE_NAMESandPROJECT_NAMEinflame_health_check.py, and the report cards / console link inCONFIG.reports/CONFIG.consoleUrlindocs/constants.js. - Trigger a first run manually via the Scheduled Health Check workflow (
workflow_dispatch).
Maintenance or incident notices can be posted by editing docs/messages.json (e.g. directly in the GitHub web editor). Each entry is rendered as a banner above the status cards, newest first:
[
{
"date": "2026-06-12",
"type": "maintenance",
"title": "Hub upgrade",
"text": "Staging cluster will be unavailable June 12, 09:00-11:00 CEST."
}
]type controls the banner accent: info (blue), maintenance (yellow), incident (red). Remove entries (or set the file to []) to clear the page.
pip install -r requirements.txt
export FLAME_USERNAME=... FLAME_PASSWORD=...
python flame_health_check.pyFrontend and status-page concept forked from statsig-io/statuspage.