Skip to content

PrivateAIM/status

Health Check

FLAME E2E Status Page

Automated end-to-end health monitoring for the FLAME federated learning and analysis platform (staging cluster).

Every 30 minutes, a GitHub Action runs a complete federated analysis round against the FLAME Hub — from login to result retrieval — and publishes the outcome to a static status page via GitHub Pages.

Maintainer: Jules Kreuer

What is checked

Each run executes flame_health_check.py. After a shared login, it runs a separate minimal analysis per compute node — each one pairing the aggregator with a single node (aggregator-1 + default-1, aggregator-1 + default-2, …) — in parallel, each gated by an online pre-check. This yields an independent up/down verdict and latency for every node, instead of one all-or-nothing round that hangs if any node is down.

The results are merged into six step cards and a per-node section:

Check What it verifies Timeout
login Authentication against the FLAME Hub and basic API access (node listing). Shared, once. 10 s
upload Per pair: analysis creation, code bucket provisioning, and upload of the test script (flame_checks/00_test_connection.py) as entrypoint. 60 s
distribute Per pair: analysis image build and distribution to the paired nodes. 60 s
execute Per pair: execution of the analysis on the paired nodes. 120 s
results Per pair: download of the result tarball and confirmation that both paired nodes reported ok.
latency Per pair: E2E duration stays below 300 s. 300 s

The five pipeline step cards (uploadlatency) are aggregated across the parallel pair runs (status merged, duration averaged). The per-node cards at the bottom show each node's own up/down and latency; aggregator-1 counts as up if any of its pairs succeeds. The overall badge is the aggregation of the node verdicts: all up → operational, some down → partial, all down or login failing → major outage.

If a step fails within a pair run, that run's subsequent steps are recorded as unknown (shown as "no data" on the page, excluded from uptime statistics); an offline node is recorded down without spending a run on it.

Results are appended as date, status, duration lines to docs/logs/<check>_report.log (capped at 2000 lines, ≈ 40 days at 30-minute intervals) and committed back to the repository. The frontend (docs/index.html / docs/index.js) renders the last 30 days per check, including run durations.

Setup

  1. Repository secrets (Settings → Secrets and variables → Actions):
    • FLAME_USERNAME / FLAME_PASSWORD — Hub credentials (required). The Hub endpoints (*.staging.privateaim.net) are hardcoded at the top of flame_health_check.py.
  2. GitHub Pages: Settings → Pages → deploy from the main branch, /docs folder.
  3. Optionally adjust TARGET_NODE_NAMES and PROJECT_NAME in flame_health_check.py, and the report cards / console link in CONFIG.reports / CONFIG.consoleUrl in docs/constants.js.
  4. Trigger a first run manually via the Scheduled Health Check workflow (workflow_dispatch).

Manual messages

Maintenance or incident notices can be posted by editing docs/messages.json (e.g. directly in the GitHub web editor). Each entry is rendered as a banner above the status cards, newest first:

[
  {
    "date": "2026-06-12",
    "type": "maintenance",
    "title": "Hub upgrade",
    "text": "Staging cluster will be unavailable June 12, 09:00-11:00 CEST."
  }
]

type controls the banner accent: info (blue), maintenance (yellow), incident (red). Remove entries (or set the file to []) to clear the page.

Running locally

pip install -r requirements.txt
export FLAME_USERNAME=... FLAME_PASSWORD=...
python flame_health_check.py

Credits

Frontend and status-page concept forked from statsig-io/statuspage.

Releases

No releases published

Packages

 
 
 

Contributors

Languages