FLAME E2E Status Page

Automated end-to-end health monitoring for the FLAME federated learning and analysis platform (staging cluster).

Every 30 minutes, a GitHub Action runs a complete federated analysis round against the FLAME Hub — from login to result retrieval — and publishes the outcome to a static status page via GitHub Pages.

Maintainer: Jules Kreuer

What is checked

Each run executes flame_health_check.py. After a shared login, it runs a separate minimal analysis per compute node — each one pairing the aggregator with a single node (aggregator-1 + default-1, aggregator-1 + default-2, …) — in parallel, each gated by an online pre-check. This yields an independent up/down verdict and latency for every node, instead of one all-or-nothing round that hangs if any node is down.

The results are merged into six step cards and a per-node section:

Check	What it verifies	Timeout
`login`	Authentication against the FLAME Hub and basic API access (node listing). Shared, once.	10 s
`upload`	Per pair: analysis creation, code bucket provisioning, and upload of the test script (`flame_checks/00_test_connection.py`) as entrypoint.	60 s
`distribute`	Per pair: analysis image build and distribution to the paired nodes.	60 s
`execute`	Per pair: execution of the analysis on the paired nodes.	120 s
`results`	Per pair: download of the result tarball and confirmation that both paired nodes reported `ok`.	—
`latency`	Per pair: E2E duration stays below 300 s.	300 s

The five pipeline step cards (upload…latency) are aggregated across the parallel pair runs (status merged, duration averaged). The per-node cards at the bottom show each node's own up/down and latency; aggregator-1 counts as up if any of its pairs succeeds. The overall badge is the aggregation of the node verdicts: all up → operational, some down → partial, all down or login failing → major outage.

If a step fails within a pair run, that run's subsequent steps are recorded as unknown (shown as "no data" on the page, excluded from uptime statistics); an offline node is recorded down without spending a run on it.

Results are appended as date, status, duration lines to docs/logs/<check>_report.log (capped at 2000 lines, ≈ 40 days at 30-minute intervals) and committed back to the repository. The frontend (docs/index.html / docs/index.js) renders the last 30 days per check, including run durations.

Setup

Repository secrets (Settings → Secrets and variables → Actions):
- FLAME_USERNAME / FLAME_PASSWORD — Hub credentials (required). The Hub endpoints (*.staging.privateaim.net) are hardcoded at the top of flame_health_check.py.
GitHub Pages: Settings → Pages → deploy from the main branch, /docs folder.
Optionally adjust TARGET_NODE_NAMES and PROJECT_NAME in flame_health_check.py, and the report cards / console link in CONFIG.reports / CONFIG.consoleUrl in docs/constants.js.
Trigger a first run manually via the Scheduled Health Check workflow (workflow_dispatch).

Manual messages

Maintenance or incident notices can be posted by editing docs/messages.json (e.g. directly in the GitHub web editor). Each entry is rendered as a banner above the status cards, newest first:

[
  {
    "date": "2026-06-12",
    "type": "maintenance",
    "title": "Hub upgrade",
    "text": "Staging cluster will be unavailable June 12, 09:00-11:00 CEST."
  }
]

type controls the banner accent: info (blue), maintenance (yellow), incident (red). Remove entries (or set the file to []) to clear the page.

Running locally

pip install -r requirements.txt
export FLAME_USERNAME=... FLAME_PASSWORD=...
python flame_health_check.py

Credits

Frontend and status-page concept forked from statsig-io/statuspage.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
docs		docs
flame_checks		flame_checks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aggregate.py		aggregate.py
config.py		config.py
flame_health_check.py		flame_health_check.py
hub_client.py		hub_client.py
pair_run.py		pair_run.py
reporting.py		reporting.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FLAME E2E Status Page

What is checked

Setup

Manual messages

Running locally

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FLAME E2E Status Page

What is checked

Setup

Manual messages

Running locally

Credits

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages