Skip to content

draft dev: action flow matching#235

Draft
koritsky wants to merge 52 commits into
mainfrom
feat/action_expert
Draft

draft dev: action flow matching#235
koritsky wants to merge 52 commits into
mainfrom
feat/action_expert

Conversation

@koritsky

Copy link
Copy Markdown
Contributor

No description provided.

@koritsky koritsky force-pushed the feat/action_expert branch from cead078 to 9b56b2b Compare June 1, 2026 09:43
koritsky and others added 16 commits June 9, 2026 18:32
add sin time emb and adal

better oracles

Add higher-order flow sampling

Strengthen flow time conditioning

Normalize flow policy actions

Configurable training-time flow sampling

Add deterministic flow policy validation

add comments

rm normalization, simplify

rm some validations

batch with custom temporal dimension

fix turnsignal

move prediction one step in future

drop waypoints logging
…/val

Overfit datamodule references its own single-drive /dataset/yaak/overfit for both splits (defensive against scope-creep into the full yaak/train corpus). train/val use drop_last to avoid partial trailing batches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
RotarySelfAttention threaded through the decoder blocks via a rope flag (default off so existing checkpoints rebuild byte-identical). Injects slot position at every attention layer, complementing the additive position embedding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ampling, maneuver-L1

- within-chunk delta loss (chunk_delta_weight) to break the constant-chunk optimum; NaN-row guard with culprit attribution; pe_drift metric; flow_mse always logged.
- configurable flow-time sampling: logit-normal mean/std exposed, plus pi0's beta p(t)=Beta((s-t)/s; alpha,1) skewed toward the noisy end (generator-safe inverse-CDF).
- per-t flow-MSE buckets and per-channel maneuver-L1 (gas/brake/steering) validation metrics.
- policy_finetune.yaml wires the new hparams with backward-compatible defaults.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gging

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… recipes

- fan: N-draw sample-concentration + horizon views.
- field: velocity-field / trajectory visualization.
- thresholds: per-channel maneuver thresholds from data (CPU-only, prints a config-ready tuple).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…; unfreeze encoder

- finetune_flow callbacks: freezer + PE-cosine image logger, no EMA (val reads raw weights; EMA lags a still-descending overfit).
- finetune_overfit_flow: chunk-delta, time-sampling, maneuver thresholds, heun/32 val, 200 epochs.
- finetune: unfreeze encoder for multi-drive finetune.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eport

- improvements.md: reverse the mislabeled-overfit correction — the runs were genuine single-drive overfit (finetuned policy head), not full-corpus.
- add rescue plan and diagnostic report (typst source, figures, rendered pdf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Train the flow in a well-conditioned model space matched to its N(0,1) prior (suspect 5):
- gas/brake -> signed longitudinal merge (gas - brake), folding brake's point mass into one continuous channel; model space is 2-d (longitudinal, steering), inverted to gas/brake at the I/O boundary.
- per-channel Gaussianize (empirical CDF -> N(0,1)) via data-fit quantile knots; monotonic, invertible, non-clipping.
Objective decouples raw I/O dim from decoder/model dim and inverts samples so all metrics stay raw-space. flow_action_thresholds.py writes the knots (+thresholds.merge=true); decoder dim via flow_action_dim. Off by default (identity).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Routes slot identity through the flow-time AdaLN modulation — the channel proven to train — to break the constant/mid-anchored chunk (the additive position embedding stayed at init). A zero-init per-slot offset (nn.Embedding, so SelectiveAdamW classifies it and excludes it from weight decay) is added to the time embedding feeding every block's adaLN_modulation, giving each slot its own scale/shift. Zero-init => byte-identical to the time-only decoder at start; old checkpoints rebuild unchanged. Off by default; enabled in the finetune_overfit_flow_slot_adaln experiment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@koritsky koritsky force-pushed the feat/action_expert branch from 816231a to 15f7a80 Compare June 9, 2026 16:33
koritsky and others added 12 commits June 9, 2026 21:48
…ace)

decoder.sample returns model-space samples (2-ch after the gas/brake merge), but the fan indexed steering at raw channel idx 2 -> IndexError. Invert samples through the objective's action transform before scoring; identity no-op when no transform is configured. Fan metrics stay raw-space, matching the raw GT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e horizon axis

Fix the mid-anchored constant chunk (every slot predicts ~t+3; fan lag = h-3)
at its cause: slots are ~90%+ correlated, so in slot space within-chunk
structure gets only its variance share of the flow MSE (measured on the
overfit drive: 4.7% longitudinal / 12.4% steering, sigma_0 ~ sqrt(6)) — the
constant chunk is the gradient-rational optimum, which is why information-only
levers (RoPE, additive PE, slot-AdaLN) all failed to break it.

- action_transform: optional chunk-DCT stage after Gaussianize — orthonormal
  DCT-II over the horizon axis (slot index -> frequency: k=0 mean, k=1 slope,
  ...) + per-coefficient standardization with a sigma-floor (frac of the
  channel's sigma_0). Within-chunk shape gets 5/6 of the loss weight instead
  of ~5-12%; the inverse scales errors back DOWN by sigma_k, so raw samples
  are robust to bad high-frequency predictions (opposite of mu-law's
  amplifying inverse). Precedent: pi0-FAST's normalize+DCT action tokenizer
  (arXiv:2501.09747), trajectory-DCT in motion prediction (arXiv:1908.05436).
- flow_policy: validate DCT horizon == decoder horizon; force-disable the
  chunk-delta loss under a horizon-mixing transform (differencing adjacent
  coefficients is meaningless; per-coefficient standardization supersedes it).
- flow_action_thresholds: always print the DCT sigma spectrum diagnostic;
  +thresholds.dct=true persists mu/sigma/floor into the stats json
  (artifacts/action_norm_dct.json fitted from the overfit drive).
- finetune_overfit_flow_dct experiment: pg5lzmvk base + DCT stats,
  flow_action_dim=2, delta weight 0 — single-change A/B vs the transform-only
  baseline. Readouts: fan lag profile flattens, per-horizon spike L1 loses the
  U-shape, slope corr >> 0, concentration holds near 91%/61%.
- tests: basis orthonormality, constant-chunk -> k=0, round-trips (incl.
  stacked sample dims), sigma-floor, delta guard, horizon-mismatch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…in transform

The chunk-DCT reparameterization (orthonormal DCT-II over the horizon axis +
per-coefficient standardization, commit a10838a) is a dead end: head-to-head on
the overfit eval drive it loses to the plain Gaussianize+merge baseline
(pg5lzmvk) — steering chunk-mean corr 0.36 vs 0.81, steering L1 ~2x worse,
single-draw sample_l1 0.048 vs 0.023. The sigma-floor=0.3 fix lifted corr
0.10->0.36 over the floor-0.05 collapse but couldn't close the gap. Two causes:
the level's slot-redundancy is load-bearing (6 tokens carry it in slot space ->
6x conditioning pressure; the DCT funnels it through c0 alone), and the
sigma-floor down-weights high-k coefficients in training but the inverse
inflates their prior-sampled values back up (2.5x over-wiggle). Full negative
result + numbers logged in Notion (Experiments DB + Action Expert dev log).

Restores action_transform.py / flow_policy.py / flow_action_thresholds.py to
their pre-DCT state and removes the finetune_overfit_flow_dct experiment.

Keeps test_action_transform.py (pruned to the non-DCT cases): the
Gaussianize+merge transform — now the established baseline — had ZERO test
coverage before, so the roundtrip / stacked-sample-dim / dims tests are net-new
and worth retaining.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… diversity

First multi-drive test of the established flow baseline (Gaussianize+merge
transform, delta+RoPE, logit-normal, heun/32). Everything since the rescue
plan has been single-drive overfit; this checks whether the transform win
transfers and gives real generalization readouts (train/val now disjoint).

- dataset/train_pilot template: first 30 drives of the train list (Niro096
  kept first for continuity with the overfit runs).
- datamodule/train_pilot: pilot train split + the standard 5-drive val set
  (no overlap).
- experiment/finetune_flow_pilot: pg5lzmvk recipe at 50ep / 25k-step cosine;
  stats are pilot-fit (artifacts/action_norm_pilot.json, gitignored) on the
  train split — see header for the fit command.
- flow_action_thresholds: support split=train so stats can be fit on the
  pilot's train drives (was predict|val only).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New baseline (run tqozevsv): reweight the flow loss from frequency toward
importance to fix the maneuver undercommit. The flat-dominated MSE (worsened by
the Gaussianize tail-Jacobian) spent raw precision by frequency — finest in
cruise, coarsest on the rare maneuvers that matter. LDS realigns it to
importance: steering maneuver-L1 dropped ~26% (spike 0.289->0.215) AND steering
chunk-mean corr improved 0.81->0.91, with overall single-draw sample_l1 flat.

- maneuver_weights.py: ManeuverLossWeights (Yang et al., ICML 2021). Per-chunk
  label = peak |physical action| over the horizon, per model channel
  (longitudinal, steering) — peak-over-slots upweights a chunk's lead-in with
  its maneuver. Weight = (1/smoothed-density)^alpha, capped (no runaway tail
  weight — cf. mu-law), mean-1 normalized over the empirical distribution so the
  loss scale (hence LR/schedule) is unchanged. alpha/cap applied at load => sweep
  without refit.
- flow_policy.py: load + validate the weighter (channels must match model space);
  weight the per-element flow loss (broadcast over slots) and, consistently, the
  chunk-delta term; depends only on the clean target so it is valid at every
  flow-time. flow_mse stays logged UNWEIGHTED (cross-run comparable);
  maneuver_weight_mean logged as a ~1 sanity check. Off when flow_lds_stats empty.
- action_transform.py: physical_model() — merge without Gaussianize, for the
  (physical-units) importance label.
- flow_action_thresholds.py: +thresholds.lds=true fits and writes per-channel
  {edges, emp, smooth} into the stats JSON; _collect_targets now returns per-chunk
  shape (frames, H, A) for the peak-over-horizon label.
- Validation: single honest draw (deployment-realistic). Dropped the best-of-N
  curve and _sample_curve_ks (oracle selection flattered the model); kept
  sample_l1 + maneuver_l1/*. flow_prediction_samples default 32 -> 1.
- Experiments: finetune_overfit_flow_lds, finetune_flow_pilot_lds (pg5lzmvk
  recipe + LDS as the single change).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make ModelCheckpoint cadence/retention overridable (oc.select, defaults 1/1 =
prior behavior) and set the overfit base to every_n_epochs=10, save_top_k=-1.
Keeps a sparse epoch ladder for offline fan/eval and cuts the per-epoch wandb
artifact uploads ~10x (log_model=all logs on each save). Pilot/other runs
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The encoder is factorized (RoPE on the temporal axis only; PR #209), so its
within-timestep spatial attention has no positional encoding and the 10
waypoints reach the flow decoder as an unordered bag — distinguished only by a
shared per-modality role embedding. The route SEQUENCE is therefore invisible
to the decoder's cross-attention (order recoverable only from coordinate
values). This restores a within-frame route-sequence signal.

- flow_policy.py: optional learnable per-waypoint-index embedding
  (nn.Embedding, zero-init) added to the waypoint condition-token block in
  _condition_tokens, so the decoder can use waypoint order. Added in the
  TRAINABLE objective (not the encoder) because the episode builder + encoder
  are frozen during flow finetune; cond_dim auto-derived from the decoder's
  condition projection; count-mismatch guard. Zero-init => byte-identical at
  start; weight-decay-excluded (nn.Embedding); per-slot constant across
  timesteps => KV-cache safe. Off when waypoint_pe is false.
- policy_finetune.yaml: flow_waypoint_pe (default false) / flow_waypoint_count
  (default 10) threaded through hparams_jq.
- experiments: finetune_overfit_flow_lds_wpe, finetune_flow_pilot_lds_wpe
  (LDS baseline + waypoint PE as the single change). Inherit the base directly
  (re-setting LDS flags inline) to avoid a two-level experiment-chaining bug.

Overfit A/B (3e5numej vs tqozevsv) was a clean null — the PE learns sensible
ordinal structure (adjacent-index cos 0.94) but at tiny magnitude on a fixed
single route; the meaningful test is the pilot (route diversity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hparams_jq is a jq program; an oc.select default of `false` renders as Python
`False`, which jq rejects ("False/0 is not defined"). This broke every
experiment that didn't explicitly set flow_waypoint_pe (the plain *_lds
configs). Use 0, matching the existing 1/0 convention (flow_decoder_rope,
flow_decoder_slot_adaln); pydantic coerces 0 -> False.

(Surfaced while reverting the minibatch OT-coupling experiment, which was a
structural dead end for a per-sample-conditioned flow — see Notion.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llout diagnostics

Three evaluation gaps closed:
- datamodule/yaak/predict_val (+ val_eval dataset template): predict over a
  HELD-OUT val drive (Niro115). The standard predict datamodule points at the
  overfit drive, which is in-sample for every pilot model — all previous
  predict parquets silently measured memorization for those.
- flow_meank_eval.py: mean-of-K bias/variance decomposition. ~30% of overall
  flow L1 is per-draw sampling variance (mean-of-32 is a free inference win);
  held-out SPIKE steering error is ~pure bias (0.55-0.63 even for the 32-draw
  conditional mean) — out-of-sample maneuver INITIATION, not sampling, is the
  error mass. Also defines the deterministic-head-reachable floor for the
  flow-vs-heads comparison.
- flow_rollout_eval.py: recursive action-feedback rollout (predictions fed
  back as action history; vision/waypoints stay GT). Motivated by the
  action-history sensitivity probe: predictions shift 1.7-3.4x baseline error
  under history perturbation (copycat signature), so open-loop metrics
  understate deployment error through this channel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "is flow earning its complexity?" control that never existed: the old
baseline (PolicyObjective) regresses only the last observed step, so flow vs
simple-heads was never comparable. RegressionPolicyObjective predicts the same
6-slot future chunk from the same condition tokens, in the same Gaussianize+
merge model space, with the same raw-space metrics and wandb keys
(sample_l1, maneuver_l1/*) — a single deterministic forward + MSE instead of
an integrated flow. Capacity-matched decoder (4-layer cross-attention, learned
slot queries via nn.Embedding for SelectiveAdamW classification). Optional LDS
(off by default). finetune_regression callbacks = finetune_flow minus the
flow-only PE image logger (encoder stays FROZEN — required for the matched
comparison). Experiments: finetune_overfit_regression (vs pg5lzmvk/tqozevsv),
finetune_pilot_regression (vs 7d3y7ort/ge4vfboq held-out).

Pre-registered readout: regression ~= flow mean-of-K => flow's gap is sampling
variance (deterministic readout/distillation is the cheap equivalent); flow
single-draw beats regression on maneuvers => distributional modeling earns it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The action-history sensitivity probe showed the policy leans heavily on past
actions (perturbing them shifts predictions 1.7-3.4x the baseline error): the
classic copycat shortcut (de Haan et al. 2019 causal confusion; ChauffeurNet
past-motion dropout). Cruise is "copy history forward"; maneuver ONSET needs
vision/route, which the shortcut out-competes — the best causal account of
the held-out spike bias (~0.6 even for the 32-draw conditional mean).

ActionHistoryDropout zeroes the action-history batch fields for a random
subset of train samples at on_train_batch_start (eval untouched; the frozen
encoder is a function, so zeroed inputs change the summaries and the head
learns both regimes). finetune_flow_histdrop callbacks + the
finetune_flow_pilot_lds_histdrop experiment (single change vs ge4vfboq;
p via flow_action_hist_dropout, default 0.5) — QUEUED, not launched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ise mode delta

The counters are load-bearing: the first rollout read "ratio 1.00" everywhere,
which without them would have been reported as "no compounding". They proved
substitutions fired (2385, mean |GT-sub| 0.053) while predictions stayed
bit-identical — i.e. genuinely zero local sensitivity to self-error-scale
history perturbations (drive-start segment), vs the probe's large-perturbation
sensitivity. Distinguishes "feedback not wired" from "feedback has no effect".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
koritsky and others added 24 commits June 11, 2026 13:22
Plain mean-of-K is a mode-averaging estimator: on a genuinely bimodal
conditional (turn left XOR right) it splits the difference — the exact
regression-to-the-mean failure flow exists to avoid. The mode-aware readout
clusters the K draws on chunk-mean steering (largest-gap split, with an
outlier guard: minority cluster must hold >= mode_min_frac of draws), commits
to the DOMINANT cluster (draw count ~ probability mass) and averages within
it — identical to mean-of-K on unimodal frames, mode-committing on bimodal
ones. Precedents: propose-then-select in motion forecasting (MultiPath
anchors, MTR/DenseTNT NMS, Trajectron++ "most likely" deployment; MDN
take-dominant-component) and Minimum-Bayes-Risk consensus decoding (Kumar &
Byrne 2004); sample-then-select policies (Implicit BC, SfBC, Diffusion-ES).

The census reports how often the conditional is ACTUALLY multimodal (% bimodal
frames, flat vs spike, mode separation, and meanK-vs-mode-aware steering L1 on
the bimodal subset). With route waypoints in the conditioning, discrete modes
should be rare at current scale — this measures it instead of assuming.

Unit-verified: bimodal draws (24 left / 8 right) -> anchor at the dominant
mode (-0.60), where mean-of-K averages to -0.30; unimodal / outlier-minority /
sub-gap frames fall back exactly to mean-of-K.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the mode-aware consensus into the predict pipeline so `just predict`
can use it on EVERY existing flow checkpoint (pure inference, no retraining):

    just predict ... '+flow_predict_readout=mode' '+flow_predict_samples=16'

- objectives/consensus.py: shared torch winner-take-all consensus (cluster K
  draws on chunk-mean steering via largest-gap split + outlier guard; commit
  to the dominant cluster, average within it; == mean-of-K on unimodal
  frames). Single source of truth: predict() and flow_meank_eval both use it
  (the eval's numpy version replaced by a wrapper).
- FlowPolicyObjective: predict_readout (single = legacy one honest draw,
  default; meank = K-draw mean, mode-averaging; mode = WTA) + predict_samples.
  Measured held-out spike steering L1: single 0.62 / meank 0.53 / mode 0.34
  (77% of held-out spike frames are bimodal: follow-history vs follow-route).
- finetuned.yaml: inject the readout into saved hparams at load, guarded on
  _target_ == FlowPolicyObjective so regression/legacy ckpts are untouched.

Verified: synthetic bimodal unit test via the shared module; jq injection on
flow/regression/legacy hparams shapes (flow gets it, others untouched, default
single); e2e mini-predict mode-vs-single on a real ckpt (predictions differ,
finite, parquet written).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The K draws for meank/mode readouts ran as K sequential decoder.sample calls;
batch them via repeat_interleave into a single (B*K) integration, like the
fan/meank scripts. The encoder already ran once upstream (condition_tokens is
sliced from the precomputed embedding); only the small flow decoder sees the
K-fold batch, so readout wall-clock is ~flat instead of ~K-sequential.

Re-verified e2e (K=16, 5 held-out batches): finite, differs from single-draw.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ut eval

Three additions on the WTA readout:
- consensus.py: refactor into split_modes + mode_aware_anchor(anchor=
  "mean"|"medoid"). Medoid = the actual draw closest to the dominant-cluster
  mean — guaranteed model sample (dynamically coherent by construction) vs the
  synthetic averaged chunk; trades a little variance for realism. Exposed as
  predict_readout=mode_medoid. Unit-verified: always a real draw, commits to
  the dominant mode.
- flow_meank_eval.py: RESIDUAL DECOMPOSITION of the held-out spike error
  (steering, per spike frame): single / mean-of-K / mode (WTA) / mode-medoid /
  oracle-mode / best-draw. Splits the remaining error into selection regret
  (dominant minus oracle mode choice), within-mode residual (oracle), and
  coverage floor (best single draw) — each component has a different fix
  (selector vs training weight/data vs data/copycat).
- per-drive breakdown + datamodule/yaak/predict_val_all (all 5 held-out val
  drives): generalization numbers no longer rest on a single val drive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s diagnostic

Route-turn signal (signed bearing change of the waypoint polyline, 1-dof
calibration to steering) as the mode selector instead of draw-count mass.
5-val-drive result: wpt-mode 0.457 vs mass-WTA 0.341 spike steering chunk-L1 —
WORSE. Root cause measured directly: corr(route-turn, GT chunk steering) is
only -0.29 on spike frames / -0.13 overall, and per-waypoint lateral offsets
are ~uncorrelated — far too weak to arbitrate modes ~0.5 apart. Notably the
MODEL uses waypoints heavily (zeroing shifts predictions ~3x baseline error),
so route info is real but a hand-crafted scalar extraction is too crude.
Selection headroom (oracle-mode 0.26, best-draw 0.13 vs current 0.34) needs a
learned ranker or probability-mass recalibration (history-dropout training).

Also: route-corr diagnostics printed; wpt-mode/wpt-draw rows in the residual
decomposition; per-drive wptM column.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Canonical tag taxonomy (applied retroactively to all 27 existing runs via the
API, deduped against the older shorthand tags):
- scale: overfit | pilot-30 | corpus
- head: flow | regression-head
- levers: gauss-merge, lds, delta-loss, rope, waypoint-pe, hist-dropout,
  chunk-dct, ot-coupling, slot-adaln, beta-time, image-cond, mu-law, ema,
  unfrozen-encoder, linear-std, horizon-1
- outcome (post-hoc, set when logging to Notion): baseline | control |
  negative | null | superseded | legacy | crashed-early

Each experiment config now sets wandb.tags with its lever set, so future runs
are tagged automatically at launch (wandb.init(**cfg.wandb) passes them
through; child experiments override the parent's list wholesale).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t family

Scoped to the flow/regression experiment configs (set alongside their tags
block), so launches no longer need wandb.project=action-flow on the CLI.
pretrain/finetune/finetune_overfit_baseline keep the shared default (rmind).

Also fix a pre-existing duplicate drop_last key in datamodule/yaak/train.yaml
that made every config using it (pretrain, finetune) fail to compose.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0-30x faster

The encoder is frozen in every flow finetune, so the 12 condition tokens per
(drive, frame) are deterministic constants — yet the training loop recomputed
them (JPEG decode -> DINOv3 ViT over 6 frames -> 8-layer encoder) every step of
every epoch of every experiment, to feed a 3M-param decoder. Cache them once;
train decoder-only.

Measured: full-path pilot epoch 548s (solo GPU) / 1627s (shared); cached run =
~50s WALL for startup + cache load + train epoch + val epoch (with flow
sampling) + ckpt, on a GPU shared with two live trainings. A 50-epoch pilot
screen: ~7.6h -> ~15-20 min. Pilot cache = 1.6GB fp16 (88k frames).

- flow_policy: compute_metrics_from(condition_tokens, target_actions) — the
  metrics body downstream of condition/target extraction, with its own
  non-finite guard. compute_metrics now delegates to it.
- flow_cache_features.py: one encoder pass per split; saves BOTH condition
  variants (normal + action-history-zeroed) so ActionHistoryDropout-style
  training stays EXACT under caching (the frozen encoder is a function — the
  variant must be precomputed, not patched). Metadata records the source
  artifact + condition spec for cache validity.
- CachedFeaturesDataset + FlowFeatureTrainer: plain DataLoader over the cache;
  slim LightningModule training FlowPolicyObjective via compute_metrics_from.
  hist_dropout_p selects the cached variant per sample. State-dict keys are
  objectives.policy.* (ControlTransformer-compatible for later stitching).
- finetune_flow_pilot_lds_cached: the pilot-LDS recipe on the cached path
  (bs 256, schedule scaled; tag "cached"). For lever SCREENING — cached ckpts
  carry no encoder, so winners get confirmed on the full pipeline.

Limitations: valid only while the encoder stays frozen and the conditioning
spec is fixed; no predict from cached ckpts (stitch or retrain winners).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…led lr

Move the active pilot A/Bs onto the feature-cached track (the full-path runs
lafegos0/lybmf6k6 were killed and superseded — cached epochs are ~10-30x
faster, and cached-vs-cached is the only valid comparison anyway since the
schedule/batch differ from the full path):
- RegressionPolicyObjective.compute_metrics_from (same refactor as flow);
  FlowFeatureTrainer accepts either objective.
- finetune_pilot_regression_cached (standalone — the objective node must be
  replaced wholesale, not merged over the flow one) and
  finetune_flow_pilot_lds_histdrop_cached (p=0.5; exact under caching via the
  precomputed hist0 variant).
- bs 256 -> 1024 to saturate the GPU with two concurrent runs, with the
  LINEAR-SCALING lr correction (1e-5 @ bs64 -> 1.6e-4 @ bs1024) and the
  3500-step cosine. All three cached arms share lr/bs/schedule.

Fleet: 7c3dh7bs (cached LDS control) + vfa4hrig (cached histdrop) concurrent
on GPU0 (~89% util), cached regression chained after the control.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dropped with the ControlTransformer-specific callbacks when the cached configs
got a minimal inline set, but this one works: FlowFeatureTrainer deliberately
mirrors the objectives.policy.* key layout. Regression configs excluded (no
position_embedding; slot_queries would need a separate select).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
+meank.cache=<val.pt> +meank.cached_artifact=<model-run:vN> evaluates
FlowFeatureTrainer checkpoints (which carry no encoder) by reading the
condition tokens from the feature cache — the frozen encoder makes them
checkpoint-independent — and loading only the objective weights. Same census/
decomposition/per-drive readouts; waypoint diagnostics degrade to their meanK
fallback (waypoints aren't cached). Also makes repeated evals ~decoder-only
fast and guards the wpt polyfit when no finite route signal exists.

First use (cached fleet verdicts, 5 val drives, spike steering chunk-L1):
- histdrop NULL with mechanism: more bimodality (20.6% vs 17.0%), better
  coverage (0.144 vs 0.150) and oracle (0.259 vs 0.280), but selection regret
  doubled (0.084 vs 0.050) -> improves proposals, not the choice.
- regression held-out: wins aggregate/flats; spike 0.375 ~= flow single-draw,
  loses to flow meanK (0.326) and far above flow's coverage floor (0.150).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- reports/LOOP_INSTRUCTIONS.md: the agreed contract for the >=30-experiment
  autonomous loop (protocol, pre-registered decision rules, experiment tree,
  stack-on-win policy, mandatory learned-ranker phase, edge handling, launch
  pattern gotchas).
- flow_report.py: self-contained HTML experiment report (approved format) —
  run cards with verdict chips, key-metric table with best-per-column
  highlighting, interactive pred-vs-GT overlays per drive (legend-toggle run
  comparison), maneuver-zoom small multiples. Manifest-driven, cumulative.
- flow_cached_predict.py: reye/DataFramePredictionWriter-schema parquets from
  cached (encoder-less) checkpoints; readouts single/meank/mode/mode_medoid +
  deterministic for regression heads.
- flow_cache_features.py: cache schema v2 — full frame_idx/time_stamp arrays
  (report/reye need timestamps); caches rebuilt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Controls (4 seeds) spike-meanK: .3289/.3555/.3237/.3387 -> cross-seed
sigma=0.014. lr 0.5x/2x both null (paired deltas -0.001/+0.003) -> keep
1.6e-4. Key finding: same-seed paired deltas (~0.003) are 5x tighter than
cross-seed sigma -> loop switches to paired design (levers at seed 1001 vs
exp01), keep bar = paired delta > 0.010, second-seed confirm before stacking.
Evals now seeded (same ckpt -> same numbers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per-channel alpha in ManeuverLossWeights (tuple per model channel; enables
steering-only / gas-weighted LDS; unit-verified). Loop verdicts: LDS kept at
0.5 (its held-out value = gas-maneuver protection); DELTA-OFF is the biggest
win so far (spike-meanK 0.327 vs 0.356 paired, dose-response monotonic across
0/10/50) - second-seed confirm + gas-recovery composition in flight.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase D of the experiment loop: a 1.4M-param scorer s(cond, draw) trained
listwise (softmax-CE vs soft targets softmax(-L1/tau), spread-weighted) on
K=32 frozen-flow draw banks, evaluated as a top-M selection readout against
meanK / oracle-draw / oracle-mode references.

Also: LOOP_INSTRUCTIONS decision rules re-amended after exp12 — delta-off's
-0.028 did not replicate (lever x seed interaction >> paired noise); stack
bar is now a two-seed lever mean beating the control panel mean by >0.015.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ranker iteration results (val banks seed 0/1, control ckpt bj8r0jyp):
- listwise w512/64ep/tau=0.02 + softmax t=4 readout: spike steering chunk-L1
  0.2963 two-bank mean vs meanK 0.3247 (-0.028), gas/flat/overall neutral
  or better — first confirmed win of the loop, pure inference-time.
- tau dose-response U-shaped (0.1: 0.306, 0.05: 0.300, 0.02: 0.296,
  0.01: 0.312, 0.005: 0.304); ep256 overfits the train bank.
- set-context features HURT spikes (0.3225) — shortcut toward meanK-mimicry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…cached predict

Robustness arc (exp24-30): K=64 and 3-decoder-ensemble (K=96) clouds do NOT
help — selection difficulty grows faster than coverage (median oracle-rank of
pick 12/32 -> 26/64 -> 39/96). Gas-aware labels dilute both axes. Hard mode
commitment loses to soft weighting despite 65-69% oracle-mode agreement.
Per-checkpoint rankers: aggregate/flat L1 win replicates on all 3 control
checkpoints (~5-8%); the spike win is checkpoint-dependent (-0.028 on the
strongest decoder, +0.017 on the weakest).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ort manifest

Phase E: fixes the original pilot's 20-drives-Niro101 bias; disjoint from the
5-drive val set. Action-transform/LDS stats intentionally kept from the
original pilot so val metrics stay comparable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
exp34: sampling-noise temperature sweep on bj8r0jyp. Coverage ceilings improve
monotonically (oracle draw 0.1437 -> 0.1280 -> 0.1215; oracle mode 0.2718 ->
0.2638 -> 0.2616) but the ranker readout degrades in lockstep (0.2985 ->
0.3239 -> 0.3702 at t=4): widened clouds add distractors faster than the
scorer can exploit headroom. Selection discrimination is the binding
constraint, not coverage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Zero-parameter selection via receding-horizon coherence (cf. backward
coherence, Liu et al. 2024). Result: clean negative with a sharp diagnosis —
committed-anchored voting is worse than meanK at every temperature (error
propagation; the anchor is wrong exactly at maneuver onset), confidence-
gating collapses bitwise to meanK (the gate only opens where voting is
useless), and the near-oracle GT-anchored bound (0.156) is quasi-tautological
(previous GT chunk shares 5/6 slots with current GT). Selection-side levers
are now exhausted: the mode-disambiguation information must come from
conditioning, not the readout.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… only

37 experiments closed Phases B-E; everything the loop rejected comes out:
- beta flow-time sampling (exp11 drop) — logit-normal/uniform stay
- waypoint PE (null on overfit, never stacked)
- per-channel LDS alpha (exp09/13 negative) — scalar 0.5 is the recipe
- action-history dropout: callback, trainer knob, cond_hist0 cache variant
  (metric-null at p=0.5; halves cache build cost and size)
- histdrop/wpe/slot-adaln experiment configs (slot-AdaLN is in the decoder
  proper since 15f7a80)
- pre-loop wip scripts: flow_oracle, flow_oracle_generate, flow_rollout_eval
  (findings preserved in Notion)
- flow_ranker: canonical recipe only (w512/64ep/tau0.02, steering label);
  set-context, label-weights, temporal-voting variants removed (documented
  negatives)

Kept: chunk_delta_weight=10 (gas-spike protection), mode/medoid predict
readouts (used via just predict), consensus.py + eval diagnostics, the
ranker's draws/train/eval stages with the noise_scale coverage knob.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- embeddings_unpacked -> embeddings_flattened (tensordict/Episode drift)
- policy_finetune lr interpolations get oc.select defaults so the model
  config composes standalone (experiments still override)
- drop tests of removed code (flow_oracle trace, beta time sampling)
134 passed, 1 skipped.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Full-path anchor (exp33/jdt7mexr) confirms the cached screening track;
final synthesis in the Notion dev log. Recipe: control unchanged + meanK
readout. Next direction: conditioning (route-vs-history disambiguation).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@koritsky koritsky changed the title draft: action flow matching draft dev: action flow matching Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant