draft dev: action flow matching#235
Draft
koritsky wants to merge 52 commits into
Draft
Conversation
cead078 to
9b56b2b
Compare
add sin time emb and adal better oracles Add higher-order flow sampling Strengthen flow time conditioning Normalize flow policy actions Configurable training-time flow sampling Add deterministic flow policy validation add comments rm normalization, simplify rm some validations batch with custom temporal dimension fix turnsignal move prediction one step in future drop waypoints logging
…/val Overfit datamodule references its own single-drive /dataset/yaak/overfit for both splits (defensive against scope-creep into the full yaak/train corpus). train/val use drop_last to avoid partial trailing batches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
RotarySelfAttention threaded through the decoder blocks via a rope flag (default off so existing checkpoints rebuild byte-identical). Injects slot position at every attention layer, complementing the additive position embedding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ampling, maneuver-L1 - within-chunk delta loss (chunk_delta_weight) to break the constant-chunk optimum; NaN-row guard with culprit attribution; pe_drift metric; flow_mse always logged. - configurable flow-time sampling: logit-normal mean/std exposed, plus pi0's beta p(t)=Beta((s-t)/s; alpha,1) skewed toward the noisy end (generator-safe inverse-CDF). - per-t flow-MSE buckets and per-channel maneuver-L1 (gas/brake/steering) validation metrics. - policy_finetune.yaml wires the new hparams with backward-compatible defaults. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gging Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… recipes - fan: N-draw sample-concentration + horizon views. - field: velocity-field / trajectory visualization. - thresholds: per-channel maneuver thresholds from data (CPU-only, prints a config-ready tuple). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…; unfreeze encoder - finetune_flow callbacks: freezer + PE-cosine image logger, no EMA (val reads raw weights; EMA lags a still-descending overfit). - finetune_overfit_flow: chunk-delta, time-sampling, maneuver thresholds, heun/32 val, 200 epochs. - finetune: unfreeze encoder for multi-drive finetune. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eport - improvements.md: reverse the mislabeled-overfit correction — the runs were genuine single-drive overfit (finetuned policy head), not full-corpus. - add rescue plan and diagnostic report (typst source, figures, rendered pdf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Train the flow in a well-conditioned model space matched to its N(0,1) prior (suspect 5): - gas/brake -> signed longitudinal merge (gas - brake), folding brake's point mass into one continuous channel; model space is 2-d (longitudinal, steering), inverted to gas/brake at the I/O boundary. - per-channel Gaussianize (empirical CDF -> N(0,1)) via data-fit quantile knots; monotonic, invertible, non-clipping. Objective decouples raw I/O dim from decoder/model dim and inverts samples so all metrics stay raw-space. flow_action_thresholds.py writes the knots (+thresholds.merge=true); decoder dim via flow_action_dim. Off by default (identity). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Routes slot identity through the flow-time AdaLN modulation — the channel proven to train — to break the constant/mid-anchored chunk (the additive position embedding stayed at init). A zero-init per-slot offset (nn.Embedding, so SelectiveAdamW classifies it and excludes it from weight decay) is added to the time embedding feeding every block's adaLN_modulation, giving each slot its own scale/shift. Zero-init => byte-identical to the time-only decoder at start; old checkpoints rebuild unchanged. Off by default; enabled in the finetune_overfit_flow_slot_adaln experiment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
816231a to
15f7a80
Compare
…ace) decoder.sample returns model-space samples (2-ch after the gas/brake merge), but the fan indexed steering at raw channel idx 2 -> IndexError. Invert samples through the objective's action transform before scoring; identity no-op when no transform is configured. Fan metrics stay raw-space, matching the raw GT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e horizon axis Fix the mid-anchored constant chunk (every slot predicts ~t+3; fan lag = h-3) at its cause: slots are ~90%+ correlated, so in slot space within-chunk structure gets only its variance share of the flow MSE (measured on the overfit drive: 4.7% longitudinal / 12.4% steering, sigma_0 ~ sqrt(6)) — the constant chunk is the gradient-rational optimum, which is why information-only levers (RoPE, additive PE, slot-AdaLN) all failed to break it. - action_transform: optional chunk-DCT stage after Gaussianize — orthonormal DCT-II over the horizon axis (slot index -> frequency: k=0 mean, k=1 slope, ...) + per-coefficient standardization with a sigma-floor (frac of the channel's sigma_0). Within-chunk shape gets 5/6 of the loss weight instead of ~5-12%; the inverse scales errors back DOWN by sigma_k, so raw samples are robust to bad high-frequency predictions (opposite of mu-law's amplifying inverse). Precedent: pi0-FAST's normalize+DCT action tokenizer (arXiv:2501.09747), trajectory-DCT in motion prediction (arXiv:1908.05436). - flow_policy: validate DCT horizon == decoder horizon; force-disable the chunk-delta loss under a horizon-mixing transform (differencing adjacent coefficients is meaningless; per-coefficient standardization supersedes it). - flow_action_thresholds: always print the DCT sigma spectrum diagnostic; +thresholds.dct=true persists mu/sigma/floor into the stats json (artifacts/action_norm_dct.json fitted from the overfit drive). - finetune_overfit_flow_dct experiment: pg5lzmvk base + DCT stats, flow_action_dim=2, delta weight 0 — single-change A/B vs the transform-only baseline. Readouts: fan lag profile flattens, per-horizon spike L1 loses the U-shape, slope corr >> 0, concentration holds near 91%/61%. - tests: basis orthonormality, constant-chunk -> k=0, round-trips (incl. stacked sample dims), sigma-floor, delta guard, horizon-mismatch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…in transform The chunk-DCT reparameterization (orthonormal DCT-II over the horizon axis + per-coefficient standardization, commit a10838a) is a dead end: head-to-head on the overfit eval drive it loses to the plain Gaussianize+merge baseline (pg5lzmvk) — steering chunk-mean corr 0.36 vs 0.81, steering L1 ~2x worse, single-draw sample_l1 0.048 vs 0.023. The sigma-floor=0.3 fix lifted corr 0.10->0.36 over the floor-0.05 collapse but couldn't close the gap. Two causes: the level's slot-redundancy is load-bearing (6 tokens carry it in slot space -> 6x conditioning pressure; the DCT funnels it through c0 alone), and the sigma-floor down-weights high-k coefficients in training but the inverse inflates their prior-sampled values back up (2.5x over-wiggle). Full negative result + numbers logged in Notion (Experiments DB + Action Expert dev log). Restores action_transform.py / flow_policy.py / flow_action_thresholds.py to their pre-DCT state and removes the finetune_overfit_flow_dct experiment. Keeps test_action_transform.py (pruned to the non-DCT cases): the Gaussianize+merge transform — now the established baseline — had ZERO test coverage before, so the roundtrip / stacked-sample-dim / dims tests are net-new and worth retaining. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… diversity First multi-drive test of the established flow baseline (Gaussianize+merge transform, delta+RoPE, logit-normal, heun/32). Everything since the rescue plan has been single-drive overfit; this checks whether the transform win transfers and gives real generalization readouts (train/val now disjoint). - dataset/train_pilot template: first 30 drives of the train list (Niro096 kept first for continuity with the overfit runs). - datamodule/train_pilot: pilot train split + the standard 5-drive val set (no overlap). - experiment/finetune_flow_pilot: pg5lzmvk recipe at 50ep / 25k-step cosine; stats are pilot-fit (artifacts/action_norm_pilot.json, gitignored) on the train split — see header for the fit command. - flow_action_thresholds: support split=train so stats can be fit on the pilot's train drives (was predict|val only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New baseline (run tqozevsv): reweight the flow loss from frequency toward
importance to fix the maneuver undercommit. The flat-dominated MSE (worsened by
the Gaussianize tail-Jacobian) spent raw precision by frequency — finest in
cruise, coarsest on the rare maneuvers that matter. LDS realigns it to
importance: steering maneuver-L1 dropped ~26% (spike 0.289->0.215) AND steering
chunk-mean corr improved 0.81->0.91, with overall single-draw sample_l1 flat.
- maneuver_weights.py: ManeuverLossWeights (Yang et al., ICML 2021). Per-chunk
label = peak |physical action| over the horizon, per model channel
(longitudinal, steering) — peak-over-slots upweights a chunk's lead-in with
its maneuver. Weight = (1/smoothed-density)^alpha, capped (no runaway tail
weight — cf. mu-law), mean-1 normalized over the empirical distribution so the
loss scale (hence LR/schedule) is unchanged. alpha/cap applied at load => sweep
without refit.
- flow_policy.py: load + validate the weighter (channels must match model space);
weight the per-element flow loss (broadcast over slots) and, consistently, the
chunk-delta term; depends only on the clean target so it is valid at every
flow-time. flow_mse stays logged UNWEIGHTED (cross-run comparable);
maneuver_weight_mean logged as a ~1 sanity check. Off when flow_lds_stats empty.
- action_transform.py: physical_model() — merge without Gaussianize, for the
(physical-units) importance label.
- flow_action_thresholds.py: +thresholds.lds=true fits and writes per-channel
{edges, emp, smooth} into the stats JSON; _collect_targets now returns per-chunk
shape (frames, H, A) for the peak-over-horizon label.
- Validation: single honest draw (deployment-realistic). Dropped the best-of-N
curve and _sample_curve_ks (oracle selection flattered the model); kept
sample_l1 + maneuver_l1/*. flow_prediction_samples default 32 -> 1.
- Experiments: finetune_overfit_flow_lds, finetune_flow_pilot_lds (pg5lzmvk
recipe + LDS as the single change).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make ModelCheckpoint cadence/retention overridable (oc.select, defaults 1/1 = prior behavior) and set the overfit base to every_n_epochs=10, save_top_k=-1. Keeps a sparse epoch ladder for offline fan/eval and cuts the per-epoch wandb artifact uploads ~10x (log_model=all logs on each save). Pilot/other runs unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The encoder is factorized (RoPE on the temporal axis only; PR #209), so its within-timestep spatial attention has no positional encoding and the 10 waypoints reach the flow decoder as an unordered bag — distinguished only by a shared per-modality role embedding. The route SEQUENCE is therefore invisible to the decoder's cross-attention (order recoverable only from coordinate values). This restores a within-frame route-sequence signal. - flow_policy.py: optional learnable per-waypoint-index embedding (nn.Embedding, zero-init) added to the waypoint condition-token block in _condition_tokens, so the decoder can use waypoint order. Added in the TRAINABLE objective (not the encoder) because the episode builder + encoder are frozen during flow finetune; cond_dim auto-derived from the decoder's condition projection; count-mismatch guard. Zero-init => byte-identical at start; weight-decay-excluded (nn.Embedding); per-slot constant across timesteps => KV-cache safe. Off when waypoint_pe is false. - policy_finetune.yaml: flow_waypoint_pe (default false) / flow_waypoint_count (default 10) threaded through hparams_jq. - experiments: finetune_overfit_flow_lds_wpe, finetune_flow_pilot_lds_wpe (LDS baseline + waypoint PE as the single change). Inherit the base directly (re-setting LDS flags inline) to avoid a two-level experiment-chaining bug. Overfit A/B (3e5numej vs tqozevsv) was a clean null — the PE learns sensible ordinal structure (adjacent-index cos 0.94) but at tiny magnitude on a fixed single route; the meaningful test is the pilot (route diversity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hparams_jq is a jq program; an oc.select default of `false` renders as Python
`False`, which jq rejects ("False/0 is not defined"). This broke every
experiment that didn't explicitly set flow_waypoint_pe (the plain *_lds
configs). Use 0, matching the existing 1/0 convention (flow_decoder_rope,
flow_decoder_slot_adaln); pydantic coerces 0 -> False.
(Surfaced while reverting the minibatch OT-coupling experiment, which was a
structural dead end for a per-sample-conditioned flow — see Notion.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llout diagnostics Three evaluation gaps closed: - datamodule/yaak/predict_val (+ val_eval dataset template): predict over a HELD-OUT val drive (Niro115). The standard predict datamodule points at the overfit drive, which is in-sample for every pilot model — all previous predict parquets silently measured memorization for those. - flow_meank_eval.py: mean-of-K bias/variance decomposition. ~30% of overall flow L1 is per-draw sampling variance (mean-of-32 is a free inference win); held-out SPIKE steering error is ~pure bias (0.55-0.63 even for the 32-draw conditional mean) — out-of-sample maneuver INITIATION, not sampling, is the error mass. Also defines the deterministic-head-reachable floor for the flow-vs-heads comparison. - flow_rollout_eval.py: recursive action-feedback rollout (predictions fed back as action history; vision/waypoints stay GT). Motivated by the action-history sensitivity probe: predictions shift 1.7-3.4x baseline error under history perturbation (copycat signature), so open-loop metrics understate deployment error through this channel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "is flow earning its complexity?" control that never existed: the old baseline (PolicyObjective) regresses only the last observed step, so flow vs simple-heads was never comparable. RegressionPolicyObjective predicts the same 6-slot future chunk from the same condition tokens, in the same Gaussianize+ merge model space, with the same raw-space metrics and wandb keys (sample_l1, maneuver_l1/*) — a single deterministic forward + MSE instead of an integrated flow. Capacity-matched decoder (4-layer cross-attention, learned slot queries via nn.Embedding for SelectiveAdamW classification). Optional LDS (off by default). finetune_regression callbacks = finetune_flow minus the flow-only PE image logger (encoder stays FROZEN — required for the matched comparison). Experiments: finetune_overfit_regression (vs pg5lzmvk/tqozevsv), finetune_pilot_regression (vs 7d3y7ort/ge4vfboq held-out). Pre-registered readout: regression ~= flow mean-of-K => flow's gap is sampling variance (deterministic readout/distillation is the cheap equivalent); flow single-draw beats regression on maneuvers => distributional modeling earns it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The action-history sensitivity probe showed the policy leans heavily on past actions (perturbing them shifts predictions 1.7-3.4x the baseline error): the classic copycat shortcut (de Haan et al. 2019 causal confusion; ChauffeurNet past-motion dropout). Cruise is "copy history forward"; maneuver ONSET needs vision/route, which the shortcut out-competes — the best causal account of the held-out spike bias (~0.6 even for the 32-draw conditional mean). ActionHistoryDropout zeroes the action-history batch fields for a random subset of train samples at on_train_batch_start (eval untouched; the frozen encoder is a function, so zeroed inputs change the summaries and the head learns both regimes). finetune_flow_histdrop callbacks + the finetune_flow_pilot_lds_histdrop experiment (single change vs ge4vfboq; p via flow_action_hist_dropout, default 0.5) — QUEUED, not launched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ise mode delta The counters are load-bearing: the first rollout read "ratio 1.00" everywhere, which without them would have been reported as "no compounding". They proved substitutions fired (2385, mean |GT-sub| 0.053) while predictions stayed bit-identical — i.e. genuinely zero local sensitivity to self-error-scale history perturbations (drive-start segment), vs the probe's large-perturbation sensitivity. Distinguishes "feedback not wired" from "feedback has no effect". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Plain mean-of-K is a mode-averaging estimator: on a genuinely bimodal conditional (turn left XOR right) it splits the difference — the exact regression-to-the-mean failure flow exists to avoid. The mode-aware readout clusters the K draws on chunk-mean steering (largest-gap split, with an outlier guard: minority cluster must hold >= mode_min_frac of draws), commits to the DOMINANT cluster (draw count ~ probability mass) and averages within it — identical to mean-of-K on unimodal frames, mode-committing on bimodal ones. Precedents: propose-then-select in motion forecasting (MultiPath anchors, MTR/DenseTNT NMS, Trajectron++ "most likely" deployment; MDN take-dominant-component) and Minimum-Bayes-Risk consensus decoding (Kumar & Byrne 2004); sample-then-select policies (Implicit BC, SfBC, Diffusion-ES). The census reports how often the conditional is ACTUALLY multimodal (% bimodal frames, flat vs spike, mode separation, and meanK-vs-mode-aware steering L1 on the bimodal subset). With route waypoints in the conditioning, discrete modes should be rare at current scale — this measures it instead of assuming. Unit-verified: bimodal draws (24 left / 8 right) -> anchor at the dominant mode (-0.60), where mean-of-K averages to -0.30; unimodal / outlier-minority / sub-gap frames fall back exactly to mean-of-K. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the mode-aware consensus into the predict pipeline so `just predict`
can use it on EVERY existing flow checkpoint (pure inference, no retraining):
just predict ... '+flow_predict_readout=mode' '+flow_predict_samples=16'
- objectives/consensus.py: shared torch winner-take-all consensus (cluster K
draws on chunk-mean steering via largest-gap split + outlier guard; commit
to the dominant cluster, average within it; == mean-of-K on unimodal
frames). Single source of truth: predict() and flow_meank_eval both use it
(the eval's numpy version replaced by a wrapper).
- FlowPolicyObjective: predict_readout (single = legacy one honest draw,
default; meank = K-draw mean, mode-averaging; mode = WTA) + predict_samples.
Measured held-out spike steering L1: single 0.62 / meank 0.53 / mode 0.34
(77% of held-out spike frames are bimodal: follow-history vs follow-route).
- finetuned.yaml: inject the readout into saved hparams at load, guarded on
_target_ == FlowPolicyObjective so regression/legacy ckpts are untouched.
Verified: synthetic bimodal unit test via the shared module; jq injection on
flow/regression/legacy hparams shapes (flow gets it, others untouched, default
single); e2e mini-predict mode-vs-single on a real ckpt (predictions differ,
finite, parquet written).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The K draws for meank/mode readouts ran as K sequential decoder.sample calls; batch them via repeat_interleave into a single (B*K) integration, like the fan/meank scripts. The encoder already ran once upstream (condition_tokens is sliced from the precomputed embedding); only the small flow decoder sees the K-fold batch, so readout wall-clock is ~flat instead of ~K-sequential. Re-verified e2e (K=16, 5 held-out batches): finite, differs from single-draw. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ut eval Three additions on the WTA readout: - consensus.py: refactor into split_modes + mode_aware_anchor(anchor= "mean"|"medoid"). Medoid = the actual draw closest to the dominant-cluster mean — guaranteed model sample (dynamically coherent by construction) vs the synthetic averaged chunk; trades a little variance for realism. Exposed as predict_readout=mode_medoid. Unit-verified: always a real draw, commits to the dominant mode. - flow_meank_eval.py: RESIDUAL DECOMPOSITION of the held-out spike error (steering, per spike frame): single / mean-of-K / mode (WTA) / mode-medoid / oracle-mode / best-draw. Splits the remaining error into selection regret (dominant minus oracle mode choice), within-mode residual (oracle), and coverage floor (best single draw) — each component has a different fix (selector vs training weight/data vs data/copycat). - per-drive breakdown + datamodule/yaak/predict_val_all (all 5 held-out val drives): generalization numbers no longer rest on a single val drive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s diagnostic Route-turn signal (signed bearing change of the waypoint polyline, 1-dof calibration to steering) as the mode selector instead of draw-count mass. 5-val-drive result: wpt-mode 0.457 vs mass-WTA 0.341 spike steering chunk-L1 — WORSE. Root cause measured directly: corr(route-turn, GT chunk steering) is only -0.29 on spike frames / -0.13 overall, and per-waypoint lateral offsets are ~uncorrelated — far too weak to arbitrate modes ~0.5 apart. Notably the MODEL uses waypoints heavily (zeroing shifts predictions ~3x baseline error), so route info is real but a hand-crafted scalar extraction is too crude. Selection headroom (oracle-mode 0.26, best-draw 0.13 vs current 0.34) needs a learned ranker or probability-mass recalibration (history-dropout training). Also: route-corr diagnostics printed; wpt-mode/wpt-draw rows in the residual decomposition; per-drive wptM column. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Canonical tag taxonomy (applied retroactively to all 27 existing runs via the API, deduped against the older shorthand tags): - scale: overfit | pilot-30 | corpus - head: flow | regression-head - levers: gauss-merge, lds, delta-loss, rope, waypoint-pe, hist-dropout, chunk-dct, ot-coupling, slot-adaln, beta-time, image-cond, mu-law, ema, unfrozen-encoder, linear-std, horizon-1 - outcome (post-hoc, set when logging to Notion): baseline | control | negative | null | superseded | legacy | crashed-early Each experiment config now sets wandb.tags with its lever set, so future runs are tagged automatically at launch (wandb.init(**cfg.wandb) passes them through; child experiments override the parent's list wholesale). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t family Scoped to the flow/regression experiment configs (set alongside their tags block), so launches no longer need wandb.project=action-flow on the CLI. pretrain/finetune/finetune_overfit_baseline keep the shared default (rmind). Also fix a pre-existing duplicate drop_last key in datamodule/yaak/train.yaml that made every config using it (pretrain, finetune) fail to compose. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0-30x faster The encoder is frozen in every flow finetune, so the 12 condition tokens per (drive, frame) are deterministic constants — yet the training loop recomputed them (JPEG decode -> DINOv3 ViT over 6 frames -> 8-layer encoder) every step of every epoch of every experiment, to feed a 3M-param decoder. Cache them once; train decoder-only. Measured: full-path pilot epoch 548s (solo GPU) / 1627s (shared); cached run = ~50s WALL for startup + cache load + train epoch + val epoch (with flow sampling) + ckpt, on a GPU shared with two live trainings. A 50-epoch pilot screen: ~7.6h -> ~15-20 min. Pilot cache = 1.6GB fp16 (88k frames). - flow_policy: compute_metrics_from(condition_tokens, target_actions) — the metrics body downstream of condition/target extraction, with its own non-finite guard. compute_metrics now delegates to it. - flow_cache_features.py: one encoder pass per split; saves BOTH condition variants (normal + action-history-zeroed) so ActionHistoryDropout-style training stays EXACT under caching (the frozen encoder is a function — the variant must be precomputed, not patched). Metadata records the source artifact + condition spec for cache validity. - CachedFeaturesDataset + FlowFeatureTrainer: plain DataLoader over the cache; slim LightningModule training FlowPolicyObjective via compute_metrics_from. hist_dropout_p selects the cached variant per sample. State-dict keys are objectives.policy.* (ControlTransformer-compatible for later stitching). - finetune_flow_pilot_lds_cached: the pilot-LDS recipe on the cached path (bs 256, schedule scaled; tag "cached"). For lever SCREENING — cached ckpts carry no encoder, so winners get confirmed on the full pipeline. Limitations: valid only while the encoder stays frozen and the conditioning spec is fixed; no predict from cached ckpts (stitch or retrain winners). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…led lr Move the active pilot A/Bs onto the feature-cached track (the full-path runs lafegos0/lybmf6k6 were killed and superseded — cached epochs are ~10-30x faster, and cached-vs-cached is the only valid comparison anyway since the schedule/batch differ from the full path): - RegressionPolicyObjective.compute_metrics_from (same refactor as flow); FlowFeatureTrainer accepts either objective. - finetune_pilot_regression_cached (standalone — the objective node must be replaced wholesale, not merged over the flow one) and finetune_flow_pilot_lds_histdrop_cached (p=0.5; exact under caching via the precomputed hist0 variant). - bs 256 -> 1024 to saturate the GPU with two concurrent runs, with the LINEAR-SCALING lr correction (1e-5 @ bs64 -> 1.6e-4 @ bs1024) and the 3500-step cosine. All three cached arms share lr/bs/schedule. Fleet: 7c3dh7bs (cached LDS control) + vfa4hrig (cached histdrop) concurrent on GPU0 (~89% util), cached regression chained after the control. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dropped with the ControlTransformer-specific callbacks when the cached configs got a minimal inline set, but this one works: FlowFeatureTrainer deliberately mirrors the objectives.policy.* key layout. Regression configs excluded (no position_embedding; slot_queries would need a separate select). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
+meank.cache=<val.pt> +meank.cached_artifact=<model-run:vN> evaluates FlowFeatureTrainer checkpoints (which carry no encoder) by reading the condition tokens from the feature cache — the frozen encoder makes them checkpoint-independent — and loading only the objective weights. Same census/ decomposition/per-drive readouts; waypoint diagnostics degrade to their meanK fallback (waypoints aren't cached). Also makes repeated evals ~decoder-only fast and guards the wpt polyfit when no finite route signal exists. First use (cached fleet verdicts, 5 val drives, spike steering chunk-L1): - histdrop NULL with mechanism: more bimodality (20.6% vs 17.0%), better coverage (0.144 vs 0.150) and oracle (0.259 vs 0.280), but selection regret doubled (0.084 vs 0.050) -> improves proposals, not the choice. - regression held-out: wins aggregate/flats; spike 0.375 ~= flow single-draw, loses to flow meanK (0.326) and far above flow's coverage floor (0.150). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- reports/LOOP_INSTRUCTIONS.md: the agreed contract for the >=30-experiment autonomous loop (protocol, pre-registered decision rules, experiment tree, stack-on-win policy, mandatory learned-ranker phase, edge handling, launch pattern gotchas). - flow_report.py: self-contained HTML experiment report (approved format) — run cards with verdict chips, key-metric table with best-per-column highlighting, interactive pred-vs-GT overlays per drive (legend-toggle run comparison), maneuver-zoom small multiples. Manifest-driven, cumulative. - flow_cached_predict.py: reye/DataFramePredictionWriter-schema parquets from cached (encoder-less) checkpoints; readouts single/meank/mode/mode_medoid + deterministic for regression heads. - flow_cache_features.py: cache schema v2 — full frame_idx/time_stamp arrays (report/reye need timestamps); caches rebuilt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Controls (4 seeds) spike-meanK: .3289/.3555/.3237/.3387 -> cross-seed sigma=0.014. lr 0.5x/2x both null (paired deltas -0.001/+0.003) -> keep 1.6e-4. Key finding: same-seed paired deltas (~0.003) are 5x tighter than cross-seed sigma -> loop switches to paired design (levers at seed 1001 vs exp01), keep bar = paired delta > 0.010, second-seed confirm before stacking. Evals now seeded (same ckpt -> same numbers). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per-channel alpha in ManeuverLossWeights (tuple per model channel; enables steering-only / gas-weighted LDS; unit-verified). Loop verdicts: LDS kept at 0.5 (its held-out value = gas-maneuver protection); DELTA-OFF is the biggest win so far (spike-meanK 0.327 vs 0.356 paired, dose-response monotonic across 0/10/50) - second-seed confirm + gas-recovery composition in flight. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase D of the experiment loop: a 1.4M-param scorer s(cond, draw) trained listwise (softmax-CE vs soft targets softmax(-L1/tau), spread-weighted) on K=32 frozen-flow draw banks, evaluated as a top-M selection readout against meanK / oracle-draw / oracle-mode references. Also: LOOP_INSTRUCTIONS decision rules re-amended after exp12 — delta-off's -0.028 did not replicate (lever x seed interaction >> paired noise); stack bar is now a two-seed lever mean beating the control panel mean by >0.015. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ranker iteration results (val banks seed 0/1, control ckpt bj8r0jyp): - listwise w512/64ep/tau=0.02 + softmax t=4 readout: spike steering chunk-L1 0.2963 two-bank mean vs meanK 0.3247 (-0.028), gas/flat/overall neutral or better — first confirmed win of the loop, pure inference-time. - tau dose-response U-shaped (0.1: 0.306, 0.05: 0.300, 0.02: 0.296, 0.01: 0.312, 0.005: 0.304); ep256 overfits the train bank. - set-context features HURT spikes (0.3225) — shortcut toward meanK-mimicry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…cached predict Robustness arc (exp24-30): K=64 and 3-decoder-ensemble (K=96) clouds do NOT help — selection difficulty grows faster than coverage (median oracle-rank of pick 12/32 -> 26/64 -> 39/96). Gas-aware labels dilute both axes. Hard mode commitment loses to soft weighting despite 65-69% oracle-mode agreement. Per-checkpoint rankers: aggregate/flat L1 win replicates on all 3 control checkpoints (~5-8%); the spike win is checkpoint-dependent (-0.028 on the strongest decoder, +0.017 on the weakest). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ort manifest Phase E: fixes the original pilot's 20-drives-Niro101 bias; disjoint from the 5-drive val set. Action-transform/LDS stats intentionally kept from the original pilot so val metrics stay comparable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
exp34: sampling-noise temperature sweep on bj8r0jyp. Coverage ceilings improve monotonically (oracle draw 0.1437 -> 0.1280 -> 0.1215; oracle mode 0.2718 -> 0.2638 -> 0.2616) but the ranker readout degrades in lockstep (0.2985 -> 0.3239 -> 0.3702 at t=4): widened clouds add distractors faster than the scorer can exploit headroom. Selection discrimination is the binding constraint, not coverage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Zero-parameter selection via receding-horizon coherence (cf. backward coherence, Liu et al. 2024). Result: clean negative with a sharp diagnosis — committed-anchored voting is worse than meanK at every temperature (error propagation; the anchor is wrong exactly at maneuver onset), confidence- gating collapses bitwise to meanK (the gate only opens where voting is useless), and the near-oracle GT-anchored bound (0.156) is quasi-tautological (previous GT chunk shares 5/6 slots with current GT). Selection-side levers are now exhausted: the mode-disambiguation information must come from conditioning, not the readout. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… only 37 experiments closed Phases B-E; everything the loop rejected comes out: - beta flow-time sampling (exp11 drop) — logit-normal/uniform stay - waypoint PE (null on overfit, never stacked) - per-channel LDS alpha (exp09/13 negative) — scalar 0.5 is the recipe - action-history dropout: callback, trainer knob, cond_hist0 cache variant (metric-null at p=0.5; halves cache build cost and size) - histdrop/wpe/slot-adaln experiment configs (slot-AdaLN is in the decoder proper since 15f7a80) - pre-loop wip scripts: flow_oracle, flow_oracle_generate, flow_rollout_eval (findings preserved in Notion) - flow_ranker: canonical recipe only (w512/64ep/tau0.02, steering label); set-context, label-weights, temporal-voting variants removed (documented negatives) Kept: chunk_delta_weight=10 (gas-spike protection), mode/medoid predict readouts (used via just predict), consensus.py + eval diagnostics, the ranker's draws/train/eval stages with the noise_scale coverage knob. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- embeddings_unpacked -> embeddings_flattened (tensordict/Episode drift) - policy_finetune lr interpolations get oc.select defaults so the model config composes standalone (experiments still override) - drop tests of removed code (flow_oracle trace, beta time sampling) 134 passed, 1 skipped. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Full-path anchor (exp33/jdt7mexr) confirms the cached screening track; final synthesis in the Notion dev log. Recipe: control unchanged + meanK readout. Next direction: conditioning (route-vs-history disambiguation). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.