docs(eval): add NEL v0.3.0 migration guide + example configs#1845
docs(eval): add NEL v0.3.0 migration guide + example configs#1845hychiang-git wants to merge 7 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds two NEL 0.3.x recipe examples and a migration guide covering config structure, CLI flow, backend routing, sharding, and MLflow export behavior. ChangesNEL 0.3.x Recipe Examples and Migration Guide
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
67af6b0 to
e8ac728
Compare
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.agents/skills/evaluation/recipes/examples/r030_container.yaml:
- Around line 46-48: The inline note for the container example is too broad and
misleading because mmlu_pro_aa_v3 does have a native 0.3.x equivalent. Update
the comment on the mmlu_pro_aa_v3 entry to either use a truly legacy-only task
example or rephrase it to say there is no native benchmark of the same name,
keeping the guidance aligned with the container solver example.
In @.agents/skills/evaluation/recipes/examples/r030_gym.yaml:
- Around line 8-11: The prerequisite instructions for starting the resource
server are inconsistent because the `gym env start` command in the
`r030_gym.yaml` example is shown in the foreground even though the next `curl`
step assumes it is running in the background. Update the example around the
resource-server startup command so it is actually backgrounded (or otherwise
clearly detached) before the `curl` lookup, keeping the `RAY_TMPDIR` and
`server_instances` steps in the same flow.
- Around line 34-45: The gym benchmark URI currently uses 127.0.0.1, which only
reaches a local loopback server and will point to the worker itself under SLURM.
Update the recipe around the gym:// URI to use a host/IP that is reachable from
the worker nodes, and adjust the surrounding comment near the cluster
configuration to avoid implying that only swapping the cluster block is
sufficient. Refer to the gym:// source entry and the cluster block in this
example when making the fix.
In @.agents/skills/evaluation/references/nel-v0.3.0-migration.md:
- Around line 370-383: The MLflow export example in the migration guide does not
match the documented `nel export` CLI shape because it passes multiple benchmark
bundles in one call. Update the example near the `nel eval merge`/`nel export`
recipe to either show one bundle per `nel export` invocation or explicitly state
that `nel export` accepts multiple positional bundles, matching the earlier CLI
usage and the `nel export` command documentation.
- Around line 73-81: The markdown table/example blocks in
nel-v0.3.0-migration.md use bare fenced code blocks that trip markdownlint
MD040. Update the fenced blocks in the referenced sections to include a language
tag such as text, keeping the content unchanged; this applies to the example
blocks around the dataset/backend table and the other bare-fence occurrences
noted in the review.
- Around line 55-61: Fix the remaining markdownlint issues in this migration
note by removing the extra double space after the “2.” list item in the numbered
options section and ensuring the document ends with a trailing newline. Update
the affected markdown list content in the referenced eval-factory migration
guidance so it stays lint-clean.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 90162b89-d43f-4514-ba35-ac5d0223821f
📒 Files selected for processing (4)
.agents/skills/evaluation/recipes/examples/r030_container.yaml.agents/skills/evaluation/recipes/examples/r030_example_eval.yaml.agents/skills/evaluation/recipes/examples/r030_gym.yaml.agents/skills/evaluation/references/nel-v0.3.0-migration.md
| # The `container` solver runs the v0.2.x eval-factory task by name. Use the v1 `*_aa_v3` task id. | ||
| - name: mmlu_pro_aa_v3 # v0.2.x eval-factory container task (no native port) | ||
| solver: {type: container, service: model} |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
The inline note overstates the legacy-only requirement.
mmlu_pro_aa_v3 does have a native 0.3.x equivalent (mmlu_pro) per the migration guide, so the “no native port” note is misleading here. Either use a truly legacy-only example task or narrow the comment to “no native benchmark of the same name.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.agents/skills/evaluation/recipes/examples/r030_container.yaml around lines
46 - 48, The inline note for the container example is too broad and misleading
because mmlu_pro_aa_v3 does have a native 0.3.x equivalent. Update the comment
on the mmlu_pro_aa_v3 entry to either use a truly legacy-only task example or
rephrase it to say there is no native benchmark of the same name, keeping the
guidance aligned with the container solver example.
| - name: "gym://127.0.0.1:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl" | ||
| repeats: 1 | ||
| max_concurrent: 5 | ||
| # simple solver = single-shot policy answer scored by the server (mcqa/ifbench-style). For agentic | ||
| # gym envs that need tool use, switch to `type: tool_calling` + a `sandbox:`. | ||
| solver: {type: simple, service: model} | ||
|
|
||
| cluster: | ||
| # `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a | ||
| # GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block | ||
| # (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does. | ||
| type: local |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Call out that 127.0.0.1 is local-only.
The note on Lines 42-45 suggests that users can move this recipe to SLURM by swapping the cluster block, but the benchmark URI on Line 34 still points at loopback. On a worker node that resolves to the worker itself, not the separately started Gym server.
Suggested fix
- - name: "gym://127.0.0.1:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl"
+ - name: "gym://<gym-server-host>:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl"- # (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does.
+ # (see r030_example_eval.yaml). If you move off `local`, replace `127.0.0.1` above with a host that
+ # is reachable from the worker node(s). NOTE: gym:// supports cluster.shards the same way native does.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: "gym://127.0.0.1:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl" | |
| repeats: 1 | |
| max_concurrent: 5 | |
| # simple solver = single-shot policy answer scored by the server (mcqa/ifbench-style). For agentic | |
| # gym envs that need tool use, switch to `type: tool_calling` + a `sandbox:`. | |
| solver: {type: simple, service: model} | |
| cluster: | |
| # `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a | |
| # GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block | |
| # (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does. | |
| type: local | |
| - name: "gym://<gym-server-host>:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl" | |
| repeats: 1 | |
| max_concurrent: 5 | |
| # simple solver = single-shot policy answer scored by the server (mcqa/ifbench-style). For agentic | |
| # gym envs that need tool use, switch to `type: tool_calling` + a `sandbox:`. | |
| solver: {type: simple, service: model} | |
| cluster: | |
| # `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a | |
| # GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block | |
| # (see r030_example_eval.yaml). If you move off `local`, replace `127.0.0.1` above with a host that | |
| # is reachable from the worker node(s). NOTE: gym:// supports cluster.shards the same way native does. | |
| type: local |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.agents/skills/evaluation/recipes/examples/r030_gym.yaml around lines 34 -
45, The gym benchmark URI currently uses 127.0.0.1, which only reaches a local
loopback server and will point to the worker itself under SLURM. Update the
recipe around the gym:// URI to use a host/IP that is reachable from the worker
nodes, and adjust the surrounding comment near the cluster configuration to
avoid implying that only swapping the cluster block is sufficient. Refer to the
gym:// source entry and the cluster block in this example when making the fix.
8fd7a9a to
adaabab
Compare
|
/claude review |
| # WITH-TOOLS variant of a native benchmark: SAME backend/name, just a tool_calling solver + a local | ||
| # sandbox — NO gym server involved. Use in place of the simple `gpqa` entry above to give it tools. | ||
| # - name: gpqa # still BUILT-IN | ||
| # repeats: 8 | ||
| # max_concurrent: 64 | ||
| # solver: {type: tool_calling, service: model, sandbox_tools: true, max_turns: 100} | ||
| # sandbox: {type: local, concurrency: 64} |
There was a problem hiding this comment.
[SUGGESTION] This commented-out tool_calling variant would silently fail to use tools as written. The migration guide itself (nel-v0.3.0-migration.md, §"The solver: block") states tool calling needs "a model served with tool calling enabled (vLLM: --enable-auto-tool-choice + --tool-call-parser)", but the services.model.extra_args block above (lines 30-38) doesn't include those flags. A user who simply uncomments this variant will get a model that never emits tool calls (turns exhaust → failed episodes).
Consider adding a companion note here pointing back to the required extra_args, e.g.:
# WITH-TOOLS variant ... — also add to services.model.extra_args:
# - "--enable-auto-tool-choice"
# - "--tool-call-parser=<your-tool-parser>"
# - name: gpqa # still BUILT-IN
# ...Non-blocking — the doc covers it, but co-locating the requirement with the commented snippet prevents a foot-gun.
| @@ -0,0 +1,109 @@ | |||
| # NEL 0.3.x (engine) — SIMPLE / no-tools, SLURM-sharded. | |||
| # Built-in + skills math/multichoice benchmarks for a reasoning checkpoint. | |||
| # Fill every <...> placeholder. See SKILL.md "NEL r0.3.0 track" + references/nel-v0.3.0-migration.md. | |||
There was a problem hiding this comment.
[SUGGESTION] This points users to "SKILL.md 'NEL r0.3.0 track'", but .agents/skills/evaluation/SKILL.md has no such section — it is still entirely the 0.2.x flow (nel run, nel skills build-config, nel ls tasks). A reader who follows this pointer finds nothing. Either drop the SKILL.md "NEL r0.3.0 track" reference and keep just references/nel-v0.3.0-migration.md, or add the track to SKILL.md in a follow-up so the reference resolves. Non-blocking, but a dangling cross-reference is worse than none.
| - name: gym://127.0.0.1:<port>?protocol=native&data=<tasks.jsonl> | ||
| solver: {type: tool_calling, service: model, sandbox_tools: true, max_turns: 100} | ||
| sandbox: {type: local, concurrency: 64} | ||
|
|
||
| cluster: | ||
| shards: 8 # supported, same as native (the gym scorer is CPU; only the policy needs a GPU) | ||
| # ... + type/account/walltime/eval_image/node_pools | ||
| ``` | ||
|
|
||
| You must start the Gym resource server first and point `data=` at its task file — full setup (server | ||
| launch, protocol, integration) is in the [NeMo-Gym repo](https://github.com/NVIDIA-NeMo/Gym). `cluster.shards:` works |
There was a problem hiding this comment.
[IMPORTANT Compatibility] This snippet pairs a loopback gym://127.0.0.1:<port> URI with cluster.shards: 8 and asserts "cluster.shards works the same way as native." That combination silently breaks: under SLURM sharding the N workers each run on a different compute node, so 127.0.0.1 resolves to the worker itself, not the node where the single Gym resource server was started. The shards can't reach /verify and the run fails (or every reward comes back empty).
Native (built-in/skills://) sharding is genuinely shards-transparent because each worker serves its own model copy locally. gym:// is not symmetric: there's one shared resource server, and a sharded run needs all workers to reach it at a routable host/IP (the server node's hostname/IP, not loopback), plus that server able to handle the aggregate concurrency.
Suggest either:
- changing the sharded example to a routable host (
gym://<gym-server-host>:<port>?...) and adding a one-line caveat that the resource server must be reachable from all worker nodes, or - noting that sharded
gym://requires the server on a node reachable from the workers (loopback only works for single-node/type: local).
The single-node r030_gym.yaml (type: local) is fine with 127.0.0.1 — this only concerns the sharded path the guide documents here.
| # `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a | ||
| # GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block | ||
| # (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does. |
There was a problem hiding this comment.
[IMPORTANT Compatibility] This comment tells the user to "swap to a slurm block" and that "gym:// supports cluster.shards the same way native does", while the benchmark URI above (line 34) is gym://127.0.0.1:<PORT>. Following both literally produces a broken sharded run: under SLURM each shard worker is a separate node, so 127.0.0.1 points at the worker itself rather than the node running the single Gym resource server, and the shards can't reach /verify.
Add a caveat that when switching to SLURM + shards, the gym:// host must be a routable hostname/IP of the resource-server node (not loopback), and that the one server must handle the aggregate concurrency. (Same point as the migration-doc gym section.)
There was a problem hiding this comment.
Claude Review Summary
Docs-only PR: a NEL 0.2.x to 0.3.x migration guide plus two example recipe YAMLs (3 files, +583). No Python/code changes, so no algorithm/mode-state/export-binary risk — review focused on internal consistency, runnability of the example configs, and cross-references. All 3 changed files reviewed.
Findings by severity — CRITICAL: 0, IMPORTANT: 2, SUGGESTION: 2
IMPORTANT (should fix)
gym://loopback +cluster.shardsis a broken combination (raised in bothnel-v0.3.0-migration.mdgym section andr030_gym.yaml). The guide pairsgym://127.0.0.1:<port>withshards: 8and says sharding works the same way as native. It does not: native shards each serve their own local model copy, butgym://has ONE shared resource server. Under SLURM each shard runs on a different node, so127.0.0.1resolves to the worker itself and the shards cannot reach the server's/verify. The sharded gym path needs a routable host/IP plus a server sized for the aggregate concurrency. The single-nodetype: localexample is fine as-is.
SUGGESTION (non-blocking)
- Commented
tool_callingvariant inr030_example_eval.yamldoes not mention the--enable-auto-tool-choice/--tool-call-parserextra_argsit requires (the guide notes them elsewhere; co-locating prevents a foot-gun). - Dangling cross-reference:
r030_example_eval.yamlheader points to SKILL.md 'NEL r0.3.0 track', but SKILL.md has no such section (still entirely 0.2.x). Drop the reference or add the track in a follow-up.
Note on prior CodeRabbit findings
CodeRabbit referenced a r030_container.yaml file that is NOT present in the current PR (only r030_example_eval.yaml and r030_gym.yaml exist), so its container-example finding is stale. Its markdownlint nits (MD040 bare fences, trailing-newline, double-space) and the nel export multi-bundle CLI-shape inconsistency (single bundle at L257 vs multiple at L415) are valid doc nits I am intentionally not duplicating.
Overall
Low risk — documentation only, backward compatible, no runtime surface. The two IMPORTANT items are the same gym-sharding correctness gap surfaced in two files; worth fixing before merge so users do not author a sharded gym run that silently fails to score. Nothing here is a hard blocker on the code itself.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1845 +/- ##
==========================================
- Coverage 68.70% 68.69% -0.02%
==========================================
Files 515 515
Lines 57118 57118
==========================================
- Hits 39245 39239 -6
- Misses 17873 17879 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
e5cd3d0 to
5be14f0
Compare
|
Thanks @claude and @coderabbitai — addressed in Fixed
Not changed (with reason)
|
|
Claude finished @hychiang-git's task in 1m 24s —— View job Re-review of
|
|
Tip For best results, initiate chat on the files or code changes.
🐇👍 ✏️ Learnings added
|
8fddfc2 to
7ad76df
Compare
|
/claude review |
| # nel eval run r030_example_eval.yaml | ||
| # # after all shards finish: | ||
| # nel eval merge <output.dir>/<run-id> | ||
| # # then export from the LOGIN node (SLURM does NOT auto-export — see migration doc §5): |
There was a problem hiding this comment.
[SUGGESTION] This comment points at "migration doc §5", and lines 38/52 point at "§6" (plus a "3-knob concurrency note"), but nel-v0.3.0-migration.md uses unnumbered ## headings — there is no §5/§6 and no section literally titled "3-knob concurrency note". A reader following these references hits a dead end.
Why it matters: these example configs are the on-ramp to the guide; dangling section pointers undercut their usefulness.
Suggestion: reference the actual heading names instead (e.g. "see Export results to MLFlow" for the export note, and "see Single-node vs sharded / the concurrency note" for the --max-num-seqs / request_timeout knobs), or add explicit §-style numbering to the migration doc headings so these anchors resolve.
|
|
||
| output: | ||
| dir: /abs/path/to/rundir # MUST be absolute | ||
| # export is run manually after the job (see r030_example_eval.yaml / migration doc "Export to MLFlow"): |
There was a problem hiding this comment.
[SUGGESTION] Minor: this points at migration doc "Export to MLFlow", but the actual heading is "Export results to MLFlow". Worth aligning the title exactly so the cross-reference is unambiguous.
There was a problem hiding this comment.
Claude review passed — no blocking issues found. LGTM
Scope: This is a docs-only PR (660 additions, 3 new files, all under .agents/skills/evaluation/). No modelopt/ or examples/ code is touched, so the ModelOpt-specific categories (algorithm correctness, mode/state composability, export compatibility, backward compatibility) don't apply. I reviewed the migration guide and the two example YAMLs for internal consistency and correctness of the documented CLI/config flow.
Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 2
- The example YAMLs reference numbered sections (
§5,§6) and a "3-knob concurrency note" that don't exist as such in the migration doc (it uses unnumbered headings) — dangling cross-references. r030_gym.yamlrefers to "Export to MLFlow" while the doc heading is "Export results to MLFlow".
Both are non-blocking. The guide itself is thorough, internally consistent on the substance (config-block mapping, sharding semantics, gym loopback-vs-routable-host caveat, manual SLURM export flow), and clearly labels the unverified skills:///gym:// matrix columns as a routing guide rather than a guarantee.
Risk: low — documentation and placeholder-only example configs; no runtime behavior changes.
| # nel eval run r030_example_eval.yaml | ||
| # # after all shards finish: | ||
| # nel eval merge <output.dir>/<run-id> | ||
| # # then export from the LOGIN node (SLURM does NOT auto-export — see migration doc §5): |
There was a problem hiding this comment.
[SUGGESTION] This comment points at "migration doc §5", and lines 38/52 point at "§6" (plus a "3-knob concurrency note"), but nel-v0.3.0-migration.md uses unnumbered ## headings — there is no §5/§6 and no section literally titled "3-knob concurrency note". A reader following these references will hit a dead end.
Why it matters: these example configs are the on-ramp to the guide; dangling section pointers undercut their usefulness.
Suggestion: reference the actual heading names instead (e.g. "see Export results to MLFlow" for the export note, and "see Single-node vs sharded / the concurrency note" for the --max-num-seqs / request_timeout knobs), or add explicit §-style numbering to the migration doc headings so these anchors resolve.
|
|
||
| output: | ||
| dir: /abs/path/to/rundir # MUST be absolute | ||
| # export is run manually after the job (see r030_example_eval.yaml / migration doc "Export to MLFlow"): |
There was a problem hiding this comment.
[SUGGESTION] Minor: this points at migration doc "Export to MLFlow", but the actual heading is "Export results to MLFlow". Worth aligning the title exactly so the cross-reference is unambiguous.
7ad76df to
abfc9be
Compare
Add a guide for migrating checkpoint evaluation from the NEL 0.2.x launcher (nemo_evaluator_launcher) to the 0.3.x engine (nemo_evaluator / `nel eval run`): - references/nel-v0.3.0-migration.md: 0.2.x->0.3.x config mapping, backend support matrix (built-in / skills:// / gym://), per-backend launch patterns, single-node vs sharded runs, and the MLflow export flow. - recipes/examples/r030_example_eval.yaml: built-in + skills:// native example. - recipes/examples/r030_gym.yaml: gym:// (NeMo-Gym resource server + reward). Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
abfc9be to
7d3bdbb
Compare
|
/claude review |
There was a problem hiding this comment.
Claude review passed — no blocking issues found. LGTM
Scope: Docs-only PR (3 new files, +759, all under .agents/skills/evaluation/): the NEL v0.3.0 migration guide and two example recipe YAMLs. No modelopt/ or examples/ code is touched, so the ModelOpt-specific categories (algorithm correctness, mode/state composability, export compatibility, backward compatibility) don't apply. All 3 changed files reviewed in full.
Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 0
I traced the substance end-to-end — the 0.2.x→0.3.x config-block mapping, backend-selection-by-prefix, sharding semantics (repeats not divided, the problem set is), the gym two-job SLURM pattern, and the manual merge→export flow — and it is internally consistent. Unverified material is clearly labeled (the skills:///gym:// matrix columns as a routing guide, multimodal/audio as "not validated here").
Prior findings — all resolved/confirmed:
- The
gym://loopback +cluster.shardscorrectness gap (the only real concern from earlier rounds) is fixed across all three places:r030_gym.yamluses validated single-nodecluster: {type: local}with the routable-host caveat, and the migration doc (Single-node vs sharded, plus the gym Caveats) states the limitation consistently. gym env startis backgrounded with dynamic port discovery via/server_instancesdocumented.- The
§4/§5/§7cross-references now resolve against the numbered#### 1–#### 7subsections. nel export <bundle> [<bundle> ...]is harmonized between Basic Usage and the MLflow example.- markdownlint nits (MD040 bare fences, trailing newline) addressed.
- CodeRabbit's
r030_container.yamlfinding remains correctly out of scope (that file isn't in this PR).
Risk: low — documentation and placeholder-only example configs; no runtime behavior changes, backward compatible.
…al resume
- §7: how to enable reasoning (`chat_template_kwargs: {enable_thinking: true}` under
services.model.proxy.extra_body) — a model setting, applies to native runs too.
- §7: clarify that cluster:{type: local} has no auto_resume (slurm-only); recover a
killed gym run by re-submitting with `nel eval run --resume` (progress is checkpointed).
Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l --resume
Clarify in the native section that built-in/skills runs use cluster:{type: slurm}
(auto_resume: true) so NEL chains a successor job on a wall-clock kill — no manual
step — in contrast to gym://'s cluster:{type: local} (manual `nel eval run --resume`).
Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Explain that a NeMo-Gym bundled config (resources_servers/<name>/configs/<name>.yaml) packs multiple server instances: the resources_servers grader, plus a sample responses_api_agents agent that references an undefined responses_api_models. Add a component table (linked to upstream ifbench.yaml) and a keep/drop rule: NEL native needs only the grader; the agent is the gym-driven Responses-API path. Note that --resources-server / --config <name>.yaml start the whole file and fail on the undefined model, so point --config at a grader-only config. Align §6 sbatch to match. Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Align the §6 sbatch grader command with §2 (same --config resources_servers/<name>/configs/<name>.yaml line and the "&" = background / keep resources_servers, drop the agent comment) so both code blocks are consistent. Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the gym "Resume the jobs" section into a clean two-step procedure
(relaunch the serve, then resume the eval with
--export=ALL,RESUME=--resume) and thread ${RESUME} into the §6 conductor
line so the command works. Add that completed rollouts survive the resume
even though the new serve node changes the model URL: NEL keys the skip on
its verified log, which is retained on a config-hash change (only the
un-scored inference cache is dropped). Tighten the native auto-resume
paragraph and fix its cross-reference, and clarify the single-node/local
sharding note in the §4 config snippet.
Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xample Add a reminder in "Export results to MLflow" that downstream MLflow dashboards key/group runs by the `model` (and `checkpoint_path`) tag: a run exported without them can't be attributed (shows as an orphan/"no checkpoint" row), and a drifting `model` value splits one logical row. Include a concrete `output.export_config.mlflow.tags` YAML example. Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What does this PR do?
Type of change: documentation
Adds a guide for migrating checkpoint evaluation from the NEL 0.2.x launcher
(
nemo_evaluator_launcher) to the 0.3.x engine (nemo_evaluator/nel eval run),plus two runnable example configs covering the native and gym backends.
references/nel-v0.3.0-migration.md— 0.2.x→0.3.x config mapping, the backendsupport matrix (built-in /
skills:///gym://), per-backend launch patterns,single-node vs sharded runs, and the MLflow export flow.
recipes/examples/r030_example_eval.yaml— built-in +skills://(native, no-tools).recipes/examples/r030_gym.yaml—gym://(NeMo-Gym resource server + reward).Usage
Testing
Docs + example configs only (no code change). The built-in benchmark column of the
matrix was verified against the engine (
nel list -s builtin); theskills:///gym://columns are compiled from the upstream repos and labelled as a routing guide in the doc.
The example YAMLs are placeholder-only (
<...>) and parse as valid YAML.Before your PR is "Ready for review"
CONTRIBUTING.md: N/A/claude review)Additional Information
N/A
Summary by CodeRabbit
skills://vsgym://), sharding and merge behavior, and MLflow export steps.