Skip to content

docs(eval): add NEL v0.3.0 migration guide + example configs#1845

Open
hychiang-git wants to merge 7 commits into
mainfrom
hungyuehc/nel-030-migration
Open

docs(eval): add NEL v0.3.0 migration guide + example configs#1845
hychiang-git wants to merge 7 commits into
mainfrom
hungyuehc/nel-030-migration

Conversation

@hychiang-git

@hychiang-git hychiang-git commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: documentation

Adds a guide for migrating checkpoint evaluation from the NEL 0.2.x launcher
(nemo_evaluator_launcher) to the 0.3.x engine (nemo_evaluator / nel eval run),
plus two runnable example configs covering the native and gym backends.

  • references/nel-v0.3.0-migration.md — 0.2.x→0.3.x config mapping, the backend
    support matrix (built-in / skills:// / gym://), per-backend launch patterns,
    single-node vs sharded runs, and the MLflow export flow.
  • recipes/examples/r030_example_eval.yaml — built-in + skills:// (native, no-tools).
  • recipes/examples/r030_gym.yamlgym:// (NeMo-Gym resource server + reward).

Usage

# Run a native built-in + skills:// eval set, then merge shards and export to MLflow
nel eval run recipes/examples/r030_example_eval.yaml
nel eval merge <output.dir>/<run-id>
nel export <output.dir>/<run-id>/<bench> --dest mlflow -o tracking_uri=<uri> -o experiment_name=<exp>

Testing

Docs + example configs only (no code change). The built-in benchmark column of the
matrix was verified against the engine (nel list -s builtin); the skills:// / gym://
columns are compiled from the upstream repos and labelled as a routing guide in the doc.
The example YAMLs are placeholder-only (<...>) and parse as valid YAML.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (documentation + new example files only)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (documentation)
  • Did you update Changelog?: N/A (documentation)
  • Did you get Claude approval on this PR?: ❌ (will run /claude review)

Additional Information

N/A

Summary by CodeRabbit

  • New Features
    • Added new NEL 0.3.x evaluation recipe examples, including a SIMPLE SLURM-sharded workflow (vLLM long-context/reasoning settings, benchmark tasks, and MLflow export configuration) and a gym-based local evaluation recipe.
  • Documentation
    • Added a “NEL v0.3.0 Migration” guide covering updated config/CLI usage, benchmark routing guidance (skills:// vs gym://), sharding and merge behavior, and MLflow export steps.

@hychiang-git hychiang-git requested a review from a team as a code owner June 28, 2026 02:40
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two NEL 0.3.x recipe examples and a migration guide covering config structure, CLI flow, backend routing, sharding, and MLflow export behavior.

Changes

NEL 0.3.x Recipe Examples and Migration Guide

Layer / File(s) Summary
Migration schema and backend support
.agents/skills/evaluation/references/nel-v0.3.0-migration.md
Introduces the migration guide, config schema changes, backward-support notes, and the backend support matrix with benchmark routing and fidelity notes.
Migration CLI and execution flow
.agents/skills/evaluation/references/nel-v0.3.0-migration.md
Documents the new CLI commands, benchmark selection guidance, single-node and sharded execution semantics, and the unsupported legacy container sharding path.
Migration how-tos and export flow
.agents/skills/evaluation/references/nel-v0.3.0-migration.md
Adds backend-specific configuration examples and the post-run merge and MLflow export flow, including skills path naming.
SLURM-sharded evaluation recipe
.agents/skills/evaluation/recipes/examples/r030_example_eval.yaml
Adds a vLLM-backed SLURM recipe with benchmark entries, fixed parallelism, HF cache wiring, and output/export settings.
Gym-based local recipe
.agents/skills/evaluation/recipes/examples/r030_gym.yaml
Adds a gym evaluation recipe with an API-hosted model, gym URI benchmark configuration, local cluster mode, and output directory settings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • chadvoegele
  • kevalmorabia97
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main docs change: a NEL v0.3.0 migration guide plus example configs.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR only adds docs and YAML examples; no Python or dependency files changed, so none of the listed security anti-patterns are present.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hungyuehc/nel-030-migration

Comment @coderabbitai help to get the list of available commands.

@hychiang-git hychiang-git force-pushed the hungyuehc/nel-030-migration branch from 67af6b0 to e8ac728 Compare June 28, 2026 02:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/skills/evaluation/recipes/examples/r030_container.yaml:
- Around line 46-48: The inline note for the container example is too broad and
misleading because mmlu_pro_aa_v3 does have a native 0.3.x equivalent. Update
the comment on the mmlu_pro_aa_v3 entry to either use a truly legacy-only task
example or rephrase it to say there is no native benchmark of the same name,
keeping the guidance aligned with the container solver example.

In @.agents/skills/evaluation/recipes/examples/r030_gym.yaml:
- Around line 8-11: The prerequisite instructions for starting the resource
server are inconsistent because the `gym env start` command in the
`r030_gym.yaml` example is shown in the foreground even though the next `curl`
step assumes it is running in the background. Update the example around the
resource-server startup command so it is actually backgrounded (or otherwise
clearly detached) before the `curl` lookup, keeping the `RAY_TMPDIR` and
`server_instances` steps in the same flow.
- Around line 34-45: The gym benchmark URI currently uses 127.0.0.1, which only
reaches a local loopback server and will point to the worker itself under SLURM.
Update the recipe around the gym:// URI to use a host/IP that is reachable from
the worker nodes, and adjust the surrounding comment near the cluster
configuration to avoid implying that only swapping the cluster block is
sufficient. Refer to the gym:// source entry and the cluster block in this
example when making the fix.

In @.agents/skills/evaluation/references/nel-v0.3.0-migration.md:
- Around line 370-383: The MLflow export example in the migration guide does not
match the documented `nel export` CLI shape because it passes multiple benchmark
bundles in one call. Update the example near the `nel eval merge`/`nel export`
recipe to either show one bundle per `nel export` invocation or explicitly state
that `nel export` accepts multiple positional bundles, matching the earlier CLI
usage and the `nel export` command documentation.
- Around line 73-81: The markdown table/example blocks in
nel-v0.3.0-migration.md use bare fenced code blocks that trip markdownlint
MD040. Update the fenced blocks in the referenced sections to include a language
tag such as text, keeping the content unchanged; this applies to the example
blocks around the dataset/backend table and the other bare-fence occurrences
noted in the review.
- Around line 55-61: Fix the remaining markdownlint issues in this migration
note by removing the extra double space after the “2.” list item in the numbered
options section and ensuring the document ends with a trailing newline. Update
the affected markdown list content in the referenced eval-factory migration
guidance so it stays lint-clean.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90162b89-d43f-4514-ba35-ac5d0223821f

📥 Commits

Reviewing files that changed from the base of the PR and between f335459 and 67af6b0.

📒 Files selected for processing (4)
  • .agents/skills/evaluation/recipes/examples/r030_container.yaml
  • .agents/skills/evaluation/recipes/examples/r030_example_eval.yaml
  • .agents/skills/evaluation/recipes/examples/r030_gym.yaml
  • .agents/skills/evaluation/references/nel-v0.3.0-migration.md

Comment on lines +46 to +48
# The `container` solver runs the v0.2.x eval-factory task by name. Use the v1 `*_aa_v3` task id.
- name: mmlu_pro_aa_v3 # v0.2.x eval-factory container task (no native port)
solver: {type: container, service: model}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

The inline note overstates the legacy-only requirement.

mmlu_pro_aa_v3 does have a native 0.3.x equivalent (mmlu_pro) per the migration guide, so the “no native port” note is misleading here. Either use a truly legacy-only example task or narrow the comment to “no native benchmark of the same name.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agents/skills/evaluation/recipes/examples/r030_container.yaml around lines
46 - 48, The inline note for the container example is too broad and misleading
because mmlu_pro_aa_v3 does have a native 0.3.x equivalent. Update the comment
on the mmlu_pro_aa_v3 entry to either use a truly legacy-only task example or
rephrase it to say there is no native benchmark of the same name, keeping the
guidance aligned with the container solver example.

Comment thread .agents/skills/evaluation/recipes/examples/r030_gym.yaml Outdated
Comment on lines +34 to +45
- name: "gym://127.0.0.1:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl"
repeats: 1
max_concurrent: 5
# simple solver = single-shot policy answer scored by the server (mcqa/ifbench-style). For agentic
# gym envs that need tool use, switch to `type: tool_calling` + a `sandbox:`.
solver: {type: simple, service: model}

cluster:
# `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a
# GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block
# (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does.
type: local

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Call out that 127.0.0.1 is local-only.

The note on Lines 42-45 suggests that users can move this recipe to SLURM by swapping the cluster block, but the benchmark URI on Line 34 still points at loopback. On a worker node that resolves to the worker itself, not the separately started Gym server.

Suggested fix
-  - name: "gym://127.0.0.1:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl"
+  - name: "gym://<gym-server-host>:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl"
-  # (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does.
+  # (see r030_example_eval.yaml). If you move off `local`, replace `127.0.0.1` above with a host that
+  # is reachable from the worker node(s). NOTE: gym:// supports cluster.shards the same way native does.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: "gym://127.0.0.1:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl"
repeats: 1
max_concurrent: 5
# simple solver = single-shot policy answer scored by the server (mcqa/ifbench-style). For agentic
# gym envs that need tool use, switch to `type: tool_calling` + a `sandbox:`.
solver: {type: simple, service: model}
cluster:
# `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a
# GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block
# (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does.
type: local
- name: "gym://<gym-server-host>:<PORT>?protocol=native&data=/abs/path/to/tasks.jsonl"
repeats: 1
max_concurrent: 5
# simple solver = single-shot policy answer scored by the server (mcqa/ifbench-style). For agentic
# gym envs that need tool use, switch to `type: tool_calling` + a `sandbox:`.
solver: {type: simple, service: model}
cluster:
# `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a
# GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block
# (see r030_example_eval.yaml). If you move off `local`, replace `127.0.0.1` above with a host that
# is reachable from the worker node(s). NOTE: gym:// supports cluster.shards the same way native does.
type: local
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agents/skills/evaluation/recipes/examples/r030_gym.yaml around lines 34 -
45, The gym benchmark URI currently uses 127.0.0.1, which only reaches a local
loopback server and will point to the worker itself under SLURM. Update the
recipe around the gym:// URI to use a host/IP that is reachable from the worker
nodes, and adjust the surrounding comment near the cluster configuration to
avoid implying that only swapping the cluster block is sufficient. Refer to the
gym:// source entry and the cluster block in this example when making the fix.

Comment thread .agents/skills/evaluation/references/nel-v0.3.0-migration.md Outdated
Comment thread .agents/skills/evaluation/references/nel-v0.3.0-migration.md Outdated
Comment thread .agents/skills/evaluation/references/nel-v0.3.0-migration.md
@hychiang-git hychiang-git force-pushed the hungyuehc/nel-030-migration branch 5 times, most recently from 8fd7a9a to adaabab Compare June 28, 2026 02:59
@hychiang-git

Copy link
Copy Markdown
Contributor Author

/claude review

Comment on lines +68 to +74
# WITH-TOOLS variant of a native benchmark: SAME backend/name, just a tool_calling solver + a local
# sandbox — NO gym server involved. Use in place of the simple `gpqa` entry above to give it tools.
# - name: gpqa # still BUILT-IN
# repeats: 8
# max_concurrent: 64
# solver: {type: tool_calling, service: model, sandbox_tools: true, max_turns: 100}
# sandbox: {type: local, concurrency: 64}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] This commented-out tool_calling variant would silently fail to use tools as written. The migration guide itself (nel-v0.3.0-migration.md, §"The solver: block") states tool calling needs "a model served with tool calling enabled (vLLM: --enable-auto-tool-choice + --tool-call-parser)", but the services.model.extra_args block above (lines 30-38) doesn't include those flags. A user who simply uncomments this variant will get a model that never emits tool calls (turns exhaust → failed episodes).

Consider adding a companion note here pointing back to the required extra_args, e.g.:

  # WITH-TOOLS variant ... — also add to services.model.extra_args:
  #   - "--enable-auto-tool-choice"
  #   - "--tool-call-parser=<your-tool-parser>"
  # - name: gpqa                             # still BUILT-IN
  #   ...

Non-blocking — the doc covers it, but co-locating the requirement with the commented snippet prevents a foot-gun.

@@ -0,0 +1,109 @@
# NEL 0.3.x (engine) — SIMPLE / no-tools, SLURM-sharded.
# Built-in + skills math/multichoice benchmarks for a reasoning checkpoint.
# Fill every <...> placeholder. See SKILL.md "NEL r0.3.0 track" + references/nel-v0.3.0-migration.md.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] This points users to "SKILL.md 'NEL r0.3.0 track'", but .agents/skills/evaluation/SKILL.md has no such section — it is still entirely the 0.2.x flow (nel run, nel skills build-config, nel ls tasks). A reader who follows this pointer finds nothing. Either drop the SKILL.md "NEL r0.3.0 track" reference and keep just references/nel-v0.3.0-migration.md, or add the track to SKILL.md in a follow-up so the reference resolves. Non-blocking, but a dangling cross-reference is worse than none.

Comment on lines +368 to +378
- name: gym://127.0.0.1:<port>?protocol=native&data=<tasks.jsonl>
solver: {type: tool_calling, service: model, sandbox_tools: true, max_turns: 100}
sandbox: {type: local, concurrency: 64}

cluster:
shards: 8 # supported, same as native (the gym scorer is CPU; only the policy needs a GPU)
# ... + type/account/walltime/eval_image/node_pools
```

You must start the Gym resource server first and point `data=` at its task file — full setup (server
launch, protocol, integration) is in the [NeMo-Gym repo](https://github.com/NVIDIA-NeMo/Gym). `cluster.shards:` works

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] This snippet pairs a loopback gym://127.0.0.1:<port> URI with cluster.shards: 8 and asserts "cluster.shards works the same way as native." That combination silently breaks: under SLURM sharding the N workers each run on a different compute node, so 127.0.0.1 resolves to the worker itself, not the node where the single Gym resource server was started. The shards can't reach /verify and the run fails (or every reward comes back empty).

Native (built-in/skills://) sharding is genuinely shards-transparent because each worker serves its own model copy locally. gym:// is not symmetric: there's one shared resource server, and a sharded run needs all workers to reach it at a routable host/IP (the server node's hostname/IP, not loopback), plus that server able to handle the aggregate concurrency.

Suggest either:

  • changing the sharded example to a routable host (gym://<gym-server-host>:<port>?...) and adding a one-line caveat that the resource server must be reachable from all worker nodes, or
  • noting that sharded gym:// requires the server on a node reachable from the workers (loopback only works for single-node/type: local).

The single-node r030_gym.yaml (type: local) is fine with 127.0.0.1 — this only concerns the sharded path the guide documents here.

Comment on lines +42 to +44
# `local` = in-process on the login/cpu node — the gym scorer is always CPU, only the policy needs a
# GPU (and here the policy is a hosted API, so no GPU at all). For a real run, swap to a slurm block
# (see r030_example_eval.yaml). NOTE: gym:// supports cluster.shards the same way native does.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] This comment tells the user to "swap to a slurm block" and that "gym:// supports cluster.shards the same way native does", while the benchmark URI above (line 34) is gym://127.0.0.1:<PORT>. Following both literally produces a broken sharded run: under SLURM each shard worker is a separate node, so 127.0.0.1 points at the worker itself rather than the node running the single Gym resource server, and the shards can't reach /verify.

Add a caveat that when switching to SLURM + shards, the gym:// host must be a routable hostname/IP of the resource-server node (not loopback), and that the one server must handle the aggregate concurrency. (Same point as the migration-doc gym section.)

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Review Summary

Docs-only PR: a NEL 0.2.x to 0.3.x migration guide plus two example recipe YAMLs (3 files, +583). No Python/code changes, so no algorithm/mode-state/export-binary risk — review focused on internal consistency, runnability of the example configs, and cross-references. All 3 changed files reviewed.

Findings by severity — CRITICAL: 0, IMPORTANT: 2, SUGGESTION: 2

IMPORTANT (should fix)

  • gym:// loopback + cluster.shards is a broken combination (raised in both nel-v0.3.0-migration.md gym section and r030_gym.yaml). The guide pairs gym://127.0.0.1:<port> with shards: 8 and says sharding works the same way as native. It does not: native shards each serve their own local model copy, but gym:// has ONE shared resource server. Under SLURM each shard runs on a different node, so 127.0.0.1 resolves to the worker itself and the shards cannot reach the server's /verify. The sharded gym path needs a routable host/IP plus a server sized for the aggregate concurrency. The single-node type: local example is fine as-is.

SUGGESTION (non-blocking)

  • Commented tool_calling variant in r030_example_eval.yaml does not mention the --enable-auto-tool-choice / --tool-call-parser extra_args it requires (the guide notes them elsewhere; co-locating prevents a foot-gun).
  • Dangling cross-reference: r030_example_eval.yaml header points to SKILL.md 'NEL r0.3.0 track', but SKILL.md has no such section (still entirely 0.2.x). Drop the reference or add the track in a follow-up.

Note on prior CodeRabbit findings
CodeRabbit referenced a r030_container.yaml file that is NOT present in the current PR (only r030_example_eval.yaml and r030_gym.yaml exist), so its container-example finding is stale. Its markdownlint nits (MD040 bare fences, trailing-newline, double-space) and the nel export multi-bundle CLI-shape inconsistency (single bundle at L257 vs multiple at L415) are valid doc nits I am intentionally not duplicating.

Overall
Low risk — documentation only, backward compatible, no runtime surface. The two IMPORTANT items are the same gym-sharding correctness gap surfaced in two files; worth fixing before merge so users do not author a sharded gym run that silently fails to score. Nothing here is a hard blocker on the code itself.

@codecov

codecov Bot commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.69%. Comparing base (f335459) to head (5314022).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1845      +/-   ##
==========================================
- Coverage   68.70%   68.69%   -0.02%     
==========================================
  Files         515      515              
  Lines       57118    57118              
==========================================
- Hits        39245    39239       -6     
- Misses      17873    17879       +6     
Flag Coverage Δ
unit 54.90% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hychiang-git hychiang-git force-pushed the hungyuehc/nel-030-migration branch 2 times, most recently from e5cd3d0 to 5be14f0 Compare June 28, 2026 03:31
@hychiang-git

Copy link
Copy Markdown
Contributor Author

Thanks @claude and @coderabbitai — addressed in 5be14f0ec:

Fixed

  • gym:// loopback + cluster.shards (IMPORTANT, both bots): the gym example now uses the validated single-node cluster: {type: local} instead of shards: 8, plus a caveat that sharding a gym run is not a free swap — its one shared resource server must be at a routable host (not 127.0.0.1) reachable from every shard. Corrected the same claim in the "Single-node vs sharded" bullet and in r030_gym.yaml.
  • r030_gym.yaml: backgrounded the gym env start command (&) and documented dynamic port discovery via …:11000/server_instances.
  • r030_example_eval.yaml: dropped the dangling SKILL.md "NEL r0.3.0 track" reference; noted the commented tool_calling variant needs the model served with --enable-auto-tool-choice + --tool-call-parser.
  • markdownlint: added a text language tag to the ASCII-table fences (MD040), removed the double space after 2., and added the trailing newline.

Not changed (with reason)

  • r030_container.yaml "no native port" note: stale — that file isn't part of this PR (only r030_example_eval.yaml and r030_gym.yaml).
  • nel export multi-bundle "inconsistency": not a bug — nel export [OPTIONS] PATHS... accepts one or more bundles (verified against the CLI). Harmonized the Basic-Usage line to show [<bundle> ...] so the two examples read consistently.

@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Claude finished @hychiang-git's task in 1m 24s —— View job


Re-review of 5be14f0ec — all items verified ✅

I re-read all three changed files at the current branch head and confirmed each fix landed correctly.

Verified fixed

  • gym:// loopback + sharding (IMPORTANT, both bots). Confirmed in all three places:
    • r030_gym.yaml:43-48 now uses cluster: {type: local} with the inline caveat that sharding is "NOT a free swap" and loopback + shards "silently fails to score." ✓
    • nel-v0.3.0-migration.md:289-292 ("Single-node vs sharded") now says a gym:// run can shard but needs the server at a routable host reachable from every shard. ✓
    • nel-v0.3.0-migration.md:395-399 blockquote spells out the SLURM/loopback failure mode and that only single-node type: local was validated. ✓
  • gym env start backgrounded + port discovery. r030_gym.yaml:10 now appends &; :11-13 document reading the dynamic port back from …:11000/server_instances. Matches migration.md:388-389. ✓
  • Dangling SKILL.md "NEL r0.3.0 track" reference dropped; tool_calling note added. r030_example_eval.yaml:3 points only to references/nel-v0.3.0-migration.md; :70-71 notes the served model needs --enable-auto-tool-choice + --tool-call-parser for the commented tool_calling variant. ✓
  • markdownlint. All ASCII-table fences now carry a text tag (migration.md:73, 109, 212, 266); the remaining bare ``` lines are all closing fences (MD040 only flags openers). No ^\d\. double-space marker remains, and the file ends with a single trailing newline (od -c\n). ✓

Agreed not-changed (reasons sound)

  • r030_container.yaml — correct, that file isn't in this PR; CodeRabbit's finding referenced a path not present in the diff. The legacy container:// guidance instead lives inline in migration.md:411-428, which I checked reads consistently.
  • nel export multi-bundle — confirmed harmonized: migration.md:257 now shows nel export <bundle> [<bundle> ...] with a # takes 1+ bundles note, so the Basic-Usage line and the MLflow example (:444) read consistently. Reasonable resolution.

Everything I raised (and the valid CodeRabbit nits) is addressed. The two IMPORTANT gym-sharding items — the only correctness concerns — are resolved across both the doc and the recipe. No further blockers from me on this docs-only PR. 👍
· hungyuehc/nel-030-migration

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@hychiang-git Thanks for the thorough follow-up — the fixes and rationale make sense. The gym:// single-node correction, the routable-host caveat for sharded runs, the r030_gym.yaml launch/docs cleanup, and the r030_example_eval.yaml note updates all address the concern from my side. The clarification on nel export [OPTIONS] PATHS... accepting multiple bundles is also helpful.

🐇👍


✏️ Learnings added
Learnt from: hychiang-git
URL: https://github.com/NVIDIA/Model-Optimizer/pull/1845

Timestamp: 2026-06-28T03:32:04.717Z
Learning: In `.agents/skills/evaluation/recipes/examples/r030_gym.yaml` and related NEL 0.3.x migration docs, `gym://` evaluation should default to single-node `cluster: {type: local}` unless the shared NeMo-Gym resource server is exposed on a routable host reachable by every shard; using `127.0.0.1` is not valid for multi-shard execution.

Learnt from: hychiang-git
URL: https://github.com/NVIDIA/Model-Optimizer/pull/1845

Timestamp: 2026-06-28T03:32:04.717Z
Learning: For NEL 0.3.x CLI usage documented in `.agents/skills/evaluation/references/nel-v0.3.0-migration.md`, `nel export [OPTIONS] PATHS...` accepts one or more bundle paths, so multi-bundle export examples are valid.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@hychiang-git hychiang-git force-pushed the hungyuehc/nel-030-migration branch 4 times, most recently from 8fddfc2 to 7ad76df Compare June 28, 2026 03:42
@hychiang-git

Copy link
Copy Markdown
Contributor Author

/claude review

# nel eval run r030_example_eval.yaml
# # after all shards finish:
# nel eval merge <output.dir>/<run-id>
# # then export from the LOGIN node (SLURM does NOT auto-export — see migration doc §5):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] This comment points at "migration doc §5", and lines 38/52 point at "§6" (plus a "3-knob concurrency note"), but nel-v0.3.0-migration.md uses unnumbered ## headings — there is no §5/§6 and no section literally titled "3-knob concurrency note". A reader following these references hits a dead end.

Why it matters: these example configs are the on-ramp to the guide; dangling section pointers undercut their usefulness.

Suggestion: reference the actual heading names instead (e.g. "see Export results to MLFlow" for the export note, and "see Single-node vs sharded / the concurrency note" for the --max-num-seqs / request_timeout knobs), or add explicit §-style numbering to the migration doc headings so these anchors resolve.


output:
dir: /abs/path/to/rundir # MUST be absolute
# export is run manually after the job (see r030_example_eval.yaml / migration doc "Export to MLFlow"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Minor: this points at migration doc "Export to MLFlow", but the actual heading is "Export results to MLFlow". Worth aligning the title exactly so the cross-reference is unambiguous.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Scope: This is a docs-only PR (660 additions, 3 new files, all under .agents/skills/evaluation/). No modelopt/ or examples/ code is touched, so the ModelOpt-specific categories (algorithm correctness, mode/state composability, export compatibility, backward compatibility) don't apply. I reviewed the migration guide and the two example YAMLs for internal consistency and correctness of the documented CLI/config flow.

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 2

  • The example YAMLs reference numbered sections (§5, §6) and a "3-knob concurrency note" that don't exist as such in the migration doc (it uses unnumbered headings) — dangling cross-references.
  • r030_gym.yaml refers to "Export to MLFlow" while the doc heading is "Export results to MLFlow".

Both are non-blocking. The guide itself is thorough, internally consistent on the substance (config-block mapping, sharding semantics, gym loopback-vs-routable-host caveat, manual SLURM export flow), and clearly labels the unverified skills:///gym:// matrix columns as a routing guide rather than a guarantee.

Risk: low — documentation and placeholder-only example configs; no runtime behavior changes.

# nel eval run r030_example_eval.yaml
# # after all shards finish:
# nel eval merge <output.dir>/<run-id>
# # then export from the LOGIN node (SLURM does NOT auto-export — see migration doc §5):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] This comment points at "migration doc §5", and lines 38/52 point at "§6" (plus a "3-knob concurrency note"), but nel-v0.3.0-migration.md uses unnumbered ## headings — there is no §5/§6 and no section literally titled "3-knob concurrency note". A reader following these references will hit a dead end.

Why it matters: these example configs are the on-ramp to the guide; dangling section pointers undercut their usefulness.

Suggestion: reference the actual heading names instead (e.g. "see Export results to MLFlow" for the export note, and "see Single-node vs sharded / the concurrency note" for the --max-num-seqs / request_timeout knobs), or add explicit §-style numbering to the migration doc headings so these anchors resolve.


output:
dir: /abs/path/to/rundir # MUST be absolute
# export is run manually after the job (see r030_example_eval.yaml / migration doc "Export to MLFlow"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Minor: this points at migration doc "Export to MLFlow", but the actual heading is "Export results to MLFlow". Worth aligning the title exactly so the cross-reference is unambiguous.

@hychiang-git hychiang-git force-pushed the hungyuehc/nel-030-migration branch from 7ad76df to abfc9be Compare June 28, 2026 05:25
Add a guide for migrating checkpoint evaluation from the NEL 0.2.x launcher
(nemo_evaluator_launcher) to the 0.3.x engine (nemo_evaluator / `nel eval run`):

- references/nel-v0.3.0-migration.md: 0.2.x->0.3.x config mapping, backend
  support matrix (built-in / skills:// / gym://), per-backend launch patterns,
  single-node vs sharded runs, and the MLflow export flow.
- recipes/examples/r030_example_eval.yaml: built-in + skills:// native example.
- recipes/examples/r030_gym.yaml: gym:// (NeMo-Gym resource server + reward).

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hychiang-git hychiang-git force-pushed the hungyuehc/nel-030-migration branch from abfc9be to 7d3bdbb Compare June 28, 2026 05:40
@hychiang-git

Copy link
Copy Markdown
Contributor Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Scope: Docs-only PR (3 new files, +759, all under .agents/skills/evaluation/): the NEL v0.3.0 migration guide and two example recipe YAMLs. No modelopt/ or examples/ code is touched, so the ModelOpt-specific categories (algorithm correctness, mode/state composability, export compatibility, backward compatibility) don't apply. All 3 changed files reviewed in full.

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 0

I traced the substance end-to-end — the 0.2.x→0.3.x config-block mapping, backend-selection-by-prefix, sharding semantics (repeats not divided, the problem set is), the gym two-job SLURM pattern, and the manual merge→export flow — and it is internally consistent. Unverified material is clearly labeled (the skills:///gym:// matrix columns as a routing guide, multimodal/audio as "not validated here").

Prior findings — all resolved/confirmed:

  • The gym:// loopback + cluster.shards correctness gap (the only real concern from earlier rounds) is fixed across all three places: r030_gym.yaml uses validated single-node cluster: {type: local} with the routable-host caveat, and the migration doc (Single-node vs sharded, plus the gym Caveats) states the limitation consistently.
  • gym env start is backgrounded with dynamic port discovery via /server_instances documented.
  • The §4/§5/§7 cross-references now resolve against the numbered #### 1#### 7 subsections.
  • nel export <bundle> [<bundle> ...] is harmonized between Basic Usage and the MLflow example.
  • markdownlint nits (MD040 bare fences, trailing newline) addressed.
  • CodeRabbit's r030_container.yaml finding remains correctly out of scope (that file isn't in this PR).

Risk: low — documentation and placeholder-only example configs; no runtime behavior changes, backward compatible.

hychiang-git and others added 6 commits June 27, 2026 23:00
…al resume

- §7: how to enable reasoning (`chat_template_kwargs: {enable_thinking: true}` under
  services.model.proxy.extra_body) — a model setting, applies to native runs too.
- §7: clarify that cluster:{type: local} has no auto_resume (slurm-only); recover a
  killed gym run by re-submitting with `nel eval run --resume` (progress is checkpointed).

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l --resume

Clarify in the native section that built-in/skills runs use cluster:{type: slurm}
(auto_resume: true) so NEL chains a successor job on a wall-clock kill — no manual
step — in contrast to gym://'s cluster:{type: local} (manual `nel eval run --resume`).

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Explain that a NeMo-Gym bundled config (resources_servers/<name>/configs/<name>.yaml)
packs multiple server instances: the resources_servers grader, plus a sample
responses_api_agents agent that references an undefined responses_api_models. Add a
component table (linked to upstream ifbench.yaml) and a keep/drop rule: NEL native
needs only the grader; the agent is the gym-driven Responses-API path. Note that
--resources-server / --config <name>.yaml start the whole file and fail on the
undefined model, so point --config at a grader-only config. Align §6 sbatch to match.

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Align the §6 sbatch grader command with §2 (same
--config resources_servers/<name>/configs/<name>.yaml line and the
"&" = background / keep resources_servers, drop the agent comment) so both
code blocks are consistent.

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the gym "Resume the jobs" section into a clean two-step procedure
(relaunch the serve, then resume the eval with
--export=ALL,RESUME=--resume) and thread ${RESUME} into the §6 conductor
line so the command works. Add that completed rollouts survive the resume
even though the new serve node changes the model URL: NEL keys the skip on
its verified log, which is retained on a config-hash change (only the
un-scored inference cache is dropped). Tighten the native auto-resume
paragraph and fix its cross-reference, and clarify the single-node/local
sharding note in the §4 config snippet.

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xample

Add a reminder in "Export results to MLflow" that downstream MLflow dashboards
key/group runs by the `model` (and `checkpoint_path`) tag: a run exported without
them can't be attributed (shows as an orphan/"no checkpoint" row), and a drifting
`model` value splits one logical row. Include a concrete `output.export_config.mlflow.tags`
YAML example.

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant