Skip to content

[db-executable-env] Phase 2: DatabaseExecutableEnvironment (prototype)#2520

Draft
aniruddh-alt wants to merge 12 commits into
aniruddh-alt/db-executable-env-01-executable-skeletonfrom
aniruddh-alt/db-executable-env-02-database-env
Draft

[db-executable-env] Phase 2: DatabaseExecutableEnvironment (prototype)#2520
aniruddh-alt wants to merge 12 commits into
aniruddh-alt/db-executable-env-01-executable-skeletonfrom
aniruddh-alt/db-executable-env-02-database-env

Conversation

@aniruddh-alt

Copy link
Copy Markdown
Contributor

Summary

Prototype of an executable database environment for RL/eval over a SQLite database. Each rollout gets an isolated session whose writes are visible within an episode but never persist or leak across rollouts.

  • DatabaseExecutableEnvironment (registered "database"): per-rollout RollbackSession, requires_isolation()=True, executors dispatched with a live db connection.
  • db_isolation: RollbackSession (opens isolation_level=None + explicit BEGIN so both DML and a leading DDL statement roll back; executors must not commit) and materialize_sqlite_snapshot.
  • Fleshed out the ExecutableEnvironment base executor dispatch (_step_one) the skeleton left abstract.
  • EHR example tools/executors (list/lookup/update patient) + a YAML config — the "bring your DB" entry point through build_environment.
  • sql_execution_match reward that grades candidate vs gold SQL on a fresh isolated session (reusing the env's isolation on the grading side).

85 unit tests cover the isolation contract: uncommitted write visible within an episode, rolled back on close, no cross-rollout leak, shared snapshot never mutated, leading-DDL rollback.

Status: prototype

Opened to run experiments. Deliberately deferred: wiring into verl GRPO rollouts (single-turn NL2SQL first), copy/copy-on-write isolation for concurrent committed writes, and the schema/table/row database-mutation stack for init diversity.

Series context (db-executable-env chain)

Test plan

  • pytest tests/unit/environments tests/unit/datasets/grpo/rewards/test_sql_execution_match.py (85 pass)
  • Build the example env from configs/examples/database_env/ehr_database_env.yaml via build_environment

Move `import jsonschema` from between __future__ and stdlib imports to
its correct position (stdlib → third-party → first-party), fixing the
ruff I001 lint failure that was blocking CI.
…nused import

Add Google-style docstring to RollbackSession.__init__ to satisfy ruff D107,
and remove unused `from pathlib import Path` in the test file (ruff F401).
Both violations would hard-fail the pre-commit ruff hook.
…solation

Per-rollout RollbackSession (never commits, rolls back on close) so writes are
visible within an episode and never persist or leak across rollouts. Includes
the isolation proof tests (write-then-read within an episode, rollback on close,
no cross-rollout leak, shared snapshot never mutated).
Execution-match reward grades candidate vs gold SQL on a fresh rollback session
(reusing the env's isolation on the grading side). EHR YAML config + builder test
exercise the 'bring your DB' entry point through build_environment.
- RollbackSession opens isolation_level=None + explicit BEGIN so leading DDL
  (CREATE/DROP as the first statement) is also rolled back, not just DML.
- sql_execution_match grades gold and candidate on separate sessions so a
  mutating gold query can't contaminate the candidate.
- export sql_execution_match from rewards/__init__ so @register fires on
  package import (otherwise it's missing from the registry).
- config test resolves its path relative to __file__, not CWD.
- document that db_path isolation is read-concurrent only (writers contend).
ToolResult.output is str | dict; subscripting it tripped pyright's pre-push
check. Compare the whole output dict instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant