[db-executable-env] Phase 2: DatabaseExecutableEnvironment (prototype)#2520
Draft
aniruddh-alt wants to merge 12 commits into
Draft
Conversation
Move `import jsonschema` from between __future__ and stdlib imports to its correct position (stdlib → third-party → first-party), fixing the ruff I001 lint failure that was blocking CI.
…nused import Add Google-style docstring to RollbackSession.__init__ to satisfy ruff D107, and remove unused `from pathlib import Path` in the test file (ruff F401). Both violations would hard-fail the pre-commit ruff hook.
…solation Per-rollout RollbackSession (never commits, rolls back on close) so writes are visible within an episode and never persist or leak across rollouts. Includes the isolation proof tests (write-then-read within an episode, rollback on close, no cross-rollout leak, shared snapshot never mutated).
Execution-match reward grades candidate vs gold SQL on a fresh rollback session (reusing the env's isolation on the grading side). EHR YAML config + builder test exercise the 'bring your DB' entry point through build_environment.
- RollbackSession opens isolation_level=None + explicit BEGIN so leading DDL (CREATE/DROP as the first statement) is also rolled back, not just DML. - sql_execution_match grades gold and candidate on separate sessions so a mutating gold query can't contaminate the candidate. - export sql_execution_match from rewards/__init__ so @register fires on package import (otherwise it's missing from the registry). - config test resolves its path relative to __file__, not CWD. - document that db_path isolation is read-concurrent only (writers contend).
ToolResult.output is str | dict; subscripting it tripped pyright's pre-push check. Compare the whole output dict instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prototype of an executable database environment for RL/eval over a SQLite database. Each rollout gets an isolated session whose writes are visible within an episode but never persist or leak across rollouts.
DatabaseExecutableEnvironment(registered"database"): per-rolloutRollbackSession,requires_isolation()=True, executors dispatched with a livedbconnection.db_isolation:RollbackSession(opensisolation_level=None+ explicitBEGINso both DML and a leading DDL statement roll back; executors must not commit) andmaterialize_sqlite_snapshot.ExecutableEnvironmentbase executor dispatch (_step_one) the skeleton left abstract.list/lookup/updatepatient) + a YAML config — the "bring your DB" entry point throughbuild_environment.sql_execution_matchreward that grades candidate vs gold SQL on a fresh isolated session (reusing the env's isolation on the grading side).85 unit tests cover the isolation contract: uncommitted write visible within an episode, rolled back on close, no cross-rollout leak, shared snapshot never mutated, leading-DDL rollback.
Status: prototype
Opened to run experiments. Deliberately deferred: wiring into verl GRPO rollouts (single-turn NL2SQL first), copy/copy-on-write isolation for concurrent committed writes, and the schema/table/row database-mutation stack for init diversity.
Series context (db-executable-env chain)
Test plan
pytest tests/unit/environments tests/unit/datasets/grpo/rewards/test_sql_execution_match.py(85 pass)configs/examples/database_env/ehr_database_env.yamlviabuild_environment