Verifiers v0.1.13.dev4 Release Notes
Date: 04/22/2026
Highlights since v0.1.13.dev3
- RLM harness: new
rlm_toolskwarg sets bothHarness.tool_names(forToolMonitorRubric) and the sandboxRLM_TOOLSenv var from a single source, plus newHarness.environment_varsfield merged harness-wins-on-conflict byComposableEnv. - Refactored experimental RLM checkout caching;
DEFAULT_RLM_BRANCHrenamed toDEFAULT_RLM_REFandrlm_harness(..., rlm_branch=...)renamed torlm_ref=to reflect that any git ref (branch, tag, sha) is accepted. - Added
SandboxTimeoutsdataclass centralizing per-operation sandbox HTTP timeouts. - Expanded task coverage with SWE-rebench-V2 and a multilingual SWESmith taskset, plus a
filter_fnkwarg on all tasksets for ad-hoc row filtering. vf-eval: renamed-d/--debugto--disable-tuiand--tuito--fullscreenfor clearer intent.- RLM rollout metrics (context tokens, programmatic tool calls) exposed to verifiers and auto-merged by the composable env.
Changes included in v0.1.13.dev4 (since v0.1.13.dev3)
Features and enhancements
- vf-eval: replace -d/--debug with --disable-tui, rename --tui to --fullscreen (#1183)
- Expose RLM metrics to verifiers (#1195)
- Add streaming observability + resume to TaskSet.validate() (#1169)
- Refactor experimental RLM checkout caching (#1202)
- feat: add filter_fn kwarg to all tasksets for ad-hoc row filtering (#1199)
- feat: add multilingual SWESmithTaskSet (#1186)
- feat: add SWE-rebench-V2 TaskSet (#1187)
- HarborMCPMixin (#1146)
- feat: SandboxTimeouts dataclass — centralize per-operation sandbox HTTP timeouts (#1207)
- Run SWE-Lego eval via dataset's canonical test_cmd (#1205)
- Authenticate interception server via INTERCEPTION_SECRET (#1180)
- feat: revert agent test edits at grading (swe_lego, swe_rebench_v2) (#1212)
- AgentError: rollout_id, sandbox_id, ... (#1218)
- Remove RLM_DEFAULT_TOOL_NAMES, accept rlm_tools (#1223)
- r2e_gym: add hide_tests_from_agent flag + expose instance_id/repo aliases (#1208)
- feat(rlm): upload a /usr/local/bin/git shim, gated by
allow_git(#1225)
Fixes and maintenance
- Keep harness metrics merge inside experimental composable env (#1201)
- Propagate typed exceptions from SWE/Harbor validate_instance (#1204)
- fix: pass explicit 60s timeout to get_background_job in poll_job_completion (#1206)
- fix: bump opencode harness default release to v1.1.63-rl2 (#1184)
- validate(): extract resume-file parsing into a named helper (#1209)
- fix: SandboxTimeouts fields must be int (sidecar deserializes as u64) (#1210)
- fix: respect framework-injected OPENAI_API_KEY in RLM and opencode harnesses (#1213)
- fix: offload composable _upload_dir tar build to thread (#1224)
Full Changelog: v0.1.13.dev3...v0.1.13.dev4