Verifiers v0.1.12 Release Notes
Date: 04/17/2026
Full Changelog: v0.1.11...v0.1.12
Highlights since v0.1.11
- Landed a new composable Task/Agent/Environment architecture and upstreamed opencode/RLM harnesses and swe/lean/math/cp/harbor tasksets into
verifiers.envs.experimental.composable, so downstream environments can depend on them directly instead of via research-environments. - Major
RLMEnvoverhaul: newRLMPromptBuilder, context dropping withsummarize_turns,max_turns_in_context, sub-LLM toggle, removed RLM-internal branding from model-visible prompts, richer metrics, hardened root-tool transport (no unsafe pickle), and a reworked harness install flow that runs from a uv workspace checkout. - Runtime performance and reliability improvements including executor autoscaling, incremental metrics, threaded file I/O, event loop lag monitoring, multi-worker env server support, GC tuning before accepting requests,
setproctitlelabels, dead-tunnel auto-recovery inCliAgentEnv, and safer task cancellation paths. - Richer
vf-tuiwith a log viewer, run comparison mode, toggleable markdown/reasoning rendering, rollout and unique-prompt counts with responsive layout, and saved-state columns in the info view. - Expanded evaluation ergonomics with configurable
output_dir,[[ablation]]model/endpoint overrides,max_total_tokensforMultiTurnEnv,extra_headers_from_stateandheaders/extra_headerssupport inendpoints.toml,X-Session-IDfor DP-aware routing, preserved multimodal media in saved results, and exported eval parser/normalization helpers for Prime CLI reuse. - New Hosted Evaluations docs plus an environment performance guide, refreshed BrowserEnv README, and updated Secrets/Hub guidance across docs and agent skills.
Changes included in v0.1.12 (since v0.1.11)
Clients, RLM, and rollout execution
- feat: composable Task/Agent/Environment architecture (#1067)
- upstream
RlmComposableEnvintoComposableEnv/TaskSet/Harness(#1158) - move harnesses (rlm, opencode) and tasksets (swe, lean, math, cp, harbor) into
verifiers.envs.experimental.composable(#1131) - RLM:
RLMPromptBuilder(#1070) - RLM: context dropping & summarization (#1072)
- replace
remove_conversation_turnswithsummarize_turnsstandard tool (#1095) - add
max_turns_in_context, fix answer extraction, document metrics (#1099) - RLM: inform model about
max_turns_in_contextlimit in scaffolding (#1111) - remove RLM branding from model-visible prompts and messages (#1089)
- change
toolsarg to pass standard tools to root LLM (#1087) - add
enable_sub_llmstoggle toRLMEnv(#1085) - simplify RLM message transcript handling (#1116)
- RLM: improve prompts and metrics (#1102)
- refactor: rename RLM metrics for consistency (#1086)
- remove token/timing info from
llm_batchoutput and addmax_turnsmetric (#1098) - replace timing info in RLM REPL output with root tool time metrics (#1097)
- harden RLM root-tool transport to remove unsafe pickle deserialization (#1104)
- RLM: remove dead code, harden tunnels (#1107)
- run RLM harness from a uv workspace checkout (#1139)
- revert inline install, use rlm's
install.sh(#1144) - clone via git protocol instead of fetching
install.sh(#1159) - update RLM harness test to match git-clone install script (#1160)
- RLM harness: install from arbitrary branch (#1153)
- fix RLM harness to use per-example
AGENT_WORKDIR(#1143) - port rlm harness dedup install script fix (#1133)
- set
RLM_KERNEL_PYTHONto sandbox.venvfor inline imports (#1145) - guard
RLM_KERNEL_PYTHONon successful ipykernel install (#1150) - pin
ipykernel<7for older sandbox Pythons (#1151) - fix RLM bash timeout (#1079)
- fix: handle
CommandTimeoutErrorinRLMEnv(#1069) - deprecate
RolloutGatewayMixin(#1017) - add
NeMoRLChatCompletionsClientfor NeMo Gym model servers (#1141) - feat: send
X-Session-IDheader during eval for DP-aware routing (#1137) - feat: add
extra_headers_from_statetoClientConfig(#1048) - fix: handle
Noneprompt/completion token ids inparse_tokens(#1066) - fix tool args passing (#1106)
- fix TITO in opencode envs: bridge extraction, truncation gate, tool-call handling (#1005)
- fix: remove content
rstripinnormalize_responseto preserve TITO prefix match (#1081)
Env server, sandbox, and runtime reliability
- feat: multi env worker (#1055)
- fix: propagate
json_loggingto env workers (#1138) - tune GC on env server before accepting requests (#1022)
- feat: set process titles on env server and workers (#1082)
- perf: executor autoscaling (#1039)
- perf: incremental metrics (#1036)
- perf: offload file I/O to thread pool (#1037)
- feat: improve event loop lag monitor (#1038)
- fix
get_free_port_pair()TOCTOU race condition (#1013) - fix: task cancellation race + RLM sandbox workers (#1035)
- fix: call
uncancel()after catchingCancelledErrorinprocess_request(#1047) - fix cancelled + serialize error (#1044)
- detect server-side tunnel death and auto-recreate in
CliAgentEnv(#1127) - fix
AgentErrordouble-wrapping inpoll_job_completion(#1130) - fix: clear root logger handlers hijacked by swebench import (#1163)
- use SDK read-file endpoint and bg job handling in
SandboxMixin(#1084)
Evaluation UX, metrics, and configuration
vf-tui: log viewer (#1075)vf-tui: fixes & features (including comparison mode and markdown/reasoning toggles) (#1007)vf-tui: show rollouts and unique prompts, better dynamic width (#1060)- show saved state columns in TUI info view (#1091)
- make
output_dirconfigurable in evals (#1029) - handle ablation
modelandendpoint_idoverrides (#1135) - export eval parser and normalization helpers for Prime CLI reuse (#1135)
- feat: add
max_total_tokensparameter toMultiTurnEnv(#1101) - support
headers/extra_headersinendpoints.toml(#1051) - preserve multimodal media in saved eval results (#1015)
- fix display of custom sampling args (#1025)
- fix output dir logging (#1041)
- fix host-side eval in composable CP wrapper parsing (#1165)
- fix composable
mkdirpath quoting (#1110)
Environments, multimodality, and integrations
- add
BrowserEnvintegration README (#1020) opencode_rlm_env(#1023)- misc improvements to opencode envs (#999)
- perf improvs for opencode envs + math rubric (#1034)
- opencode envs (including
CliAgentEnvhardening, hybrid math rubric overhaul, and log capture) (#1005) - fix: revert
opencode_envconfig regression and move RLM logic out ofcli_agent_env(#1042) - fix opencode config for model names without slash (#1114)
- feat: dataset builder pattern for lazy loading in all environments (#1064)
- add cleanup and teardown lifecycle hooks to
Rubric(#1026) - remove redundant msg normalization + align
env_responseAPI (#1027) - chore: reuse math rubric in hybrid math rubric (#1043)
- perf: math rubric skip overlong answers (#1046)
- fix math rubric timeout (#1096)
- lazily import packages (#1019)
- fix: env tests (#1061)
Docs, CLI, and tooling
- docs: add performance guide for environments (#1045)
- add hosted evaluations section to eval docs (#1040)
- update Secrets guidance (BrowserBase README) (#1056)
- docs: prefix
prime evalmodels (#1125) tomllib/tomliguard for Python 3.10 (#1136)- pin
regex<2026.4.4(missing cp312/cp313 wheels) (#1109) - pin uv
<0.11.0to fix flash-attn resolution (#1057) - bump uv requirement to
>=0.11.1(#1112)