Dark Factory Mode (closes #37)
Spec-in, software-out autonomous pipeline. Feed it an NLSpec, get working software back — with blind holdout testing to verify quality.
New Command
/octo:factory --spec spec.md
Pipeline
- Parse Spec — Extract behaviors, constraints, satisfaction target from NLSpec
- Generate Scenarios — Multi-provider test scenario generation (Codex + Gemini)
- Split Holdout — 80/20 split with behavior-diverse holdout selection
- Embrace Workflow — Full 4-phase implementation (discover → define → develop → deliver) in autonomous mode
- Holdout Tests — Blind cross-model evaluation against withheld scenarios
- Satisfaction Scoring — Weighted 4-dimension assessment
- Report — Markdown report + JSON session at
.octo/factory/<run-id>/
Scoring
| Dimension | Weight |
|---|---|
| Behavior Coverage | 40% |
| Constraint Adherence | 20% |
| Holdout Pass Rate | 25% |
| Quality | 15% |
Verdicts: PASS (≥ target), WARN (≥ target − 0.05), FAIL (< target − 0.05)
On FAIL, automatically retries phases 3–4 with remediation context from failing scenarios.
What's New
- 7 new functions in
orchestrate.sh:parse_factory_spec,generate_factory_scenarios,split_holdout_scenarios,run_holdout_tests,score_satisfaction,generate_factory_report,factory_run /octo:factorycommand file +skill-factory.mdenforced skill with 8-step execution contract- CLI flags:
--spec,--holdout-ratio,--max-retries,--ci - 49 new tests across 7 suites
- OpenClaw registry updated (96 entries)
Architecture
Factory wraps embrace_full_workflow() — zero modifications to embrace. Sets AUTONOMY_MODE=autonomous + OCTOPUS_SKIP_PHASE_COST_PROMPT=true + OCTOPUS_FACTORY_MODE=true and injects visible scenarios into the prompt.
Cost
~$0.50–2.00 per run (~20–30 agent calls). Displayed upfront with approval gate unless --ci.
Full Changelog: v8.24.0...v8.25.0