Highlights
This release introduces three multi-agent subsystems, a hardened Docker sandbox, and 4 rounds of paper quality auditing — significantly improving the end-to-end quality of generated research papers.
New Multi-Agent Subsystems
CodeAgent (4-phase architecture)
- LLM generates multi-file experiment code (main.py + setup.py + requirements.txt)
- Static analysis & deep validation (AST-based class/method checks)
- LLM-guided code review with structured JSON feedback
- Iterative repair loop (up to 3 rounds) with automatic UnboundLocalError fix
BenchmarkAgent (4 sub-agents: Surveyor → Selector → Acquirer → Validator)
- Domain-aware dataset and baseline selection from 13-domain knowledge base
- Automatic benchmark acquisition with Docker compatibility validation
- Integrated at Stage 9 (experiment_design), output injected into Stage 10
FigureAgent (5 sub-agents: Planner → CodeGen → Renderer → Critic → Integrator)
- Academic-quality chart generation with SciencePlots, 300 DPI, colorblind-safe palette
- 6 built-in chart templates + LLM fallback for custom visualizations
- Tri-modal critic review (data accuracy, aesthetics, academic convention)
Docker Sandbox Enhancements
- Network-policy-aware code generation:
none|setup_only|pip_only|full - Dynamic dependency installation via requirements.txt
- Pre-cached datasets: CIFAR-10/100, MNIST, FashionMNIST, STL-10, SVHN
- Extended ML stack: torch, torchvision, timm, einops, transformers, etc.
Paper Quality Hardening (4-round audit)
- Post-compilation quality checks, weasel/duplicate word lint
- 7-dimension AI-Scientist-style review scoring
- AI-slop detection (50+ phrases), statistical rigor validator
- Cross-discipline support for 7 research domains (ML/physics/chem/econ/math/eng/bio)
- NeurIPS checklist integration
Bug Fixes (15+)
- Fix baselines dict-to-list crash in BenchmarkAgent
- Fix Gymnasium environment versions (v4 → v5)
- Fix experiment condition drift in iterative refinement (anchor to exp_plan.yaml)
- Fix compute budget constraint for experiment design
- Fix metric direction mismatch, citation verification batching
- Fix LaTeX output sanitization, figure plan format handling
- Add RL stability guidance (gradient clipping, NaN guard)
- And more — see full commit message for details
Compatibility
All changes are backward-compatible with v0.1.0 configuration files.
Full Changelog: v0.1.0...v0.2.0