Summary
This release includes significant performance improvements, bug fixes, and architectural refactoring.
Key Improvements:
- TMA autotuning and MMA matmul tuning enabled for better performance
- ONNX-IR refactored to an op/node-centric architecture IR refactored to define outputs as a function of the operation
Bug Fixes:
- Fixed autodiff graph cleanup issues (multiple fixes for deferred/consumed nodes)
- Fixed Linear layer panic when output size is one
- Fixed PyTorch pickle reader regression with integer dict keys
- Fixed RoPE sum_dim calculation
- Fixed tensor *_like dtype preservation
- Fixed squeeze check for D2 > 0
- Fixed QLinear implementation
- Fixed async barrier & TMA checks
New Features:
- Added matvec operation
- Added support for custom learning strategies
- Added Candle device seeding
- Added Shape::ravel_index for row-major raveling
- Generalized linalg::outer semantics with new linalg::outer_dim
- Implemented error handling for DataError
- Added square() optimization where appropriate