halide/Halide v21.0.0 on GitHub

Release highlights

We have deliberately skipped version 20.0.0 to align with the LLVM version we are now using. Note that LLVM 21.1.1 or higher is required as LLVM 21.1.0 has a major bug in the NVPTX backend.

Major changes

The rfactor scheduling directive was rewritten and enhanced. It is now compatible with autoschedulers.
The Mullapudi2016 autoscheduler now supports experimental GPU scheduling.
The Python bindings have been substantially improved, with many missing bindings filled in.
HL_DEBUG_CODEGEN gained a new filtering mode. Debug levels can now be set on a per-file/per-function basis.
Support was added for AMD Zen5 and the iOS Simulator.
The strict_float feature has been reimplemented and should be much more reliable.
Lots of bugfixes, performance improvements, and build system improvements. We spent a lot of time fixing issues with our testing infrastructure and are looking forward to implementing a more stable contribution experience going forward.

Deprecations

LLVM 19 and below are no longer supported, in keeping with our support policy.
Halide_BUNDLE_STATIC will be removed in the next release. If you are using it, please migrate to the shared library instead.
Support for Python 3.8 has been dropped.

Changelog

Scheduling

The rfactor scheduling directive was rewritten and enhanced.
- Rewrite the rfactor scheduling directive by @alexreinking in #8490
- Dequalify names when constructing RVars in rfactor by @alexreinking in #8560
- Add promise_clamped in rfactor by @alexreinking in #8608
- Add rfactor patterns for NaN-propagating min/max by @alexreinking in #8587
The Mullapudi2016 autoscheduler now supports experimental GPU scheduling.
- GPU autoscheduling with Mullapudi2016: the reference implementation by @antonysigma in #7787
- Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between gpu_blocks. by @antonysigma in #8647
- Enable experimental Mullapudi2016 GPU scheduler for test-bench by @antonysigma in #8650
- Highlight Metal GPU code in stmt_html by @antonysigma in #8659
- Always ensure gpu_threads count >= warp size of 32 by @antonysigma in #8656
Fix incorrect natural vector size on Zen4 by @abadams in #8570
Make it an error to use a device extern stage without target support by @abadams in #8794
Add support for adding tuple outputs in the configure() method by @abadams in #8649

Python

Fix argument order in rpow by @alexreinking in #8677
Drop support for Python 3.8 by @alexreinking in #8678
Fix segfault in RDom's operator<< by @alexreinking in #8679
Use ruff to format and lint Python code by @alexreinking in #8684
Get raw Runtime::Buffer from Buffer in Python rather than use PyBuffer by @alexreinking in #8682
Bind in-place update operators (e.g. +=) in Python by @alexreinking in #8683
Clean up Python dependencies; document uv usage by @alexreinking in #8694
Fix several printing segfaults. by @alexreinking in #8700
Add Python bindings for serialization by @alexreinking in #8718
Add all remaining IROperator ops to Python bindings by @alexreinking in #8771
Fix up memoize; bind to Python by @alexreinking in #8778
Fix invalid Python type annotation and return types (#8772) by @rtzam in #8773
Expose Runtime::Buffer::cropped to C++ and Python Buffer by @rtzam in #8787

Debugging

New feature flag to allow for stack backtrace/unwind by @mcourteaux in #8703
Add filtering capabilities to HL_DEBUG_CODEGEN by @alexreinking in #8627
Adding worker_thread_idle() for more informative profiling by @slomp in #8719
Color IR output in cout and cerr. by @mcourteaux in #8635
Improve output format for lowering passes timing. by @mcourteaux in #8749
fix(stmt-html): Fix embedded Buffer processing performance issue. by @mcourteaux in #8748
Use AArch64 assembly syntax on macOS with LLVM<22 by @alexreinking in #8710

CodeGen

Mark our PTX kernels as kernels, to stop them from being stripped by @abadams in #8571
Math functions renaming table for GPU backends to support vectorized evaluation of math functions. by @mcourteaux in #8595
Apply version constraints to iOS objects by @alexreinking in #8546
Redirect bitwise ops to logical ops in case the arguments are bool. by @mcourteaux in #8597
scalarize select condition for LLVM where possible by @abadams in #8575
Add missing addition simplifier rules by @abadams in #8630
Bounds and alignment analysis through bitwise ops by @abadams in #8574
Make the vld2 pattern more obviously profitable by @abadams in #8765
Fix vector shuffle for Vulkan CodeGen by @derek-gerstmann in #8621
Suppress warning on Windows for duplicate constant symbols. by @mcourteaux in #8555
Use lossless_cast for saturating casts from unsigned to signed on x86 by @abadams in #8527
AMD Zen5 support by @changhoon-sung in #8612

Compiler

Rework strict_float to use individual op intrinsics instead by @abadams in #8641
Don't cache mutations of Exprs that have only one reference to them by @abadams in #8518
Only use the nodes-visited set for nodes with multiple refs by @abadams in #8547
In graph_equal(), call the correct implementation for comparing equalities between statements and expressions by @BachiLi in #8611

Runtime

Support copying the overlapping region from one buffer to another. by @mcourteaux in #8463
Add (iOS) simulator target feature. by @alexreinking in #8623
Opt out of JIT exceptions by @abadams in #8615
Experimental: support removing unused runtime functions via
HL_RUNTIME_DROP_FUNCS environment variable.
- PoC feature: drop functions from the runtime by @mcourteaux in #8653

Apps

The onnx app now builds with CMake:
- Add CMake for onnx app by @vawale in #8707
- Fix halide_as_onnx_backend_test by @alexreinking in #8784

Documentation

Add note about relative paths to readme by @abadams in #8613

Bugfixes

Fix #8534 [Buffer serialization does not match deserialization] by @abadams in #8535
Fix CUDA HTML code printing bug. by @mcourteaux in #8558
Fix halide_get_cpu_features() linkage to avoid name mangling issues by @derek-gerstmann in #8573
Fix for #8578 by @mcourteaux in #8579
Fix shuffle bug in CodeGen C. by @mcourteaux in #8567
Check if expression is defined before trying to compute its constant_integer_bounds by @vksnk in #8599
Drop invalid "in-bounds" GEP for constant offsets by @alexreinking in #8768
Record trace_loads directly on ImageParam. by @alexreinking in #8803
RewriteLoadsAs32Bit should use the mutated index by @rootjalex in #8581
Set any_strict_float for wrapper module if target has strict_flag feature by @vksnk in #8681
Fix wrong type of the bound by @vksnk in #8781
Fix UB-introducing rewrite in FindIntrinsics by @abadams in #8539
Fix rewrite that doesn't preserve type by @abadams in #8674
Fix nested select handling in remove_undef by @abadams in #8669
Add an underlying type to the halide_buffer_flags to prevent UB in C++ by @mcourteaux in #8690

Testing / CI

Limit depth more strictly in CSE fuzz test by @abadams in #8512
Skip fast exp/log/pow/sin/cosine tests without sse 4.1 by @abadams in #8541
Hopefully fix flaky mullapudi reorder test by @abadams in #8542
Skip test when code could be using x87 by @abadams in #8537
Fix stale GPU lifetime management tests for Vulkan. by @derek-gerstmann in #8601
Upgrade runner for cmake_cmake_file_lists job by @alexreinking in #8609
Buildbot fixes by @alexreinking in #8706
Fix the pip packaging workflow by @alexreinking in #8708
Fix complexity of bounds of nested pure intrinsics by @abadams in #8689
Skip two sub-tests on llvm 21.1 by @abadams in #8782
Speed up simd_op_check_wasm by @abadams in #8780
Reduce the beam size in the adams2019 apps test to avoid timeouts by @abadams in #8786
Workaround llvm slow compile time bug in Mullapudi overlap test by @abadams in #8793
Restore concurrent behavior to gpu_allocation_cache test by @abadams in #8792
Revert "Skip two sub-tests on llvm 21.1" by @abadams in #8806
Fix WASM splat op check test. by @mcourteaux in #8705

Build

Fix workflow for next release by @alexreinking in #8514
Fix Debian packaging by @alexreinking in #8524
Remove llvm version check from Makefile by @abadams in #8533
Drop deprecated / unsupported setups for Halide 20 by @alexreinking in #8508
Fix check for Windows never having aligned_alloc available. by @mcourteaux in #8551
Don't include CMAKE_INSTALL_PREFIX when LIBDIR is absolute by @alexreinking in #8552
Add target-nvptx to target-all in vcpkg.json by @alexreinking in #8562
Fix top of LLVM, and remove upper limit of LLVM version from CMakeLists. by @mcourteaux in #8568
build_halide_h asserts that every header it slurps in is one of the args by @abadams in #8559
Upgrade pybind11 to 2.11.1 by @alexreinking in #8616
Drop check for LLVM_LIBCXX in FindHalide_LLVM.cmake by @alexreinking in #8617
Fix finding LLD on Homebrew when multiple versions are installed. by @alexreinking in #8619
Fix build on GCC 15 (Comes with Fedora 42). by @mcourteaux in #8626
Constrain Clang and LLD searches to LLVM version by @alexreinking in #8634
Disallow empty CMAKE_BUILD_TYPE on single-config generators by @alexreinking in #8651
Add missing outputs to add_halide_library; fix advice in Lesson 21. by @alexreinking in #8660
Bump the LLVM version in the pip package to 20.1.8 by @alexreinking in #8698
Prefer to build against libjpeg-turbo and document this. by @alexreinking in #8775
Add C++17 requirement to RunGenMain CMake target by @alexreinking in #8795
Allow llvm-ar in BundleStatic.cmake by @alexreinking in #8799
Fix dubious find_package logic in test/generator by @alexreinking in #8804
Warning when extra-output is requested w/o filename by @FabianSchuetze in #8671
Makefile linker flag fixes and cleanups by @abadams in #8764

Ongoing maintenance

Fix clang-tidy-19 errors by @steven-johnson in #8509
Remove unused function in HexagonOptimize by @steven-johnson in #8511
Fix two non-idiomatic uses of node_type by @abadams in #8520
Handle some misc TODOs by @abadams in #8528
Use a consistent idiom for visit_let by @abadams in #8540
Upgrade to clang-format 19 by @alexreinking in #8543
Suppress clang-tidy warning for make_with_shape_of() by @steven-johnson in #8545
Remove debugging print left in by @abadams in #8572
Fixes for llvm trunk by @abadams in #8590
Another fix for llvm trunk by @abadams in #8591
Our internal error macros were redesigned:
- Accurately annotate Error system with [[noreturn]] by @alexreinking in #8564
- Teach compilers that internal_error does not return. by @alexreinking in #8807
- Use a new macro trick to avoid throwing in destructors. by @alexreinking in #8774
Move to opaque llvm pointers by @abadams in #8614
Avoid throwing from a destructor in PartitionLoops.cpp by @alexreinking in #8767
Bump version to 21.0.0 by @alexreinking in #8810
remove superfluous overload that causes compile errors by @ongjunjie in #8654
Remove obsolete WasmExecutor specific debug macro. by @zvookin in #8670
Add missing header by @vksnk in #8680
Attempted fix for LLVM change by @abadams in #8642
Fix top LLVM: renamed NVPTX barrier intrinsics. by @mcourteaux in #8631

New Contributors

@changhoon-sung made their first contribution in #8612
@vawale made their first contribution in #8707
@rtzam made their first contribution in #8773

Full Changelog: v19.0.0...v21.0.0