Release highlights
We have deliberately skipped version 20.0.0 to align with the LLVM version we are now using. Note that LLVM 21.1.1 or higher is required as LLVM 21.1.0 has a major bug in the NVPTX backend.
Major changes
- The
rfactor
scheduling directive was rewritten and enhanced. It is now compatible with autoschedulers. - The Mullapudi2016 autoscheduler now supports experimental GPU scheduling.
- The Python bindings have been substantially improved, with many missing bindings filled in.
HL_DEBUG_CODEGEN
gained a new filtering mode. Debug levels can now be set on a per-file/per-function basis.- Support was added for AMD Zen5 and the iOS Simulator.
- The
strict_float
feature has been reimplemented and should be much more reliable. - Lots of bugfixes, performance improvements, and build system improvements. We spent a lot of time fixing issues with our testing infrastructure and are looking forward to implementing a more stable contribution experience going forward.
Deprecations
- LLVM 19 and below are no longer supported, in keeping with our support policy.
Halide_BUNDLE_STATIC
will be removed in the next release. If you are using it, please migrate to the shared library instead.- Support for Python 3.8 has been dropped.
Changelog
Scheduling
- The
rfactor
scheduling directive was rewritten and enhanced.- Rewrite the rfactor scheduling directive by @alexreinking in #8490
- Dequalify names when constructing RVars in rfactor by @alexreinking in #8560
- Add promise_clamped in rfactor by @alexreinking in #8608
- Add rfactor patterns for NaN-propagating min/max by @alexreinking in #8587
- The Mullapudi2016 autoscheduler now supports experimental GPU scheduling.
- GPU autoscheduling with Mullapudi2016: the reference implementation by @antonysigma in #7787
- Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between
gpu_blocks
. by @antonysigma in #8647 - Enable experimental Mullapudi2016 GPU scheduler for test-bench by @antonysigma in #8650
- Highlight Metal GPU code in stmt_html by @antonysigma in #8659
- Always ensure gpu_threads count >= warp size of 32 by @antonysigma in #8656
- Fix incorrect natural vector size on Zen4 by @abadams in #8570
- Make it an error to use a device extern stage without target support by @abadams in #8794
- Add support for adding tuple outputs in the configure() method by @abadams in #8649
Python
- Fix argument order in rpow by @alexreinking in #8677
- Drop support for Python 3.8 by @alexreinking in #8678
- Fix segfault in RDom's operator<< by @alexreinking in #8679
- Use ruff to format and lint Python code by @alexreinking in #8684
- Get raw Runtime::Buffer from Buffer in Python rather than use PyBuffer by @alexreinking in #8682
- Bind in-place update operators (e.g. +=) in Python by @alexreinking in #8683
- Clean up Python dependencies; document uv usage by @alexreinking in #8694
- Fix several printing segfaults. by @alexreinking in #8700
- Add Python bindings for serialization by @alexreinking in #8718
- Add all remaining IROperator ops to Python bindings by @alexreinking in #8771
- Fix up memoize; bind to Python by @alexreinking in #8778
- Fix invalid Python type annotation and return types (#8772) by @rtzam in #8773
- Expose
Runtime::Buffer::cropped
to C++ and PythonBuffer
by @rtzam in #8787
Debugging
- New feature flag to allow for stack backtrace/unwind by @mcourteaux in #8703
- Add filtering capabilities to HL_DEBUG_CODEGEN by @alexreinking in #8627
- Adding worker_thread_idle() for more informative profiling by @slomp in #8719
- Color IR output in cout and cerr. by @mcourteaux in #8635
- Improve output format for lowering passes timing. by @mcourteaux in #8749
- fix(stmt-html): Fix embedded Buffer processing performance issue. by @mcourteaux in #8748
- Use AArch64 assembly syntax on macOS with LLVM<22 by @alexreinking in #8710
CodeGen
- Mark our PTX kernels as kernels, to stop them from being stripped by @abadams in #8571
- Math functions renaming table for GPU backends to support vectorized evaluation of math functions. by @mcourteaux in #8595
- Apply version constraints to iOS objects by @alexreinking in #8546
- Redirect bitwise ops to logical ops in case the arguments are bool. by @mcourteaux in #8597
- scalarize select condition for LLVM where possible by @abadams in #8575
- Add missing addition simplifier rules by @abadams in #8630
- Bounds and alignment analysis through bitwise ops by @abadams in #8574
- Make the vld2 pattern more obviously profitable by @abadams in #8765
- Fix vector shuffle for Vulkan CodeGen by @derek-gerstmann in #8621
- Suppress warning on Windows for duplicate constant symbols. by @mcourteaux in #8555
- Use lossless_cast for saturating casts from unsigned to signed on x86 by @abadams in #8527
- AMD Zen5 support by @changhoon-sung in #8612
Compiler
- Rework strict_float to use individual op intrinsics instead by @abadams in #8641
- Don't cache mutations of Exprs that have only one reference to them by @abadams in #8518
- Only use the nodes-visited set for nodes with multiple refs by @abadams in #8547
- In graph_equal(), call the correct implementation for comparing equalities between statements and expressions by @BachiLi in #8611
Runtime
- Support copying the overlapping region from one buffer to another. by @mcourteaux in #8463
- Add (iOS) simulator target feature. by @alexreinking in #8623
- Opt out of JIT exceptions by @abadams in #8615
- Experimental: support removing unused runtime functions via
HL_RUNTIME_DROP_FUNCS
environment variable.- PoC feature: drop functions from the runtime by @mcourteaux in #8653
Apps
- The onnx app now builds with CMake:
- Add CMake for onnx app by @vawale in #8707
- Fix halide_as_onnx_backend_test by @alexreinking in #8784
Documentation
Bugfixes
- Fix #8534 [Buffer serialization does not match deserialization] by @abadams in #8535
- Fix CUDA HTML code printing bug. by @mcourteaux in #8558
- Fix halide_get_cpu_features() linkage to avoid name mangling issues by @derek-gerstmann in #8573
- Fix for #8578 by @mcourteaux in #8579
- Fix shuffle bug in CodeGen C. by @mcourteaux in #8567
- Check if expression is defined before trying to compute its constant_integer_bounds by @vksnk in #8599
- Drop invalid "in-bounds" GEP for constant offsets by @alexreinking in #8768
- Record trace_loads directly on ImageParam. by @alexreinking in #8803
- RewriteLoadsAs32Bit should use the mutated index by @rootjalex in #8581
- Set any_strict_float for wrapper module if target has strict_flag feature by @vksnk in #8681
- Fix wrong type of the bound by @vksnk in #8781
- Fix UB-introducing rewrite in FindIntrinsics by @abadams in #8539
- Fix rewrite that doesn't preserve type by @abadams in #8674
- Fix nested select handling in remove_undef by @abadams in #8669
- Add an underlying type to the halide_buffer_flags to prevent UB in C++ by @mcourteaux in #8690
Testing / CI
- Limit depth more strictly in CSE fuzz test by @abadams in #8512
- Skip fast exp/log/pow/sin/cosine tests without sse 4.1 by @abadams in #8541
- Hopefully fix flaky mullapudi reorder test by @abadams in #8542
- Skip test when code could be using x87 by @abadams in #8537
- Fix stale GPU lifetime management tests for Vulkan. by @derek-gerstmann in #8601
- Upgrade runner for cmake_cmake_file_lists job by @alexreinking in #8609
- Buildbot fixes by @alexreinking in #8706
- Fix the pip packaging workflow by @alexreinking in #8708
- Fix complexity of bounds of nested pure intrinsics by @abadams in #8689
- Skip two sub-tests on llvm 21.1 by @abadams in #8782
- Speed up simd_op_check_wasm by @abadams in #8780
- Reduce the beam size in the adams2019 apps test to avoid timeouts by @abadams in #8786
- Workaround llvm slow compile time bug in Mullapudi overlap test by @abadams in #8793
- Restore concurrent behavior to gpu_allocation_cache test by @abadams in #8792
- Revert "Skip two sub-tests on llvm 21.1" by @abadams in #8806
- Fix WASM splat op check test. by @mcourteaux in #8705
Build
- Fix workflow for next release by @alexreinking in #8514
- Fix Debian packaging by @alexreinking in #8524
- Remove llvm version check from Makefile by @abadams in #8533
- Drop deprecated / unsupported setups for Halide 20 by @alexreinking in #8508
- Fix check for Windows never having aligned_alloc available. by @mcourteaux in #8551
- Don't include CMAKE_INSTALL_PREFIX when LIBDIR is absolute by @alexreinking in #8552
- Add target-nvptx to target-all in vcpkg.json by @alexreinking in #8562
- Fix top of LLVM, and remove upper limit of LLVM version from CMakeLists. by @mcourteaux in #8568
- build_halide_h asserts that every header it slurps in is one of the args by @abadams in #8559
- Upgrade pybind11 to 2.11.1 by @alexreinking in #8616
- Drop check for LLVM_LIBCXX in FindHalide_LLVM.cmake by @alexreinking in #8617
- Fix finding LLD on Homebrew when multiple versions are installed. by @alexreinking in #8619
- Fix build on GCC 15 (Comes with Fedora 42). by @mcourteaux in #8626
- Constrain Clang and LLD searches to LLVM version by @alexreinking in #8634
- Disallow empty CMAKE_BUILD_TYPE on single-config generators by @alexreinking in #8651
- Add missing outputs to add_halide_library; fix advice in Lesson 21. by @alexreinking in #8660
- Bump the LLVM version in the pip package to 20.1.8 by @alexreinking in #8698
- Prefer to build against libjpeg-turbo and document this. by @alexreinking in #8775
- Add C++17 requirement to RunGenMain CMake target by @alexreinking in #8795
- Allow llvm-ar in BundleStatic.cmake by @alexreinking in #8799
- Fix dubious find_package logic in test/generator by @alexreinking in #8804
- Warning when extra-output is requested w/o filename by @FabianSchuetze in #8671
- Makefile linker flag fixes and cleanups by @abadams in #8764
Ongoing maintenance
- Fix clang-tidy-19 errors by @steven-johnson in #8509
- Remove unused function in HexagonOptimize by @steven-johnson in #8511
- Fix two non-idiomatic uses of node_type by @abadams in #8520
- Handle some misc TODOs by @abadams in #8528
- Use a consistent idiom for visit_let by @abadams in #8540
- Upgrade to clang-format 19 by @alexreinking in #8543
- Suppress clang-tidy warning for make_with_shape_of() by @steven-johnson in #8545
- Remove debugging print left in by @abadams in #8572
- Fixes for llvm trunk by @abadams in #8590
- Another fix for llvm trunk by @abadams in #8591
- Our internal error macros were redesigned:
- Accurately annotate Error system with [[noreturn]] by @alexreinking in #8564
- Teach compilers that internal_error does not return. by @alexreinking in #8807
- Use a new macro trick to avoid throwing in destructors. by @alexreinking in #8774
- Move to opaque llvm pointers by @abadams in #8614
- Avoid throwing from a destructor in PartitionLoops.cpp by @alexreinking in #8767
- Bump version to 21.0.0 by @alexreinking in #8810
- remove superfluous overload that causes compile errors by @ongjunjie in #8654
- Remove obsolete WasmExecutor specific debug macro. by @zvookin in #8670
- Add missing header by @vksnk in #8680
- Attempted fix for LLVM change by @abadams in #8642
- Fix top LLVM: renamed NVPTX barrier intrinsics. by @mcourteaux in #8631
New Contributors
- @changhoon-sung made their first contribution in #8612
- @vawale made their first contribution in #8707
- @rtzam made their first contribution in #8773
Full Changelog: v19.0.0...v21.0.0