Changes Of Note
ParamMap
has been removed entirely from the public API. All users ofParamMap
should migrate toCallable
instead.Halide::Parameter
has been moved to the public Halide API (it was formerly "internal" and not intended for public use).- New scheduling primitives:
Func::partition()
and friends: Set the loop partition policy, which controls how/whether a loop is split into three loops (prologue/steady-state/epilogue). Loop partitioning can be useful to optimize boundary conditions (e.g. clamp_edge).Func::hoist_storage()
and friends: allows a functions's storage to be moved to a given loop level. UnlikeFunc::store_at()
, no optimizations are triggered (e.g. sliding window).
- New
TailStrategy
options for for existing scheduling directives:ShiftInwardsAndBlend
: Equivalent to ShiftInwards, but protects values that would be re-evaluated by loading the memory location that would be stored to, modifying only the elements not contained within the overlap, and then storing the blended result. Unlike ShiftInwards, this is valid to use in update definitions.RoundUpAndBlend
: Equivalent to RoundUp, but protects values that would be written beyond the end by loading the memory location that would be stored to, modifying only the elements within the region being computed, and then storing the blended result. Unlike RoundUp, this is valid to use on non-outermost splits in update definitions.
- Substantially improved performance and display in the VizIR output.
- Profiler improvements:
- Substantially nicer text output
- Injects timing into calls for
copy_to_host
andcopy_to_device
so you can measure host<->device copy overhead - Allows option sorting via
HL_PROFILER_SORT
env var
- Substantially faster codegen for several GPU backends.
- Experimental serialization/deserialization feature allows for saving of Halide IR code.
- Various bug fixes and improvements in the
Anderson2021
autoscheduler. - Improved ARM codegen, including: better patterns for sdot/udot; improved shift/mul codegen.
- Support for Zen4 architecture in the x86 backend.
- Updates to the ONNX app.
- Various fixes and improvements to sliding-window and storage-folding.
- Improvements to slow gather operations for some x86 variants.
- Improvements to correctness for the
.async()
scheduling directive. - Improved codegen for float16 conversion, especially on x86.
- Several compile-time warnings of dubious usefulness disabled.
- WebAssembly codegen now defaults to assuming that saturating-float-to-int and sign-extension instructions sets are always available.
Target
now does some reality-checking that it doesn't contain obviously nonsensicalFeature
combinations
What's Changed
- Misc changes and fixes to RISCV codegen
- Revise LLVM fix to work when no V8 or WABT available by @steven-johnson in #7635
- Be more careful about overflow in trim_bounds_using_alignment by @abadams in #7645
- Add a compositing example app by @abadams in #7646
- Get the ASAN toolchain working again by @steven-johnson in #7604
- Upgrade clang-format and clang-tidy to use v16 by @steven-johnson in #7660
- Enable the misc-use-anonymous-namespace clang-tidy check by @steven-johnson in #7661
- Enable clang-tidy's modernize-use-default-member-init check by @steven-johnson in #7662
- Update onnx app to Adams2019 autoscheduler and new autoscheduler API by @abadams in #7673
- Remove ParamMap by @steven-johnson in #7675
- Fix correctness_float16_t for ASAN builds by @steven-johnson in #7687
- Add a select overload for tuples by @abadams in #7672
- Add Sanitizer details to README_cmake.md by @steven-johnson in #7688
- Fix quadratic algorithm in simplify_correlated_differences by @abadams in #7686
- Fix float16 under asan, attempt #2 by @steven-johnson in #7691
- Add a warning if a Generator declares any Outputs before the final Input (Fixes #7669) by @steven-johnson in #7697
- Fixed the regularization for BGU. by @mcourteaux in #7684
- Fix clang and llvm versions in scripts by @TH3CHARLie in #7702
- Fix leaks caused by self-referential parameter constraints by @abadams in #7700
- Fix float16 warning for older clangs by @abadams in #7701
- Upgrade Halide main branch for LLVM18 by @steven-johnson in #7710
- Improved profiler result printing. by @mcourteaux in #7709
- Default WITH_TEST_FUZZ to OFF by @steven-johnson in #7695
- Throw an erorr if split is called with the same older and inner var name by @TH3CHARLie in #7715
- Making HLSL code-gen a couple orders of magnitude faster... by @slomp in #7719
- Making Metal code-gen a bit faster by @slomp in #7720
- Fix handling of thread features for scalars in Anderson2021 by @aekul in #7726
- Change default generator timeout to infinite by @abadams in #7718
- Remove unused using decl by @abadams in #7730
- [Hexagon] - Fix problems in sim_host.cpp by @pranavb-ca in #7725
- Fix RDom usage in anderson2021_test_apps_autoscheduler (Fixes #7729) by @steven-johnson in #7734
- Fix leak on cloning functions with update defs by @abadams in #7735
- Ignore code in src/runtime/hexagon_remote/bin/src for clang-format by @steven-johnson in #7736
- Clean up really long line lengths in Anderson2021 by @steven-johnson in #7728
- Revise labels on autoscheduler tests by @steven-johnson in #7732
- Speedup the VizIR HTML. by @mcourteaux in #7713
- Run clang-tidy on macOS runners instead of Linux by @steven-johnson in #7746
- Fix infinite recursion in loop partitioning by @abadams in #7743
- Fix leaks in test/correctness/memoize.cpp by @abadams in #7705
- Allow optional sorting of profiler output via HL_PROFILER_SORT env var (Fixes #7638) by @steven-johnson in #7639
- Permit llvm 15 on windows by @abadams in #7744
- Revert accidental typo change in #7746 by @steven-johnson in #7747
- [vulkan] Fix heap buffer overflow in Vulkan extension handling discovered by ASAN by @derek-gerstmann in #7740
- [vulkan] Fix SPIR-V IR references causing leaks by @derek-gerstmann in #7739
- Improve error-handling in Anderson2021, and ensure build deps are cor… by @steven-johnson in #7748
- StmtViz: Search for tooltip only in the child node by @antonysigma in #7754
- Experimental serializer by @TH3CHARLie in #7594
- Define
cast<i32>(u32)
overflow behavior by @rootjalex in #7769 - Fix vector reduce HTML by @mcourteaux in #7773
- Remove fragile simd_op_check test for mlal/mlsl on ARM by @rootjalex in #7775
- Speedup page loading of VizStmt. by @mcourteaux in #7755
- Try to fix remaining ASAN-reported leaks by @steven-johnson in #7767
- Fix out of bounds access in anderson2021_test_apps_autoscheduler by @aekul in #7771
- Don't introduce reinterprets in find/lower intrinsics by @rootjalex in #7776
- [Hexagon] -Build Hexagon runtime components using the Hexagon SDK (Clone of #7671) by @pranavb-ca in #7741
- slice IRMatcher should only match on slices by @abadams in #7772
- Don't inject undef() in the simplifier by @abadams in #7791
- Fix for top-of-tree LLVM by @steven-johnson in #7798
- [ARM] Distribute shifts as muls by @rootjalex in #7790
- [ARM] support new udot/sdot patterns by @rootjalex in #7800
- Remove some unused includes by @abadams in #7799
- Add support to the makefile for serialization by @abadams in #7762
- [wasm] Enable PIC for WebAssembly on LLVM v18.x by @derek-gerstmann in #7803
- Update WebGPU to latest Emscripten/Dawn API by @steven-johnson in #7804
- Add jump-buttons to get fro Stmt directly to Assembly by @mcourteaux in #7793
- Update clang-tidy action to stop breaking by @steven-johnson in #7808
- [serialization] Add serialization support to generator interface by @derek-gerstmann in #7792
- Ensure that multitarget AOT builds have consistent random sequence by @steven-johnson in #7717
- Move clang-tidy checks back to Linux by @steven-johnson in #7817
- Update 'Check CMake file lists' action by @steven-johnson in #7809
- Remove dead
auto-schedule
label in CMake by @steven-johnson in #7818 - Don't return an undefined Stmt() from IfThenElse visitor by @abadams in #7816
- Avoid generating name collisions in CSE by @abadams in #7821
- Add a check that PredicateLoads must be used in the outermost split of a dimension by @TH3CHARLie in #7788
- Enable emission of float16/32 casts on x86 by @abadams in #7837
- Iterate over lets in the correct order in VectorizeLoops by @vksnk in #7830
- Zen4 support by @abadams in #7840
- Update arguments in driver.cpp to match what correctness/simd_op_check has by @vksnk in #7842
- [tutorials] Add tutorial on JIT compile/execute performance by @derek-gerstmann in #7838
- [api] Promote Internal::Parameter to Halide::Parameter by @derek-gerstmann in #7829
- [Hexagon] - Fix 8-bit unsigned saturating downcasts for HVX (Fixes #7806) by @pranavb-ca in #7825
- Handle nested vectorization in store predicates by @abadams in #7864
- Respect input buffer constraints in root-level bounds inference exprs by @abadams in #7865
- Prevent use of uninitialized scalar Parameters in JIT code (#7847, partial) by @steven-johnson in #7853
- Handle unreachable code in bounds inference by @abadams in #7866
- [serialization] Add support to serialize to memory, and a basic serialization tutorial by @derek-gerstmann in #7760
- Don't deduce unreachability from predicated out of bounds stores by @abadams in #7874
- Validate for types when fusing Vars with RVars by @abadams in #7877
- Consider all dimensions before deciding to slide over a new dimension by @abadams in #7875
- Update onnx app to work with newer versions of protobuf by @abadams in #7879
- HTML Stmt IR with conceptual code and device code. by @mcourteaux in #7843
- Update README.md to include RISCV in llvm build instructions by @abadams in #7878
- Implement elementwise complex value division by @antonysigma in #7848
- Explicitly name the allocgroups on GPU schedules "allocgroup__..." by @mcourteaux in #7883
- Generate simpler LLVM IR for shuffles that recursively become broadcasts by @abadams in #7902
- Check for overflow in Type constructor by @abadams in #7889
- Mutating if branches in isolation can break reachability analysis by @abadams in #7895
- Disable warning for mismatched new/delete by @abadams in #7897
- Assignment is not associative by @abadams in #7894
- Don't lift loop vars outside of their loops in sliding window by @abadams in #7896
- Stop interleaver from expanding the scope of letstmts by @abadams in #7908
- Highlight groups for the HTML Stmt file and tooltips to reveal types. by @mcourteaux in #7887
- Static analysis (MSVC) fixes for device_buffer_utils.h by @slomp in #7904
- Check returned result in the test by @vksnk in #7911
- Fix read-after-write hazard analysis in storage folding by @abadams in #7910
- Turn off SLP vectorization for avx512 only by @abadams in #7918
- Scheduling directive to hoist the storage of the function by @vksnk in #7915
- Improve the error message if you store_at without a compute_at by @vksnk in #7923
- Loop Partitioning Policy through Stage::partition(VarOrRVar, LoopPartitionPolicy) by @mcourteaux in #7914
- Remove use of dynamic_cast. by @zvookin in #7931
- Add special build for testing serialization via a serialization roundtrip in JIT compilation and fix serialization leaks by @TH3CHARLie in #7763
- Add missing serialization of Dim::partition_policy by @TH3CHARLie in #7935
- Make sure all Halide arithmetic scalar types can be named from the Generator interface. by @zvookin in #7934
- Remove the deprecated API
llvm::Type::getInt8PtrTy
usage. by @hokein in #7937 - More targeted fix for gather instructions being slow on intel processors by @abadams in #7945
- Track likely values through lets in loop partitioning by @abadams in #7930
- Add missing condition to if renesting rule by @abadams in #7952
- Always call lower_round_to_nearest_ties_to_even on arm32 by @vksnk in #7957
- Improve code size and compile time for local laplacian app by @abadams in #7927
- [serialization] Serialize stub definitions of external parameters. by @derek-gerstmann in #7926
- [WebGPU] Update to latest native headers by @jrprice in #7932
- Return values from stub functions in Deserialization by @steven-johnson in #7963
- Make the fast inverse test throughput-limited rather than latency-limited by @abadams in #7958
- Attempt to fix nested vectorization gemm performance on new build bot by @abadams in #7959
- Update instructions to include generated schedules by @antonysigma in #7928
- [serialization] Add Halide version and serialization version in serialization format by @TH3CHARLie in #7905
- Handle many more intrinsics in Bounds.cpp by @steven-johnson in #7823
- Disallow async nestings that violate read after write dependencies by @abadams in #7868
- complete_x86_target() should enable F16C and FMA when AVX2 is present by @steven-johnson in #7971
- Add two new tail strategies for update definitions by @abadams in #7949
- Add appropriate mattrs for arm-32 extensions by @abadams in #7978
- Move canonical version numbers into source, not build system (#7980) by @steven-johnson in #7981
- Silence useless "Insufficient parallelism" autoscheduler warning by @steven-johnson in #7990
- Add a notebook with a visualization of the aprrox_* functions and their errors by @vksnk in #7974
- Make narrowing float->int casts on wasm go via wider ints by @abadams in #7973
- Fix handling of assert statements whose conditions get vectorized by @abadams in #7989
- Fix all "unscheduled update()" warnings in our code by @steven-johnson in #7991
- Silence useless 'Outer dim vectorization of var' warning in Mullapudi… by @steven-johnson in #7992
- Make wasm +sign-ext and +nontrapping-fptoint the default by @steven-johnson in #7995
- Teach unrolling to exploit conditions in enclosing ifs by @abadams in #7969
- Do some basic validation of Target Features (#7986) by @steven-johnson in #7987
- Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. by @mcourteaux in #7913
- bounds_of_nested_lanes assumed that one layer of nested vectorization could be removed at a time, but failed in situations with unusual nesting structures. by @abadams in #8039 and 8055
- we now track whether or not let expressions failed to solve in solver; failure to do this meant we did unhelpful transformations in some cases which let to exploding compile times. by @abadams in #7982
Full Changelog: v16.0.0...v17.0.0