Introduction

The TVM community has worked since the v0.17.0 release to deliver the following new exciting improvements! This release version is:

The main tags are below (bold text is with lots of progress):

Community, RFCs
AOT, Hexagon， OpenCL & CLML, Web, Metal
Relax, Dlight, Disco
TIR, TVMScript
Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.17.dev0...v0.17.0.rc0.

Community

#17018 - New committer: Balint Cristian

RFCs

This new RFC added an open, standardized format for neural network exchange developed by the Khronos Group since 2018 (https://www.khronos.org/nnef). It is aimed at deploying trained neural networks from deep learning frameworks to proprietary inference engines of neural network hardware vendors.

#108 - [RFC] Add NNEF frontend

AOT

#17077 - Correctly calculate workspace for vector types

Adreno

#16927 - [SCRIPT]Fix in build config for adreno

BYOC

#16895 - Add layout check and update shape check for cublas FP8 BYOC

BugFix

#17138 - [Fix][TIR] Fix outdated call to create extern buffer in make_extern
#17132 - Restrict CopyOnWrite to _type_final
#17096 - Update FAttrsGetter to return Map<String, ObjectRef>
#17078 - [NCCL] Release NCCL thread_local resources in destructor
#17044 - [Support] Fix copy constructor for support::OrderedSet
#17000 - [MSC] split name_string with index by colon from the right
#16923 - [Fix][Dlight] Fix GeneralReduction for log-sum-exp
#16924 - [Fix] Fix SSA conversion for SizeVar retention
#16903 - CudaDeviceAPI::GetAttr may check kExist when GPUs absent
#16901 - rocm shared memory issue on MI250

CI

#17055 - [SME][Test] Add additional conv2d tests for asymmetric parameters
#17007 - [TOPI][Testing] Enable conv2d NHWC fp16 topi testing for arm_cpu
#16930 - [UnitTest] Use pytest's scope='session' for tvm.testing.parameter
#16948 - Update image tag to 20240428-060115-0b09ed018
#16931 - Use LLVM17 for tests on ci_cpu
#16942 - Enable Conda setup v3
#16939 - Upgrade CUDA to 12.4

CRT

#17097 - [Bugfix]Return error code on error from ModuleGetFunction

Disco

#17035 - [QoL] Implement broadcast/scatter methods for Session
#16992 - [Bugfix]Handle NDArray larger than OS buffer for pipe
#16978 - Implement num_workers property for disco.Session
#16989 - Treat hangup of disco worker process as kShutdown
#16993 - Allow allocation that only exists on worker0
#16979 - Expose disco.Session.shutdown through the python API
#16919 - Improve error message for CallPacked

Dlight

#17082 - Use 16x32 spatial x reduction thread extents in GEMV scheduling
#17052 - Skip GEMV rules when more than one vector
#17026 - Perf improvement for low_batch_gemv on Metal
#17016 - Update Adreno GEMV Rules
#16972 - [GPU] Enhance opencl thread limit for schedules
#16973 - [GPU] Improved gemv outer fallback schedule
#16958 - Check for target in function attributes
#16894 - Enhance vectorization for gpu matmul
#16884 - Add check for matmul dtype and fix reduction rule

Docs

#17146 - [DOC] Fix typo for the "We utilize the intermediate representation of nn.Graph to convert the OneFlow model to Reley."
#17015 - [DOC] Update Model Links to Include Commit

Frontend

#17014 - [ArgParse] Pass default values to target compiler(#13264)
#16961 - [Bugfix][ONNX] Improve broadcast and batch_matmul conversion
#16936 - [TFLite] Add support for GELU conversion

Hexagon

#17123 - Add support for v75

LLVM

#17046 - [Arith][SVE] Add rewrite rules for indices split by scalable expressions
#16966 - [SVE] Add support for representing and creating buffer-level predicates
#17001 - [SVE] Use only powers of two as possible vscale values
#16962 - [SVE] Add codegen support for vscale_range() function attribute
#16968 - Stringref API deprecation fixes
#16965 - [SVE] Add get_active_lane_mask builtin
#16899 - [SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for arm_cpu
#16893 - [SVE] Check for SVE target in VectorizeLoop
#16862 - [SVE] Support splitting by vscale in tir::split and te::split

MetaSchedule

#17012 - [BugFix]MultiLevelTilingTensorCore generates inconsistent thread-binding sketch for batched matmul
#17066 - [BugFix]Fix TensorIntrin ‘dot_4x4_i8i8s32_sdot’ is not registered

Metal

#17059 - Enable Debug Label
#17025 - Support metal device profiling

OpenCL & CLML

#16933 - [CLML] Fix in clml pattern check condition
#16929 - [VM][OPENCL] Take advantage of OpenCL host ptr for improved copy

ROCm

#17141 - [Backend]Fix error when building TVM with LLVM 19

Relax

#17139 - Fix cublas dispatch for corner cases
#17127 - [KVCache] Support fork in sliding window sink part
#17115 - Support input_axis_separator to allow 2D to 1D conversion
#17119 - [Bugfix]Set purity=false for LazySetOutput
#17118 - [VM] Improved error messages for mismatched parameter count
#17110 - Alloc BYOC workspace with R.builtin.alloc_tensor
#17089 - [ONNX] Add support for HardSigmoid
#17100 - [KVCache] Unlimited depth blocks
#17075 - [Transform] Modify FuseTIR pass to propagate buffer attributes
#17088 - [ONNX] Add support for HardSwish
#17085 - [PyTorch] Add support for torch.nn.Hardsigmoid
#17083 - [TVMScript]Preserve tir.SizeVar through TVMScript round-trip
#17086 - Ignore dynamic parameters in RewriteDataflowReshape
#17084 - [PyTorch] Add support for torch.nn.Hardswish
#17074 - [KVCache][Test] Fix TIR attn kernels for uncommon group size
#17067 - Add missing white spaces in error messages
#17061 - [Frontend][Onnx] Cast Op special handling for ShapeExpr input
#17033 - [Bugfix] Apply FuseOps to nested DataflowBlock
#17032 - [Bugfix] Annotate ComputePrimValue output as host function
#17034 - [Bugfix] Bind symbolic variables in R.match_cast
#16960 - [UnitTest] Validate IRModule with multiple targets
#16995 - [KVCache] Support KVCache decode from forked sequence and pop more tokens
#16959 - [Transform] Handle identical PrimFunc with distinct VDevice
#16589 - [Unity] Check for transpose and dynamic shape in AdjustMatmulOrder
#16988 - [KVCache] Fix the aux data syncing order of paged KV cache
#16922 - [BugFix]change FuseOpsByPattern strategy to pattern-match maximal subgraph
#16982 - [Unity][BYOC] Use arith.Analyzer to check batch equality of matmul in cublas
#16955 - Implement relax.op.view
#16971 - Support nested ModuleList in nn.Module
#16826 - Express dynamic arguments of strided_slice as arguments
#16476 - [Unity][Cutlass] Fix C source generation of dense operation
#16940 - Allow PrimValue as index in relax.op.take
#16934 - [TIR] Introduce new cumsum op for gpu
#16859 - [QoL]Use SeqExpr in IR types when SeqExpr is required
#16904 - Prevent to generate duplicate func in dispatch_sort_scan
#16905 - [Bugfix]Raise exception for OOM allocation
#16827 - Handle binary operations between Tensor and PrimValue
#16902 - Allow specifying entry_funcs for BYOC
#16860 - [QoL]Infer StructInfo for relax::Tuple on construction
#16861 - [QoL]Return well-formed IR from relax::Function::CreateEmpty
#16886 - [Frontend] Fix sort, argsort and topk in nn module
#16883 - Stabilize relax pass mutation order

Relay

#16983 - [BugFix]skip leaf args when matching 'path' part for dominator pattern
#16996 - fixed to make TupleGetItem inherits the previous span

Runtime

#17057 - Stateless interface of PagedKVCache leaf node commit
#17049 - Support PagedKVCache with tree attention
#17045 - Fix PagedKVCache for PopN and enhance tests
#16998 - Compatibility with dmlc::Stream API changes
#17037 - [ROCm] Enable ROCm host memory support
#17036 - Use preferred host memory (pinned memory) in KV cache
#16994 - Allow query of available device memory through DeviceAPI
#16997 - [Disco] Restore checks for hangup of disco pipe
#16938 - Allow offset to be specified in NDArray::CreateView
#16890 - [VULKAN] Support total_global_memory
#16880 - Implemented Datatype.itemsize()

TIR

#17134 - [Schedule] Remove @type_check for set_axis_separator
#17112 - [DLight] Enable SimdGroup op for Metal
#17098 - [RPC] Allow RPC calls to compiled PrimFuncs with no arguments
#17039 - Fix Bug in VectorizeLoop
#17030 - Fix Shuffle rewrite
#16947 - Support narrow dtype for let binding
#16952 - Enhance CLZ intrinsic support
#16945 - [Compute-at] Make compute-ated block simple when the predicate could be merged
#16879 - Make T.reinterpret nop when dtype is the same

TOPI

#17091 - Add dense schedule for fp16 and fp32 using gemm
#17048 - [SME]Add conv2d NHWC SME fp16->fp32 schedule
#17040 - Fix SME conv2d schedule import and intrin argument
#17003 - [SME]Add conv2d NHWC SME fp32 schedule
#16977 - Remove blockIdx.z in topi sort
#16951 - Revert unification of conv2d NHWC hybrid scheduling for arm_cpu targets

TVMScript

#17107 - Better Type Annotation for TIR OP
#16967 - Fix error reporting inside Macro func
#16916 - Support T.launch_thread with i64 dtype
#16876 - Optionally use ruff format instead of black
#16877 - [Bug] Add test case for missing symbolic bounds

cuda & cutlass & tensorrt

#16980 - [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state

web

#17031 - Fix string to uint8 array for special characters
#17028 - Add dtype and offset for CreateView in runtime
#16910 - Support string[] in setPackedFunc() and exceptionally long arrays

Misc

#17135 - [QoL][IR] Provide default constructor for NameSupply/GlobalVarSupply
#17125 - [Utils] Define line-length for "ruff format"
#17152 - GraphExecutor: Fix wild pointer assign when input and output are reshape
#17150 - [WebGPU] Fall back to 256MB for maxBufferSize if needed
#17128 - [Compute-inline] Prefer T.where for reverse compute-inlined block with predicate
#16976 - [WebGPU] Implement tir.dp4a with WGSL built-in function dot4I8Packed
#17124 - [WebGPU] Add tir.dp4a
#17113 - [CudaGraph] Handle exceptions thrown while capturing cuda graph
#17094 - [Utility][Container] Support non-nullable types in Array::Map
#17101 - [RPC] Raise error if server process terminated
#17092 - [UnitTests] Use tvm.ir.assert_structural_equal whenever possible
#17054 - [SME] Utilize predication in fp32 matmul and conv2d schedules
#17079 - [CMake] Show NVCC include directories in compile_commands.json
#17076 - [SME] Extract gemm block correctly when fused with bias
#17071 - [WebGPU] Translate int8x4 into u32
#17065 - [FP8][Codegen] Add make_fp8 vector constructors
#17064 - Add docs of v0.15.0 and v0.16.0
#16985 - [CODEGEN] Vector-Codegen support for llvm-pure-intrin
#17058 - Introduce outer reduction for metal
#17051 - Use adapter.info when available instead of requestAdapterInfo
#16981 - [SME] Add scalable fp16->fp32 dense schedule
#17029 - [Contrib] Implement NDArray cache update
#17027 - [picojson] Let objects be ordered when serializing
#17021 - [WebGPU] Update error messages to be more user-friendly
#17010 - Support multinomial_from_uniform dispatch
#16999 - [USMP] add missing const specifier for global_const_workspace
#17005 - [WebGPU] Handle device OOM in createBuffer
#16921 - [SME] Introduce scalable fp32 dense schedule
#16957 - chore: remove repetitive words
#16909 - [QoL][IR] Provide std::hash and std::equal_to for IR Variable types
#16987 - [JVM] Automatic Compatibility of JVM AttachCurrentThread
#16974 - [CUBLAS][FP8] Enable R.matmul + R.multiply offloading
#16896 - [CUBLAS] Enable offloading of R.matmul + R.dequantize
#16956 - Add script for testing release package
#16908 - Overriding the StructuralEqual() for easy usage
#16932 - Enable gemv schedule for adreno
#16935 - [3rdparty] Bump FlashInfer for sampling functions
#16937 - [Thrust] Increase static workspace size
#16915 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 2
#16741 - Restore "pytest.mark.gpu" for RELAX tests
#16914 - [CMAKE] Make LOG_BEFORE_THROW explicit
#16913 - Enhance Release Note Script and Remove Useless File
#16907 - [Upd] Fixed lld search in rocm
#16900 - [CMAKE] Misc improvment of Util
#16897 - [Target] Don't register AArch64 target tags without LLVM compiler support
#16892 - [CUBLAS] Set fp32 compute and scale dtypes in fp16 matmul
#16888 - [CUBLAS][FP8] Support e4m3 gemm in cuBLAS BYOC
#16887 - [Contrib] Enable fp16 for thrust sort
#16881 - [release][Dont Squash] Update version to 0.16.0 and 0.17.0.dev on main branch

apache/tvm v0.17.0 Apache TVM v0.17.0 on GitHub