Introduction
The TVM community has worked since the v0.17.0 release to deliver the following new exciting improvements! This release version is:
The main tags are below (bold text is with lots of progress):
- Community, RFCs
- AOT, Hexagon, OpenCL & CLML, Web, Metal
- Relax, Dlight, Disco
- TIR, TVMScript
- Docs, CI, Misc, BugFix
Please visit the full listing of commits for a complete view: v0.17.dev0...v0.17.0.rc0.
Community
- #17018 - New committer: Balint Cristian
RFCs
This new RFC added an open, standardized format for neural network exchange developed by the Khronos Group since 2018 (https://www.khronos.org/nnef). It is aimed at deploying trained neural networks from deep learning frameworks to proprietary inference engines of neural network hardware vendors.
- #108 - [RFC] Add NNEF frontend
AOT
- #17077 - Correctly calculate workspace for vector types
Adreno
- #16927 - [SCRIPT]Fix in build config for adreno
BYOC
- #16895 - Add layout check and update shape check for cublas FP8 BYOC
BugFix
- #17138 - [Fix][TIR] Fix outdated call to create extern buffer in make_extern
- #17132 - Restrict CopyOnWrite to _type_final
- #17096 - Update FAttrsGetter to return Map<String, ObjectRef>
- #17078 - [NCCL] Release NCCL thread_local resources in destructor
- #17044 - [Support] Fix copy constructor for support::OrderedSet
- #17000 - [MSC] split name_string with index by colon from the right
- #16923 - [Fix][Dlight] Fix GeneralReduction for log-sum-exp
- #16924 - [Fix] Fix SSA conversion for SizeVar retention
- #16903 - CudaDeviceAPI::GetAttr may check kExist when GPUs absent
- #16901 - rocm shared memory issue on MI250
CI
- #17055 - [SME][Test] Add additional conv2d tests for asymmetric parameters
- #17007 - [TOPI][Testing] Enable conv2d NHWC fp16 topi testing for
arm_cpu
- #16930 - [UnitTest] Use pytest's scope='session' for tvm.testing.parameter
- #16948 - Update image tag to 20240428-060115-0b09ed018
- #16931 - Use LLVM17 for tests on
ci_cpu
- #16942 - Enable Conda setup v3
- #16939 - Upgrade CUDA to 12.4
CRT
- #17097 - [Bugfix]Return error code on error from ModuleGetFunction
Disco
- #17035 - [QoL] Implement broadcast/scatter methods for Session
- #16992 - [Bugfix]Handle NDArray larger than OS buffer for pipe
- #16978 - Implement
num_workers
property fordisco.Session
- #16989 - Treat hangup of disco worker process as kShutdown
- #16993 - Allow allocation that only exists on worker0
- #16979 - Expose disco.Session.shutdown through the python API
- #16919 - Improve error message for CallPacked
Dlight
- #17082 - Use 16x32 spatial x reduction thread extents in GEMV scheduling
- #17052 - Skip GEMV rules when more than one vector
- #17026 - Perf improvement for low_batch_gemv on Metal
- #17016 - Update Adreno GEMV Rules
- #16972 - [GPU] Enhance opencl thread limit for schedules
- #16973 - [GPU] Improved gemv outer fallback schedule
- #16958 - Check for target in function attributes
- #16894 - Enhance vectorization for gpu matmul
- #16884 - Add check for matmul dtype and fix reduction rule
Docs
- #17146 - [DOC] Fix typo for the "We utilize the intermediate representation of nn.Graph to convert the OneFlow model to Reley."
- #17015 - [DOC] Update Model Links to Include Commit
Frontend
- #17014 - [ArgParse] Pass default values to target compiler(#13264)
- #16961 - [Bugfix][ONNX] Improve broadcast and batch_matmul conversion
- #16936 - [TFLite] Add support for GELU conversion
Hexagon
- #17123 - Add support for v75
LLVM
- #17046 - [Arith][SVE] Add rewrite rules for indices split by scalable expressions
- #16966 - [SVE] Add support for representing and creating buffer-level predicates
- #17001 - [SVE] Use only powers of two as possible vscale values
- #16962 - [SVE] Add codegen support for
vscale_range()
function attribute - #16968 - Stringref API deprecation fixes
- #16965 - [SVE] Add get_active_lane_mask builtin
- #16899 - [SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for
arm_cpu
- #16893 - [SVE] Check for SVE target in VectorizeLoop
- #16862 - [SVE] Support splitting by vscale in
tir::split
andte::split
MetaSchedule
- #17012 - [BugFix]MultiLevelTilingTensorCore generates inconsistent thread-binding sketch for batched matmul
- #17066 - [BugFix]Fix TensorIntrin ‘dot_4x4_i8i8s32_sdot’ is not registered
Metal
OpenCL & CLML
- #16933 - [CLML] Fix in clml pattern check condition
- #16929 - [VM][OPENCL] Take advantage of OpenCL host ptr for improved copy
ROCm
- #17141 - [Backend]Fix error when building TVM with LLVM 19
Relax
- #17139 - Fix cublas dispatch for corner cases
- #17127 - [KVCache] Support fork in sliding window sink part
- #17115 - Support
input_axis_separator
to allow 2D to 1D conversion - #17119 - [Bugfix]Set purity=false for LazySetOutput
- #17118 - [VM] Improved error messages for mismatched parameter count
- #17110 - Alloc BYOC workspace with R.builtin.alloc_tensor
- #17089 - [ONNX] Add support for HardSigmoid
- #17100 - [KVCache] Unlimited depth blocks
- #17075 - [Transform] Modify FuseTIR pass to propagate buffer attributes
- #17088 - [ONNX] Add support for HardSwish
- #17085 - [PyTorch] Add support for torch.nn.Hardsigmoid
- #17083 - [TVMScript]Preserve tir.SizeVar through TVMScript round-trip
- #17086 - Ignore dynamic parameters in RewriteDataflowReshape
- #17084 - [PyTorch] Add support for torch.nn.Hardswish
- #17074 - [KVCache][Test] Fix TIR attn kernels for uncommon group size
- #17067 - Add missing white spaces in error messages
- #17061 - [Frontend][Onnx] Cast Op special handling for ShapeExpr input
- #17033 - [Bugfix] Apply FuseOps to nested DataflowBlock
- #17032 - [Bugfix] Annotate ComputePrimValue output as host function
- #17034 - [Bugfix] Bind symbolic variables in R.match_cast
- #16960 - [UnitTest] Validate IRModule with multiple targets
- #16995 - [KVCache] Support KVCache decode from forked sequence and pop more tokens
- #16959 - [Transform] Handle identical PrimFunc with distinct VDevice
- #16589 - [Unity] Check for transpose and dynamic shape in AdjustMatmulOrder
- #16988 - [KVCache] Fix the aux data syncing order of paged KV cache
- #16922 - [BugFix]change FuseOpsByPattern strategy to pattern-match maximal subgraph
- #16982 - [Unity][BYOC] Use arith.Analyzer to check batch equality of matmul in cublas
- #16955 - Implement relax.op.view
- #16971 - Support nested ModuleList in nn.Module
- #16826 - Express dynamic arguments of strided_slice as arguments
- #16476 - [Unity][Cutlass] Fix C source generation of dense operation
- #16940 - Allow PrimValue as index in relax.op.take
- #16934 - [TIR] Introduce new
cumsum
op for gpu - #16859 - [QoL]Use SeqExpr in IR types when SeqExpr is required
- #16904 - Prevent to generate duplicate func in dispatch_sort_scan
- #16905 - [Bugfix]Raise exception for OOM allocation
- #16827 - Handle binary operations between Tensor and PrimValue
- #16902 - Allow specifying entry_funcs for BYOC
- #16860 - [QoL]Infer StructInfo for relax::Tuple on construction
- #16861 - [QoL]Return well-formed IR from relax::Function::CreateEmpty
- #16886 - [Frontend] Fix sort, argsort and topk in nn module
- #16883 - Stabilize relax pass mutation order
Relay
- #16983 - [BugFix]skip leaf args when matching 'path' part for dominator pattern
- #16996 - fixed to make TupleGetItem inherits the previous span
Runtime
- #17057 - Stateless interface of PagedKVCache leaf node commit
- #17049 - Support PagedKVCache with tree attention
- #17045 - Fix PagedKVCache for PopN and enhance tests
- #16998 - Compatibility with dmlc::Stream API changes
- #17037 - [ROCm] Enable ROCm host memory support
- #17036 - Use preferred host memory (pinned memory) in KV cache
- #16994 - Allow query of available device memory through DeviceAPI
- #16997 - [Disco] Restore checks for hangup of disco pipe
- #16938 - Allow offset to be specified in NDArray::CreateView
- #16890 - [VULKAN] Support total_global_memory
- #16880 - Implemented Datatype.itemsize()
TIR
- #17134 - [Schedule] Remove
@type_check
forset_axis_separator
- #17112 - [DLight] Enable SimdGroup op for Metal
- #17098 - [RPC] Allow RPC calls to compiled PrimFuncs with no arguments
- #17039 - Fix Bug in VectorizeLoop
- #17030 - Fix Shuffle rewrite
- #16947 - Support narrow dtype for let binding
- #16952 - Enhance CLZ intrinsic support
- #16945 - [Compute-at] Make compute-ated block simple when the predicate could be merged
- #16879 - Make T.reinterpret nop when dtype is the same
TOPI
- #17091 - Add dense schedule for fp16 and fp32 using gemm
- #17048 - [SME]Add conv2d NHWC SME fp16->fp32 schedule
- #17040 - Fix SME conv2d schedule import and intrin argument
- #17003 - [SME]Add conv2d NHWC SME fp32 schedule
- #16977 - Remove
blockIdx.z
in topi sort - #16951 - Revert unification of conv2d NHWC hybrid scheduling for
arm_cpu
targets
TVMScript
- #17107 - Better Type Annotation for TIR OP
- #16967 - Fix error reporting inside Macro func
- #16916 - Support
T.launch_thread
with i64 dtype - #16876 - Optionally use
ruff format
instead ofblack
- #16877 - [Bug] Add test case for missing symbolic bounds
cuda & cutlass & tensorrt
- #16980 - [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state
web
- #17031 - Fix string to uint8 array for special characters
- #17028 - Add dtype and offset for CreateView in runtime
- #16910 - Support string[] in setPackedFunc() and exceptionally long arrays
Misc
- #17135 - [QoL][IR] Provide default constructor for NameSupply/GlobalVarSupply
- #17125 - [Utils] Define line-length for "ruff format"
- #17152 - GraphExecutor: Fix wild pointer assign when input and output are reshape
- #17150 - [WebGPU] Fall back to 256MB for maxBufferSize if needed
- #17128 - [Compute-inline] Prefer T.where for reverse compute-inlined block with predicate
- #16976 - [WebGPU] Implement
tir.dp4a
with WGSL built-in functiondot4I8Packed
- #17124 - [WebGPU] Add
tir.dp4a
- #17113 - [CudaGraph] Handle exceptions thrown while capturing cuda graph
- #17094 - [Utility][Container] Support non-nullable types in Array::Map
- #17101 - [RPC] Raise error if server process terminated
- #17092 - [UnitTests] Use tvm.ir.assert_structural_equal whenever possible
- #17054 - [SME] Utilize predication in fp32 matmul and conv2d schedules
- #17079 - [CMake] Show NVCC include directories in compile_commands.json
- #17076 - [SME] Extract gemm block correctly when fused with bias
- #17071 - [WebGPU] Translate
int8x4
intou32
- #17065 - [FP8][Codegen] Add make_fp8 vector constructors
- #17064 - Add docs of v0.15.0 and v0.16.0
- #16985 - [CODEGEN] Vector-Codegen support for llvm-pure-intrin
- #17058 - Introduce outer reduction for metal
- #17051 - Use adapter.info when available instead of requestAdapterInfo
- #16981 - [SME] Add scalable fp16->fp32 dense schedule
- #17029 - [Contrib] Implement NDArray cache update
- #17027 - [picojson] Let objects be ordered when serializing
- #17021 - [WebGPU] Update error messages to be more user-friendly
- #17010 - Support multinomial_from_uniform dispatch
- #16999 - [USMP] add missing const specifier for global_const_workspace
- #17005 - [WebGPU] Handle device OOM in createBuffer
- #16921 - [SME] Introduce scalable fp32 dense schedule
- #16957 - chore: remove repetitive words
- #16909 - [QoL][IR] Provide std::hash and std::equal_to for IR Variable types
- #16987 - [JVM] Automatic Compatibility of JVM AttachCurrentThread
- #16974 - [CUBLAS][FP8] Enable R.matmul + R.multiply offloading
- #16896 - [CUBLAS] Enable offloading of R.matmul + R.dequantize
- #16956 - Add script for testing release package
- #16908 - Overriding the StructuralEqual() for easy usage
- #16932 - Enable gemv schedule for adreno
- #16935 - [3rdparty] Bump FlashInfer for sampling functions
- #16937 - [Thrust] Increase static workspace size
- #16915 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 2
- #16741 - Restore "pytest.mark.gpu" for RELAX tests
- #16914 - [CMAKE] Make LOG_BEFORE_THROW explicit
- #16913 - Enhance Release Note Script and Remove Useless File
- #16907 - [Upd] Fixed lld search in rocm
- #16900 - [CMAKE] Misc improvment of Util
- #16897 - [Target] Don't register AArch64 target tags without LLVM compiler support
- #16892 - [CUBLAS] Set fp32 compute and scale dtypes in fp16 matmul
- #16888 - [CUBLAS][FP8] Support e4m3 gemm in cuBLAS BYOC
- #16887 - [Contrib] Enable fp16 for thrust sort
- #16881 - [release][Dont Squash] Update version to 0.16.0 and 0.17.0.dev on main branch