Introduction

The TVM community has worked since the v0.15.0 release to deliver the following new exciting improvements! This release version is:

First support of Relax, with dynamic shape and pipeline
Dlight module for optimizing LLM TIR workloads on GPU
Disco module for initial SPMD multi-GPU support

The main tags are below (bold text is with lots of progress):

Community, RFCs
Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime
Relax, Dlight, Disco
Arith, TIR, TVMScript
Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.16.dev0...v0.16.0.rc0.

Community

#16695 - Add new key for release signing
#16419 - Add new key for release signing

RFCs

This new RFC explores how TVM can be utilized to generate code for the SME ISA to achieve improved inference performance on supported Arm®-based hardware implementing the SME extension.

#107 - [RFC] Scalable Matrix Extension enablement

Arith

#16735 - [Fixup] Require feature flag for tighter inequality bounds
#16588 - Provide tighter ConstIntBounds for special cases
#16704 - [Fix]Fix canonical simplification of LE

BYOC

#16567 - Skip processed functions in FuseOpsByPattern and RunCodegen

BugFix

#16766 - [Target] Added null check to fix segfault at ->defined() in cpu.cc DetectSystemTriple()
#16739 - [Ansor] Fixing Ansor Gradient Bug
#16820 - [Fix] PAPI docs
#16793 - [Fix] fix for numpy 2.0 compatibility
#16790 - [Fix] Fix build errors with VS2022
#16780 - [Fix] Fix numpy dtype map
#16773 - [Fix] Fix the purity flag of "vm.call_tir_dyn" and "kill" ops
#16770 - [Hotfix] Revert driver API pass ordering that breaks MLC, mark failing test
#16771 - [Fix] Remove redundant "remove_all_unused" in IPC memory lowering
#16746 - [Fix][Builtin] Fix "GetQueryPosition" of PagedKVCache
#16728 - [Fix] Introduce TVM_DEBUG_WITH_ABI_CHANGE to warn ABI changes in debug mode
#16714 - [Fix] PagedKVCache fetching compute stream when copy stream is needed
#16684 - [SLM] Produce well-formed Relax for nn.modules.KVCache
#16659 - add the default value for DFT in ONNX frontend
#16637 - [Transform] Preserve symbolic variables in FuseOps
#16649 - [FFI] Add a missing default for datatype lanes
#16492 - [Executor] fix debug_executor function debug_get_output
#16598 - [Transform]Handle non-composite lambda functions in FuseOps
#16565 - [Transform] Keep private non-primitive functions in FuseTIR
#16518 - Use xxx instead of pow(x,3)
#16436 - Ensure that bf16 arrays are created as expected
#16361 - Disable SingleEnvThreadVerifier
#16289 - [AUTOTVM][FIX] Typo fixes and add a warning in the Droplet Search

CI

#16837 - Disable flaky unit test
#16765 - [AOT][Testing] Improve output mismatch information on test failure
#16661 - add merge_with_main in unity
#16611 - [AOT][Testing] Print output values on test failure
#16546 - Disable testing that downloads from mxnet
#16521 - Fix CI Script and Broken Tests
#16502 - Support tvm-bot rerun for tvm-unity task
#16435 - Update image tag to 20240126-070121-8ade9c30e
#16420 - [WASM] Update emsdk and nodejs version
#16384 - Remove NVIDIA_DISABLE_REQUIRE
#16382 - In jenkins.cmd_utils.Sh.tee, check for failing subprocess
#16366 - Upgrade sccache version to 0.7.*
#16369 - Upgrade Unity ci images
#16344 - Update docker images tag to 20240105-165030-51bdaec6
#16340 - [Unity][UnitTest] Increase atol to resolve flaky CI failure
#16337 - [Hexagon][UnitTest] Disable flaky quantization test
#16336 - Upgrade cmake version to 3.24.0

Docker

#16755 - [SME]Add Fixed Virtual Platform (FVP) and toolchain install
#16348 - Upgrade pip in i386 container

Disco

#16618 - [Disco] Propagate structlog configuration to disco workers
#16639 - [Disco] Expose functions to query the per-worker device/rank
#16617 - [Disco] Implement Session.import_python_module method
#16715 - [Disco] Propagate structlog/logging config to workers
#16845 - [Debug][Disco] Check if a PackedFunc exists before calling it
#16817 - [Disco] Reduce Process/ThreadSession message queue reads and writes
#16807 - [Disco] Support setting workers' CPU affinity
#16375 - [Unity] Fix creation of disco ProcessSession
#16821 - [Fix] Add TVM_DLL to Disco session
#16752 - [Fix] Lazy import of "psutil" in disco process pool

Dlight

#16775 - [Fix][Dlight] (Low-batched-)GeMV on small spatial loops
#16429 - [Unity][Dlight][Fix] Reduction rule support dyn-shape epilogue
#16351 - [Unity] Add dlight.gpu.Fallback in DispatchSortScan, add argsort, topk, and cumprod
#16338 - [Unity][DLight] Introduce Specific Rule for RMSNorm
#16251 - [Unity][Dlight] Support dlight gemv rule on nested inner block
#16878 - [Dlight] Enhance vectorization loading weight for gemv
#16848 - [DLight] Fix a corner case for reduction rule
#16701 - [Dlight] Add fallback for low batch gemv with outer reduction
#16678 - [Dlight] LowBatchGemv rule only apply to function with spatial symbolic var
#16665 - [Dlight] Skip GeMV when normalization fails
#16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
#16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
#16321 - [DLight] Skip rule if target is not suitable
#16731 - [Dlight] Fix GeMV shared memory estimation

Docs

#16792 - [Doc] Fix set_axis_separator example
#16610 - [Doc] Fixed Docstring usage example in tvm.ir.make_node
#16572 - [Doc] Remove MxNet related tutorials
#16514 - [Unity][Doc] Document passes that depend on DataflowBlocks and encourage using ConvertToDataflow
#16482 - [Doc] Fix Docstring in extern.py for Sphinx
#16346 - [Doc] Fix minor error in "Expressions in Relay"

Frontend

#16001 - [ONNX] Fix interpreting auto_pad parameters in ConvTranspose operator
#16651 - [PaddlePaddle] PaddlePaddle model with NCHW data format that supports quantization
#16616 - [PaddlePaddle] Support conv2d when data_format is NHWC
#16526 - [Keras] Enable Dense operator for any input dims
#16478 - [PaddlePaddle] Fixed the bug that prevented the model from being successfully converted to microTVM on MacOS

Hexagon

#16762 - [VM]Cache operations when bypass mode is enabled
#16706 - [VM] Add buffers to dma_wait builtin
#16448 - [VM]Implement dma_copy and dma_wait builtin for hexagon

LLVM

#16782 - [SVE] Support scalable vectors in LoopVectorizer
#16812 - Fix compilation failure due to minor change
#16808 - [Runtime]Fix errors during loading of target tags
#16748 - Lack of DWARF type is not an error
#16696 - [SVE] Add codegen support for scalable buffer accesses
#15964 - [RUNTIME] Add optional LLVM ORCJIT runtime executor
#16612 - [SVE] Add support for scalable data type strings
#16523 - [SVE] Change the dtype of Ramp and Broadcast lanes to PrimExpr
#16484 - [SVE] Add vscale builtin
#16373 - Update Host.h path

MetaSchedule

#16725 - Make the opt_level of tune_relay() adjustable

Metal

#16713 - [RUNTIME]Provide richer runtime when error happens
#16605 - [RUNTIME]Fix multithreading access of metal runtime
#16438 - Dispatch numerically stable tanh for metal

OpenCL & CLML

#16854 - [OpenCL] Add OpenCL device for automatic target detection
#16846 - [Meta-Schedule][OpenCL] Enable MS tuning for Android OpenCL
#16768 - [RUNTIME][OPENCL] Bugfix for ciImage create with host ptr
#16672 - [CLML] Fix build TVM with CLML on MacOS
#16328 - [RUNTIME][CLML] Fix for Softmax op for 4D tensors
#16394 - [OpenCL][CMake] Fix OpenCL tests compilation

ROCm

#16441 - [WebGPU] Intrin Dispatch: tanh, erf, log
#16404 - Some fixes of ROCm codegen

Relax

#16872 - Enhance symbolic expr estimation in memory planning
#16867 - Dispatch sort/scan for non-cuda gpu backends
#16852 - Fix EliminiateCommonSubexpr removing alloc tensor
#16851 - [Relax,Topi] Allow passing workspace to thrust to avoid allocations
#16841 - Provide well-formed output in transform.LazyGetInput
#16798 - [Transform] Provide callback versions of LazyTransformParams
#16801 - Allow DeadCodeElimination within ApplyPassToFunction
#16834 - Capture symbolic vars in struct info of weights
#16830 - Share storage allocs among functions after cuda graph rewriting
#16823 - [VM] Refactor CUDA graph builtins as VM extension
#16828 - [Bugfix] Provide the full Expr to pattern-match rewriter
#16805 - [Bugfix]BlockBuilder may not assume unique input functions
#16815 - Enable capturing symbolic shapes in cuda graph
#16642 - Allow R.Prim('bool') in relax::If and assert_op
#16796 - Unit-test for structural equal of recursive function
#16732 - Allow composition of DFPattern replacements
#16783 - Improve CanonicalizeBindings in DataflowVar edge case
#16721 - Implement operators to inspec DLTensor::strides and offset
#16730 - Refactor PatternRewriter into separate Block/Expr mutators
#16756 - [IR]Improve highlighting in assert_structural_equal
#16779 - Improve malform error msg
#16569 - [Unity][Parser] Check well-formedness in the parser
#16759 - [Pass] Lowering passes for GPU IPC memory and allreduce
#16697 - Implement relax.transform.TopologicalSort
#16658 - Normalize use of void-type variable to inline R.tuple()
#16711 - [Frontend] Add op tanh, exp, negative, and permute
#16703 - [Fix]Fix top-p/top-k sampling kernel
#16669 - [Frontend][Onnx] add sum and globalavgpool 1d/3d op
#16691 - CUDA graph rewrite treating StringImm as static
#16685 - Implement StructInfoPattern for dataflow pattern matching
#16681 - [Frontend][Onnx] support MaxPool1/2/3D and AveragePool1/2/3D
#16584 - [Unity][TIR] Clear struct info when specializing PrimFunc
#16676 - Remove the legalization of cumsum/cumprob
#16654 - [Frontend][NN] Add support for Conv3D
#16674 - Eager free original weights in transform_params
#16675 - add sample_indices in sampling
#16648 - [Runtime] Support Unpack API for NDArrayCache
#16591 - [Unity][Transform] Handle dynamic shapes in CombineParallelMatmul
#16594 - [Transform] Preserve param names in LiftTransformParams
#16575 - [Unity] GPU sampling
#16574 - Additional unit tests for RemoveUnusedParameters
#16585 - [Unity][Analysis] Include impure call in VerifyWellFormed errors
#16421 - [Unity][Transform] Raise error in FuseOpsByPattern for SSA violation
#16629 - Fix error message in BlockBuilder
#16592 - Handle dynamic arguments in legalization of nn.attention
#16590 - [Unity][Transform] Check for permute_dims in ExpandMatmulOfSum
#16604 - [Frontend][Onnx] fix clip unsqueeze opset implement
#16568 - [Runtime] RNNState for Space State Models
#16563 - Implement operators to read runtime DLTensor* information
#16581 - [Unity][MSC][M4.2][Step2] Enable plugin with manager, test plugins in compile pipeline
#16600 - Expose name_hint field for BlockBuilder.match_cast
#16601 - [Transform] Canonicalize let var = R.const bindings
#16583 - [Unity][VM] Recursively visit match bindings in VMShapeLowerMutator
#16586 - Ignore non-relax functions in relax.transform.RunCodegen
#16573 - [VM] Re-implementation of callback functions
#16561 - [Bugfix]Remove call to tvm.build for empty TIR module
#16564 - [Unity] Check for symbolic vars in PrimValue in when lowering to TIR
#16558 - Minor updates for NN frontend
#16542 - Support callback as argument
#16487 - [Unity][Transform] Handle call_tir_inplace in FuseTIR and FuseOps
#16355 - [Unity] Infer struct info for relax.op.split on dynamic-sized index
#16465 - [Redo][Unity] Split DecomposeOpsForTraining into two steps
#16495 - [Unity][MSC][M4.2][Step1] Enable plugin with manager, test plugins in compile pipeline
#16498 - [Frontent] "tensor_ir_inplace" op
#16500 - [Unity] Support storage reuse for dynamic shapes
#16493 - [Pass] Skip data type node for CSE pass
#16467 - [Unity][MSC][Refactor] Reconstruct BYOC and runner
#16422 - [Unity][CodeGen] RunCodegen based on externally-exposed functions
#16483 - [Unity][Frontend] Add Sigmoid and Square Op
#16472 - [Unity] Improved error message in tvm::relax::UpdateStructInfo
#16473 - [Unity] Improve error message in tensor_to_shape struct inference
#16466 - Memory planning for "partially dynamic" shapes
#16464 - NDArray Cache Update with DLTensor Support
#16315 - [Unity][Transform] Implement relax.transform.ReorderTakeAfterMatmul
#16313 - [Unity][Transform] Implement relax.transform.ExpandMatmulOfSum
#16411 - [Unity][Transform] Handle symbolic variables in LambdaLift
#16443 - [Unity][FIX] fix thread dtype mismatch
#16442 - Revert "[Unity] Split DecomposeOpsForTraining into two steps"
#16437 - [Unity] Improve buffer allocation for handling duplicated buffer names.
#16439 - [Unity] Support cumsum with pure int32
#16432 - [Unity] downgrade cmake version requirement
#16427 - [Unity][Frontend][NN] Better support for dynamic convolutions
#16418 - [Unity][Fix] Fix mismatched intrinsic name
#16129 - [Unity][Transform] Replace eligible operators with in-place versions in dataflow blocks
#16414 - [Bugfix][Unity] Recover MSVC/NVCC/ROCm/Vulkan
#15954 - [Unity] Split DecomposeOpsForTraining into two steps
#16111 - [Unity][Transform] Memory planning for dynamic-shape func return
#16396 - [Unity] PagedKVCache supporting on-the-fly RoPE calculation
#16395 - [Frontend][ONNX]fix onnx frontend parse
#16385 - [Unity][Op] Add Conv3D Operator
#16284 - [Unity][nnModule] Dynamic shape support in nn Module
#16378 - [Unity][BlockBuilder] Restore bb.get()
#16374 - [Unity] Support TIR kernel for PagedKVCache
#16314 - [Unity][Transform] Implement relax.transform.AdjustMatmulOrder
#16349 - [Unity][MSC] Avoid depending on trivial bindings in Relax intermediate
#16376 - [Unity][Contrib] Fix a bug due to typo in vllm reconstruct_from_cache kernel and add test
#16388 - [Unity] Update dispatch test cases following the merge from main
#16335 - [Unity] Set CMAKE_CUDA_ARCHITECTURES default to native
#16306 - [Unity][Transform] Update LambdaLift to use name of lifted lambda
#16310 - [Unity][Analysis] Show objects instead of names in WellFormedChecker
#16362 - [Unity][Fix] Memory planning check value type of 'tir_var_upper_bound'
#16367 - [Unity][Transform] Handle replacement at both var binding and usage
#16309 - [Unity][Transform] Use parameter name in BundleModelParams
#16307 - [Unity] Improved error message in ExprMutator::ReEmitBinding
#16308 - [Unity] Improved error message for matmul shape mismatch
#16360 - [Unity] Enhance Torch-consistency in rehsape
#16350 - [Unity][Contrib] Add vLLM paged attention kernel
#16303 - [Unity][NN] Use Linear name for nn.op.permute_dims
#16325 - [Unity][MSC][Legalize] legalize codes and mute logging
#16312 - [Unity][Analysis] Add utility for collecting compile-time bindings
#16330 - [Unity][WEBGPU] Enable wasm exception propagation
#16304 - [Unity][Analysis] Handle PrimStructInfo in EraseToWellDefined
#16305 - [Unity][Transform] Implement UpdateParamStructInfo
#16331 - [Unity] Alter op impl handling empty transform for output
#16254 - [Unity] Dispatch cumsum and sort
#16120 - [Unity][Transform] Extract partial-tuple-usage from FuseTIR
#16311 - [Unity] Validate struct info in relax::Call constructor
#16333 - [Unity] Fix nn.op.tensor_ir_op signature
#16302 - [Unity] Cutlass kernel compatibility with cmake 3.18+

Relay

#16622 - [ONNX] Fix the attribute mode parse of operator Upsample
#16626 - [ONNX] Fix the Resize operator in ONNX frontend
#16624 - [ONNX] fix the wrong default value about dtype in Multinomial converter
#16417 - [Frontend][Torch] fix pytorch frontend linspace op
#16400 - [Frontend][Torch] fix pytorch frontend not support logical or
#16390 - [Frontend][Torch] fix a typo mistake in nonzero_numpy
#16324 - make "ToScalar" support directly obtaining "int64_t"

Runtime

#16804 - Introduce MSCCLPP with NCCL equivalent interface
#16809 - Add "TVM_DLL" to NVTX header
#16750 - CUDA IPC Memory support and custom allreduce kernels
#16738 - [Refactor]Always specify device in allocator interface
#16716 - Ensure NDArray.CopyTo(Device) always sync
#16705 - Add TVM_DLL to memory manager functions
#16692 - PagedKVCache execute data copy on a separate stream
#16647 - [RPC] Fix FreeObject in minrpc server
#16667 - [Builtin] Using float32 accumulation in attention kernel
#16635 - [RPC] Enable RPCObjectRef over multi-hop RPC
#16630 - Add TVM_DLL to threading backend funcs
#16541 - Add "TVM_DLL" to NDArray cache load func
#16550 - [ROCM] Properly align rocm parameter buffer
#16545 - Fix dtype conversion for bf16 and fp8
#16508 - ParallelFor skipping thread backend for unit extent
#16486 - KV cache providing workspace for attn kernel
#16456 - [KVCache] AttentionWithFusedQKV and RoPE mode
#16415 - [Memory] Implement support for non-zero offset within a storage object in AllocNDArr…
#16387 - [RPC] Enable RPCObjectRef return in RPC
#16377 - Use cudaGetDeviceCount to check if device exists

TIR

#16832 - Use constructor for new PrimFunc in TransformLayout
#16543 - Fix segfaults from ordering of Let/Assert in MakePackedAPI
#16795 - Ramp and Broadcast lanes fixed to int32 dtype
#16767 - [Driver] Use BindTarget to specify target for FP8 legalization
#16742 - [Bugfix]Fix cache_read update buffer region
#16726 - [Bugfix]Avoid overwrite of unmanaged buffer allocations
#16548 - [CUDA] Add native FP8 support to codegen
#16723 - Implement max/min_value for fp8 data types
#16655 - Improve well-formed check's handling of match buffer
#16673 - Support Vector Reinterpret Calls
#16682 - [Bugfix]Handle AttrStmt of upcoming tir.Var in ConvertSSA
#16560 - Enhance and fix tensorize schedule for some case
#16660 - [Bugfix]Fix duplicate AllocateConst in CacheReadWrite schedule primitive
#16544 - Expand debug symbol output for CodeGenLLVM
#16553 - Fix get_block_access_region for let bindings
#16515 - Require exactly same-dtype matching for Vulkan smem reuse
#16406 - Fix of inter thread reduction with shared memory prefetch
#16293 - Extend DP4A tensor intrin
#16345 - Allow sync threads inside condition
#16250 - In SplitHostDevice, check for variables in thread extents
#16184 - [Transform] Implement InlinePrivateFunctions

TOPI

#16652 - improve inclusive_scan for thrust
#16383 - [Target] Add fp16 SIMD support for conv2d on arm_cpu targets

TVMC

#16261 - Add tvmc flag to print ir before and print ir after named pass

TVMScript

#16864 - Add parser and printer support for e4m3/e5m2 fp8
#16844 - Produce empty DictAttrs when R.func_attrs is absent
#16811 - Do not throw error for duplicate definitions
#16641 - Allow use of relax.Expr with void type as a statement
#16663 - Infer T.reads() for DeclBuffer nodes
#16640 - Represent tir::builtin::ret() using python "return"
#16562 - [Bugfix]Handle R.match_cast as last binding in if/else
#16593 - [Unity]Parse R.Object return type from call_pure_packed
#16356 - [Unity]Optionally hide StructInfo that can be inferred
#16379 - [Unity]Update call_packed semantics to support empty sinfo_args

Vulkan

#16858 - Fix CLZ support for Vulkan

cuda & cutlass & tensorrt

#16865 - [Codegen, CUDA] Add handling of fp8 broadcast / const
#16818 - [Cutlass] Fix usage of cuda stream for group gemm
#16788 - [Cutlass] Add check for group gemm param shapes
#16789 - [Bugfix][Cutlass] Remove a typo in cutlass build
#16787 - [Codegen, Cuda] Add overload for fp8x4 e5m2 <-> half4 conversion
#16751 - [Cutlass] Add group gemm kernels
#16736 - [Target][CUDA] Allow non-numeric arch as needed for latest gpu
#16619 - [Bugfix][Cutlass] Check if function attributes is None
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.

micoNPU

#16266 - [microNPU][ETHOSU] Add fixed point for tanh
#16680 - [microNPU][ETHOSU] Fix LUT size for int16 activations
#16401 - [microNPU][ETHOSU] Add fixed point for matmul

web

#16733 - Support web indexDB cache for larger model storage
#16810 - Support building tvm/web on Windows
#16825 - Allow custom bc files in emcc making
#16791 - Add kv_state and rnn_state to wasm_runtime
#16722 - Implement linear congruential generator, make runtime seedable
#16650 - Seperate parallel shard download and iterative shard loading
#16694 - Initial support for asyncify
#16631 - Fix NDArrayCache loading report callback
#16525 - Move ArtifactCache to Interface, Support Cache delete and Batch Delete, Remove typo
#16554 - Compatibility with PagedKVCache in WebGPU
#16527 - Revert "[Unity]Temp disable wasm exception (#16444)"
#16504 - [Relax]Add ApplyPresenceAndRequencyPenalty
#16485 - [wasm] Enlarge initial memory for emcc
#16444 - [Unity]Temp disable wasm exception

Misc

#16873 - [Thrust] Fix thrust workspace allocation
#16868 - [3rdparty] Bump flashinfer
#16871 - [PageKV] allow PopN to pop all the tokens in last block
#16866 - [3rdparty] Bump FlashInfer
#16863 - [Picojson] Let the key of objects in json be ordered by default
#16856 - [Thrust] Use pointer to tls pool to prevent creating new pool
#16850 - Fixing probability comment
#16849 - [KVCache] Initialize one extra page than specified
#16843 - [IR] Provide well-formed intermediate in ApplyPassToFunction
#16772 - [MSC][M5.3] Support torch.dynamo for dynamic models
#16839 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/cmsisnn
#16838 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/ethosu
#16831 - [KVCache] Reducing CacheAuxDataManager copy size
#16794 - [SME] Target parser support for SME
#16824 - [KVCache] Introducing auxiliary data manager
#16800 - [BugTIR]fix error merging shared memory for ptx_cp_async
#16822 - [VM] Recycle VMFrame
#16813 - [KVCache] Support forking sequence at specific posotion
#16786 - [Codegen] Add check to disable invalid reinterpret
#16816 - [Cmake] Allow using custom CCCL path for thrust
#16784 - [SLM] Add unit tests for SLM to Relax exporter
#16814 - Fix includes of custom allreduce kernel
#16806 - [Debug] Improve error message in VMShapeLower
#16802 - [Debug] Improve error messages in LiftTransformParams
#16425 - [Target] Use LLVM target parser for determining Arm(R) A-Profile Architecture features
#16797 - [3rdparty] AUTO mode for custom all-reduce strategy
#16761 - [SME] Add support for inserting processor state annotations
#16778 - [Analysis] Allow calls to GlobalVar in @R.function
#16745 - [IR] Default to empty attributes, instead of NULL
#16777 - Revert "[SLM] Allow modules to define pre-processing of weights"
#16776 - [Contrib] Remove thrust "built but not used" warning
#16757 - [SLM] Allow modules to define pre-processing of weights
#16763 - [CONTRIB] Add nm symbol dump
#16717 - Enable Shared Function in LiftTransformParam Pass
#16729 - [Builtin] Sliding window and sink support for PagedKVCache
#16724 - Fix cpp_rtvm cmake build on Windows
#16513 - [Target] Automatically detect system triple when not specified by the user
#16710 - [CMake] Add "USE_FLASHINFER" to libinfo
#16702 - [MSC][M5.2] Enable quantize && prune with gym by wrapper
#16699 - [Transform] Remove R.Object parameters after LazyTransformParams
#16668 - [MSC][M5.1] Build wrapper to support compression
#16693 - [Contrib] Support NDArray cache taking generator
#16412 - [Lint] Add check to prevent usage of #include
#16689 - [DeviceAPI] Support "GetCurrentStream"
#16690 - Use target name instead of node name as function name
#16683 - [skip ci] Fix wasm exception flag
#16609 - Minor update docs instructions
#16656 - Simplify Windows CMake Command
#16666 - [KVCache] Fix the reference counter in sequence fork
#16662 - Fixing workload comment
#16595 - [Transform] Check for zero-param operators in LiftTransformParams
#16599 - [Transform] De-duplicate MatchCast nodes in EliminateCommonSubexpr
#16596 - [Transform] Implement relax.transform.ReorderPermuteDimsAfterConcat
#16597 - [Transform] Allow explicit name of bundled model parameters
#16602 - [Transform] Improvements to LazyTransformParams
#16606 - [KVCache] Support passing in attn_score_scaling_factor into KV cache
#16608 - Extend gpu memory bandwidth test to work through RPC
#16587 - [Debug] Improve error message for codegen pattern mismatches
#16570 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 1
#16576 - Update the 3rdparty/libflash_attn submodule
#16580 - [KVCache] Support mode "None" for Rotary Embebdding
#16578 - [KVCache] Support returning query positions
#16571 - Fix compile warnings
#16540 - [Upd] Enable lld search to include /opt/rocm/llvm/bin for rocm
#16539 - Improve error message in NDArray::CopyFromTo
#16524 - [Build] Improving debug and build-dir options
#16551 - [KVCache] Fix attention kernel for ROCm
#16512 - Cut pytest-lazy-fixture
#16506 - Bump 3rdparty/cutlass_fpA_intB_gemm version
#16511 - [Minor] Fix Clang compilation warning in fuse_tir.cc and codegen_c_host.cc
#16516 - Add Relax, Unity Tags in make_notes.py
#16497 - [Instrument] Add default instrument to print all passes
#16494 - [DPL] Support tir_vars field in is_call_tir pattern
#16453 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm
#16454 - [BugTIR] fix thread_sync occurs in letstmt
#16468 - [LINT] Fix pylint issues in test_dma_builtin.py
#16413 - [Contrib] Workspace for cuBLAS backend
#16460 - [Cherry-pick][MSC][M4.1] Add plugin && plugin_builder, enable build and test in different frameworks (#16397)
#16461 - [Minor] Fix Docstring for sphinx-build
#16431 - [Schedule] Loop-Partition Scheduling Primitive
#16451 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/ethosu
#16452 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/cmsisnn
#16445 - [skip ci] update branch rule to prepare for unity transition
#16426 - [CMake] Enable cuda lang if USE_CUDA is on
#16407 - Add NVIDIA Hopper H100 target tag
#16398 - [DeviceAPI] Support querying total global memory
#16357 - [RPC] Fix tuning on macOS and Windows (#15771)
#16386 - [Thrust] Use no sync exec policy and caching allocator
#16343 - [CMake][MSVC] Disable permissive mode for MSVC builds
#16242 - [Codegen] Fix if_then_else codegen
#16341 - [CMake] Use ccache as CMAKE_CUDA_COMPILER_LAUNCHER
#16332 - Change metal dtype of ceil_log2 to fp32

apache/tvm v0.16.0.rc0 Apache TVM v0.16.0.rc0 on GitHub