github apache/tvm v0.17.0
Apache TVM v0.17.0

2 months ago

Introduction

The TVM community has worked since the v0.17.0 release to deliver the following new exciting improvements! This release version is:

The main tags are below (bold text is with lots of progress):

  • Community, RFCs
  • AOT, Hexagon, OpenCL & CLML, Web, Metal
  • Relax, Dlight, Disco
  • TIR, TVMScript
  • Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.17.dev0...v0.17.0.rc0.

Community

  • #17018 - New committer: Balint Cristian

RFCs

This new RFC added an open, standardized format for neural network exchange developed by the Khronos Group since 2018 (https://www.khronos.org/nnef). It is aimed at deploying trained neural networks from deep learning frameworks to proprietary inference engines of neural network hardware vendors.

  • #108 - [RFC] Add NNEF frontend

AOT

  • #17077 - Correctly calculate workspace for vector types

Adreno

  • #16927 - [SCRIPT]Fix in build config for adreno

BYOC

  • #16895 - Add layout check and update shape check for cublas FP8 BYOC

BugFix

  • #17138 - [Fix][TIR] Fix outdated call to create extern buffer in make_extern
  • #17132 - Restrict CopyOnWrite to _type_final
  • #17096 - Update FAttrsGetter to return Map<String, ObjectRef>
  • #17078 - [NCCL] Release NCCL thread_local resources in destructor
  • #17044 - [Support] Fix copy constructor for support::OrderedSet
  • #17000 - [MSC] split name_string with index by colon from the right
  • #16923 - [Fix][Dlight] Fix GeneralReduction for log-sum-exp
  • #16924 - [Fix] Fix SSA conversion for SizeVar retention
  • #16903 - CudaDeviceAPI::GetAttr may check kExist when GPUs absent
  • #16901 - rocm shared memory issue on MI250

CI

  • #17055 - [SME][Test] Add additional conv2d tests for asymmetric parameters
  • #17007 - [TOPI][Testing] Enable conv2d NHWC fp16 topi testing for arm_cpu
  • #16930 - [UnitTest] Use pytest's scope='session' for tvm.testing.parameter
  • #16948 - Update image tag to 20240428-060115-0b09ed018
  • #16931 - Use LLVM17 for tests on ci_cpu
  • #16942 - Enable Conda setup v3
  • #16939 - Upgrade CUDA to 12.4

CRT

  • #17097 - [Bugfix]Return error code on error from ModuleGetFunction

Disco

  • #17035 - [QoL] Implement broadcast/scatter methods for Session
  • #16992 - [Bugfix]Handle NDArray larger than OS buffer for pipe
  • #16978 - Implement num_workers property for disco.Session
  • #16989 - Treat hangup of disco worker process as kShutdown
  • #16993 - Allow allocation that only exists on worker0
  • #16979 - Expose disco.Session.shutdown through the python API
  • #16919 - Improve error message for CallPacked

Dlight

  • #17082 - Use 16x32 spatial x reduction thread extents in GEMV scheduling
  • #17052 - Skip GEMV rules when more than one vector
  • #17026 - Perf improvement for low_batch_gemv on Metal
  • #17016 - Update Adreno GEMV Rules
  • #16972 - [GPU] Enhance opencl thread limit for schedules
  • #16973 - [GPU] Improved gemv outer fallback schedule
  • #16958 - Check for target in function attributes
  • #16894 - Enhance vectorization for gpu matmul
  • #16884 - Add check for matmul dtype and fix reduction rule

Docs

  • #17146 - [DOC] Fix typo for the "We utilize the intermediate representation of nn.Graph to convert the OneFlow model to Reley."
  • #17015 - [DOC] Update Model Links to Include Commit

Frontend

  • #17014 - [ArgParse] Pass default values to target compiler(#13264)
  • #16961 - [Bugfix][ONNX] Improve broadcast and batch_matmul conversion
  • #16936 - [TFLite] Add support for GELU conversion

Hexagon

  • #17123 - Add support for v75

LLVM

  • #17046 - [Arith][SVE] Add rewrite rules for indices split by scalable expressions
  • #16966 - [SVE] Add support for representing and creating buffer-level predicates
  • #17001 - [SVE] Use only powers of two as possible vscale values
  • #16962 - [SVE] Add codegen support for vscale_range() function attribute
  • #16968 - Stringref API deprecation fixes
  • #16965 - [SVE] Add get_active_lane_mask builtin
  • #16899 - [SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for arm_cpu
  • #16893 - [SVE] Check for SVE target in VectorizeLoop
  • #16862 - [SVE] Support splitting by vscale in tir::split and te::split

MetaSchedule

  • #17012 - [BugFix]MultiLevelTilingTensorCore generates inconsistent thread-binding sketch for batched matmul
  • #17066 - [BugFix]Fix TensorIntrin ‘dot_4x4_i8i8s32_sdot’ is not registered

Metal

  • #17059 - Enable Debug Label
  • #17025 - Support metal device profiling

OpenCL & CLML

  • #16933 - [CLML] Fix in clml pattern check condition
  • #16929 - [VM][OPENCL] Take advantage of OpenCL host ptr for improved copy

ROCm

  • #17141 - [Backend]Fix error when building TVM with LLVM 19

Relax

  • #17139 - Fix cublas dispatch for corner cases
  • #17127 - [KVCache] Support fork in sliding window sink part
  • #17115 - Support input_axis_separator to allow 2D to 1D conversion
  • #17119 - [Bugfix]Set purity=false for LazySetOutput
  • #17118 - [VM] Improved error messages for mismatched parameter count
  • #17110 - Alloc BYOC workspace with R.builtin.alloc_tensor
  • #17089 - [ONNX] Add support for HardSigmoid
  • #17100 - [KVCache] Unlimited depth blocks
  • #17075 - [Transform] Modify FuseTIR pass to propagate buffer attributes
  • #17088 - [ONNX] Add support for HardSwish
  • #17085 - [PyTorch] Add support for torch.nn.Hardsigmoid
  • #17083 - [TVMScript]Preserve tir.SizeVar through TVMScript round-trip
  • #17086 - Ignore dynamic parameters in RewriteDataflowReshape
  • #17084 - [PyTorch] Add support for torch.nn.Hardswish
  • #17074 - [KVCache][Test] Fix TIR attn kernels for uncommon group size
  • #17067 - Add missing white spaces in error messages
  • #17061 - [Frontend][Onnx] Cast Op special handling for ShapeExpr input
  • #17033 - [Bugfix] Apply FuseOps to nested DataflowBlock
  • #17032 - [Bugfix] Annotate ComputePrimValue output as host function
  • #17034 - [Bugfix] Bind symbolic variables in R.match_cast
  • #16960 - [UnitTest] Validate IRModule with multiple targets
  • #16995 - [KVCache] Support KVCache decode from forked sequence and pop more tokens
  • #16959 - [Transform] Handle identical PrimFunc with distinct VDevice
  • #16589 - [Unity] Check for transpose and dynamic shape in AdjustMatmulOrder
  • #16988 - [KVCache] Fix the aux data syncing order of paged KV cache
  • #16922 - [BugFix]change FuseOpsByPattern strategy to pattern-match maximal subgraph
  • #16982 - [Unity][BYOC] Use arith.Analyzer to check batch equality of matmul in cublas
  • #16955 - Implement relax.op.view
  • #16971 - Support nested ModuleList in nn.Module
  • #16826 - Express dynamic arguments of strided_slice as arguments
  • #16476 - [Unity][Cutlass] Fix C source generation of dense operation
  • #16940 - Allow PrimValue as index in relax.op.take
  • #16934 - [TIR] Introduce new cumsum op for gpu
  • #16859 - [QoL]Use SeqExpr in IR types when SeqExpr is required
  • #16904 - Prevent to generate duplicate func in dispatch_sort_scan
  • #16905 - [Bugfix]Raise exception for OOM allocation
  • #16827 - Handle binary operations between Tensor and PrimValue
  • #16902 - Allow specifying entry_funcs for BYOC
  • #16860 - [QoL]Infer StructInfo for relax::Tuple on construction
  • #16861 - [QoL]Return well-formed IR from relax::Function::CreateEmpty
  • #16886 - [Frontend] Fix sort, argsort and topk in nn module
  • #16883 - Stabilize relax pass mutation order

Relay

  • #16983 - [BugFix]skip leaf args when matching 'path' part for dominator pattern
  • #16996 - fixed to make TupleGetItem inherits the previous span

Runtime

  • #17057 - Stateless interface of PagedKVCache leaf node commit
  • #17049 - Support PagedKVCache with tree attention
  • #17045 - Fix PagedKVCache for PopN and enhance tests
  • #16998 - Compatibility with dmlc::Stream API changes
  • #17037 - [ROCm] Enable ROCm host memory support
  • #17036 - Use preferred host memory (pinned memory) in KV cache
  • #16994 - Allow query of available device memory through DeviceAPI
  • #16997 - [Disco] Restore checks for hangup of disco pipe
  • #16938 - Allow offset to be specified in NDArray::CreateView
  • #16890 - [VULKAN] Support total_global_memory
  • #16880 - Implemented Datatype.itemsize()

TIR

  • #17134 - [Schedule] Remove @type_check for set_axis_separator
  • #17112 - [DLight] Enable SimdGroup op for Metal
  • #17098 - [RPC] Allow RPC calls to compiled PrimFuncs with no arguments
  • #17039 - Fix Bug in VectorizeLoop
  • #17030 - Fix Shuffle rewrite
  • #16947 - Support narrow dtype for let binding
  • #16952 - Enhance CLZ intrinsic support
  • #16945 - [Compute-at] Make compute-ated block simple when the predicate could be merged
  • #16879 - Make T.reinterpret nop when dtype is the same

TOPI

  • #17091 - Add dense schedule for fp16 and fp32 using gemm
  • #17048 - [SME]Add conv2d NHWC SME fp16->fp32 schedule
  • #17040 - Fix SME conv2d schedule import and intrin argument
  • #17003 - [SME]Add conv2d NHWC SME fp32 schedule
  • #16977 - Remove blockIdx.z in topi sort
  • #16951 - Revert unification of conv2d NHWC hybrid scheduling for arm_cpu targets

TVMScript

  • #17107 - Better Type Annotation for TIR OP
  • #16967 - Fix error reporting inside Macro func
  • #16916 - Support T.launch_thread with i64 dtype
  • #16876 - Optionally use ruff format instead of black
  • #16877 - [Bug] Add test case for missing symbolic bounds

cuda & cutlass & tensorrt

  • #16980 - [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state

web

  • #17031 - Fix string to uint8 array for special characters
  • #17028 - Add dtype and offset for CreateView in runtime
  • #16910 - Support string[] in setPackedFunc() and exceptionally long arrays

Misc

  • #17135 - [QoL][IR] Provide default constructor for NameSupply/GlobalVarSupply
  • #17125 - [Utils] Define line-length for "ruff format"
  • #17152 - GraphExecutor: Fix wild pointer assign when input and output are reshape
  • #17150 - [WebGPU] Fall back to 256MB for maxBufferSize if needed
  • #17128 - [Compute-inline] Prefer T.where for reverse compute-inlined block with predicate
  • #16976 - [WebGPU] Implement tir.dp4a with WGSL built-in function dot4I8Packed
  • #17124 - [WebGPU] Add tir.dp4a
  • #17113 - [CudaGraph] Handle exceptions thrown while capturing cuda graph
  • #17094 - [Utility][Container] Support non-nullable types in Array::Map
  • #17101 - [RPC] Raise error if server process terminated
  • #17092 - [UnitTests] Use tvm.ir.assert_structural_equal whenever possible
  • #17054 - [SME] Utilize predication in fp32 matmul and conv2d schedules
  • #17079 - [CMake] Show NVCC include directories in compile_commands.json
  • #17076 - [SME] Extract gemm block correctly when fused with bias
  • #17071 - [WebGPU] Translate int8x4 into u32
  • #17065 - [FP8][Codegen] Add make_fp8 vector constructors
  • #17064 - Add docs of v0.15.0 and v0.16.0
  • #16985 - [CODEGEN] Vector-Codegen support for llvm-pure-intrin
  • #17058 - Introduce outer reduction for metal
  • #17051 - Use adapter.info when available instead of requestAdapterInfo
  • #16981 - [SME] Add scalable fp16->fp32 dense schedule
  • #17029 - [Contrib] Implement NDArray cache update
  • #17027 - [picojson] Let objects be ordered when serializing
  • #17021 - [WebGPU] Update error messages to be more user-friendly
  • #17010 - Support multinomial_from_uniform dispatch
  • #16999 - [USMP] add missing const specifier for global_const_workspace
  • #17005 - [WebGPU] Handle device OOM in createBuffer
  • #16921 - [SME] Introduce scalable fp32 dense schedule
  • #16957 - chore: remove repetitive words
  • #16909 - [QoL][IR] Provide std::hash and std::equal_to for IR Variable types
  • #16987 - [JVM] Automatic Compatibility of JVM AttachCurrentThread
  • #16974 - [CUBLAS][FP8] Enable R.matmul + R.multiply offloading
  • #16896 - [CUBLAS] Enable offloading of R.matmul + R.dequantize
  • #16956 - Add script for testing release package
  • #16908 - Overriding the StructuralEqual() for easy usage
  • #16932 - Enable gemv schedule for adreno
  • #16935 - [3rdparty] Bump FlashInfer for sampling functions
  • #16937 - [Thrust] Increase static workspace size
  • #16915 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 2
  • #16741 - Restore "pytest.mark.gpu" for RELAX tests
  • #16914 - [CMAKE] Make LOG_BEFORE_THROW explicit
  • #16913 - Enhance Release Note Script and Remove Useless File
  • #16907 - [Upd] Fixed lld search in rocm
  • #16900 - [CMAKE] Misc improvment of Util
  • #16897 - [Target] Don't register AArch64 target tags without LLVM compiler support
  • #16892 - [CUBLAS] Set fp32 compute and scale dtypes in fp16 matmul
  • #16888 - [CUBLAS][FP8] Support e4m3 gemm in cuBLAS BYOC
  • #16887 - [Contrib] Enable fp16 for thrust sort
  • #16881 - [release][Dont Squash] Update version to 0.16.0 and 0.17.0.dev on main branch

Don't miss a new tvm release

NewReleases is sending notifications on new releases.