1. New features

1.1 Real Function

We are excited to announce the stabilization of the Real Function feature in Taichi Lang v1.7.0. Initially introduced as an experimental feature in v1.0.0, it has now matured with enhanced capabilities and usability.

Key Updates

Decorator Change: The Real Function now uses @ti.real_func. The previous decorator, @ti.experimental.real_func, is deprecated.
Performance Improvements: Real Functions, unlike Taichi inline functions (@ti.func), are compiled as separate entities, akin to CUDA's device functions. This separation allows for recursive runtime calls and significantly faster compilation. For instance, the Cornell box example's compilation time is reduced from 2.34s to 1.01s on an i9-11900K when switching from inline to real functions.
Enhanced Functionality: Real Functions support multiple return statements, offering greater flexibility in coding.

Limitations

Backend Support: Real Functions are currently only compatible with LLVM-based backends, including CPU and CUDA.
Parallel Loops: Writing parallel loops within Real Functions is not supported. However, if called within a parallel loop in a kernel, the Real Function will be parallelized accordingly.

Important Note on Usage: Ensure all arguments and return values in Real Functions are explicitly type-hinted.

Usage Example

The following example demonstrates the recursive capability of Real Functions. The sum_func Real Function is used to calculate the sum of numbers from 1 to n, showcasing its ability to handle multiple return statements and variable recursion depths.

@ti.real_func
def sum_func(n: ti.i32) -> ti.i32:
    if n == 0:
        return 0
    return sum_func(n - 1) + n

@ti.kernel
def sum(n: ti.i32) -> ti.i32:
    return sum_func(n)

print(sum(100))  # 5050

You can find more examples of the real function in the repository.

1.2 Enhancements in Kernel Arguments and Return Values

Support for Multiple Return Values in Taichi Kernel:

In this update, we've introduced the capability to return multiple values from a Taichi kernel. This can be achieved by specifying a tuple as the return type. You can directly use (ti.f32, s0) as the type hint or write the type hint in Python manner like typing.Tuple[ti.f32, s0] or for Python 3.9 and above, tuple[ti.f32, s0] . The following example illustrates this new feature:

s0 = ti.types.struct(a=ti.math.vec3, b=ti.i16)

@ti.real_func
def foo() -> (ti.f32, s0):
    return 1, s0(a=ti.math.vec3([100, 0.5, 3]), b=1)

@ti.kernel
def bar() -> (ti.f32, s0):
    return foo()
    
ret1, ret2 = bar()
print(ret1)  # 1.0
print(ret2)  # {'a': [100.0, 0.5, 3.0], 'b': 1}

Removal of Size Limit on Kernel Arguments and Return Values:

We have eliminated the size restrictions on kernel arguments and return values. However, it's crucial to remember that keeping these small is advisable. Large argument or return value sizes can lead to substantially longer compile times. While we support larger sizes, we haven't thoroughly tested arguments and return values exceeding 4KB and cannot guarantee their flawless functionality.

1.3 Argument Pack

Taichi now introduces a powerful feature for developers - Argument Packs. This new functionality enables efficient caching of unchanged parameters between multiple kernel calls, which not only provides convenience when launching a kernel, but also boosts the performance.

Key Advantages

Argument Pack: User-defined data types that encapsulate multiple parameters into a single, manageable unit.
Buffering Capability: Store and reuse parameters that remain constant across kernel calls, reducing the overhead of repeated parameter passing.
Device-level Caching: Taichi optimizes performance by caching argpacks directly on the device.

Usage Example

import taichi as ti
ti.init()

# Defining a custom argument type using "ti.types.argpack"
view_params_tmpl = ti.types.argpack(view_mtx=ti.math.mat4, proj_mtx=ti.math.mat4, far=ti.f32)

# Declaration of a Taichi kernel leveraging Argument Packs
@ti.kernel
def p(view_params: view_params_tmpl) -> ti.f32:
    return view_params.far

# Instantiation of the argument pack
view_params = view_params_tmpl(
    view_mtx=ti.math.mat4(
        [[1, 0, 0, 0],
         [0, 1, 0, 0],
         [0, 0, 1, 0],
         [0, 0, 0, 1]]),
    proj_mtx=ti.math.mat4(
        [[1, 0, 0, 0],
         [0, 1, 0, 0],
         [0, 0, 1, 0],
         [0, 0, 0, 1]]),
    far=1)

# Executing the kernel with the Argument Pack
print(p(view_params))  # Outputs: 1.0

Supported Data Types

Argument Packs are currently compatible with a variety of data types, including scalar, matrix, vector, Ndarray, and Struct.

Limitations

Please note that Argument Packs currently do not support the following features and data types:

Ahead-of-Time (AOT) Compilation and Compute Graph
ti.template
ti.data_oriented

2. Improvements

2.1 CUDA Memory Allocation Improvements

Dynamic VRAM Allocation:

In our latest update, the CUDA backend has been optimized to dynamically allocate Video RAM (VRAM), significantly reducing the initial preallocation requirement. Now, less than 50MB is preallocated upon ti.init.

Changes in `device_memory_GB` and `device_memory_fraction` Usage:

These settings are now specifically tailored for preallocating memory for SPARSE data structures, such as ti.pointer. This preallocation occurs only once a Sparse data structure is detected in your code.

Impact on VRAM Consumption:

Users can expect a noticeable decrease in VRAM usage with these enhancements. For instance:
diffmpm3d: 3866MB --> 3190 MB
nerf_train_deploy: 5618MB --> 4664 MB

2.2 CUDA SIMT APIs

Added the following ti.simt.block APIs:

ti.simt.block.sync_any_nonzero
ti.simt.block.sync_all_nonzero
ti.simt.block.sync_count_nonzero

2.3 Sparse grid APIs

Added helper function to create a 2D/3D sparse grid, for example:

    # create a 2D sparse grid
    grid = ti.sparse.grid(
        {
            "pos": ti.math.vec2,
            "mass": ti.f32,
            "grid2particles": ti.types.vector(20, ti.i32),
        },
        shape=(10, 10),
    )

    # access
    grid[0, 0].pos = ti.math.vec2(1, 2)
    grid[0, 0].mass = 1.0
    grid[0, 0].grid2particles[2] = 123

2.4 GGUI

Added Metal backend support for GGUI

2.5 AOT

Added C-APIs of ti_import_cpu_memory() and ti_import_cuda_memory()
Added support for multiple AOT runtime devices
Added support for matrix/vector in compute graph in C-API
Added support for matrix/vector in compute graph in Python

2.6 Error reporting

Improved the quality and coverage of error messages

2.7 Autodiff

supports passing vector/matrix arguments in autodiff kernel
supports autodiff for torch Tensor and taichi ndarray on CPU and CUDA
supports passing grad tensor to primal kernel

3. Bug Fixes

3.1 Autodiff Bugfixes

Fixed a few bugs with use of ti.ad.Tape
Fixed a bug with random seed for loss

3.2 AOT Bugfixes

Fixed a few bugs with compute graph
Fixed a few bugs with C-API

3.3 API Bugfixes

Fixed a bunch of bugs related to Matrix/Vector
Fixed an error with Ndarray type check
Fixed a few error with taichi.math APIs
Fixed an error with SNode destruction
Fixed an error with dataclass support for struct with matrix
Fixed an error with ti.func
Fixed a few errors with ti.struct and struct field
Fixed a few errors with Sparse Matrix

3.4 Build & Environment Bugfixes

Fixed a few compilation issues on Windows platform
Fixed an issue with cusolver dependency

3.5 GGUI Bugfixes

Fix vec_to_euler that breaks GGUI cameras & handle camera logic better
Fix for ImGui widget size on HiDPI

4. Deprecation Notice

We have removed the CC backend because it is rarely used, and it lacks maintenance.
We are deprecating ti.experimental.real_func because it is no longer experimental. Please use ti.real_func instead.

5. Full changelog

Highlights:
   - **Bug fixes**
      - Fix macro error with ti_import_cpu_memory (#8401) (by **Zhanlue Yang**)
      - Fix argpack nesting issues (by **listerily**)
      - Convert matrices to structs in argpack type members, Fixing layout error (by **listerily**)
      - Fix error when returning a struct field member when the return … (#8271) (by **秋云未云**)
      - Fix Erroneous handling of ndarray in real function in CFG (#8245) (by **Lin Jiang**)
      - Fix issue with passing python-scope Matrix as ti.func argument (#8197) (by **Zhanlue Yang**)
      - Fix incorrect CFG Graph structure due to missing Block wiith OffloadedStmts on LLVM backend (#8113) (by **Zhanlue Yang**)
      - Fix type inference error with LowerMatrixPtr pass (#8105) (by **Zhanlue Yang**)
      - Set initial value for Cuda device allocation (#8063) (by **Zhanlue Yang**)
      - Fix the insertion position of the access chain (#7957) (by **Lin Jiang**)
      - Fix wrong datatype size when writing to ndarray from Python scope (by **Ailing Zhang**)
   - **CUDA backend**
      - Warn driver version if it doesn't support memory pool. (#7912) (by **Haidong Lan**)
   - **Documentation**
      - Fixing typo in impl.py on ti.grouped function documentation (#8407) (by **Quentin Warnant**)
      - Update doc about kernels and functions (#8400) (by **Lin Jiang**)
      - Update documentation (#8089) (by **Zhao Liang**)
      - Update docstring for inverse func (#8170) (by **Zhao Liang**)
      - Update type.md, add descriptions of the vector (#8048) (by **Chenzhan Shang**)
      - Fix a bug in faq.md (#7992) (by **Zhao Liang**)
      - Fix problems in type_system.md (#7949) (by **秋云未云**)
      - Add doc about struct arguments (#7959) (by **Lin Jiang**)
      - Fix docstring of mix function (#7922) (by **Zhao Liang**)
      - Update faq and ggui, and add them to CI (#7861) (by **Zhao Liang**)
      - Add kernel sync doc (#7831) (by **Zhao Liang**)
   - **Error messages**
      - Warn before calling the external function (#8177) (by **Lin Jiang**)
      - Add option to print full traceback in Python (#8160) (by **Lin Jiang**)
      - Let to_primitive_type throw an error if the type is a pointer (by **lin-hitonami**)
      - Update deprecation warning of the graph arguments (#7965) (by **Lin Jiang**)
   - **Language and syntax**
      - Add clz instruction (#8276) (by **Jett Chen**)
      - Move real function out of the experimental module (#8399) (by **Lin Jiang**)
      - Fix error with loop unique analysis for MatrixPtrStmt (#8307) (by **Zhanlue Yang**)
      - Pass DebugInfo from Python to C++ for ndarray and field (#8286) (by **魔法少女赵志辉**)
      - Support TensorType for SharedArray (#8258) (by **Zhanlue Yang**)
      - Use ErrorEmitter in type check passes (#8285) (by **魔法少女赵志辉**)
      - Implement struct DebugInfo and ErrorEmitter (#8284) (by **魔法少女赵志辉**)
      - Add TensorType support for Constant Folding (#8250) (by **Zhanlue Yang**)
      - Support TensorType for irpass::alg_simp() (#8225) (by **Zhanlue Yang**)
      - Support vector/matrix ndarray arguments in real function (by **Lin Jiang**)
      - Fix error on ndarray type check (by **Lin Jiang**)
      - Support real function in data-oriented classes (by **lin-hitonami**)
      - Let kernel support return type annotated with 'typing.Tuple' (by **lin-hitonami**)
      - Support tuple return value for kernel and real function (by **lin-hitonami**)
      - Let static assert be in static scope (#8217) (by **Lin Jiang**)
      - Avoid scalarization for AOS GlobalPtrStmt (#8187) (by **Zhanlue Yang**)
      - Support matrix return value for real function (by **lin-hitonami**)
      - Support ndarray argument for real function (by **lin-hitonami**)
      - Cast the scalar arguments and return values of ti.func if the type hints exist (#8193) (by **Lin Jiang**)
      - Handle MatrixPtrStmt for uniquely_accessed_pointers() (#8165) (by **Zhanlue Yang**)
      - Support struct arguments for real function (by **lin-hitonami**)
      - Merge irpass::half2_vectorize() with irpass::scalarize() (#8102) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after optimize_bit_struct_stores & determine_ad_stack_size (#8097) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::demote_operations() (#8096) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::lower_access() (#8091) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::make_block_local() (#8090) (by **Zhanlue Yang**)
      - Support TensorType for Dead-Store-Elimination (#8065) (by **Zhanlue Yang**)
      - Optimize alias checking conditions for store-to-load forwarding (#8079) (by **Zhanlue Yang**)
      - Support TensorType for Load-Store-Forwarding (#8058) (by **Zhanlue Yang**)
      - Fix TensorTyped error with irpass::make_thread_local() (#8051) (by **Zhanlue Yang**)
      - Fix numerical issue with auto_diff() (#8025) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::make_mesh_block_local() (#8030) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::make_thread_local() (#8028) (by **Zhanlue Yang**)
      - Support allocate with cuda memory pool and reduce preallocation size accordingly (#7929) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::demote_no_access_mesh_fors() (#7956) (by **Zhanlue Yang**)
      - Fix error with irpass::check_out_of_bound() for TensorTyped ExternalPtrStmt (#7997) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::demote_atomics() (#7943) (by **Zhanlue Yang**)
      - Separate out preallocation logics for runtime objects (#7938) (by **Zhanlue Yang**)
      - Remove deprecated funcs in __init__.py (#7941) (by **Lin Jiang**)
      - Remove deprecated sparse_matrix_builder function (#7942) (by **Lin Jiang**)
      - Remove deprecated compile option ndarray_use_cached_allocator (#7937) (by **Zhanlue Yang**)
      - Migrate irpass::scalarize() after irpass::detect_read_only() (#7939) (by **Zhanlue Yang**)
      - Remove deprecated funcs in ti.ui (#7940) (by **Lin Jiang**)
      - Remove the support for 'is' (#7930) (by **Lin Jiang**)
      - Migrate irpass::scalarize() after irpass::offload() (#7919) (by **Zhanlue Yang**)
      - Raise error when the dimension of the ndrange does not equal to the number of the loop variable (#7933) (by **Lin Jiang**)
      - Remove a.atomic(b) (#7925) (by **Lin Jiang**)
      - Cancel deprecating native min/max (#7928) (by **Lin Jiang**)
      - Fix the api doc search problem (#7918) (by **Zhao Liang**)
      - Move irpass::scalarize() after irpass::auto_diff() (#7902) (by **Zhanlue Yang**)
      - Fix Ndarray fill with Matrix/Vector typed values (#7901) (by **Zhanlue Yang**)
      - Add cast to field.fill() interface (#7899) (by **Zhanlue Yang**)
      - Let nested data classes have methods (#7909) (by **Lin Jiang**)
      - Let kernel argument support matrix nested in a struct (by **lin-hitonami**)
      - Support the functions of dataclass as kernel argument and return value (#7865) (by **Lin Jiang**)
      - Fix a bug on PosixPath (#7860) (by **Zhao Liang**)
      - Postpone MatrixType scalarization to irpass::differentiation_validation_check() (#7839) (by **Zhanlue Yang**)
      - Postpone MatrixType scalarization to irpass::gather_meshfor_relation_types() (#7838) (by **Zhanlue Yang**)
   - **Miscellaneous**
      - Make clang-tidy happy on 'explicit' (#7999) (by **秋云未云**)
   - **OpenGL backend**
      - Fix: runtime caught error cannot be displayed in opengl (#7998) (by **秋云未云**)
   - **IR optimization passes**
      - Make merging casts int(int(x)) less aggressive (#7944) (by **Ailing**)
      - Fix redundant clone of stmts across offloaded tasks (#7927) (by **Ailing**)
   - **Refactor**
      - Refactor the argument passing logic of rwtexture and remove extra_args (#7914) (by **Lin Jiang**)

taichi-dev/taichi v1.7.0 on GitHub

1. New features

1.1 Real Function

Key Updates

Limitations

Usage Example

1.2 Enhancements in Kernel Arguments and Return Values

Support for Multiple Return Values in Taichi Kernel:

Removal of Size Limit on Kernel Arguments and Return Values:

1.3 Argument Pack

Key Advantages

Usage Example

Supported Data Types

Limitations

2. Improvements

2.1 CUDA Memory Allocation Improvements

Dynamic VRAM Allocation:

Changes in device_memory_GB and device_memory_fraction Usage:

Impact on VRAM Consumption:

2.2 CUDA SIMT APIs

2.3 Sparse grid APIs

2.4 GGUI

2.5 AOT

2.6 Error reporting

2.7 Autodiff

3. Bug Fixes

3.1 Autodiff Bugfixes

3.2 AOT Bugfixes

3.3 API Bugfixes

3.4 Build & Environment Bugfixes

3.5 GGUI Bugfixes

4. Deprecation Notice

5. Full changelog

taichi-dev/taichi v1.7.0
on GitHub

Changes in `device_memory_GB` and `device_memory_fraction` Usage: