Added
reduce_by_index
withf32
-addition is now approximately 2x
faster in the CUDA backend.
Fixed
-
Fixed kernel extractor bug in
if
-interchange (#921). -
Fixed some cases of malformed kernel code generation (#922).
-
Fixed rare memory corruption bug involving branches returning
arrays (#923). -
Fixed spurious warning about entry points involving opaque return
types, where the type annotations are put on a higher-order return
type. -
Fixed incorrect size type checking for sum types in negative
position with unknown constructors (#927). -
Fixed loop interchange for permuted sequential loops with more
than one outer parallel loop (#928). -
Fixed a type checking bug for branches returning incomplete sum
types (#931).