rapidsai/cudf v25.04.00 on GitHub

🚨 Breaking Changes

Remove unused group_range_rolling_window API (#18313) @wence-
[BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
Remove cudf.Scalar from binops (#18240) @mroeschke
Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
Remove deprecated single component datetime extract APIs (#18010) @Matt711
Remove deprecated rolling window functionality (#17993) @wence-
Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
Remove dataframe protocol (#17909) @vyasr
Use new rapids-logger library (#17899) @vyasr
Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
Fixed incorrect PTX parsing of ret instruction after branch label (#17859) @lamarrr
Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu

🐛 Bug Fixes

Fix alpha versions of cudf package. (#18429) @bdice
Backport: Deterministic hashing for DataFrameScan nodes in cudf-polars multi-partition executor (#18351) (#18420) @bdice
Skip failing Narwhals rolling groupy tests (#18398) @Matt711
Pin cmake in test_java to be less than 4.0.0 (#18392) @abellina
Skip polars tests that fail with pydantic deprecation warnings (#18388) @Matt711
Backport: Fix index of right table in unary operators in AST, in Joins (#18342) @bdice
xfail narwhals sqlframe tests (#18297) @Matt711
[BUG] Disabled JIT for CUDA Runtime < 11.5 (#18296) @lamarrr
Make a pylibcudf Column from a device array object with strides=None (#18295) @Matt711
Fix cudf.pandas objects to not be Callable (#18288) @galipremsagar
Skip failing polars test test_general_prefiltering (#18264) @Matt711
Filter all cudf.pandas profiler tests from running in parallel (#18262) @Matt711
Allow cudf.Series([pd.NA], dtype=, nan_as_null=False) (#18259) @mroeschke
Fix cross join with extra columns (#18256) @galipremsagar
Fix Dataframe.loc to not modify the actual dataframe (#18254) @galipremsagar
Remove RMM macro usage from to_arrow_device.cu (#18252) @davidwendt
Skip Narwhals cross join tests for cudf.pandas CI run (#18249) @Matt711
Fix cudf-polars tests for polars < 1.24 (#18246) @wence-
Fix experimental cudf-polars tests (#18244) @rjzamora
Fix datetime64 vs datetime binops max resolution (#18241) @galipremsagar
Use CCCL::libcudacxx include directories in Jitify preprocessing. (#18233) @bdice
Disable conda prefix patching to avoid mangling binaries (#18225) @vyasr
Workaround for ARM compiler issue with single space literal string (#18220) @davidwendt
Bump nightly check limit (#18213) @Matt711
Support comparitive binops between catgorical and non categorical (#18200) @mroeschke
Make the version file inside cudf.pandas not a symlink (#18198) @vyasr
Ensure RAPIDS_ARTIFACTS_DIR is set for build metrics reports. (#18192) @bdice
Ignore run exports of libcufile. (#18190) @bdice
Skip flaky multi GPU test (#18187) @Matt711
Fix BPE merges table static-map capacity size (#18184) @davidwendt
Drop CUB_QUOTIENT_CEILING (#18179) @miscco
Disable ARM CI in C++ and Python test CI jobs (#18175) @Matt711
Add fmt to the test/benchmarks env (#18173) @vyasr
Fix merge(how=left, left_on=, right_index=True, sort=True) (#18166) @mroeschke
Allow nonnative cupy dtype in cudf.Series (#18164) @mroeschke
Fix Series construction from numpy array with non-native byte order (#18151) @mroeschke
Use protocol for dlpack instead of deprecated function in cupy notebook (#18147) @Matt711
Skip failing test (#18146) @vyasr
Update calls to KvikIO's config setter (#18144) @kingcrimsontianyu
Reduce memory use when writing tables with very short columns to ORC (#18136) @vuule
Handle empty dictionary in to_arrow_device interop (#18121) @davidwendt
Allow pivot_table to accept single label index and column arguments (#18115) @mroeschke
Preserve DataFrame.column subclass and type during binop (#18113) @mroeschke
Fix rmm macro call (#18108) @pmattione-nvidia
Add include for <functional> (#18102) @miscco
Remove static column vectors from window function tests. (#18099) @mythrocks
Fix scatter_by_map with spilling enabled (#18095) @mroeschke
Use the right version macro CCCL_MAJOR_VERSION (#18073) @miscco
Fix test_scan_csv_multi cudf-polars test (#18064) @rjzamora
Fix memcopy direction for concatenate (#18058) @tgujar
Fix upstream dask loc test (#18045) @rjzamora
Fix hang on invalid UTF-8 data in string_view iterator (#18039) @davidwendt
Fix dask_cudf.to_orc deprecation (#18038) @rjzamora
Compatibility with dask.dataframe's is_scalar (#18030) @TomAugspurger
Fix the build error due to KvikIO update (#18025) @kingcrimsontianyu
Fix failing ibis test (#18022) @Matt711
Skip failing polars tests (#18015) @Matt711
Fix to_arrow to return consistent pandas-metadata (#18009) @galipremsagar
Prevent setting custom attributes to ColumnMethods (#18005) @galipremsagar
Compatibility with Dask main (#17992) @TomAugspurger
[Bug] Fix Parquet-metadata sampling in cudf-polars (#17991) @rjzamora
Add missing include for calling std::iota() (#17983) @davidwendt
Fix pickle and unpickling for all objects (#17980) @galipremsagar
Install duckdb the default backend for ibis in the cudf.pandas integration tests (#17972) @Matt711
Check null count too in sum aggregation (#17964) @Matt711
Raise NotImplementedError for groupby.agg if duplicate columns would be created (#17956) @mroeschke
Ensure disabling the module accelerator is thread-safe (#17955) @vyasr
Fix DataFrame/Series.rank for int and null data in mode.pandas_compatible (#17954) @mroeschke
Limit buffer size in reallocation policy in JSON reader (#17940) @shrshi
Make cudf.pandas proxy array picklable (#17929) @Matt711
Add missing standard includes (#17928) @miscco
Fix torch integration test (#17923) @Matt711
Fix to_pandas writable bug for datetime and timedelta types (#17913) @galipremsagar
Raise NotImplementedError if .merge(suffixes=) introduces duplicate labels (#17905) @mroeschke
Fix groupby scans with int and NA data in mode.pandas_compatible (#17895) @mroeschke
Patch __init__ of cudf constructors to parse through cudf.pandas proxy objects (#17878) @galipremsagar
Fixed incorrect PTX parsing of ret instruction after branch label (#17859) @lamarrr
Relax inconsistent schema handling in dask_cudf.read_parquet (#17554) @rjzamora

📖 Documentation

Clarify that cudf.pandas should be enabled before importing pandas. (#18339) @bdice
[DOC] Add wordpiece tokenizer to cudf documentation (#18247) @davidwendt
Added pylibcudf.contiguous_split to API docs (#18194) @TomAugspurger
Fix build.sh docs for default behavior (#18180) @bdice
Update Dask-cuDF documentation to fix all warnings and errors (#18157) @TomAugspurger
[DOC] Document character normalizer (#18125) @Matt711

🚀 New Features

Add and revise experimental cudf-polars config options (#18284) @rjzamora
Support top-k and bottom_k expressions (#18222) @Matt711
Support cudf-polars is_leap_year (#18212) @brandon-b-miller
Support cudf-polars month_start/month_end (#18211) @brandon-b-miller
Support cudf-polars ordinal_day (#18152) @brandon-b-miller
Add pylibcudf.gpumemoryview support for len()/nbytes (#18133) @pentschev
Link to libzstd for ZSTD compression and decompression APIs (#18129) @shrshi
Added NDSH Q09 Benchmark for Transforms (#18127) @lamarrr
Make pylibcudf traits raise exceptions gracefully rather than terminating in C++ (#18117) @Matt711
Host decompression (#18114) @vuule
Add owning types to hold Arrow data (#18084) @vyasr
Bump polars version to <1.24 (#18076) @Matt711
Support sorted merges in cudf.polars (#18075) @Matt711
Add a slice expression to polars IR (#18050) @Matt711
Expose num_rows_per_source (IO metadata) to pylibcudf (#18049) @Matt711
Added Imbalanced Tree Benchmarks for Transforms (#18032) @lamarrr
Run the narwhals test suite with cudf.pandas (#18031) @Matt711
Add host_read_async interfaces to datasource (#18018) @vuule
Make most cudf-polars Node objects pickleable (#17998) @rjzamora
Add Column.serialize to cudf-polars (#17990) @rjzamora
Bump polars version to <1.23 (#17986) @Matt711
Implemented Decimal Transforms (#17968) @lamarrr
Introduce ZSTD host-side compression and decompression APIs (#17935) @shrshi
Add catboost integration tests (#17931) @Matt711
[FEA] Expose stripe_size_rows setting for ORCWriterOptions (#17927) @ustcfy
Test narwhals in CI (#17884) @bdice
Added Multi-input & Scalar Support for Transform UDFs (#17881) @lamarrr
Host Snappy compression (#17824) @vuule
Run spark-rapids-jni CI (#17781) @KyleFromNVIDIA
Add multi-partition Shuffle operation to cuDF Polars (#17744) @rjzamora
Added polynomials benchmark (#17695) @lamarrr
Add stream parameters in pylibcudf IO APIs (#17620) @Matt711
New nvtext::wordpiece_tokenizer APIs (#17600) @davidwendt
Add support for unary negation operator (#17560) @Matt711
Add multi-partition Join support to cuDF-Polars (#17518) @rjzamora
Add basic multi-partition GroupBy support to cuDF-Polars (#17503) @rjzamora
Support Distributed in cudf-polars tests and IR evaluation (#17364) @pentschev

🛠️ Improvements

Use pyarrow 15 in oldest dependency CI jobs (#18409) @bdice
Bump librdkafka to 2.8.0 (#18370) @raydouglass
fix(rattler): ignore libzlib run dependency to avoid pandoc collision (#18368) @gforsyth
Fix zstd build interface include definition (#18366) @trxcllnt
test: Install pytest-env and hypothesis in test_narwhals.sh (#18337) @MarcoGorelli
Remove unused group_range_rolling_window API (#18313) @wence-
Cache column view creation from arrow types (#18302) @vyasr
Split Narwhals cudf.pandas tests failures into to fix and to skip (#18267) @mroeschke
Support BinOp, min, and max Aggregations in cudf-polars parallel groupby (#18266) @TomAugspurger
Minor clean up and optimizations in the Parquet writer (#18258) @vuule
Fix cudf_kafka run export for cudatoolkit (#18245) @gforsyth
dask-polars: use splat everywhere. (#18243) @madsbk
Remove cudf.Scalar from binops (#18240) @mroeschke
Remove warning in the stream pool when asking for more streams than available (#18236) @vuule
Explain why we disable parallelism for profiler tests to avoid pytest-cov issue (#18234) @Matt711
Ignore cudatoolkit run exports by name, not package (#18230) @gforsyth
Revert "Bump nightly check limit" (#18227) @Matt711
Fix cudf.pandas to be able to work on a cpu-only machine (#18224) @galipremsagar
Add missing cudatoolkit run_export ignore to pylibcudf (#18223) @gforsyth
Remove cudf.Scalar from Column.setitem (#18221) @mroeschke
Remove unused round_up_pow2 utility (#18218) @PointKernel
Add flake8-print/debugger Ruff rules (#18217) @mroeschke
Bump polars version to <1.25 (#18209) @Matt711
Export RAPIDS_ARTIFACTS_DIR. (#18208) @bdice
Drop more thrust functions with libcu++ ones (#18207) @miscco
Update Numpy <2.1 unpinning xfail condition (#18203) @mroeschke
Run conda import tests on Python packages (#18197) @bdice
fix(rattler): add cudatoolkit ignore run export to cudf (#18195) @gforsyth
Revert "Disable ARM CI in C++ and Python test CI jobs" (#18188) @Matt711
Define Column.where to be used across DataFrame/Series (#18186) @mroeschke
Remove cudf.Scalar in where (#18178) @mroeschke
Drop unnecessary fmt dep (#18177) @vyasr
Refactor join internals: separate hash_join declaration and cleanup (#18170) @PointKernel
Add Ruff rule to enforce cudf dtype utils over numpy/pandas dtype utils (#18169) @mroeschke
Combine multiple str.minhash() APIs into one call (#18168) @davidwendt
Move nanoarrow_utils.hpp from cpp/tests/interop to cpp/include/cudf_test (#18163) @davidwendt
Test cudf against the latest stable branch of Narwhals (#18162) @Matt711
fix libcudf pins cu11 (#18161) @gforsyth
Combine separate ConfigureNVBench calls to fix cpp conda builds (#18155) @gforsyth
Add telemetry to build workflows (#18154) @gforsyth
Prune more seldom used dtype utils (#18150) @mroeschke
Remove some unnecessary module imports (#18143) @mroeschke
Branch 25.04 merge branch 25.02 (#18142) @vyasr
Prune some seldom used dtype utils (#18141) @mroeschke
Use more, cheaper dtype checking utilities in cudf Python (#18139) @mroeschke
Support deserializing cudf-polars objects composed of RMM frames (#18138) @pentschev
Add ConfigOptions convenience class to cudf-polars (#18137) @rjzamora
Support new callback API for lazyframe.profile (#18132) @wence-
Optimized compilation of CUDFTESTUTIL's interface sources (#18131) @lamarrr
Unpin numpy<2.1 (#18128) @mroeschke
Use cpu16 for build CI jobs (#18124) @bdice
Remove now non-existent job (#18123) @vyasr
Minor typo fix in filling.pxd (#18120) @davidwendt
Replace more deprecated CUB functors (#18119) @miscco
Simplify DecimalDtype and DecimalColumn operations (#18111) @mroeschke
Add interop support from arrow StringView to libcudf strings column (#18107) @davidwendt
Expose the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf (#18106) @JigaoLuo
Add a list of expected failures to narwhals tests (#18097) @Matt711
Remove unused var (#18096) @vyasr
Run narwhals tests nightly. (#18093) @bdice
Use conda-build instead of conda-mambabuild (#18092) @bdice
Remove static configure step (#18091) @vyasr
Remove FindCUDAToolkit.cmake from .pre-commit-config.yaml (#18087) @KyleFromNVIDIA
Align StringColumn constructor with ColumnBase base class (#18086) @mroeschke
Remove FindCUDAToolkit backport (#18081) @KyleFromNVIDIA
Support melt(ignore_index=False) (#18080) @mroeschke
Update numba dep and upper-bound numpy (#18078) @vyasr
Add as_proxy_object API to cudf.pandas (#18072) @galipremsagar
Enforce deprecation of dtype parameter in sum/product (#18070) @mroeschke
send sccache logs to telemetry (#18069) @msarahan
Short circuit Index.equal if compared Index isn't same type (#18067) @mroeschke
Make Column.view/can_cast_safely accept a dtype object (#18066) @mroeschke
Optimization improvement for substr in cudf::string_view (#18062) @davidwendt
Forward-merge branch-25.02 to branch-25.04 (#18061) @bdice
Port all conda recipes to rattler-build (#18054) @gforsyth
Minor improvements in arrow interop (#18053) @wence-
Pass more dtype objects to astype calls (#18044) @mroeschke
Forward merge branch-25.02 to branch-25.04 (#18041) @Matt711
Replace deprecated CCCL features (#18036) @miscco
Separate stats filtering helpers to reuse in page pruning (#18034) @mhaseeb123
Update spark-rapids-jni CI image version to cuda12.8.0 (#18024) @pxLi
Add pylibcudf.Scalar.from_numpy for bool/int/float/str types (#18020) @mroeschke
Support IntervalDtype(subtype=None) (#18017) @mroeschke
Enable pytest-xdist runs for py-polars tests (#18016) @galipremsagar
consolidate more conda solves in CI (#18014) @jameslamb
Replace cub::Int2Type with cuda::std::integral_constant (#18013) @miscco
Remove deprecated single component datetime extract APIs (#18010) @Matt711
Pass dtype objects to Column.astype (#18008) @mroeschke
Require CMake 3.30.4 (#18007) @robertmaynard
Refactor math_ops.cu dispatcher logic (#18006) @davidwendt
Move cudf::lists::detail::make_empty_lists_column to public API (#17996) @davidwendt
Create Conda CI test env in one step (#17995) @KyleFromNVIDIA
Add seed parameter to cudf hash_character_ngrams (#17994) @davidwendt
Remove deprecated rolling window functionality (#17993) @wence-
Continue on failures in cudf.pandas integration tests CI job (#17987) @Matt711
Avoid cudf.dtype calls in build_column/column_empty/.where (#17979) @mroeschke
Ensure dtype objects are passed within Column.astype (#17978) @mroeschke
Use Conda XGBoost (#17959) @jakirkham
Read the footers in parallel when reading multiple Parquet files (#17957) @vuule
Refactor predicate pushdown to reuse row group pruning in experimental PQ reader (#17946) @mhaseeb123
Add new nvtext tokenized minhash API (#17944) @davidwendt
Use shared-workflows branch-25.04 (#17943) @bdice
Get rid of the deprecated thrust::identity (#17942) @PointKernel
Remove deprecated nvtext::minhash_permuted APIs (#17939) @davidwendt
Enable third party library integration tests in CI with cudf.pandas (#17936) @galipremsagar
Add build_type input field for test.yaml (#17925) @gforsyth
Remove cudf.Scalar from shift/fillna (#17922) @mroeschke
Enabling cross join in cudf python (#17921) @galipremsagar
Use rapids-pip-retry in CI jobs that might need retries (#17920) @gforsyth
More avoid cudf.dtype internally in favor of pre-defined, supported types (#17918) @mroeschke
Initialize inout parameter (#17911) @miscco
Remove dataframe protocol (#17909) @vyasr
Rename PascalCase functions and types to to snake_case to improve consistency (#17908) @vuule
Use new rapids-logger library (#17899) @vyasr
Add pylibcudf.Scalar.from_py for construction from Python strings, bool, int, float (#17898) @mroeschke
Remove cudf.Scalar from factorize (#17897) @mroeschke
disallow fallback to Make in Python builds (#17894) @jameslamb
Remove orc::gpu namespace (#17891) @vuule
Only run Auto Assign PR workflow if PR is not merged (#17888) @mroeschke
Update pre-commit-hooks to version 0.6.0 (#17887) @KyleFromNVIDIA
Forward-merge branch-25.02 to branch-25.04 (#17885) @bdice
Add script to run pylibcudf tests (#17882) @bdice
Migrate to NVKS for amd64 CI runners (#17877) @bdice
Fix merge conflict for branch-25.02 into branch-25.04 (#17874) @davidwendt
Remove decimal32/64 to decimal128 conversion in Parquet writer (#17869) @mhaseeb123
Expose JSON reader options to builder in pylibcudf (#17866) @shrshi
Remove cudf.Scalar from .dt timedelta properties (#17863) @mroeschke
Added support for custom types in PTX parser (#17861) @lamarrr
Remove cudf.Scalar from date_range/to_datetime (#17860) @mroeschke
Avoid cudf.dtype internally in favor of pre-defined, supported types (#17839) @mroeschke
Allow cudf::type_to_id<T const>() (#17831) @esoha-nvidia
Fixing auto-merge branch-25.02 into branch-25.04 (#17828) @davidwendt
Add new nvtext::normalize_characters API (#17818) @davidwendt
Include more information in error messages in the nvcomp adapter (#17814) @vuule
Extend and simplify API for calculation of range-based rolling window offsets (#17807) @wence-
More minor fixes for CCCL (#17793) @miscco
Use KvikIO to enable file's fast host read and host write (#17764) @kingcrimsontianyu
Remove cudf._lib.column in favor of pylibcudf. (#17760) @mroeschke
Replaced std::string with std::string_view and removed excessive copies in cudf::io (#17734) @lamarrr
Use xdist worksteal on the cudf.pandas test suite (#16930) @Matt711