π¨ Breaking Changes
- Update
bitmask_and
andbitmask_or
to return a pair of resulting mask and count of unset bits (#9616) @PointKernel - Remove sizeof and standardize on memory_usage (#9544) @vyasr
- Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
- Refactor sorting APIs (#9464) @vyasr
- Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
- Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
- JNI: Support nested types in ORC writer (#9334) @firestarman
- Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
- Refactor cuIO timestamp processing with
cuda::std::chrono
(#9278) @PointKernel - Various internal MultiIndex improvements (#9243) @vyasr
π Bug Fixes
- Fix read_parquet bug for bytes input (#9669) @rjzamora
- Use
_gather
internal forsort_*
(#9668) @isVoid - Fix behavior of equals for non-DataFrame Frames and add tests. (#9653) @vyasr
- Dont recompute output size if it is already available (#9649) @abellina
- Fix read_parquet bug for extended dtypes from remote storage (#9638) @rjzamora
- add const when getting data from a JNI data wrapper (#9637) @wjxiz1992
- Fix debrotli issue on CUDA 11.5 (#9632) @vuule
- Use std::size_t when computing join output size (#9626) @jlowe
- Fix
usecols
parameter handling indask_cudf.read_csv
(#9618) @galipremsagar - Add support for string
'nan', 'inf' & '-inf'
values while type-casting tofloat
(#9613) @galipremsagar - Avoid passing NativeFileDatasource to pyarrow in read_parquet (#9608) @rjzamora
- Fix test failure with cuda 11.5 in row_bit_count tests. (#9581) @nvdbaranec
- Correct _LIBCUDACXX_CUDACC_VER value computation (#9579) @robertmaynard
- Increase max RLE stream size estimate to avoid potential overflows (#9568) @vuule
- Fix edge case in tdigest scalar generation for groups containing all nulls. (#9551) @nvdbaranec
- Fix pytests failing in
cuda-11.5
environment (#9547) @galipremsagar - compile libnvcomp with PTDS if requested (#9540) @jbrennan333
- Fix
segmented_gather()
for null LIST rows (#9537) @mythrocks - Deprecate DataFrame.label_encoding, use private _label_encoding method internally. (#9535) @bdice
- Fix several test and benchmark issues related to bitmask allocations. (#9521) @nvdbaranec
- Fix for inserting duplicates in groupby result cache (#9508) @karthikeyann
- Fix mismatched types error in clip() when using non int64 numeric types (#9498) @davidwendt
- Match conda pinnings for style checks (revert part of #9412, #9433). (#9490) @bdice
- Make sure all dask-cudf supported aggs are handled in
_tree_node_agg
(#9487) @charlesbluca - Resolve
hash_columns
FutureWarning
indask_cudf
(#9481) @pentschev - Add fixed point to AllTypes in libcudf unit tests (#9472) @karthikeyann
- Fix regex handling of embedded null characters (#9470) @davidwendt
- Fix memcheck error in copy-if-else (#9467) @davidwendt
- Fix bug in dask_cudf.read_parquet for index=False (#9453) @rjzamora
- Preserve the decimal scale when creating a default scalar (#9449) @revans2
- Push down parent nulls when flattening nested columns. (#9443) @mythrocks
- Fix memcheck error in gtest SegmentedGatherTest/GatherSliced (#9442) @davidwendt
- Revert "Fix quantile division / partition handling for dask-cudf sort⦠(#9438) @charlesbluca
- Allow int-like objects for the
decimals
argument inround
(#9428) @shwina - Fix stream compaction's
drop_duplicates
API to use stable sort (#9417) @ttnghia - Skip Comparing Uniform Window Results in Var/std Tests (#9416) @isVoid
- Fix
StructColumn.to_pandas
type handling issues (#9388) @galipremsagar - Correct issues in the build dir cudf-config.cmake (#9386) @robertmaynard
- Fix Java table partition test to account for non-deterministic ordering (#9385) @jlowe
- Fix timestamp truncation/overflow bugs in orc/parquet (#9382) @PointKernel
- Fix the crash in stats code (#9368) @devavret
- Make Series.hash_encode results reproducible. (#9366) @bdice
- Fix libcudf compile warnings on debug 11.4 build (#9360) @davidwendt
- Fail gracefully when compiling python UDFs that attempt to access columns with unsupported dtypes (#9359) @brandon-b-miller
- Set pass_filenames: false in mypy pre-commit configuration. (#9349) @bdice
- Fix cudf_assert in cudf::io::orc::gpu::gpuDecodeOrcColumnData (#9348) @davidwendt
- Fix memcheck error in groupby-tdigest get_scalar_minmax (#9339) @davidwendt
- Optimizations for
cudf.concat
whenaxis=1
(#9333) @galipremsagar - Use f-string in join helper warning message. (#9325) @bdice
- Avoid casting to list or struct dtypes in dask_cudf.read_parquet (#9314) @rjzamora
- Fix null count in statistics for parquet (#9303) @devavret
- Potential overflow of
decimal32
when casting toint64_t
(#9287) @codereport - Fix quantile division / partition handling for dask-cudf sort on null dataframes (#9259) @charlesbluca
- Updating cudf version also updates rapids cmake branch (#9249) @robertmaynard
- Implement
one_hot_encoding
in libcudf and bind to python (#9229) @isVoid - BUG FIX: CSV Writer ignores the header parameter when no metadata is provided (#8740) @skirui-source
π Documentation
- Update Documentation to use
TYPED_TEST_SUITE
(#9654) @codereport - Add dedicated page for
StringHandling
in python docs (#9624) @galipremsagar - Update docstring of
DataFrame.merge
(#9572) @galipremsagar - Use raw strings to avoid SyntaxErrors in parsed docstrings. (#9526) @bdice
- Add example to docstrings in
rolling.apply
(#9522) @isVoid - Update help message to escape quotes in ./build.sh --cmake-args. (#9494) @bdice
- Improve Python docstring formatting. (#9493) @bdice
- Update table of I/O supported types (#9476) @vuule
- Document invalid regex patterns as undefined behavior (#9473) @davidwendt
- Miscellaneous documentation fixes to
cudf
(#9471) @galipremsagar - Fix many documentation errors in libcudf. (#9355) @karthikeyann
- Fixing SubwordTokenizer docs issue (#9354) @mayankanand007
- Improved deprecation warnings. (#9347) @bdice
- doc reorder mr, stream to stream, mr (#9308) @karthikeyann
- Deprecate method parameters to DataFrame.join, DataFrame.merge. (#9291) @bdice
- Added deprecation warning for
.label_encoding()
(#9289) @mayankanand007
π New Features
- Enable Series.divide and DataFrame.divide (#9630) @vyasr
- Update
bitmask_and
andbitmask_or
to return a pair of resulting mask and count of unset bits (#9616) @PointKernel - Add handling of mixed numeric types in
to_dlpack
(#9585) @galipremsagar - Support re.Pattern object for pat arg in str.replace (#9573) @davidwendt
- Add JNI for
lists::drop_list_duplicates
with keys-values input column (#9553) @ttnghia - Support structs column in
min
,max
,argmin
andargmax
groupby aggregate() and scan() (#9545) @ttnghia - Move libcudacxx to use
rapids_cpm
and use newer versions (#9539) @robertmaynard - Add scan min/max support for chrono types to libcudf reduction-scan (not groupby scan) (#9518) @davidwendt
- Support
args=
inapply
(#9514) @brandon-b-miller - Add groupby scan min/max support for strings values (#9502) @davidwendt
- Add list output option to character_ngrams() function (#9499) @davidwendt
- More granular column selection in ORC reader (#9496) @vuule
- add min_periods, ddof to groupby covariance, & correlation aggregation (#9492) @karthikeyann
- Implement Series.datetime.floor (#9488) @skirui-source
- Enable linting of CMake files using pre-commit (#9484) @vyasr
- Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
- Augment
order_by
to Accept a List ofnull_precedence
(#9455) @isVoid - Add format API for list column of strings (#9454) @davidwendt
- Enable Datetime/Timedelta dtypes in Masked UDFs (#9451) @brandon-b-miller
- Add cudf python groupby.diff (#9446) @karthikeyann
- Implement
lists::stable_sort_lists
for stable sorting of elements within each row of lists column (#9425) @ttnghia - add ctest memcheck using cuda-sanitizer (#9414) @karthikeyann
- Support Unary Operations in Masked UDF (#9409) @isVoid
- Move Several Series Function to Frame (#9394) @isVoid
- MD5 Python hash API (#9390) @bdice
- Add cudf strings is_title API (#9380) @davidwendt
- Enable casting to int64, uint64, and double in AST code. (#9379) @vyasr
- Add support for writing ORC with map columns (#9369) @vuule
- extract_list_elements() with column_view indices (#9367) @mythrocks
- Reimplement
lists::drop_list_duplicates
for keys-values lists columns (#9345) @ttnghia - Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
- JNI: Support nested types in ORC writer (#9334) @firestarman
- Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
- Add shallow hash function and shallow equality comparison for column_view (#9312) @karthikeyann
- Add CudaMemoryBuffer for cudaMalloc memory using RMM cuda_memory_resource (#9311) @rongou
- Add parameters to control row index stride and stripe size in ORC writer (#9310) @vuule
- Add
na_position
param to dask-cudfsort_values
(#9264) @charlesbluca - Add
ascending
parameter for dask-cudfsort_values
(#9250) @charlesbluca - New array conversion methods (#9236) @vyasr
- Series
apply
method backed by masked UDFs (#9217) @brandon-b-miller - Grouping by frequency and resampling (#9178) @shwina
- Pure-python masked UDFs (#9174) @brandon-b-miller
- Add Covariance, Pearson correlation for sort groupby (libcudf) (#9154) @karthikeyann
- Add
calendrical_month_sequence
in c++ anddate_range
in python (#8886) @shwina
π οΈ Improvements
- Followup to PR 9088 comments (#9659) @cwharris
- Update cuCollections to version that supports installed libcudacxx (#9633) @robertmaynard
- Add
11.5
dev.yml tocudf
(#9617) @galipremsagar - Add
xfail
for parquet reader11.5
issue (#9612) @galipremsagar - remove deprecated Rmm.initialize method (#9607) @rongou
- Use HostColumnVectorCore for child columns in JCudfSerialization.unpackHostColumnVectors (#9596) @sperlingxx
- Set RMM pool to a fixed size in JNI (#9583) @rongou
- Use nvCOMP for Snappy compression/decompression (#9582) @vuule
- Build CUDA version agnostic packages for dask-cudf (#9578) @Ethyling
- Fixed tests warning: "TYPED_TEST_CASE is deprecated, please use TYPED_TEST_SUITE" (#9574) @ttnghia
- Enable CMake format in CI and fix style (#9570) @vyasr
- Add NVTX Start/End Ranges to JNI (#9563) @abellina
- Add librdkafka and python-confluent-kafka to dev conda environments s⦠(#9562) @jdye64
- Add offsets_begin/end() to strings_column_view (#9559) @davidwendt
- remove alignment options for RMM jni (#9550) @rongou
- Add axis parameter passthrough to
DataFrame
andSeries
take for pandas API compatibility (#9549) @dantegd - Remove sizeof and standardize on memory_usage (#9544) @vyasr
- Adds cudaProfilerStart/cudaProfilerStop in JNI api (#9543) @abellina
- Generalize comparison binary operations (#9542) @vyasr
- Expose APIs to wrap CUDA or RMM allocations with a Java device buffer instance (#9538) @jlowe
- Add scan sum support for duration types to libcudf (#9536) @davidwendt
- Force inlining to improve AST performance (#9530) @vyasr
- Generalize some more indexed frame methods (#9529) @vyasr
- Add Java bindings for rolling window stddev aggregation (#9527) @razajafri
- catch rmm::out_of_memory exceptions in jni (#9525) @rongou
- Add an overload of
make_empty_column
withtype_id
parameter (#9524) @ttnghia - Accelerate conditional inner joins with larger right tables (#9523) @vyasr
- Initial pass of generalizing
decimal
support incudf
python layer (#9517) @galipremsagar - Cleanup for flattening nested columns (#9509) @rwlee
- Enable running tests using RMM arena and async memory resources (#9506) @rongou
- Remove dependency on six. (#9495) @bdice
- Cleanup some libcudf strings gtests (#9489) @davidwendt
- Rename strings/array_tests.cu to strings/array_tests.cpp (#9480) @davidwendt
- Refactor sorting APIs (#9464) @vyasr
- Implement DataFrame.hash_values, deprecate DataFrame.hash_columns. (#9458) @bdice
- Deprecate Series.hash_encode. (#9457) @bdice
- Update
conda
recipes for Enhanced Compatibility effort (#9456) @ajschmidt8 - Small clean up to simplify column selection code in ORC reader (#9444) @vuule
- add missing stream to scalar.is_valid() wherever stream is available (#9436) @karthikeyann
- Adds Deprecation Warnings to
one_hot_encoding
and Implementget_dummies
with Cython API (#9435) @isVoid - Update pre-commit hook URLs. (#9433) @bdice
- Remove pyarrow import in
dask_cudf.io.parquet
(#9429) @charlesbluca - Miscellaneous improvements for UDFs (#9422) @isVoid
- Use pre-commit for CI (#9412) @vyasr
- Update to UCX-Py 0.23 (#9407) @pentschev
- Expose OutOfBoundsPolicy in JNI for Table.gather (#9406) @abellina
- Improvements to tdigest aggregation code. (#9403) @nvdbaranec
- Add Java API to deserialize a table to host columns (#9402) @jlowe
- Frame copy to use class instead of type() (#9397) @madsbk
- Change all DeprecationWarnings to FutureWarning. (#9392) @bdice
- Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
- Add IndexedFrame class and move SingleColumnFrame to a separate module (#9378) @vyasr
- Support Arrow NativeFile and PythonFile for remote ORC storage (#9377) @rjzamora
- Use Arrow PythonFile for remote CSV storage (#9376) @rjzamora
- Add multi-threaded writing to GDS writes (#9372) @devavret
- Miscellaneous column cleanup (#9370) @vyasr
- Use single kernel to extract all groups in cudf::strings::extract (#9358) @davidwendt
- Consolidate binary ops into
Frame
(#9357) @isVoid - Move rank scan implementations from scan_inclusive.cu to rank_scan.cu (#9351) @davidwendt
- Remove usage of deprecated thrust::host_space_tag. (#9350) @bdice
- Use Default Memory Resource for Temporaries in
reduction.cpp
(#9344) @isVoid - Fix Cython compilation warnings. (#9327) @bdice
- Fix some unused variable warnings in libcudf (#9326) @davidwendt
- Use optional-iterator for copy-if-else kernel (#9324) @davidwendt
- Remove Table class (#9315) @vyasr
- Unpin
dask
anddistributed
in CI (#9307) @galipremsagar - Add optional-iterator support to indexalator (#9306) @davidwendt
- Consolidate more methods in Frame (#9305) @vyasr
- Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf (#9304) @rjzamora
- Pin mypy in .pre-commit-config.yaml to match conda environment pinning. (#9300) @bdice
- Use gather.hpp when gather-map exists in device memory (#9299) @davidwendt
- Fix Automerger for
Branch-21.12
frombranch-21.10
(#9285) @galipremsagar - Refactor cuIO timestamp processing with
cuda::std::chrono
(#9278) @PointKernel - Change strings copy_if_else to use optional-iterator instead of pair-iterator (#9266) @davidwendt
- Update cudf java bindings to 21.12.0-SNAPSHOT (#9248) @pxLi
- Various internal MultiIndex improvements (#9243) @vyasr
- Add detail interface for
split
andslice(table_view)
, refactors both function withhost_span
(#9226) @isVoid - Refactor MD5 implementation. (#9212) @bdice
- Update groupby result_cache to allow sharing intermediate results based on column_view instead of requests. (#9195) @karthikeyann
- Use nvcomp's snappy decompressor in avro reader (#9181) @devavret
- Add
isocalendar
API support (#9169) @marlenezw - Simplify read_json by removing unnecessary reader/impl classes (#9088) @cwharris
- Simplify read_csv by removing unnecessary reader/impl classes (#9041) @cwharris
- Refactor hash join with cuCollections multimap (#8934) @PointKernel