rapidsai/cudf v21.12.00 on GitHub

🚨 Breaking Changes

Update bitmask_and and bitmask_or to return a pair of resulting mask and count of unset bits (#9616) @PointKernel
Remove sizeof and standardize on memory_usage (#9544) @vyasr
Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
Refactor sorting APIs (#9464) @vyasr
Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
JNI: Support nested types in ORC writer (#9334) @firestarman
Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
Refactor cuIO timestamp processing with cuda::std::chrono (#9278) @PointKernel
Various internal MultiIndex improvements (#9243) @vyasr

🐛 Bug Fixes

Fix read_parquet bug for bytes input (#9669) @rjzamora
Use _gather internal for sort_* (#9668) @isVoid
Fix behavior of equals for non-DataFrame Frames and add tests. (#9653) @vyasr
Dont recompute output size if it is already available (#9649) @abellina
Fix read_parquet bug for extended dtypes from remote storage (#9638) @rjzamora
add const when getting data from a JNI data wrapper (#9637) @wjxiz1992
Fix debrotli issue on CUDA 11.5 (#9632) @vuule
Use std::size_t when computing join output size (#9626) @jlowe
Fix usecols parameter handling in dask_cudf.read_csv (#9618) @galipremsagar
Add support for string 'nan', 'inf' & '-inf' values while type-casting to float (#9613) @galipremsagar
Avoid passing NativeFileDatasource to pyarrow in read_parquet (#9608) @rjzamora
Fix test failure with cuda 11.5 in row_bit_count tests. (#9581) @nvdbaranec
Correct _LIBCUDACXX_CUDACC_VER value computation (#9579) @robertmaynard
Increase max RLE stream size estimate to avoid potential overflows (#9568) @vuule
Fix edge case in tdigest scalar generation for groups containing all nulls. (#9551) @nvdbaranec
Fix pytests failing in cuda-11.5 environment (#9547) @galipremsagar
compile libnvcomp with PTDS if requested (#9540) @jbrennan333
Fix segmented_gather() for null LIST rows (#9537) @mythrocks
Deprecate DataFrame.label_encoding, use private _label_encoding method internally. (#9535) @bdice
Fix several test and benchmark issues related to bitmask allocations. (#9521) @nvdbaranec
Fix for inserting duplicates in groupby result cache (#9508) @karthikeyann
Fix mismatched types error in clip() when using non int64 numeric types (#9498) @davidwendt
Match conda pinnings for style checks (revert part of #9412, #9433). (#9490) @bdice
Make sure all dask-cudf supported aggs are handled in _tree_node_agg (#9487) @charlesbluca
Resolve hash_columns FutureWarning in dask_cudf (#9481) @pentschev
Add fixed point to AllTypes in libcudf unit tests (#9472) @karthikeyann
Fix regex handling of embedded null characters (#9470) @davidwendt
Fix memcheck error in copy-if-else (#9467) @davidwendt
Fix bug in dask_cudf.read_parquet for index=False (#9453) @rjzamora
Preserve the decimal scale when creating a default scalar (#9449) @revans2
Push down parent nulls when flattening nested columns. (#9443) @mythrocks
Fix memcheck error in gtest SegmentedGatherTest/GatherSliced (#9442) @davidwendt
Revert "Fix quantile division / partition handling for dask-cudf sort… (#9438) @charlesbluca
Allow int-like objects for the decimals argument in round (#9428) @shwina
Fix stream compaction's drop_duplicates API to use stable sort (#9417) @ttnghia
Skip Comparing Uniform Window Results in Var/std Tests (#9416) @isVoid
Fix StructColumn.to_pandas type handling issues (#9388) @galipremsagar
Correct issues in the build dir cudf-config.cmake (#9386) @robertmaynard
Fix Java table partition test to account for non-deterministic ordering (#9385) @jlowe
Fix timestamp truncation/overflow bugs in orc/parquet (#9382) @PointKernel
Fix the crash in stats code (#9368) @devavret
Make Series.hash_encode results reproducible. (#9366) @bdice
Fix libcudf compile warnings on debug 11.4 build (#9360) @davidwendt
Fail gracefully when compiling python UDFs that attempt to access columns with unsupported dtypes (#9359) @brandon-b-miller
Set pass_filenames: false in mypy pre-commit configuration. (#9349) @bdice
Fix cudf_assert in cudf::io::orc::gpu::gpuDecodeOrcColumnData (#9348) @davidwendt
Fix memcheck error in groupby-tdigest get_scalar_minmax (#9339) @davidwendt
Optimizations for cudf.concat when axis=1 (#9333) @galipremsagar
Use f-string in join helper warning message. (#9325) @bdice
Avoid casting to list or struct dtypes in dask_cudf.read_parquet (#9314) @rjzamora
Fix null count in statistics for parquet (#9303) @devavret
Potential overflow of decimal32 when casting to int64_t (#9287) @codereport
Fix quantile division / partition handling for dask-cudf sort on null dataframes (#9259) @charlesbluca
Updating cudf version also updates rapids cmake branch (#9249) @robertmaynard
Implement one_hot_encoding in libcudf and bind to python (#9229) @isVoid
BUG FIX: CSV Writer ignores the header parameter when no metadata is provided (#8740) @skirui-source

📖 Documentation

Update Documentation to use TYPED_TEST_SUITE (#9654) @codereport
Add dedicated page for StringHandling in python docs (#9624) @galipremsagar
Update docstring of DataFrame.merge (#9572) @galipremsagar
Use raw strings to avoid SyntaxErrors in parsed docstrings. (#9526) @bdice
Add example to docstrings in rolling.apply (#9522) @isVoid
Update help message to escape quotes in ./build.sh --cmake-args. (#9494) @bdice
Improve Python docstring formatting. (#9493) @bdice
Update table of I/O supported types (#9476) @vuule
Document invalid regex patterns as undefined behavior (#9473) @davidwendt
Miscellaneous documentation fixes to cudf (#9471) @galipremsagar
Fix many documentation errors in libcudf. (#9355) @karthikeyann
Fixing SubwordTokenizer docs issue (#9354) @mayankanand007
Improved deprecation warnings. (#9347) @bdice
doc reorder mr, stream to stream, mr (#9308) @karthikeyann
Deprecate method parameters to DataFrame.join, DataFrame.merge. (#9291) @bdice
Added deprecation warning for .label_encoding() (#9289) @mayankanand007

🚀 New Features

Enable Series.divide and DataFrame.divide (#9630) @vyasr
Update bitmask_and and bitmask_or to return a pair of resulting mask and count of unset bits (#9616) @PointKernel
Add handling of mixed numeric types in to_dlpack (#9585) @galipremsagar
Support re.Pattern object for pat arg in str.replace (#9573) @davidwendt
Add JNI for lists::drop_list_duplicates with keys-values input column (#9553) @ttnghia
Support structs column in min, max, argmin and argmax groupby aggregate() and scan() (#9545) @ttnghia
Move libcudacxx to use rapids_cpm and use newer versions (#9539) @robertmaynard
Add scan min/max support for chrono types to libcudf reduction-scan (not groupby scan) (#9518) @davidwendt
Support args= in apply (#9514) @brandon-b-miller
Add groupby scan min/max support for strings values (#9502) @davidwendt
Add list output option to character_ngrams() function (#9499) @davidwendt
More granular column selection in ORC reader (#9496) @vuule
add min_periods, ddof to groupby covariance, & correlation aggregation (#9492) @karthikeyann
Implement Series.datetime.floor (#9488) @skirui-source
Enable linting of CMake files using pre-commit (#9484) @vyasr
Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
Augment order_by to Accept a List of null_precedence (#9455) @isVoid
Add format API for list column of strings (#9454) @davidwendt
Enable Datetime/Timedelta dtypes in Masked UDFs (#9451) @brandon-b-miller
Add cudf python groupby.diff (#9446) @karthikeyann
Implement lists::stable_sort_lists for stable sorting of elements within each row of lists column (#9425) @ttnghia
add ctest memcheck using cuda-sanitizer (#9414) @karthikeyann
Support Unary Operations in Masked UDF (#9409) @isVoid
Move Several Series Function to Frame (#9394) @isVoid
MD5 Python hash API (#9390) @bdice
Add cudf strings is_title API (#9380) @davidwendt
Enable casting to int64, uint64, and double in AST code. (#9379) @vyasr
Add support for writing ORC with map columns (#9369) @vuule
extract_list_elements() with column_view indices (#9367) @mythrocks
Reimplement lists::drop_list_duplicates for keys-values lists columns (#9345) @ttnghia
Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
JNI: Support nested types in ORC writer (#9334) @firestarman
Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
Add shallow hash function and shallow equality comparison for column_view (#9312) @karthikeyann
Add CudaMemoryBuffer for cudaMalloc memory using RMM cuda_memory_resource (#9311) @rongou
Add parameters to control row index stride and stripe size in ORC writer (#9310) @vuule
Add na_position param to dask-cudf sort_values (#9264) @charlesbluca
Add ascending parameter for dask-cudf sort_values (#9250) @charlesbluca
New array conversion methods (#9236) @vyasr
Series apply method backed by masked UDFs (#9217) @brandon-b-miller
Grouping by frequency and resampling (#9178) @shwina
Pure-python masked UDFs (#9174) @brandon-b-miller
Add Covariance, Pearson correlation for sort groupby (libcudf) (#9154) @karthikeyann
Add calendrical_month_sequence in c++ and date_range in python (#8886) @shwina

🛠️ Improvements

Followup to PR 9088 comments (#9659) @cwharris
Update cuCollections to version that supports installed libcudacxx (#9633) @robertmaynard
Add 11.5 dev.yml to cudf (#9617) @galipremsagar
Add xfail for parquet reader 11.5 issue (#9612) @galipremsagar
remove deprecated Rmm.initialize method (#9607) @rongou
Use HostColumnVectorCore for child columns in JCudfSerialization.unpackHostColumnVectors (#9596) @sperlingxx
Set RMM pool to a fixed size in JNI (#9583) @rongou
Use nvCOMP for Snappy compression/decompression (#9582) @vuule
Build CUDA version agnostic packages for dask-cudf (#9578) @Ethyling
Fixed tests warning: "TYPED_TEST_CASE is deprecated, please use TYPED_TEST_SUITE" (#9574) @ttnghia
Enable CMake format in CI and fix style (#9570) @vyasr
Add NVTX Start/End Ranges to JNI (#9563) @abellina
Add librdkafka and python-confluent-kafka to dev conda environments s… (#9562) @jdye64
Add offsets_begin/end() to strings_column_view (#9559) @davidwendt
remove alignment options for RMM jni (#9550) @rongou
Add axis parameter passthrough to DataFrame and Series take for pandas API compatibility (#9549) @dantegd
Remove sizeof and standardize on memory_usage (#9544) @vyasr
Adds cudaProfilerStart/cudaProfilerStop in JNI api (#9543) @abellina
Generalize comparison binary operations (#9542) @vyasr
Expose APIs to wrap CUDA or RMM allocations with a Java device buffer instance (#9538) @jlowe
Add scan sum support for duration types to libcudf (#9536) @davidwendt
Force inlining to improve AST performance (#9530) @vyasr
Generalize some more indexed frame methods (#9529) @vyasr
Add Java bindings for rolling window stddev aggregation (#9527) @razajafri
catch rmm::out_of_memory exceptions in jni (#9525) @rongou
Add an overload of make_empty_column with type_id parameter (#9524) @ttnghia
Accelerate conditional inner joins with larger right tables (#9523) @vyasr
Initial pass of generalizing decimal support in cudf python layer (#9517) @galipremsagar
Cleanup for flattening nested columns (#9509) @rwlee
Enable running tests using RMM arena and async memory resources (#9506) @rongou
Remove dependency on six. (#9495) @bdice
Cleanup some libcudf strings gtests (#9489) @davidwendt
Rename strings/array_tests.cu to strings/array_tests.cpp (#9480) @davidwendt
Refactor sorting APIs (#9464) @vyasr
Implement DataFrame.hash_values, deprecate DataFrame.hash_columns. (#9458) @bdice
Deprecate Series.hash_encode. (#9457) @bdice
Update conda recipes for Enhanced Compatibility effort (#9456) @ajschmidt8
Small clean up to simplify column selection code in ORC reader (#9444) @vuule
add missing stream to scalar.is_valid() wherever stream is available (#9436) @karthikeyann
Adds Deprecation Warnings to one_hot_encoding and Implement get_dummies with Cython API (#9435) @isVoid
Update pre-commit hook URLs. (#9433) @bdice
Remove pyarrow import in dask_cudf.io.parquet (#9429) @charlesbluca
Miscellaneous improvements for UDFs (#9422) @isVoid
Use pre-commit for CI (#9412) @vyasr
Update to UCX-Py 0.23 (#9407) @pentschev
Expose OutOfBoundsPolicy in JNI for Table.gather (#9406) @abellina
Improvements to tdigest aggregation code. (#9403) @nvdbaranec
Add Java API to deserialize a table to host columns (#9402) @jlowe
Frame copy to use class instead of type() (#9397) @madsbk
Change all DeprecationWarnings to FutureWarning. (#9392) @bdice
Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
Add IndexedFrame class and move SingleColumnFrame to a separate module (#9378) @vyasr
Support Arrow NativeFile and PythonFile for remote ORC storage (#9377) @rjzamora
Use Arrow PythonFile for remote CSV storage (#9376) @rjzamora
Add multi-threaded writing to GDS writes (#9372) @devavret
Miscellaneous column cleanup (#9370) @vyasr
Use single kernel to extract all groups in cudf::strings::extract (#9358) @davidwendt
Consolidate binary ops into Frame (#9357) @isVoid
Move rank scan implementations from scan_inclusive.cu to rank_scan.cu (#9351) @davidwendt
Remove usage of deprecated thrust::host_space_tag. (#9350) @bdice
Use Default Memory Resource for Temporaries in reduction.cpp (#9344) @isVoid
Fix Cython compilation warnings. (#9327) @bdice
Fix some unused variable warnings in libcudf (#9326) @davidwendt
Use optional-iterator for copy-if-else kernel (#9324) @davidwendt
Remove Table class (#9315) @vyasr
Unpin dask and distributed in CI (#9307) @galipremsagar
Add optional-iterator support to indexalator (#9306) @davidwendt
Consolidate more methods in Frame (#9305) @vyasr
Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf (#9304) @rjzamora
Pin mypy in .pre-commit-config.yaml to match conda environment pinning. (#9300) @bdice
Use gather.hpp when gather-map exists in device memory (#9299) @davidwendt
Fix Automerger for Branch-21.12 from branch-21.10 (#9285) @galipremsagar
Refactor cuIO timestamp processing with cuda::std::chrono (#9278) @PointKernel
Change strings copy_if_else to use optional-iterator instead of pair-iterator (#9266) @davidwendt
Update cudf java bindings to 21.12.0-SNAPSHOT (#9248) @pxLi
Various internal MultiIndex improvements (#9243) @vyasr
Add detail interface for split and slice(table_view), refactors both function with host_span (#9226) @isVoid
Refactor MD5 implementation. (#9212) @bdice
Update groupby result_cache to allow sharing intermediate results based on column_view instead of requests. (#9195) @karthikeyann
Use nvcomp's snappy decompressor in avro reader (#9181) @devavret
Add isocalendar API support (#9169) @marlenezw
Simplify read_json by removing unnecessary reader/impl classes (#9088) @cwharris
Simplify read_csv by removing unnecessary reader/impl classes (#9041) @cwharris
Refactor hash join with cuCollections multimap (#8934) @PointKernel