rapidsai/cudf v21.08.00 on GitHub

🚨 Breaking Changes

Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
Remove unused cudf::strings::create_offsets (#8663) @davidwendt
Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
Change default datetime index resolution to ns to match pandas (#8611) @vyasr
Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
Add strings::repeat_strings API that can repeat each string a different number of times (#8561) @ttnghia
String-to-boolean conversion is different from Pandas (#8549) @skirui-source
Add accurate hash join size functions (#8453) @PointKernel
Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
Adapt cudf::scalar classes to changes in rmm::device_scalar (#8411) @harrism
Remove special Index class from the general index class hierarchy (#8309) @vyasr
Add first-class dtype utilities (#8308) @vyasr
ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
Upgrade arrow to 4.0.1 (#7495) @galipremsagar

🐛 Bug Fixes

Fix contains check in string column (#8834) @galipremsagar
Remove unused variable from row_bit_count_test. (#8829) @mythrocks
Fixes issue with null struct columns in ORC reader (#8819) @rgsl888prabhu
Set CMake vars for python/parquet support in libarrow builds (#8808) @vyasr
Handle empty child columns in row_bit_count() (#8791) @mythrocks
Revert "Remove cudf unneeded build time requirement of the cuda driver" (#8784) @robertmaynard
Fix isort error in utils.pyx (#8771) @charlesbluca
Handle sliced struct/list columns properly in concatenate() bounds checking. (#8760) @nvdbaranec
Fix issues with _CPackedColumns.serialize() handling of host and device data (#8759) @charlesbluca
Fix issues with MultiIndex in dropna, stack & reset_index (#8753) @galipremsagar
Write pandas extension types to parquet file metadata (#8749) @devavret
Fix where to handle DataFrame & Series input combination (#8747) @galipremsagar
Fix replace to handle null values correctly (#8744) @galipremsagar
Handle sliced structs properly in pack/contiguous_split. (#8739) @nvdbaranec
Fix issue in slice() where columns with a positive offset were computing null counts incorrectly. (#8738) @nvdbaranec
Fix cudf.Series constructor to handle list of sequences (#8735) @galipremsagar
Fix min/max sorted groupby aggregation on string column with nulls (argmin, argmax sentinel value missing on nulls) (#8731) @karthikeyann
Fix orc reader assert on create data_type in debug (#8706) @davidwendt
Fix min/max inclusive cudf::scan for strings column (#8705) @davidwendt
JNI: Fix driver version assertion logic in testGetCudaRuntimeInfo (#8701) @sperlingxx
Adding fix for skip_rows and crash in orc reader (#8700) @rgsl888prabhu
Bug fix: replace_nulls_policy functor not returning correct indices for gathermap (#8699) @isVoid
Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
Add post-processing steps to dask_cudf.groupby.CudfSeriesGroupby.aggregate (#8694) @charlesbluca
JNI build no longer looks for Arrow in conda environment (#8686) @jlowe
Handle arbitrarily different data in null list column rows when checking for equivalency. (#8666) @nvdbaranec
Add ConfigureNVBench to avoid concurrent main() entry points (#8662) @PointKernel
Pin *arrow to use *cuda in run (#8651) @jakirkham
Add proper support for tolerances in testing methods. (#8649) @vyasr
Support multi-char case conversion in capitalize function (#8647) @davidwendt
Fix repeated mangled names in read_csv with duplicate column names (#8645) @karthikeyann
Temporarily disable libcudf example build tests (#8642) @isVoid
Use conda-sourced cudf artifacts for libcudf example in CI (#8638) @isVoid
Ensure dev environment uses Arrow GPU packages (#8637) @charlesbluca
Fix bug that columns only initialized once when specified columns and index in dataframe ctor (#8628) @isVoid
Propagate **kwargs through to as_*_column methods (#8618) @shwina
Fix orc_reader_benchmark.cpp compile error (#8609) @davidwendt
Fix missed renumbering of Aggregation values (#8600) @revans2
Update cmake to 3.20.5 in the Java Docker image (#8593) @NvTimLiu
Fix bug in replace_with_backrefs when group has greedy quantifier (#8575) @davidwendt
Apply metadata to keys before returning in Frame._encode (#8560) @charlesbluca
Fix for strings containing special JSON characters in get_json_object(). (#8556) @nvdbaranec
Fix debug compile error in gather_struct_tests.cpp (#8554) @davidwendt
String-to-boolean conversion is different from Pandas (#8549) @skirui-source
Fix __repr__ output with display.max_rows is None (#8547) @galipremsagar
Fix size passed to column constructors in _with_type_metadata (#8539) @shwina
Properly retrieve last column when -1 is specified for column index (#8529) @isVoid
Fix importing apply from dask (#8517) @galipremsagar
Fix offset of the string dictionary length stream (#8515) @vuule
Fix double counting of selected columns in CSV reader (#8508) @ochan1
Incorrect map size in scatter_to_gather corrupts struct columns (#8507) @gerashegalov
replace_nulls properly propagates memory resource to gather calls (#8500) @robertmaynard
Disallow groupby aggs for StructColumns (#8499) @charlesbluca
Fixes out-of-bounds access for small files in unzip (#8498) @elstehle
Adding support for writing empty dataframe (#8490) @shaneding
Fix exclusive scan when including nulls and improve testing (#8478) @harrism
Add workaround for crash in libcudf debug build using output_indexalator in thrust::lower_bound (#8432) @davidwendt
Install only the same Thrust files that Thrust itself installs (#8420) @robertmaynard
Add nightly version for ucx-py in ci script (#8419) @galipremsagar
Fix null_equality config of rolling_collect_set (#8415) @sperlingxx
CollectSetAggregation: implement RollingAggregation interface (#8406) @sperlingxx
Handle pre-sliced nested columns in contiguous_split. (#8391) @nvdbaranec
Fix bitmask_tests.cpp host accessing device memory (#8370) @davidwendt
Fix concurrent_unordered_map to prevent accessing padding bits in pair_type (#8348) @davidwendt
BUG FIX: Raise appropriate strings error when concatenating strings column (#8290) @skirui-source
Make gpuCI and pre-commit style configurations consistent (#8215) @charlesbluca
Add collect list to dask-cudf groupby aggregations (#8045) @charlesbluca

📖 Documentation

Update Python UDFs notebook (#8810) @brandon-b-miller
Fix dask.dataframe API docs links after reorg (#8772) @jsignell
Fix instructions for running cuDF/dask-cuDF tests in CONTRIBUTING.md (#8724) @shwina
Translate Markdown documentation to rST and remove recommonmark (#8698) @vyasr
Fixed spelling mistakes in libcudf documentation (#8664) @karthikeyann
Custom Sphinx Extension: PandasCompat (#8643) @isVoid
Fix README.md (#8535) @ajschmidt8
Change namespace contains_nulls to struct (#8523) @davidwendt
Add info about NVTX ranges to dev guide (#8461) @jrhemstad
Fixed documentation bug in groupby agg method (#8325) @ahmet-uyar

🚀 New Features

Fix concatenating structs (#8811) @shaneding
Implement JNI for groupby aggregations M2 and MERGE_M2 (#8763) @ttnghia
Bump isort to 5.6.4 and remove isort overrides made for 5.0.7 (#8755) @charlesbluca
Implement __setitem__ for StructColumn (#8737) @shaneding
Add is_leap_year to DateTimeProperties and DatetimeIndex (#8736) @isVoid
Add struct.explode() method (#8729) @shwina
Add DataFrame.to_struct() method to convert a DataFrame to a struct Series (#8728) @shwina
Add support for list type in ORC writer (#8723) @vuule
Fix slicing from struct columns and accessing struct columns (#8719) @shaneding
Add datetime::is_leap_year (#8711) @isVoid
Accessing struct columns from dask_cudf (#8675) @shaneding
Added pct_change to Series (#8650) @TravisHester
Add strings support to cudf::shift function (#8648) @davidwendt
Support Scatter struct_scalar (#8630) @isVoid
Struct scalar from host dictionary (#8629) @shaneding
Add dayofyear and day_of_year to Series, DatetimeColumn, and DatetimeIndex (#8626) @beckernick
JNI support for capitalize (#8624) @firestarman
Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
Add NVBench in CMake (#8619) @PointKernel
Change default datetime index resolution to ns to match pandas (#8611) @vyasr
ListColumn __setitem__ (#8606) @brandon-b-miller
Implement groupby aggregations M2 and MERGE_M2 (#8605) @ttnghia
Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
Adding support for list and struct type in ORC Reader (#8599) @rgsl888prabhu
Benchmark for strings::repeat_strings APIs (#8589) @ttnghia
Nested scalar support for copy if else (#8588) @gerashegalov
User specified decimal columns to float64 (#8587) @jdye64
Add get_element for struct column (#8578) @isVoid
Python changes for adding __getitem__ for struct (#8577) @shaneding
Add strings::repeat_strings API that can repeat each string a different number of times (#8561) @ttnghia
Refactor tests/iterator_utilities.hpp functions (#8540) @ttnghia
Support MERGE_LISTS and MERGE_SETS in Java package (#8516) @sperlingxx
Decimal support csv reader (#8511) @elstehle
Add column type tests (#8505) @isVoid
Warn when downscaling decimal columns (#8492) @ChrisJar
Add JNI for strings::repeat_strings (#8491) @ttnghia
Add Index.get_loc for Numerical, String Index support (#8489) @isVoid
Expose half_up rounding in cuDF (#8477) @shwina
Java APIs to fetch CUDA runtime info (#8465) @sperlingxx
Add str.edit_distance_matrix (#8463) @isVoid
Support constructing cudf.Scalar objects from host side lists (#8459) @brandon-b-miller
Add accurate hash join size functions (#8453) @PointKernel
Add cudf::strings::integer_to_hex convert API (#8450) @davidwendt
Create objects from iterables that contain cudf.NA (#8442) @brandon-b-miller
JNI bindings for sort_lists (#8439) @sperlingxx
Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
Replace all_null() and all_valid() by iterator_all_nulls() and iterator_no_null() in tests (#8437) @ttnghia
Implement groupby MERGE_LISTS and MERGE_SETS aggregates (#8436) @ttnghia
Add public libcudf match_dictionaries API (#8429) @davidwendt
Add move constructors for string_scalar and struct_scalar (#8428) @ttnghia
Implement strings::repeat_strings (#8423) @ttnghia
STRUCT column support for cudf::merge. (#8422) @nvdbaranec
Implement reverse in libcudf (#8410) @shaneding
Support multiple input files/buffers for read_json (#8403) @jdye64
Improve test coverage for struct search (#8396) @ttnghia
Add groupby.fillna (#8362) @isVoid
Enable AST-based joining (#8214) @vyasr
Generalized null support in user defined functions (#8213) @brandon-b-miller
Add compiled binary operation (#8192) @karthikeyann
Implement .describe() for DataFrameGroupBy (#8179) @skirui-source
ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
Add Python bindings for lists::concatenate_list_elements and expose them as .list.concat() (#8006) @shwina
Use Arrow URI FileSystem backed instance to retrieve remote files (#7709) @jdye64
Example to build custom application and link to libcudf (#7671) @isVoid
Upgrade arrow to 4.0.1 (#7495) @galipremsagar

🛠️ Improvements

Provide a better error message when CUDA::cuda_driver not found (#8794) @robertmaynard
Remove anonymous namespace from null_mask.cuh (#8786) @nvdbaranec
Allow cudf to be built without libcuda.so existing (#8751) @robertmaynard
Pin mimesis to <4.1 (#8745) @galipremsagar
Update conda environment name for CI (#8692) @ajschmidt8
Remove flatbuffers dependency (#8671) @Ethyling
Add options to build Arrow with Python and Parquet support (#8670) @trxcllnt
Remove unused cudf::strings::create_offsets (#8663) @davidwendt
Update GDS lib version to 1.0.0 (#8654) @pxLi
Support for groupby/scan rank and dense_rank aggregations (#8652) @rwlee
Fix usage of deprecated arrow ipc API (#8632) @revans2
Use absolute imports in cudf (#8631) @galipremsagar
ENH Add Java CI build script (#8627) @dillon-cullinan
Add DeprecationWarning to ser.str.subword_tokenize (#8603) @VibhuJawa
Rewrite binary operations for improved performance and additional type support (#8598) @vyasr
Fix mypy errors surfacing because of numpy-1.21.0 (#8595) @galipremsagar
Remove unneeded includes from cudf::string_view headers (#8594) @davidwendt
Use cmake 3.20.1 as it is now required by rmm (#8586) @robertmaynard
Remove device debug symbols from cmake CUDF_CUDA_FLAGS (#8584) @davidwendt
Dask-CuDF: use default Dask Dataframe optimizer (#8581) @madsbk
Remove checking if an unsigned value is less than zero (#8579) @robertmaynard
Remove strings_count parameter from cudf::strings::detail::create_chars_child_column (#8576) @davidwendt
Make cudf.api.types imports consistent (#8571) @galipremsagar
Modernize libcudf basic example CMakeFile; updates CI build tests (#8568) @isVoid
Rename concatenate_tests.cu to .cpp (#8555) @davidwendt
enable window lead/lag test on struct (#8548) @wbo4958
Add Java methods to split and write column views (#8546) @razajafri
Small cleanup (#8534) @codereport
Unpin dask version in CI (#8533) @galipremsagar
Added optional flag for building Arrow with S3 filesystem support (#8531) @jdye64
Minor clean up of various internal column and frame utilities (#8528) @vyasr
Rename some copying_test source files .cu to .cpp (#8527) @davidwendt
Correct the last warnings and issues when using newer cuda versions (#8525) @robertmaynard
Correct unused parameter warnings in transform and unary ops (#8521) @robertmaynard
Correct unused parameter warnings in string algorithms (#8509) @robertmaynard
Add in JNI APIs for scan, replace_nulls, group_by.scan, and group_by.replace_nulls (#8503) @revans2
Fix 21.08 forward-merge conflicts (#8502) @ajschmidt8
Fix Cython formatting command in Contributing.md. (#8496) @marlenezw
Bug/correct unused parameters in reshape and text (#8495) @robertmaynard
Correct unused parameter warnings in partitioning and stream compact (#8494) @robertmaynard
Correct unused parameter warnings in labelling and list algorithms (#8493) @robertmaynard
Refactor index construction (#8485) @vyasr
Correct unused parameter warnings in replace algorithms (#8483) @robertmaynard
Correct unused parameter warnings in reduction algorithms (#8481) @robertmaynard
Correct unused parameter warnings in io algorithms (#8480) @robertmaynard
Correct unused parameter warnings in interop algorithms (#8479) @robertmaynard
Correct unused parameter warnings in filling algorithms (#8468) @robertmaynard
Correct unused parameter warnings in groupby (#8467) @robertmaynard
use libcu++ time_point as timestamp (#8466) @karthikeyann
Modify reprog_device::extract to return groups in a single pass (#8460) @davidwendt
Update minimum Dask requirement to 2021.6.0 (#8458) @pentschev
Fix failures when performing binary operations on DataFrames with empty columns (#8452) @ChrisJar
Fix conflicts in 8447 (#8448) @ajschmidt8
Add serialization methods for List and StructDtype (#8441) @charlesbluca
Replace make_empty_strings_column with make_empty_column (#8435) @davidwendt
JNI bindings for get_element (#8433) @revans2
Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
Unpin dask version on CI (#8425) @galipremsagar
Add benchmark for strings/fixed_point convert APIs (#8417) @davidwendt
Adapt cudf::scalar classes to changes in rmm::device_scalar (#8411) @harrism
Add benchmark for strings/integers convert APIs (#8402) @davidwendt
Enable multi-file partitioning in dask_cudf.read_parquet (#8393) @rjzamora
Correct unused parameter warnings in rolling algorithms (#8390) @robertmaynard
Correct unused parameters in column round and search (#8389) @robertmaynard
Add functionality to apply Dtype metadata to ColumnBase (#8373) @charlesbluca
Refactor setting stack size in regex code (#8358) @davidwendt
Update Java bindings to 21.08-SNAPSHOT (#8344) @pxLi
Replace remaining uses of device_vector (#8343) @harrism
Statically link libnvcomp into libcudfjni (#8334) @jlowe
Resolve auto merge conflicts for Branch 21.08 from branch 21.06 (#8329) @galipremsagar
Minor code refactor for sorted_order (#8326) @wbo4958
Remove special Index class from the general index class hierarchy (#8309) @vyasr
Add first-class dtype utilities (#8308) @vyasr
Add option to link Java bindings with Arrow dynamically (#8307) @jlowe
Refactor ColumnMethods and its subclasses to remove column argument and require parent argument (#8306) @shwina
Refactor scatter for list columns (#8255) @isVoid
Expose pack/unpack API to Python (#8153) @charlesbluca
Adding cudf.cut method (#8002) @marlenezw
Optimize string gather performance for large strings (#7980) @gaohao95
Add peak memory usage tracking to cuIO benchmarks (#7770) @devavret
Updating Clang Version to 11.0.0 (#6695) @codereport