🚨 Breaking Changes
- Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
- Remove unused cudf::strings::create_offsets (#8663) @davidwendt
- Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
- Change default datetime index resolution to ns to match pandas (#8611) @vyasr
- Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
- Add
strings::repeat_strings
API that can repeat each string a different number of times (#8561) @ttnghia - String-to-boolean conversion is different from Pandas (#8549) @skirui-source
- Add accurate hash join size functions (#8453) @PointKernel
- Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
- Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
- Adapt
cudf::scalar
classes to changes inrmm::device_scalar
(#8411) @harrism - Remove special Index class from the general index class hierarchy (#8309) @vyasr
- Add first-class dtype utilities (#8308) @vyasr
- ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
- Upgrade arrow to 4.0.1 (#7495) @galipremsagar
🐛 Bug Fixes
- Fix
contains
check in string column (#8834) @galipremsagar - Remove unused variable from
row_bit_count_test
. (#8829) @mythrocks - Fixes issue with null struct columns in ORC reader (#8819) @rgsl888prabhu
- Set CMake vars for python/parquet support in libarrow builds (#8808) @vyasr
- Handle empty child columns in row_bit_count() (#8791) @mythrocks
- Revert "Remove cudf unneeded build time requirement of the cuda driver" (#8784) @robertmaynard
- Fix isort error in utils.pyx (#8771) @charlesbluca
- Handle sliced struct/list columns properly in concatenate() bounds checking. (#8760) @nvdbaranec
- Fix issues with
_CPackedColumns.serialize()
handling of host and device data (#8759) @charlesbluca - Fix issues with
MultiIndex
indropna
,stack
&reset_index
(#8753) @galipremsagar - Write pandas extension types to parquet file metadata (#8749) @devavret
- Fix
where
to handleDataFrame
&Series
input combination (#8747) @galipremsagar - Fix
replace
to handle null values correctly (#8744) @galipremsagar - Handle sliced structs properly in pack/contiguous_split. (#8739) @nvdbaranec
- Fix issue in slice() where columns with a positive offset were computing null counts incorrectly. (#8738) @nvdbaranec
- Fix
cudf.Series
constructor to handle list of sequences (#8735) @galipremsagar - Fix min/max sorted groupby aggregation on string column with nulls (argmin, argmax sentinel value missing on nulls) (#8731) @karthikeyann
- Fix orc reader assert on create data_type in debug (#8706) @davidwendt
- Fix min/max inclusive cudf::scan for strings column (#8705) @davidwendt
- JNI: Fix driver version assertion logic in testGetCudaRuntimeInfo (#8701) @sperlingxx
- Adding fix for skip_rows and crash in orc reader (#8700) @rgsl888prabhu
- Bug fix:
replace_nulls_policy
functor not returning correct indices for gathermap (#8699) @isVoid - Fix a crash in pack() when being handed tables with no columns. (#8697) @nvdbaranec
- Add post-processing steps to
dask_cudf.groupby.CudfSeriesGroupby.aggregate
(#8694) @charlesbluca - JNI build no longer looks for Arrow in conda environment (#8686) @jlowe
- Handle arbitrarily different data in null list column rows when checking for equivalency. (#8666) @nvdbaranec
- Add ConfigureNVBench to avoid concurrent main() entry points (#8662) @PointKernel
- Pin
*arrow
to use*cuda
inrun
(#8651) @jakirkham - Add proper support for tolerances in testing methods. (#8649) @vyasr
- Support multi-char case conversion in capitalize function (#8647) @davidwendt
- Fix repeated mangled names in read_csv with duplicate column names (#8645) @karthikeyann
- Temporarily disable libcudf example build tests (#8642) @isVoid
- Use conda-sourced cudf artifacts for libcudf example in CI (#8638) @isVoid
- Ensure dev environment uses Arrow GPU packages (#8637) @charlesbluca
- Fix bug that columns only initialized once when specified
columns
andindex
in dataframe ctor (#8628) @isVoid - Propagate **kwargs through to as_*_column methods (#8618) @shwina
- Fix orc_reader_benchmark.cpp compile error (#8609) @davidwendt
- Fix missed renumbering of Aggregation values (#8600) @revans2
- Update cmake to 3.20.5 in the Java Docker image (#8593) @NvTimLiu
- Fix bug in replace_with_backrefs when group has greedy quantifier (#8575) @davidwendt
- Apply metadata to keys before returning in
Frame._encode
(#8560) @charlesbluca - Fix for strings containing special JSON characters in get_json_object(). (#8556) @nvdbaranec
- Fix debug compile error in gather_struct_tests.cpp (#8554) @davidwendt
- String-to-boolean conversion is different from Pandas (#8549) @skirui-source
- Fix
__repr__
output withdisplay.max_rows
isNone
(#8547) @galipremsagar - Fix size passed to column constructors in _with_type_metadata (#8539) @shwina
- Properly retrieve last column when
-1
is specified for column index (#8529) @isVoid - Fix importing
apply
fromdask
(#8517) @galipremsagar - Fix offset of the string dictionary length stream (#8515) @vuule
- Fix double counting of selected columns in CSV reader (#8508) @ochan1
- Incorrect map size in scatter_to_gather corrupts struct columns (#8507) @gerashegalov
- replace_nulls properly propagates memory resource to gather calls (#8500) @robertmaynard
- Disallow groupby aggs for
StructColumns
(#8499) @charlesbluca - Fixes out-of-bounds access for small files in unzip (#8498) @elstehle
- Adding support for writing empty dataframe (#8490) @shaneding
- Fix exclusive scan when including nulls and improve testing (#8478) @harrism
- Add workaround for crash in libcudf debug build using output_indexalator in thrust::lower_bound (#8432) @davidwendt
- Install only the same Thrust files that Thrust itself installs (#8420) @robertmaynard
- Add nightly version for ucx-py in ci script (#8419) @galipremsagar
- Fix null_equality config of rolling_collect_set (#8415) @sperlingxx
- CollectSetAggregation: implement RollingAggregation interface (#8406) @sperlingxx
- Handle pre-sliced nested columns in contiguous_split. (#8391) @nvdbaranec
- Fix bitmask_tests.cpp host accessing device memory (#8370) @davidwendt
- Fix concurrent_unordered_map to prevent accessing padding bits in pair_type (#8348) @davidwendt
- BUG FIX: Raise appropriate strings error when concatenating strings column (#8290) @skirui-source
- Make gpuCI and pre-commit style configurations consistent (#8215) @charlesbluca
- Add collect list to dask-cudf groupby aggregations (#8045) @charlesbluca
📖 Documentation
- Update Python UDFs notebook (#8810) @brandon-b-miller
- Fix dask.dataframe API docs links after reorg (#8772) @jsignell
- Fix instructions for running cuDF/dask-cuDF tests in CONTRIBUTING.md (#8724) @shwina
- Translate Markdown documentation to rST and remove recommonmark (#8698) @vyasr
- Fixed spelling mistakes in libcudf documentation (#8664) @karthikeyann
- Custom Sphinx Extension:
PandasCompat
(#8643) @isVoid - Fix README.md (#8535) @ajschmidt8
- Change namespace contains_nulls to struct (#8523) @davidwendt
- Add info about NVTX ranges to dev guide (#8461) @jrhemstad
- Fixed documentation bug in groupby agg method (#8325) @ahmet-uyar
🚀 New Features
- Fix concatenating structs (#8811) @shaneding
- Implement JNI for groupby aggregations
M2
andMERGE_M2
(#8763) @ttnghia - Bump
isort
to5.6.4
and removeisort
overrides made for 5.0.7 (#8755) @charlesbluca - Implement
__setitem__
forStructColumn
(#8737) @shaneding - Add
is_leap_year
toDateTimeProperties
andDatetimeIndex
(#8736) @isVoid - Add
struct.explode()
method (#8729) @shwina - Add
DataFrame.to_struct()
method to convert a DataFrame to a struct Series (#8728) @shwina - Add support for list type in ORC writer (#8723) @vuule
- Fix slicing from struct columns and accessing struct columns (#8719) @shaneding
- Add
datetime::is_leap_year
(#8711) @isVoid - Accessing struct columns from
dask_cudf
(#8675) @shaneding - Added pct_change to Series (#8650) @TravisHester
- Add strings support to cudf::shift function (#8648) @davidwendt
- Support Scatter
struct_scalar
(#8630) @isVoid - Struct scalar from host dictionary (#8629) @shaneding
- Add dayofyear and day_of_year to Series, DatetimeColumn, and DatetimeIndex (#8626) @beckernick
- JNI support for capitalize (#8624) @firestarman
- Add delimiter parameter to cudf::strings::capitalize() (#8620) @davidwendt
- Add NVBench in CMake (#8619) @PointKernel
- Change default datetime index resolution to ns to match pandas (#8611) @vyasr
- ListColumn
__setitem__
(#8606) @brandon-b-miller - Implement groupby aggregations
M2
andMERGE_M2
(#8605) @ttnghia - Add sequence_type parameter to cudf::strings::title function (#8602) @davidwendt
- Adding support for list and struct type in ORC Reader (#8599) @rgsl888prabhu
- Benchmark for
strings::repeat_strings
APIs (#8589) @ttnghia - Nested scalar support for copy if else (#8588) @gerashegalov
- User specified decimal columns to float64 (#8587) @jdye64
- Add
get_element
for struct column (#8578) @isVoid - Python changes for adding
__getitem__
forstruct
(#8577) @shaneding - Add
strings::repeat_strings
API that can repeat each string a different number of times (#8561) @ttnghia - Refactor
tests/iterator_utilities.hpp
functions (#8540) @ttnghia - Support MERGE_LISTS and MERGE_SETS in Java package (#8516) @sperlingxx
- Decimal support csv reader (#8511) @elstehle
- Add column type tests (#8505) @isVoid
- Warn when downscaling decimal columns (#8492) @ChrisJar
- Add JNI for
strings::repeat_strings
(#8491) @ttnghia - Add
Index.get_loc
for Numerical, String Index support (#8489) @isVoid - Expose half_up rounding in cuDF (#8477) @shwina
- Java APIs to fetch CUDA runtime info (#8465) @sperlingxx
- Add
str.edit_distance_matrix
(#8463) @isVoid - Support constructing
cudf.Scalar
objects from host side lists (#8459) @brandon-b-miller - Add accurate hash join size functions (#8453) @PointKernel
- Add cudf::strings::integer_to_hex convert API (#8450) @davidwendt
- Create objects from iterables that contain cudf.NA (#8442) @brandon-b-miller
- JNI bindings for sort_lists (#8439) @sperlingxx
- Expose a Decimal32Dtype in cuDF Python (#8438) @skirui-source
- Replace
all_null()
andall_valid()
byiterator_all_nulls()
anditerator_no_null()
in tests (#8437) @ttnghia - Implement groupby
MERGE_LISTS
andMERGE_SETS
aggregates (#8436) @ttnghia - Add public libcudf match_dictionaries API (#8429) @davidwendt
- Add move constructors for
string_scalar
andstruct_scalar
(#8428) @ttnghia - Implement
strings::repeat_strings
(#8423) @ttnghia - STRUCT column support for cudf::merge. (#8422) @nvdbaranec
- Implement reverse in libcudf (#8410) @shaneding
- Support multiple input files/buffers for read_json (#8403) @jdye64
- Improve test coverage for struct search (#8396) @ttnghia
- Add
groupby.fillna
(#8362) @isVoid - Enable AST-based joining (#8214) @vyasr
- Generalized null support in user defined functions (#8213) @brandon-b-miller
- Add compiled binary operation (#8192) @karthikeyann
- Implement
.describe()
forDataFrameGroupBy
(#8179) @skirui-source - ORC - Support reading multiple orc files/buffers in a single operation (#8142) @jdye64
- Add Python bindings for
lists::concatenate_list_elements
and expose them as.list.concat()
(#8006) @shwina - Use Arrow URI FileSystem backed instance to retrieve remote files (#7709) @jdye64
- Example to build custom application and link to libcudf (#7671) @isVoid
- Upgrade arrow to 4.0.1 (#7495) @galipremsagar
🛠️ Improvements
- Provide a better error message when
CUDA::cuda_driver
not found (#8794) @robertmaynard - Remove anonymous namespace from null_mask.cuh (#8786) @nvdbaranec
- Allow cudf to be built without libcuda.so existing (#8751) @robertmaynard
- Pin
mimesis
to<4.1
(#8745) @galipremsagar - Update
conda
environment name for CI (#8692) @ajschmidt8 - Remove flatbuffers dependency (#8671) @Ethyling
- Add options to build Arrow with Python and Parquet support (#8670) @trxcllnt
- Remove unused cudf::strings::create_offsets (#8663) @davidwendt
- Update GDS lib version to 1.0.0 (#8654) @pxLi
- Support for groupby/scan rank and dense_rank aggregations (#8652) @rwlee
- Fix usage of deprecated arrow ipc API (#8632) @revans2
- Use absolute imports in
cudf
(#8631) @galipremsagar - ENH Add Java CI build script (#8627) @dillon-cullinan
- Add DeprecationWarning to
ser.str.subword_tokenize
(#8603) @VibhuJawa - Rewrite binary operations for improved performance and additional type support (#8598) @vyasr
- Fix
mypy
errors surfacing because ofnumpy-1.21.0
(#8595) @galipremsagar - Remove unneeded includes from cudf::string_view headers (#8594) @davidwendt
- Use cmake 3.20.1 as it is now required by rmm (#8586) @robertmaynard
- Remove device debug symbols from cmake CUDF_CUDA_FLAGS (#8584) @davidwendt
- Dask-CuDF: use default Dask Dataframe optimizer (#8581) @madsbk
- Remove checking if an unsigned value is less than zero (#8579) @robertmaynard
- Remove strings_count parameter from cudf::strings::detail::create_chars_child_column (#8576) @davidwendt
- Make
cudf.api.types
imports consistent (#8571) @galipremsagar - Modernize libcudf basic example CMakeFile; updates CI build tests (#8568) @isVoid
- Rename concatenate_tests.cu to .cpp (#8555) @davidwendt
- enable window lead/lag test on struct (#8548) @wbo4958
- Add Java methods to split and write column views (#8546) @razajafri
- Small cleanup (#8534) @codereport
- Unpin
dask
version in CI (#8533) @galipremsagar - Added optional flag for building Arrow with S3 filesystem support (#8531) @jdye64
- Minor clean up of various internal column and frame utilities (#8528) @vyasr
- Rename some copying_test source files .cu to .cpp (#8527) @davidwendt
- Correct the last warnings and issues when using newer cuda versions (#8525) @robertmaynard
- Correct unused parameter warnings in transform and unary ops (#8521) @robertmaynard
- Correct unused parameter warnings in string algorithms (#8509) @robertmaynard
- Add in JNI APIs for scan, replace_nulls, group_by.scan, and group_by.replace_nulls (#8503) @revans2
- Fix
21.08
forward-merge conflicts (#8502) @ajschmidt8 - Fix Cython formatting command in Contributing.md. (#8496) @marlenezw
- Bug/correct unused parameters in reshape and text (#8495) @robertmaynard
- Correct unused parameter warnings in partitioning and stream compact (#8494) @robertmaynard
- Correct unused parameter warnings in labelling and list algorithms (#8493) @robertmaynard
- Refactor index construction (#8485) @vyasr
- Correct unused parameter warnings in replace algorithms (#8483) @robertmaynard
- Correct unused parameter warnings in reduction algorithms (#8481) @robertmaynard
- Correct unused parameter warnings in io algorithms (#8480) @robertmaynard
- Correct unused parameter warnings in interop algorithms (#8479) @robertmaynard
- Correct unused parameter warnings in filling algorithms (#8468) @robertmaynard
- Correct unused parameter warnings in groupby (#8467) @robertmaynard
- use libcu++ time_point as timestamp (#8466) @karthikeyann
- Modify reprog_device::extract to return groups in a single pass (#8460) @davidwendt
- Update minimum Dask requirement to 2021.6.0 (#8458) @pentschev
- Fix failures when performing binary operations on DataFrames with empty columns (#8452) @ChrisJar
- Fix conflicts in
8447
(#8448) @ajschmidt8 - Add serialization methods for
List
andStructDtype
(#8441) @charlesbluca - Replace make_empty_strings_column with make_empty_column (#8435) @davidwendt
- JNI bindings for get_element (#8433) @revans2
- Update dask make_meta changes to be compatible with dask upstream (#8426) @galipremsagar
- Unpin dask version on CI (#8425) @galipremsagar
- Add benchmark for strings/fixed_point convert APIs (#8417) @davidwendt
- Adapt
cudf::scalar
classes to changes inrmm::device_scalar
(#8411) @harrism - Add benchmark for strings/integers convert APIs (#8402) @davidwendt
- Enable multi-file partitioning in dask_cudf.read_parquet (#8393) @rjzamora
- Correct unused parameter warnings in rolling algorithms (#8390) @robertmaynard
- Correct unused parameters in column round and search (#8389) @robertmaynard
- Add functionality to apply
Dtype
metadata toColumnBase
(#8373) @charlesbluca - Refactor setting stack size in regex code (#8358) @davidwendt
- Update Java bindings to 21.08-SNAPSHOT (#8344) @pxLi
- Replace remaining uses of device_vector (#8343) @harrism
- Statically link libnvcomp into libcudfjni (#8334) @jlowe
- Resolve auto merge conflicts for Branch 21.08 from branch 21.06 (#8329) @galipremsagar
- Minor code refactor for sorted_order (#8326) @wbo4958
- Remove special Index class from the general index class hierarchy (#8309) @vyasr
- Add first-class dtype utilities (#8308) @vyasr
- Add option to link Java bindings with Arrow dynamically (#8307) @jlowe
- Refactor ColumnMethods and its subclasses to remove
column
argument and requireparent
argument (#8306) @shwina - Refactor
scatter
for list columns (#8255) @isVoid - Expose pack/unpack API to Python (#8153) @charlesbluca
- Adding cudf.cut method (#8002) @marlenezw
- Optimize string gather performance for large strings (#7980) @gaohao95
- Add peak memory usage tracking to cuIO benchmarks (#7770) @devavret
- Updating Clang Version to 11.0.0 (#6695) @codereport