Arrow2 0.10.0 is out! 🚀🚀🚀🚀🚀
Continuing breaking ground, this constitutes one of the most feature rich releases of this crate so far!
Thank you to everyone for the impressive work over the past 2.5 months that make arrow2 so feature rich, safe, fast, and easy to use! 🙇
Here are the main headlines:
Copy on Write
So far, whenever we applied a transformation to an array, we had to create a new array. When multiple operations were used (e.g. c1 x 2 + 1
), it lead to the following compute pattern:
1. allocate new region
2. compute
3. allocate new region
4. compute
This was identified by @sundy-li on #741 and addressed by @ritchie46 on #794.
Users can now re-use Arc
ed arrays, just like std::sync::Arc::get_mut
. As expected, if the array is being used in multiple places, it will return a None
and users do need to allocate a new region (exclusive mutability).
This is being used in Polars to further re-use allocated regions and therefore reduce both memory pressure and wasted compute cycles allocating new regions.
Support for ODBC
This release now supports reading from, and write to, any ODBC driver.
This builds on top of the superb odbc-api created by @pacman82, that allows this crate to use the columnar format provided by ODBC specification.
Given a performant ODBC driver, this is expected to be the fastest way to load data to the Arrow format, as many operations are simple memcopies.
Check out the example and guide for details on how to use it!
async
support for writing to Arrow's IPC
Until now, we had limited support to writing to Arrow IPC asynchronously. @dexterduck closed this gap on #878, offering complete async
support for both Arrow files and Arrow streams, including implementations of futures::Stream
and futures::Sink
for them!
Migrated std::simd
After some back and forth with the working group of the project portable simd, this release replaces packed_simd2
by std::simd
. This resulted in no performance difference but allow us to leverage the great work that is happening on std::simd
.
Support to Serde metadata
A common pain point in using arrow2's logical types is that they are quite rich, making them sometimes difficult
to visualize or represent in e.g. JSON. @houqp closed this with #858, that adds compatibility with Serde for
schema-related structs in this crate (PhysicalType
DataType
, Field
, Schema
).
Support for Arrow C stream interface
Arrow has an experimental specification for an FFI to iterators of arrow arrays. This release now fully supports this interface.
Made crate deny(missing_docs)
This makes us developers more conscious about documenting APIs, thereby allowing users more context about them. We have also start documenting IO-related APIs over whether they are CPU or IO-bounded, so that users know which ones block async
contexts.
Changelog
Breaking changes:
- Renamed
Ffi_ArrowArray
andFfi_ArrowSchema
#859 - Improved performance and stability of writing to CSV #866 (ritchie46)
- Simplified API for writing to JSON #864 (jorgecarleitao)
- Simplified API to import from FFI #854 (jorgecarleitao)
- Simplified compute (lower/upper) #847 (jorgecarleitao)
- Simplified infering arrow schema from a parquet schema #819 (jorgecarleitao)
- Bumped parquet and aligned API to fit into it #795 (jorgecarleitao)
New features:
- Added
GrowableUnion
#902 (jorgecarleitao) - Added cast to
months_days_ns
#900 (jorgecarleitao) - Added support for
hash
ofmonth_day_ns
arrays #899 (jorgecarleitao) - IPC sink types and IPC file stream #878 (dexterduck)
- implemented
futures::Sink
for parquet async writer #877 (dexterduck) - Added
try_new
andnew
to all arrays #873 (jorgecarleitao) - Added support for datatypes serde #858 (houqp)
- Added support to the Arrow C stream interface (read and write) #857 (jorgecarleitao)
- Support to read/write from/to ODBC #849 (jorgecarleitao)
- Added operators that include validities in comparisons #846 (ritchie46)
- Added support to read and write
Decimal128
to Avro #837 (potter420) - Added support to read Arrow streams asynchronously #832 (jorgecarleitao)
- Added support to write
LargeUtf8
andLargeBinary
to Avro #828 (illumination-k) - Added support for pushdown projection in reading Avro #827 (jorgecarleitao)
- Added support to read Avro's structs #826 (jorgecarleitao)
- Added support to write largeUtf8/Binary to Avro #825 (jorgecarleitao)
- Added json serialization of timestamp/date32/date64 #814 (ritchie46)
- Added
BooleanArray::from_trusted_len_values_iter_unchecked
#799 (ritchie46) - Added
MutableUtf8Array::extend_values
#798 (ritchie46) - Added COW semantics to
Buffer
,Bitmap
and some arrays #794 (ritchie46) - Added support to read parquet row groups in chunks #789 (jorgecarleitao)
- Added scalar bitwise ops #788 (jorgecarleitao)
- Migrated to portable simd #747 (jorgecarleitao)
Fixed bugs:
- Fixed edge case in reading multiple parquet pages #904 (jorgecarleitao)
- Bug fix in offset for sliced unions #891 (ncpenke)
- Fix edge case in reading nested parquet #884 (jorgecarleitao)
- Fixed unsoundness of
#derive(Clone)
for FFI structs #882 (jorgecarleitao) - Fixed json writing of dates and datetimes #867 (jorgecarleitao)
- Fixed reading parquet with timezone #862 (jorgecarleitao)
- Fixed error in writing compressed IPC arrow #855 (jorgecarleitao)
- Fixed wrong null_count when slicing a sliced Bitmap #848 (satlank)
- Fixed error in writing compressed IPC files #840 (jorgecarleitao)
- Fixed float to i128 cast #817 (houqp)
- fix unescaped '"' in json writing #812 (ritchie46)
- Fixed reading parquet binary dict page #791 (danburkert)
Enhancements:
- Add
FixedSizeBinaryScalar
#782 - Use more idiomatic versions #898 (jorgecarleitao)
- Added support for min/max for decimal #897 (jorgecarleitao)
- Made
FixedSizeList::try_push_valid
public and addednew_with_field
#887 (ncpenke) - Added
MutableFixedList::mut_values
#886 (jorgecarleitao) - Made IPC IO use
try_new
#879 (jorgecarleitao) - expose
ListValuesIter
#874 (ritchie46) - Bumped crc #856 (jorgecarleitao)
- DRY parquet reading #845 (jorgecarleitao)
- Refactored (internal) fmt #842 (jorgecarleitao)
- Bumped zstd #841 (jorgecarleitao)
- inline push #835 (ritchie46)
- Increased API consistency for COW and respective docs #833 (jorgecarleitao)
- Improved flexibility of reading parquet #820 (jorgecarleitao)
- Small improvement to deserializing fixed-len parquet statistics. #818 (jorgecarleitao)
- Added support for other timestamp units from parquet #803 (jorgecarleitao)
- More to
into_mut
implementations #801 (ritchie46) - Added
FixedSizeListScalar
andFixedSizeBinaryScalar
#786 (illumination-k) - DRY parquet module #785 (jorgecarleitao)
Documentation updates:
- Improved documentation #860 (jorgecarleitao)
- Made crate
deny(missing_docs)
#808 (jorgecarleitao) - Fixed doc for
Bitmap::set_bit
#802 (yjshen) - Fixed
dyn Array::slice
docstring #792 (ritchie46)
Testing updates:
- Simpler code (DRY) #901 (jorgecarleitao)
- Fixed integration test #885 (jorgecarleitao)
- Simplified code to generate parquet files for tests #883 (jorgecarleitao)
- Removed un-needed
unsafe
#843 (jorgecarleitao) - Added more tests #810 (jorgecarleitao)
- Reduced code duplication #805 (jorgecarleitao)
- upgrade to clap 3.0 #797 (Jimexist)
- Simplified avro reading and added more tests #737 (jorgecarleitao)