Dataset Features
- Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
from datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
Other improvements and bug fixes
- fix conda release worlflow by @lhoestq in #7272
- Add link to video dataset by @NielsRogge in #7277
- Raise error for incorrect JSON serialization by @varadhbhatnagar in #7273
- support for custom feature encoding/decoding by @alex-hh in #7284
- update load_dataset doctring by @lhoestq in #7301
- Let server decide default repo visibility by @Wauplin in #7302
- fix: update elasticsearch version by @ruidazeng in #7300
- Fix typing in iterable_dataset.py by @lhoestq in #7304
- Updated inconsistent output in documentation examples for
ClassLabel
by @sergiopaniego in #7293 - More docs to from_dict to mention that the result lives in RAM by @lhoestq in #7316
- Release: 3.2.0 by @lhoestq in #7317
New Contributors
- @ruidazeng made their first contribution in #7300
- @sergiopaniego made their first contribution in #7293
Full Changelog: 3.1.0...3.2.0