Datasets Features
- Add
Json()type by @lhoestq in #8027- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the
Json()type is used to store such data that would normally not be supported in Arrow/Parquet - Use the
Json()type inFeatures()for any dataset, it is supported in any functions that acceptsfeatures=likeload_dataset(),.map(),.cast(),.from_dict(),.from_list() - Use
on_mixed_types="use_json"to automatically set theJson()type on mixed types in.from_dict(),.from_list()and.map()
- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the
Examples:
You can use on_mixed_types="use_json" or specify features= with a [Json] type:
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
...
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64
>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]] # missing fields are filled with None
>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]] # OKAnother example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):
>>> messages = [
... {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
... {"role": "assistant", "tool_calls": [
... {"type": "function", "function": {
... "name": "control_light",
... "arguments": {"room": "living room", "state": "on"}
... }},
... {"type": "function", "function": {
... "name": "play_music",
... "arguments": {"playlist": "electronic"} # mixed-type here since keys ["playlist"] and ["room", "state"] are different
... }}]
... },
... {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
... {"role": "tool", "name": "play_music", "content": "The music is now playing."},
... {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}What's Changed
- Fix typos in iterable_dataset.py by @omkar-334 in #8049
- Fix non-deterministic by sorting metadata extensions (#8034) by @Nexround in #8039
- Use num_examples instead of len(self) for iterable_dataset's SplitInfo by @HaukurPall in #8041
- Fix silent data loss in push_to_hub when num_proc > num_shards by @HaukurPall in #8044
- Don't extract bad files by @lhoestq in #8056
- fix(iterable_dataset): preserve features when chaining filter() on typed IterableDataset by @s-zx in #8053
- fix: handle nested null types in feature alignment for multi-proc map by @ain-soph in #8047
- Fix unstable tokenizer fingerprinting (enables map cache reuse) by @KOKOSde in #7982
- Limit dataset listing to first 20 entries in readme by @lhoestq in #8057
New Contributors
- @omkar-334 made their first contribution in #8049
- @Nexround made their first contribution in #8039
- @HaukurPall made their first contribution in #8041
- @s-zx made their first contribution in #8053
- @ain-soph made their first contribution in #8047
- @KOKOSde made their first contribution in #7982
Full Changelog: 4.6.1...4.7.0