HfFileSystem: interact with the Hub through the Filesystem API
We introduce HfFileSystem, a pythonic filesystem interface compatible with fsspec
. Built on top of HfApi
, it offers typical filesystem operations like cp
, mv
, ls
, du
, glob
, get_file
and put_file
.
>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem()
# List all files in a directory
>>> fs.ls("datasets/myself/my-dataset/data", detail=False)
['datasets/myself/my-dataset/data/train.csv', 'datasets/myself/my-dataset/data/test.csv']
>>> train_data = fs.read_text("datasets/myself/my-dataset/data/train.csv")
Its biggest advantage is to provide ready-to-use integrations with popular libraries like Pandas, DuckDB and Zarr.
import pandas as pd
# Read a remote CSV file into a dataframe
df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")
# Write a dataframe to a remote CSV file
df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")
For a more detailed overview, please have a look to this guide.
- Transfer the
hffs
code tohfh
by @mariosasko in #1420 - Hffs misc improvements by @mariosasko in #1433
Webhook Server
WebhooksServer
allows to implement, debug and deploy webhook endpoints on the Hub without any overhead. Creating a new endpoint is as easy as decorating a Python function.
# app.py
from huggingface_hub import webhook_endpoint, WebhookPayload
@webhook_endpoint
async def trigger_training(payload: WebhookPayload) -> None:
if payload.repo.type == "dataset" and payload.event.action == "update":
# Trigger a training job if a dataset is updated
...
For more details, check out this twitter thread or the documentation guide.
Note that this feature is experimental which means the API/behavior might change without prior notice. A warning is displayed to the user when using it. As it is experimental, we would love to get feedback!
Some upload QOL improvements
Faster upload with hf_transfer
Integration with a Rust-based library to upload large files in chunks and concurrently. Expect x3 speed-up if your bandwidth allows it!
Upload in multiple commits
Uploading large folders at once might be annoying if any error happens while committing (e.g. a connection error occurs). It is now possible to upload a folder in multiple (smaller) commits. If a commit fails, you can re-run the script and resume the upload. Commits are pushed to a dedicated PR. Once completed, the PR is merged to the main
branch resulting in a single commit in your git history.
upload_folder(
folder_path="local/checkpoints",
repo_id="username/my-dataset",
repo_type="dataset",
multi_commits=True, # resumable multi-upload
multi_commits_verbose=True,
)
Note that this feature is also experimental, meaning its behavior might be updated in the future.
Upload validation
Some more pre-validation done before committing files to the Hub. The .git
folder is ignored in upload_folder
(if any) + fail early in case of invalid paths.
- Fix
path_in_repo
validation when committing files by @Wauplin in #1382 - Raise issue if trying to upload
.git/
folder + ignore.git/
folder inupload_folder
by @Wauplin in #1408
Keep-alive connections between requests
Internal update to reuse the same HTTP session across huggingface_hub
. The goal is to keep the connection open when doing multiple calls to the Hub which ultimately saves a lot of time. For instance, updating metadata in a README became 40% faster while listing all models from the Hub is 60% faster. This has no impact for atomic calls (e.g. 1 standalone GET call).
- Keep-alive connection between requests by @Wauplin in #1394
- Accept backend_factory to configure Sessions by @Wauplin in #1442
Custom sleep time for Spaces
It is now possible to programmatically set a custom sleep time on your upgraded Space. After X seconds of inactivity, your Space will go to sleep to save you some $$$.
from huggingface_hub import set_space_sleep_time
# Put your Space to sleep after 1h of inactivity
set_space_sleep_time(repo_id=repo_id, sleep_time=3600)
Breaking change
fsspec
has been added as a main dependency. It's a lightweight Python library required forHfFileSystem
.
No other breaking change expected in this release.
Bugfixes & small improvements
File-related
A lot of effort has been invested in making huggingface_hub
's cache system more robust especially when working with symlinks on Windows. Hope everything's fixed by now.
- Fix relative symlinks in cache by @Wauplin in #1390
- Hotfix - use relative symlinks whenever possible by @Wauplin in #1399
- [hot-fix] Malicious repo can overwrite any file on disk by @Wauplin in #1429
- Fix symlinks on different volumes on Windows by @Wauplin in #1437
- [FIX] bug "Invalid cross-device link" error when using snapshot_download to local_dir with no symlink by @thaiminhpv in #1439
- Raise after download if file size is not consistent by @Wauplin in # 1403
ETag-related
After a server-side configuration issue, we made huggingface_hub
more robust when getting Hub's Etags to be more future-proof.
- Update file_download.py by @Wauplin in #1406
- ๐งน Use
HUGGINGFACE_HEADER_X_LINKED_ETAG
const by @julien-c in #1405 - Normalize both possible variants of the Etag to remove potentially invalid path elements by @dwforbes in #1428
Documentation-related
- Docs about how to hide progress bars by @Wauplin in #1416
- [docs] Update docstring for repo_id in push_to_hub by @tomaarsen in #1436
Misc
- Prepare for 0.14 by @Wauplin in #1381
- Add force_download to snapshot_download by @Wauplin in #1391
- Model card template: Move model usage instructions out of Bias section by @NimaBoscarino in #1400
- typo by @Wauplin (direct commit on main)
- Log as warning when waiting for ongoing commands by @Wauplin in #1415
- Fix: notebook_login() does not update UI on Databricks by @fwetdb in #1414
- Passing the headers to hf_transfer download. by @Narsil in #1444