A new model is added to transformers: V-JEPA 2
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-VJEPA-2-preview
.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-VJEPA-2-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the VJEPA-2 model. This tag is a tagged version of the main
branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0
.
VJEPA-2
V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
The abstract from the technical report is the following:
Usage example
VJEPA-2 can be found on the Huggingface Hub. V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.
The snippet below shows how to load the V-JEPA 2 model using the AutoModel
class.
import torch
from torchcodec.decoders import VideoDecoder
import numpy as np
processor = AutoVideoProcessor.from_pretrained("facebook/vjepa2-vitl-fpc64-256")
model = AutoModel.from_pretrained(
"facebook/vjepa2-vitl-fpc64-256",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
outputs = model(**video)
# V-JEPA 2 encoder outputs, same as calling `model.get_vision_features()`
encoder_outputs = outputs.last_hidden_state
# V-JEPA 2 predictor outputs
predictor_outputs = outputs.predictor_output.last_hidden_state