TRIBE2

Video

Plot

Video A

Video B

Plot

The model

TRIBE v2 is Meta's brain-encoding foundation model. It predicts fMRI-level cortical activation from video, audio, and text using three extractors:

V-JEPA2 (vision) + Wav2Vec-BERT 2.0 (audio) + LLaMA 3.2 (language)

The output is a prediction across ~20,000 cortical vertices on the fsaverage5 surface. We aggregate those into five zones and derive a composite engagement signal.

Predictions carry a ~5s hemodynamic offset inherent to fMRI. Treat each timestamp as an approximate editing window.

Model card