Toggle light / dark theme

14 JEPA Milestones as a Map of AI Progress

Tx, Yann LeCun.

• JEPA / H-JEPA: avoids predicting every single pixel (too expensive) and rather predicts in latent space. H-JEPA adds hierarchy — short term details vs long term planning ie. how humans actually learn.

• I-JEPA: built for very efficient vision models. Masks image patches and predicts the semantics and in doing so bypasses heavy compute of traditional autoencoders.

• MC-JEPA & V-JEPA: both of these are built for videos. MC-JEPA separates content (what an object is) vs motion (how it moves). V-JEPA masks video features with no text labels making it perfect of action tracking at scale.

• Audio-JEPA: filters out background noise by treating sounds like visuals.

• Point-JEPA & 3D-JEPA: used primarily in AVs. Uses LiDAR point clouds & volumetric grids.

• ACT-JEPA: filters out real world noise to learn manipulation tasks efficiently via imitation learning.

• V-JEPA 2: predicts future physical states of the world caused by an action before it happens.

• LeJEPA: replaces techniques like masking with an Energy-Based Model (EBM) which mathematically prevents “feature collapse” & ensures the model scales reliably as data increases.

• Causal-JEPA: for learning true cause-and-effect physics by applying object level masking.

• V-JEPA 2.1: great for spatial grounding since it combines a dense predictive loss across image & video.

• LeWorldModel: built directly on LeJEPA’s math but super compact — 15M params.

• ThinkJEPA: uses dense physical prediction with VLM reasoning. Best used when long-term strategy is needed.


Over the past couple of weeks, several new papers on JEPA (Joint Embedding Predictive Architecture), like V-JEPA 2.1, LeWorldModel and ThinkJEPA, have been released – and they turned out to be not just incremental, but foundational.

Leave a Comment

Lifeboat Foundation respects your privacy! Your email address will not be published.

/* */