JEPA: predicting the gist, not the pixels

A large language model predicts the next token. A video generator predicts the next pixels. Yann LeCun's bet is that for understanding the physical world, predicting the raw output is the wrong target — and that a Joint Embedding Predictive Architecture, which predicts a compact representationof what comes next, is what robots and planners actually need. Here's the mechanism, with a live demo of the core idea.

What does the model predict?

Now (observed)

→

Predicted frame

averaged over both futures → blurry

A generative model is asked to reconstruct the actual future — every pixel or token. When several futures are plausible, the loss pulls it toward the averageof them all. For text that average is often fine; for video and the physical world it produces a smeared, mushy prediction. You can't plan against a blur.

The catch — representation collapse

Predicting embeddings has a failure mode: if you only reward the model for matching its own predictions, it can cheat by mapping everything to the same point. The prediction is then perfect and useless. Toggle the regularizer.

A regularizer (spread the embeddings out / keep them Gaussian / make features non-redundant) forces distinct, useful representations.

Real fixes for this are the heart of the research: contrastive learning (push negatives apart), Barlow Twins (de-correlate features), DINO-style self-distillation, and LeJEPA's Gaussian regularizer (SIGReg). They all exist to stop the collapse on the left.

The problem with predicting everything

Today's dominant models are generative: they predict the output itself, in full detail. An LLM emits the next token; an image or video model emits the next pixels. This works astonishingly well for language, where the “average” of plausible continuations is usually a coherent sentence. It works far less well for the physical world.

The reason is uncertainty. Drop a ball off a table and it might bounce left or right. A model forced to draw the actual future has to hedge across every possibility at once, and the mathematically safe answer is to average them— which renders as a blurry, smeared frame. This is why generative video so often looks mushy when the scene gets unpredictable: the model is literally drawing the mean of all the futures it can't distinguish. You cannot plan a robot's next move against a blur.

Joint embeddings: predict a representation, not the bytes

JEPA changes the prediction target. Instead of reconstructing the next frame pixel-by-pixel, it encodes both the present and the future into embeddings— compact vectors that capture the meaningful structure of a scene — and predicts the future embedding from the present one. The “joint” part is that inputs and outputs are mapped into the same representation space, so the model can compare them directly.

The payoff is that an embedding doesn't have to commit to unpredictable detail. “A ball, falling, about to leave the right edge” is a perfectly sharp prediction even if the exact landing pixel is unknowable. The model keeps what it can predict and is free to ignorewhat it can't — so there is nothing to average and nothing to blur. The demo above is exactly this contrast: the generative panel smears two futures together; the JEPA panel predicts the gist and stays crisp.

World models: prediction you can plan with

That abstract prediction is the point. A world modelis a system that predicts the consequences of an action from the current state — “if I push here, the cup slides there.” Planning means searching over candidate actions and picking the one whose predicted outcome you want. You can search efficiently in a compact representation space; you cannot search over a space of blurry pixel frames. JEPA is LeCun's proposed substrate for world models because its predictions live at the right level of abstraction for planning.

This is also why it's framed as complementary to LLMs, not a replacement. Language models remain the strongest tool for language; JEPA targets the domains where the thing you need to anticipate is a physical outcome, not a sentence. Different prediction target, different job.

The catch: representation collapse

Predicting embeddings introduces a failure mode that pixel-prediction doesn't have. If the model is only rewarded for matching its own predicted embedding, it can take a shortcut: map every input to the same constant vector. Now the prediction is always perfect — and completely worthless, because the representation carries no information. This is representation collapse, and it's the central engineering problem of the whole approach (the second panel in the demo).

The research is largely a catalog of ways to prevent it. Contrastive learning explicitly pushes unrelated examples apart so they can't share a point. Barlow Twins de-correlates the features so each neuron carries unique information rather than echoing its neighbors. DINO-style self-distillation reaches strong representations with no human labels at all. And LeCun's more recent LeJEPA adds a regularizer that forces the embeddings to stay spread out in a well-behaved (Gaussian) distribution. Different mechanisms, one shared goal: keep the embeddings distinct and informative.

Where it stands

JEPA is not just a whiteboard idea. Meta has shipped a line of image and video variants — I-JEPA for images, then V-JEPA and V-JEPA 2 for video and physical-world prediction, trained on large amounts of video plus a little robot interaction data — with vision-language and robotics-focused versions in the same family. The frontier of the work is now theoretical as much as practical: recent papers pin down the precise conditions under which these models actually recover the hidden variables driving what they observe.

[VERIFY: exact V-JEPA version/date if you cite a specific one — the line has moved fast (V-JEPA 2 mid-2025, further revisions into 2026).]

The takeaway

The whole bet compresses to a single design choice: what you ask the model to predict. Predict the raw output and uncertainty forces you to average into a blur. Predict a representation and you can keep the predictable structure while dropping the noise — sharp, abstract, plannable. That trade is what makes JEPA a candidate substrate for world models, and the price of admission is solving representation collapse so the representations are worth predicting in the first place.

EyesInAI·Loading explainers…

Explainers

JEPA · interactive