LeJEPA: Isotropic Gaussian Latents for JEPA

The paper LeWorldModel is popular these days. It is the first JEPA that trains stably end-to-end from raw pixels using only two loss terms, and it avoids feature collapse without complex techniques.

This blog discusses the paper LeJEPA, which introduces the math framework behind LeWorldModel. LeJEPA answers the question of what distribution a JEPA’s latent space should follow (an isotropic Gaussian) and introduces the tool to achieve it (SIGReg).

I (as a newbie) will also include my own discussion of this work.

Feature Collapse

Suppose the pretext task is to align the representations of two views and of the same observation . The loss is

For this loss, a collapsed feature map is the global optimum but is undesirable. If the network projects every observation to the same point — that is, for all — the loss is zero. However, this collapsed representation is useless for any downstream task.

Why Isotropic Gaussian

If the latent space is anisotropic — meaning it is stretched out in some directions and squashed flat in others — it creates two major problems:

  1. Amplified bias. In the squashed, narrow dimensions, data points are crammed too closely together. A downstream classifier will struggle to draw a clean boundary between different classes, leading to high bias.

  2. Amplified variance. In the highly stretched dimensions, the representation becomes overly sensitive to tiny perturbations. Any noise in the input observation is magnified, causing the downstream model’s predictions to fluctuate wildly.

LeJEPA proves the optimality of the isotropic Gaussian for downstream tasks based on linear probing, KNN, and kernel methods.

How To Enforce the Latent To Be Isotropic Gaussian

LeJEPA uses hypothesis testing rather than constructing a distance or divergence measure. The motivation is unclear; one benefit is interpretability.

Hyperspherical Cramér–Wold Theorem

Hypothesis testing on high-dimensional data is hard. The hyperspherical Cramér–Wold theorem says a high-dimensional distribution is completely determined by its 1-D projections. Therefore, we only need to conduct multiple ( in the paper) 1-D tests and combine the results. Concretely, to test whether a distribution is a 1000-D Gaussian, we sample some directions and test whether the projection onto each is Gaussian.

We have converted the problem into a univariate distribution-matching hypothesis test.

Which Hypothesis Testing Paradigm

Moment-based

Unsuitable.

A moment-based method compares the moments of two distributions. Minimizing the sum of divergences over a finite number of moments does not guarantee a distribution match (Theorem 3). Also, Monte Carlo estimation of multiple moments is impractical to compute.

CDF-based

Unsuitable.

A CDF-based method compares the empirical CDF with the target CDF:

The problem is that the empirical CDF is not differentiable.

Characteristic-function-based

Suitable.

It uses the Epps–Pulley test, which compares the empirical characteristic functions of the distributions.

Accumulation Trick

One question is how many 1-D test directions to select. The SGD accumulation trick: because neural networks learn iteratively via SGD, we can pick a small number of new random directions (e.g., 16 slices) at every step. Over thousands of training steps, the optimizer effectively scans the entire space, smoothing out any non-Gaussian anomalies.

My Discussion

  1. The optimality of the isotropic Gaussian is proven for linear probing and classical non-linear methods like KNN and kernel methods. However, these don’t represent all “downstream tasks.” As long as the system empirically works well, this is fine — but the theoretical grounding of LeJEPA is weak.

  2. Let’s take a step back and ask why collapsed representations don’t appear in biological systems. Because they are useless. For example, we can do future-state prediction in the representation space: we roll out the future and select the best action sequence as our planning module. With a collapsed representation, the reward model can’t assign appropriate rewards and will fail. In this case, we build a system (plan + act) where the representation and the other parts of the system co-evolve, and we use this representation for the same downstream task during both training and inference. In LeJEPA, however, the representation is trained solely on the pretext task. I don’t think it’s philosophically valid to train a standalone representation for a system and to avoid feature collapse for its own sake. Regularization works and is important, but the signal from co-evolving modules in the system is the way.