Tech Blog
I believe in two things: 1. Mathematically-grounded work, if verified on small models, is more likely to be scalable to large models; 2. I as a rigorous AI researcher should understand the nitty-gritty details of deep learning systems to conduct large scale experiments with high iteration speed.
Math
- VAE and Diffusion
- LeJEPA: Isotropic Gaussian Latents for JEPA
- Linear Attention
- The Cumulant Generating Function: From Moments to Max
DLSys
- Understand FSDP2
- FSDP2 Small Tricks
- Ring All-Reduce
- Ring Flash Attention (As An Example of Context Parallelism)
- Tensor Parallel, Sequence Parallel and Loss Parallel