Best Practice for the FAST Tokenizer

1. Brief Introduction to FAST

On January 16, 2025, Physical Intelligence (PI) published the FAST action tokenizer. Their core motivation for developing a new tokenizer (rather than using a bin-based tokenizer) is that the latter cannot capture the rapidly changing actions (high-frequency information) required for dexterous manipulation — imagine that if you are just reaching for a cup, the action commands probably change slowly over time (low frequency); but if you are folding clothes, the garment state keeps changing, so the actions must also keep changing (high frequency). The new tokenizer prevents high-frequency information from being lost during action tokenization.

FAST’s core claim is that it enables auto-regressive (AR) VLAs to achieve training efficiency five times higher than diffusion-based VLAs, while achieving comparable performance. AR VLAs already train faster than flow-matching-based ones, but their performance on tasks requiring dexterous manipulation has historically lagged. FAST aims to solve this problem. (And from my limited experience, the AR loss is a much better signal of training progress than the flow-matching loss — flow-matching still needs to train for a long time after the loss converges…)

Moreover… LLMs today are all based on AR loss. If VLAs also use AR loss, this means VLAs can borrow nearly all of the inference and training infrastructure optimizations from the LLM ecosystem.

2. Personal Tips for Using FAST

2.1 Conclusion

TL;DR: Add three special tokens: BOS token, EOS token, and action chunk split token. And read through this issue before using it.

2.2 Analysis (Using the LIBERO Dataset as an Example)

2.2.1 Prerequisites

FAST should only receive 1-second action chunks (stated on the HF page) (except for FAST models trained from scratch).
FAST does not define any special tokens and naturally has no auto-padding functionality.
LIBERO’s FPS is 10, meaning 10 actions are recorded per second.
FAST can only decode a fixed-length action horizon during decoding. If decoding fails, it returns a zero action for safety. See the issue discussion and source code.
FAST only tokenizes actions, not proprioceptive state. For state processing, refer to the FAST paper or this code (in short: bin the state values and treat them as text alongside the task description).
In the VLA setting, FAST only needs three special tokens to work (in most cases): BOS (begin of sentence) token, EOS (end of sentence) token, and action chunk split token.

2.2.2 EOS Token

Although FAST can only tokenize a fixed-horizon action, the output token length is variable, causing inconsistent lengths within a batch. Therefore, an EOS token is needed as a padding token. Moreover, without an EOS token, the model does not know when to stop generating, and decoding will keep producing zero actions… Additionally, if you use a HuggingFace model template to define the action head (e.g., Gemma2), defining an EOS token allows you to conveniently call the generate method for inference.

2.2.3 BOS Token

BOS serves two purposes in a FAST-based VLA: first, it separates observation tokens (prefill tokens) from action tokens, allowing the model to correctly predict the first action token; second, it pads the observation during inference.

Current action head implementations first extract the VLM’s KV cache, then the action head attends to the VLM’s tokens. Thus, training and inference look like this:

"""
Pseudocode below
"""

action_head = Gemma2ForCausalLM(action_head_cfg)

# If using a Gemma-based action head like Pi0, use HybridCache for KV cache.
kv_cache_from_vlm: Cache = vlm(..., use_cache=True)["past_key_values"]
hybrid_kv_cache_for_gemma: HybridCache = HybridCache(...)  # transform any Cache impl to HybridCache

# Training
loss = action_head(
    input_ids=action_tokens,
    labels=action_labels,
    past_key_values=kv_cache_from_vlm,
    use_cache=True,
)

# Inference
generated_tokens = action_head.generate(
    # If not writing your own generate method, must pad observation input IDs
    # shape: [bsz, vlm_input_seq_len + 1], value: bos_token_id
    # +1 because the BOS before the actions must be included
    input_ids=bos_tokens_with_shape_of_vlm_input_ids,
    past_key_values=kv_cache_from_vlm,
    use_cache=True,  # don't use max_new_tokens, let the model stop at EOS
)

Correctly predicting the first (few) tokens is especially important, because as the FAST first author stated:

Intuitively, the first predicted tokens in FAST, the lowest frequency tokens, dominate the overall shape of the output, and those are often good, even if the model messes up later tokens (errors in an autoregressive model accumulate, so earlier tokens tend to be more accurate than later tokens).

2.2.4 Action Chunk Split Token for Longer Horizons

In LIBERO, one second of action has an action horizon of 10. This means that with only BOS and EOS, the model can only correctly decode a length of 10, limiting our ability to increase the predicted action horizon. If you directly feed in an action chunk with horizon 50, decoding will produce large errors:

"""
Pseudocode below
"""

tokenizer = AutoProcessor.from_pretrained(
    "physical-intelligence/fast",
    trust_remote_code=True,
    action_dim=7,
)

# Correct: action horizon = 10 => action.shape = (bsz, 10, 7)
action_token = tokenizer(action)
decoded_action = tokenizer.decode(action_token, time_horizon=10)
diff = action - decoded_action
print(torch.sum(torch.square(diff)).item())  # out: ~ 0.05

# Wrong: action horizon = 50 => action.shape = (bsz, 50, 7)
action_token = tokenizer(action)
decoded_action = tokenizer.decode(action_token, time_horizon=50)  # can decode successfully
diff = action - decoded_action
print(torch.sum(torch.square(diff)).item())  # out: ~ 397

Therefore, to support longer predicted action horizons, you need to insert an action chunk split token every 10 actions. This way, decoding is performed per 10-action segment. In practice, with action horizon 50, torch.sum(torch.square(diff)).item() is roughly 0.38, essentially back to normal.

3. Summary

Suppose a dataset has FPS of 3, and we want to decode an action horizon of 9 (three chunks) after training. Batch size is 2.

FAST’s vocab size is 2048. Let eos_token_id = 2048, bos_token_id = 2049, action_chunk_split_token_id = 2050. The before/after processing comparison:

# Before processing
[[256, 279, 693, 1045, 937],
 [890, 2033, 267, 574, 28, 92, 5]]

# After processing
[[2049, 256, 2050, 279, 693, 1045, 2050, 937, 2048, 2048],  # pad EOS, split with 2050, begin with 2049
 [2049, 890, 2033, 2050, 267, 2050, 574,  28,  92,   5]]    # split with 2050, begin with 2049