mimic-video: Video-Action Models for Generalizable
Robot Control Beyond VLAs
Jonas Pai∗1,2,3, Liam Achenbach ∗1,2,3, Victoriano Montesinos 1, Benedek Forrai 1, Oier Mees †2,5, Elvis Nava †1,3,4
1mimic robotics 2Microsoft Zurich 3ETH Zurich 4ETH AI Center 5UC Berkeley
∗Core Contributors † Co-advising
https://mimic-video.github.io
Video Model
VLM
Person cutting carrots
VLA
Video-Action Model (VAM)
Image-Text Pairs
Person cutting carrots
Video-Text Pairs
Large Scale 
Robotics Data
Small Scale 
Robotics Data
❌ Expensive Post-Training
✅ Eﬃcient Post-Training
Semantics
Semantics + Visual Dynamics
Learn: Dynamics+Control
Learn: Control
Dexterous & Generalizable Manipulation
Video-Action Model (ours)
Vision-Language-Action Model (VLA)
Robot Data Quantity
Success
Rate
2% 10% 50% 100%
0.2
0.4
0.6
0.8
1.0
10x Sample-Eﬃciency
Fig. 1:We introduce mimic-video, a new class of Video-Action Model (V AM) that grounds robotic policies in pretrained video models.
Unlike standard VLAs that must learn physical dynamics from scratch (top), mimic-video leverages the inherent visual dynamics of video
backbones to isolate the control problem (bottom). This enables state-of-the-art performance on dexterous manipulation tasks, while achieving
10x greater sample efficiency compared to VLAs (right).
Abstract—Prevailing Vision-Language-Action Models (VLAs)
for robotic manipulation are built upon vision-language backbones
pretrained on large-scale, but disconnected static web data. As
a result, despite improved semantic generalization, the policy
must implicitly infer complex physical dynamics and temporal
dependencies solely from robot trajectories. This reliance creates
an unsustainable data burden, necessitating continuous, large-scale
expert data collection to compensate for the lack of innate physical
understanding. We contend that while vision-language pretraining
effectively captures semantic priors, it remains blind to physical
causality. A more effective paradigm leverages video to jointly
capture semantics and visual dynamics during pretraining, thereby
isolating the remaining task of low-level control. To this end, we
introduce mimic-video, a novel Video-Action Model (V AM) that
pairs a pretrained Internet-scale video model with a flow matching-
based action decoder conditioned on its latent representations. The
decoder serves as an Inverse Dynamics Model (IDM), generating
low-level robot actions from the latent representation of video-
space action plans. Our extensive evaluation shows that our
approach achieves state-of-the-art performance on simulated and
real-world robotic manipulation tasks, improving sample efficiency
by 10x and convergence speed by 2x compared to traditional
VLA architectures.
I. INTRODUCTION
Building on the capabilities of pretrained Vision-Language
Models (VLMs), Vision-Language-Action Models (VLAs)
transfer semantic knowledge acquired through Internet-scale
vision-language pretraining to the physical domain. By fine-
tuning a VLM on diverse robot action data, VLAs learn
generalist, natural language-conditioned robot manipulation
policies that combine a wide range of skills and exhibit
impressive generalization to unseen instructions, objects, and
environments [60, 26, 3, 25].
However, this paradigm faces a fundamental limitation: the
pretraining data, while massive in scale, is inherently static.
Images and text lack explicit, temporally-grounded information
about dynamics and physical procedures that are crucial for
complex manipulation. Consequently, the burden of learning
physical dynamics (how objects move, deform and interact)
falls entirely on the post-training stage, where the model
must infer these from scarce and expensive expert-teleoperated
demonstrations. This heavy reliance on robot data creates a
data-efficiency bottleneck that limits scalability. While prior
works have explored augmenting VLA training with auxiliary
video-derived signals such as language plans, affordances, or
keypoints [ 55, 19, 24, 25], reducing dense video into such
sparse representations creates an information bottleneck, failing
to capture fine-grained dynamics.
In this work, we posit that the key to more sample-efficient
and capable robot policies lies in leveraging a pretraining
modality that inherently encodes dynamic, procedural infor-
mation: video. Unlike static image-text pairs, internet-scale
video data provides rich knowledge on “how things are done”,
capturing the nuanced physics of interaction, how objects move,
deform and react to forces. However, effectively harnessing
arXiv:2512.15692v2  [cs.RO]  19 Dec 2025
this data remains a significant challenge. Prior approaches
leveraging pretrained video models typically learn the joint
distribution over video and actions, factorized such that the
predicted actions are conditioned on synthesized future frames
[30, 29, 15]. However, recovering the policy typically requires
fully generating these future frames, necessitating prohibitive
video synthesis at every control step.
To address these limitations, we propose a more direct
paradigm: grounding robot policies directly in the latent repre-
sentations of a generative, pretrained video model. We introduce
mimic-video, a novel Video-Action Model (V AM) that unifies
video modeling with robot control. Built upon a state-of-the-
art video diffusion backbone, mimic-video functions by first
synthesizing a visual plan: given an initial observation and
language instruction, the video backbone predicts a future
trajectory within a compact latent space. Rather than requiring
full or even partial video generation, we extract intermediate
video model representations to condition a downstream action
decoder. This decoder operates as an Inverse Dynamics Model
(IDM) to recover low-level motor commands. This formulation
allows the video backbone to remain frozen, eliminating the
need to train it on scarce robot action data. Fundamentally,
this architecture decouples the inherent multi-modality of long-
horizon planning, now offloaded to the video backbone, from
the downstream control task. This effectively frees the action
decoder from modeling complex future distributions, allowing
it to dedicate its entire capacity to the far simpler, unimodal
and non-causal problem of inverse dynamics [34, 35].
Our primary contribution is mimic-video, a novel generalist
robot policy that integrates generative video pretraining with
flow matching-based control, establishing a new class of
methods we term Video-Action Models (V AMs). We evaluate
our approach across a diverse suite of robotic embodiments
ranging from standard single-arm manipulation to bimanual
dexterous tasks, demonstrating state-of-the-art results in both
simulated benchmarks and challenging real-world environments.
Our mimic-video model achieves this performance while
improving sample efficiency by 10x and convergence speed by
2x compared to traditional VLA architectures.
II. RELATEDWORK
a) Imitation Learning for Robot Control:End-to-end imi-
tation learning has become the dominant paradigm for training
general-purpose robot manipulation policies, enabling robots
to acquire complex skills directly from expert demonstrations.
This approach, which maps raw sensory observations to actions,
has benefited from advances in generative modeling in so far
as models act as “data sponges”, able to absorb large and
diverse pretraining datasets [ 9, 26, 60, 11, 45, 4] to achieve
downstream generalization in action generation.
While early approaches like ACT [ 57] used a V AE to
model action chunks, the field has shifted toward iterative
generative frameworks, popularized by Denoising Diffusion
Probabilistic Models [22]. This class of methods, encompassing
Diffusion Policy [ 8, 10] and the Flow Matching [ 31] decoders
of the π0/π0.5 series [3, 25], has become the state-of-the-art.
These generative approaches excel at modeling multi-modal
distributions of expert actions and form the technical foundation
for modern robot imitation learning policies, including our own
action decoder.
b) Vision-Language-Action (VLA) Models:A major break-
through in robot learning has been the paradigm of Vision-
Language-Action (VLA) models, which are obtained by
finetuning large, pretrained Vision-Language Models (VLMs)
on robotics data. Models like RT-2 [ 60], OpenVLA [ 26],
and the π0/π0.5 series [3, 42, 25] leverage the vast semantic
knowledge embedded in their backbones from pretraining on
internet-scale image-text data. This allows them to follow open-
ended language instructions, understand abstract concepts, and
generalize to novel objects, environments, and tasks in a zero-
shot fashion. However, a fundamental limitation of VLAs is
that the VLM backbones they make use of are only pretrained
with static vision and language data. They lack an inherent
model of video dynamics, physics, or temporal progression,
limiting their ability to reason about the physical consequences
of actions. This critical knowledge must be learned from scratch
from comparatively small and expensive robotics datasets.
Several works [ 55, 56] make use of techniques like Chain-
of-Thought reasoning [ 52] in order to extract more useful
grounded conditioning signals and representations [ 7, 24, 56]
for VLAs. However, those approaches are still ultimately
limited by relying on the static knowledge embedded in the pre-
existing image-text pretraining of VLMs. They also typically
result in significantly slower inference due to the computation
of autoregressive plans before action decoding.
c) Video Models for Policy Learning:The utilization of
video prediction for robotic control has a long-standing history,
primarily motivated by the potential to enable planning through
visual foresight. Early works, such as the seminal approaches by
Oh et al. [39], Watter et al. [51], Fragkiadaki et al. [18], Finn
et al. [17], Finn and Levine [16], demonstrated how video
prediction could enhance physical interaction. As generative
models have matured to produce high-definition, coherent long-
form content [ 40, 53], recent works have explored diverse
integrations of video generation with policy learning. The use
of action-conditioned video models (world models) for policy
learning has recently seen significant adoption. World models
can help select more optimal action sequences at runtime
by "imagining" their outcome [ 1, 43], or be used as learned
simulators for evaluation and DAGGER-like [ 47] approaches
[20, 48]. In this work, we consider non action-conditioned video
models. One line of work fully generates pixel-space future
video and obtains actions either via non-parametric methods
such as tracking a custom end effector-mounted tool [ 29], or
learned pixel-based Inverse Dynamics Models [ 14, 15]. CoT-
VLA [56] uses a pretrained VLM capable of generating images
to generate a subgoal image and actions in one autoregressive
sequence. Another line of work learns to model video during
training without predicting video in action sampling; Unified
World Models [59] learns a model from scratch that can flexibly
function as a policy, a video prediction model, or a forward or
inverse dynamics model and LAPA [ 54] finetunes a VLM to
0.2
0.4
0.6
0.8
1.0
   Pretrained Video Model
  Finetuned Video Model
Inﬂuence of Video Representations for Robot Control
Success
Rate
Predicted Video Expert Video
Action Decoder Input
Video Latents
Video Model
Action Decoder
“Grasp the bowl”
Fig. 2: We compare success rates when conditioning our action
decoder on different visual inputs: video latents generated
by either predictions or ground-truth (expert) video for both
features from a standard pretrained video model (gray), as
well as a video model finetuned on video data from the robot
dataset (orange). The near-perfect performance with ground
truth inputs confirms that control effectively reduces to visual
prediction, implying policy performance scales directly with
video model quality.
predict “latent actions” (an encoding of the difference between
the current and a future image), re-training only the output layer
to predict actions in a subsequent training stage. FLARE [ 58]
aligns intermediate VLA representations with future vision-
language embeddings, implicitly modeling video and actions
jointly. Similarly to our approach, Video Policy [ 30] explicitly
models the joint video-action distribution and conditions a
policy model on intermediate video model representations, but
crucially does not allow for efficient sampling of the marginal
action distribution.
Our proposed approach departs from most prior work by
directly grounding control in the rich latent priors of internet-
scale video models, rather than training from scratch or relying
on pixel-level reconstruction. Additionally, by conditioning a
lightweight inverse dynamics model on intermediate, noisy
latent states, we bypass the computational cost of full video
generation and the brittleness of heuristic tracking. This enables
a scalable, end-to-end framework that effectively transfers
broad physical understanding to downstream manipulation
tasks, significantly reducing the reliance on expensive, large-
scale robotic demonstrations.
III. CASESTUDY: HOW DOESVIDEOGENERATION
QUALITYAFFECTROBOTPOLICYPERFORMANCE?
In this work, we argue that the internal representations
arising in pretrained video model backbones are better suited
for downstream robot learning compared to those in the
VLM backbones commonly used in current state of the art
VLAs. Intuitively, video models jointly model images, physical
dynamics, alongside visual action plans. A policy trained on
such representations effectively reduces the role of the action
decoder to a simple translator, mapping visual action plans into
a low-dimensional robot action trajectory. If this hypothesis
holds, the bulk of learning in Video-Action Models falls on
the large-scale video pretraining and finetuning phases, while
training the action decoder (the step requiring expensive, high-
quality robot teleoperation data) becomes lightweight and data
efficient.
We investigate this claim by conducting an “oracle” case
study (see Fig. 2), where we disentangle the difficulty of
predictingthe future in robotic control tasks fromexecuting
it. Concretely, we train an action decoder on top of video
representations and evaluate its performance under different
conditioning regimes. We compare success rates when the
decoder is conditioned on predicted video latents, from either a
standard off-the-shelf video model or one finetuned on robotics
data, versus “oracle” latents extracted from ground-truth future
video frames. We observe a pronounced scaling behavior: while
minimizing the domain gap via finetuning leads to improved
performance when usingpredictedvideo, conditioning on
oraclelatents yields near perfect success rates regardless of
whether the underlying backbone is finetuned on the target
distribution or not. Notably, this finding suggests that a high-
quality pretrained video model backbone provides extremely
rich representations for action decoding, sufficient on their own
to perfectly decode low-level action plans with a decoder trained
on minimal low-level action finetuning data. Consequently, the
burden for policy learning in V AMs effectively shifts away from
low-level action decoding towards video model pretraining and
finetuning.
IV. VIDEO-ACTIONMODELS
We introduce mimic-video, a generative Video-Action Model
(V AM) capable of modeling the joint distribution of video and
robot actions. Our architecture couples two Conditional Flow
Matching (CFM) models: a pretrained, language-conditioned
video backbone and a lightweight action decoder that functions
as an Inverse Dynamics Model (IDM) by conditioning on the
video model’s latent representations.
A. Preliminaries: Flow Matching
Both the video and action prediction components are trained
using the Flow Matching framework [ 31] to model a data
distribution p0(x0) by constructing a Continuous Normalizing
Flow [6]. We use the conditional optimal transport path
xτ = (1−τ)x 0 +τε, τ∈[0,1](1)
which interpolates between clean data x0 (at τ= 0 ) and
Gaussian noise ε∼ N(0, I)(at τ= 1) to define the conditional
probability path pτ
 
xτ |x 0
. The model parameterizes an
estimator vθ to the intractable marginal generating vector field
uτ (xτ ) =Ep(x0|xτ )uτ (xτ |x 0)
where uτ (xτ |x 0) := d
dτ xτ =ε−x 0 is termed the conditional
generating vector field and can be computed trivially for
samples x0, ε. The power of flow matching lies in learning vθ
by regressing tou τ (xτ |x 0):
LCFM =E T(τ), p0(x0), pτ (xτ |x0)


vθ(xτ , τ)−uτ (xτ |x 0)


2
,
(2)
“put the package on 
the conveyor belt”
Partial Video 
Prediction
Video Model
Video Input
Repeat Repeat
Action 
Decoder
Language
 Model
Language 
Encoder
Video Noise
0.4
0.8
1.2
1.6
Action Noise
0.25
0.5
0.75
1
τv τa
Fig. 3: mimic-video architecture: we instantiate our framework with a pretrained video generation backbone (Cosmos-Predict2 [ 38,
37]), which provides rich physical dynamics priors learned from large-scale video data. We adapt this model for control via a
partial denoisingstrategy, where the video backbone follows the flow to an intermediate flow time τv to extract latent visual
plans. These representations condition a smaller action decoder, which processes proprioceptive states and predicts action
trajectories. The video and action components operate on independent flow schedules ( τv and τa), allowing us to design the
learning problem separately for each modality.
where the expectation is taken over a distribution T of flow
times τ, which is U([0,1]) in [31] and will take different values
in this work.
Inference is performed by integrating the learned field vθ
fromτ= 1toτ= 0to recoverˆx 0 ∼p 0:
ˆx0 =ε+
Z 0
1
vθ(ˆxτ , τ)dτ(3)
Critically, this continuous time parameter τ allows us to define
partial denoising(stopping at intermediate τ >0), which is
central to our method.
B. Architecture Formulation
Formally, we aim to learn a generalist robot policy
π(At |o t, l) that predicts a sequence of actions At =
[at, . . . ,at+Ha−1] given observations consisting of multiple
RGB images It′ , a language instruction l and the robot’s
proprioceptive state qt, such that ot = [It−Ho+1, . . . ,It, l,qt].
Our model consists of two flow matching-based models
trained using the objective defined in Eq. 2. Let z0
t be the
sequence of video encodings and A0
t be the clean action chunk:
Video Model: vϕ(z0
past,z τv
future, l, τv) induces pϕ(z0
future |
z0
past, l).
Action Policy: πθ(Aτa
t ,q t,h τv , τa, τv) induces the action
distributionp θ(A0
t |q t,h τv
t , τv).
Here, hτv =v (k)
ϕ (z0
past,z τv
future, l, τv) is the vector of hidden
states extracted after the kth layer of the video model when
invoking it on “noisy” video input zτv
future (computed via Eq. 1)
at flow-timeτ v. We illustrate our architecture in Fig. 3.
C. Video Model
While our Video-Action Model formulation can be instanti-
ated with any flow matching-based video model, in practice
we use Cosmos-Predict2 [ 38, 37] as our base model. Cosmos-
Predict2 is an open-source 2B latent Diffusion Transformer
(DiT) [41] model that operates on a sequence of video frames
encoded by a pretrained 3D-tokenizer. The input to the model
is a concatenation of clean latent patch embeddings from a
context prefix (for which we choose to use 5 frames) and “noisy”
latent patches representing the future frames to be generated.
Each transformer layer alternates between (1) self-attention
over the full video sequence, (2) cross-attention to language
instructions encoded by T5 [44], and (3) a two-layer MLP.
D. Action Decoder
The action decoder is instantiated as a DiT that encodes
the robot’s proprioceptive state qt and a sequence of At
future robot actions through two separate MLP networks
and concatenates them to form the action decoder’s sequence
dimension. We use learned absolute positional encodings to
add temporal information to each token. During training, we
randomly replace the soft token encoding the proprioceptive
state with a learned mask token to prevent overfitting on the low
dimensional observation. Each action decoder layer consists of
(1) cross-attention to intermediate video model representations
hτv , (2) self-attention over the action sequence, and (3) a two-
layer MLP. Each component is bypassed by a residual path and
each component’s output is modulated via AdaLN [ 41], where
the input to the AdaLN projections is a low-rank bilinear-affine
encoding of both video and action flow timesτ v andτ a.
E. Action Sampling
To enable real-time control, we formulate inference as
efficient sampling from the marginal action policy. Although
mimic-video is in principle capable of sampling from the
joint video-action distribution (see Fig. 4 for an example),
we can sample from the marginal action distribution more
efficiently by bypassing the computational cost of full video
reconstruction. We therefore propose apartial denoising
strategy that extracts semantic features from intermediate flow
states without resolving fine-grained pixel details. Our inference
action sampling procedure is described in Algorithm 1. Given
image observations ot, we integrate the video flow field from
Gaussian noise to an intermediate flow time τv (see Eq. 3).
This yields a partially denoised latent state zτv
future that retains
sufficient structural information to guide the policy. We process
this state with the first k layers of the video model and pass the
resulting activations as conditioning information to the action
decoder. The action decoder then performs a full denoising
procedure to produce a chunk of robot actionsA 0
t .
Algorithm 1Action Sampling(k, τ v)
1:Input:z 0
past,q t, l
2:z 1
future,A 1
t ∼ N(0,I)
3:z τv
future ←z 1
future +
R τv
1 vϕ(z0
past,z τ′
v
future, l, τ′
v)dτ ′
v
4:h τv ←v (k)
ϕ (z0
past,z τv
future, l, τv)
5:A 0
t ←A 1
t +
R 0
1 πθ(Aτa
t ,q t,h τv
t , τa, τv)dτa
6:returnA 0
t
At inference time, τv is a free hyperparameter. Its optimal
value is task-dependent, but we show in Sec. V-C empirically
that it is generally close to 1 (high noise). In the special case of
τv = 1, a single forward pass of the computationally intensive
video backbone is sufficient to generate a chunk of actions (line
3 in Algorithm 1 becomes redundant), facilitating real-time
inference in our experiments. We find that τv = 1is a good
default value that balances policy performance and inference
speed. See Sec. E for a discussion on the motivation behind
“noisy” video conditioning.
F . Training
Video-Action Model training proceeds in two distinct phases
operating on disjoint sets of parameters. The first stage focuses
on the video backbone. To align the generalist backbone
with the specific visual domain and dynamics of our robotic
tasks, we finetune it using Low-Rank Adapters (LoRA) [ 23]
on robotics video datasets. This adaptation step ensures the
model captures domain-specific semantics while preserving its
pretrained temporal reasoning capabilities.
The second stage focuses on learning the action decoder πθ
while keeping the video backbone frozen. We train the decoder
from scratch to regress the action flow field, conditioned on
video representations hτv extracted from the frozen backbone.
Crucially, to ensure robustness to varying noise levels during
inference, we sampleindependentflow times τv (for video)
and τa (for action) during each training iteration, as detailed
in Algorithm 2. We employ a logit-normal distribution for Tv
matching the video pretraining and Ta(τa)∝ √τa −0.001 for
actions following [ 3]. This decoupled training scheme renders
our approach significantly more sample-efficient and faster to
converge than comparable VLA baselines (see Sec. V-B).
Algorithm 2Action Decoder Training(k,T v,T a)
1:repeat
2:z past
0 ,z future
0 ,a 0,s 0, l∼p 0(zpast
0 ,z future
0 ,a 0,s 0, l)
3:τ v ∼ Tv(τv);τ a ∼ Ta(τa)
4:ε v,ε a ∼ N(0,I)
5:z future
τv ←(1−τ v)z future
0 +τ v εv
6:a τa ←(1−τ a)a 0 +τ a εa
7:h τv ←v (k)
ϕ (zpast
0 ,z future
τv , l, τv)
8:Take gradient descent step on
∇θ ∥πθ(aτa,s 0,h τv , τa, τv)−u τa(aτa |a 0)∥2
9:untilconverged
V. EXPERIMENTS
Our experiments provide an empirical analysis of mimic-
video, evaluating the efficacy of leveraging a video backbone
for robotic control across several axes:
1) Can mimic-video effectively control multiple embodi-
ments?
2) Does conditioning on a generative video backbone yield
superior sample efficiency and faster convergence for
action decoder training, compared to conditioning on a
VLM backbone?
3) Is fine-grained video reconstruction necessary for effec-
tive policy learning?
a) Evaluation setups:We evaluate mimic-video’s ca-
pabilities across the simulated benchmarks SIMPLER [ 28]
and LIBERO [ 32], as well as through real-world dexterous
manipulation experiments using humanoid hands.
SIMPLERserves as a high-fidelity proxy for real-world
performance, evaluating policies trained on the BridgeDataV2
[50] dataset of a Widow-X robot embodiment. By employing
system identification and visual matching, it specifically tests
the policy’s ability to generalize to unseen tasks under realistic
visual domain shifts.
LIBERObenchmark evaluates precision and multi-task
capacity with a simulated tabletop Panda robot. We focus on
the LIBERO-Goal, -Object, and -Spatial suites, each providing
50 expert demonstrations per task across 10 distinct tasks. This
widely used benchmark assesses the model’s ability to learn
precise multi-task manipulation behaviors.
Real-World Dexterous Bimanual Manipulation. To vali-
date mimic-video on high-dimensional, contact-rich tasks, we
utilize a bimanual setup equipped with two 16-DoF “mimic”
hands [ 36] mounted on Panda arms. The observation space
mimic dexterous bimanual 
Package sorting
Measuring Tape Stowing
Real Autonomous Execution
Video Generated Plan
(Not decoded during 
autonomous execution)
Fig. 4: We train and evaluate mimic-video on a real-world bimanual robot setup with Franka Emika Panda robot arms and
mimic 16-DoF dexterous humanoid hands. We execute a real-world evaluation on the bimanual setup with two tasks: package
sorting and pick and place of a measuring tape into a box. For each action chunk, mimic-video generates a latent video plan
(τv = 1) and then executes the actions on the real robot. We further fully denoise the predicted video for this visualization.
includes a global workspace view, four wrist cameras, and
full proprioception. We evaluate on two long-horizon tasks:
Package Sorting (pick, handover, place) and Tape Stowing
(pick, stow, move box). Critically, while the video backbone is
finetuned on a broader 200-hour corpus, the respective action
decoders are trained on extremely scarce task-specific data: just
1h 33m (512 episodes) for sorting and 2h 14m (480 episodes)
for stowing.
b) Comparisons:We compare mimic-video’s ability to
control multiple robots against several state-of-the-art baselines.
π0.5-style VLA (Knowledge-Insulating).To isolate the
effect of video pretraining versus standard vision-language
pretraining, we construct a VLM-based baseline following
a similar architecture as π0.5 [25, 13]. We employ the 3B-
parameter PaliGemma [ 2] as backbone, coupled with an
action decoder identical to that of mimic-video. Mirroring
mimic-video, the action decoder cross-attends to a particular
layer of the backbone for which we empirically find the
optimal choice. This equivalent action decoder design, together
with training on perfectly equivalent datasets, ensures that
performance differences in our comparisons stem strictly from
the quality of the conditioning representations (video vs. image-
text). Adhering to the “Knowledge-Insulation” protocol [ 13],
we employ a two-stage training process: the autoregressive
backbone is trained via Next Token Prediction (discretizing
and compressing actions with FAST [ 42]), while the action
decoder is separately trained via flow matching. Notably,
while the original π0.5 leverages massive web and cross-
embodiment pretraining, our baseline (denoted “ π0.5-style”)
trains the backbone from the original VLM checkpoint and the
decoder from scratch. This standardizes the data regime across
methods, allowing for a fair evaluation of the backbone prior.
DiT-Block Policy.For real-world bimanual evaluations,
we compare against a strong single-task baseline: a DiT-
Block Policy [ 10] following the action representation recipe
from Nava et al. [36]. This model features a ViT-S DINO
backbone [ 12, 5] (with separate encoders for each camera
view) feeding into an 8-block, 8-head transformer diffusion
policy. With approximately 155M parameters (in the multi-view
setting), this architecture represents a competitive standard
for imitation learning in low-data regimes, making it an
ideal reference point for the utilized bimanual dexterous tele-
operation datasets.
State-of-the-Art Published Baselines.We additionally in-
clude results reported for state-of-the-art competing approaches,
namely Octo [ 49], ThinkAct [ 24], FLOWER [ 46], OpenVLA
[26] and OpenVLA-OFT [27].
A. Direct Evaluation across Diverse Robot Platforms
a) SIMPLER-Bridge:We first evaluate mimic-video’s
cross-task generalization capabilities on the SIMPLER-Bridge
benchmark, with full results detailed in Table I. Our model
TABLE I: Benchmark scores on SIMPLER-Bridge. The training regimes denote the usage of robot action data: “pretrained”
(large-scale external), “finetuned” (external → target), and “scratch” (target only). Note that all models leverage image or video
pretraining.Bold: best overall; underline : best “scratch” score. We also report mimic-video with task-optimizedτ v.
Inputs: third-person image, language instruction, robot proprioceptive state (optional)
Model Put Carrot on Plate Put Spoon on Towel Stack Blocks Eggplant Average SR (%)
OpenVLA (finetuned) [26] 4.2 8.3 0.0 45.8 14.6
Octo (finetuned) [49] 8.3 12.5 0.0 43.1 16.0
ThinkAct (pretrained) [24]37.558.3 8.7 70.8 43.8
FLOWER (finetuned) [46] 13.071.08.0 88.0 45.0
π0.5-style VLA (scratch) 25.0 29.220.8 66.7 35.4
mimic-video(scratch)37.5 37.5 12.5100.0 46.9
mimic-video(scratch, per taskτ v-tuning) 54.2 41.7 29.2 100.0 56.3
achieves the strongest average success rate across all four
tasks, matching or surpassing the performance of state-of-the-
art baselines, including our π0.5-style VLA comparison. This
strong performance validates that conditioning on the generative
video prior yields more robust policy representations than
those derived from vision-language-action (VLA) pretraining
alone. Additionally, leveraging the partial denoising strategy, we
demonstrate a novel form ofinference-time policy optimization:
by adjusting the flow parameter τv, the fixed trained model
can be specialized to individual task dynamics, achieving
further performance gains at the cost of modest increases
in computation.
TABLE II: Benchmark scores on LIBERO. “finetuned”,
“scratch”,bold, and underline are defined as in Tab. I.
Inputs: third-person image, language instruction, proprioception (optional)
Model Spatial (%) Object (%) Goal (%) Avg (%)
Diffusion Policy (scratch) [8] 78.3 92.5 68.3 79.7
Octo (finetuned) [49] 78.9 85.7 84.6 83.1
DiT Policy (finetuned) [10] 84.2 96.3 85.4 88.6
OpenVLA (finetuned) [26] 84.7 88.4 79.2 84.1
OpenVLA-OFT (finetuned) [27]96.2 98.3 96.2 96.9
π0.5-style VLA (scratch) 79.2 94.0 84.4 85.9
mimic-video(scratch) 94.2 96.8 90.6 93.9
b) LIBERO:We evaluate mimic-video’s multi-task manip-
ulation capabilities on the LIBERO benchmark. Despite being
trained from scratch on task-specific action data, mimic-video
outperforms the majority of state-of-the-art methods finetuned
from generalist models (see Table II). Notably, mimic-video
achieves significantly higher success rates than the comparable
π0.5-style VLA baseline, indicating that the generative video
prior facilitates more robust and efficient policy learning than
the corresponding vision-language pretrained representations.
c) Real-World Dexterous Bimanual System:To validate
mimic-video on high-dimensional, contact-rich tasks under
real-world data scarcity, we benchmark it against single-
task DiT-Block Policies on a bimanual setup comprising two
Franka arms equipped with dexterous humanoid hands. This
setup presents a significant challenge due to heavy occlusions,
particularly during grasping, where wrist camera observations
play a critical role in guiding robot policies. This necessity
is reflected by the performance gap between the two DiT-
Block Policy variants (workspace-only vs. multi-view) shown in
Table III. Remarkably, mimic-video significantly surpasses the
performance of both baselines, despite only being conditioned
on the single workspace camera view. This result confirms that
the predictive capacity of the generative video prior allows
mimic-video to effectively bridge the visual uncertainty caused
by occlusion, leading to robust policies learned from minimal
task-specific data. Fig. 4 illustrates the real world experiment.
TABLE III: Benchmark scores on real-world bimanual dexter-
ous manipulation on the mimic system.
Model Packing Package handover
DiT-Block Policy [10] 11.0 30.0
DiT-Block Policy [10] (+ wrist cams) 42.6 74.1
mimic-video 72.0 93.0
B. Data Efficiency and Convergence Speed
We investigate the data efficiency of decoding actions
from video model representations compared to the VLM
representations by training mimic-video and π0.5-style VLA
action decoders on differently-sized subsets of the LIBERO-
Goal, LIBERO-Spatial, and LIBERO-Object task suites. The
result, shown in Fig. 5, demonstrates a remarkableorder-of-
magnitude increase in sample efficiencywhen conditioning
0
0.25
0.5
0.75
1
2% 10% 50% 100%
 -style VLAπ0.5mimic-video
Training 
Data
Success 
Rate
Fig. 5: Sample efficiency for action decoder training on
LIBERO: mimic-video against theπ 0.5-style VLA baseline.
0
0.2
0.4
0.6
0.8
1
0 35K 70K 105K 140K
Success 
Rate
Training 
steps
 -style VLAπ0.5mimic-video
Fig. 6: Convergence speed for action decoder training. Both
decoders are trained with a batch size of 128 (optimal for
π0.5-style VLA) and their respective optimal learning rate.
on the video prior. Specifically, mimic-video’s action decoder
reaches the maximum success rate achieved by the VLM-
conditioned decoder while requiring only 10% of the training
data. Decreasing the dataset size to only one episode per task
(a 98% reduction in action data), still yields a 77% average
success rate, placing mimic-video trained on 2% of the action
data competitive with our Diffusion Policy baseline.
Beyond sample efficiency, Fig. 6 shows that the mimic-
video action decoder converges significantly faster and to a
higher asymptotic success rate than the π0.5-style VLA decoder.
Notably, this advantage persists despite the VLA baseline
having been exposed to task-specific action data during FAST-
pretraining.
C. Trade-offs between Video Fidelity and Action Performance
mimic-video couples two separate flow matching models
for video and actions, respectively. A key design element of
our approach is the ability to control the video generation
process via an inference-time hyperparameter: the video flow
time τv ∈[0,1] . This parameter dictates the extent to which
future video latents are denoised during action sampling. To
investigate the necessity of fine-grained video reconstruction for
effective policy learning, we first note the intuitive hypothesis:
a more resolved, higher-fidelity video signal should correlate
with better policy performance. In order to study this question,
we sweep τv across the SIMPLER-Bridge environments and
visualize the resulting success rates in Fig. 7.
Counterintuitively, we find that the best autonomous policy
performance in our SIMPLER experiments is achieved at the
highest flow time τv = 1. Theoretically, as τv progresses
from 1 (pure noise) to 0 (full reconstruction), the underlying
video signal grows, and the mutual information I(zτv
future;A 0)
between the future video latent and future actions increases.
However, consistent with the case study in Sec. III, we
hypothesize that imperfect video generation introduces artifacts.
Consequently, fully denoised video latents may diverge from the
training distribution, presenting out-of-distribution conditioning
0
0.1
0.2
0.3
0.4
0.5
0.6
-6-4.5-3-1.5 0 1.5 3 4.5 6
Success 
Rate
0.0 0.5 1.0 Video noise 
level
0
0.1
0.2
0.3
0.4
0.5
-6-4.5-3-1.5 0 1.5 3 4.5 6
Success 
Rate
0.0 0.5 1.0 Video noise 
level
0
0.1
0.2
0.3
-6-4.5-3-1.5 0 1.5 3 4.5 6
Success 
Rate
0.0 0.5 1.0 Video noise 
level
0
0.2
0.4
0.6
0.8
1
-6-4.5-3-1.5 0 1.5 3 4.5 6
Success 
Rate
0.0 0.5 1.0 Video noise 
level
carrot cubes
spoon eggplant
Fig. 7: Policy success rate across the SIMPLER-Bridge
environments vs video flow time (τv, logit-scaled). Performance
peaks at an intermediate noise level, confirming that high-
fidelity video reconstruction is not required for performant
robot policies.
to the action decoder. To isolate the effect of these generation
errors, we perform an additional sweep of τv where we
condition the action decoder on “noisy” ground-truth video
latents, zfuture
τv , computed via Eq. 1. We report the resulting
action reconstruction MSE on a held-out validation set of
BridgeDataV2 in Fig. 8. We observe that the lowest action
reconstruction error is achieved at an intermediate flow time
of τv ≈0.4 , corresponding to the perfect rollout performance
observed in our Case Study (Sec. III).
Interestingly, action prediction error increases sharply as we
move from this optimum towards τv = 0(full reconstruction).
We attribute this to the nature of the conditioning signal: while
thevideo latentsthemselves contain more information at lower
noise, theintermediate video model representations—from
0
0.005
0.01
0.015
0.02
0.025
0.03
-6 -4 -2 0 2 4
0.0 τv
Reconstruction 
MSE
0.5 1.0
Fig. 8: Action reconstruction MSE of a decoder conditioned on
“noisy” ground-truth video latents at varying flow times (logit-
scaled) on BridgeDataV2. Reconstruction is best at intermediate
flow times and increases towards clean and pure noise latents.
which the action decoder reconstructs actions—may exhibit
distinct, non-trivial behavior. We provide a detailed discussion
of these mechanisms in Appendix E. This observation yields
a significant practical advantage: operating at τv = 1requires
only a single forward pass of the video backbone to generate
conditioning features, resulting in both the highest average
performance and the fastest inference speed.
VI. DISCUSSION ANDFUTUREWORK
In this work, we introduce mimic-video, a new class of
Video-Action Model (V AM) that grounds robotic policies in
a pretrained video model. By leveraging the physical priors
embedded in internet-scale video, mimic-video achieves an
order-of-magnitude improvement in sample efficiency and
significantly faster convergence compared to standard VLA
baselines. These results strongly suggest that representations
learned from large-scale generative video pretraining provide a
significantly more robust signal for policy learning than those
induced by vision–language-action pretraining. To achieve
this, our approach operates by first partially generating a
plausible video of a task’s successful execution. We find
that conditioning on these partially-denoised plans is critical,
yielding a dual benefit: it mitigates the distribution shift between
model predictions and the ground-truth data used for training,
while simultaneously accelerating inference by significantly
reducing the computational cost of video generation.
While mimic-video achieves strong performance across both
simulated and real-world evaluations, we find that the current
model still has several shortcomings. First, we rely on a
single-view video backbone, which restricts our policies to
a fixed, single workspace view. Exploring a wider range of
video architectures, particularly natively multi-view models,
would likely enhance spatial reasoning and occlusion robustness.
Second, we have not yet applied the V AM recipe to train
a unified, large-scale, cross-embodiment model, a step we
believe is necessary to unlock the full generalization capabilities
of video foundation models. Finally, our current real-world
experiments are limited to a focused set of tasks; scaling this
approach to a broader diversity of manipulation behaviors
remains a key objective for future work.
ACKNOWLEDGMENTS
This work was supported under project ID #36 as part of the
Swiss AI Initiative, through a grant from the ETH Domain and
computational resources provided by the Swiss National Super-
computing Centre (CSCS) under the Alps infrastructure. We
thank mimic robotics for providing experimental infrastructure,
real-world robot platforms and additional compute resources.
Primary work by the lead authors was performed during their
internships at mimic robotics, with continued development
supported during their internships at Microsoft.
We thank Benjamin Estermann, Stefanos Charalambous, Erik
Bauer, German Rodriguez, Sigmund Hennum Høeg, Irvin Totic
and Benedict Wüest for their help with the project.
REFERENCES
[1] Mido Assran, Adrien Bardes, David Fan, Quentin Garrido,
Russell Howes, Mojtaba, Komeili, Matthew Muckley,
Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem
Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Fran-
cois Robert Hogan, Daniel Dugas, Piotr Bojanowski,
Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc
Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma,
Sarath Chandar, Franziska Meier, Yann LeCun, Michael
Rabbat, and Nicolas Ballas. V-JEPA 2: Self-Supervised
Video Models Enable Understanding, Prediction and
Planning, June 2025. URL http://arxiv.org/abs/2506.09985.
arXiv:2506.09985 [cs].
[2] Lucas Beyer, Andreas Steiner, André Susano Pinto,
Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim
Neumann, Ibrahim Alabdulmohsin, Michael Tschannen,
Emanuele Bugliarello, Thomas Unterthiner, Daniel Key-
sers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey
Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong,
Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko
Bošnjak, Xi Chen, Matthias Minderer, Paul V oigtlaender,
Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi
Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut,
Jeremiah Harmsen, and Xiaohua Zhai. Paligemma: A
versatile 3b vlm for transfer, 2024. URL https://arxiv.org/
abs/2407.07726.
[3] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail,
Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom,
Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim
Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mo-
hith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang
Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan
Wang, and Ury Zhilinsky. pi0: A Vision-Language-Action
Flow Model for General Robot Control, November 2024.
URL http://arxiv.org/abs/2410.24164. arXiv:2410.24164
[cs].
[4] Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao,
Coline Devin, Alex X Lee, Maria Bauzá, Todor Davchev,
Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat:
A self-improving generalist agent for robotic manipulation.
arXiv preprint arXiv:2306.11706, 2023.
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé
Jégou, Julien Mairal, Piotr Bojanowski, and Armand
Joulin. Emerging Properties in Self-Supervised Vision
Transformers.arXiv:2104.14294 [cs], May 2021. URL
http://arxiv.org/abs/2104.14294. arXiv: 2104.14294.
[6] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and
David Duvenaud. Neural ordinary differential equations,
2019. URL https://arxiv.org/abs/1806.07366.
[7] William Chen, Suneel Belkhale, Suvir Mirchandani, Oier
Mees, Danny Driess, Karl Pertsch, and Sergey Levine.
Training strategies for efficient embodied reasoning. In
Conference on Robot Learning, 2025.
[8] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau,
Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran
Song. Diffusion Policy: Visuomotor Policy Learning via
Action Diffusion, March 2024. URL http://arxiv.org/abs/
2303.04137. arXiv:2303.04137 [cs].
[9] Open X.-Embodiment Collaboration, Abby O’Neill, Ab-
dul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab-
hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn
Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain,
Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan,
Alexander Khazatsky, Anant Rai, Anchit Gupta, An-
drew Wang, Andrey Kolobov, Anikait Singh, Animesh
Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan,
Antonin Raffin, Archit Sharma, Arefeh Yavary, Arhan
Jain, Ashwin Balakrishna, Ayzaan Wahid, Ben Burgess-
Limerick, Beomjoon Kim, Bernhard Schölkopf, Blake
Wulfe, Brian Ichter, Cewu Lu, Charles Xu, Charlotte
Le, Chelsea Finn, Chen Wang, Chenfeng Xu, Cheng
Chi, Chenguang Huang, Christine Chan, Christopher
Agia, Chuer Pan, Chuyuan Fu, Coline Devin, Dan-
fei Xu, Daniel Morton, Danny Driess, Daphne Chen,
Deepak Pathak, Dhruv Shah, Dieter Büchler, Dinesh
Jayaraman, Dmitry Kalashnikov, Dorsa Sadigh, Edward
Johns, Ethan Foster, Fangchen Liu, Federico Ceola, Fei
Xia, Feiyu Zhao, Felipe Vieira Frujeri, Freek Stulp,
Gaoyue Zhou, Gaurav S. Sukhatme, Gautam Salhotra,
Ge Yan, Gilbert Feng, Giulio Schiavi, Glen Berseth,
Gregory Kahn, Guangwen Yang, Guanzhi Wang, Hao
Su, Hao-Shu Fang, Haochen Shi, Henghui Bao, Heni Ben
Amor, Henrik I. Christensen, Hiroki Furuta, Homanga
Bharadhwaj, Homer Walke, Hongjie Fang, Huy Ha, Igor
Mordatch, Ilija Radosavovic, Isabel Leal, Jacky Liang,
Jad Abou-Chakra, Jaehyung Kim, Jaimyn Drake, Jan
Peters, Jan Schneider, Jasmine Hsu, Jay Vakil, Jeannette
Bohg, Jeffrey Bingham, Jeffrey Wu, Jensen Gao, Jiaheng
Hu, Jiajun Wu, Jialin Wu, Jiankai Sun, Jianlan Luo,
Jiayuan Gu, Jie Tan, Jihoon Oh, Jimmy Wu, Jingpei Lu,
Jingyun Yang, Jitendra Malik, João Silvério, Joey Hejna,
Jonathan Booher, Jonathan Tompson, Jonathan Yang, Jordi
Salvador, Joseph J. Lim, Junhyek Han, Kaiyuan Wang,
Kanishka Rao, Karl Pertsch, Karol Hausman, Keegan
Go, Keerthana Gopalakrishnan, Ken Goldberg, Kendra
Byrne, Kenneth Oslund, Kento Kawaharazuka, Kevin
Black, Kevin Lin, Kevin Zhang, Kiana Ehsani, Kiran
Lekkala, Kirsty Ellis, Krishan Rana, Krishnan Srinivasan,
Kuan Fang, Kunal Pratap Singh, Kuo-Hao Zeng, Kyle
Hatch, Kyle Hsu, Laurent Itti, Lawrence Yunliang Chen,
Lerrel Pinto, Li Fei-Fei, Liam Tan, Linxi "Jim" Fan,
Lionel Ott, Lisa Lee, Luca Weihs, Magnum Chen, Marion
Lepert, Marius Memmel, Masayoshi Tomizuka, Masha
Itkina, Mateo Guaman Castro, Max Spero, Maximilian
Du, Michael Ahn, Michael C. Yip, Mingtong Zhang,
Mingyu Ding, Minho Heo, Mohan Kumar Srirama, Mohit
Sharma, Moo Jin Kim, Naoaki Kanazawa, Nicklas Hansen,
Nicolas Heess, Nikhil J. Joshi, Niko Suenderhauf, Ning
Liu, Norman Di Palo, Nur Muhammad Mahi Shafiullah,
Oier Mees, Oliver Kroemer, Osbert Bastani, Pannag R.
Sanketi, Patrick "Tree" Miller, Patrick Yin, Paul Wohlhart,
Peng Xu, Peter David Fagan, Peter Mitrano, Pierre
Sermanet, Pieter Abbeel, Priya Sundaresan, Qiuyu Chen,
Quan Vuong, Rafael Rafailov, Ran Tian, Ria Doshi,
Roberto Mart’in-Mart’in, Rohan Baijal, Rosario Scalise,
Rose Hendrix, Roy Lin, Runjia Qian, Ruohan Zhang,
Russell Mendonca, Rutav Shah, Ryan Hoque, Ryan Julian,
Samuel Bustamante, Sean Kirmani, Sergey Levine, Shan
Lin, Sherry Moore, Shikhar Bahl, Shivin Dass, Shubham
Sonawani, Shubham Tulsiani, Shuran Song, Sichun Xu,
Siddhant Haldar, Siddharth Karamcheti, Simeon Ade-
bola, Simon Guist, Soroush Nasiriany, Stefan Schaal,
Stefan Welker, Stephen Tian, Subramanian Ramamoorthy,
Sudeep Dasari, Suneel Belkhale, Sungjae Park, Suraj
Nair, Suvir Mirchandani, Takayuki Osa, Tanmay Gupta,
Tatsuya Harada, Tatsuya Matsushima, Ted Xiao, Thomas
Kollar, Tianhe Yu, Tianli Ding, Todor Davchev, Tony Z.
Zhao, Travis Armstrong, Trevor Darrell, Trinity Chung,
Vidhi Jain, Vikash Kumar, Vincent Vanhoucke, Wei Zhan,
Wenxuan Zhou, Wolfram Burgard, Xi Chen, Xiangyu
Chen, Xiaolong Wang, Xinghao Zhu, Xinyang Geng,
Xiyuan Liu, Xu Liangwei, Xuanlin Li, Yansong Pang,
Yao Lu, Yecheng Jason Ma, Yejin Kim, Yevgen Chebotar,
Yifan Zhou, Yifeng Zhu, Yilin Wu, Ying Xu, Yixuan
Wang, Yonatan Bisk, Yongqiang Dou, Yoonyoung Cho,
Youngwoon Lee, Yuchen Cui, Yue Cao, Yueh-Hua Wu,
Yujin Tang, Yuke Zhu, Yunchu Zhang, Yunfan Jiang,
Yunshuang Li, Yunzhu Li, Yusuke Iwasawa, Yutaka
Matsuo, Zehan Ma, Zhuo Xu, Zichen Jeff Cui, Zichen
Zhang, Zipeng Fu, and Zipeng Lin. Open X-Embodiment:
Robotic Learning Datasets and RT-X Models, June 2024.
URL http://arxiv.org/abs/2310.08864. arXiv:2310.08864
[cs].
[10] Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar
Srirama, and Sergey Levine. The ingredients for robotic
diffusion transformers. InProceedings of the IEEE
International Conference on Robotics and Automation
(ICRA), Atlanta, USA, 2025.
[11] Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and
Sergey Levine. Scaling Cross-Embodied Learning: One
Policy for Manipulation, Navigation, Locomotion and
Aviation, August 2024. URL http://arxiv.org/abs/2408.
11812. arXiv:2408.11812 [cs].
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An
Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale, June 2021. URL http://arxiv.org/
abs/2010.11929. arXiv:2010.11929 [cs].
[13] Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili
Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer
Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge
insulating vision-language-action models: Train fast, run
fast, generalize better.arXiv preprint arXiv:2505.23705,
2025.
[14] Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir
Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and
Pieter Abbeel. Learning universal policies via text-guided
video generation, 2023. URL https://arxiv.org/abs/2302.
00111.
[15] Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan
Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter
Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy
Zeng, and Jonathan Tompson. Video Language Planning,
October 2023. URL http://arxiv.org/abs/2310.10625.
arXiv:2310.10625 [cs].
[16] Chelsea Finn and Sergey Levine. Deep Visual Foresight
for Planning Robot Motion, March 2017. URL http:
//arxiv.org/abs/1610.00696. arXiv:1610.00696 [cs].
[17] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsu-
pervised learning for physical interaction through video
prediction.Advances in neural information processing
systems, 29, 2016.
[18] Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine,
and Jitendra Malik. Learning Visual Predictive Models
of Physics for Playing Billiards, January 2016. URL
http://arxiv.org/abs/1511.07404. arXiv:1511.07404 [cs].
[19] Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang
Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu
Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang,
and Xinggang Wang. Rad: Training an end-to-end driving
policy via large-scale 3dgs-based reinforcement learning,
2025. URL https://arxiv.org/abs/2502.13144.
[20] Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and
Chelsea Finn. Ctrl-world: A controllable generative world
model for robot manipulation, 2025. URL https://arxiv.
org/abs/2510.10125.
[21] Kyle Beltran Hatch, Ashwin Balakrishna, Oier Mees,
Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina,
Benjamin Eysenbach, Sergey Levine, Thomas Kollar, and
Benjamin Burchfiel. Ghil-glue: Hierarchical control with
filtered subgoal images. InProceedings of the IEEE
International Conference on Robotics and Automation
(ICRA), Atlanta, USA, 2025.
[22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
Diffusion Probabilistic Models. InAdvances in
Neural Information Processing Systems, volume 33,
pages 6840–6851. Curran Associates, Inc., 2020.
URL https://proceedings.neurips.cc/paper/2020/hash/
4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
[23] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Chen. Lora: Low-rank adaptation of large language
models, 2021. URL https://arxiv.org/abs/2106.09685.
[24] Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-
Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-
language-action reasoning via reinforced visual latent
planning, 2025. URL https://arxiv.org/abs/2507.16815.
[25] Physical Intelligence, Kevin Black, Noah Brown, James
Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail,
Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y .
Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman,
Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming
Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell,
Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z.
Ren, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Sprin-
genberg, Kyle Stachowicz, James Tanner, Quan Vuong,
Homer Walke, Anna Walling, Haohuan Wang, Lili Yu,
and Ury Zhilinsky. $pi_{0.5}$: a Vision-Language-Action
Model with Open-World Generalization, April 2025. URL
http://arxiv.org/abs/2504.16054. arXiv:2504.16054 [cs].
[26] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted
Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov,
Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong,
Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa
Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn.
OpenVLA: An Open-Source Vision-Language-Action
Model, September 2024. URL http://arxiv.org/abs/2406.
09246. arXiv:2406.09246 [cs].
[27] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-
tuning vision-language-action models: Optimizing speed
and success, 2025. URL https://arxiv.org/abs/2502.19645.
[28] Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier
Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat,
Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu,
Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao.
Evaluating real-world robot manipulation policies in
simulation, 2024. URL https://arxiv.org/abs/2405.05941.
[29] Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi
Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song,
and Carl V ondrick. Dreamitate: Real-World Visuomotor
Policy Learning via Video Generation, June 2024. URL
http://arxiv.org/abs/2406.16862. arXiv:2406.16862 [cs].
[30] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi
Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick.
Video Generators are Robot Policies, August 2025. URL
http://arxiv.org/abs/2508.00795. arXiv:2508.00795 [cs].
[31] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu,
Maximilian Nickel, and Matt Le. Flow Matching for
Generative Modeling, February 2023. URL http://arxiv.
org/abs/2210.02747. arXiv:2210.02747 [cs].
[32] Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang
Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking
knowledge transfer for lifelong robot learning, 2023. URL
https://arxiv.org/abs/2306.03310.
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight
decay regularization, 2019. URL https://arxiv.org/abs/
1711.05101.
[34] Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar,
Jonathan Tompson, Sergey Levine, and Pierre Sermanet.
Learning latent plans from play. InConference on robot
learning, pages 1113–1132. Pmlr, 2020.
[35] Oier Mees, Lukas Hermann, and Wolfram Burgard. What
matters in language conditioned robotic imitation learning
over unstructured data.IEEE Robotics and Automation
Letters (RA-L), 7(4):11205–11212, 2022.
[36] Elvis Nava, Victoriano Montesinos, Erik Bauer, Benedek
Forrai, Jonas Pai, Stefan Weirich, Stephan-Daniel Gravert,
Philipp Wand, Stephan Polinski, Benjamin F. Grewe, and
Robert K. Katzschmann. mimic-one: a Scalable Model
Recipe for General Purpose Robot Dexterity, June 2025.
URL http://arxiv.org/abs/2506.11916. arXiv:2506.11916
[cs].
[37] NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala,
Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat-
topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel
Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco
Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge,
Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang,
Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook
Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-
Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-
Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice
Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan
Mousavian, Seungjun Nah, Sriharsha Niverty, David Page,
Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao,
Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth
Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Ste-
faniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-
Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang,
Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei,
Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen,
Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang,
Yuxuan Zhang, Qingqing Zhao, and Artur Zolkowski.
Cosmos world foundation model platform for physical ai,
2025. URL https://arxiv.org/abs/2501.03575.
[38] NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh
Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi
Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopad-
hyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng,
Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi
Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao
Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta,
Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun
Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz,
Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-
Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling,
Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma,
Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang,
Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza
Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren,
Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz
Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen,
Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, An-
drew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang,
Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang,
Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui
Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng,
Andrew Zhu, and Yuke Zhu. World Simulation with
Video Foundation Models for Physical AI, October 2025.
URL http://arxiv.org/abs/2511.00062. arXiv:2511.00062
[cs].
[39] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis,
and Satinder Singh. Action-Conditional Video Prediction
using Deep Networks in Atari Games, December 2015.
URL http://arxiv.org/abs/1507.08750. arXiv:1507.08750
[cs].
[40] OpenAI. Video generation models as world simu-
lators, March 2024. URL https://openai.com/index/
video-generation-models-as-world-simulators/.
[41] William Peebles and Saining Xie. Scalable diffusion
models with transformers, 2023. URL https://arxiv.org/
abs/2212.09748.
[42] Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny
Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn,
and Sergey Levine. Fast: Efficient action tokenization
for vision-language-action models. InProceedings of
Robotics: Science and Systems, Los Angeles, USA, 2025.
[43] Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng
Yang. Strengthening Generative Robot Policies through
Predictive World Modeling, May 2025. URL http://arxiv.
org/abs/2502.00622. arXiv:2502.00622 [cs].
[44] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu. Exploring the limits of transfer
learning with a unified text-to-text transformer, 2023.
URL https://arxiv.org/abs/1910.10683.
[45] Scott Reed, Konrad Zolna, Emilio Parisotto, Ser-
gio Gómez Colmenarejo, Alexander Novikov, Gabriel
Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay,
Jost Tobias Springenberg, Tom Eccles, Jake Bruce,
Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian
Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and
Nando de Freitas. A Generalist Agent.Transactions on
Machine Learning Research, August 2022. ISSN 2835-
8856. URL https://openreview.net/forum?id=1ikK0kHjvj.
[46] Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer
Erdinç Ya ˘gmurlu, Fabian Otto, and Rudolf Lioutikov.
Flower: Democratizing generalist robot policies with
efficient vision-language-action flow policies, 2025. URL
https://arxiv.org/abs/2509.04996.
[47] Stephane Ross, Geoffrey J. Gordon, and J. Andrew
Bagnell. A reduction of imitation learning and structured
prediction to no-regret online learning, 2011. URL
https://arxiv.org/abs/1011.0686.
[48] Gemini Robotics Team, Coline Devin, Yilun Du, De-
bidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas
Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar,
Andrew Marmon, Carolina Parada, Yulia Rubanova,
Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao,
Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating
gemini robotics policies in a veo world simulator, 2025.
URL https://arxiv.org/abs/2512.10675.
[49] Octo Model Team, Dibya Ghosh, Homer Walke, Karl
Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey
Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo,
You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi,
Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and
Sergey Levine. Octo: An Open-Source Generalist Robot
Policy, May 2024. URL http://arxiv.org/abs/2405.12213.
arXiv:2405.12213 [cs].
[50] Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim,
Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-
Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan
Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2:
A dataset for robot learning at scale, 2024. URL https:
//arxiv.org/abs/2308.12952.
[51] Manuel Watter, Jost Tobias Springenberg, Joschka
Boedecker, and Martin Riedmiller. Embed to Control: A
Locally Linear Latent Dynamics Model for Control from
Raw Images, November 2015. URL http://arxiv.org/abs/
1506.07365. arXiv:1506.07365 [cs].
[52] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. Chain-of-thought prompting elicits reasoning in
large language models.Advances in neural information
processing systems, 35:24824–24837, 2022.
[53] Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixi-
ang Shane Gu, Nick Matarese, Kevin Swersky, Been
Kim, Priyank Jaini, and Robert Geirhos. Video models
are zero-shot learners and reasoners, September 2025.
URL http://arxiv.org/abs/2509.20328. arXiv:2509.20328
[cs].
[54] Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo,
Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan,
Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee,
Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon
Seo. Latent Action Pretraining from Videos, May 2025.
URL http://arxiv.org/abs/2410.11758. arXiv:2410.11758
[cs].
[55] Michał Zawalski, William Chen, Karl Pertsch, Oier Mees,
Chelsea Finn, and Sergey Levine. Robotic Control via
Embodied Chain-of-Thought Reasoning, March 2025.
URL http://arxiv.org/abs/2407.08693. arXiv:2407.08693
[cs].
[56] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu,
Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli
Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu
Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi
Lin. CoT-VLA: Visual Chain-of-Thought Reasoning
for Vision-Language-Action Models, March 2025. URL
http://arxiv.org/abs/2503.22020. arXiv:2503.22020 [cs].
[57] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea
Finn. Learning Fine-Grained Bimanual Manipulation with
Low-Cost Hardware, April 2023. URL http://arxiv.org/
abs/2304.13705. arXiv:2304.13705 [cs].
[58] Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck,
Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia,
Zongyu Lin, Loic Magne, Avnish Narayan, You Liang
Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen
Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu,
and Linxi Fan. Flare: Robot learning with implicit world
modeling, 2025. URL https://arxiv.org/abs/2505.15659.
[59] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin
Burchfiel, Paarth Shah, and Abhishek Gupta. Unified
World Models: Coupling Video and Action Diffusion for
Pretraining on Large Robotic Datasets, May 2025. URL
http://arxiv.org/abs/2504.02792. arXiv:2504.02792 [cs].
[60] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted
Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker,
Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong
Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre
Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S.
Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor
Mordatch, Henryk Michalewski, Yao Lu, Sergey Levine,
Lisa Lee, Tsang-Wei Edward Lee, Isabel Leal, Yuheng
Kuang, Dmitry Kalashnikov, Ryan Julian, Nikhil J. Joshi,
Alex Irpan, Brian Ichter, Jasmine Hsu, Alexander Herzog,
Karol Hausman, Keerthana Gopalakrishnan, Chuyuan Fu,
Pete Florence, Chelsea Finn, Kumar Avinava Dubey,
Danny Driess, Tianli Ding, Krzysztof Marcin Choro-
manski, Xi Chen, Yevgen Chebotar, Justice Carbajal,
Noah Brown, Anthony Brohan, Montserrat Gonzalez
Arenas, and Kehang Han. RT-2: Vision-Language-Action
Models Transfer Web Knowledge to Robotic Control. In
Proceedings of The 7th Conference on Robot Learning,
pages 2165–2183. PMLR, December 2023. URL https:
//proceedings.mlr.press/v229/zitkovich23a.html. ISSN:
2640-3498.
CONTRIBUTIONS
Jonas Pai:Led project ideation, implementation, and evalua-
tion. Contributed to tech report writing.
Liam Achenbach:Led baseline model development, training,
and evaluation. Helped with dataset integration and tech report
writing.
Victoriano Montesinos:Contributed the diffusion policy
baseline implementation and training, as well as data collection
for the bimanual mimic robot experiments.
Benedek Forrai:Oversaw development for the mimic bimanual
system, contributed data collection for the bimanual mimic
robot experiments.
Oier Mees:Supervised the project from its inception, mentored
the lead authors during their research internships, guided the
technical direction and experimental strategy, and led the
writing and visualization of this tech report.
Elvis Nava:Supervised the project from early conception to
implementation and evaluations, contributed to project ideation.
Supervised the lead authors throughout their research internship
and oversaw the technical development of supporting infras-
tructure, robot systems and compute resources. Contributed
to manuscript writing, website and video editing. Contributed
data collection for the bimanual mimic robot experiments.
APPENDIX
A. Training Hyperparameters
We summarize mimic-video training hyperparameters for
each dataset in Tab. IV.
B. Data Preprocessing
All orientations are expressed as 6-dimensional vectors
corresponding to the top two rows of the matrix representation.
Images are extracted or rendered in a resolution of 480 x 640
px.
a) BridgeDataV2:
• Observation Space: Absolute end-effector pose and abso-
lute continuous gripper joint state.
• Action Space: Future end effector pose (relative to the
proprioceptive pose for the entire action chunk) and the
continuous (but practically mostly binary) gripperaction.
We remove 3046 non-informative language labels as well as
the first state and null-action of each episode.
b) LIBERO:
• Observation Space: Absolute end-effector pose and abso-
lute continuous gripper joint state.
• Action Space: End effector pose action (relative to the
proprioceptive pose for the entire action chunk) and binary
gripper action.
We follow the preprocessing procedure used in Kim et al. [27]
and remove episodes not leading to a successful rollout when
replaying their actions.
c) mimic:
• Observation Space: Absolute end-effector poses, absolute
continuous hand joint states. Relative end-effector poses
with respect to each other. Previous end-effector and hand
actions.
• Action Space: End effector pose action (relative to the
proprioceptive pose for the entire action chunk) and
absolute hand joints action.
C. Video-Action Model Learnings
• Video Model Source Layer k: We observe intermediate
layer k= 19 to yield the strongest policy performance
with strongly decreasing success rates towards initial or
final layers. We posit that ideally, this choice should be
learned.
• Video Observation Horizon Ho: We find the longer horizon
of 5 frames to work better than conditioning on only the
current observation (1 frame).
D.π 0.5-style VLA Learnings
We ablate various choices in the VLA training procedure
and architecture to enable a fair comparison to mimic-video.
• Architectural details: We find the highest SIMPLER-
Bridge success rates when cross-attending to layer 11 of
the FAST-pretrained VLM.
• Training results: For SIMPLER-Bridge we find that
training longer does not improve success rates significantly
after 2-3 epochs of decoder training on a frozen FAST-
backbone trained to convergence. For LIBERO, we
observe that continuing FAST pretraining slightly beyond
the convergence point yields modest downstream gains
during the subsequent decoder training stage.
E. Video Denoising Analysis
A key finding of our work is that the choice of the
video model’s cutoff flow time, τv, is a critical inference
hyperparameter. We empirically observe that stopping the video
generation process early and conditioning the action decoder
on a “noisy” visual plan yields substantially better performance
than allowing the video model to fully denoise its prediction.
We find two main reasons for this phenomenon:
a) Distribution Mismatch and Noise as Augmentation:
The action decoder is trained by conditioning on the video
model’s representations ofground truthfuture video. A fully
denoised video generated by the video model at inference
time could represent an incorrect action plan due to the video
model’s weakness. Even if accurate, it will still likely be subtly
out-of-distribution compared to the ground truth data seen
during training. By intentionally leaving noise in the visual
plan, we perform a kind of train and test-time augmentation.
This prevents the action decoder from relying on spurious
ground truth visual cues that may not be present in the video
model’s own generations. This is analogous to findings in goal-
conditioned policies [ 21], where augmenting predicted future
target images with simple transformations improved robustness
and performance.
TABLE IV: Hyperparameters used during training of mimic-video.
Hyperparameter Video finetuning Action Decoder training
BridgeDataV2 LIBERO mimic BridgeDataV2 LIBERO mimic
Learning Rate 1.778e-4 1e-4
Warmup Steps 1000
Training Steps 70043 7k-8k 27300 14112 50k 26k
LR Scheduler Constant Linear
Weight Decay Factor 0.1
Gradient Clip Threshold 10.0
Batch Size 256 128 32 256 128 128
Optimizer AdamW [33]
The results shown in Fig. 2 give credence to this hypothesis,
as we observe that conditioning the action decoder on ground
truth data at inference time leads to perfect performance, so
less-than-perfect performance during regular inference has to
be attributed to the fully denoised video plans being imperfect
or out of distribution. The results shown in Fig. 7 also illustrate
how optimal performance on benchmarks occurs at video flow
times close to τv = 1, where “noise as augmentation” is larger.
b) Information Content of Intermediate Representations:
Another reason lies in the nature of the flow matching model’s
internal representations throughout the denoising process. In
intermediate steps, the hidden states of the video model
must encode rich information about scene dynamics and
the necessary transformations to reach the final, clean video.
However, as the denoising process approaches τv = 0, the
input is already very close to the target. To minimize the
training loss, the video model layers, when conditioned on
the final noise values, are incentivized to learn a close-to-
identity mapping, making minimal changes to the nearly-perfect
input. Consequently, these final-step hidden states become less
informative for downstream tasks. Cross-attending to the richer
representations from earlier flow times τv provides the action
decoder with a more useful conditioning signal for generating
actions. The results in Fig. 8 indeed distinctly show that, when
approaching τv = 0, reconstruction error strongly increases,
which is compatible with this hypothesis.
