Egocentric Video Generation

E3C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

Qiao Gu · Lingni Ma · Adam W Harley · Richard Newcombe ·
Florian Shkurti · Julian Straub

Meta Reality Labs · University of Toronto

Abstract

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E3C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E3C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attendable tokens. Experiments on Nymeria show that E3C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.

01

3D Environmental Memory

Rendered point-cloud memory keeps the world anchored across rapid head motion.

02

Ego Motion Control

3D body and wrist trajectories steer first-person hands and body motion.

03

Exo Human Control

Rendered skeletons guide people observed in the egocentric scene.

04

Editable Conditions

Objects, people, camera paths, and action-scene composition can be changed explicitly.

Method

A structured conditioning interface for egocentric video.

E3C turns text, 3D environmental memory, and ego-exo motion controls into aligned conditioning streams for a latent video diffusion transformer.

Overview of the E3C method pipeline.
Overview of E3C. A text prompt, 3D ego skeleton, and feature-augmented 3D memory are encoded as complementary controls: text tokens, ego pose tokens, rendered condition videos, and rendered condition feature videos. These streams condition DiT denoising through text cross-attention, pose cross-attention, and a context adapter.

Structured Inputs

The model receives a text prompt, target ego motion, and a semi-dense 3D environmental memory with ego-exo pose controls, separating static scene structure from human dynamics.

View-Aligned Controls

The 3D memory and pose controls are rendered into the target camera viewpoints as a condition video and a condition feature video, aligning control signals with each frame.

Pose-Aware Denoising

Text, rendered controls, and ego pose tokens enter the diffusion backbone through cross-attention and adapter pathways, guiding generation while preserving the pretrained prior.

Qualitative Comparisons

E3C controls both the static environment and human dynamics.

E3C

Uses appearance-augmented memory and ego-exo pose control together.

GT Video

Ground-truth target video from the same controls.

Zero-Shot VACE

Struggles to use semi-dense point-cloud conditioning directly.

Fine-Tuned VACE

Recovers motion but misses color and texture details.

Splatfacto

Accurate where reconstructed, with artifacts outside the context views.

Gen3C

Sharp egocentric view changes can produce ghosting artifacts.

VMem

Chunked generation can introduce jumps between camera views.

Ablations

Each condition contributes a different kind of control.

Appearance Features

Per-point appearance features help preserve color information and texture details.

GT Video Generation w/ Feat. (Ours) Generation w/o Feat.

Exo Human Control

Exo pose control helps the generated person follow the intended motion.

GT Video Generation w/ Exo (Ours) Generation w/o Exo

Ego Human Control

Ego human pose control enables precise body motion aligned with the target action.

GT Video Generation w/ Ego Control (Ours) Generation w/o Ego Control

Scene Editing

Edited conditions produce controlled generation changes.

In this example, the chairs and the hanging lamp are consistently removed.

In this example, the coach is removed by editing the 3D memory.

In this example, the exo person is removed from the generation.

By combining the ego human poses from one video and spatial memory from another, we can compose video generation with any actions and any scene.

Video A
Ego human pose from A
Video B
3D memory from B
Condition video
Generation by E3C

Citation

BibTeX

@misc{gu2026e3c,
  title = {E3C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control},
  author = {Gu, Qiao and Ma, Lingni and Harley, Adam W and Newcombe, Richard and Shkurti, Florian and Straub, Julian},
  year = {2026}
}