CoMoVi: Co-Generation of 3D Human Motions
and Realistic Videos

Chengfeng Zhao¹, Jiazhi Shu², Yubo Zhao¹, Tianyu Huang³, Jiahao Lu¹,
Zekai Gu¹, Chengwei Ren¹, Zhiyang Dou⁴, Qing Shuai⁵, Yuan Liu^1,†

¹HKUST ²SCUT ³CUHK ⁴MIT ⁵ZJU
^†Corresponding author

🚀 Getting Started

1. Environment Setup

conda create python=3.10 --name comovi
conda activate comovi

# basic installation
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

# install flash attention
pip install ninja
pip install flash_attn --no-build-isolation # ==2.7.3 for CUDA < 12

# install pytorch3d
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"

# install camerahmr
bash scripts/install_camerahmr.sh

2. Download Model Weights

bash scripts/download_model_weights.sh --source modelscope
# or
bash scripts/download_model_weights.sh --source huggingface

3. Inference

python inference.py \
  --arch Wan2.2-TI2V-5B \
  --fps 16 \
  --frames 81 \
  --height 704 \
  --width 1280 \
  --interaction "single_m2v" \
  --interleave 1

Explanation of arguments:

arch: model architecture of VDM backbone
fps: frame rate of generated video, default is 16
frames: frame num of generated video, default is 81
height: H of generated video, default is 704
width: W of generated video, default is 1280
interaction: direction of ControlNet-module, default is single_m2v (only motion branch to rgb branch)
interleave: copy one DiT block per x pretrained rgb blocks, default is 1 (full copy)

Check inference inputs and outputs in ./example/inference/:

motion keyword: high plank	motion keyword: dog pose
example_output_high_plank.mp4	example_output_dog_pose.mp4

🔬 Training

1. Data Preparation

Option-1: Download CoMoVi dataset (coming soon)

Option-2: Prepare customized data step by step

Install Blender

mkdir <dir_for_blender>
cd <dir_for_blender>

wget https://download.blender.org/release/Blender3.6/blender-3.6.0-linux-x64.tar.xz
xz -d blender-3.6.0-linux-x64.tar.xz
tar -xvf blender-3.6.0-linux-x64.tar

export PATH=<dir_for_blender>/blender-3.6.0-linux-x64:$PATH

Step-1: Estimate human motion from image frames

python -m prepare.step1_run_hmr

Step-2: Smooth framewise motion estimation

python -m prepare.step2_smooth

Step-3: Render 3D human motion to 2D motion representation

python -m prepare.step3_render_2d_morep

Step-4: Normalize data to the native setting of Wan2.2 (e.g. resolution, fps, etc.)

python -m prepare.step4_normalize

After the steps above, the ./examples/training/ folder should be in the following structure:

examples/training/
├── CameraHMR_smpl_results/           # raw HMR results
└── CameraHMR_smpl_results_overlay/   # raw HMR re-projection results for sanity check
└── CameraHMR_smpl_results_smoothed/  # smoothed HMR results
└── motion_2d_videos/                 # rendered 2d motion representation video
└── processed_trainable_data/         # training-ready data 
└── rgb_videos/                       # rgb video

Step-5: Caption description of human motion in videos

We caption videos using enterprise-level api of Gemini-2.5-pro. To get similar results using open-source models, we tested Qwen-VL and DAM. Please follow their example codes.

Step-Final: Organize everything in a data config file

Make a JSON file to list the training corpus, for instance:

[
  {
    "rgb_path": "example/rgb.mp4",
    "motion_path": "example/motion.mp4",
    "first_frame": "example/first_frame.jpg",
    "motion_first_frame": "example/motion_first_frame.jpg",
    "text": "A woman in a red long-sleeved crop top and matching leggings holds a high plank position on a yoga mat.",
    "type": "video"
  },
  {},
  {},
  ......,
  {}
]

2. Train CoMoVi

# stage 1
bash scripts/train_comovi_stage1.sh <GPU_NUM> <MACHINE_NUM> <LOCAL_RANK> <GPU_IDS> <MAIN_MACHINE_IP>

# stage 2
bash scripts/train_comovi_stage2.sh <GPU_NUM> <MACHINE_NUM> <LOCAL_RANK> <GPU_IDS> <MAIN_MACHINE_IP>

# example command for 2 8-GPU training machines
bash scripts/train_comovi_stage1.sh 16 2 0 0,1,2,3,4,5,6,7 x.x.x.x
bash scripts/train_comovi_stage1.sh 16 2 1 0,1,2,3,4,5,6,7 x.x.x.x
#
bash scripts/train_comovi_stage2.sh 16 2 0 0,1,2,3,4,5,6,7 x.x.x.x
bash scripts/train_comovi_stage2.sh 16 2 1 0,1,2,3,4,5,6,7 x.x.x.x

The ZeRO-3 sharded checkpoints will be saved in ./output_dir/. To convert it to full bf16 model, run:

python scripts/zero_to_bf16.py {zero3_dir} {target_dir} --max_shard_size 80GB --safe_serialization

Acknowledgments

Thanks to the following work that we refer to and benefit from:

VideoX-Fun: the video generation model training framework;
CameraHMR: the excellent SMPL estimation for pseudo labels;
Champ: the data processing pipeline

Citation

@article{zhao2026comovi,
  title={CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos},
  author={Zhao, Chengfeng and Shu, Jiazhi and Zhao, Yubo and Huang, Tianyu and Lu, Jiahao and Gu, Zekai and Ren, Chengwei and Dou, Zhiyang and Shuai, Qing and Liu, Yuan},
  journal={arXiv preprint arXiv:2601.10632},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoMoVi: Co-Generation of 3D Human Motions
and Realistic Videos

🚀 Getting Started

1. Environment Setup

2. Download Model Weights

3. Inference

🔬 Training

1. Data Preparation

Install Blender

Step-1: Estimate human motion from image frames

Step-2: Smooth framewise motion estimation

Step-3: Render 3D human motion to 2D motion representation

Step-4: Normalize data to the native setting of Wan2.2 (e.g. resolution, fps, etc.)

Step-5: Caption description of human motion in videos

Step-Final: Organize everything in a data config file

2. Train CoMoVi

Acknowledgments

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
comovi		comovi
config		config
examples		examples
prepare		prepare
scripts		scripts
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

CoMoVi: Co-Generation of 3D Human Motionsand Realistic Videos

🚀 Getting Started

1. Environment Setup

2. Download Model Weights

3. Inference

🔬 Training

1. Data Preparation

Install Blender

Step-1: Estimate human motion from image frames

Step-2: Smooth framewise motion estimation

Step-3: Render 3D human motion to 2D motion representation

Step-4: Normalize data to the native setting of Wan2.2 (e.g. resolution, fps, etc.)

Step-5: Caption description of human motion in videos

Step-Final: Organize everything in a data config file

2. Train CoMoVi

Acknowledgments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

CoMoVi: Co-Generation of 3D Human Motions
and Realistic Videos