Chengfeng Zhao1,
Jiazhi Shu2,
Yubo Zhao1,
Tianyu Huang3,
Jiahao Lu1,
Zekai Gu1,
Chengwei Ren1,
Zhiyang Dou4,
Qing Shuai5,
Yuan Liu1,†
1HKUST
2SCUT
3CUHK
4MIT
5ZJU
†Corresponding author
conda create python=3.10 --name comovi
conda activate comovi
# basic installation
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
# install flash attention
pip install ninja
pip install flash_attn --no-build-isolation # ==2.7.3 for CUDA < 12
# install pytorch3d
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
# install camerahmr
bash scripts/install_camerahmr.shbash scripts/download_model_weights.sh --source modelscope
# or
bash scripts/download_model_weights.sh --source huggingfacepython inference.py \
--arch Wan2.2-TI2V-5B \
--fps 16 \
--frames 81 \
--height 704 \
--width 1280 \
--interaction "single_m2v" \
--interleave 1Explanation of arguments:
arch: model architecture of VDM backbonefps: frame rate of generated video, default is16frames: frame num of generated video, default is81height:Hof generated video, default is704width:Wof generated video, default is1280interaction: direction of ControlNet-module, default issingle_m2v(only motion branch to rgb branch)interleave: copy one DiT block perxpretrained rgb blocks, default is1(full copy)
Check inference inputs and outputs in ./example/inference/:
| motion keyword: high plank | motion keyword: dog pose |
example_output_high_plank.mp4 |
example_output_dog_pose.mp4 |
Option-1: Download CoMoVi dataset (coming soon)
Option-2: Prepare customized data step by step
Install Blender
mkdir <dir_for_blender>
cd <dir_for_blender>
wget https://download.blender.org/release/Blender3.6/blender-3.6.0-linux-x64.tar.xz
xz -d blender-3.6.0-linux-x64.tar.xz
tar -xvf blender-3.6.0-linux-x64.tar
export PATH=<dir_for_blender>/blender-3.6.0-linux-x64:$PATHpython -m prepare.step1_run_hmrpython -m prepare.step2_smoothpython -m prepare.step3_render_2d_moreppython -m prepare.step4_normalizeAfter the steps above, the ./examples/training/ folder should be in the following structure:
examples/training/
├── CameraHMR_smpl_results/ # raw HMR results
└── CameraHMR_smpl_results_overlay/ # raw HMR re-projection results for sanity check
└── CameraHMR_smpl_results_smoothed/ # smoothed HMR results
└── motion_2d_videos/ # rendered 2d motion representation video
└── processed_trainable_data/ # training-ready data
└── rgb_videos/ # rgb video We caption videos using enterprise-level api of Gemini-2.5-pro. To get similar results using open-source models, we tested Qwen-VL and DAM. Please follow their example codes.
Make a JSON file to list the training corpus, for instance:
[
{
"rgb_path": "example/rgb.mp4",
"motion_path": "example/motion.mp4",
"first_frame": "example/first_frame.jpg",
"motion_first_frame": "example/motion_first_frame.jpg",
"text": "A woman in a red long-sleeved crop top and matching leggings holds a high plank position on a yoga mat.",
"type": "video"
},
{},
{},
......,
{}
]# stage 1
bash scripts/train_comovi_stage1.sh <GPU_NUM> <MACHINE_NUM> <LOCAL_RANK> <GPU_IDS> <MAIN_MACHINE_IP>
# stage 2
bash scripts/train_comovi_stage2.sh <GPU_NUM> <MACHINE_NUM> <LOCAL_RANK> <GPU_IDS> <MAIN_MACHINE_IP>
# example command for 2 8-GPU training machines
bash scripts/train_comovi_stage1.sh 16 2 0 0,1,2,3,4,5,6,7 x.x.x.x
bash scripts/train_comovi_stage1.sh 16 2 1 0,1,2,3,4,5,6,7 x.x.x.x
#
bash scripts/train_comovi_stage2.sh 16 2 0 0,1,2,3,4,5,6,7 x.x.x.x
bash scripts/train_comovi_stage2.sh 16 2 1 0,1,2,3,4,5,6,7 x.x.x.xThe ZeRO-3 sharded checkpoints will be saved in ./output_dir/. To convert it to full bf16 model, run:
python scripts/zero_to_bf16.py {zero3_dir} {target_dir} --max_shard_size 80GB --safe_serializationThanks to the following work that we refer to and benefit from:
- VideoX-Fun: the video generation model training framework;
- CameraHMR: the excellent SMPL estimation for pseudo labels;
- Champ: the data processing pipeline
@article{zhao2026comovi,
title={CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos},
author={Zhao, Chengfeng and Shu, Jiazhi and Zhao, Yubo and Huang, Tianyu and Lu, Jiahao and Gu, Zekai and Ren, Chengwei and Dou, Zhiyang and Shuai, Qing and Liu, Yuan},
journal={arXiv preprint arXiv:2601.10632},
year={2026}
}