Skip to content

IGL-HKUST/CoMoVi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoMoVi: Co-Generation of 3D Human Motions
and Realistic Videos

Chengfeng Zhao1, Jiazhi Shu2, Yubo Zhao1, Tianyu Huang3, Jiahao Lu1,
Zekai Gu1, Chengwei Ren1, Zhiyang Dou4, Qing Shuai5, Yuan Liu1,†

1HKUST    2SCUT    3CUHK    4MIT    5ZJU   
Corresponding author

arXiv Project Page Dataset

🚀 Getting Started

1. Environment Setup

conda create python=3.10 --name comovi
conda activate comovi

# basic installation
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

# install flash attention
pip install ninja
pip install flash_attn --no-build-isolation # ==2.7.3 for CUDA < 12

# install pytorch3d
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"

# install camerahmr
bash scripts/install_camerahmr.sh

2. Download Model Weights

bash scripts/download_model_weights.sh --source modelscope
# or
bash scripts/download_model_weights.sh --source huggingface

3. Inference

python inference.py \
  --arch Wan2.2-TI2V-5B \
  --fps 16 \
  --frames 81 \
  --height 704 \
  --width 1280 \
  --interaction "single_m2v" \
  --interleave 1

Explanation of arguments:

  • arch: model architecture of VDM backbone
  • fps: frame rate of generated video, default is 16
  • frames: frame num of generated video, default is 81
  • height: H of generated video, default is 704
  • width: W of generated video, default is 1280
  • interaction: direction of ControlNet-module, default is single_m2v (only motion branch to rgb branch)
  • interleave: copy one DiT block per x pretrained rgb blocks, default is 1 (full copy)

Check inference inputs and outputs in ./example/inference/:

motion keyword: high plank motion keyword: dog pose
example_output_high_plank.mp4
example_output_dog_pose.mp4

🔬 Training

1. Data Preparation

Option-1: Download CoMoVi dataset (coming soon)
Option-2: Prepare customized data step by step

Install Blender

mkdir <dir_for_blender>
cd <dir_for_blender>

wget https://download.blender.org/release/Blender3.6/blender-3.6.0-linux-x64.tar.xz
xz -d blender-3.6.0-linux-x64.tar.xz
tar -xvf blender-3.6.0-linux-x64.tar

export PATH=<dir_for_blender>/blender-3.6.0-linux-x64:$PATH

Step-1: Estimate human motion from image frames

python -m prepare.step1_run_hmr

Step-2: Smooth framewise motion estimation

python -m prepare.step2_smooth

Step-3: Render 3D human motion to 2D motion representation

python -m prepare.step3_render_2d_morep

Step-4: Normalize data to the native setting of Wan2.2 (e.g. resolution, fps, etc.)

python -m prepare.step4_normalize

After the steps above, the ./examples/training/ folder should be in the following structure:

examples/training/
├── CameraHMR_smpl_results/           # raw HMR results
└── CameraHMR_smpl_results_overlay/   # raw HMR re-projection results for sanity check
└── CameraHMR_smpl_results_smoothed/  # smoothed HMR results
└── motion_2d_videos/                 # rendered 2d motion representation video
└── processed_trainable_data/         # training-ready data 
└── rgb_videos/                       # rgb video 

Step-5: Caption description of human motion in videos

We caption videos using enterprise-level api of Gemini-2.5-pro. To get similar results using open-source models, we tested Qwen-VL and DAM. Please follow their example codes.

Step-Final: Organize everything in a data config file

Make a JSON file to list the training corpus, for instance:

[
  {
    "rgb_path": "example/rgb.mp4",
    "motion_path": "example/motion.mp4",
    "first_frame": "example/first_frame.jpg",
    "motion_first_frame": "example/motion_first_frame.jpg",
    "text": "A woman in a red long-sleeved crop top and matching leggings holds a high plank position on a yoga mat.",
    "type": "video"
  },
  {},
  {},
  ......,
  {}
]

2. Train CoMoVi

# stage 1
bash scripts/train_comovi_stage1.sh <GPU_NUM> <MACHINE_NUM> <LOCAL_RANK> <GPU_IDS> <MAIN_MACHINE_IP>

# stage 2
bash scripts/train_comovi_stage2.sh <GPU_NUM> <MACHINE_NUM> <LOCAL_RANK> <GPU_IDS> <MAIN_MACHINE_IP>

# example command for 2 8-GPU training machines
bash scripts/train_comovi_stage1.sh 16 2 0 0,1,2,3,4,5,6,7 x.x.x.x
bash scripts/train_comovi_stage1.sh 16 2 1 0,1,2,3,4,5,6,7 x.x.x.x
#
bash scripts/train_comovi_stage2.sh 16 2 0 0,1,2,3,4,5,6,7 x.x.x.x
bash scripts/train_comovi_stage2.sh 16 2 1 0,1,2,3,4,5,6,7 x.x.x.x

The ZeRO-3 sharded checkpoints will be saved in ./output_dir/. To convert it to full bf16 model, run:

python scripts/zero_to_bf16.py {zero3_dir} {target_dir} --max_shard_size 80GB --safe_serialization

Acknowledgments

Thanks to the following work that we refer to and benefit from:

  • VideoX-Fun: the video generation model training framework;
  • CameraHMR: the excellent SMPL estimation for pseudo labels;
  • Champ: the data processing pipeline

Citation

@article{zhao2026comovi,
  title={CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos},
  author={Zhao, Chengfeng and Shu, Jiazhi and Zhao, Yubo and Huang, Tianyu and Lu, Jiahao and Gu, Zekai and Ren, Chengwei and Dou, Zhiyang and Shuai, Qing and Liu, Yuan},
  journal={arXiv preprint arXiv:2601.10632},
  year={2026}
}

About

Official repository of paper "CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos"

Topics

Resources

License

Stars

Watchers

Forks

Contributors