拡散ベースの動画生成を100〜200倍高速化するフレームワーク「TurboDiffusion」をWindowsで動かす

初めに

動画生成モデルで高速に生成できるものが出てきたので触ってみます

主要技術:

  • SageAttention: 8bit量子化アテンション
  • SLA (Sparse-Linear Attention): Top-kスパースアテンション
  • rCM (Rectified Consistency Models): タイムステップ蒸留

開発環境

  • Windows 11
  • cuda (13.0) : RTX 4070 ti super
  • uv (0.9.x)

環境構築

まずはリポジトリをcloneします

git clone https://github.com/thu-ml/TurboDiffusion.git
cd TurboDiffusion
git submodule update --init --recursive

プロジェクトを初期化します

uv init --python 3.12

次にpyproject.tomlを以下のように設定します

[project]
name = "turbodiffusion"
description = "TurboDiffusion: video generation acceleration framework that could accelerate end-to-end video generation by 100-205x with negligible video quality loss."
version = "1.0.0"
authors = [
  { name = "Jintao Zhang" },
  { name = "Kaiwen Zheng" },
  { name = "Kai Jiang" },
  { name = "Haoxu Wang" },
]
readme = "README.md"
requires-python = ">=3.9"
license = { file = "LICENSE" }

classifiers = [
  "Programming Language :: Python :: 3",
  "License :: OSI Approved :: Apache Software License",
]

dependencies = [
  "torch==2.8.0",
  "torchvision",
  # triton and flash-attn are installed manually on Windows
  # "triton>=3.3.0",
  # "flash-attn",
  "einops",
  "numpy",
  "pillow",
  "loguru",
  "imageio[ffmpeg]",
  "pandas",
  "PyYAML",
  "omegaconf",
  "attrs",
  "fvcore",
  "ftfy",
  "regex",
  "transformers",
  "nvidia-ml-py",
  "triton-windows<3.5",
  "flash-attn",
]

[build-system]
requires = [
  "setuptools>=62",
  "wheel>=0.38",
  "packaging>=21",
  "torch>=2.7.0",
  "ninja",
]
build-backend = "setuptools.build_meta"

[project.urls]
Homepage = "https://github.com/thu-ml/TurboDiffusion"
Repository = "https://github.com/thu-ml/TurboDiffusion"

[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true

[tool.uv.sources]
torch = { index = "pytorch-cu128" }
torchvision = { index = "pytorch-cu128" }
flash-attn = { path = "flash_attn-2.8.3+cu128torch2.8.0cxx11abiTRUE-cp312-cp312-win_amd64.whl" }

WindowsFlash-attentionを使いたいので、Windows向けのwheelをダウンロードします

curl -L -o flash_attn-2.8.3+cu128torch2.8.0cxx11abiTRUE-cp312-cp312-win_amd64.whl "https://github.com/bdashore3/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.8.0cxx11abiTRUE-cp312-cp312-win_amd64.whl"

依存パッケージをインストールします

次にTurboDiffusionのインストールします

uv add -e . --no-build-isolation
uv sync

チェックポイントのダウンロード

以下で今回使用するモデルをダウンロードします

# VAE
curl -L -o Wan2.1_VAE.pth "https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/Wan2.1_VAE.pth"

# Text Encoder (約10GB)
curl -L -o models_t5_umt5-xxl-enc-bf16.pth "https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/models_t5_umt5-xxl-enc-bf16.pth"

# DiTモデル (1.3B非量子化版、約2.7GB)
curl -L -o TurboWan2.1-T2V-1.3B-480P.pth "https://huggingface.co/TurboDiffusion/TurboWan2.1-T2V-1.3B-480P/resolve/main/TurboWan2.1-T2V-1.3B-480P.pth"

ダウンロードしたモデルは checkpoints に保存します

実行

実行するときは以下で実行可能です

uv turbodiffusion\inference\wan2.1_t2v_infer.py ^
    --model Wan2.1-1.3B ^
    --dit_path checkpoints/TurboWan2.1-T2V-1.3B-480P.pth ^
    --resolution 480p ^
    --prompt "A cat walking on the street" ^
    --num_steps 4 ^
    --attention_type sla ^
    --save_path output/test.mp4

デモ動画

  • アニメ風 (demo_anime.mp4 / demo_anime.gif)
Anime style, a cute girl with pink hair and big eyes walking through a cherry blossom garden, sakura petals falling, soft lighting, Studio Ghibli style

  • ダンス (demo_dance.mp4 / demo_dance.gif)
A professional dancer performing hip hop dance moves in a dance studio with mirrors, dynamic movements, energetic, studio lighting

  • 風景 (demo_landscape.mp4 / demo_landscape.gif)
A beautiful mountain landscape with a flowing river, autumn trees with orange and red leaves, mist rising from the valley, cinematic drone shot, golden hour lighting