使用 accelerator 进行多机训练

本文以DiffSynth-Studio为示例，展示基于开发机进行 accelerator 多机训练的具体操作方式。

DiffSynth-Studio 是一个面向主流 Diffusion 模型的统一训练与推理平台，支持多模型接入、可视化与高效分布式加速。本次 accelerator 多机训练以 DiffSynth-Studio中 Wan-AI为主，使用的是 Wan2.2-I2V-A14B 模型。

从源码安装

git clone -b wan2.2 https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio

安装 Python 包

pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple

安装额外依赖

pip install deepspeed -i https://pypi.tuna.tsinghua.edu.cn/simple
apt-get update
apt install -y netcat
apt install net-tools -y
apt-get install git-lfs

4 台开发机都需要安装 Python 包和额外依赖。

下载示例视频数据集

DiffSynth-Studio 源码中包含了一个示例视频数据集，可用于测试训练流程。运行以下代码进行数据集下载：

modelscope download --dataset DiffSynth-Studio/example_video_dataset --local_dir ./example_video_dataset

检查下载结果

ls -lh ./example_video_dataset

如果显示文件大小仅为几 KB，说明实际数据未下载成功。此时执行以下命令拉取数据：

git lfs install
git lfs pull

多机多卡训练配置

本实例计划使用 4 台机器、共 32 张 GPU 进行分布式训练。以下展示 YAML 配置文件示例：

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
  dynamo_mode: default
  dynamo_use_dynamic: true
  dynamo_use_fullgraph: true
  dynamo_use_regional_compilation: true
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: '10.233.xx.xx'
main_process_port: 12345
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 32
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

在多机多卡分布式训练中，需要关注以下参数：

machine_rank：当前节点在所有训练节点中的序号（从 0 开始）。主节点设置为 0，其他节点依次递增，即 1, 2, 3。
main_process_ip：主节点的 IP 地址。所有从节点需要通过该 IP 与主节点通信。
提示：选择一台机器为主节点，使用指令 ifconfig查找 IP , eth0:中 inet 后的地址即为本机 IP。
main_process_port：主节点用于分布式通信的端口。确保端口未被占用。示例中使用端口 12345。
num_machines：参与训练的总机器数。本实例为 4。
num_processes：总进程数，等于机器数 × 每机 GPU 数。本实例为 4 × 8 = 32。

说明

需分别为每台机器创建一个 YAML 配置文件，共 4 个。

每个文件中 machine_rank 对应 0, 1, 2, 3，其余参数保持一致。

多机多卡训练

参考源代码该路径下的examples/wanvideo/model_training/full/Wan2.2-T2V-A14B.sh 文件。

在使用该脚本进行训练前，需要进行额外添加 NCCL 配置。

在 Wan2.2-T2V-A14B.sh 脚本开头加入 NCCL 相关环境变量及参数配置，以确保多机多卡训练中的通信效率和稳定性。

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3

注意运行代码路径的修改

--config_file ：改为对应的 YAML 文件路径，需要改成上述生成的 YAML 配置文件，4 个配置文件对应 4 个启动文件。

--model_paths ：改为对应模型的本地路径，英博云在共享存储中提供多种内置模型及数据集，使用本地模型无需等待下载时间，该实例使用 Wan2.2-I2V-A14B 模型，路径地址修改如下。

'[
    [
      "/public/huggingface-models/Wan-AI/Wan2.2-I2V-A14B/high_noise_model/diffusion_pytorch_model-00001-of-00006.safetensors",
      "/public/huggingface-models/Wan-AI/Wan2.2-I2V-A14B/high_noise_model/diffusion_pytorch_model-00002-of-00006.safetensors",
      "/public/huggingface-models/Wan-AI/Wan2.2-I2V-A14B/high_noise_model/diffusion_pytorch_model-00003-of-00006.safetensors",
      "/public/huggingface-models/Wan-AI/Wan2.2-I2V-A14B/high_noise_model/diffusion_pytorch_model-00004-of-00006.safetensors",
      "/public/huggingface-models/Wan-AI/Wan2.2-I2V-A14B/high_noise_model/diffusion_pytorch_model-00005-of-00006.safetensors",
      "/public/huggingface-models/Wan-AI/Wan2.2-I2V-A14B/high_noise_model/diffusion_pytorch_model-00006-of-00006.safetensors"
    ],
    "/public/huggingface-models/Wan-AI/Wan2.2-I2V-A14B/models_t5_umt5-xxl-enc-bf16.pth",
    "/public/huggingface-models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth"
  ]' \

--model_id_with_origin_paths ：联网下载的 model_id 匹配。由于本实例设置上述本地路径 --model_paths，请删除该参数设置。

详细参数设置可以参考源文档：https://github.com/modelscope/DiffSynth-Studio/blob/main/examples/wanvideo/README_zh.md

在 4 台机器上分别存放：

4 个 YAML 配置文件（machine_rank 分别为 0、1、2、3）
对应的训练启动脚本 Wan2.2-T2V-A14B.sh 文件

为便于区分，可将启动脚本命名为：

Wan2.2-T2V-A14B_0.sh
Wan2.2-T2V-A14B_1.sh
Wan2.2-T2V-A14B_2.sh
Wan2.2-T2V-A14B_3.sh

在英博云环境中，租用的共享存储卷可在不同开发机之间共享。因此，可将以上文件统一存放在共享存储卷上，每台开发机只需运行其对应的脚本即可，无需在本地单独存储配置文件，从而简化多机训练的管理与部署。

在主节点（machine_rank=0，配置了 eth0 IP 的节点）上运行：

bash Wan2.2-T2V-A14B_0.sh

运行几秒后，可在其他 3 台从节点上检查与主节点的连接是否成功：

nc -zv <主节点IP> <main_process_port>

如果端口监听成功，会显示：

Connection to <主节点IP> 12345 port [tcp/*] succeeded!

在剩余 3 台机器上依次启动训练进程：

bash Wan2.2-T2V-A14B_1.sh
bash Wan2.2-T2V-A14B_2.sh
bash Wan2.2-T2V-A14B_3.sh

等待训练结束，完成时间约10-20分钟

本章目录

从源码安装

多机多卡训练配置

多机多卡训练