AnimateDiff-train-infer.md

# AnimateDiff Fine-tuning and Inference

SWIFT supports fine-tuning and inference of AnimateDiff of full parameter and LoRA fine-tuning.

First, you need to clone and install SWIFT:

```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install ".[aigc]"
```

## Full Parameter Training

### Training Effect

Full parameter fine-tuning can reproduce the effect of the [officially provided model animatediff-motion-adapter-v1-5-2](https://www.modelscope.cn/models/Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2/summary), requiring a large number of short videos. The official reproduction used a subset version of the official dataset: [WebVid 2.5M](https://maxbain.com/webvid-dataset/). The training effect is as follows:

```text
Prompt:masterpiece, bestquality, highlydetailed, ultradetailed, girl, walking, on the street, flowers
```

![image.png](../../resources/1.gif)

```text
Prompt: masterpiece, bestquality, highlydetailed, ultradetailed, beautiful house, mountain, snow top```
```

![image.png](../../resources/2.gif)

The generation effect of training with the 2.5M subset still has unstable results. Developers using the 10M dataset will have more stable effects.

### Running Command

```shell
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100 * 4
# 200GB GPU memory totally
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 animatediff_sft.py \
  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
  --csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
  --video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
  --sft_type full \
  --lr_scheduler_type constant \
  --trainable_modules .*motion_modules.* \
  --batch_size 4 \
  --eval_steps 100 \
  --gradient_accumulation_steps 16
```

We used A100 * 4 for training, requiring a total of 200GB GPU memory, and the training time is about 40 hours. The data format is as follows:


```text
--csv_path # Pass in a csv file, which should contain the following format:
name,contentUrl
Travel blogger shoot a story on top of mountains. young man holds camera in forest.,stock-footage-travel-blogger-shoot-a-story-on-top-of-mountains-young-man-holds-camera-in-forest.mp4
```

The name field represents the prompt of the short video, and contentUrl represents the name of the video file.

```text
--video_folder Pass in a video directory containing all the video files referenced by contentUrl in the csv file.
```

To perform inference using full parameters:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
  --sft_type full \
  --ckpt_dir /output/path/like/checkpoints/iter-xxx \
  --eval_human true
```

The --ckpt_dir should be the output folder from training.

## LoRA Training

### Running Command

Full parameter training will train the entire Motion-Adapter structure from scratch. Users can use an existing model and a small number of videos for fine-tuning by running the following command:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 20GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_sft.py \
  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
  --csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
  --video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
  --motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
  --sft_type lora \
  --lr_scheduler_type constant \
  --trainable_modules .*motion_modules.* \
  --batch_size 1 \
  --eval_steps 200 \
  --dataset_sample_size 10000 \
  --gradient_accumulation_steps 16
```

Video data parameters are the same as above.

The inference command is as follows:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
  --motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
  --sft_type lora \
  --ckpt_dir /output/path/like/checkpoints/iter-xxx \
  --eval_human true
```

The --ckpt_dir should be the output folder from training.

## Parameter List

Below are the supported parameter lists and their meanings for training and inference respectively:

### Training Parameters
```text
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.

model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.

dataset_sample_size: int = None # The number of training samples in the dataset. Default represents full training.

sft_type: str = field(
    default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.

output_dir: str = 'output' # Output folder.
ddp_backend: str = field(
    default='nccl', metadata={'choices': ['nccl', 'gloo', 'mpi', 'ccl']}) # If using ddp training, ddp backend.

seed: int = 42 # Random seed.

lora_rank: int = 8 # lora parameter.
lora_alpha: int = 32 # lora parameter.
lora_dropout_p: float = 0.05 # lora parameter.
lora_dtype: str = 'fp32' # lora module dtype type. If `AUTO`, it follows the dtype setting of the original module.

gradient_checkpointing: bool = False # Whether to enable gc, disabled by default. Note: The current version of diffusers has a problem and does not support this parameter being True.
batch_size: int = 1 # batchsize.
num_train_epochs: int = 1 # Number of epochs.
# if max_steps >= 0, override num_train_epochs
learning_rate: Optional[float] = None # Learning rate.
weight_decay: float = 0.01 # adamw parameter.
gradient_accumulation_steps: int = 16 # ga size.
max_grad_norm: float = 1. # grad norm size.
lr_scheduler_type: str = 'cosine' # Type of lr_scheduler.
warmup_ratio: float = 0.05 # Whether to warmup and the proportion of warmup.

eval_steps: int = 50 # eval step interval.
save_steps: Optional[int] = None # save step interval.
dataloader_num_workers: int = 1 # Number of dataloader workers.

push_to_hub: bool = False # Whether to push to modelhub.
# 'user_name/repo_name' or 'repo_name'
hub_model_id: Optional[str] = None # modelhub id.
hub_private_repo: bool = False
push_hub_strategy: str = field( # Push strategy, push the last one or push each one.
    default='push_best',
    metadata={'choices': ['push_last', 'all_checkpoints']})
# None: use env var `MODELSCOPE_API_TOKEN`
hub_token: Optional[str] = field( # modelhub token.
    default=None,
    metadata={
        'help':
        'SDK token can be found in https://modelscope.cn/my/myaccesstoken'
    })

ignore_args_error: bool = False  # True: notebook compatibility.

text_dropout_rate: float = 0.1 # Drop a certain proportion of text to ensure model robustness.

validation_prompts_path: str = field( # The prompt file directory used in the evaluation process. By default, swift/aigc/configs/validation.txt is used.
    default=None,
    metadata={
        'help':
        'The validation prompts file path, use aigc/configs/validation.txt is None'
    })

trainable_modules: str = field( # Trainable modules, recommended to use the default value.
    default='.*motion_modules.*',
    metadata={
        'help':
        'The trainable modules, by default, the .*motion_modules.* will be trained'
    })

mixed_precision: bool = True # Mixed precision training.

enable_xformers_memory_efficient_attention: bool = True # Use xformers.

num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.

csv_path: str = None # Input dataset.
video_folder: str = None # Input dataset.

motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.

use_wandb: bool = False # Whether to use wandb.
```

### Inference Parameters
```text
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.

model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.

sft_type: str = field(
    default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.

ckpt_dir: Optional[str] = field(
    default=None, metadata={'help': '/path/to/your/vx-xxx/checkpoint-xxx'}) # The output folder of training.
eval_human: bool = False  # False: eval val_dataset # Whether to use manual input evaluation.

seed: int = 42 # Random seed.

merge_lora: bool = False # Merge lora into the MotionAdapter and save the model.
replace_if_exists: bool = False # Replace the files if the output merged dir exists when `merge_lora` is True.

# other
ignore_args_error: bool = False  # True: notebook compatibility.

validation_prompts_path: str = None # The file used for validation. When eval_human=False, each line is a prompt.

output_path: str = './generated' # The output directory for gifs.

enable_xformers_memory_efficient_attention: bool = True # Use xformers.

num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.

motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.
```