Huggingface - 5.Transformer实验

1.全量参数微调

先设计一下要跑哪几个实验：

使用 accelerate 库在单卡上启动 transformers 库中提供的训练脚本；
使用 accelerate 库在双卡（数据并行）上启动 transformers 库中提供的训练脚本；
使用 deepspeed 库在单卡（ZeRO3）上启动 transformers 库中提供的训练脚本；
使用 deepspeed 库在双卡（ZeRO3）上启动 transformers 库中提供的训练脚本；

1.1 Accelerate 单卡

accelerate 的配置：

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

启动命令：

accelerate launch \
  --config_file ./myconfig/default_accelerate_config.yaml \
  examples/pytorch/language-modeling/run_clm.py \
  --model_name_or_path gpt2-xl \
  --train_file ./data/data.json \
  --output_dir ./outputs \
  --overwrite_output_dir \
  --do_train \
  --do_eval \
  --fp16 \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 1

训练日志：

[INFO|trainer.py:1779] 2023-05-22 00:42:36,114 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:42:36,114 >>   Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:42:36,114 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:42:36,114 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1783] 2023-05-22 00:42:36,114 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1784] 2023-05-22 00:42:36,114 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:42:36,114 >>   Total optimization steps = 114,024
[INFO|trainer.py:1786] 2023-05-22 00:42:36,115 >>   Number of trainable parameters = 1,557,611,200
  1%|▉                                                                  | 650/114024 [10:06<26:31:56,  1.19it/s]

显存使用情况：

+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:31:00.0 Off |                    0 |
|  0%   66C    P0   262W / 300W |  44403MiB / 46068MiB |     99%      Default |
|                               |                      |                  N/A |
+
|   1  NVIDIA A40          Off  | 00000000:4B:00.0 Off |                    0 |
|  0%   32C    P8    15W / 300W |      7MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+

1.2 Accelerate 双卡

accelerate 的配置：

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

启动命令：

accelerate launch \
  --config_file ./myconfig/default_accelerate_config.yaml \
  examples/pytorch/language-modeling/run_clm.py \
  --model_name_or_path gpt2-xl \
  --train_file ./data/data.json \
  --output_dir ./outputs \
  --overwrite_output_dir \
  --do_train \
  --do_eval \
  --fp16 \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 1

训练日志：

[INFO|trainer.py:1779] 2023-05-22 00:56:37,438 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:56:37,438 >>   Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:56:37,438 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:56:37,438 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1783] 2023-05-22 00:56:37,438 >>   Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:1784] 2023-05-22 00:56:37,438 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:56:37,438 >>   Total optimization steps = 57,012
[INFO|trainer.py:1786] 2023-05-22 00:56:37,440 >>   Number of trainable parameters = 1,557,611,200
  1%|█▉                                                                 | 673/57012 [10:57<14:01:08,  1.12it/s]

显存使用情况：

+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:31:00.0 Off |                    0 |
|  0%   66C    P0   257W / 300W |  44389MiB / 46068MiB |     99%      Default |
|                               |                      |                  N/A |
+
|   1  NVIDIA A40          Off  | 00000000:4B:00.0 Off |                    0 |
|  0%   61C    P0   246W / 300W |  44385MiB / 46068MiB |     99%      Default |
|                               |                      |                  N/A |
+

1.3 DeepSpeed 单卡

deepspeed 的配置：

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

启动命令：

deepspeed --num_gpus=1 \
  examples/pytorch/language-modeling/run_clm.py \
  --deepspeed ./myconfig/ds_config.json \
  --model_name_or_path gpt2-xl \
  --train_file ./data/data.json \
  --output_dir ./outputs \
  --overwrite_output_dir \
  --do_train \
  --do_eval \
  --fp16 \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 1

训练日志：

[INFO|trainer.py:1779] 2023-05-22 00:26:45,120 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:26:45,120 >>   Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:26:45,120 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:26:45,120 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1783] 2023-05-22 00:26:45,120 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1784] 2023-05-22 00:26:45,120 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:26:45,120 >>   Total optimization steps = 114,024
[INFO|trainer.py:1786] 2023-05-22 00:26:45,121 >>   Number of trainable parameters = 1,557,611,200
  1%|▊                                                                  | 577/114024 [11:33<38:41:41,  1.23s/it]

显存使用情况：

+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:31:00.0 Off |                    0 |
|  0%   56C    P0   164W / 300W |  40397MiB / 46068MiB |     52%      Default |
|                               |                      |                  N/A |
+
|   1  NVIDIA A40          Off  | 00000000:4B:00.0 Off |                    0 |
|  0%   35C    P0    75W / 300W |      4MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+

1.4 DeepSpeed 双卡

优化器分片+梯度分片+参数分片；

deepspeed 的配置：

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

启动命令：

deepspeed --num_gpus=2 \
  examples/pytorch/language-modeling/run_clm.py \
  --deepspeed ./myconfig/ds_config.json \
  --model_name_or_path gpt2-xl \
  --train_file ./data/data.json \
  --output_dir ./outputs \
  --overwrite_output_dir \
  --do_train \
  --do_eval \
  --fp16 \
  --per_device_train_batch_size 2 \
  --per_device_eval_batch_size 2

训练日志：

[INFO|trainer.py:1779] 2023-05-22 00:13:55,728 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:13:55,728 >>   Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:13:55,728 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:13:55,728 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1783] 2023-05-22 00:13:55,728 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1784] 2023-05-22 00:13:55,728 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:13:55,728 >>   Total optimization steps = 28,506
[INFO|trainer.py:1786] 2023-05-22 00:13:55,730 >>   Number of trainable parameters = 1,557,611,200
 2%|██▊                                                                 | 500/28506 [10:14<9:40:02,  1.24s/it]

显存使用情况：

+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:31:00.0 Off |                    0 |
|  0%   62C    P0   216W / 300W |  41777MiB / 46068MiB |     75%      Default |
|                               |                      |                  N/A |
+
|   1  NVIDIA A40          Off  | 00000000:4B:00.0 Off |                    0 |
|  0%   58C    P0   219W / 300W |  41773MiB / 46068MiB |     96%      Default |
|                               |                      |                  N/A |
+

Comment