Huggingface - 5.Transformer实验

1.全量参数微调

先设计一下要跑哪几个实验:

  • 使用 accelerate 库在单卡上启动 transformers 库中提供的训练脚本;
  • 使用 accelerate 库在双卡(数据并行)上启动 transformers 库中提供的训练脚本;
  • 使用 deepspeed 库在单卡(ZeRO3)上启动 transformers 库中提供的训练脚本;
  • 使用 deepspeed 库在双卡(ZeRO3)上启动 transformers 库中提供的训练脚本;

1.1 Accelerate 单卡

accelerate 的配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

启动命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
accelerate launch \
--config_file ./myconfig/default_accelerate_config.yaml \
examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2-xl \
--train_file ./data/data.json \
--output_dir ./outputs \
--overwrite_output_dir \
--do_train \
--do_eval \
--fp16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1

训练日志:

1
2
3
4
5
6
7
8
9
10
[INFO|trainer.py:1779] 2023-05-22 00:42:36,114 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:42:36,114 >> Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:42:36,114 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:42:36,114 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1783] 2023-05-22 00:42:36,114 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1784] 2023-05-22 00:42:36,114 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:42:36,114 >> Total optimization steps = 114,024
[INFO|trainer.py:1786] 2023-05-22 00:42:36,115 >> Number of trainable parameters = 1,557,611,200
1%|▉ | 650/114024 [10:06<26:31:56, 1.19it/s]

显存使用情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:31:00.0 Off | 0 |
| 0% 66C P0 262W / 300W | 44403MiB / 46068MiB | 99% Default |
| | | N/A |
+
| 1 NVIDIA A40 Off | 00000000:4B:00.0 Off | 0 |
| 0% 32C P8 15W / 300W | 7MiB / 46068MiB | 0% Default |
| | | N/A |
+

1.2 Accelerate 双卡

accelerate 的配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

启动命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
accelerate launch \
--config_file ./myconfig/default_accelerate_config.yaml \
examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2-xl \
--train_file ./data/data.json \
--output_dir ./outputs \
--overwrite_output_dir \
--do_train \
--do_eval \
--fp16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1

训练日志:

1
2
3
4
5
6
7
8
9
10
[INFO|trainer.py:1779] 2023-05-22 00:56:37,438 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:56:37,438 >> Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:56:37,438 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:56:37,438 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1783] 2023-05-22 00:56:37,438 >> Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:1784] 2023-05-22 00:56:37,438 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:56:37,438 >> Total optimization steps = 57,012
[INFO|trainer.py:1786] 2023-05-22 00:56:37,440 >> Number of trainable parameters = 1,557,611,200
1%|█▉ | 673/57012 [10:57<14:01:08, 1.12it/s]

显存使用情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:31:00.0 Off | 0 |
| 0% 66C P0 257W / 300W | 44389MiB / 46068MiB | 99% Default |
| | | N/A |
+
| 1 NVIDIA A40 Off | 00000000:4B:00.0 Off | 0 |
| 0% 61C P0 246W / 300W | 44385MiB / 46068MiB | 99% Default |
| | | N/A |
+

1.3 DeepSpeed 单卡

deepspeed 的配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},

"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},

"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}

启动命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
deepspeed --num_gpus=1 \
examples/pytorch/language-modeling/run_clm.py \
--deepspeed ./myconfig/ds_config.json \
--model_name_or_path gpt2-xl \
--train_file ./data/data.json \
--output_dir ./outputs \
--overwrite_output_dir \
--do_train \
--do_eval \
--fp16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1


训练日志:

1
2
3
4
5
6
7
8
9
10
[INFO|trainer.py:1779] 2023-05-22 00:26:45,120 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:26:45,120 >> Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:26:45,120 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:26:45,120 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1783] 2023-05-22 00:26:45,120 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1784] 2023-05-22 00:26:45,120 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:26:45,120 >> Total optimization steps = 114,024
[INFO|trainer.py:1786] 2023-05-22 00:26:45,121 >> Number of trainable parameters = 1,557,611,200
1%|▊ | 577/114024 [11:33<38:41:41, 1.23s/it]

显存使用情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:31:00.0 Off | 0 |
| 0% 56C P0 164W / 300W | 40397MiB / 46068MiB | 52% Default |
| | | N/A |
+
| 1 NVIDIA A40 Off | 00000000:4B:00.0 Off | 0 |
| 0% 35C P0 75W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+

1.4 DeepSpeed 双卡

优化器分片+梯度分片+参数分片;

deepspeed 的配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},

"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},

"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}

启动命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
deepspeed --num_gpus=2 \
examples/pytorch/language-modeling/run_clm.py \
--deepspeed ./myconfig/ds_config.json \
--model_name_or_path gpt2-xl \
--train_file ./data/data.json \
--output_dir ./outputs \
--overwrite_output_dir \
--do_train \
--do_eval \
--fp16 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2

训练日志:

1
2
3
4
5
6
7
8
9
10
[INFO|trainer.py:1779] 2023-05-22 00:13:55,728 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-05-22 00:13:55,728 >> Num examples = 38,008
[INFO|trainer.py:1781] 2023-05-22 00:13:55,728 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-05-22 00:13:55,728 >> Instantaneous batch size per device = 2
[INFO|trainer.py:1783] 2023-05-22 00:13:55,728 >> Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1784] 2023-05-22 00:13:55,728 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1785] 2023-05-22 00:13:55,728 >> Total optimization steps = 28,506
[INFO|trainer.py:1786] 2023-05-22 00:13:55,730 >> Number of trainable parameters = 1,557,611,200
2%|██▊ | 500/28506 [10:14<9:40:02, 1.24s/it]

显存使用情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:31:00.0 Off | 0 |
| 0% 62C P0 216W / 300W | 41777MiB / 46068MiB | 75% Default |
| | | N/A |
+
| 1 NVIDIA A40 Off | 00000000:4B:00.0 Off | 0 |
| 0% 58C P0 219W / 300W | 41773MiB / 46068MiB | 96% Default |
| | | N/A |
+


Comment