diff --git a/README.md b/README.md index 05a75949..eb260003 100644 --- a/README.md +++ b/README.md @@ -329,7 +329,7 @@ To enable FlashAttention-2 on the Windows platform, you need to install the prec -### LLaMA Board GUI +### Train with LLaMA Board GUI > [!IMPORTANT] > LLaMA Board GUI only supports training on a single GPU, please use [CLI](#command-line-interface) for distributed training. @@ -381,7 +381,7 @@ docker compose -f ./docker-compose.yml up -d -### Command Line Interface +### Train with Command Line Interface See [examples/README.md](examples/README.md) for usage. @@ -397,7 +397,7 @@ CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \ --vllm_enforce_eager ``` -### Use ModelScope Hub +### Download from ModelScope Hub If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope. @@ -405,7 +405,7 @@ If you have trouble with downloading models and datasets from Hugging Face, you export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows ``` -Train the model by specifying a model ID of the ModelScope Hub as the `--model_name_or_path`. You can find a full list of model IDs at [ModelScope Hub](https://modelscope.cn/models), e.g., `modelscope/Llama-2-7b-ms`. +Train the model by specifying a model ID of the ModelScope Hub as the `--model_name_or_path`. You can find a full list of model IDs at [ModelScope Hub](https://modelscope.cn/models), e.g., `LLM-Research/Meta-Llama-3-8B-Instruct`. ## Projects using LLaMA Factory diff --git a/README_zh.md b/README_zh.md index 0e01e2c2..ab43fa26 100644 --- a/README_zh.md +++ b/README_zh.md @@ -329,10 +329,10 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl -### LLaMA Board 可视化界面 +### 利用 LLaMA Board 可视化界面训练 > [!IMPORTANT] -> LLaMA Board 可视化界面目前仅支持单 GPU 训练,请使用[命令行接口](#命令行接口)来进行分布式训练。 +> LLaMA Board 可视化界面目前仅支持单 GPU 训练,请使用[命令行接口](#命令行接口)来进行多 GPU 分布式训练。 #### 使用本地环境 @@ -381,13 +381,13 @@ docker compose -f ./docker-compose.yml up -d -### 命令行接口 +### 利用命令行接口训练 使用方法请参考 [examples/README_zh.md](examples/README_zh.md)。 -使用 `python src/train_bash.py -h` 查看参数文档。 +您可以执行 `python src/train_bash.py -h` 来查看参数文档。 -### 使用 OpenAI 风格 API 和 vLLM 部署 +### 利用 vLLM 部署 OpenAI API ```bash CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \ @@ -397,7 +397,7 @@ CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \ --vllm_enforce_eager ``` -### 使用魔搭社区 +### 从魔搭社区下载 如果您在 Hugging Face 模型和数据集的下载中遇到了问题,可以通过下述方法使用魔搭社区。 @@ -405,7 +405,7 @@ CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \ export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1` ``` -将 `--model_name_or_path` 设置为模型 ID 来加载对应的模型。在[魔搭社区](https://modelscope.cn/models)查看所有可用的模型,例如 `modelscope/Llama-2-7b-ms`。 +将 `--model_name_or_path` 设置为模型 ID 来加载对应的模型。在[魔搭社区](https://modelscope.cn/models)查看所有可用的模型,例如 `LLM-Research/Meta-Llama-3-8B-Instruct`。 ## 使用了 LLaMA Factory 的项目 diff --git a/examples/README.md b/examples/README.md index 8218d113..871bf0de 100644 --- a/examples/README.md +++ b/examples/README.md @@ -18,7 +18,8 @@ examples/ │ └── aqlm.sh: Fine-tune 2-bit AQLM models using QLoRA ├── lora_multi_gpu/ │ ├── single_node.sh: Fine-tune model with Accelerate on single node using LoRA -│ └── multi_node.sh: Fine-tune model with Accelerate on multiple nodes using LoRA +│ ├── multi_node.sh: Fine-tune model with Accelerate on multiple nodes using LoRA +│ └── ds_zero3.sh: Fine-tune model with DeepSpeed ZeRO-3 using LoRA ├── full_multi_gpu/ │ ├── single_node.sh: Full fine-tune model with DeepSpeed on single node │ ├── multi_node.sh: Full fine-tune model with DeepSpeed on multiple nodes diff --git a/examples/README_zh.md b/examples/README_zh.md index ed0d244d..c4f2062e 100644 --- a/examples/README_zh.md +++ b/examples/README_zh.md @@ -18,7 +18,8 @@ examples/ │ └── aqlm.sh: 基于 QLoRA 微调 2 比特 AQLM 模型 ├── lora_multi_gpu/ │ ├── single_node.sh: 使用 Accelerate 进行单节点 LoRA 训练 -│ └── multi_node.sh: 使用 Accelerate 进行多节点 LoRA 训练 +│ ├── multi_node.sh: 使用 Accelerate 进行多节点 LoRA 训练 +│ └── ds_zero3.sh: 使用 DeepSpeed ZeRO-3 进行 LoRA 训练 ├── full_multi_gpu/ │ ├── single_node.sh: 使用 DeepSpeed 进行单节点全量训练 │ ├── multi_node.sh: 使用 DeepSpeed 进行多节点全量训练 diff --git a/examples/extras/badam/sft.sh b/examples/extras/badam/sft.sh new file mode 100644 index 00000000..c2319caa --- /dev/null +++ b/examples/extras/badam/sft.sh @@ -0,0 +1,35 @@ +#!/bin/bash + +CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ + --stage sft \ + --do_train \ + --model_name_or_path meta-llama/Llama-2-7b-hf \ + --dataset alpaca_gpt4_en,glaive_toolcall \ + --dataset_dir ../../../data \ + --template default \ + --finetuning_type full \ + --use_badam \ + --badam_switch_mode descending \ + --badam_switch_block_every 50 \ + --badam_verbose 2 \ + --output_dir ../../../saves/LLaMA2-7B/badam/sft \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 8 \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --load_best_model_at_end \ + --learning_rate 5e-5 \ + --num_train_epochs 3.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --plot_loss \ + --pure_bf16 diff --git a/examples/extras/fsdp_qlora/sft.sh b/examples/extras/fsdp_qlora/sft.sh new file mode 100644 index 00000000..e8b9ece7 --- /dev/null +++ b/examples/extras/fsdp_qlora/sft.sh @@ -0,0 +1,41 @@ +#!/bin/bash +# DO NOT use GPTQ/AWQ model in FSDP+QLoRA + +pip install "transformers>=4.39.1" +pip install "accelerate>=0.28.0" +pip install "bitsandbytes>=0.43.0" + +CUDA_VISIBLE_DEVICES=0,1 accelerate launch \ + --config_file ../../accelerate/fsdp_config.yaml \ + ../../../src/train_bash.py \ + --stage sft \ + --do_train \ + --model_name_or_path meta-llama/Llama-2-70b-hf \ + --dataset alpaca_gpt4_en,glaive_toolcall \ + --dataset_dir ../../../data \ + --template default \ + --finetuning_type lora \ + --lora_target q_proj,v_proj \ + --output_dir ../../../saves/LLaMA2-70B/lora/sft \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 4 \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --load_best_model_at_end \ + --learning_rate 5e-5 \ + --num_train_epochs 3.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --ddp_timeout 180000000 \ + --quantization_bit 4 \ + --plot_loss \ + --fp16 diff --git a/examples/extras/galore/sft.sh b/examples/extras/galore/sft.sh new file mode 100644 index 00000000..da1779ed --- /dev/null +++ b/examples/extras/galore/sft.sh @@ -0,0 +1,36 @@ +#!/bin/bash + +CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ + --stage sft \ + --do_train \ + --model_name_or_path meta-llama/Llama-2-7b-hf \ + --dataset alpaca_gpt4_en,glaive_toolcall \ + --dataset_dir ../../../data \ + --template default \ + --finetuning_type full \ + --use_galore \ + --galore_layerwise \ + --galore_target mlp,self_attn \ + --galore_rank 128 \ + --galore_scale 2.0 \ + --output_dir ../../../saves/LLaMA2-7B/galore/sft \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 1 \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --load_best_model_at_end \ + --learning_rate 5e-5 \ + --num_train_epochs 3.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --plot_loss \ + --pure_bf16 diff --git a/examples/extras/llama_pro/expand.sh b/examples/extras/llama_pro/expand.sh new file mode 100644 index 00000000..b260902c --- /dev/null +++ b/examples/extras/llama_pro/expand.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +python ../../../scripts/llama_pro.py \ + --model_name_or_path meta-llama/Llama-2-7b-hf \ + --output_dir ../../../models/llama2-7b-pro \ + --num_expand 8 diff --git a/examples/extras/llama_pro/sft.sh b/examples/extras/llama_pro/sft.sh new file mode 100644 index 00000000..573078ff --- /dev/null +++ b/examples/extras/llama_pro/sft.sh @@ -0,0 +1,34 @@ +#!/bin/bash + +CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ + --stage sft \ + --do_train \ + --model_name_or_path ../../../models/llama2-7b-pro \ + --dataset alpaca_gpt4_en,glaive_toolcall \ + --dataset_dir ../../../data \ + --template default \ + --finetuning_type freeze \ + --name_module_trainable all \ + --num_layer_trainable 8 \ + --use_llama_pro \ + --output_dir ../../../saves/LLaMA2-7B-Pro/lora/sft \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 8 \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --load_best_model_at_end \ + --learning_rate 5e-5 \ + --num_train_epochs 3.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --plot_loss \ + --fp16 diff --git a/examples/extras/loraplus/sft.sh b/examples/extras/loraplus/sft.sh new file mode 100644 index 00000000..cb334e7d --- /dev/null +++ b/examples/extras/loraplus/sft.sh @@ -0,0 +1,33 @@ +#!/bin/bash + +CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \ + --stage sft \ + --do_train \ + --model_name_or_path meta-llama/Llama-2-7b-hf \ + --dataset alpaca_gpt4_en,glaive_toolcall \ + --dataset_dir ../../data \ + --template default \ + --finetuning_type lora \ + --lora_target q_proj,v_proj \ + --loraplus_lr_ratio 16.0 \ + --output_dir ../../saves/LLaMA2-7B/loraplus/sft \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 8 \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --load_best_model_at_end \ + --learning_rate 5e-5 \ + --num_train_epochs 3.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --plot_loss \ + --fp16 diff --git a/examples/extras/mod/sft.sh b/examples/extras/mod/sft.sh new file mode 100644 index 00000000..2c8f04a3 --- /dev/null +++ b/examples/extras/mod/sft.sh @@ -0,0 +1,33 @@ +#!/bin/bash + +CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ + --stage sft \ + --do_train \ + --model_name_or_path meta-llama/Llama-2-7b-hf \ + --dataset alpaca_gpt4_en,glaive_toolcall \ + --dataset_dir ../../../data \ + --template default \ + --finetuning_type full \ + --mixture_of_depths convert \ + --output_dir ../../../saves/LLaMA2-7B/mod/sft \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 8 \ + --optim paged_adamw_8bit \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --load_best_model_at_end \ + --learning_rate 5e-5 \ + --num_train_epochs 3.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --plot_loss \ + --pure_bf16 diff --git a/examples/lora_multi_gpu/ds_zero3.sh b/examples/lora_multi_gpu/ds_zero3.sh new file mode 100644 index 00000000..f429d15b --- /dev/null +++ b/examples/lora_multi_gpu/ds_zero3.sh @@ -0,0 +1,33 @@ +#!/bin/bash + +deepspeed --num_gpus 4 ../../src/train_bash.py \ + --deepspeed ../deepspeed/ds_z3_config.json \ + --stage sft \ + --do_train \ + --model_name_or_path meta-llama/Llama-2-7b-hf \ + --dataset alpaca_gpt4_en,glaive_toolcall \ + --dataset_dir ../../data \ + --template default \ + --finetuning_type lora \ + --lora_target q_proj,v_proj \ + --output_dir ../../saves/LLaMA2-7B/lora/sft \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 2 \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --learning_rate 5e-5 \ + --num_train_epochs 3.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --ddp_timeout 180000000 \ + --plot_loss \ + --fp16