diff --git a/README.md b/README.md index d10ef982..14a2084d 100644 --- a/README.md +++ b/README.md @@ -337,7 +337,7 @@ Please refer to [data/README.md](data/README.md) for checking the details about ### Quickstart -The following 3 commands conduct LoRA fine-tuning, inference and merging for Llama3-8B-Instruct model, respectively. +Use the following 3 commands to conduct LoRA **fine-tuning**, **inference** and **merging** for Llama3-8B-Instruct model, respectively. ```bash CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml @@ -345,7 +345,7 @@ CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft. CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml ``` -See [examples/README.md](examples/README.md) for advanced usage. +See [examples/README.md](examples/README.md) for advanced usage (including distributed training). > [!TIP] > Use `llamafactory-cli help` to show help information. diff --git a/README_zh.md b/README_zh.md index 9c639f2c..daf5f2e8 100644 --- a/README_zh.md +++ b/README_zh.md @@ -337,7 +337,7 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl ### 快速开始 -下面三行命令分别对 Llama3-8B-Instruct 模型进行 LoRA 微调、推理和合并。 +下面三行命令分别对 Llama3-8B-Instruct 模型进行 LoRA **微调**、**推理**和**合并**。 ```bash CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml @@ -345,10 +345,10 @@ CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft. CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml ``` -高级用法请参考 [examples/README_zh.md](examples/README_zh.md)。 +高级用法请参考 [examples/README_zh.md](examples/README_zh.md)(包括多 GPU 微调)。 > [!TIP] -> 使用 `llamafactory-cli help` 显示使用帮助。 +> 使用 `llamafactory-cli help` 显示帮助信息。 ### 使用 LLaMA Board 可视化界面(由 [Gradio](https://github.com/gradio-app/gradio) 驱动) diff --git a/examples/README.md b/examples/README.md index 0a14c5bd..922f9c7b 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,57 +1,204 @@ We provide diverse examples about fine-tuning LLMs. +### LoRA Fine-Tuning on A Single GPU + +#### (Continuous) Pre-Training + ```bash -export CUDA_VISIBLE_DEVICES=0 -cd examples/lora_single_gpu -llamafactory-cli train llama3_lora_pretrain.yaml # Do continuous pre-training using LoRA - +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_pretrain.yaml ``` +#### Supervised Fine-Tuning + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml ``` -examples/ -├── lora_single_gpu/ -│ ├── ` -│ ├── sft.sh: Do supervised fine-tuning using LoRA -│ ├── reward.sh: Do reward modeling using LoRA -│ ├── ppo.sh: Do PPO training using LoRA -│ ├── dpo.sh: Do DPO training using LoRA -│ ├── orpo.sh: Do ORPO training using LoRA -│ ├── sft_mllm.sh: Do supervised fine-tuning on multimodal data using LoRA -│ ├── prepare.sh: Save tokenized dataset -│ └── predict.sh: Do batch predict and compute BLEU and ROUGE scores after LoRA tuning -├── qlora_single_gpu/ -│ ├── bitsandbytes.sh: Fine-tune 4/8-bit BNB models using QLoRA -│ ├── gptq.sh: Fine-tune 4/8-bit GPTQ models using QLoRA -│ ├── awq.sh: Fine-tune 4-bit AWQ models using QLoRA -│ └── aqlm.sh: Fine-tune 2-bit AQLM models using QLoRA -├── lora_multi_gpu/ -│ ├── single_node.sh: Fine-tune model with Accelerate on single node using LoRA -│ ├── multi_node.sh: Fine-tune model with Accelerate on multiple nodes using LoRA -│ └── ds_zero3.sh: Fine-tune model with DeepSpeed ZeRO-3 using LoRA (weight sharding) -├── full_multi_gpu/ -│ ├── single_node.sh: Full fine-tune model with DeepSpeed on single node -│ ├── multi_node.sh: Full fine-tune model with DeepSpeed on multiple nodes -│ └── predict.sh: Do parallel batch predict and compute BLEU and ROUGE scores after full tuning -├── merge_lora/ -│ ├── merge.sh: Merge LoRA weights into the pre-trained models -│ └── quantize.sh: Quantize the fine-tuned model with AutoGPTQ -├── inference/ -│ ├── cli_demo.sh: Chat with fine-tuned model in the CLI with LoRA adapters -│ ├── api_demo.sh: Chat with fine-tuned model in an OpenAI-style API with LoRA adapters -│ ├── web_demo.sh: Chat with fine-tuned model in the Web browser with LoRA adapters -│ └── evaluate.sh: Evaluate model on the MMLU/CMMLU/C-Eval benchmarks with LoRA adapters -└── extras/ - ├── galore/ - │ └── sft.sh: Fine-tune model with GaLore - ├── badam/ - │ └── sft.sh: Fine-tune model with BAdam - ├── loraplus/ - │ └── sft.sh: Fine-tune model using LoRA+ - ├── mod/ - │ └── sft.sh: Fine-tune model using Mixture-of-Depths - ├── llama_pro/ - │ ├── expand.sh: Expand layers in the model - │ └── sft.sh: Fine-tune the expanded model - └── fsdp_qlora/ - └── sft.sh: Fine-tune quantized model with FSDP+QLoRA + +#### Reward Modeling + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_reward.yaml +``` + +#### PPO Training + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_ppo.yaml +``` + +#### DPO Training + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_dpo.yaml +``` + +#### ORPO Training + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_orpo.yaml +``` + +#### Multimodal Supervised Fine-Tuning + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llava1_5_lora_sft.yaml +``` + +#### Preprocess Dataset + +It is useful for large dataset, use `tokenized_path` in config to load the preprocessed dataset. + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_preprocess.yaml +``` + +#### Evaluating on MMLU/CMMLU/C-Eval Benchmarks + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval examples/lora_single_gpu/llama3_lora_eval.yaml +``` + +#### Batch Predicting and Computing BLEU and ROUGE Scores + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_predict.yaml +``` + +### QLoRA Fine-Tuning on a Single GPU + +#### Supervised Fine-Tuning with 4/8-bit Bitsandbytes Quantization (Recommended) + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml +``` + +#### Supervised Fine-Tuning with 4/8-bit GPTQ Quantization + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml +``` + +#### Supervised Fine-Tuning with 4-bit AWQ Quantization + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_awq.yaml +``` + +#### Supervised Fine-Tuning with 2-bit AQLM Quantization + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml +``` + +### LoRA Fine-Tuning on Multiple GPUs + +#### Supervised Fine-Tuning with Accelerate on Single Node + +```bash +bash examples/lora_multi_gpu/single_node.sh +``` + +#### Supervised Fine-Tuning with Accelerate on Multiple Nodes + +```bash +bash examples/lora_multi_gpu/multi_node.sh +``` + +#### Supervised Fine-Tuning with DeepSpeed ZeRO-3 (Weight Sharding) + +```bash +bash examples/lora_multi_gpu/ds_zero3.sh +``` + +### Full-Parameter Fine-Tuning on Multiple GPUs + +#### Supervised Fine-Tuning with Accelerate on Single Node + +```bash +bash examples/full_multi_gpu/single_node.sh +``` + +#### Supervised Fine-Tuning with Accelerate on Multiple Nodes + +```bash +bash examples/full_multi_gpu/multi_node.sh +``` + +#### Batch Predicting and Computing BLEU and ROUGE Scores + +```bash +bash examples/full_multi_gpu/predict.sh +``` + +### Merging LoRA Adapters and Quantization + +#### Merge LoRA Adapters + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml +``` + +#### Quantizing Model using AutoGPTQ + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_gptq.yaml +``` + +### Inferring LoRA Fine-Tuned Models + +#### Use CLI + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/merge_lora/llama3_lora_sft.yaml +``` + +#### Use Web UI + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat examples/merge_lora/llama3_lora_sft.yaml +``` + +#### Launch OpenAI-style API + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli api examples/merge_lora/llama3_lora_sft.yaml +``` + +### Extras + +#### Full-Parameter Fine-Tuning using GaLore + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/galore/llama3_full_sft.yaml +``` + +#### Full-Parameter Fine-Tuning using BAdam + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/badam/llama3_full_sft.yaml +``` + +#### LoRA+ Fine-Tuning + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/loraplus/llama3_lora_sft.yaml +``` + +#### Mixture-of-Depths Fine-Tuning + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/mod/llama3_full_sft.yaml +``` + +#### LLaMA-Pro Fine-Tuning + +```bash +bash examples/extras/llama_pro/expand.sh +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/llama_pro/llama3_freeze_sft.yaml +``` + +#### FSDP+QLoRA Fine-Tuning + +```bash +bash examples/extras/fsdp_qlora/single_node.sh ``` diff --git a/examples/README_zh.md b/examples/README_zh.md index 091a877f..14d72c10 100644 --- a/examples/README_zh.md +++ b/examples/README_zh.md @@ -1,50 +1,204 @@ 我们提供了多样化的大模型微调示例脚本。 +### 单 GPU LoRA 微调 + +#### (增量)预训练 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_pretrain.yaml ``` -examples/ -├── lora_single_gpu/ -│ ├── pretrain.sh: 基于 LoRA 进行增量预训练 -│ ├── sft.sh: 基于 LoRA 进行指令监督微调 -│ ├── reward.sh: 基于 LoRA 进行奖励模型训练 -│ ├── ppo.sh: 基于 LoRA 进行 PPO 训练 -│ ├── dpo.sh: 基于 LoRA 进行 DPO 训练 -│ ├── orpo.sh: 基于 LoRA 进行 ORPO 训练 -│ ├── sft_mllm.sh: 基于 LoRA 进行多模态指令监督微调 -│ ├── prepare.sh: 保存预处理后的数据集 -│ └── predict.sh: 基于 LoRA 进行批量预测并计算 BLEU 和 ROUGE 分数 -├── qlora_single_gpu/ -│ ├── bitsandbytes.sh: 基于 QLoRA 微调 4/8 比特 BNB 模型 -│ ├── gptq.sh: 基于 QLoRA 微调 4/8 比特 GPTQ 模型 -│ ├── awq.sh: 基于 QLoRA 微调 4 比特 AWQ 模型 -│ └── aqlm.sh: 基于 QLoRA 微调 2 比特 AQLM 模型 -├── lora_multi_gpu/ -│ ├── single_node.sh: 使用 Accelerate 进行单节点 LoRA 训练 -│ ├── multi_node.sh: 使用 Accelerate 进行多节点 LoRA 训练 -│ └── ds_zero3.sh: 使用 DeepSpeed ZeRO-3 进行 LoRA 训练(拆分权重) -├── full_multi_gpu/ -│ ├── single_node.sh: 使用 DeepSpeed 进行单节点全量训练 -│ ├── multi_node.sh: 使用 DeepSpeed 进行多节点全量训练 -│ └── predict.sh: 基于全量训练进行多卡批量预测并计算 BLEU 和 ROUGE 分数 -├── merge_lora/ -│ ├── merge.sh: 将 LoRA 权重合并到预训练模型中 -│ └── quantize.sh: 使用 AutoGPTQ 量化微调后的模型 -├── inference/ -│ ├── cli_demo.sh: 启动 LoRA 模型的命令行推理接口 -│ ├── api_demo.sh: 启动 LoRA 模型的 OpenAI 风格 API -│ ├── web_demo.sh: 启动 LoRA 模型的浏览器推理接口 -│ └── evaluate.sh: 在 MMLU/CMMLU/C-Eval 数据集上评测 LoRA 模型 -└── extras/ - ├── galore/ - │ └── sft.sh: 使用 GaLore 训练模型 - ├── badam/ - │ └── sft.sh: 使用 BAdam 训练模型 - ├── loraplus/ - │ └── sft.sh: 使用 LoRA+ 训练模型 - ├── mod/ - │ └── sft.sh: 使用深度混合训练模型 - ├── llama_pro/ - │ ├── expand.sh: 扩展模型中的层 - │ └── sft.sh: 训练扩展后的模型 - └── fsdp_qlora/ - └── sft.sh: 使用 FSDP+QLoRA 微调量化模型 + +#### 指令监督微调 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml +``` + +#### 奖励模型训练 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_reward.yaml +``` + +#### PPO 训练 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_ppo.yaml +``` + +#### DPO 训练 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_dpo.yaml +``` + +#### ORPO 训练 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_orpo.yaml +``` + +#### 多模态指令监督微调 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llava1_5_lora_sft.yaml +``` + +#### 预处理数据集 + +对于大数据集有帮助,在配置中使用 `tokenized_path` 以加载预处理后的数据集。 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_preprocess.yaml +``` + +#### 在 MMLU/CMMLU/C-Eval 上评估 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval examples/lora_single_gpu/llama3_lora_eval.yaml +``` + +#### 批量预测并计算 BLEU 和 ROUGE 分数 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_predict.yaml +``` + +### 单 GPU QLoRA 微调 + +#### 基于 4/8 比特 Bitsandbytes 量化进行指令监督微调(推荐) + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml +``` + +#### 基于 4/8 比特 GPTQ 量化进行指令监督微调 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml +``` + +#### 基于 4 比特 AWQ 量化进行指令监督微调 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_awq.yaml +``` + +#### 基于 2 比特 AQLM 量化进行指令监督微调 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml +``` + +### 多 GPU LoRA 微调 + +#### 使用 Accelerate 进行单节点训练 + +```bash +bash examples/lora_multi_gpu/single_node.sh +``` + +#### 使用 Accelerate 进行多节点训练 + +```bash +bash examples/lora_multi_gpu/multi_node.sh +``` + +#### 使用 DeepSpeed ZeRO-3 平均分配显存 + +```bash +bash examples/lora_multi_gpu/ds_zero3.sh +``` + +### 多 GPU 全参数微调 + +#### 使用 DeepSpeed 进行单节点训练 + +```bash +bash examples/full_multi_gpu/single_node.sh +``` + +#### 使用 DeepSpeed 进行多节点训练 + +```bash +bash examples/full_multi_gpu/multi_node.sh +``` + +#### 批量预测并计算 BLEU 和 ROUGE 分数 + +```bash +bash examples/full_multi_gpu/predict.sh +``` + +### 合并 LoRA 适配器与模型量化 + +#### 合并 LoRA 适配器 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml +``` + +#### 使用 AutoGPTQ 量化模型 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_gptq.yaml +``` + +### 推理 LoRA 模型 + +#### 使用命令行接口 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/merge_lora/llama3_lora_sft.yaml +``` + +#### 使用浏览器界面 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat examples/merge_lora/llama3_lora_sft.yaml +``` + +#### 启动 OpenAI 风格 API + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli api examples/merge_lora/llama3_lora_sft.yaml +``` + +### 杂项 + +#### 使用 GaLore 进行全参数训练 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/galore/llama3_full_sft.yaml +``` + +#### 使用 BAdam 进行全参数训练 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/badam/llama3_full_sft.yaml +``` + +#### LoRA+ 微调 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/loraplus/llama3_lora_sft.yaml +``` + +#### 深度混合微调 + +```bash +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/mod/llama3_full_sft.yaml +``` + +#### LLaMA-Pro 微调 + +```bash +bash examples/extras/llama_pro/expand.sh +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/llama_pro/llama3_freeze_sft.yaml +``` + +#### FSDP+QLoRA 微调 + +```bash +bash examples/extras/fsdp_qlora/single_node.sh ``` diff --git a/examples/extras/badam/llama3_lora_sft.yaml b/examples/extras/badam/llama3_lora_sft.yaml new file mode 100644 index 00000000..9f1f1976 --- /dev/null +++ b/examples/extras/badam/llama3_lora_sft.yaml @@ -0,0 +1,41 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct + +# method +stage: sft +do_train: true +finetuning_type: full +use_badam: true +badam_switch_mode: descending +badam_switch_interval: 50 +badam_verbose: 2 + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/full/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +pure_bf16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/extras/badam/sft.sh b/examples/extras/badam/sft.sh deleted file mode 100644 index 61167dad..00000000 --- a/examples/extras/badam/sft.sh +++ /dev/null @@ -1,35 +0,0 @@ -#!/bin/bash - -CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../../data \ - --template default \ - --finetuning_type full \ - --use_badam \ - --badam_switch_mode descending \ - --badam_switch_block_every 50 \ - --badam_verbose 2 \ - --output_dir ../../../saves/LLaMA2-7B/badam/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 8 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --plot_loss \ - --pure_bf16 diff --git a/examples/extras/fsdp_qlora/llama3_lora_sft.yaml b/examples/extras/fsdp_qlora/llama3_lora_sft.yaml new file mode 100644 index 00000000..64bf1356 --- /dev/null +++ b/examples/extras/fsdp_qlora/llama3_lora_sft.yaml @@ -0,0 +1,39 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct +quantization_bit: 4 + +# method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/extras/fsdp_qlora/sft.sh b/examples/extras/fsdp_qlora/sft.sh deleted file mode 100644 index 9eb70a53..00000000 --- a/examples/extras/fsdp_qlora/sft.sh +++ /dev/null @@ -1,41 +0,0 @@ -#!/bin/bash -# DO NOT use GPTQ/AWQ model in FSDP+QLoRA - -pip install "transformers>=4.39.1" -pip install "accelerate>=0.28.0" -pip install "bitsandbytes>=0.43.0" - -CUDA_VISIBLE_DEVICES=0,1 accelerate launch \ - --config_file ../../accelerate/fsdp_config.yaml \ - ../../../src/train.py \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-70b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../../data \ - --template default \ - --finetuning_type lora \ - --lora_target q_proj,v_proj \ - --output_dir ../../../saves/LLaMA2-70B/lora/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 4 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --ddp_timeout 180000000 \ - --quantization_bit 4 \ - --plot_loss \ - --fp16 diff --git a/examples/extras/fsdp_qlora/single_node.sh b/examples/extras/fsdp_qlora/single_node.sh new file mode 100644 index 00000000..54ec2bd2 --- /dev/null +++ b/examples/extras/fsdp_qlora/single_node.sh @@ -0,0 +1,10 @@ +#!/bin/bash +# DO NOT use GPTQ/AWQ model in FSDP+QLoRA + +pip install "transformers>=4.39.1" +pip install "accelerate>=0.28.0" +pip install "bitsandbytes>=0.43.0" + +CUDA_VISIBLE_DEVICES=0,1 accelerate launch \ + --config_file examples/accelerate/fsdp_config.yaml \ + src/train.py examples/extras/fsdp_qlora/llama3_lora_sft.yaml diff --git a/examples/extras/galore/llama3_full_sft.yaml b/examples/extras/galore/llama3_full_sft.yaml new file mode 100644 index 00000000..5aec8af9 --- /dev/null +++ b/examples/extras/galore/llama3_full_sft.yaml @@ -0,0 +1,42 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct + +# method +stage: sft +do_train: true +finetuning_type: full +use_galore: true +galore_layerwise: true +galore_target: mlp,self_attn +galore_rank: 128 +galore_scale: 2.0 + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/full/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 1 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +pure_bf16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/extras/galore/sft.sh b/examples/extras/galore/sft.sh deleted file mode 100644 index 283673e7..00000000 --- a/examples/extras/galore/sft.sh +++ /dev/null @@ -1,36 +0,0 @@ -#!/bin/bash - -CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../../data \ - --template default \ - --finetuning_type full \ - --use_galore \ - --galore_layerwise \ - --galore_target mlp,self_attn \ - --galore_rank 128 \ - --galore_scale 2.0 \ - --output_dir ../../../saves/LLaMA2-7B/galore/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 1 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --plot_loss \ - --pure_bf16 diff --git a/examples/extras/llama_pro/expand.sh b/examples/extras/llama_pro/expand.sh index b260902c..e0d41c7b 100644 --- a/examples/extras/llama_pro/expand.sh +++ b/examples/extras/llama_pro/expand.sh @@ -1,6 +1,6 @@ #!/bin/bash -python ../../../scripts/llama_pro.py \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --output_dir ../../../models/llama2-7b-pro \ +python scripts/llama_pro.py \ + --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \ + --output_dir models/llama3-8b-instruct-pro \ --num_expand 8 diff --git a/examples/extras/llama_pro/llama3_freeze_sft.yaml b/examples/extras/llama_pro/llama3_freeze_sft.yaml new file mode 100644 index 00000000..a54be8b8 --- /dev/null +++ b/examples/extras/llama_pro/llama3_freeze_sft.yaml @@ -0,0 +1,40 @@ +# model +model_name_or_path: models/llama3-8b-instruct-pro + +# method +stage: sft +do_train: true +finetuning_type: freeze +name_module_trainable: all +num_layer_trainable: 8 +use_llama_pro: true + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b-instruct-pro/freeze/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +pure_bf16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/extras/llama_pro/sft.sh b/examples/extras/llama_pro/sft.sh deleted file mode 100644 index 3e26e0a6..00000000 --- a/examples/extras/llama_pro/sft.sh +++ /dev/null @@ -1,34 +0,0 @@ -#!/bin/bash - -CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \ - --stage sft \ - --do_train \ - --model_name_or_path ../../../models/llama2-7b-pro \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../../data \ - --template default \ - --finetuning_type freeze \ - --name_module_trainable all \ - --num_layer_trainable 8 \ - --use_llama_pro \ - --output_dir ../../../saves/LLaMA2-7B-Pro/lora/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 8 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --plot_loss \ - --fp16 diff --git a/examples/extras/loraplus/llama3_lora_sft.yaml b/examples/extras/loraplus/llama3_lora_sft.yaml new file mode 100644 index 00000000..dfb7058b --- /dev/null +++ b/examples/extras/loraplus/llama3_lora_sft.yaml @@ -0,0 +1,39 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct + +# method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj +loraplus_lr_ratio: 16.0 + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +pure_bf16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/extras/loraplus/sft.sh b/examples/extras/loraplus/sft.sh deleted file mode 100644 index 8d152d9e..00000000 --- a/examples/extras/loraplus/sft.sh +++ /dev/null @@ -1,33 +0,0 @@ -#!/bin/bash - -CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../data \ - --template default \ - --finetuning_type lora \ - --lora_target q_proj,v_proj \ - --loraplus_lr_ratio 16.0 \ - --output_dir ../../saves/LLaMA2-7B/loraplus/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 8 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --plot_loss \ - --fp16 diff --git a/examples/extras/mod/llama3_full_sft.yaml b/examples/extras/mod/llama3_full_sft.yaml new file mode 100644 index 00000000..5f80521d --- /dev/null +++ b/examples/extras/mod/llama3_full_sft.yaml @@ -0,0 +1,39 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct + +# method +stage: sft +do_train: true +finetuning_type: full +mixture_of_depths: convert + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b-mod/full/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +optim: paged_adamw_8bit +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +pure_bf16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/extras/mod/sft.sh b/examples/extras/mod/sft.sh deleted file mode 100644 index 5219751f..00000000 --- a/examples/extras/mod/sft.sh +++ /dev/null @@ -1,33 +0,0 @@ -#!/bin/bash - -CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../../data \ - --template default \ - --finetuning_type full \ - --mixture_of_depths convert \ - --output_dir ../../../saves/LLaMA2-7B/mod/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 8 \ - --optim paged_adamw_8bit \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --plot_loss \ - --pure_bf16 diff --git a/examples/full_multi_gpu/llama3_full_predict.yaml b/examples/full_multi_gpu/llama3_full_predict.yaml new file mode 100644 index 00000000..5b9b680b --- /dev/null +++ b/examples/full_multi_gpu/llama3_full_predict.yaml @@ -0,0 +1,23 @@ +# model +model_name_or_path: saves/llama3-8b/full/sft + +# method +stage: sft +do_predict: true +finetuning_type: full + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 50 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/full/predict +overwrite_output_dir: true + +# eval +per_device_eval_batch_size: 1 +predict_with_generate: true diff --git a/examples/full_multi_gpu/llama3_full_sft.yaml b/examples/full_multi_gpu/llama3_full_sft.yaml new file mode 100644 index 00000000..ef35e441 --- /dev/null +++ b/examples/full_multi_gpu/llama3_full_sft.yaml @@ -0,0 +1,41 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct + +# method +stage: sft +do_train: true +finetuning_type: full + +# ddp +ddp_timeout: 180000000 +deepspeed: examples/deepspeed/ds_z3_config.json + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/full/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 2 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/full_multi_gpu/multi_node.sh b/examples/full_multi_gpu/multi_node.sh index a1ffc0ee..9c2508b6 100644 --- a/examples/full_multi_gpu/multi_node.sh +++ b/examples/full_multi_gpu/multi_node.sh @@ -6,33 +6,4 @@ python -m torch.distributed.run \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ - ../../src/train.py \ - --deepspeed ../deepspeed/ds_z3_config.json \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../data \ - --template default \ - --finetuning_type full \ - --output_dir ../../saves/LLaMA2-7B/full/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 2 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --ddp_timeout 180000000 \ - --plot_loss \ - --fp16 + src/train.py examples/full_multi_gpu/llama3_full_sft.yaml diff --git a/examples/full_multi_gpu/predict.sh b/examples/full_multi_gpu/predict.sh index 7c2e458f..2445f444 100644 --- a/examples/full_multi_gpu/predict.sh +++ b/examples/full_multi_gpu/predict.sh @@ -1,20 +1,5 @@ #!/bin/bash CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \ - --config_file ../accelerate/single_config.yaml \ - ../../src/train.py \ - --stage sft \ - --do_predict \ - --model_name_or_path ../../saves/LLaMA2-7B/full/sft \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../data \ - --template default \ - --finetuning_type full \ - --output_dir ../../saves/LLaMA2-7B/full/predict \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_eval_batch_size 1 \ - --max_samples 20 \ - --predict_with_generate + --config_file examples/accelerate/single_config.yaml \ + src/train.py examples/full_multi_gpu/llama3_full_predict.yaml diff --git a/examples/full_multi_gpu/single_node.sh b/examples/full_multi_gpu/single_node.sh index 73c7662d..f391166a 100644 --- a/examples/full_multi_gpu/single_node.sh +++ b/examples/full_multi_gpu/single_node.sh @@ -1,32 +1,4 @@ #!/bin/bash -deepspeed --num_gpus 4 ../../src/train.py \ - --deepspeed ../deepspeed/ds_z3_config.json \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../data \ - --template default \ - --finetuning_type full \ - --output_dir ../../saves/LLaMA2-7B/full/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 2 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --ddp_timeout 180000000 \ - --plot_loss \ - --fp16 +deepspeed --include "localhost:0,1,2,3" \ + src/train.py examples/full_multi_gpu/llama3_full_sft.yaml diff --git a/examples/lora_multi_gpu/ds_zero3.sh b/examples/lora_multi_gpu/ds_zero3.sh index bc74a6de..304f3780 100644 --- a/examples/lora_multi_gpu/ds_zero3.sh +++ b/examples/lora_multi_gpu/ds_zero3.sh @@ -1,34 +1,5 @@ #!/bin/bash # ZeRO-3 enables weight sharding on multiple GPUs -deepspeed --num_gpus 4 ../../src/train.py \ - --deepspeed ../deepspeed/ds_z3_config.json \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../data \ - --template default \ - --finetuning_type lora \ - --lora_target q_proj,v_proj \ - --output_dir ../../saves/LLaMA2-7B/lora/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 2 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --ddp_timeout 180000000 \ - --plot_loss \ - --fp16 +deepspeed --include "localhost:0,1,2,3" \ + src/train.py examples/lora_multi_gpu/llama3_lora_sft_ds.yaml diff --git a/examples/lora_multi_gpu/llama3_lora_sft.yaml b/examples/lora_multi_gpu/llama3_lora_sft.yaml new file mode 100644 index 00000000..d9690679 --- /dev/null +++ b/examples/lora_multi_gpu/llama3_lora_sft.yaml @@ -0,0 +1,41 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct + +# method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj + +# ddp +ddp_timeout: 180000000 + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 2 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/lora_multi_gpu/llama3_lora_sft_ds.yaml b/examples/lora_multi_gpu/llama3_lora_sft_ds.yaml new file mode 100644 index 00000000..26955167 --- /dev/null +++ b/examples/lora_multi_gpu/llama3_lora_sft_ds.yaml @@ -0,0 +1,42 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct + +# method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj + +# ddp +ddp_timeout: 180000000 +deepspeed: examples/deepspeed/ds_z3_config.json + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 2 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/lora_multi_gpu/multi_node.sh b/examples/lora_multi_gpu/multi_node.sh index a58cac20..401fac5f 100644 --- a/examples/lora_multi_gpu/multi_node.sh +++ b/examples/lora_multi_gpu/multi_node.sh @@ -2,35 +2,5 @@ # also launch it on slave machine using slave_config.yaml CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \ - --config_file ../accelerate/master_config.yaml \ - ../../src/train.py \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../data \ - --template default \ - --finetuning_type lora \ - --lora_target q_proj,v_proj \ - --output_dir ../../saves/LLaMA2-7B/lora/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 2 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --ddp_timeout 180000000 \ - --plot_loss \ - --fp16 + --config_file examples/accelerate/master_config.yaml \ + src/train.py examples/lora_multi_gpu/llama3_lora_sft.yaml diff --git a/examples/lora_multi_gpu/single_node.sh b/examples/lora_multi_gpu/single_node.sh index c0719c04..885a0e8c 100644 --- a/examples/lora_multi_gpu/single_node.sh +++ b/examples/lora_multi_gpu/single_node.sh @@ -1,35 +1,5 @@ #!/bin/bash CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \ - --config_file ../accelerate/single_config.yaml \ - ../../src/train.py \ - --stage sft \ - --do_train \ - --model_name_or_path meta-llama/Llama-2-7b-hf \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../data \ - --template default \ - --finetuning_type lora \ - --lora_target q_proj,v_proj \ - --output_dir ../../saves/LLaMA2-7B/lora/sft \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 2 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --ddp_timeout 180000000 \ - --plot_loss \ - --fp16 + --config_file examples/accelerate/single_config.yaml \ + src/train.py examples/lora_multi_gpu/llama3_lora_sft.yaml diff --git a/examples/lora_single_gpu/llama3_preprocess.yaml b/examples/lora_single_gpu/llama3_preprocess.yaml index 04df9631..0b3dc599 100644 --- a/examples/lora_single_gpu/llama3_preprocess.yaml +++ b/examples/lora_single_gpu/llama3_preprocess.yaml @@ -15,7 +15,7 @@ max_samples: 1000 val_size: 0.1 overwrite_cache: true preprocessing_num_workers: 16 -tokenized_path: saves/llama3-8b/dataset/sft # use `tokenized_path` in config to load data +tokenized_path: saves/llama3-8b/dataset/sft # output output_dir: saves/llama3-8b/lora/sft diff --git a/examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml b/examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml index 2bd99740..11f1d277 100644 --- a/examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml +++ b/examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml @@ -1,27 +1,38 @@ +# model +model_name_or_path: ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16 + +# method stage: sft do_train: true -model_name_or_path: BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf -dataset: alpaca_gpt4_en,glaive_toolcall -dataset_dir: data -template: default finetuning_type: lora lora_target: q_proj,v_proj -output_dir: ../../saves/LLaMA2-7B/lora/sft -overwrite_cache: true -overwrite_output_dir: true + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 cutoff_len: 1024 -per_device_train_batch_size: 1 -per_device_eval_batch_size: 1 -gradient_accumulation_steps: 8 -lr_scheduler_type: cosine -logging_steps: 10 -save_steps: 100 -eval_steps: 100 -evaluation_strategy: steps -load_best_model_at_end: true -learning_rate: 5e-5 -num_train_epochs: 3.0 -max_samples: 3000 +max_samples: 1000 val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/qlora_single_gpu/llama3_lora_sft_awq.yaml b/examples/qlora_single_gpu/llama3_lora_sft_awq.yaml index e69de29b..4b070d45 100644 --- a/examples/qlora_single_gpu/llama3_lora_sft_awq.yaml +++ b/examples/qlora_single_gpu/llama3_lora_sft_awq.yaml @@ -0,0 +1,38 @@ +# model +model_name_or_path: TechxGenus/Meta-Llama-3-8B-Instruct-AWQ + +# method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml b/examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml index e69de29b..7bc31bde 100644 --- a/examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml +++ b/examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml @@ -0,0 +1,42 @@ +# model +model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct +quantization_bit: 4 + +# method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj + +# ddp +ddp_timeout: 180000000 + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500 diff --git a/examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml b/examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml index e69de29b..2f8cfe45 100644 --- a/examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml +++ b/examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml @@ -0,0 +1,38 @@ +# model +model_name_or_path: TechxGenus/Meta-Llama-3-8B-Instruct-GPTQ + +# method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj + +# dataset +dataset: identity,alpaca_gpt4_en +template: llama3 +cutoff_len: 1024 +max_samples: 1000 +val_size: 0.1 +overwrite_cache: true +preprocessing_num_workers: 16 + +# output +output_dir: saves/llama3-8b/lora/sft +logging_steps: 10 +save_steps: 500 +plot_loss: true +overwrite_output_dir: true + +# train +per_device_train_batch_size: 1 +gradient_accumulation_steps: 8 +learning_rate: 0.0001 +num_train_epochs: 3.0 +lr_scheduler_type: cosine +warmup_steps: 0.1 +fp16: true + +# eval +per_device_eval_batch_size: 1 +evaluation_strategy: steps +eval_steps: 500