mirror of
https://github.com/hiyouga/LLaMA-Factory.git
synced 2025-10-14 15:52:49 +08:00
update example docs
Former-commit-id: 102cd42768d9eb2cf1219309a25b41e26149067e
This commit is contained in:
parent
5c9da798b5
commit
50c71dd29f
@ -337,7 +337,7 @@ Please refer to [data/README.md](data/README.md) for checking the details about
|
||||
|
||||
### Quickstart
|
||||
|
||||
The following 3 commands conduct LoRA fine-tuning, inference and merging for Llama3-8B-Instruct model, respectively.
|
||||
Use the following 3 commands to conduct LoRA **fine-tuning**, **inference** and **merging** for Llama3-8B-Instruct model, respectively.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
|
||||
@ -345,7 +345,7 @@ CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft.
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
See [examples/README.md](examples/README.md) for advanced usage.
|
||||
See [examples/README.md](examples/README.md) for advanced usage (including distributed training).
|
||||
|
||||
> [!TIP]
|
||||
> Use `llamafactory-cli help` to show help information.
|
||||
|
@ -337,7 +337,7 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl
|
||||
|
||||
### 快速开始
|
||||
|
||||
下面三行命令分别对 Llama3-8B-Instruct 模型进行 LoRA 微调、推理和合并。
|
||||
下面三行命令分别对 Llama3-8B-Instruct 模型进行 LoRA **微调**、**推理**和**合并**。
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
|
||||
@ -345,10 +345,10 @@ CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft.
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
高级用法请参考 [examples/README_zh.md](examples/README_zh.md)。
|
||||
高级用法请参考 [examples/README_zh.md](examples/README_zh.md)(包括多 GPU 微调)。
|
||||
|
||||
> [!TIP]
|
||||
> 使用 `llamafactory-cli help` 显示使用帮助。
|
||||
> 使用 `llamafactory-cli help` 显示帮助信息。
|
||||
|
||||
### 使用 LLaMA Board 可视化界面(由 [Gradio](https://github.com/gradio-app/gradio) 驱动)
|
||||
|
||||
|
@ -1,57 +1,204 @@
|
||||
We provide diverse examples about fine-tuning LLMs.
|
||||
|
||||
### LoRA Fine-Tuning on A Single GPU
|
||||
|
||||
#### (Continuous) Pre-Training
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
cd examples/lora_single_gpu
|
||||
llamafactory-cli train llama3_lora_pretrain.yaml # Do continuous pre-training using LoRA
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_pretrain.yaml
|
||||
```
|
||||
|
||||
#### Supervised Fine-Tuning
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
|
||||
```
|
||||
examples/
|
||||
├── lora_single_gpu/
|
||||
│ ├── `
|
||||
│ ├── sft.sh: Do supervised fine-tuning using LoRA
|
||||
│ ├── reward.sh: Do reward modeling using LoRA
|
||||
│ ├── ppo.sh: Do PPO training using LoRA
|
||||
│ ├── dpo.sh: Do DPO training using LoRA
|
||||
│ ├── orpo.sh: Do ORPO training using LoRA
|
||||
│ ├── sft_mllm.sh: Do supervised fine-tuning on multimodal data using LoRA
|
||||
│ ├── prepare.sh: Save tokenized dataset
|
||||
│ └── predict.sh: Do batch predict and compute BLEU and ROUGE scores after LoRA tuning
|
||||
├── qlora_single_gpu/
|
||||
│ ├── bitsandbytes.sh: Fine-tune 4/8-bit BNB models using QLoRA
|
||||
│ ├── gptq.sh: Fine-tune 4/8-bit GPTQ models using QLoRA
|
||||
│ ├── awq.sh: Fine-tune 4-bit AWQ models using QLoRA
|
||||
│ └── aqlm.sh: Fine-tune 2-bit AQLM models using QLoRA
|
||||
├── lora_multi_gpu/
|
||||
│ ├── single_node.sh: Fine-tune model with Accelerate on single node using LoRA
|
||||
│ ├── multi_node.sh: Fine-tune model with Accelerate on multiple nodes using LoRA
|
||||
│ └── ds_zero3.sh: Fine-tune model with DeepSpeed ZeRO-3 using LoRA (weight sharding)
|
||||
├── full_multi_gpu/
|
||||
│ ├── single_node.sh: Full fine-tune model with DeepSpeed on single node
|
||||
│ ├── multi_node.sh: Full fine-tune model with DeepSpeed on multiple nodes
|
||||
│ └── predict.sh: Do parallel batch predict and compute BLEU and ROUGE scores after full tuning
|
||||
├── merge_lora/
|
||||
│ ├── merge.sh: Merge LoRA weights into the pre-trained models
|
||||
│ └── quantize.sh: Quantize the fine-tuned model with AutoGPTQ
|
||||
├── inference/
|
||||
│ ├── cli_demo.sh: Chat with fine-tuned model in the CLI with LoRA adapters
|
||||
│ ├── api_demo.sh: Chat with fine-tuned model in an OpenAI-style API with LoRA adapters
|
||||
│ ├── web_demo.sh: Chat with fine-tuned model in the Web browser with LoRA adapters
|
||||
│ └── evaluate.sh: Evaluate model on the MMLU/CMMLU/C-Eval benchmarks with LoRA adapters
|
||||
└── extras/
|
||||
├── galore/
|
||||
│ └── sft.sh: Fine-tune model with GaLore
|
||||
├── badam/
|
||||
│ └── sft.sh: Fine-tune model with BAdam
|
||||
├── loraplus/
|
||||
│ └── sft.sh: Fine-tune model using LoRA+
|
||||
├── mod/
|
||||
│ └── sft.sh: Fine-tune model using Mixture-of-Depths
|
||||
├── llama_pro/
|
||||
│ ├── expand.sh: Expand layers in the model
|
||||
│ └── sft.sh: Fine-tune the expanded model
|
||||
└── fsdp_qlora/
|
||||
└── sft.sh: Fine-tune quantized model with FSDP+QLoRA
|
||||
|
||||
#### Reward Modeling
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_reward.yaml
|
||||
```
|
||||
|
||||
#### PPO Training
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_ppo.yaml
|
||||
```
|
||||
|
||||
#### DPO Training
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_dpo.yaml
|
||||
```
|
||||
|
||||
#### ORPO Training
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_orpo.yaml
|
||||
```
|
||||
|
||||
#### Multimodal Supervised Fine-Tuning
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llava1_5_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### Preprocess Dataset
|
||||
|
||||
It is useful for large dataset, use `tokenized_path` in config to load the preprocessed dataset.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_preprocess.yaml
|
||||
```
|
||||
|
||||
#### Evaluating on MMLU/CMMLU/C-Eval Benchmarks
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval examples/lora_single_gpu/llama3_lora_eval.yaml
|
||||
```
|
||||
|
||||
#### Batch Predicting and Computing BLEU and ROUGE Scores
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_predict.yaml
|
||||
```
|
||||
|
||||
### QLoRA Fine-Tuning on a Single GPU
|
||||
|
||||
#### Supervised Fine-Tuning with 4/8-bit Bitsandbytes Quantization (Recommended)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml
|
||||
```
|
||||
|
||||
#### Supervised Fine-Tuning with 4/8-bit GPTQ Quantization
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml
|
||||
```
|
||||
|
||||
#### Supervised Fine-Tuning with 4-bit AWQ Quantization
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_awq.yaml
|
||||
```
|
||||
|
||||
#### Supervised Fine-Tuning with 2-bit AQLM Quantization
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml
|
||||
```
|
||||
|
||||
### LoRA Fine-Tuning on Multiple GPUs
|
||||
|
||||
#### Supervised Fine-Tuning with Accelerate on Single Node
|
||||
|
||||
```bash
|
||||
bash examples/lora_multi_gpu/single_node.sh
|
||||
```
|
||||
|
||||
#### Supervised Fine-Tuning with Accelerate on Multiple Nodes
|
||||
|
||||
```bash
|
||||
bash examples/lora_multi_gpu/multi_node.sh
|
||||
```
|
||||
|
||||
#### Supervised Fine-Tuning with DeepSpeed ZeRO-3 (Weight Sharding)
|
||||
|
||||
```bash
|
||||
bash examples/lora_multi_gpu/ds_zero3.sh
|
||||
```
|
||||
|
||||
### Full-Parameter Fine-Tuning on Multiple GPUs
|
||||
|
||||
#### Supervised Fine-Tuning with Accelerate on Single Node
|
||||
|
||||
```bash
|
||||
bash examples/full_multi_gpu/single_node.sh
|
||||
```
|
||||
|
||||
#### Supervised Fine-Tuning with Accelerate on Multiple Nodes
|
||||
|
||||
```bash
|
||||
bash examples/full_multi_gpu/multi_node.sh
|
||||
```
|
||||
|
||||
#### Batch Predicting and Computing BLEU and ROUGE Scores
|
||||
|
||||
```bash
|
||||
bash examples/full_multi_gpu/predict.sh
|
||||
```
|
||||
|
||||
### Merging LoRA Adapters and Quantization
|
||||
|
||||
#### Merge LoRA Adapters
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### Quantizing Model using AutoGPTQ
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_gptq.yaml
|
||||
```
|
||||
|
||||
### Inferring LoRA Fine-Tuned Models
|
||||
|
||||
#### Use CLI
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### Use Web UI
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### Launch OpenAI-style API
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli api examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
### Extras
|
||||
|
||||
#### Full-Parameter Fine-Tuning using GaLore
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/galore/llama3_full_sft.yaml
|
||||
```
|
||||
|
||||
#### Full-Parameter Fine-Tuning using BAdam
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/badam/llama3_full_sft.yaml
|
||||
```
|
||||
|
||||
#### LoRA+ Fine-Tuning
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/loraplus/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### Mixture-of-Depths Fine-Tuning
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/mod/llama3_full_sft.yaml
|
||||
```
|
||||
|
||||
#### LLaMA-Pro Fine-Tuning
|
||||
|
||||
```bash
|
||||
bash examples/extras/llama_pro/expand.sh
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/llama_pro/llama3_freeze_sft.yaml
|
||||
```
|
||||
|
||||
#### FSDP+QLoRA Fine-Tuning
|
||||
|
||||
```bash
|
||||
bash examples/extras/fsdp_qlora/single_node.sh
|
||||
```
|
||||
|
@ -1,50 +1,204 @@
|
||||
我们提供了多样化的大模型微调示例脚本。
|
||||
|
||||
### 单 GPU LoRA 微调
|
||||
|
||||
#### (增量)预训练
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_pretrain.yaml
|
||||
```
|
||||
examples/
|
||||
├── lora_single_gpu/
|
||||
│ ├── pretrain.sh: 基于 LoRA 进行增量预训练
|
||||
│ ├── sft.sh: 基于 LoRA 进行指令监督微调
|
||||
│ ├── reward.sh: 基于 LoRA 进行奖励模型训练
|
||||
│ ├── ppo.sh: 基于 LoRA 进行 PPO 训练
|
||||
│ ├── dpo.sh: 基于 LoRA 进行 DPO 训练
|
||||
│ ├── orpo.sh: 基于 LoRA 进行 ORPO 训练
|
||||
│ ├── sft_mllm.sh: 基于 LoRA 进行多模态指令监督微调
|
||||
│ ├── prepare.sh: 保存预处理后的数据集
|
||||
│ └── predict.sh: 基于 LoRA 进行批量预测并计算 BLEU 和 ROUGE 分数
|
||||
├── qlora_single_gpu/
|
||||
│ ├── bitsandbytes.sh: 基于 QLoRA 微调 4/8 比特 BNB 模型
|
||||
│ ├── gptq.sh: 基于 QLoRA 微调 4/8 比特 GPTQ 模型
|
||||
│ ├── awq.sh: 基于 QLoRA 微调 4 比特 AWQ 模型
|
||||
│ └── aqlm.sh: 基于 QLoRA 微调 2 比特 AQLM 模型
|
||||
├── lora_multi_gpu/
|
||||
│ ├── single_node.sh: 使用 Accelerate 进行单节点 LoRA 训练
|
||||
│ ├── multi_node.sh: 使用 Accelerate 进行多节点 LoRA 训练
|
||||
│ └── ds_zero3.sh: 使用 DeepSpeed ZeRO-3 进行 LoRA 训练(拆分权重)
|
||||
├── full_multi_gpu/
|
||||
│ ├── single_node.sh: 使用 DeepSpeed 进行单节点全量训练
|
||||
│ ├── multi_node.sh: 使用 DeepSpeed 进行多节点全量训练
|
||||
│ └── predict.sh: 基于全量训练进行多卡批量预测并计算 BLEU 和 ROUGE 分数
|
||||
├── merge_lora/
|
||||
│ ├── merge.sh: 将 LoRA 权重合并到预训练模型中
|
||||
│ └── quantize.sh: 使用 AutoGPTQ 量化微调后的模型
|
||||
├── inference/
|
||||
│ ├── cli_demo.sh: 启动 LoRA 模型的命令行推理接口
|
||||
│ ├── api_demo.sh: 启动 LoRA 模型的 OpenAI 风格 API
|
||||
│ ├── web_demo.sh: 启动 LoRA 模型的浏览器推理接口
|
||||
│ └── evaluate.sh: 在 MMLU/CMMLU/C-Eval 数据集上评测 LoRA 模型
|
||||
└── extras/
|
||||
├── galore/
|
||||
│ └── sft.sh: 使用 GaLore 训练模型
|
||||
├── badam/
|
||||
│ └── sft.sh: 使用 BAdam 训练模型
|
||||
├── loraplus/
|
||||
│ └── sft.sh: 使用 LoRA+ 训练模型
|
||||
├── mod/
|
||||
│ └── sft.sh: 使用深度混合训练模型
|
||||
├── llama_pro/
|
||||
│ ├── expand.sh: 扩展模型中的层
|
||||
│ └── sft.sh: 训练扩展后的模型
|
||||
└── fsdp_qlora/
|
||||
└── sft.sh: 使用 FSDP+QLoRA 微调量化模型
|
||||
|
||||
#### 指令监督微调
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### 奖励模型训练
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_reward.yaml
|
||||
```
|
||||
|
||||
#### PPO 训练
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_ppo.yaml
|
||||
```
|
||||
|
||||
#### DPO 训练
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_dpo.yaml
|
||||
```
|
||||
|
||||
#### ORPO 训练
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_orpo.yaml
|
||||
```
|
||||
|
||||
#### 多模态指令监督微调
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llava1_5_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### 预处理数据集
|
||||
|
||||
对于大数据集有帮助,在配置中使用 `tokenized_path` 以加载预处理后的数据集。
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_preprocess.yaml
|
||||
```
|
||||
|
||||
#### 在 MMLU/CMMLU/C-Eval 上评估
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval examples/lora_single_gpu/llama3_lora_eval.yaml
|
||||
```
|
||||
|
||||
#### 批量预测并计算 BLEU 和 ROUGE 分数
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_predict.yaml
|
||||
```
|
||||
|
||||
### 单 GPU QLoRA 微调
|
||||
|
||||
#### 基于 4/8 比特 Bitsandbytes 量化进行指令监督微调(推荐)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml
|
||||
```
|
||||
|
||||
#### 基于 4/8 比特 GPTQ 量化进行指令监督微调
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml
|
||||
```
|
||||
|
||||
#### 基于 4 比特 AWQ 量化进行指令监督微调
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_awq.yaml
|
||||
```
|
||||
|
||||
#### 基于 2 比特 AQLM 量化进行指令监督微调
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml
|
||||
```
|
||||
|
||||
### 多 GPU LoRA 微调
|
||||
|
||||
#### 使用 Accelerate 进行单节点训练
|
||||
|
||||
```bash
|
||||
bash examples/lora_multi_gpu/single_node.sh
|
||||
```
|
||||
|
||||
#### 使用 Accelerate 进行多节点训练
|
||||
|
||||
```bash
|
||||
bash examples/lora_multi_gpu/multi_node.sh
|
||||
```
|
||||
|
||||
#### 使用 DeepSpeed ZeRO-3 平均分配显存
|
||||
|
||||
```bash
|
||||
bash examples/lora_multi_gpu/ds_zero3.sh
|
||||
```
|
||||
|
||||
### 多 GPU 全参数微调
|
||||
|
||||
#### 使用 DeepSpeed 进行单节点训练
|
||||
|
||||
```bash
|
||||
bash examples/full_multi_gpu/single_node.sh
|
||||
```
|
||||
|
||||
#### 使用 DeepSpeed 进行多节点训练
|
||||
|
||||
```bash
|
||||
bash examples/full_multi_gpu/multi_node.sh
|
||||
```
|
||||
|
||||
#### 批量预测并计算 BLEU 和 ROUGE 分数
|
||||
|
||||
```bash
|
||||
bash examples/full_multi_gpu/predict.sh
|
||||
```
|
||||
|
||||
### 合并 LoRA 适配器与模型量化
|
||||
|
||||
#### 合并 LoRA 适配器
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### 使用 AutoGPTQ 量化模型
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_gptq.yaml
|
||||
```
|
||||
|
||||
### 推理 LoRA 模型
|
||||
|
||||
#### 使用命令行接口
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### 使用浏览器界面
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### 启动 OpenAI 风格 API
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli api examples/merge_lora/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
### 杂项
|
||||
|
||||
#### 使用 GaLore 进行全参数训练
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/galore/llama3_full_sft.yaml
|
||||
```
|
||||
|
||||
#### 使用 BAdam 进行全参数训练
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/badam/llama3_full_sft.yaml
|
||||
```
|
||||
|
||||
#### LoRA+ 微调
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/loraplus/llama3_lora_sft.yaml
|
||||
```
|
||||
|
||||
#### 深度混合微调
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/mod/llama3_full_sft.yaml
|
||||
```
|
||||
|
||||
#### LLaMA-Pro 微调
|
||||
|
||||
```bash
|
||||
bash examples/extras/llama_pro/expand.sh
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/llama_pro/llama3_freeze_sft.yaml
|
||||
```
|
||||
|
||||
#### FSDP+QLoRA 微调
|
||||
|
||||
```bash
|
||||
bash examples/extras/fsdp_qlora/single_node.sh
|
||||
```
|
||||
|
41
examples/extras/badam/llama3_lora_sft.yaml
Normal file
41
examples/extras/badam/llama3_lora_sft.yaml
Normal file
@ -0,0 +1,41 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: full
|
||||
use_badam: true
|
||||
badam_switch_mode: descending
|
||||
badam_switch_interval: 50
|
||||
badam_verbose: 2
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/full/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
pure_bf16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -1,35 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../../data \
|
||||
--template default \
|
||||
--finetuning_type full \
|
||||
--use_badam \
|
||||
--badam_switch_mode descending \
|
||||
--badam_switch_block_every 50 \
|
||||
--badam_verbose 2 \
|
||||
--output_dir ../../../saves/LLaMA2-7B/badam/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 8 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--plot_loss \
|
||||
--pure_bf16
|
39
examples/extras/fsdp_qlora/llama3_lora_sft.yaml
Normal file
39
examples/extras/fsdp_qlora/llama3_lora_sft.yaml
Normal file
@ -0,0 +1,39 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
quantization_bit: 4
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -1,41 +0,0 @@
|
||||
#!/bin/bash
|
||||
# DO NOT use GPTQ/AWQ model in FSDP+QLoRA
|
||||
|
||||
pip install "transformers>=4.39.1"
|
||||
pip install "accelerate>=0.28.0"
|
||||
pip install "bitsandbytes>=0.43.0"
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
|
||||
--config_file ../../accelerate/fsdp_config.yaml \
|
||||
../../../src/train.py \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-70b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../../data \
|
||||
--template default \
|
||||
--finetuning_type lora \
|
||||
--lora_target q_proj,v_proj \
|
||||
--output_dir ../../../saves/LLaMA2-70B/lora/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 4 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--ddp_timeout 180000000 \
|
||||
--quantization_bit 4 \
|
||||
--plot_loss \
|
||||
--fp16
|
10
examples/extras/fsdp_qlora/single_node.sh
Normal file
10
examples/extras/fsdp_qlora/single_node.sh
Normal file
@ -0,0 +1,10 @@
|
||||
#!/bin/bash
|
||||
# DO NOT use GPTQ/AWQ model in FSDP+QLoRA
|
||||
|
||||
pip install "transformers>=4.39.1"
|
||||
pip install "accelerate>=0.28.0"
|
||||
pip install "bitsandbytes>=0.43.0"
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
|
||||
--config_file examples/accelerate/fsdp_config.yaml \
|
||||
src/train.py examples/extras/fsdp_qlora/llama3_lora_sft.yaml
|
42
examples/extras/galore/llama3_full_sft.yaml
Normal file
42
examples/extras/galore/llama3_full_sft.yaml
Normal file
@ -0,0 +1,42 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: full
|
||||
use_galore: true
|
||||
galore_layerwise: true
|
||||
galore_target: mlp,self_attn
|
||||
galore_rank: 128
|
||||
galore_scale: 2.0
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/full/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 1
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
pure_bf16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -1,36 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../../data \
|
||||
--template default \
|
||||
--finetuning_type full \
|
||||
--use_galore \
|
||||
--galore_layerwise \
|
||||
--galore_target mlp,self_attn \
|
||||
--galore_rank 128 \
|
||||
--galore_scale 2.0 \
|
||||
--output_dir ../../../saves/LLaMA2-7B/galore/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 1 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--plot_loss \
|
||||
--pure_bf16
|
@ -1,6 +1,6 @@
|
||||
#!/bin/bash
|
||||
|
||||
python ../../../scripts/llama_pro.py \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--output_dir ../../../models/llama2-7b-pro \
|
||||
python scripts/llama_pro.py \
|
||||
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
|
||||
--output_dir models/llama3-8b-instruct-pro \
|
||||
--num_expand 8
|
||||
|
40
examples/extras/llama_pro/llama3_freeze_sft.yaml
Normal file
40
examples/extras/llama_pro/llama3_freeze_sft.yaml
Normal file
@ -0,0 +1,40 @@
|
||||
# model
|
||||
model_name_or_path: models/llama3-8b-instruct-pro
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: freeze
|
||||
name_module_trainable: all
|
||||
num_layer_trainable: 8
|
||||
use_llama_pro: true
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b-instruct-pro/freeze/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
pure_bf16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -1,34 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path ../../../models/llama2-7b-pro \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../../data \
|
||||
--template default \
|
||||
--finetuning_type freeze \
|
||||
--name_module_trainable all \
|
||||
--num_layer_trainable 8 \
|
||||
--use_llama_pro \
|
||||
--output_dir ../../../saves/LLaMA2-7B-Pro/lora/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 8 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--plot_loss \
|
||||
--fp16
|
39
examples/extras/loraplus/llama3_lora_sft.yaml
Normal file
39
examples/extras/loraplus/llama3_lora_sft.yaml
Normal file
@ -0,0 +1,39 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
loraplus_lr_ratio: 16.0
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
pure_bf16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -1,33 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../data \
|
||||
--template default \
|
||||
--finetuning_type lora \
|
||||
--lora_target q_proj,v_proj \
|
||||
--loraplus_lr_ratio 16.0 \
|
||||
--output_dir ../../saves/LLaMA2-7B/loraplus/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 8 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--plot_loss \
|
||||
--fp16
|
39
examples/extras/mod/llama3_full_sft.yaml
Normal file
39
examples/extras/mod/llama3_full_sft.yaml
Normal file
@ -0,0 +1,39 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: full
|
||||
mixture_of_depths: convert
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b-mod/full/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
optim: paged_adamw_8bit
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
pure_bf16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -1,33 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../../data \
|
||||
--template default \
|
||||
--finetuning_type full \
|
||||
--mixture_of_depths convert \
|
||||
--output_dir ../../../saves/LLaMA2-7B/mod/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 8 \
|
||||
--optim paged_adamw_8bit \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--plot_loss \
|
||||
--pure_bf16
|
23
examples/full_multi_gpu/llama3_full_predict.yaml
Normal file
23
examples/full_multi_gpu/llama3_full_predict.yaml
Normal file
@ -0,0 +1,23 @@
|
||||
# model
|
||||
model_name_or_path: saves/llama3-8b/full/sft
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_predict: true
|
||||
finetuning_type: full
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 50
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/full/predict
|
||||
overwrite_output_dir: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
predict_with_generate: true
|
41
examples/full_multi_gpu/llama3_full_sft.yaml
Normal file
41
examples/full_multi_gpu/llama3_full_sft.yaml
Normal file
@ -0,0 +1,41 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: full
|
||||
|
||||
# ddp
|
||||
ddp_timeout: 180000000
|
||||
deepspeed: examples/deepspeed/ds_z3_config.json
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/full/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 2
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -6,33 +6,4 @@ python -m torch.distributed.run \
|
||||
--node_rank $RANK \
|
||||
--master_addr $MASTER_ADDR \
|
||||
--master_port $MASTER_PORT \
|
||||
../../src/train.py \
|
||||
--deepspeed ../deepspeed/ds_z3_config.json \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../data \
|
||||
--template default \
|
||||
--finetuning_type full \
|
||||
--output_dir ../../saves/LLaMA2-7B/full/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 2 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--ddp_timeout 180000000 \
|
||||
--plot_loss \
|
||||
--fp16
|
||||
src/train.py examples/full_multi_gpu/llama3_full_sft.yaml
|
||||
|
@ -1,20 +1,5 @@
|
||||
#!/bin/bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
|
||||
--config_file ../accelerate/single_config.yaml \
|
||||
../../src/train.py \
|
||||
--stage sft \
|
||||
--do_predict \
|
||||
--model_name_or_path ../../saves/LLaMA2-7B/full/sft \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../data \
|
||||
--template default \
|
||||
--finetuning_type full \
|
||||
--output_dir ../../saves/LLaMA2-7B/full/predict \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--max_samples 20 \
|
||||
--predict_with_generate
|
||||
--config_file examples/accelerate/single_config.yaml \
|
||||
src/train.py examples/full_multi_gpu/llama3_full_predict.yaml
|
||||
|
@ -1,32 +1,4 @@
|
||||
#!/bin/bash
|
||||
|
||||
deepspeed --num_gpus 4 ../../src/train.py \
|
||||
--deepspeed ../deepspeed/ds_z3_config.json \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../data \
|
||||
--template default \
|
||||
--finetuning_type full \
|
||||
--output_dir ../../saves/LLaMA2-7B/full/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 2 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--ddp_timeout 180000000 \
|
||||
--plot_loss \
|
||||
--fp16
|
||||
deepspeed --include "localhost:0,1,2,3" \
|
||||
src/train.py examples/full_multi_gpu/llama3_full_sft.yaml
|
||||
|
@ -1,34 +1,5 @@
|
||||
#!/bin/bash
|
||||
# ZeRO-3 enables weight sharding on multiple GPUs
|
||||
|
||||
deepspeed --num_gpus 4 ../../src/train.py \
|
||||
--deepspeed ../deepspeed/ds_z3_config.json \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../data \
|
||||
--template default \
|
||||
--finetuning_type lora \
|
||||
--lora_target q_proj,v_proj \
|
||||
--output_dir ../../saves/LLaMA2-7B/lora/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 2 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--ddp_timeout 180000000 \
|
||||
--plot_loss \
|
||||
--fp16
|
||||
deepspeed --include "localhost:0,1,2,3" \
|
||||
src/train.py examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
|
||||
|
41
examples/lora_multi_gpu/llama3_lora_sft.yaml
Normal file
41
examples/lora_multi_gpu/llama3_lora_sft.yaml
Normal file
@ -0,0 +1,41 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
|
||||
# ddp
|
||||
ddp_timeout: 180000000
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 2
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
42
examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
Normal file
42
examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
Normal file
@ -0,0 +1,42 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
|
||||
# ddp
|
||||
ddp_timeout: 180000000
|
||||
deepspeed: examples/deepspeed/ds_z3_config.json
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 2
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -2,35 +2,5 @@
|
||||
# also launch it on slave machine using slave_config.yaml
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
|
||||
--config_file ../accelerate/master_config.yaml \
|
||||
../../src/train.py \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../data \
|
||||
--template default \
|
||||
--finetuning_type lora \
|
||||
--lora_target q_proj,v_proj \
|
||||
--output_dir ../../saves/LLaMA2-7B/lora/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 2 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--ddp_timeout 180000000 \
|
||||
--plot_loss \
|
||||
--fp16
|
||||
--config_file examples/accelerate/master_config.yaml \
|
||||
src/train.py examples/lora_multi_gpu/llama3_lora_sft.yaml
|
||||
|
@ -1,35 +1,5 @@
|
||||
#!/bin/bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
|
||||
--config_file ../accelerate/single_config.yaml \
|
||||
../../src/train.py \
|
||||
--stage sft \
|
||||
--do_train \
|
||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||
--dataset_dir ../../data \
|
||||
--template default \
|
||||
--finetuning_type lora \
|
||||
--lora_target q_proj,v_proj \
|
||||
--output_dir ../../saves/LLaMA2-7B/lora/sft \
|
||||
--overwrite_cache \
|
||||
--overwrite_output_dir \
|
||||
--cutoff_len 1024 \
|
||||
--preprocessing_num_workers 16 \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 2 \
|
||||
--lr_scheduler_type cosine \
|
||||
--logging_steps 10 \
|
||||
--warmup_steps 20 \
|
||||
--save_steps 100 \
|
||||
--eval_steps 100 \
|
||||
--evaluation_strategy steps \
|
||||
--load_best_model_at_end \
|
||||
--learning_rate 5e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--max_samples 3000 \
|
||||
--val_size 0.1 \
|
||||
--ddp_timeout 180000000 \
|
||||
--plot_loss \
|
||||
--fp16
|
||||
--config_file examples/accelerate/single_config.yaml \
|
||||
src/train.py examples/lora_multi_gpu/llama3_lora_sft.yaml
|
||||
|
@ -15,7 +15,7 @@ max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
tokenized_path: saves/llama3-8b/dataset/sft # use `tokenized_path` in config to load data
|
||||
tokenized_path: saves/llama3-8b/dataset/sft
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
|
@ -1,27 +1,38 @@
|
||||
# model
|
||||
model_name_or_path: ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
model_name_or_path: BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf
|
||||
dataset: alpaca_gpt4_en,glaive_toolcall
|
||||
dataset_dir: data
|
||||
template: default
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
output_dir: ../../saves/LLaMA2-7B/lora/sft
|
||||
overwrite_cache: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
per_device_train_batch_size: 1
|
||||
per_device_eval_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
lr_scheduler_type: cosine
|
||||
logging_steps: 10
|
||||
save_steps: 100
|
||||
eval_steps: 100
|
||||
evaluation_strategy: steps
|
||||
load_best_model_at_end: true
|
||||
learning_rate: 5e-5
|
||||
num_train_epochs: 3.0
|
||||
max_samples: 3000
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
||||
|
@ -0,0 +1,38 @@
|
||||
# model
|
||||
model_name_or_path: TechxGenus/Meta-Llama-3-8B-Instruct-AWQ
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -0,0 +1,42 @@
|
||||
# model
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
quantization_bit: 4
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
|
||||
# ddp
|
||||
ddp_timeout: 180000000
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
@ -0,0 +1,38 @@
|
||||
# model
|
||||
model_name_or_path: TechxGenus/Meta-Llama-3-8B-Instruct-GPTQ
|
||||
|
||||
# method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
|
||||
# dataset
|
||||
dataset: identity,alpaca_gpt4_en
|
||||
template: llama3
|
||||
cutoff_len: 1024
|
||||
max_samples: 1000
|
||||
val_size: 0.1
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
# output
|
||||
output_dir: saves/llama3-8b/lora/sft
|
||||
logging_steps: 10
|
||||
save_steps: 500
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
|
||||
# train
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
learning_rate: 0.0001
|
||||
num_train_epochs: 3.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_steps: 0.1
|
||||
fp16: true
|
||||
|
||||
# eval
|
||||
per_device_eval_batch_size: 1
|
||||
evaluation_strategy: steps
|
||||
eval_steps: 500
|
Loading…
x
Reference in New Issue
Block a user