diff --git a/README.md b/README.md index 3bf284b2..ed56baba 100644 --- a/README.md +++ b/README.md @@ -46,7 +46,7 @@ Choose your path: - **Various models**: LLaMA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc. - **Integrated methods**: (Continuous) pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO. - **Scalable resources**: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA and 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8. -- **Advanced algorithms**: GaLore, Mixture of Depths, BAdam, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning. +- **Advanced algorithms**: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and Agent tuning. - **Practical tricks**: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA. - **Experiment monitors**: LlamaBoard, TensorBoard, Wandb, MLflow, etc. - **Faster inference**: OpenAI-style API, Gradio UI and CLI with vLLM worker. @@ -68,16 +68,16 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/ ## Changelog -[24/04/19] We integrated **[Mixture of Depths](https://github.com/astramind-ai/Mixture-of-depths)**. see `examples/extras/MoD` for usage. +[24/04/21] We supported **[Mixture-of-Depths](https://arxiv.org/abs/2404.02258)** according to [AstraMindAI's implementation](https://github.com/astramind-ai/Mixture-of-depths). See `examples/extras/mod` for usage. [24/04/19] We supported **Meta Llama 3** model series. [24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See `examples/extras/badam` for usage. -
Full Changelog - [24/04/16] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s long-sequence training (Llama-2-7B-56k within 24GB). It achieves **117%** speed and **50%** memory compared with FlashAttention-2, more benchmarks can be found in [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison). +
Full Changelog + [24/03/31] We supported **[ORPO](https://arxiv.org/abs/2403.07691)**. See `examples/lora_single_gpu` for usage. [24/03/21] Our paper "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" is available at arXiv! @@ -251,6 +251,7 @@ You also can add a custom chat template to [template.py](src/llmtuner/data/templ - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) - [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs) - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar) +- [DPO mix (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k) - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
diff --git a/README_zh.md b/README_zh.md index 7565664e..586ee38a 100644 --- a/README_zh.md +++ b/README_zh.md @@ -46,7 +46,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd - **多种模型**:LLaMA、Mistral、Mixtral-MoE、Qwen、Yi、Gemma、Baichuan、ChatGLM、Phi 等等。 - **集成方法**:(增量)预训练、指令监督微调、奖励模型训练、PPO 训练、DPO 训练和 ORPO 训练。 - **多种精度**:32 比特全参数微调、16 比特冻结微调、16 比特 LoRA 微调和基于 AQLM/AWQ/GPTQ/LLM.int8 的 2/4/8 比特 QLoRA 微调。 -- **先进算法**:GaLore、Mixture of Depths、BAdam、DoRA、LongLoRA、LLaMA Pro、LoRA+、LoftQ 和 Agent 微调。 +- **先进算法**:GaLore、BAdam、DoRA、LongLoRA、LLaMA Pro、Mixture-of-Depths、LoRA+、LoftQ 和 Agent 微调。 - **实用技巧**:FlashAttention-2、Unsloth、RoPE scaling、NEFTune 和 rsLoRA。 - **实验监控**:LlamaBoard、TensorBoard、Wandb、MLflow 等等。 - **极速推理**:基于 vLLM 的 OpenAI 风格 API、浏览器界面和命令行接口。 @@ -68,16 +68,16 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd ## 更新日志 -[24/04/19] 我们整合了 **[深度混合](https://github.com/astramind-ai/Mixture-of-depths)**。用法请参见 `examples/extras/MoD`。 +[24/04/21] 我们基于 [AstraMindAI 的仓库](https://github.com/astramind-ai/Mixture-of-depths)支持了 **[混合深度训练](https://arxiv.org/abs/2404.02258)**。详细用法请参照 `examples/extras/mod`。 [24/04/19] 我们支持了 **Meta Llama 3** 系列模型。 [24/04/16] 我们支持了 **[BAdam](https://arxiv.org/abs/2404.02827)**。详细用法请参照 `examples/extras/badam`。 -
展开日志 - [24/04/16] 我们支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的长序列训练(24GB 可训练 Llama-2-7B-56k)。该方法相比 FlashAttention-2 提供了 **117%** 的训练速度和 **50%** 的显存节约。更多数据请见[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。 +
展开日志 + [24/03/31] 我们支持了 **[ORPO](https://arxiv.org/abs/2403.07691)**。详细用法请参照 `examples/lora_single_gpu`。 [24/03/21] 我们的论文 "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" 可在 arXiv 上查看! @@ -251,6 +251,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) - [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs) - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar) +- [DPO mix (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k) - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
diff --git a/examples/README.md b/examples/README.md index dd526ba8..8218d113 100644 --- a/examples/README.md +++ b/examples/README.md @@ -38,12 +38,11 @@ examples/ │ └── sft.sh: Fine-tune model with BAdam ├── loraplus/ │ └── sft.sh: Fine-tune model using LoRA+ + ├── mod/ + │ └── sft.sh: Fine-tune model using Mixture-of-Depths ├── llama_pro/ │ ├── expand.sh: Expand layers in the model │ └── sft.sh: Fine-tune the expanded model - ├── MoD/ - │ ├── freeze_sft.sh: Freeze finetune a model, updating only the MoD router - │ └── sft.sh: Fine-tune the MoD model └── fsdp_qlora/ └── sft.sh: Fine-tune quantized model with FSDP+QLoRA ``` diff --git a/examples/README_zh.md b/examples/README_zh.md index cdef207b..ed0d244d 100644 --- a/examples/README_zh.md +++ b/examples/README_zh.md @@ -38,12 +38,11 @@ examples/ │ └── sft.sh: 使用 BAdam 训练模型 ├── loraplus/ │ └── sft.sh: 使用 LoRA+ 训练模型 + ├── mod/ + │ └── sft.sh: 使用深度混合训练模型 ├── llama_pro/ │ ├── expand.sh: 扩展模型中的层 │ └── sft.sh: 训练扩展后的模型 - ├── MoD/ - │ ├── freeze_sft.sh: 冻结微调模型,仅更新 MoD 路由器 - │ └── sft.sh: 微调国防部模型 └── fsdp_qlora/ └── sft.sh: 使用 FSDP+QLoRA 微调量化模型 ``` diff --git a/examples/extras/MoD/freeze_sft.sh b/examples/extras/MoD/freeze_sft.sh deleted file mode 100644 index 867fad47..00000000 --- a/examples/extras/MoD/freeze_sft.sh +++ /dev/null @@ -1,33 +0,0 @@ -#!/bin/bash - -CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ - --stage sft \ - --do_train \ - --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ - --dataset alpaca_gpt4_en,glaive_toolcall \ - --dataset_dir ../../../data \ - --template default \ - --finetuning_type freeze \ - --name_module_trainable router \ - --output_dir ../../../saves/TinyLlama/TinyLlama-1.1B-Chat-v1.0/sft \ - --mixture_of_depths convert \ - --overwrite_cache \ - --overwrite_output_dir \ - --cutoff_len 1024 \ - --preprocessing_num_workers 16 \ - --per_device_train_batch_size 1 \ - --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 1 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --warmup_steps 20 \ - --save_steps 100 \ - --eval_steps 100 \ - --evaluation_strategy steps \ - --load_best_model_at_end \ - --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ - --val_size 0.1 \ - --plot_loss \ - --pure_bf16 diff --git a/examples/extras/MoD/sft.sh b/examples/extras/MoD/sft.sh index b0257f9f..2c8f04a3 100644 --- a/examples/extras/MoD/sft.sh +++ b/examples/extras/MoD/sft.sh @@ -3,20 +3,21 @@ CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ --stage sft \ --do_train \ - --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ + --model_name_or_path meta-llama/Llama-2-7b-hf \ --dataset alpaca_gpt4_en,glaive_toolcall \ --dataset_dir ../../../data \ --template default \ --finetuning_type full \ - --output_dir ../../../saves/TinyLlama/TinyLlama-1.1B-Chat-v1.0/sft \ --mixture_of_depths convert \ + --output_dir ../../../saves/LLaMA2-7B/mod/sft \ --overwrite_cache \ --overwrite_output_dir \ --cutoff_len 1024 \ --preprocessing_num_workers 16 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ - --gradient_accumulation_steps 1 \ + --gradient_accumulation_steps 8 \ + --optim paged_adamw_8bit \ --lr_scheduler_type cosine \ --logging_steps 10 \ --warmup_steps 20 \ diff --git a/examples/extras/galore/sft.sh b/examples/extras/galore/sft.sh index 1ffeb5ca..1e46ac1f 100644 --- a/examples/extras/galore/sft.sh +++ b/examples/extras/galore/sft.sh @@ -11,6 +11,7 @@ CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ --use_galore \ --galore_layerwise \ --galore_target mlp,self_attn \ + --galore_scale 2.0 \ --galore_rank 128 \ --output_dir ../../../saves/LLaMA2-7B/galore/sft \ --overwrite_cache \ @@ -28,8 +29,8 @@ CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \ --evaluation_strategy steps \ --load_best_model_at_end \ --learning_rate 5e-5 \ - --num_train_epochs 3.0 \ - --max_samples 3000 \ + --num_train_epochs 30.0 \ + --max_samples 300 \ --val_size 0.1 \ --plot_loss \ --pure_bf16 diff --git a/examples/inference/evaluate.sh b/examples/inference/evaluate.sh index b54c2a60..1fc6ccf8 100644 --- a/examples/inference/evaluate.sh +++ b/examples/inference/evaluate.sh @@ -3,7 +3,7 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/evaluate.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \ - --template vanilla \ + --template fewshot \ --finetuning_type lora \ --task mmlu \ --split test \ diff --git a/src/llmtuner/data/template.py b/src/llmtuner/data/template.py index c74becc4..04538510 100644 --- a/src/llmtuner/data/template.py +++ b/src/llmtuner/data/template.py @@ -343,7 +343,7 @@ def get_template_and_fix_tokenizer( name: Optional[str] = None, ) -> Template: if name is None: - template = templates["vanilla"] # placeholder + template = templates["empty"] # placeholder else: template = templates.get(name, None) if template is None: @@ -385,7 +385,8 @@ _register_template( format_user=StringFormatter(slots=["### Instruction:\n{{content}}\n\n### Response:\n"]), format_separator=EmptyFormatter(slots=["\n\n"]), default_system=( - "Below is an instruction that describes a task. " "Write a response that appropriately completes the request." + "Below is an instruction that describes a task. " + "Write a response that appropriately completes the request.\n\n" ), ) @@ -596,6 +597,13 @@ _register_template( ) +_register_template( + name="fewshot", + format_separator=EmptyFormatter(slots=["\n\n"]), + efficient_eos=True, +) + + _register_template( name="gemma", format_user=StringFormatter(slots=["user\n{{content}}\nmodel\n"]), @@ -740,13 +748,6 @@ _register_template( ) -_register_template( - name="vanilla", - format_separator=EmptyFormatter(slots=["\n"]), - efficient_eos=True, -) - - _register_template( name="vicuna", format_user=StringFormatter(slots=["USER: {{content}} ASSISTANT:"]), diff --git a/src/llmtuner/extras/constants.py b/src/llmtuner/extras/constants.py index 78352a01..a0e51d17 100644 --- a/src/llmtuner/extras/constants.py +++ b/src/llmtuner/extras/constants.py @@ -28,6 +28,8 @@ LOG_FILE_NAME = "trainer_log.jsonl" METHODS = ["full", "freeze", "lora"] +MOD_SUPPORTED_MODELS = ["bloom", "falcon", "gemma", "llama", "mistral", "mixtral", "phi", "starcoder2"] + PEFT_METHODS = ["lora"] SUBJECTS = ["Average", "STEM", "Social Sciences", "Humanities", "Other"] diff --git a/src/llmtuner/extras/misc.py b/src/llmtuner/extras/misc.py index ecb6797c..8ce25d18 100644 --- a/src/llmtuner/extras/misc.py +++ b/src/llmtuner/extras/misc.py @@ -83,6 +83,8 @@ def count_parameters(model: torch.nn.Module) -> Tuple[int, int]: if param.__class__.__name__ == "Params4bit": if hasattr(param, "quant_storage") and hasattr(param.quant_storage, "itemsize"): num_bytes = param.quant_storage.itemsize + elif hasattr(param, "element_size"): # for older pytorch version + num_bytes = param.element_size() else: num_bytes = 1 diff --git a/src/llmtuner/hparams/model_args.py b/src/llmtuner/hparams/model_args.py index bc80d304..0e42033f 100644 --- a/src/llmtuner/hparams/model_args.py +++ b/src/llmtuner/hparams/model_args.py @@ -63,15 +63,15 @@ class ModelArguments: ) flash_attn: bool = field( default=False, - metadata={"help": "Enable FlashAttention-2 for faster training."}, + metadata={"help": "Enable FlashAttention for faster training."}, ) shift_attn: bool = field( default=False, metadata={"help": "Enable shift short attention (S^2-Attn) proposed by LongLoRA."}, ) - mixture_of_depths: Optional[Literal["convert", "continue"]] = field( + mixture_of_depths: Optional[Literal["convert", "load"]] = field( default=None, - metadata={"help": "Whether or not to use MoD in the model."}, + metadata={"help": "Convert the model to mixture-of-depths (MoD) or load the MoD model."}, ) use_unsloth: bool = field( default=False, diff --git a/src/llmtuner/hparams/parser.py b/src/llmtuner/hparams/parser.py index 246d97cf..b22db652 100644 --- a/src/llmtuner/hparams/parser.py +++ b/src/llmtuner/hparams/parser.py @@ -82,8 +82,8 @@ def _check_extra_dependencies( if model_args.use_unsloth: require_version("unsloth", "Please install unsloth: https://github.com/unslothai/unsloth") - if model_args.mixture_of_depths == 'convert' or model_args.mixture_of_depths == 'continue': - require_version("mixture-of-depth", "To fix: pip install mixture-of-depth") + if model_args.mixture_of_depths is not None: + require_version("mixture-of-depth>=1.1.6", "To fix: pip install mixture-of-depth>=1.1.6") if model_args.infer_backend == "vllm": require_version("vllm>=0.3.3", "To fix: pip install vllm>=0.3.3") diff --git a/src/llmtuner/model/adapter.py b/src/llmtuner/model/adapter.py index 2aafd663..f73666d5 100644 --- a/src/llmtuner/model/adapter.py +++ b/src/llmtuner/model/adapter.py @@ -69,7 +69,7 @@ def init_adapter( for name, _ in model.named_modules(): if ".0." in name: freeze_modules.add(name.split(".0.")[-1].split(".")[0]) - elif ".1." in name: # here since MoD starts from layer 1 + elif ".1." in name: # MoD starts from layer 1 freeze_modules.add(name.split(".1.")[-1].split(".")[0]) trainable_layers = [] diff --git a/src/llmtuner/model/loader.py b/src/llmtuner/model/loader.py index e4624d65..4935dd52 100644 --- a/src/llmtuner/model/loader.py +++ b/src/llmtuner/model/loader.py @@ -3,6 +3,7 @@ from typing import TYPE_CHECKING, Any, Dict from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer from trl import AutoModelForCausalLMWithValueHead +from ..extras.constants import MOD_SUPPORTED_MODELS from ..extras.logging import get_logger from ..extras.misc import count_parameters, get_current_device, try_download_model_from_ms from .adapter import init_adapter @@ -44,7 +45,7 @@ def load_tokenizer(model_args: "ModelArguments") -> "PreTrainedTokenizer": padding_side="right", **init_kwargs, ) - except Exception: # try the fast one + except ValueError: # try the fast one tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, use_fast=True, @@ -71,12 +72,6 @@ def load_model( patch_config(config, tokenizer, model_args, init_kwargs, is_trainable) model = None - if model_args.mixture_of_depths == 'continue': - from MoD import AutoMoDModelForCausalLM - model = AutoMoDModelForCausalLM.from_pretrained(model_args.model_name_or_path, config=config) - if model.config.model_type == 'qwen2': - RuntimeError("Qwen models are not supported for MoD training.") - if is_trainable and model_args.use_unsloth: from unsloth import FastLanguageModel # type: ignore @@ -104,14 +99,22 @@ def load_model( if model is None: init_kwargs["config"] = config init_kwargs["pretrained_model_name_or_path"] = model_args.model_name_or_path - model: "PreTrainedModel" = AutoModelForCausalLM.from_pretrained(**init_kwargs) - if model_args.mixture_of_depths == 'convert': - from MoD import convert_hf_model - if model.config.model_type == 'qwen2': - RuntimeError("Qwen models are not supported for MoD training.") - model = convert_hf_model(model) + if model_args.mixture_of_depths == "load": + from MoD import AutoMoDModelForCausalLM + model = AutoMoDModelForCausalLM.from_pretrained(**init_kwargs) + else: + model = AutoModelForCausalLM.from_pretrained(**init_kwargs) + + if model_args.mixture_of_depths == "convert": + from MoD import apply_mod_to_hf + + if getattr(config, "model_type", None) not in MOD_SUPPORTED_MODELS: + raise ValueError("Current model is not supported by mixture-of-depth.") + + model = apply_mod_to_hf(model) + model = model.to(model_args.compute_dtype) patch_model(model, tokenizer, model_args, is_trainable) register_autoclass(config, model, tokenizer) @@ -119,7 +122,7 @@ def load_model( model = init_adapter(model, model_args, finetuning_args, is_trainable) if add_valuehead: - model: "AutoModelForCausalLMWithValueHead" = AutoModelForCausalLMWithValueHead.from_pretrained(model) + model = AutoModelForCausalLMWithValueHead.from_pretrained(model) patch_valuehead_model(model) if model_args.adapter_name_or_path is not None: diff --git a/src/llmtuner/model/patcher.py b/src/llmtuner/model/patcher.py index fb2835e8..a1b19fb1 100644 --- a/src/llmtuner/model/patcher.py +++ b/src/llmtuner/model/patcher.py @@ -61,9 +61,7 @@ def _get_quantization_dataset(tokenizer: "PreTrainedTokenizer", model_args: "Mod return samples -def _configure_attn_implementation( - config: "PretrainedConfig", model_args: "ModelArguments", init_kwargs: Dict[str, Any] -) -> None: +def _configure_attn_implementation(config: "PretrainedConfig", model_args: "ModelArguments") -> None: if model_args.flash_attn: if not is_flash_attn2_available(): logger.warning("FlashAttention2 is not installed.") @@ -73,9 +71,9 @@ def _configure_attn_implementation( if getattr(config, "model_type", None) == "internlm2": # special case for custom models setattr(config, "attn_implementation", "flash_attention_2") else: - init_kwargs["attn_implementation"] = "flash_attention_2" + setattr(config, "_attn_implementation", "flash_attention_2") else: - init_kwargs["attn_implementation"] = "eager" + setattr(config, "_attn_implementation", "eager") def _configure_rope(config: "PretrainedConfig", model_args: "ModelArguments", is_trainable: bool) -> None: @@ -295,7 +293,7 @@ def patch_config( if model_args.compute_dtype is None: # priority: bf16 > fp16 > fp32 model_args.compute_dtype = infer_optim_dtype(model_dtype=getattr(config, "torch_dtype", None)) - _configure_attn_implementation(config, model_args, init_kwargs) + _configure_attn_implementation(config, model_args) _configure_rope(config, model_args, is_trainable) _configure_longlora(config, model_args, is_trainable) _configure_quantization(config, tokenizer, model_args, init_kwargs)