diff --git a/README.md b/README.md index 4e87e369..d74a3fb5 100644 --- a/README.md +++ b/README.md @@ -68,6 +68,8 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/ ## Changelog +[24/04/26] We supported fine-tuning the **LLaVA-1.5** multimodal LLMs. See `examples/lora_single_gpu/sft_mllm.sh` for usage. + [24/04/22] We provided a **[Colab notebook](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)** for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) and [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese) for details. [24/04/21] We supported **[Mixture-of-Depths](https://arxiv.org/abs/2404.02258)** according to [AstraMindAI's implementation](https://github.com/astramind-ai/Mixture-of-depths). See `examples/extras/mod` for usage. @@ -148,6 +150,7 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/ | [LLaMA](https://github.com/facebookresearch/llama) | 7B/13B/33B/65B | q_proj,v_proj | - | | [LLaMA-2](https://huggingface.co/meta-llama) | 7B/13B/70B | q_proj,v_proj | llama2 | | [LLaMA-3](https://huggingface.co/meta-llama) | 8B/70B | q_proj,v_proj | llama3 | +| [LLaVA-1.5](https://huggingface.co/llava-hf) | 7B/13B | q_proj,v_proj | vicuna | | [Mistral/Mixtral](https://huggingface.co/mistralai) | 7B/8x7B/8x22B | q_proj,v_proj | mistral | | [OLMo](https://huggingface.co/allenai) | 1B/7B | q_proj,v_proj | - | | [Phi-1.5/2](https://huggingface.co/microsoft) | 1.3B/2.7B | q_proj,v_proj | - | @@ -457,7 +460,7 @@ If you have a project that should be incorporated, please contact via email or c This repository is licensed under the [Apache-2.0 License](LICENSE). -Please follow the model licenses to use the corresponding model weights: [Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan) +Please follow the model licenses to use the corresponding model weights: [Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2/LLaVA-1.5](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan) ## Citation diff --git a/README_zh.md b/README_zh.md index 599af301..ed19ff94 100644 --- a/README_zh.md +++ b/README_zh.md @@ -68,6 +68,8 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd ## 更新日志 +[24/04/26] 我们支持了多模态模型 **LLaVA-1.5** 的微调。详细用法请参照 `examples/lora_single_gpu/sft_mllm.sh`。 + [24/04/22] 我们提供了在免费 T4 GPU 上微调 Llama-3 模型的 **[Colab 笔记本](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)**。Hugging Face 社区公开了两个利用 LLaMA Factory 微调的 Llama-3 模型,详情请见 [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) 和 [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese)。 [24/04/21] 我们基于 [AstraMindAI 的仓库](https://github.com/astramind-ai/Mixture-of-depths)支持了 **[混合深度训练](https://arxiv.org/abs/2404.02258)**。详细用法请参照 `examples/extras/mod`。 @@ -148,6 +150,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd | [LLaMA](https://github.com/facebookresearch/llama) | 7B/13B/33B/65B | q_proj,v_proj | - | | [LLaMA-2](https://huggingface.co/meta-llama) | 7B/13B/70B | q_proj,v_proj | llama2 | | [LLaMA-3](https://huggingface.co/meta-llama) | 8B/70B | q_proj,v_proj | llama3 | +| [LLaVA-1.5](https://huggingface.co/llava-hf) | 7B/13B | q_proj,v_proj | vicuna | | [Mistral/Mixtral](https://huggingface.co/mistralai) | 7B/8x7B/8x22B | q_proj,v_proj | mistral | | [OLMo](https://huggingface.co/allenai) | 1B/7B | q_proj,v_proj | - | | [Phi-1.5/2](https://huggingface.co/microsoft) | 1.3B/2.7B | q_proj,v_proj | - | @@ -457,7 +460,7 @@ export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1` 本仓库的代码依照 [Apache-2.0](LICENSE) 协议开源。 -使用模型权重时,请遵循对应的模型协议:[Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan) +使用模型权重时,请遵循对应的模型协议:[Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2/LLaVA-1.5](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan) ## 引用 diff --git a/data/README.md b/data/README.md index 2ea0c117..6de0430f 100644 --- a/data/README.md +++ b/data/README.md @@ -18,7 +18,8 @@ If you are using a custom dataset, please provide your dataset definition in the "history": "the column name in the dataset containing the histories. (default: None)", "messages": "the column name in the dataset containing the messages. (default: conversations)", "system": "the column name in the dataset containing the system prompts. (default: None)", - "tools": "the column name in the dataset containing the tool description. (default: None)" + "tools": "the column name in the dataset containing the tool description. (default: None)", + "images": "the column name in the dataset containing the image inputs. (default: None)" }, "tags (optional, used for the sharegpt format)": { "role_tag": "the key in the message represents the identity. (default: from)", diff --git a/data/README_zh.md b/data/README_zh.md index b00f81d9..fb6cb1d9 100644 --- a/data/README_zh.md +++ b/data/README_zh.md @@ -18,7 +18,8 @@ "history": "数据集代表历史对话的表头名称(默认:None)", "messages": "数据集代表消息列表的表头名称(默认:conversations)", "system": "数据集代表系统提示的表头名称(默认:None)", - "tools": "数据集代表工具描述的表头名称(默认:None)" + "tools": "数据集代表工具描述的表头名称(默认:None)", + "images": "数据集代表图像输入的表头名称(默认:None)" }, "tags(可选,用于 sharegpt 格式)": { "role_tag": "消息中代表发送者身份的键名(默认:from)", diff --git a/data/dataset_info.json b/data/dataset_info.json index e396ed50..d053be1d 100644 --- a/data/dataset_info.json +++ b/data/dataset_info.json @@ -58,6 +58,21 @@ "tools": "tools" } }, + "mllm_demo": { + "file_name": "mllm_demo.json", + "file_sha1": "b6709b23657d5c42a701f1c5574f3a6edaa40a20", + "formatting": "sharegpt", + "columns": { + "messages": "messages", + "images": "images" + }, + "tags": { + "role_tag": "role", + "content_tag": "content", + "user_tag": "user", + "assistant_tag": "assistant" + } + }, "example": { "script_url": "example_dataset", "columns": { @@ -185,6 +200,7 @@ "ultrachat_200k": { "hf_hub_url": "HuggingFaceH4/ultrachat_200k", "ms_hub_url": "AI-ModelScope/ultrachat_200k", + "formatting": "sharegpt", "columns": { "messages": "messages" }, @@ -193,8 +209,7 @@ "content_tag": "content", "user_tag": "user", "assistant_tag": "assistant" - }, - "formatting": "sharegpt" + } }, "agent_instruct": { "hf_hub_url": "THUDM/AgentInstruct", @@ -204,6 +219,7 @@ "lmsys_chat": { "hf_hub_url": "lmsys/lmsys-chat-1m", "ms_hub_url": "AI-ModelScope/lmsys-chat-1m", + "formatting": "sharegpt", "columns": { "messages": "conversation" }, @@ -212,8 +228,7 @@ "content_tag": "content", "user_tag": "human", "assistant_tag": "assistant" - }, - "formatting": "sharegpt" + } }, "evol_instruct": { "hf_hub_url": "WizardLM/WizardLM_evol_instruct_V2_196k", @@ -340,7 +355,7 @@ "history": "history" } }, - "orca_dpo_de" : { + "orca_dpo_de": { "hf_hub_url": "mayflowergmbh/intel_orca_dpo_pairs_de", "ranking": true }, diff --git a/data/images/1.jpg b/data/images/1.jpg new file mode 100644 index 00000000..a29762ed Binary files /dev/null and b/data/images/1.jpg differ diff --git a/data/images/2.jpg b/data/images/2.jpg new file mode 100644 index 00000000..1df98231 Binary files /dev/null and b/data/images/2.jpg differ diff --git a/data/images/3.jpg b/data/images/3.jpg new file mode 100644 index 00000000..72bc7315 Binary files /dev/null and b/data/images/3.jpg differ diff --git a/data/mllm_demo.json b/data/mllm_demo.json new file mode 100644 index 00000000..32d6d221 --- /dev/null +++ b/data/mllm_demo.json @@ -0,0 +1,71 @@ +[ + { + "messages": [ + { + "content": "Who are they?", + "role": "user" + }, + { + "content": "They're Kane and Gretzka from Bayern Munich.", + "role": "assistant" + }, + { + "content": "What are they doing?", + "role": "user" + }, + { + "content": "They are celebrating on the soccer field", + "role": "assistant" + } + ], + "images": [ + "images/1.jpg" + ] + }, + { + "messages": [ + { + "content": "Who is he?", + "role": "user" + }, + { + "content": "He's Thomas Muller from Bayern Munich.", + "role": "assistant" + }, + { + "content": "Why is he on the ground?", + "role": "user" + }, + { + "content": "Because he's sliding on his knees to celebrate.", + "role": "assistant" + } + ], + "images": [ + "images/2.jpg" + ] + }, + { + "messages": [ + { + "content": "Please describe this image", + "role": "user" + }, + { + "content": "Chinese astronaut Gui Haichao is giving a speech.", + "role": "assistant" + }, + { + "content": "What has he accomplished?", + "role": "user" + }, + { + "content": "He was appointed to be a payload specialist on Shenzhou 16 mission in June 2022, thus becoming the first Chinese civilian of Group 3 in space on 30 May 2023. He is responsible for the on-orbit operation of space science experimental payloads.", + "role": "assistant" + } + ], + "images": [ + "images/3.jpg" + ] + } +] \ No newline at end of file diff --git a/examples/README.md b/examples/README.md index cc01cf9f..895e9c72 100644 --- a/examples/README.md +++ b/examples/README.md @@ -9,6 +9,7 @@ examples/ │ ├── ppo.sh: Do PPO training using LoRA │ ├── dpo.sh: Do DPO training using LoRA │ ├── orpo.sh: Do ORPO training using LoRA +│ ├── sft_mllm.sh: Do supervised fine-tuning on multimodal data using LoRA │ ├── prepare.sh: Save tokenized dataset │ └── predict.sh: Do batch predict and compute BLEU and ROUGE scores after LoRA tuning ├── qlora_single_gpu/ diff --git a/examples/README_zh.md b/examples/README_zh.md index fecbdb2f..091a877f 100644 --- a/examples/README_zh.md +++ b/examples/README_zh.md @@ -9,6 +9,7 @@ examples/ │ ├── ppo.sh: 基于 LoRA 进行 PPO 训练 │ ├── dpo.sh: 基于 LoRA 进行 DPO 训练 │ ├── orpo.sh: 基于 LoRA 进行 ORPO 训练 +│ ├── sft_mllm.sh: 基于 LoRA 进行多模态指令监督微调 │ ├── prepare.sh: 保存预处理后的数据集 │ └── predict.sh: 基于 LoRA 进行批量预测并计算 BLEU 和 ROUGE 分数 ├── qlora_single_gpu/ diff --git a/examples/lora_single_gpu/sft_mllm.sh b/examples/lora_single_gpu/sft_mllm.sh new file mode 100644 index 00000000..7e900918 --- /dev/null +++ b/examples/lora_single_gpu/sft_mllm.sh @@ -0,0 +1,33 @@ +#!/bin/bash + +CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \ + --stage sft \ + --do_train \ + --model_name_or_path llava-hf/llava-1.5-7b-hf \ + --visual_inputs \ + --dataset mllm_demo \ + --dataset_dir ../../data \ + --template vicuna \ + --finetuning_type lora \ + --lora_target q_proj,v_proj \ + --output_dir ../../saves/LLaMA2-7B/lora/sft_mllm \ + --overwrite_cache \ + --overwrite_output_dir \ + --cutoff_len 1024 \ + --preprocessing_num_workers 16 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --gradient_accumulation_steps 8 \ + --lr_scheduler_type cosine \ + --logging_steps 10 \ + --warmup_steps 20 \ + --save_steps 100 \ + --eval_steps 100 \ + --evaluation_strategy steps \ + --load_best_model_at_end \ + --learning_rate 5e-5 \ + --num_train_epochs 100.0 \ + --max_samples 3000 \ + --val_size 0.1 \ + --plot_loss \ + --fp16 diff --git a/scripts/cal_lr.py b/scripts/cal_lr.py index ffe47f28..c1c1f7a2 100644 --- a/scripts/cal_lr.py +++ b/scripts/cal_lr.py @@ -44,8 +44,9 @@ def calculate_lr( overwrite_cache=True, ) ) - tokenizer = load_tokenizer(model_args) - trainset = get_dataset(tokenizer, model_args, data_args, training_args, stage) + tokenizer_module = load_tokenizer(model_args) + tokenizer = tokenizer_module["tokenizer"] + trainset = get_dataset(model_args, data_args, training_args, stage, **tokenizer_module) if stage == "pt": data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) elif stage == "sft": diff --git a/scripts/length_cdf.py b/scripts/length_cdf.py index cf0698de..1446f77a 100644 --- a/scripts/length_cdf.py +++ b/scripts/length_cdf.py @@ -32,8 +32,8 @@ def length_cdf( overwrite_cache=True, ) ) - tokenizer = load_tokenizer(model_args) - trainset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft") + tokenizer_module = load_tokenizer(model_args) + trainset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module) total_num = len(trainset) length_dict = defaultdict(int) for sample in tqdm(trainset["input_ids"]): diff --git a/src/llmtuner/chat/base_engine.py b/src/llmtuner/chat/base_engine.py index e19db676..65b6c59c 100644 --- a/src/llmtuner/chat/base_engine.py +++ b/src/llmtuner/chat/base_engine.py @@ -4,6 +4,7 @@ from typing import TYPE_CHECKING, Any, AsyncGenerator, Dict, List, Literal, Opti if TYPE_CHECKING: + from numpy.typing import NDArray from transformers import PreTrainedModel, PreTrainedTokenizer from vllm import AsyncLLMEngine @@ -46,6 +47,7 @@ class BaseEngine(ABC): messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> List["Response"]: ... @@ -55,6 +57,7 @@ class BaseEngine(ABC): messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> AsyncGenerator[str, None]: ... diff --git a/src/llmtuner/chat/chat_model.py b/src/llmtuner/chat/chat_model.py index c49d4d78..ba58dd2e 100644 --- a/src/llmtuner/chat/chat_model.py +++ b/src/llmtuner/chat/chat_model.py @@ -8,6 +8,8 @@ from .vllm_engine import VllmEngine if TYPE_CHECKING: + from numpy.typing import NDArray + from .base_engine import BaseEngine, Response @@ -36,9 +38,10 @@ class ChatModel: messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> List["Response"]: - task = asyncio.run_coroutine_threadsafe(self.achat(messages, system, tools, **input_kwargs), self._loop) + task = asyncio.run_coroutine_threadsafe(self.achat(messages, system, tools, image, **input_kwargs), self._loop) return task.result() async def achat( @@ -46,18 +49,20 @@ class ChatModel: messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> List["Response"]: - return await self.engine.chat(messages, system, tools, **input_kwargs) + return await self.engine.chat(messages, system, tools, image, **input_kwargs) def stream_chat( self, messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> Generator[str, None, None]: - generator = self.astream_chat(messages, system, tools, **input_kwargs) + generator = self.astream_chat(messages, system, tools, image, **input_kwargs) while True: try: task = asyncio.run_coroutine_threadsafe(generator.__anext__(), self._loop) @@ -70,9 +75,10 @@ class ChatModel: messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> AsyncGenerator[str, None]: - async for new_token in self.engine.stream_chat(messages, system, tools, **input_kwargs): + async for new_token in self.engine.stream_chat(messages, system, tools, image, **input_kwargs): yield new_token def get_scores( diff --git a/src/llmtuner/chat/hf_engine.py b/src/llmtuner/chat/hf_engine.py index ddb48e47..f6f51898 100644 --- a/src/llmtuner/chat/hf_engine.py +++ b/src/llmtuner/chat/hf_engine.py @@ -14,7 +14,9 @@ from .base_engine import BaseEngine, Response if TYPE_CHECKING: - from transformers import PreTrainedModel, PreTrainedTokenizer + from numpy.typing import NDArray + from transformers import PreTrainedModel, PreTrainedTokenizer, ProcessorMixin + from transformers.image_processing_utils import BaseImageProcessor from trl import PreTrainedModelWrapper from ..data import Template @@ -30,7 +32,9 @@ class HuggingfaceEngine(BaseEngine): generating_args: "GeneratingArguments", ) -> None: self.can_generate = finetuning_args.stage == "sft" - self.tokenizer = load_tokenizer(model_args) + tokenizer_module = load_tokenizer(model_args) + self.tokenizer = tokenizer_module["tokenizer"] + self.processor = tokenizer_module["processor"] self.tokenizer.padding_side = "left" if self.can_generate else "right" self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args.template) self.model = load_model( @@ -42,13 +46,18 @@ class HuggingfaceEngine(BaseEngine): def _process_args( model: "PreTrainedModel", tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"], template: "Template", generating_args: Dict[str, Any], messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, input_kwargs: Optional[Dict[str, Any]] = {}, ) -> Tuple[Dict[str, Any], int]: + if processor is not None and image is not None and "" not in messages[0]["content"]: + messages[0]["content"] = messages[0]["content"] + "" + paired_messages = messages + [{"role": "assistant", "content": ""}] prompt_ids, _ = template.encode_oneturn( tokenizer=tokenizer, messages=paired_messages, system=system, tools=tools @@ -95,6 +104,11 @@ class HuggingfaceEngine(BaseEngine): logits_processor=get_logits_processor(), ) + if processor is not None and image is not None: + image_processor: "BaseImageProcessor" = getattr(processor, "image_processor") + pixel_values: "torch.Tensor" = image_processor(image, return_tensors="pt")["pixel_values"] + gen_kwargs["pixel_values"] = pixel_values.to(model.device) + return gen_kwargs, prompt_length @staticmethod @@ -102,15 +116,17 @@ class HuggingfaceEngine(BaseEngine): def _chat( model: "PreTrainedModel", tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"], template: "Template", generating_args: Dict[str, Any], messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, input_kwargs: Optional[Dict[str, Any]] = {}, ) -> List["Response"]: gen_kwargs, prompt_length = HuggingfaceEngine._process_args( - model, tokenizer, template, generating_args, messages, system, tools, input_kwargs + model, tokenizer, processor, template, generating_args, messages, system, tools, image, input_kwargs ) generate_output = model.generate(**gen_kwargs) response_ids = generate_output[:, prompt_length:] @@ -135,15 +151,17 @@ class HuggingfaceEngine(BaseEngine): def _stream_chat( model: "PreTrainedModel", tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"], template: "Template", generating_args: Dict[str, Any], messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, input_kwargs: Optional[Dict[str, Any]] = {}, ) -> Callable[[], str]: gen_kwargs, _ = HuggingfaceEngine._process_args( - model, tokenizer, template, generating_args, messages, system, tools, input_kwargs + model, tokenizer, processor, template, generating_args, messages, system, tools, image, input_kwargs ) streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) gen_kwargs["streamer"] = streamer @@ -199,6 +217,7 @@ class HuggingfaceEngine(BaseEngine): messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> List["Response"]: if not self.can_generate: @@ -208,11 +227,13 @@ class HuggingfaceEngine(BaseEngine): input_args = ( self.model, self.tokenizer, + self.processor, self.template, self.generating_args, messages, system, tools, + image, input_kwargs, ) async with self._semaphore: @@ -224,6 +245,7 @@ class HuggingfaceEngine(BaseEngine): messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> AsyncGenerator[str, None]: if not self.can_generate: @@ -233,11 +255,13 @@ class HuggingfaceEngine(BaseEngine): input_args = ( self.model, self.tokenizer, + self.processor, self.template, self.generating_args, messages, system, tools, + image, input_kwargs, ) async with self._semaphore: diff --git a/src/llmtuner/chat/vllm_engine.py b/src/llmtuner/chat/vllm_engine.py index 786e743d..a4caa53b 100644 --- a/src/llmtuner/chat/vllm_engine.py +++ b/src/llmtuner/chat/vllm_engine.py @@ -12,7 +12,10 @@ if is_vllm_available(): from vllm import AsyncEngineArgs, AsyncLLMEngine, RequestOutput, SamplingParams from vllm.lora.request import LoRARequest + if TYPE_CHECKING: + from numpy.typing import NDArray + from ..hparams import DataArguments, FinetuningArguments, GeneratingArguments, ModelArguments @@ -29,7 +32,9 @@ class VllmEngine(BaseEngine): infer_dtype = str(infer_dtype).split(".")[-1] self.can_generate = finetuning_args.stage == "sft" - self.tokenizer = load_tokenizer(model_args) + tokenizer_module = load_tokenizer(model_args) + self.tokenizer = tokenizer_module["tokenizer"] + self.processor = tokenizer_module["processor"] self.tokenizer.padding_side = "left" self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args.template) self.generating_args = generating_args.to_dict() @@ -58,6 +63,7 @@ class VllmEngine(BaseEngine): messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> AsyncIterator["RequestOutput"]: request_id = "chatcmpl-{}".format(uuid.uuid4().hex) @@ -121,10 +127,11 @@ class VllmEngine(BaseEngine): messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> List["Response"]: final_output = None - generator = await self._generate(messages, system, tools, **input_kwargs) + generator = await self._generate(messages, system, tools, image, **input_kwargs) async for request_output in generator: final_output = request_output @@ -146,10 +153,11 @@ class VllmEngine(BaseEngine): messages: Sequence[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, + image: Optional["NDArray"] = None, **input_kwargs, ) -> AsyncGenerator[str, None]: generated_text = "" - generator = await self._generate(messages, system, tools, **input_kwargs) + generator = await self._generate(messages, system, tools, image, **input_kwargs) async for result in generator: delta_text = result.outputs[0].text[len(generated_text) :] generated_text = result.outputs[0].text diff --git a/src/llmtuner/data/aligner.py b/src/llmtuner/data/aligner.py index 4de37e6d..dc1de865 100644 --- a/src/llmtuner/data/aligner.py +++ b/src/llmtuner/data/aligner.py @@ -1,3 +1,4 @@ +import os from functools import partial from typing import TYPE_CHECKING, Any, Dict, List, Union @@ -13,8 +14,10 @@ if TYPE_CHECKING: from .parser import DatasetAttr -def convert_alpaca(examples: Dict[str, List[Any]], dataset_attr: "DatasetAttr") -> Dict[str, List[Any]]: - outputs = {"prompt": [], "response": [], "system": [], "tools": []} +def convert_alpaca( + examples: Dict[str, List[Any]], dataset_attr: "DatasetAttr", data_args: "DataArguments" +) -> Dict[str, List[Any]]: + outputs = {"prompt": [], "response": [], "system": [], "tools": [], "images": []} for i in range(len(examples[dataset_attr.prompt])): prompt = [] if dataset_attr.history and isinstance(examples[dataset_attr.history][i], list): @@ -44,12 +47,19 @@ def convert_alpaca(examples: Dict[str, List[Any]], dataset_attr: "DatasetAttr") outputs["response"].append(response) outputs["system"].append(examples[dataset_attr.system][i] if dataset_attr.system else "") outputs["tools"].append("") + outputs["images"].append( + [os.path.join(data_args.dataset_dir, path) for path in examples[dataset_attr.images][i]] + if dataset_attr.images + else [] + ) return outputs -def convert_sharegpt(examples: Dict[str, List[Any]], dataset_attr: "DatasetAttr") -> Dict[str, List[Any]]: - outputs = {"prompt": [], "response": [], "system": [], "tools": []} +def convert_sharegpt( + examples: Dict[str, List[Any]], dataset_attr: "DatasetAttr", data_args: "DataArguments" +) -> Dict[str, List[Any]]: + outputs = {"prompt": [], "response": [], "system": [], "tools": [], "images": []} tag_mapping = { dataset_attr.user_tag: Role.USER.value, dataset_attr.assistant_tag: Role.ASSISTANT.value, @@ -84,6 +94,11 @@ def convert_sharegpt(examples: Dict[str, List[Any]], dataset_attr: "DatasetAttr" outputs["response"].append(aligned_messages[-1:]) outputs["system"].append(system) outputs["tools"].append(examples[dataset_attr.tools][i] if dataset_attr.tools else "") + outputs["images"].append( + [os.path.join(data_args.dataset_dir, path) for path in examples[dataset_attr.images][i]] + if dataset_attr.images + else [] + ) return outputs @@ -96,12 +111,13 @@ def align_dataset( prompt: [{"role": "user", "content": "..."}] * (2T - 1) response: [{"role": "assistant", "content": "..."}] * N (N > 1 for ranking dataset) system: "..." - tools: "..." + tools: "...", + images: [], """ if dataset_attr.formatting == "alpaca": - convert_func = partial(convert_alpaca, dataset_attr=dataset_attr) + convert_func = partial(convert_alpaca, dataset_attr=dataset_attr, data_args=data_args) else: - convert_func = partial(convert_sharegpt, dataset_attr=dataset_attr) + convert_func = partial(convert_sharegpt, dataset_attr=dataset_attr, data_args=data_args) column_names = list(next(iter(dataset)).keys()) features = Features.from_dict( @@ -114,6 +130,7 @@ def align_dataset( ], "system": {"dtype": "string", "_type": "Value"}, "tools": {"dtype": "string", "_type": "Value"}, + "images": [{"_type": "Image"}], } ) kwargs = {} diff --git a/src/llmtuner/data/loader.py b/src/llmtuner/data/loader.py index 5414150e..ca0d5407 100644 --- a/src/llmtuner/data/loader.py +++ b/src/llmtuner/data/loader.py @@ -1,6 +1,6 @@ import inspect import os -from typing import TYPE_CHECKING, Literal, Union +from typing import TYPE_CHECKING, Literal, Optional, Union from datasets import load_dataset, load_from_disk @@ -16,7 +16,7 @@ from .utils import checksum, merge_dataset if TYPE_CHECKING: from datasets import Dataset, IterableDataset - from transformers import Seq2SeqTrainingArguments + from transformers import ProcessorMixin, Seq2SeqTrainingArguments from transformers.tokenization_utils import PreTrainedTokenizer from ..hparams import DataArguments, ModelArguments @@ -115,11 +115,12 @@ def load_single_dataset( def get_dataset( - tokenizer: "PreTrainedTokenizer", model_args: "ModelArguments", data_args: "DataArguments", training_args: "Seq2SeqTrainingArguments", stage: Literal["pt", "sft", "rm", "ppo"], + tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"] = None, ) -> Union["Dataset", "IterableDataset"]: template = get_template_and_fix_tokenizer(tokenizer, data_args.template) if data_args.train_on_prompt and template.efficient_eos: @@ -149,7 +150,7 @@ def get_dataset( with training_args.main_process_first(desc="pre-process dataset"): preprocess_func, print_function = get_preprocess_and_print_func( - tokenizer, template, data_args, training_args, stage + data_args, training_args, stage, template, tokenizer, processor ) column_names = list(next(iter(dataset)).keys()) kwargs = {} diff --git a/src/llmtuner/data/parser.py b/src/llmtuner/data/parser.py index b9c8782a..01a417a9 100644 --- a/src/llmtuner/data/parser.py +++ b/src/llmtuner/data/parser.py @@ -28,6 +28,7 @@ class DatasetAttr: formatting: Literal["alpaca", "sharegpt"] = "alpaca" """ columns """ system: Optional[str] = None + images: Optional[str] = None """ columns for the alpaca format """ prompt: Optional[str] = "instruction" query: Optional[str] = "input" @@ -105,7 +106,7 @@ def get_dataset_list(data_args: "DataArguments") -> List["DatasetAttr"]: dataset_attr.set_attr("formatting", dataset_info[name], default="alpaca") if "columns" in dataset_info[name]: - column_names = ["system"] + column_names = ["system", "images"] if dataset_attr.formatting == "alpaca": column_names.extend(["prompt", "query", "response", "history"]) else: diff --git a/src/llmtuner/data/preprocess.py b/src/llmtuner/data/preprocess.py index b8edfa10..18681872 100644 --- a/src/llmtuner/data/preprocess.py +++ b/src/llmtuner/data/preprocess.py @@ -1,6 +1,6 @@ from functools import partial from itertools import chain -from typing import TYPE_CHECKING, Any, Callable, Dict, List, Literal, Tuple +from typing import TYPE_CHECKING, Any, Callable, Dict, List, Literal, Optional, Tuple from ..extras.constants import IGNORE_INDEX from ..extras.logging import get_logger @@ -8,7 +8,9 @@ from .utils import Role if TYPE_CHECKING: - from transformers import Seq2SeqTrainingArguments + from PIL.Image import Image + from transformers import ProcessorMixin, Seq2SeqTrainingArguments + from transformers.image_processing_utils import BaseImageProcessor from transformers.tokenization_utils import PreTrainedTokenizer from ..hparams import DataArguments @@ -18,6 +20,14 @@ if TYPE_CHECKING: logger = get_logger(__name__) +def _preprocess_visual_inputs(model_inputs: Dict[str, Any], processor: "ProcessorMixin", image: "Image") -> None: + image_processor: "BaseImageProcessor" = getattr(processor, "image_processor") + pixel_values = image_processor(image, return_tensors="pt")["pixel_values"][0] + if "pixel_values" not in model_inputs: + model_inputs["pixel_values"] = [] + model_inputs["pixel_values"].append(pixel_values) + + def preprocess_pretrain_dataset( examples: Dict[str, List[Any]], tokenizer: "PreTrainedTokenizer", data_args: "DataArguments" ) -> Dict[str, List[List[int]]]: @@ -48,8 +58,9 @@ def preprocess_pretrain_dataset( def preprocess_supervised_dataset( examples: Dict[str, List[Any]], - tokenizer: "PreTrainedTokenizer", template: "Template", + tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"], data_args: "DataArguments", ) -> Dict[str, List[List[int]]]: # build inputs with format ` X Y ` and labels with format ` ... Y ` @@ -89,14 +100,16 @@ def preprocess_supervised_dataset( model_inputs["input_ids"].append(input_ids) model_inputs["attention_mask"].append([1] * len(input_ids)) model_inputs["labels"].append(labels) + if processor is not None and "images" in examples: + _preprocess_visual_inputs(model_inputs, processor, examples["images"][i][0]) return model_inputs def preprocess_packed_supervised_dataset( examples: Dict[str, List[Any]], - tokenizer: "PreTrainedTokenizer", template: "Template", + tokenizer: "PreTrainedTokenizer", data_args: "DataArguments", ) -> Dict[str, List[List[int]]]: # build inputs with format ` X1 Y1 X2 Y2 ` @@ -141,8 +154,9 @@ def preprocess_packed_supervised_dataset( def preprocess_unsupervised_dataset( examples: Dict[str, List[Any]], - tokenizer: "PreTrainedTokenizer", template: "Template", + tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"], data_args: "DataArguments", ) -> Dict[str, List[List[int]]]: # build inputs with format ` X` and labels with format `Y ` @@ -172,14 +186,17 @@ def preprocess_unsupervised_dataset( model_inputs["input_ids"].append(input_ids) model_inputs["attention_mask"].append([1] * len(input_ids)) model_inputs["labels"].append(labels) + if processor is not None and "images" in examples: + _preprocess_visual_inputs(model_inputs, processor, examples["images"][i][0]) return model_inputs def preprocess_pairwise_dataset( examples: Dict[str, List[Any]], - tokenizer: "PreTrainedTokenizer", template: "Template", + tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"], data_args: "DataArguments", ) -> Dict[str, List[List[int]]]: # build input pairs with format ` X`, `Y1 ` and `Y2 ` @@ -214,6 +231,8 @@ def preprocess_pairwise_dataset( model_inputs["prompt_ids"].append(prompt_ids) model_inputs["chosen_ids"].append(chosen_ids) model_inputs["rejected_ids"].append(rejected_ids) + if processor is not None and "images" in examples: + _preprocess_visual_inputs(model_inputs, processor, examples["images"][i][0]) return model_inputs @@ -244,34 +263,54 @@ def print_unsupervised_dataset_example(example: Dict[str, List[int]], tokenizer: def get_preprocess_and_print_func( - tokenizer: "PreTrainedTokenizer", - template: "Template", data_args: "DataArguments", training_args: "Seq2SeqTrainingArguments", stage: Literal["pt", "sft", "rm", "ppo"], + template: "Template", + tokenizer: "PreTrainedTokenizer", + processor: Optional["ProcessorMixin"], ) -> Tuple[Callable, Callable]: if stage == "pt": - preprocess_func = partial(preprocess_pretrain_dataset, tokenizer=tokenizer, data_args=data_args) + preprocess_func = partial( + preprocess_pretrain_dataset, + tokenizer=tokenizer, + data_args=data_args, + ) print_function = partial(print_unsupervised_dataset_example, tokenizer=tokenizer) elif stage == "sft" and not training_args.predict_with_generate: if data_args.packing: preprocess_func = partial( - preprocess_packed_supervised_dataset, tokenizer=tokenizer, template=template, data_args=data_args + preprocess_packed_supervised_dataset, + template=template, + tokenizer=tokenizer, + data_args=data_args, ) else: preprocess_func = partial( - preprocess_supervised_dataset, tokenizer=tokenizer, template=template, data_args=data_args + preprocess_supervised_dataset, + template=template, + tokenizer=tokenizer, + processor=processor, + data_args=data_args, ) print_function = partial(print_supervised_dataset_example, tokenizer=tokenizer) elif stage == "rm": preprocess_func = partial( - preprocess_pairwise_dataset, tokenizer=tokenizer, template=template, data_args=data_args + preprocess_pairwise_dataset, + template=template, + tokenizer=tokenizer, + processor=processor, + data_args=data_args, ) print_function = partial(print_pairwise_dataset_example, tokenizer=tokenizer) else: preprocess_func = partial( - preprocess_unsupervised_dataset, tokenizer=tokenizer, template=template, data_args=data_args + preprocess_unsupervised_dataset, + template=template, + tokenizer=tokenizer, + processor=processor, + data_args=data_args, ) print_function = partial(print_unsupervised_dataset_example, tokenizer=tokenizer) diff --git a/src/llmtuner/eval/evaluator.py b/src/llmtuner/eval/evaluator.py index 2c039928..7446c6f5 100644 --- a/src/llmtuner/eval/evaluator.py +++ b/src/llmtuner/eval/evaluator.py @@ -21,7 +21,7 @@ from .template import get_eval_template class Evaluator: def __init__(self, args: Optional[Dict[str, Any]] = None) -> None: self.model_args, self.data_args, self.eval_args, finetuning_args = get_eval_args(args) - self.tokenizer = load_tokenizer(self.model_args) + self.tokenizer = load_tokenizer(self.model_args)["tokenizer"] self.tokenizer.padding_side = "right" # avoid overflow issue in batched inference for llama2 self.template = get_template_and_fix_tokenizer(self.tokenizer, self.data_args.template) self.model = load_model(self.tokenizer, self.model_args, finetuning_args) diff --git a/src/llmtuner/hparams/model_args.py b/src/llmtuner/hparams/model_args.py index bb8a8193..be65cd27 100644 --- a/src/llmtuner/hparams/model_args.py +++ b/src/llmtuner/hparams/model_args.py @@ -81,6 +81,10 @@ class ModelArguments: default=False, metadata={"help": "Whether or not to use unsloth's optimization for the LoRA training."}, ) + visual_inputs: bool = field( + default=False, + metadata={"help": "Whethor or not to use multimodal LLM that accepts visual inputs."}, + ) moe_aux_loss_coef: Optional[float] = field( default=None, metadata={"help": "Coefficient of the auxiliary router loss in mixture-of-experts model."}, diff --git a/src/llmtuner/hparams/parser.py b/src/llmtuner/hparams/parser.py index c922dc47..715b8f95 100644 --- a/src/llmtuner/hparams/parser.py +++ b/src/llmtuner/hparams/parser.py @@ -196,6 +196,9 @@ def get_train_args(args: Optional[Dict[str, Any]] = None) -> _TRAIN_CLS: if model_args.infer_backend == "vllm": raise ValueError("vLLM backend is only available for API, CLI and Web.") + if model_args.visual_inputs and data_args.packing: + raise ValueError("Cannot use packing in MLLM fine-tuning.") + _verify_model_args(model_args, finetuning_args) _check_extra_dependencies(model_args, finetuning_args, training_args) @@ -317,6 +320,9 @@ def get_infer_args(args: Optional[Dict[str, Any]] = None) -> _INFER_CLS: if model_args.adapter_name_or_path is not None and len(model_args.adapter_name_or_path) != 1: raise ValueError("vLLM only accepts a single adapter. Merge them first.") + if model_args.visual_inputs: + raise ValueError("vLLM engine does not support MLLM yet. Stay tuned.") + _verify_model_args(model_args, finetuning_args) _check_extra_dependencies(model_args, finetuning_args) diff --git a/src/llmtuner/model/loader.py b/src/llmtuner/model/loader.py index 54048cc5..0ff7a350 100644 --- a/src/llmtuner/model/loader.py +++ b/src/llmtuner/model/loader.py @@ -1,6 +1,6 @@ -from typing import TYPE_CHECKING, Any, Dict +from typing import TYPE_CHECKING, Any, Dict, Optional, TypedDict -from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer +from transformers import AutoConfig, AutoModelForCausalLM, AutoModelForVision2Seq, AutoProcessor, AutoTokenizer from trl import AutoModelForCausalLMWithValueHead from ..extras.logging import get_logger @@ -13,7 +13,7 @@ from .utils.unsloth import load_unsloth_pretrained_model if TYPE_CHECKING: - from transformers import PretrainedConfig, PreTrainedModel, PreTrainedTokenizer + from transformers import PretrainedConfig, PreTrainedModel, PreTrainedTokenizer, ProcessorMixin from ..hparams import FinetuningArguments, ModelArguments @@ -21,6 +21,11 @@ if TYPE_CHECKING: logger = get_logger(__name__) +class TokenizerModule(TypedDict): + tokenizer: "PreTrainedTokenizer" + processor: Optional["ProcessorMixin"] + + def _get_init_kwargs(model_args: "ModelArguments") -> Dict[str, Any]: r""" Gets arguments to load config/tokenizer/model. @@ -36,7 +41,7 @@ def _get_init_kwargs(model_args: "ModelArguments") -> Dict[str, Any]: } -def load_tokenizer(model_args: "ModelArguments") -> "PreTrainedTokenizer": +def load_tokenizer(model_args: "ModelArguments") -> "TokenizerModule": r""" Loads pretrained tokenizer. @@ -70,7 +75,14 @@ def load_tokenizer(model_args: "ModelArguments") -> "PreTrainedTokenizer": logger.warning("New tokens have been added, changed `resize_vocab` to True.") patch_tokenizer(tokenizer) - return tokenizer + + if model_args.visual_inputs: + processor = AutoProcessor.from_pretrained(model_args.model_name_or_path, **init_kwargs) + setattr(processor, "tokenizer", tokenizer) + else: + processor = None + + return {"tokenizer": tokenizer, "processor": processor} def load_config(model_args: "ModelArguments") -> "PretrainedConfig": @@ -109,6 +121,8 @@ def load_model( if model_args.mixture_of_depths == "load": model = load_mod_pretrained_model(**init_kwargs) + elif model_args.visual_inputs: + model = AutoModelForVision2Seq.from_pretrained(**init_kwargs) else: model = AutoModelForCausalLM.from_pretrained(**init_kwargs) diff --git a/src/llmtuner/train/dpo/workflow.py b/src/llmtuner/train/dpo/workflow.py index 929dd029..b19a643e 100644 --- a/src/llmtuner/train/dpo/workflow.py +++ b/src/llmtuner/train/dpo/workflow.py @@ -24,8 +24,9 @@ def run_dpo( finetuning_args: "FinetuningArguments", callbacks: Optional[List["TrainerCallback"]] = None, ): - tokenizer = load_tokenizer(model_args) - dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="rm") + tokenizer_module = load_tokenizer(model_args) + tokenizer = tokenizer_module["tokenizer"] + dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module) model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) data_collator = PairwiseDataCollatorWithPadding( diff --git a/src/llmtuner/train/orpo/workflow.py b/src/llmtuner/train/orpo/workflow.py index 5a2fd36c..9c870096 100644 --- a/src/llmtuner/train/orpo/workflow.py +++ b/src/llmtuner/train/orpo/workflow.py @@ -24,8 +24,9 @@ def run_orpo( finetuning_args: "FinetuningArguments", callbacks: Optional[List["TrainerCallback"]] = None, ): - tokenizer = load_tokenizer(model_args) - dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="rm") + tokenizer_module = load_tokenizer(model_args) + tokenizer = tokenizer_module["tokenizer"] + dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module) model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) data_collator = PairwiseDataCollatorWithPadding( diff --git a/src/llmtuner/train/ppo/workflow.py b/src/llmtuner/train/ppo/workflow.py index d5854073..8cd15932 100644 --- a/src/llmtuner/train/ppo/workflow.py +++ b/src/llmtuner/train/ppo/workflow.py @@ -27,8 +27,9 @@ def run_ppo( generating_args: "GeneratingArguments", callbacks: Optional[List["TrainerCallback"]] = None, ): - tokenizer = load_tokenizer(model_args) - dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="ppo") + tokenizer_module = load_tokenizer(model_args) + tokenizer = tokenizer_module["tokenizer"] + dataset = get_dataset(model_args, data_args, training_args, stage="ppo", **tokenizer_module) model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train, add_valuehead=True) tokenizer.padding_side = "left" # use left-padding in generation while using right-padding in training diff --git a/src/llmtuner/train/pt/workflow.py b/src/llmtuner/train/pt/workflow.py index f683f37a..3b127da4 100644 --- a/src/llmtuner/train/pt/workflow.py +++ b/src/llmtuner/train/pt/workflow.py @@ -25,8 +25,9 @@ def run_pt( finetuning_args: "FinetuningArguments", callbacks: Optional[List["TrainerCallback"]] = None, ): - tokenizer = load_tokenizer(model_args) - dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="pt") + tokenizer_module = load_tokenizer(model_args) + tokenizer = tokenizer_module["tokenizer"] + dataset = get_dataset(model_args, data_args, training_args, stage="pt", **tokenizer_module) model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) diff --git a/src/llmtuner/train/rm/workflow.py b/src/llmtuner/train/rm/workflow.py index 42bf1ce6..bd0a756c 100644 --- a/src/llmtuner/train/rm/workflow.py +++ b/src/llmtuner/train/rm/workflow.py @@ -25,8 +25,9 @@ def run_rm( finetuning_args: "FinetuningArguments", callbacks: Optional[List["TrainerCallback"]] = None, ): - tokenizer = load_tokenizer(model_args) - dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="rm") + tokenizer_module = load_tokenizer(model_args) + tokenizer = tokenizer_module["tokenizer"] + dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module) model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train, add_valuehead=True) data_collator = PairwiseDataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) diff --git a/src/llmtuner/train/sft/workflow.py b/src/llmtuner/train/sft/workflow.py index 9ab78850..4a9775b4 100644 --- a/src/llmtuner/train/sft/workflow.py +++ b/src/llmtuner/train/sft/workflow.py @@ -28,8 +28,9 @@ def run_sft( generating_args: "GeneratingArguments", callbacks: Optional[List["TrainerCallback"]] = None, ): - tokenizer = load_tokenizer(model_args) - dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft") + tokenizer_module = load_tokenizer(model_args) + tokenizer = tokenizer_module["tokenizer"] + dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module) model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) if training_args.predict_with_generate: @@ -47,6 +48,7 @@ def run_sft( # Override the decoding parameters of Seq2SeqTrainer training_args.generation_max_length = training_args.generation_max_length or data_args.cutoff_len training_args.generation_num_beams = data_args.eval_num_beams or training_args.generation_num_beams + training_args.remove_unused_columns = False if model_args.visual_inputs else training_args.remove_unused_columns # Initialize our Trainer trainer = CustomSeq2SeqTrainer( diff --git a/src/llmtuner/train/tuner.py b/src/llmtuner/train/tuner.py index a8a2b8e9..a2eb121f 100644 --- a/src/llmtuner/train/tuner.py +++ b/src/llmtuner/train/tuner.py @@ -52,7 +52,7 @@ def export_model(args: Optional[Dict[str, Any]] = None): if model_args.adapter_name_or_path is not None and model_args.export_quantization_bit is not None: raise ValueError("Please merge adapters before quantizing the model.") - tokenizer = load_tokenizer(model_args) + tokenizer = load_tokenizer(model_args)["tokenizer"] get_template_and_fix_tokenizer(tokenizer, data_args.template) model = load_model(tokenizer, model_args, finetuning_args) # must after fixing tokenizer to resize vocab diff --git a/src/llmtuner/train/utils.py b/src/llmtuner/train/utils.py index 27dc8eb3..d9fc363d 100644 --- a/src/llmtuner/train/utils.py +++ b/src/llmtuner/train/utils.py @@ -91,7 +91,7 @@ def create_ref_model( ) ref_model_args = ModelArguments(**ref_model_args_dict) ref_finetuning_args = FinetuningArguments(finetuning_type="lora") - tokenizer = load_tokenizer(ref_model_args) + tokenizer = load_tokenizer(ref_model_args)["tokenizer"] ref_model = load_model( tokenizer, ref_model_args, ref_finetuning_args, is_trainable=False, add_valuehead=add_valuehead ) @@ -100,7 +100,7 @@ def create_ref_model( if finetuning_args.finetuning_type == "lora": ref_model = None else: - tokenizer = load_tokenizer(model_args) + tokenizer = load_tokenizer(model_args)["tokenizer"] ref_model = load_model( tokenizer, model_args, finetuning_args, is_trainable=False, add_valuehead=add_valuehead ) @@ -147,7 +147,7 @@ def create_reward_model( ) reward_model_args = ModelArguments(**reward_model_args_dict) reward_finetuning_args = FinetuningArguments(finetuning_type="lora") - tokenizer = load_tokenizer(reward_model_args) + tokenizer = load_tokenizer(reward_model_args)["tokenizer"] reward_model = load_model( tokenizer, reward_model_args, reward_finetuning_args, is_trainable=False, add_valuehead=True ) diff --git a/src/llmtuner/webui/chatter.py b/src/llmtuner/webui/chatter.py index 82e7b7f1..5aa8f563 100644 --- a/src/llmtuner/webui/chatter.py +++ b/src/llmtuner/webui/chatter.py @@ -2,6 +2,8 @@ import json import os from typing import TYPE_CHECKING, Dict, Generator, List, Optional, Sequence, Tuple +from numpy.typing import NDArray + from ..chat import ChatModel from ..data import Role from ..extras.misc import torch_gc @@ -112,6 +114,7 @@ class WebChatModel(ChatModel): messages: Sequence[Dict[str, str]], system: str, tools: str, + image: Optional[NDArray], max_new_tokens: int, top_p: float, temperature: float, @@ -119,7 +122,7 @@ class WebChatModel(ChatModel): chatbot[-1][1] = "" response = "" for new_text in self.stream_chat( - messages, system, tools, max_new_tokens=max_new_tokens, top_p=top_p, temperature=temperature + messages, system, tools, image, max_new_tokens=max_new_tokens, top_p=top_p, temperature=temperature ): response += new_text if tools: diff --git a/src/llmtuner/webui/components/chatbot.py b/src/llmtuner/webui/components/chatbot.py index 82bc4f29..e1be1f7b 100644 --- a/src/llmtuner/webui/components/chatbot.py +++ b/src/llmtuner/webui/components/chatbot.py @@ -23,9 +23,15 @@ def create_chat_box( messages = gr.State([]) with gr.Row(): with gr.Column(scale=4): - role = gr.Dropdown(choices=[Role.USER.value, Role.OBSERVATION.value], value=Role.USER.value) - system = gr.Textbox(show_label=False) - tools = gr.Textbox(show_label=False, lines=2) + with gr.Row(): + with gr.Column(): + role = gr.Dropdown(choices=[Role.USER.value, Role.OBSERVATION.value], value=Role.USER.value) + system = gr.Textbox(show_label=False) + tools = gr.Textbox(show_label=False, lines=4) + + with gr.Column(): + image = gr.Image(type="numpy") + query = gr.Textbox(show_label=False, lines=8) submit_btn = gr.Button(variant="primary") @@ -43,7 +49,7 @@ def create_chat_box( [chatbot, messages, query], ).then( engine.chatter.stream, - [chatbot, messages, system, tools, max_new_tokens, top_p, temperature], + [chatbot, messages, system, tools, image, max_new_tokens, top_p, temperature], [chatbot, messages], ) clear_btn.click(lambda: ([], []), outputs=[chatbot, messages]) @@ -56,6 +62,7 @@ def create_chat_box( role=role, system=system, tools=tools, + image=image, query=query, submit_btn=submit_btn, max_new_tokens=max_new_tokens, diff --git a/src/llmtuner/webui/locales.py b/src/llmtuner/webui/locales.py index 3af9128f..8e93efd6 100644 --- a/src/llmtuner/webui/locales.py +++ b/src/llmtuner/webui/locales.py @@ -1073,6 +1073,17 @@ LOCALES = { "placeholder": "工具列表(非必填)", }, }, + "image": { + "en": { + "label": "Image (optional)", + }, + "ru": { + "label": "Изображение (по желанию)", + }, + "zh": { + "label": "图像(非必填)", + }, + }, "query": { "en": { "placeholder": "Input...",