mirror of
https://github.com/hiyouga/LLaMA-Factory.git
synced 2025-10-14 23:58:11 +08:00
add datasets
Former-commit-id: 02e4b47dea1b25905c61f2ace88bab112610f021
This commit is contained in:
parent
a02b3e6192
commit
6d881f161b
56
README.md
56
README.md
@ -22,11 +22,11 @@
|
||||
|
||||
[23/07/05] Now we support training the **Falcon-7B/40B** models in this repo. Try `--model_name_or_path tiiuae/falcon-7b` and `--lora_target query_key_value` arguments to use the Falcon model.
|
||||
|
||||
[23/06/29] We provide a **reproducible example** of training a chat model using instruction-following datasets, see this [HuggingFace Repo](https://huggingface.co/hiyouga/baichuan-7b-sft) for details.
|
||||
[23/06/29] We provide a **reproducible example** of training a chat model using instruction-following datasets, see this [Hugging Face Repo](https://huggingface.co/hiyouga/baichuan-7b-sft) for details.
|
||||
|
||||
[23/06/22] Now we align the [demo API](src/api_demo.py) with the [OpenAI's](https://platform.openai.com/docs/api-reference/chat) format where you can insert the fine-tuned model in **arbitrary ChatGPT-based applications**.
|
||||
|
||||
[23/06/15] Now we support training the **Baichuan-7B** model in this repo. Try `--model_name_or_path baichuan-inc/Baichuan-7B` and `--lora_target W_pack` arguments to use the Baichuan-7B model. If you want to train with RTX3090, use `git checkout baichuan-7b-rtx3090` to switch to the `baichuan-7b-rtx3090` branch and try the `--baichuan_rtx_gpu true` argument. (Other RTX series GPUs can also be tried)
|
||||
[23/06/15] Now we support training the **Baichuan-7B** model in this repo. Try `--model_name_or_path baichuan-inc/Baichuan-7B` and `--lora_target W_pack` arguments to use the Baichuan-7B model.
|
||||
|
||||
[23/06/03] Now we support quantized training and inference (aka **[QLoRA](https://github.com/artidoro/qlora)**). Try `--quantization_bit 4/8` argument to work with quantized model. (experimental feature)
|
||||
|
||||
@ -60,36 +60,36 @@
|
||||
## Provided Datasets
|
||||
|
||||
- For pre-training:
|
||||
- [Wiki Demo](data/wiki_demo.txt)
|
||||
- [Wiki Demo (en)](data/wiki_demo.txt)
|
||||
- For supervised fine-tuning:
|
||||
- [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
|
||||
- [Stanford Alpaca (Chinese)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
|
||||
- [GPT-4 Generated Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
||||
- [BELLE 2M](https://huggingface.co/datasets/BelleGroup/train_2M_CN)
|
||||
- [BELLE 1M](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
|
||||
- [BELLE 0.5M](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
|
||||
- [BELLE Dialogue 0.4M](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M)
|
||||
- [BELLE School Math 0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M)
|
||||
- [BELLE Multiturn Chat 0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)
|
||||
- [Guanaco Dataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)
|
||||
- [Firefly 1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)
|
||||
- [CodeAlpaca 20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k)
|
||||
- [Alpaca CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
|
||||
- [Web QA (Chinese)](https://huggingface.co/datasets/suolyer/webqa)
|
||||
- [UltraChat](https://github.com/thunlp/UltraChat)
|
||||
- [Open Assistant](https://huggingface.co/datasets/OpenAssistant/oasst1)
|
||||
- [Open Assistant (Chinese)](https://huggingface.co/datasets/OpenAssistant/oasst1)
|
||||
- [WebNovel (Chinese)](https://huggingface.co/datasets/zxbsmk/webnovel_cn)
|
||||
- For reward model training:
|
||||
- [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
|
||||
- [Open Assistant](https://huggingface.co/datasets/OpenAssistant/oasst1)
|
||||
- [Open Assistant (Chinese)](https://huggingface.co/datasets/OpenAssistant/oasst1)
|
||||
- [GPT-4 Generated Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
||||
- [GPT-4 Generated Data (Chinese)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
||||
- [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca)
|
||||
- [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
|
||||
- [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
||||
- [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
|
||||
- [Self-cognition (zh)](data/self_cognition.json)
|
||||
- [ShareGPT (zh)](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chinese-instruction-collection)
|
||||
- [RefGPT (zh)](https://github.com/sufengniu/RefGPT)
|
||||
- [Guanaco Dataset (multilingual)](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)
|
||||
- [BELLE 2M (zh)](https://huggingface.co/datasets/BelleGroup/train_2M_CN)
|
||||
- [BELLE 1M (zh)](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
|
||||
- [BELLE 0.5M (zh)](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
|
||||
- [BELLE Dialogue 0.4M (zh)](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M)
|
||||
- [BELLE School Math 0.25M (zh)](https://huggingface.co/datasets/BelleGroup/school_math_0.25M)
|
||||
- [BELLE Multiturn Chat 0.8M (zh)](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)
|
||||
- [Firefly 1.1M (zh)](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)
|
||||
- [CodeAlpaca 20k (en)](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k)
|
||||
- [Alpaca CoT (multilingual)](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
|
||||
- [Web QA (zh)](https://huggingface.co/datasets/suolyer/webqa)
|
||||
- [UltraChat (en)](https://github.com/thunlp/UltraChat)
|
||||
- [WebNovel (zh)](https://huggingface.co/datasets/zxbsmk/webnovel_cn)
|
||||
- For reward modelling:
|
||||
- [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
|
||||
- [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
|
||||
- [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
||||
|
||||
Please refer to [data/README.md](data/README.md) for details.
|
||||
|
||||
Some datasets require confirmation before using them, so we recommend logging in with your HuggingFace account using these commands.
|
||||
Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.
|
||||
|
||||
```bash
|
||||
pip install --upgrade huggingface_hub
|
||||
|
@ -1,4 +1,5 @@
|
||||
Data format in `dataset_info.json`:
|
||||
If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
|
||||
|
||||
```json
|
||||
"dataset_name": {
|
||||
"hf_hub_url": "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
|
||||
@ -14,40 +15,4 @@ Data format in `dataset_info.json`:
|
||||
}
|
||||
```
|
||||
|
||||
`dataset_info.json` 中的数据集定义格式:
|
||||
```json
|
||||
"数据集名称": {
|
||||
"hf_hub_url": "HuggingFace上的项目地址(若指定,则忽略下列三个参数)",
|
||||
"script_url": "包含数据加载脚本的本地文件夹名称(若指定,则忽略下列两个参数)",
|
||||
"file_name": "该目录下数据集文件的名称(若上述参数未指定,则此项必需)",
|
||||
"file_sha1": "数据集文件的SHA-1哈希值(可选)",
|
||||
"columns": {
|
||||
"prompt": "数据集代表提示词的表头名称(默认:instruction)",
|
||||
"query": "数据集代表请求的表头名称(默认:input)",
|
||||
"response": "数据集代表回答的表头名称(默认:output)",
|
||||
"history": "数据集代表历史对话的表头名称(默认:None)"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
部分预置数据集简介:
|
||||
|
||||
| 数据集名称 | 规模 | 描述 |
|
||||
| --- | --- | --- |
|
||||
| [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 52k | 斯坦福大学开源的 Alpaca 数据集,训练了 Alpaca 这类早期基于 LLaMA 的模型 |
|
||||
| [Stanford Alpaca (Chinese)](https://github.com/ymcui/Chinese-LLaMA-Alpaca) | 51k | 使用 ChatGPT 翻译的 Alpaca 数据集 |
|
||||
| [GPT-4 Generated Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 100k+ | 基于 GPT-4 的 self-instruction 数据集 |
|
||||
| [BELLE 2M](https://huggingface.co/datasets/BelleGroup/train_2M_CN) | 2m | 包含约 200 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文指令数据 |
|
||||
| [BELLE 1M](https://huggingface.co/datasets/BelleGroup/train_1M_CN) | 1m | 包含约 100 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文指令数据 |
|
||||
| [BELLE 0.5M](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN) | 500k | 包含约 50 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文指令数据 |
|
||||
| [BELLE Dialogue 0.4M](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M) | 400k | 包含约 40 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的个性化角色对话数据,包含角色介绍 |
|
||||
| [BELLE School Math 0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M) | 250k | 包含约 25 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文数学题数据,包含解题过程 |
|
||||
| [BELLE Multiturn Chat 0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M) | 800k | 包含约 80 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的用户与助手的多轮对话 |
|
||||
| [Guanaco Dataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | 100k+ | 包含日文、简繁体中文、英文等多类数据,数据集原用于 Guanaco 模型训练 |
|
||||
| [Firefly 1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) | 1.1M | 中文对话大模型 firefly(流萤)的中文数据集,包含多个 NLP 任务 |
|
||||
| [CodeAlpaca 20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k) | 20k | 英文代码生成任务数据集 |
|
||||
| [Alpaca CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) | 6M | 用于微调的指令数据集集合 |
|
||||
| [Web QA](https://huggingface.co/datasets/suolyer/webqa) | 36k | 百度知道汇集的中文问答数据集 |
|
||||
| [UltraChat](https://github.com/thunlp/UltraChat) | 1.57M | 清华 NLP 发布的大规模多轮对话数据集 |
|
||||
|
||||
注:BELLE 数据集是由 ChatGPT 产生的数据集,不保证数据准确性,所有类 GPT 模型产生的 self-instruction 数据集均不能保证其准确性。
|
||||
where the `prompt` and `response` columns should contain non-empty values. The `query` column will be concatenated with the `prompt` column and used as input for the model. The `history` column should contain a list where each element is a string tuple representing a query-response pair.
|
||||
|
@ -1,2 +0,0 @@
|
||||
{"id": 0,"title": "大卫·亨利","content": "大卫·亨利\n\n大卫·克莱顿·亨利(David Clayton Henrie,),美国演员。近来在迪士尼频道原创电视影集《少年魔法师》(Wizards of Waverly Place)当中演出贾斯汀·鲁索(Justin Russo)一角。\n\n大卫·亨利出生在加州Mission Viejo,在凤凰城长大。他的胞弟劳伦斯·亨利(Lorenzo Henrie)也是演员。大卫·亨利就读夏安传统学校。家中是信奉罗马天主教。 \n\n大卫在2007年拍摄少年魔法师期间认识女演员露西·海尔(Lucy Hale),之后与其交往,于2009年分手。\n\n10岁时,大卫·亨利和SAG在凤凰城签订了合约,并开始走出去试镜。 9岁的时候,在沙加缅度进行商业拍摄,SAG董事建议大卫·亨利搬到洛杉矶。在10岁那年夏天,他和他的家人搬到了好莱坞。他预定他的前2支商业试镜,扮演主要角色为汉堡王和桂格燕麦。他初演电视节目为Providence。 \n\n到了13岁,大卫有了他的第一次重大突破,在福克斯公司的喜剧The Pitts饰演 Petey Pitt一角。大卫下出作品为的Hallmark movie为Monster Maker,和琳达布莱儿、乔治甘迺迪共同演出,并要求回来Hallmark movie公司。 \n\n在18岁时,大卫得到了迪士尼频道原创系列演出机会,该节目2007年10月12日首播。大卫2008年参加了迪士尼频道的游戏节目。他是绿色团队的队长,隔年,为旋风队队长。他在迪士尼原创电影《少年魔法师》之后在《酷爸的疯狂假期》中有饰演一角。\n"}
|
||||
{"id": 1,"title": "大卫·亨利","content": "大卫·亨利\n\n大卫·克莱顿·亨利(David Clayton Henrie,),美国演员。近来在迪士尼频道原创电视影集《少年魔法师》(Wizards of Waverly Place)当中演出贾斯汀·鲁索(Justin Russo)一角。\n\n大卫·亨利出生在加州Mission Viejo,在凤凰城长大。他的胞弟劳伦斯·亨利(Lorenzo Henrie)也是演员。大卫·亨利就读夏安传统学校。家中是信奉罗马天主教。 \n\n大卫在2007年拍摄少年魔法师期间认识女演员露西·海尔(Lucy Hale),之后与其交往,于2009年分手。\n\n10岁时,大卫·亨利和SAG在凤凰城签订了合约,并开始走出去试镜。 9岁的时候,在沙加缅度进行商业拍摄,SAG董事建议大卫·亨利搬到洛杉矶。在10岁那年夏天,他和他的家人搬到了好莱坞。他预定他的前2支商业试镜,扮演主要角色为汉堡王和桂格燕麦。他初演电视节目为Providence。 \n\n到了13岁,大卫有了他的第一次重大突破,在福克斯公司的喜剧The Pitts饰演 Petey Pitt一角。大卫下出作品为的Hallmark movie为Monster Maker,和琳达布莱儿、乔治甘迺迪共同演出,并要求回来Hallmark movie公司。 \n\n在18岁时,大卫得到了迪士尼频道原创系列演出机会,该节目2007年10月12日首播。大卫2008年参加了迪士尼频道的游戏节目。他是绿色团队的队长,隔年,为旋风队队长。他在迪士尼原创电影《少年魔法师》之后在《酷爸的疯狂假期》中有饰演一角。\n"}
|
1
data/refgpt_zh_50k_p1.json.REMOVED.git-id
Normal file
1
data/refgpt_zh_50k_p1.json.REMOVED.git-id
Normal file
@ -0,0 +1 @@
|
||||
56405bb8f52727e52e99693739494b9b7b0d7ba6
|
1
data/refgpt_zh_50k_p2.json.REMOVED.git-id
Normal file
1
data/refgpt_zh_50k_p2.json.REMOVED.git-id
Normal file
@ -0,0 +1 @@
|
||||
fa935248a5d40d2bdd5649af99a72a754d40ae7a
|
1
data/sharegpt_zh_27k.json.REMOVED.git-id
Normal file
1
data/sharegpt_zh_27k.json.REMOVED.git-id
Normal file
@ -0,0 +1 @@
|
||||
38c89869c6aeca2a3af9ea1e09afe460f9b46810
|
@ -26,7 +26,7 @@ class ChatModel:
|
||||
def process_args(
|
||||
self, query: str, history: Optional[List[Tuple[str, str]]] = None, prefix: Optional[str] = None, **input_kwargs
|
||||
) -> Tuple[Dict[str, Any], int]:
|
||||
prefix = prefix if prefix else self.source_prefix
|
||||
prefix = prefix or self.source_prefix
|
||||
|
||||
inputs = self.tokenizer([self.template.get_prompt(query, history, prefix)], return_tensors="pt")
|
||||
inputs = inputs.to(self.model.device)
|
||||
@ -81,5 +81,4 @@ class ChatModel:
|
||||
thread = Thread(target=self.model.generate, kwargs=gen_kwargs)
|
||||
thread.start()
|
||||
|
||||
for new_text in streamer:
|
||||
yield new_text
|
||||
yield from streamer
|
||||
|
@ -46,7 +46,7 @@ class Template:
|
||||
def _format_example(
|
||||
self, query: str, history: Optional[List[Tuple[str, str]]] = None, prefix: Optional[str] = ""
|
||||
) -> List[str]:
|
||||
prefix = prefix if prefix else self.prefix # use prefix if provided
|
||||
prefix = prefix or self.prefix # use prefix if provided
|
||||
prefix = prefix + self.sep if prefix else "" # add separator for non-empty prefix
|
||||
history = history if (history and self.use_history) else []
|
||||
history = history + [(query, "<dummy>")]
|
||||
|
@ -24,17 +24,9 @@ def main():
|
||||
|
||||
manager = Manager([{"lang": lang}, chat_elems])
|
||||
|
||||
demo.load(
|
||||
manager.gen_label,
|
||||
[lang],
|
||||
[lang] + [elem for elem in chat_elems.values()],
|
||||
)
|
||||
demo.load(manager.gen_label, [lang], [lang] + list(chat_elems.values()))
|
||||
|
||||
lang.change(
|
||||
manager.gen_label,
|
||||
[lang],
|
||||
[lang] + [elem for elem in chat_elems.values()],
|
||||
)
|
||||
lang.change(manager.gen_label, [lang], [lang] + list(chat_elems.values()))
|
||||
|
||||
demo.queue()
|
||||
demo.launch(server_name="0.0.0.0", share=False, inbrowser=True)
|
||||
|
Loading…
x
Reference in New Issue
Block a user