mirror of
				https://github.com/hiyouga/LLaMA-Factory.git
				synced 2025-11-04 18:02:19 +08:00 
			
		
		
		
	update data readme
Former-commit-id: beb864a9367943d3274cb6057423d1eb9aaf85c4
This commit is contained in:
		
							parent
							
								
									9c1c59e481
								
							
						
					
					
						commit
						6b9003f781
					
				
							
								
								
									
										191
									
								
								data/README.md
									
									
									
									
									
								
							
							
						
						
									
										191
									
								
								data/README.md
									
									
									
									
									
								
							@ -1,16 +1,17 @@
 | 
			
		||||
If you are using a custom dataset, please add your **dataset description** to `dataset_info.json` according to the following format. We also provide several examples in the next section.
 | 
			
		||||
The `dataset_info.json` contains all available datasets. If you are using a custom dataset, please make sure to add a *dataset description* in `dataset_info.json` and specify `dataset: dataset_name` before training to use it.
 | 
			
		||||
 | 
			
		||||
Currently we support datasets in **alpaca** and **sharegpt** format.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
  "hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)",
 | 
			
		||||
  "ms_hub_url": "the name of the dataset repository on the ModelScope hub. (if specified, ignore script_url and file_name)",
 | 
			
		||||
  "ms_hub_url": "the name of the dataset repository on the Model Scope hub. (if specified, ignore script_url and file_name)",
 | 
			
		||||
  "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)",
 | 
			
		||||
  "file_name": "the name of the dataset file in this directory. (required if above are not specified)",
 | 
			
		||||
  "file_sha1": "the SHA-1 hash value of the dataset file. (optional, does not affect training)",
 | 
			
		||||
  "file_name": "the name of the dataset folder or dataset file in this directory. (required if above are not specified)",
 | 
			
		||||
  "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
 | 
			
		||||
  "ranking": "whether the dataset is a preference dataset or not. (default: False)",
 | 
			
		||||
  "subset": "the name of the subset. (optional, default: None)",
 | 
			
		||||
  "folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
 | 
			
		||||
  "ranking": "whether the dataset is a preference dataset or not. (default: false)",
 | 
			
		||||
  "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
 | 
			
		||||
  "columns (optional)": {
 | 
			
		||||
    "prompt": "the column name in the dataset containing the prompts. (default: instruction)",
 | 
			
		||||
    "query": "the column name in the dataset containing the queries. (default: input)",
 | 
			
		||||
@ -36,11 +37,15 @@ If you are using a custom dataset, please add your **dataset description** to `d
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
After that, you can load the custom dataset by specifying `--dataset dataset_name`.
 | 
			
		||||
## Alpaca Format
 | 
			
		||||
 | 
			
		||||
----
 | 
			
		||||
### Supervised Fine-Tuning Dataset
 | 
			
		||||
 | 
			
		||||
Currently we support dataset in **alpaca** or **sharegpt** format, the dataset in alpaca format should follow the below format:
 | 
			
		||||
In supervised fine-tuning, the `instruction` column will be concatenated with the `input` column and used as the human prompt, then the human prompt would be `instruction\ninput`. The `output` column represents the model response.
 | 
			
		||||
 | 
			
		||||
The `system` column will be used as the system prompt if specified.
 | 
			
		||||
 | 
			
		||||
The `history` column is a list consisting string tuples representing prompt-response pairs in the history messages. Note that the responses in the history **will also be learned by the model** in supervised fine-tuning.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
@ -57,7 +62,7 @@ Currently we support dataset in **alpaca** or **sharegpt** format, the dataset i
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
@ -72,11 +77,9 @@ Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The `query` column will be concatenated with the `prompt` column and used as the human prompt, then the human prompt would be `prompt\nquery`. The `response` column represents the model response.
 | 
			
		||||
### Pre-training Dataset
 | 
			
		||||
 | 
			
		||||
The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training** in supervised fine-tuning.
 | 
			
		||||
 | 
			
		||||
For the **pre-training datasets**, only the `prompt` column will be used for training, for example:
 | 
			
		||||
In pre-training, only the `prompt` column will be used for model learning.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
@ -85,7 +88,7 @@ For the **pre-training datasets**, only the `prompt` column will be used for tra
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
@ -96,20 +99,24 @@ Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
For the **preference datasets**, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:
 | 
			
		||||
### Preference Dataset
 | 
			
		||||
 | 
			
		||||
Preference datasets are used for reward modeling, DPO training and ORPO training.
 | 
			
		||||
 | 
			
		||||
It requires a better response in `chosen` column and a worse response in `rejected` column.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "instruction": "human instruction",
 | 
			
		||||
    "input": "human input",
 | 
			
		||||
    "chosen": "chosen answer",
 | 
			
		||||
    "rejected": "rejected answer"
 | 
			
		||||
    "instruction": "human instruction (required)",
 | 
			
		||||
    "input": "human input (optional)",
 | 
			
		||||
    "chosen": "chosen answer (required)",
 | 
			
		||||
    "rejected": "rejected answer (required)"
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
@ -124,14 +131,86 @@ Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
----
 | 
			
		||||
### KTO Dataset
 | 
			
		||||
 | 
			
		||||
The dataset in **sharegpt** format should follow the below format:
 | 
			
		||||
KTO datasets require a extra `kto_tag` column containing the boolean human feedback.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "instruction": "human instruction (required)",
 | 
			
		||||
    "input": "human input (optional)",
 | 
			
		||||
    "output": "model response (required)",
 | 
			
		||||
    "kto_tag": "human feedback [true/false] (required)"
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
  "file_name": "data.json",
 | 
			
		||||
  "columns": {
 | 
			
		||||
    "prompt": "instruction",
 | 
			
		||||
    "query": "input",
 | 
			
		||||
    "response": "output",
 | 
			
		||||
    "kto_tag": "kto_tag"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Multimodal Dataset
 | 
			
		||||
 | 
			
		||||
Multimodal datasets require a `images` column containing the paths to the input image. Currently we only support one image.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "instruction": "human instruction (required)",
 | 
			
		||||
    "input": "human input (optional)",
 | 
			
		||||
    "output": "model response (required)",
 | 
			
		||||
    "images": [
 | 
			
		||||
      "image path (required)"
 | 
			
		||||
    ]
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
  "file_name": "data.json",
 | 
			
		||||
  "columns": {
 | 
			
		||||
    "prompt": "instruction",
 | 
			
		||||
    "query": "input",
 | 
			
		||||
    "response": "output",
 | 
			
		||||
    "images": "images"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Sharegpt Format
 | 
			
		||||
 | 
			
		||||
### Supervised Fine-Tuning Dataset
 | 
			
		||||
 | 
			
		||||
Compared to the alpaca format, the sharegpt format allows the datasets have more **roles**, such as human, gpt, observation and function. They are presented in a list of objects in the `conversations` column.
 | 
			
		||||
 | 
			
		||||
Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "conversations": [
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "human instruction"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "gpt",
 | 
			
		||||
        "value": "model response"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "human instruction"
 | 
			
		||||
@ -147,7 +226,7 @@ The dataset in **sharegpt** format should follow the below format:
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
@ -157,19 +236,61 @@ Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
    "messages": "conversations",
 | 
			
		||||
    "system": "system",
 | 
			
		||||
    "tools": "tools"
 | 
			
		||||
  },
 | 
			
		||||
  "tags": {
 | 
			
		||||
    "role_tag": "from",
 | 
			
		||||
    "content_tag": "value",
 | 
			
		||||
    "user_tag": "human",
 | 
			
		||||
    "assistant_tag": "gpt"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
where the `messages` column should be a list following the `u/a/u/a/u/a` order.
 | 
			
		||||
### Preference Dataset
 | 
			
		||||
 | 
			
		||||
We also supports the dataset in the **openai** format:
 | 
			
		||||
Preference datasets in sharegpt format also require a better message in `chosen` column and a worse message in `rejected` column.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "conversations": [
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "human instruction"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "gpt",
 | 
			
		||||
        "value": "model response"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "human instruction"
 | 
			
		||||
      }
 | 
			
		||||
    ],
 | 
			
		||||
    "chosen": {
 | 
			
		||||
      "from": "gpt",
 | 
			
		||||
      "value": "chosen answer (required)"
 | 
			
		||||
    },
 | 
			
		||||
    "rejected": {
 | 
			
		||||
      "from": "gpt",
 | 
			
		||||
      "value": "rejected answer (required)"
 | 
			
		||||
    }
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
  "file_name": "data.json",
 | 
			
		||||
  "formatting": "sharegpt",
 | 
			
		||||
  "ranking": true,
 | 
			
		||||
  "columns": {
 | 
			
		||||
    "messages": "conversations",
 | 
			
		||||
    "chosen": "chosen",
 | 
			
		||||
    "rejected": "rejected"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### OpenAI Format
 | 
			
		||||
 | 
			
		||||
The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
@ -192,7 +313,7 @@ We also supports the dataset in the **openai** format:
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
@ -211,4 +332,6 @@ Regarding the above dataset, the description in `dataset_info.json` should be:
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Pre-training datasets and preference datasets are **incompatible** with the sharegpt format yet.
 | 
			
		||||
The KTO datasets and multimodal datasets in sharegpt format are similar to the alpaca format.
 | 
			
		||||
 | 
			
		||||
Pre-training datasets are **incompatible** with the sharegpt format.
 | 
			
		||||
 | 
			
		||||
@ -1,4 +1,6 @@
 | 
			
		||||
如果您使用自定义数据集,请务必按照以下格式在 `dataset_info.json` 文件中添加**数据集描述**。我们在下面也提供了一些例子。
 | 
			
		||||
`dataset_info.json` 包含了所有可用的数据集。如果您希望使用自定义数据集,请务必在 `dataset_info.json` 文件中添加*数据集描述*,并通过修改 `dataset: 数据集名称` 配置来使用数据集。
 | 
			
		||||
 | 
			
		||||
目前我们支持 **alpaca** 格式和 **sharegpt** 格式的数据集。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
@ -6,11 +8,10 @@
 | 
			
		||||
  "ms_hub_url": "ModelScope 的数据集仓库地址(若指定,则忽略 script_url 和 file_name)",
 | 
			
		||||
  "script_url": "包含数据加载脚本的本地文件夹名称(若指定,则忽略 file_name)",
 | 
			
		||||
  "file_name": "该目录下数据集文件的名称(若上述参数未指定,则此项必需)",
 | 
			
		||||
  "file_sha1": "数据集文件的 SHA-1 哈希值(可选,留空不影响训练)",
 | 
			
		||||
  "formatting": "数据集格式(可选,默认:alpaca,可以为 alpaca 或 sharegpt)",
 | 
			
		||||
  "ranking": "是否为偏好数据集(可选,默认:False)",
 | 
			
		||||
  "subset": "数据集子集的名称(可选,默认:None)",
 | 
			
		||||
  "folder": "Hugging Face 仓库的文件夹名称(可选,默认:None)",
 | 
			
		||||
  "ranking": "是否为偏好数据集(可选,默认:False)",
 | 
			
		||||
  "formatting": "数据集格式(可选,默认:alpaca,可以为 alpaca 或 sharegpt)",
 | 
			
		||||
  "columns(可选)": {
 | 
			
		||||
    "prompt": "数据集代表提示词的表头名称(默认:instruction)",
 | 
			
		||||
    "query": "数据集代表请求的表头名称(默认:input)",
 | 
			
		||||
@ -20,8 +21,8 @@
 | 
			
		||||
    "system": "数据集代表系统提示的表头名称(默认:None)",
 | 
			
		||||
    "tools": "数据集代表工具描述的表头名称(默认:None)",
 | 
			
		||||
    "images": "数据集代表图像输入的表头名称(默认:None)",
 | 
			
		||||
    "chosen": "数据集代表更优回复的表头名称(默认:None)",
 | 
			
		||||
    "rejected": "数据集代表更差回复的表头名称(默认:None)",
 | 
			
		||||
    "chosen": "数据集代表更优回答的表头名称(默认:None)",
 | 
			
		||||
    "rejected": "数据集代表更差回答的表头名称(默认:None)",
 | 
			
		||||
    "kto_tag": "数据集代表 KTO 标签的表头名称(默认:None)"
 | 
			
		||||
  },
 | 
			
		||||
  "tags(可选,用于 sharegpt 格式)": {
 | 
			
		||||
@ -31,16 +32,20 @@
 | 
			
		||||
    "assistant_tag": "消息中代表助手的 role_tag(默认:gpt)",
 | 
			
		||||
    "observation_tag": "消息中代表工具返回结果的 role_tag(默认:observation)",
 | 
			
		||||
    "function_tag": "消息中代表工具调用的 role_tag(默认:function_call)",
 | 
			
		||||
    "system_tag": "消息中代表系统提示的 role_tag(默认:system,会覆盖 system 列)"
 | 
			
		||||
    "system_tag": "消息中代表系统提示的 role_tag(默认:system,会覆盖 system column)"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
然后,可通过使用 `--dataset 数据集名称` 参数加载自定义数据集。
 | 
			
		||||
## Alpaca 格式
 | 
			
		||||
 | 
			
		||||
----
 | 
			
		||||
### 指令监督微调数据集
 | 
			
		||||
 | 
			
		||||
该项目目前支持两种格式的数据集:**alpaca** 和 **sharegpt**,其中 alpaca 格式的数据集按照以下方式组织:
 | 
			
		||||
在指令监督微调时,`instruction` 列对应的内容会与 `input` 列对应的内容拼接后作为人类指令,即人类指令为 `instruction\ninput`。而 `output` 列对应的内容为模型回答。
 | 
			
		||||
 | 
			
		||||
如果指定,`system` 列对应的内容将被作为系统提示词。
 | 
			
		||||
 | 
			
		||||
`history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮对话的指令和回答。注意在指令监督微调时,历史消息中的回答内容**也会被用于模型学习**。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
@ -57,7 +62,7 @@
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的描述应为:
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
@ -72,11 +77,9 @@
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为人类指令,即人类指令为 `prompt\nquery`。`response` 列对应的内容为模型回答。
 | 
			
		||||
### 预训练数据集
 | 
			
		||||
 | 
			
		||||
`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮的指令和回答。注意在指令监督学习时,历史消息中的回答**也会被用于训练**。
 | 
			
		||||
 | 
			
		||||
对于**预训练数据集**,仅 `prompt` 列中的内容会用于模型训练,例如:
 | 
			
		||||
对于**预训练数据集**,仅 `prompt` 列中的内容会用于模型学习,例如:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
@ -85,7 +88,7 @@
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的描述应为:
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
@ -96,20 +99,24 @@
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于**偏好数据集**,`response` 列应当是一个长度为 2 的字符串列表,排在前面的代表更优的回答,例如:
 | 
			
		||||
### 偏好数据集
 | 
			
		||||
 | 
			
		||||
偏好数据集用于奖励模型训练、DPO 训练和 ORPO 训练。
 | 
			
		||||
 | 
			
		||||
它需要在 `chosen` 列中提供更优的回答,并在 `rejected` 列中提供更差的回答。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "instruction": "人类指令",
 | 
			
		||||
    "input": "人类输入",
 | 
			
		||||
    "chosen": "优质回答",
 | 
			
		||||
    "rejected": "劣质回答"
 | 
			
		||||
    "instruction": "人类指令(必填)",
 | 
			
		||||
    "input": "人类输入(选填)",
 | 
			
		||||
    "chosen": "优质回答(必填)",
 | 
			
		||||
    "rejected": "劣质回答(必填)"
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的描述应为:
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
@ -124,14 +131,86 @@
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
----
 | 
			
		||||
### KTO 数据集
 | 
			
		||||
 | 
			
		||||
而 **sharegpt** 格式的数据集按照以下方式组织:
 | 
			
		||||
KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人类反馈。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "instruction": "人类指令(必填)",
 | 
			
		||||
    "input": "人类输入(选填)",
 | 
			
		||||
    "output": "模型回答(必填)",
 | 
			
		||||
    "kto_tag": "人类反馈 [true/false](必填)"
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
  "file_name": "data.json",
 | 
			
		||||
  "columns": {
 | 
			
		||||
    "prompt": "instruction",
 | 
			
		||||
    "query": "input",
 | 
			
		||||
    "response": "output",
 | 
			
		||||
    "kto_tag": "kto_tag"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### 多模态数据集
 | 
			
		||||
 | 
			
		||||
多模态数据集需要额外添加一个 `images` 列,包含输入图像的路径。目前我们仅支持单张图像输入。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "instruction": "人类指令(必填)",
 | 
			
		||||
    "input": "人类输入(选填)",
 | 
			
		||||
    "output": "模型回答(必填)",
 | 
			
		||||
    "images": [
 | 
			
		||||
      "图像路径(必填)"
 | 
			
		||||
    ]
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
  "file_name": "data.json",
 | 
			
		||||
  "columns": {
 | 
			
		||||
    "prompt": "instruction",
 | 
			
		||||
    "query": "input",
 | 
			
		||||
    "response": "output",
 | 
			
		||||
    "images": "images"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Sharegpt 格式
 | 
			
		||||
 | 
			
		||||
### 指令监督微调数据集
 | 
			
		||||
 | 
			
		||||
相比 alpaca 格式的数据集,sharegpt 格式支持更多的**角色种类**,例如 human、gpt、observation、function 等等。它们构成一个对象列表呈现在 `conversations` 列中。
 | 
			
		||||
 | 
			
		||||
其中 human 和 observation 必须出现在奇数位置,gpt 和 function 必须出现在偶数位置。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "conversations": [
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "人类指令"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "gpt",
 | 
			
		||||
        "value": "模型回答"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "人类指令"
 | 
			
		||||
@ -147,7 +226,7 @@
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的描述应为:
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
@ -167,9 +246,57 @@
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
其中 `messages` 列应当是一个列表,且符合 `人类/模型/人类/模型/人类/模型` 的顺序。
 | 
			
		||||
### 偏好数据集
 | 
			
		||||
 | 
			
		||||
我们同样支持 **openai** 格式的数据集:
 | 
			
		||||
Sharegpt 格式的偏好数据集同样需要在 `chosen` 列中提供更优的消息,并在 `rejected` 列中提供更差的消息。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
  {
 | 
			
		||||
    "conversations": [
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "人类指令"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "gpt",
 | 
			
		||||
        "value": "模型回答"
 | 
			
		||||
      },
 | 
			
		||||
      {
 | 
			
		||||
        "from": "human",
 | 
			
		||||
        "value": "人类指令"
 | 
			
		||||
      }
 | 
			
		||||
    ],
 | 
			
		||||
    "chosen": {
 | 
			
		||||
      "from": "gpt",
 | 
			
		||||
      "value": "优质回答"
 | 
			
		||||
    },
 | 
			
		||||
    "rejected": {
 | 
			
		||||
      "from": "gpt",
 | 
			
		||||
      "value": "劣质回答"
 | 
			
		||||
    }
 | 
			
		||||
  }
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
  "file_name": "data.json",
 | 
			
		||||
  "formatting": "sharegpt",
 | 
			
		||||
  "ranking": true,
 | 
			
		||||
  "columns": {
 | 
			
		||||
    "messages": "conversations",
 | 
			
		||||
    "chosen": "chosen",
 | 
			
		||||
    "rejected": "rejected"
 | 
			
		||||
  }
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### OpenAI 格式
 | 
			
		||||
 | 
			
		||||
OpenAI 格式仅仅是 sharegpt 格式的一种特殊情况,其中第一条消息可能是系统提示词。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
[
 | 
			
		||||
@ -192,7 +319,7 @@
 | 
			
		||||
]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的描述应为:
 | 
			
		||||
对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为:
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
@ -211,4 +338,6 @@
 | 
			
		||||
}
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
预训练数据集和偏好数据集**尚不支持** sharegpt 格式。
 | 
			
		||||
Sharegpt 格式中的 KTO 数据集和多模态数据集与 alpaca 格式的类似。
 | 
			
		||||
 | 
			
		||||
预训练数据集**不支持** sharegpt 格式。
 | 
			
		||||
 | 
			
		||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user