update data readme

Former-commit-id: 0af5f054b7b8da8b39eb44b1dfa76050f0c45667
2026-03-04 10:46:00 +08:00 · 2024-09-05 04:44:49 +08:00
parent 4d35ace75e
commit abd26f5f67
2 changed files with 254 additions and 186 deletions
--- a/data/README.md
+++ b/data/README.md
@@ -108,7 +108,7 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh

 ### Preference Dataset

-Preference datasets are used for reward modeling, DPO training and ORPO training.
+Preference datasets are used for reward modeling, DPO training, ORPO and SimPO training.

 It requires a better response in `chosen` column and a worse response in `rejected` column.

@@ -140,100 +140,15 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh

 ### KTO Dataset

- [Example dataset](kto_en_demo.json)
-
-KTO datasets require a extra `kto_tag` column containing the boolean human feedback.
-
-```json
-[
-  {
-    "instruction": "human instruction (required)",
-    "input": "human input (optional)",
-    "output": "model response (required)",
-    "kto_tag": "human feedback [true/false] (required)"
-  }
-]
-```
-
-Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
-
-```json
-"dataset_name": {
-  "file_name": "data.json",
-  "columns": {
-    "prompt": "instruction",
-    "query": "input",
-    "response": "output",
-    "kto_tag": "kto_tag"
-  }
-}
-```
+An additional column `kto_tag` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.

 ### Multimodal Image Dataset

- [Example dataset](mllm_demo.json)
-
-Multimodal image datasets require a `images` column containing the paths to the input images.
-
-```json
-[
-  {
-    "instruction": "human instruction (required)",
-    "input": "human input (optional)",
-    "output": "model response (required)",
-    "images": [
-      "image path (required)"
-    ]
-  }
-]
-```
-
-Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
-
-```json
-"dataset_name": {
-  "file_name": "data.json",
-  "columns": {
-    "prompt": "instruction",
-    "query": "input",
-    "response": "output",
-    "images": "images"
-  }
-}
-```
+An additional column `images` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.

 ### Multimodal Video Dataset

- [Example dataset](mllm_demo_video.json)
-
-Multimodal video datasets require a `videos` column containing the paths to the input videos.
-
-```json
-[
-  {
-    "instruction": "human instruction (required)",
-    "input": "human input (optional)",
-    "output": "model response (required)",
-    "videos": [
-      "video path (required)"
-    ]
-  }
-]
-```
-
-Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
-
-```json
-"dataset_name": {
-  "file_name": "data.json",
-  "columns": {
-    "prompt": "instruction",
-    "query": "input",
-    "response": "output",
-    "videos": "videos"
-  }
-}
-```
+An additional column `videos` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.

 ## Sharegpt Format

@@ -286,6 +201,10 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
 }
 ```

+### Pre-training Dataset
+
+Not yet supported, please use the [alpaca](#alpaca-format) format.
+
 ### Preference Dataset

 - [Example dataset](dpo_en_demo.json)
@@ -336,6 +255,125 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
 }
 ```

+### KTO Dataset
+
+- [Example dataset](kto_en_demo.json)
+
+KTO datasets require a extra `kto_tag` column containing the boolean human feedback.
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "human instruction"
+      },
+      {
+        "from": "gpt",
+        "value": "model response"
+      }
+    ],
+    "kto_tag": "human feedback [true/false] (required)"
+  }
+]
+```
+
+Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
+
+```json
+"dataset_name": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "kto_tag": "kto_tag"
+  }
+}
+```
+
+### Multimodal Image Dataset
+
+- [Example dataset](mllm_demo.json)
+
+Multimodal image datasets require a `images` column containing the paths to the input images.
+
+The number of images should be identical to the `<image>` tokens in the conversations.
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "<image>human instruction"
+      },
+      {
+        "from": "gpt",
+        "value": "model response"
+      }
+    ],
+    "images": [
+      "image path (required)"
+    ]
+  }
+]
+```
+
+Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
+
+```json
+"dataset_name": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "images": "images"
+  }
+}
+```
+
+### Multimodal Video Dataset
+
+- [Example dataset](mllm_video_demo.json)
+
+Multimodal video datasets require a `videos` column containing the paths to the input videos.
+
+The number of videos should be identical to the `<video>` tokens in the conversations.
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "<video>human instruction"
+      },
+      {
+        "from": "gpt",
+        "value": "model response"
+      }
+    ],
+    "videos": [
+      "video path (required)"
+    ]
+  }
+]
+```
+
+Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
+
+```json
+"dataset_name": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "videos": "videos"
+  }
+}
+```
+
 ### OpenAI Format

 The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt.
@@ -379,7 +417,3 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
  }
 }
 ```
-
-The KTO datasets and multimodal datasets in sharegpt format are similar to the alpaca format.
-
-Pre-training datasets are **incompatible** with the sharegpt format.
--- a/data/README_zh.md
+++ b/data/README_zh.md
@@ -108,7 +108,7 @@

 ### 偏好数据集

-偏好数据集用于奖励模型训练、DPO 训练和 ORPO 训练。
+偏好数据集用于奖励模型训练、DPO 训练、ORPO 训练和 SimPO 训练。

 它需要在 `chosen` 列中提供更优的回答，并在 `rejected` 列中提供更差的回答。

@@ -140,100 +140,15 @@

 ### KTO 数据集

- [样例数据集](kto_en_demo.json)
-
-KTO 数据集需要额外添加一个 `kto_tag` 列，包含 bool 类型的人类反馈。
-
-```json
-[
-  {
-    "instruction": "人类指令（必填）",
-    "input": "人类输入（选填）",
-    "output": "模型回答（必填）",
-    "kto_tag": "人类反馈 [true/false]（必填）"
-  }
-]
-```
-
-对于上述格式的数据，`dataset_info.json` 中的*数据集描述*应为：
-
-```json
-"数据集名称": {
-  "file_name": "data.json",
-  "columns": {
-    "prompt": "instruction",
-    "query": "input",
-    "response": "output",
-    "kto_tag": "kto_tag"
-  }
-}
-```
+KTO 数据集需要提供额外的 `kto_tag` 列。详情请参阅 [sharegpt](#sharegpt-格式)。

 ### 多模态图像数据集

- [样例数据集](mllm_demo.json)
-
-多模态图像数据集需要额外添加一个 `images` 列，包含输入图像的路径。
-
-```json
-[
-  {
-    "instruction": "人类指令（必填）",
-    "input": "人类输入（选填）",
-    "output": "模型回答（必填）",
-    "images": [
-      "图像路径（必填）"
-    ]
-  }
-]
-```
-
-对于上述格式的数据，`dataset_info.json` 中的*数据集描述*应为：
-
-```json
-"数据集名称": {
-  "file_name": "data.json",
-  "columns": {
-    "prompt": "instruction",
-    "query": "input",
-    "response": "output",
-    "images": "images"
-  }
-}
-```
+多模态图像数据集需要提供额外的 `images` 列。详情请参阅 [sharegpt](#sharegpt-格式)。

 ### 多模态视频数据集

- [样例数据集](mllm_demo_video.json)
-
-多模态视频数据集需要额外添加一个 `videos` 列，包含输入视频的路径。
-
-```json
-[
-  {
-    "instruction": "人类指令（必填）",
-    "input": "人类输入（选填）",
-    "output": "模型回答（必填）",
-    "videos": [
-      "视频路径（必填）"
-    ]
-  }
-]
-```
-
-对于上述格式的数据，`dataset_info.json` 中的*数据集描述*应为：
-
-```json
-"数据集名称": {
-  "file_name": "data.json",
-  "columns": {
-    "prompt": "instruction",
-    "query": "input",
-    "response": "output",
-    "videos": "videos"
-  }
-}
-```
+多模态视频数据集需要提供额外的 `videos` 列。详情请参阅 [sharegpt](#sharegpt-格式)。

 ## Sharegpt 格式

@@ -286,6 +201,10 @@ KTO 数据集需要额外添加一个 `kto_tag` 列，包含 bool 类型的人
 }
 ```

+### 预训练数据集
+
+尚不支持，请使用 [alpaca](#alpaca-格式) 格式。
+
 ### 偏好数据集

 - [样例数据集](dpo_zh_demo.json)
@@ -336,6 +255,125 @@ Sharegpt 格式的偏好数据集同样需要在 `chosen` 列中提供更优的
 }
 ```

+### KTO 数据集
+
+- [样例数据集](kto_en_demo.json)
+
+KTO 数据集需要额外添加一个 `kto_tag` 列，包含 bool 类型的人类反馈。
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "人类指令"
+      },
+      {
+        "from": "gpt",
+        "value": "模型回答"
+      }
+    ],
+    "kto_tag": "人类反馈 [true/false]（必填）"
+  }
+]
+```
+
+对于上述格式的数据，`dataset_info.json` 中的*数据集描述*应为：
+
+```json
+"数据集名称": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "kto_tag": "kto_tag"
+  }
+}
+```
+
+### 多模态图像数据集
+
+- [样例数据集](mllm_demo.json)
+
+多模态图像数据集需要额外添加一个 `images` 列，包含输入图像的路径。
+
+注意图片的数量必须和对话中 `<image>` 标记的数量严格一致。
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "<image>人类指令"
+      },
+      {
+        "from": "gpt",
+        "value": "模型回答"
+      }
+    ],
+    "images": [
+      "图像路径（必填）"
+    ]
+  }
+]
+```
+
+对于上述格式的数据，`dataset_info.json` 中的*数据集描述*应为：
+
+```json
+"数据集名称": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "images": "images"
+  }
+}
+```
+
+### 多模态视频数据集
+
+- [样例数据集](mllm_video_demo.json)
+
+多模态视频数据集需要额外添加一个 `videos` 列，包含输入视频的路径。
+
+注意视频的数量必须和对话中 `<video>` 标记的数量严格一致。
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "<video>人类指令"
+      },
+      {
+        "from": "gpt",
+        "value": "模型回答"
+      }
+    ],
+    "videos": [
+      "视频路径（必填）"
+    ]
+  }
+]
+```
+
+对于上述格式的数据，`dataset_info.json` 中的*数据集描述*应为：
+
+```json
+"数据集名称": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "videos": "videos"
+  }
+}
+```
+
 ### OpenAI 格式

 OpenAI 格式仅仅是 sharegpt 格式的一种特殊情况，其中第一条消息可能是系统提示词。
@@ -379,7 +417,3 @@ OpenAI 格式仅仅是 sharegpt 格式的一种特殊情况，其中第一条消
  }
 }
 ```
-
-Sharegpt 格式中的 KTO 数据集和多模态数据集与 alpaca 格式的类似。
-
-预训练数据集**不支持** sharegpt 格式。