diff --git a/data/README.md b/data/README.md index 7e030ac1..e6d4ef25 100644 --- a/data/README.md +++ b/data/README.md @@ -23,6 +23,7 @@ Currently we support datasets in **alpaca** and **sharegpt** format. "system": "the column name in the dataset containing the system prompts. (default: None)", "tools": "the column name in the dataset containing the tool description. (default: None)", "images": "the column name in the dataset containing the image inputs. (default: None)", + "videos": "the column name in the dataset containing the videos inputs. (default: None)", "chosen": "the column name in the dataset containing the chosen answers. (default: None)", "rejected": "the column name in the dataset containing the rejected answers. (default: None)", "kto_tag": "the column name in the dataset containing the kto tags. (default: None)" @@ -168,11 +169,11 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh } ``` -### Multimodal Dataset +### Multimodal Image Dataset - [Example dataset](mllm_demo.json) -Multimodal datasets require a `images` column containing the paths to the input images. +Multimodal image datasets require a `images` column containing the paths to the input images. ```json [ @@ -201,6 +202,39 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh } ``` +### Multimodal Video Dataset + +- [Example dataset](mllm_demo_video.json) + +Multimodal video datasets require a `videos` column containing the paths to the input videos. + +```json +[ + { + "instruction": "human instruction (required)", + "input": "human input (optional)", + "output": "model response (required)", + "videos": [ + "video path (required)" + ] + } +] +``` + +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: + +```json +"dataset_name": { + "file_name": "data.json", + "columns": { + "prompt": "instruction", + "query": "input", + "response": "output", + "videos": "videos" + } +} +``` + ## Sharegpt Format ### Supervised Fine-Tuning Dataset diff --git a/data/README_zh.md b/data/README_zh.md index cd0a4b0e..5842a099 100644 --- a/data/README_zh.md +++ b/data/README_zh.md @@ -23,6 +23,7 @@ "system": "数据集代表系统提示的表头名称(默认:None)", "tools": "数据集代表工具描述的表头名称(默认:None)", "images": "数据集代表图像输入的表头名称(默认:None)", + "videos": "数据集代表视频输入的表头名称(默认:None)", "chosen": "数据集代表更优回答的表头名称(默认:None)", "rejected": "数据集代表更差回答的表头名称(默认:None)", "kto_tag": "数据集代表 KTO 标签的表头名称(默认:None)" @@ -168,11 +169,11 @@ KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人 } ``` -### 多模态数据集 +### 多模态图像数据集 - [样例数据集](mllm_demo.json) -多模态数据集需要额外添加一个 `images` 列,包含输入图像的路径。 +多模态图像数据集需要额外添加一个 `images` 列,包含输入图像的路径。 ```json [ @@ -201,6 +202,39 @@ KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人 } ``` +### 多模态视频数据集 + +- [样例数据集](mllm_demo_video.json) + +多模态视频数据集需要额外添加一个 `videos` 列,包含输入视频的路径。 + +```json +[ + { + "instruction": "人类指令(必填)", + "input": "人类输入(选填)", + "output": "模型回答(必填)", + "videos": [ + "视频路径(必填)" + ] + } +] +``` + +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: + +```json +"数据集名称": { + "file_name": "data.json", + "columns": { + "prompt": "instruction", + "query": "input", + "response": "output", + "videos": "videos" + } +} +``` + ## Sharegpt 格式 ### 指令监督微调数据集