diff --git a/data/README.md b/data/README.md index e6d4ef25..1786804f 100644 --- a/data/README.md +++ b/data/README.md @@ -108,7 +108,7 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh ### Preference Dataset -Preference datasets are used for reward modeling, DPO training and ORPO training. +Preference datasets are used for reward modeling, DPO training, ORPO and SimPO training. It requires a better response in `chosen` column and a worse response in `rejected` column. @@ -140,100 +140,15 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh ### KTO Dataset -- [Example dataset](kto_en_demo.json) - -KTO datasets require a extra `kto_tag` column containing the boolean human feedback. - -```json -[ - { - "instruction": "human instruction (required)", - "input": "human input (optional)", - "output": "model response (required)", - "kto_tag": "human feedback [true/false] (required)" - } -] -``` - -Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: - -```json -"dataset_name": { - "file_name": "data.json", - "columns": { - "prompt": "instruction", - "query": "input", - "response": "output", - "kto_tag": "kto_tag" - } -} -``` +An additional column `kto_tag` is required. Please refer to the [sharegpt](#sharegpt-format) format for details. ### Multimodal Image Dataset -- [Example dataset](mllm_demo.json) - -Multimodal image datasets require a `images` column containing the paths to the input images. - -```json -[ - { - "instruction": "human instruction (required)", - "input": "human input (optional)", - "output": "model response (required)", - "images": [ - "image path (required)" - ] - } -] -``` - -Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: - -```json -"dataset_name": { - "file_name": "data.json", - "columns": { - "prompt": "instruction", - "query": "input", - "response": "output", - "images": "images" - } -} -``` +An additional column `images` is required. Please refer to the [sharegpt](#sharegpt-format) format for details. ### Multimodal Video Dataset -- [Example dataset](mllm_demo_video.json) - -Multimodal video datasets require a `videos` column containing the paths to the input videos. - -```json -[ - { - "instruction": "human instruction (required)", - "input": "human input (optional)", - "output": "model response (required)", - "videos": [ - "video path (required)" - ] - } -] -``` - -Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: - -```json -"dataset_name": { - "file_name": "data.json", - "columns": { - "prompt": "instruction", - "query": "input", - "response": "output", - "videos": "videos" - } -} -``` +An additional column `videos` is required. Please refer to the [sharegpt](#sharegpt-format) format for details. ## Sharegpt Format @@ -286,6 +201,10 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh } ``` +### Pre-training Dataset + +Not yet supported, please use the [alpaca](#alpaca-format) format. + ### Preference Dataset - [Example dataset](dpo_en_demo.json) @@ -336,6 +255,125 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh } ``` +### KTO Dataset + +- [Example dataset](kto_en_demo.json) + +KTO datasets require a extra `kto_tag` column containing the boolean human feedback. + +```json +[ + { + "conversations": [ + { + "from": "human", + "value": "human instruction" + }, + { + "from": "gpt", + "value": "model response" + } + ], + "kto_tag": "human feedback [true/false] (required)" + } +] +``` + +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: + +```json +"dataset_name": { + "file_name": "data.json", + "formatting": "sharegpt", + "columns": { + "messages": "conversations", + "kto_tag": "kto_tag" + } +} +``` + +### Multimodal Image Dataset + +- [Example dataset](mllm_demo.json) + +Multimodal image datasets require a `images` column containing the paths to the input images. + +The number of images should be identical to the `` tokens in the conversations. + +```json +[ + { + "conversations": [ + { + "from": "human", + "value": "human instruction" + }, + { + "from": "gpt", + "value": "model response" + } + ], + "images": [ + "image path (required)" + ] + } +] +``` + +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: + +```json +"dataset_name": { + "file_name": "data.json", + "formatting": "sharegpt", + "columns": { + "messages": "conversations", + "images": "images" + } +} +``` + +### Multimodal Video Dataset + +- [Example dataset](mllm_video_demo.json) + +Multimodal video datasets require a `videos` column containing the paths to the input videos. + +The number of videos should be identical to the `