mirror of
https://github.com/hiyouga/LLaMA-Factory.git
synced 2025-08-04 20:52:59 +08:00
34 lines
1.7 KiB
Markdown
34 lines
1.7 KiB
Markdown
If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
|
|
|
|
```json
|
|
"dataset_name": {
|
|
"hf_hub_url": "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
|
|
"script_url": "the name of the directory containing a dataset loading script. (if specified, ignore below 2 arguments)",
|
|
"file_name": "the name of the dataset file in the this directory. (required if above are not specified)",
|
|
"file_sha1": "the SHA-1 hash value of the dataset file. (optional)",
|
|
"columns": {
|
|
"prompt": "the name of the column in the datasets containing the prompts. (default: instruction)",
|
|
"query": "the name of the column in the datasets containing the queries. (default: input)",
|
|
"response": "the name of the column in the datasets containing the responses. (default: output)",
|
|
"history": "the name of the column in the datasets containing the history of chat. (default: None)"
|
|
},
|
|
"stage": "The stage at which the data is being used: pt, sft, and rm, which correspond to pre-training, supervised fine-tuning(PPO), and reward model (DPO) training, respectively.(default: None)"
|
|
}
|
|
```
|
|
|
|
where the `prompt` and `response` columns should contain non-empty values. The `query` column will be concatenated with the `prompt` column and used as input for the model. The `history` column should contain a list where each element is a string tuple representing a query-response pair.
|
|
|
|
For datasets used in reward modeling or DPO training, the `response` column should be a string list, with the preferred answers appearing first, for example:
|
|
|
|
```json
|
|
{
|
|
"instruction": "Question",
|
|
"input": "",
|
|
"output": [
|
|
"Chosen answer",
|
|
"Rejected answer"
|
|
],
|
|
"stage": "rm"
|
|
}
|
|
```
|