mirror of
				https://github.com/hiyouga/LLaMA-Factory.git
				synced 2025-11-04 18:02:19 +08:00 
			
		
		
		
	[misc] update data readme (#8128)
This commit is contained in:
		
							parent
							
								
									9ae17cd173
								
							
						
					
					
						commit
						d2a3036a23
					
				@ -2,7 +2,7 @@ The [dataset_info.json](dataset_info.json) contains all available datasets. If y
 | 
			
		||||
 | 
			
		||||
The `dataset_info.json` file should be put in the `dataset_dir` directory. You can change `dataset_dir` to use another directory. The default value is `./data`.
 | 
			
		||||
 | 
			
		||||
Currently we support datasets in **alpaca** and **sharegpt** format.
 | 
			
		||||
Currently we support datasets in **alpaca** and **sharegpt** format. Allowed file types include json, jsonl, csv, parquet, arrow.
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"dataset_name": {
 | 
			
		||||
@ -89,7 +89,7 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> [!TIP]  
 | 
			
		||||
> If the model has reasoning capabilities but the dataset does not contain chain-of-thought (CoT), LLaMA-Factory will automatically add empty CoT to the data. When `enable_thinking` is `True` (slow thinking), the empty CoT will be added to the model responses and loss computation will be considered; otherwise (fast thinking), it will be added to the user prompts and loss computation will be ignored. Please keep the `enable_thinking` parameter consistent during training and inference.
 | 
			
		||||
> If the model has reasoning capabilities (e.g. Qwen3) but the dataset does not contain chain-of-thought (CoT), LLaMA-Factory will automatically add empty CoT to the data. When `enable_thinking` is `True` (slow thinking, by default), the empty CoT will be added to the model responses and loss computation will be considered; otherwise (fast thinking), it will be added to the user prompts and loss computation will be ignored. Please keep the `enable_thinking` parameter consistent during training and inference.
 | 
			
		||||
>
 | 
			
		||||
> If you want to train data containing CoT with slow thinking and data without CoT with fast thinking, you can set `enable_thinking` to `None`. However, this feature is relatively complicated and should be used with caution.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@ -2,7 +2,7 @@
 | 
			
		||||
 | 
			
		||||
其中 `dataset_info.json` 文件应放置在 `dataset_dir` 目录下。您可以通过修改 `dataset_dir` 参数来使用其他目录。默认值为 `./data`。
 | 
			
		||||
 | 
			
		||||
目前我们支持 **alpaca** 格式和 **sharegpt** 格式的数据集。
 | 
			
		||||
目前我们支持 **alpaca** 格式和 **sharegpt** 格式的数据集。允许的文件类型包括 json、jsonl、csv、parquet 和 arrow。
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
"数据集名称": {
 | 
			
		||||
@ -88,7 +88,7 @@
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> [!TIP]
 | 
			
		||||
> 如果模型本身具备推理能力,而数据集不包含思维链,LLaMA-Factory 会自动为数据添加空思维链。当 `enable_thinking` 为 `True` 时(慢思考),空思维链会添加到模型回答中并且计算损失,否则会添加到用户指令中并且不计算损失(快思考)。请在训练和推理时保持 `enable_thinking` 参数一致。
 | 
			
		||||
> 如果模型本身具备推理能力(如 Qwen3)而数据集不包含思维链,LLaMA-Factory 会自动为数据添加空思维链。当 `enable_thinking` 为 `True` 时(慢思考,默认),空思维链会添加到模型回答中并且计算损失,否则会添加到用户指令中并且不计算损失(快思考)。请在训练和推理时保持 `enable_thinking` 参数一致。
 | 
			
		||||
>
 | 
			
		||||
> 如果您希望训练包含思维链的数据时使用慢思考,训练不包含思维链的数据时使用快思考,可以设置 `enable_thinking` 为 `None`。但该功能较为复杂,请谨慎使用。
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@ -4,7 +4,7 @@ accelerate>=0.34.0,<=1.7.0
 | 
			
		||||
peft>=0.14.0,<=0.15.2
 | 
			
		||||
trl>=0.8.6,<=0.9.6
 | 
			
		||||
tokenizers>=0.19.0,<=0.21.1
 | 
			
		||||
gradio>=4.38.0,<=5.29.1
 | 
			
		||||
gradio>=4.38.0,<=5.30.0
 | 
			
		||||
scipy
 | 
			
		||||
einops
 | 
			
		||||
sentencepiece
 | 
			
		||||
 | 
			
		||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user