update dataset

Former-commit-id: 4fc2c3293d91d8464527ebd1ddabe572c8355616
2026-06-24 16:18:55 +08:00 · 2023-07-23 20:01:43 +08:00
parent 745c46ee04
commit c145bbef3c
3 changed files with 9 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -63,6 +63,10 @@
 - For pre-training:
  - [Wiki Demo (en)](data/wiki_demo.txt)
  - [RefinedWeb (en)](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
  - [StarCoder (en)](https://huggingface.co/datasets/bigcode/starcoderdata)
  - [Wikipedia (en)](https://huggingface.co/datasets/olm/olm-wikipedia-20221220)
  - [Wikipedia (zh)](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)
 - For supervised fine-tuning:
  - [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca)
  - [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
--- a/README_zh.md
+++ b/README_zh.md
@@ -63,6 +63,10 @@
 - 用于二次预训练:
  - [Wiki Demo (en)](data/wiki_demo.txt)
  - [RefinedWeb (en)](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
  - [StarCoder (en)](https://huggingface.co/datasets/bigcode/starcoderdata)
  - [Wikipedia (en)](https://huggingface.co/datasets/olm/olm-wikipedia-20221220)
  - [Wikipedia (zh)](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)
 - 用于指令监督微调:
  - [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca)
  - [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
--- a/data/oaast_sft.json.REMOVED.git-id
+++ b/data/oaast_sft.json.REMOVED.git-id
@@ -1 +1 @@
-0a57fbc1d8cb08a8cd71c5eb8425cf59206ffed6
+57fd080be5bffe4153fe3ee26a175e3d56da30f3
		`@@ -1 +1 @@`
			`0a57fbc1d8cb08a8cd71c5eb8425cf59206ffed6`				`57fd080be5bffe4153fe3ee26a175e3d56da30f3`