From c145bbef3cfb3f28f14bcbe9fa51755a8f9adfc9 Mon Sep 17 00:00:00 2001 From: hiyouga Date: Sun, 23 Jul 2023 20:01:43 +0800 Subject: [PATCH] update dataset Former-commit-id: 4fc2c3293d91d8464527ebd1ddabe572c8355616 --- README.md | 4 ++++ README_zh.md | 4 ++++ data/oaast_sft.json.REMOVED.git-id | 2 +- 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index efc7b7ff..b19b3143 100644 --- a/README.md +++ b/README.md @@ -63,6 +63,10 @@ - For pre-training: - [Wiki Demo (en)](data/wiki_demo.txt) + - [RefinedWeb (en)](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) + - [StarCoder (en)](https://huggingface.co/datasets/bigcode/starcoderdata) + - [Wikipedia (en)](https://huggingface.co/datasets/olm/olm-wikipedia-20221220) + - [Wikipedia (zh)](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) - For supervised fine-tuning: - [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca) - [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca) diff --git a/README_zh.md b/README_zh.md index 1699dd86..73b50e95 100644 --- a/README_zh.md +++ b/README_zh.md @@ -63,6 +63,10 @@ - 用于二次预训练: - [Wiki Demo (en)](data/wiki_demo.txt) + - [RefinedWeb (en)](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) + - [StarCoder (en)](https://huggingface.co/datasets/bigcode/starcoderdata) + - [Wikipedia (en)](https://huggingface.co/datasets/olm/olm-wikipedia-20221220) + - [Wikipedia (zh)](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) - 用于指令监督微调: - [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca) - [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca) diff --git a/data/oaast_sft.json.REMOVED.git-id b/data/oaast_sft.json.REMOVED.git-id index 5bac2e5b..fd29e313 100644 --- a/data/oaast_sft.json.REMOVED.git-id +++ b/data/oaast_sft.json.REMOVED.git-id @@ -1 +1 @@ -0a57fbc1d8cb08a8cd71c5eb8425cf59206ffed6 \ No newline at end of file +57fd080be5bffe4153fe3ee26a175e3d56da30f3 \ No newline at end of file