[data] support for specifying a dataset in cloud storage (#7567)

* add support for loading datasets from s3/gcs

* add comments to readme

* run linter and address comments

* add option to pass in kwargs to ray init (i.e. runtime env)

* address comment

* revert mixed up changes
This commit is contained in:
Eric Tang
2025-04-09 20:31:35 -07:00
committed by GitHub
parent bb8d79bae2
commit a8caf09c7f
5 changed files with 63 additions and 6 deletions

View File

@@ -554,7 +554,7 @@ pip install .
### Data Preparation
Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope / Modelers hub or load the dataset in local disk.
Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can use datasets on HuggingFace / ModelScope / Modelers hub, load the dataset in local disk, or specify a path to s3/gcs cloud storage.
> [!NOTE]
> Please update `data/dataset_info.json` to use your custom dataset.