Billy Cao
51e741ec85
[data] shard the dataset to allow multiprocessing when streaming is enabled ( #7530 )
...
* Shard the dataset when streaming to allow multiprocessing
* Allow user to not set dataset_shards to ensure backward compatibility
2025-04-01 15:36:23 +08:00
hoshi-hiyouga
1b1964714e
[misc] update format ( #7277 )
2025-03-13 02:53:08 +08:00
hoshi-hiyouga
efa86e730c
[misc] upgrade format to py39 ( #7256 )
2025-03-12 00:08:41 +08:00
hiyouga
ec251b4614
remove exit in preprocess
...
Former-commit-id: f369b6ef41ffd9586ba568b88c5ff32a1af4bace
2025-03-11 15:08:25 +08:00
hoshi-hiyouga
8c7917d1a2
[data] fix loader ( #7207 )
...
* fix dataloader
* add test case
* fix type
* fix ci
* fix ci
* fix ci
* disable overwrite cache in ci
Former-commit-id: e84af0e140b1aafd1a6d6fe185a8e41c8fc5f831
2025-03-07 17:20:46 +08:00
hoshi-hiyouga
4e661b63e3
[data] fix predict dataset ( #6972 )
...
Former-commit-id: f9a82e527877b1ed47cabb3d34f4d155705f4048
2025-02-17 20:29:40 +08:00
SrWYG
d9ea4baf00
[data] evaluate on each dataset ( #5522 )
...
* [Update] loader.py , evaluate will run separate evaluations on each dataset.
`If you pass a dictionary with names of datasets as keys and datasets as values, evaluate will run separate evaluations on each dataset. This can be useful to monitor how training affects other datasets or simply to get a more fine-grained evaluation`
seq2seqtrainner support eval_dataset as Dict.
* fix format
* fix
* fix
---------
Co-authored-by: hiyouga <hiyouga@buaa.edu.cn>
Former-commit-id: cf00f78650a442c85678ce805e030d2b96cbecd7
2025-02-13 02:19:03 +08:00
hoshi-hiyouga
efef4eaefc
[breaking change] refactor data pipeline ( #6901 )
...
* refactor data
* rename file
Former-commit-id: 7a1a4ce6451cb782573d0bd9dd27a5e443e3a18b
2025-02-13 00:39:20 +08:00
hoshi-hiyouga
40b6e9045d
[misc] update license year & fix llama pro ( #6814 )
...
* fix llamapro script
* change year
Former-commit-id: d9ae594178796994d400a5f207d6499712816f89
2025-02-05 01:53:33 +08:00
hiyouga
760dea0787
imporve log
...
Former-commit-id: a6abf375975ffea3d51e1b944c9855b5f62ffac8
2025-01-08 09:56:10 +00:00
Yaser Afshar
45596f5ae0
Add trust_remote_code parameter and remove True
...
- Introduced a new model parameter `trust_remote_code`
- Set the default value of `trust_remote_code` to `False`
to enhance security
Former-commit-id: 4bf23f406cf5235c16f9f8139850c53354901814
2024-12-17 12:25:12 +00:00
hoshi-hiyouga
50782abce6
lint
...
Former-commit-id: 191ccc585399ad4c6c2c4f280b144b2c0a4869f3
2024-12-04 22:08:27 +08:00
wangdepeng
5c496f2db8
fix:tokenized_path not None and load_from_disk return Dataset Trigger stuck
...
Former-commit-id: cbf9da35728daaf98d92e699e891e334c74af1e5
2024-11-27 16:44:42 +08:00
hiyouga
41c1d014a9
fix #6149
...
Former-commit-id: b581b272793314a9602f4dc2fb646a988a6249df
2024-11-26 16:03:02 +00:00
hiyouga
a117731ecb
support rank0 logger
...
Former-commit-id: 84528eabe560091bfd866b6a0ca864085af7529b
2024-11-02 18:31:04 +08:00
hiyouga
4329e5d4d8
tiny fix
...
Former-commit-id: b8f4b145506851cf5488cd8551a04d1c7603019b
2024-10-30 08:56:29 +00:00
hiyouga
dbbfb5f5dc
use pre-commit
...
Former-commit-id: 7cfede95df22a9ff236788f04159b6b16b8d04bb
2024-10-29 09:07:46 +00:00
hiyouga
916804d11a
tiny fix
...
Former-commit-id: 1fe424323b212094856f423351dc2a15774d39c3
2024-10-11 23:51:54 +08:00
huniu20
28c602357e
add om_hub_token argument
...
Former-commit-id: b3214e69d32067a1c22dbd60c2cde1545ba75b19
2024-10-10 17:16:46 +08:00
huniu20
4affb39ca2
1. add model and dataset info to support webui
...
Former-commit-id: 92f6226f3fecbd9af744a7232dda2c68b2bb0d86
2024-10-10 16:46:34 +08:00
huniu20
c3a040b4a5
1. add modelers hub support
...
Former-commit-id: 14678eb444d8181176745d18d4a6865fd6860f58
2024-10-09 17:21:37 +08:00
hiyouga
fb90faf19a
add docstrings, refactor logger
...
Former-commit-id: c34e489d71f8f539028543ccf8ee92cecedd6276
2024-09-08 00:56:56 +08:00
hiyouga
3756553c25
update get template
...
Former-commit-id: 21ea0d0786f91c0bce79630963e66b815a6792a0
2024-09-04 22:36:20 +08:00
hoshi-hiyouga
1a76d3e09c
Merge pull request #5323 from naem1023/feat/add-dataset-map-batch-size-argument
...
Add batch size of map function in the preprocessed dataset
Former-commit-id: c3428c5807500d087cdee4386798e10e39c9cf30
2024-09-04 22:09:36 +08:00
hoshi-hiyouga
7f94451034
fix #5228
...
Former-commit-id: 0d332ca8d0987c0331361934ab110fafa6402a7e
2024-09-04 19:10:30 +08:00
hiyouga
54d4c3fca7
lazy image load
...
Former-commit-id: cdd733b575411e003bc5ffd6560dd8eff8aa09cf
2024-09-04 02:27:08 +08:00
naem1023
961a1d5ae1
feat: add batch size of map function in the preprocessed dataset
...
Former-commit-id: 94b6cf06c2f84d0619b1a2dccaf8abb51de9951c
2024-09-02 13:52:47 +09:00
hiyouga
727193cdc9
follow #5115
...
Former-commit-id: 7d917e03e2df570139bae18227d9c7303a12de2a
2024-08-09 18:03:00 +08:00
“Wzw”
a53c99ecda
mask_history args verify valid
...
Former-commit-id: 2f8388b4f4195d934400ad9267d72e10ca4105a3
2024-08-08 10:12:01 +08:00
hoshi-hiyouga
f78cd9f9da
Update loader.py
...
Former-commit-id: 860e3eb374947b72dcae88cab0a93ef561e3bfb3
2024-07-15 00:50:06 +08:00
codingma
82e941ff61
1. add custom eval dataset support
...
2. merge load dataset and split dataset function
Former-commit-id: 963d97ba07e7efa3a4544c4d077283d9e112b3ad
2024-07-05 15:52:10 +08:00
hoshi-hiyouga
70410aedc1
Update loader.py
...
Former-commit-id: afa59d61844595e6b615227e6bfdc0b16c8015dd
2024-06-24 23:06:18 +08:00
hiyouga
acfae2e677
add license
...
Former-commit-id: 69cfc98d7c81756a5ab6bf962240e393e449fef0
2024-06-15 17:54:33 +08:00
hiyouga
e8885443a9
fix #4221
...
Former-commit-id: 05a3be4853b941909e7d193c31e8d62c8c5f879b
2024-06-13 02:48:21 +08:00
hiyouga
0b1f4a34f8
rename files
...
Former-commit-id: e1a8431770fc36c0c9ee7fed4abbc3d7fdcc5efd
2024-06-07 00:09:06 +08:00
hiyouga
56a6db6d84
fix ppo dataset bug #4012
...
Former-commit-id: 7fc51b2e93698ae5e012566af8481f4d861c873d
2024-06-06 19:03:20 +08:00
hiyouga
1cc9508fb3
tiny fix
...
Former-commit-id: f9d50501aac1f60a3b445ca3fee9aa60995461ee
2024-06-04 00:31:10 +08:00
hiyouga
920b091581
fix #3992
...
Former-commit-id: a48321fbf5196b88a11106cf74a74fbcea2ea50b
2024-06-04 00:17:36 +08:00
hiyouga
2e843a4cf6
fix data loader hint
...
Former-commit-id: 25b56126a11591b0155e2f72b673dd8f45a6c8c9
2024-06-03 18:28:27 +08:00
hoshi-hiyouga
ae773f9355
Update loader.py
...
Former-commit-id: 0aa59322906d91c5e385c9c02ebb5dd64ba060f3
2024-05-30 00:20:20 +08:00
hoshi-hiyouga
88f4c583d3
Update loader.py
...
Former-commit-id: aa7f335e3ad5a78e4ed5f99c120be28e9733ea2e
2024-05-30 00:17:21 +08:00
hoshi-hiyouga
d5ee485440
Update loader.py
...
Former-commit-id: 19d8fd62c18ee3ba0e431fc241f7d315cb716fef
2024-05-30 00:12:12 +08:00
seanzhang-zhichen
fc6c31127a
Merge branch 'main' into add_dataset_sample_num
...
Former-commit-id: 26300127c45f24e63b91f1b0cc73e46c3a936a91
2024-05-24 15:57:47 +08:00
hiyouga
664cba05e3
refactor data preprocessing, fix mllm rlhf
...
Former-commit-id: 53ff2dd24f9121ea30c95063bb72e49a9b31e980
2024-05-24 04:08:25 +08:00
zhangzc
e84b72f806
fix conflict
...
Former-commit-id: 6922b23a748c2459147bf44b96d86daa89f2c96c
2024-05-20 17:10:01 +08:00
hiyouga
d24969bb7e
improve KTO impl., replace datasets
...
Former-commit-id: e56a57ddcf061de6e4acc8679f7dbf0b68364986
2024-05-18 03:44:56 +08:00
enji.zhou
d16a1d9ed0
add kto
...
Former-commit-id: ec51986cf70b0bdd79b8141e45916670fb97a08e
2024-05-17 13:09:17 +08:00
hiyouga
ee759aa0d8
rename package
...
Former-commit-id: a07ff0c083558cfe6f474d13027642d3052fee08
2024-05-16 18:39:08 +08:00