# FP8 training example with DeepSpeed ZeRO-3 # This config demonstrates FP8 mixed precision training using HuggingFace Accelerate # with DeepSpeed providing memory optimization (not FP8 handling) ### Model configuration model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct trust_remote_code: true ### Method configuration stage: sft do_train: true finetuning_type: full ### Dataset configuration dataset: identity template: llama3 cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16 ### Output configuration output_dir: saves/llama3-8b/fp8-deepspeed/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true ### Training configuration per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 5.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ### FP8 configuration fp8: true fp8_backend: torchao # Use TorchAO backend for FP8 fp8_enable_fsdp_float8_all_gather: false # Not used with DeepSpeed ### DeepSpeed configuration deepspeed: examples/deepspeed/ds_z3_fp8_config.json ### Logging configuration report_to: wandb run_name: llama3_fp8_deepspeed_sft