# FP8 training example with FSDP # This config demonstrates FP8 mixed precision training using HuggingFace Accelerate # with FSDP for distributed training and float8 all-gather optimization ### Model configuration model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct trust_remote_code: true ### Method configuration stage: sft do_train: true finetuning_type: full ### Dataset configuration dataset: identity template: llama3 cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16 ### Output configuration output_dir: saves/llama3-8b/fp8-fsdp/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true ### Training configuration per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 5.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ### FP8 configuration fp8: true fp8_backend: torchao # Use TorchAO backend for FP8 fp8_enable_fsdp_float8_all_gather: true # Enable FSDP2 float8 all-gather optimization ### FSDP configuration (using training arguments - no separate FSDP config file) fsdp: - full_shard - auto_wrap fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer ### Logging configuration report_to: wandb run_name: llama3_fp8_fsdp_sft