[version] fix gradio (#9685 )

[deps] goodbye python 3.9 (#9677 )
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hiyouga <16256802+hiyouga@users.noreply.github.com> Co-authored-by: hiyouga <hiyouga@buaa.edu.cn>
2025-12-28 09:40:34 +08:00 · 2025-12-28 05:00:51 +08:00 · 2025-12-27 02:50:44 +08:00 · 2025-12-27 02:43:46 +08:00 · 2025-12-27 01:39:13 +08:00 · 2025-12-26 22:47:23 +08:00
209 changed files with 8679 additions and 1150 deletions
--- a/.env.local
+++ b/.env.local
@@ -15,6 +15,7 @@ LLAMAFACTORY_VERBOSITY=
 USE_MODELSCOPE_HUB=
 USE_OPENMIND_HUB=
 USE_RAY=
+USE_KT=
 RECORD_VRAM=
 OPTIM_TORCH=
 NPU_JIT_COMPILE=
@@ -35,6 +36,8 @@ GRADIO_SERVER_NAME=
 GRADIO_SERVER_PORT=
 GRADIO_ROOT_PATH=
 GRADIO_IPV6=
+# backend
+USE_MCA=
 # setup
 ENABLE_SHORT_CONSOLE=
 # reserved (do not use)
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -0,0 +1,180 @@
+# GitHub Copilot Instructions for LLaMA Factory
+
+## Project Overview
+
+LLaMA Factory is an efficient fine-tuning framework for 100+ large language models (LLMs). It provides:
+- Support for various models: LLaMA, LLaVA, Mistral, Qwen, DeepSeek, Yi, Gemma, ChatGLM, Phi, etc.
+- Multiple training methods: pre-training, supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO
+- Scalable resources: 16-bit full-tuning, freeze-tuning, LoRA and QLoRA variants
+- Advanced algorithms: GaLore, BAdam, APOLLO, Adam-mini, Muon, OFT, DoRA, etc.
+- Web UI (LLaMA Board) and CLI interfaces
+
+### Architecture Versions
+
+LLaMA Factory has two parallel architectures that can be switched via the `USE_V1` environment variable:
+
+**v0 (default)** - File hierarchy:
+- `api`, `webui` → `chat`, `eval`, `train` → `data`, `model` → `hparams` → `extras`
+
+**v1** - File hierarchy:
+- `trainers` → `core` → `accelerator`, `plugins`, `config` → `utils`
+
+Set `USE_V1=1` to enable v1 architecture.
+
+## Code Structure
+
+### v0 Architecture (Default)
+
+- `src/llamafactory/` - Main package directory
+  - `api/` - OpenAI-style API implementation
+  - `chat/` - Chat interface implementation
+  - `cli.py` - Command-line interface
+  - `data/` - Data processing and dataset handling
+  - `eval/` - Model evaluation utilities
+  - `extras/` - Additional utilities and helpers
+  - `hparams/` - Hyperparameter definitions
+  - `model/` - Model loading, patching, and utilities
+  - `train/` - Training pipeline implementation
+  - `webui/` - Gradio-based web interface
+- `src/train.py` - Training entry script (delegates to `llamafactory.train.tuner`)
+- `src/webui.py` - Web UI entry script (delegates to `llamafactory.webui.interface`)
+- `src/api.py` - API server entry script (delegates to `llamafactory.api.app`)
+- `tests/` - Test suite
+- `examples/` - Example configurations for various training scenarios
+- `data/` - Dataset definitions and examples
+
+### v1 Architecture (USE_V1=1)
+
+- `src/llamafactory/v1/` - Version 1 package directory
+  - `trainers/` - Training implementations
+  - `core/` - Core training utilities
+  - `accelerator/` - Acceleration and distributed training
+  - `plugins/` - Pluggable components (model, data, sampler, trainer)
+  - `config/` - Configuration management
+  - `utils/` - Utility functions
+
+## Development Practices
+
+### Code Style
+
+- Follow the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)
+- Use ruff for linting and formatting
+- Line length: 119 characters
+- Indentation: 4 spaces
+- Quote style: double quotes
+- Use Google-style docstrings for documentation
+
+### Import Organization
+
+- Known first-party: `llamafactory`
+- Known third-party: `accelerate`, `datasets`, `gradio`, `numpy`, `peft`, `torch`, `transformers`, `trl`
+- Use 2 blank lines after imports
+
+### Quality Checks
+
+Before committing code, run:
+```bash
+make style      # Auto-fix style issues
+make quality    # Check code quality
+make test       # Run test suite
+```
+
+Or use the combined command:
+```bash
+make commit     # Run pre-commit hooks
+```
+
+### Testing
+
+- Use pytest for testing
+- Tests are located in `tests/` and `tests_v1/` directories
+- Run tests with: `make test` (which runs `WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ tests_v1/`)
+- Disable wandb during testing to avoid external dependencies
+- **Note**: Training configurations require GPU machines, so training is typically not tested end-to-end. Use `make test` to validate file-level functionality.
+
+### Building
+
+Build the package with:
+```bash
+pip3 install build && python3 -m build
+```
+
+### License
+
+- All source files must include the Apache 2.0 license header
+- Check license headers with: `make license`
+
+## Common Patterns
+
+### Configuration Files
+
+- Training configurations are typically YAML or JSON files in `examples/` directory
+- Hyperparameters are defined using dataclasses in `src/llamafactory/hparams/`
+
+### Model Support
+
+- New model support is added through model patches in `src/llamafactory/model/`
+- Visual models use the visual utilities in `src/llamafactory/model/model_utils/visual.py`
+- Quantization support is in `src/llamafactory/model/model_utils/quantization.py`
+
+### Data Processing
+
+- Dataset definitions are in `data/dataset_info.json`
+- Data templates and processors are in `src/llamafactory/data/`
+
+### Training
+
+- Training pipelines are in `src/llamafactory/train/`
+- Support for different training methods: SFT, DPO, PPO, RM, PT, KTO, ORPO
+
+## Key Dependencies
+
+- Python >= 3.9.0
+- PyTorch and transformers for model handling
+- datasets for data processing
+- peft for parameter-efficient fine-tuning
+- accelerate for distributed training
+- gradio for web UI
+- trl for reinforcement learning
+- Optional: vllm/sglang for inference, flash-attention-2, unsloth, liger-kernel
+
+## Entry Points
+
+- **CLI Training**: `llamafactory-cli train --config examples/train_lora/llama3_lora_sft.yaml`
+- **Web UI**: `llamafactory-cli webui` or `python src/webui.py`
+- **API Server**: `llamafactory-cli api` or `python src/api.py`
+- **Chat Interface**: `llamafactory-cli chat --model_name_or_path MODEL_PATH`
+
+## Environment Setup
+
+For development:
+```bash
+pip install -e ".[dev]"
+```
+
+## Important Notes
+
+- The project supports multiple backends: default PyTorch, vLLM, SGLang
+- Megatron-core training is supported via mcore_adapter
+- SwanLab and W&B are supported for experiment tracking
+- Docker support is available with pre-built images
+- Day-0/Day-1 support for latest cutting-edge models
+- Multi-modal support for vision and audio understanding tasks
+
+## Contribution Guidelines
+
+1. Fork the repository
+2. Create a development branch
+3. Set up development environment with `pip install -e ".[dev]"`
+4. Make changes following the style guide
+5. Run quality checks: `make style && make quality`
+6. Run tests: `make test`
+7. Submit a pull request
+
+## Common Commands
+
+- `make style` - Format code
+- `make quality` - Run linters
+- `make test` - Run tests
+- `make commit` - Install and run pre-commit hooks
+- `make license` - Check license headers
--- a/.github/workflows/docker.yml
+++ b/.github/workflows/docker.yml
@@ -7,7 +7,7 @@ on:
      - "main"
    paths:
      - "**/*.py"
-      - "requirements.txt"
+      - "pyproject.toml"
      - "docker/**"
      - ".github/workflows/*.yml"
  pull_request:
@@ -15,7 +15,7 @@ on:
      - "main"
    paths:
      - "**/*.py"
-      - "requirements.txt"
+      - "pyproject.toml"
      - "docker/**"
      - ".github/workflows/*.yml"
  release:
@@ -27,9 +27,10 @@ jobs:
    strategy:
      fail-fast: false
      matrix:
-        device:
-          - "cuda"
-          - "npu"
+        include:
+          - device: "cuda"
+          - device: "npu-a2"
+          - device: "npu-a3"

    runs-on: ubuntu-latest

@@ -51,16 +52,11 @@ jobs:
      - name: Checkout
        uses: actions/checkout@v4

-      - name: Set up Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: "3.10"
-
      - name: Get llamafactory version
        id: version
        run: |
          if [ "${{ github.event_name }}" = "release" ]; then
-            echo "tag=$(python setup.py --version)" >> "$GITHUB_OUTPUT"
+            echo "tag=$(grep -oP 'VERSION = "\K[^"]+' src/llamafactory/extras/env.py)" >> "$GITHUB_OUTPUT"
          else
            echo "tag=latest" >> "$GITHUB_OUTPUT"
          fi
@@ -76,7 +72,7 @@ jobs:
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Login to Quay
-        if: ${{ github.event_name != 'pull_request' && matrix.device == 'npu' }}
+        if: ${{ github.event_name != 'pull_request' && matrix.device == 'npu'}}
        uses: docker/login-action@v3
        with:
          registry: quay.io
@@ -89,16 +85,12 @@ jobs:
        with:
          context: .
          file: ./docker/docker-cuda/Dockerfile
-          build-args: |
-            EXTRAS=metrics,deepspeed,liger-kernel
          push: ${{ github.event_name != 'pull_request' }}
          tags: |
            docker.io/hiyouga/llamafactory:${{ steps.version.outputs.tag }}
-          cache-from: type=gha
-          cache-to: type=gha,mode=max

-      - name: Build and push Docker image (NPU)
-        if: ${{ matrix.device == 'npu' }}
+      - name: Build and push Docker image (NPU-A2)
+        if: ${{ matrix.device == 'npu-a2' }}
        uses: docker/build-push-action@v6
        with:
          context: .
@@ -108,5 +100,17 @@ jobs:
          tags: |
            docker.io/hiyouga/llamafactory:${{ steps.version.outputs.tag }}-npu-a2
            quay.io/ascend/llamafactory:${{ steps.version.outputs.tag }}-npu-a2
-          cache-from: type=gha
-          cache-to: type=gha,mode=max
+
+      - name: Build and push Docker image (NPU-A3)
+        if: ${{ matrix.device == 'npu-a3' }}
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          platforms: linux/amd64,linux/arm64
+          file: ./docker/docker-npu/Dockerfile
+          build-args: |
+            BASE_IMAGE=quay.io/ascend/cann:8.3.rc2-a3-ubuntu22.04-py3.11
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: |
+            docker.io/hiyouga/llamafactory:${{ steps.version.outputs.tag }}-npu-a3
+            quay.io/ascend/llamafactory:${{ steps.version.outputs.tag }}-npu-a3
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
@@ -23,10 +23,11 @@ jobs:
      - name: Checkout
        uses: actions/checkout@v4

-      - name: Set up Python
-        uses: actions/setup-python@v5
+      - name: Install uv
+        uses: astral-sh/setup-uv@v7
        with:
-          python-version: "3.9"
+          python-version: "3.11"
+          github-token: ${{ github.token }}

      - name: Build package
        run: |
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -7,14 +7,16 @@ on:
      - "main"
    paths:
      - "**/*.py"
-      - "requirements.txt"
+      - "pyproject.toml"
+      - "Makefile"
      - ".github/workflows/*.yml"
  pull_request:
    branches:
      - "main"
    paths:
      - "**/*.py"
-      - "requirements.txt"
+      - "pyproject.toml"
+      - "Makefile"
      - ".github/workflows/*.yml"

 jobs:
@@ -23,10 +25,9 @@ jobs:
      fail-fast: false
      matrix:
        python:
-          - "3.9"
-          - "3.10"
          - "3.11"
          - "3.12"
+          # - "3.13" # enable after trl is upgraded
        os:
          - "ubuntu-latest"
          - "windows-latest"
@@ -34,18 +35,15 @@ jobs:
        transformers:
          - null
        include:  # test backward compatibility
-          - python: "3.9"
+          - python: "3.11"
            os: "ubuntu-latest"
            transformers: "4.49.0"
-          - python: "3.9"
+          - python: "3.11"
            os: "ubuntu-latest"
            transformers: "4.51.0"
-          - python: "3.9"
+          - python: "3.11"
            os: "ubuntu-latest"
            transformers: "4.53.0"
-        exclude:  # exclude python 3.9 on macos
-          - python: "3.9"
-            os: "macos-latest"

    runs-on: ${{ matrix.os }}

@@ -61,28 +59,23 @@ jobs:
      - name: Checkout
        uses: actions/checkout@v4

-      - name: Set up Python
-        uses: actions/setup-python@v5
+      - name: Install uv
+        uses: astral-sh/setup-uv@v7
        with:
          python-version: ${{ matrix.python }}
-          cache: "pip"
-          cache-dependency-path: "**/requirements*.txt"
+          github-token: ${{ github.token }}
+          enable-cache: false

      - name: Install dependencies
        run: |
-          python -m pip install --upgrade pip
-          python -m pip install ".[torch,dev]"
+          uv venv
+          uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
+          uv pip install -e ".[dev]"

      - name: Install transformers
        if: ${{ matrix.transformers }}
        run: |
-          python -m pip install "transformers==${{ matrix.transformers }}"
-
-      - name: Update accelerate to avoid mac os ci errors (before accelerate 1.11.0)
-        if: ${{ matrix.os == 'macos-latest' }}
-        run: |
-          python -m pip uninstall -y accelerate
-          python -m pip install "git+https://github.com/huggingface/accelerate.git"
+          uv pip install "transformers==${{ matrix.transformers }}"

      - name: Cache files
        id: hf-hub-cache
@@ -94,18 +87,25 @@ jobs:
      - name: Check quality
        run: |
          make style && make quality
+        env:
+          UV_NO_SYNC: 1

      - name: Check license
        run: |
          make license
+        env:
+          UV_NO_SYNC: 1

      - name: Check build
        run: |
          make build
+        env:
+          UV_NO_SYNC: 1

      - name: Test with pytest
        run: |
          make test
        env:
+          UV_NO_SYNC: 1
          HF_HOME: ${{ runner.temp }}/huggingface
          HF_HUB_OFFLINE: "${{ steps.hf-hub-cache.outputs.cache-hit == 'true' && '1' || '0' }}"
--- a/.github/workflows/tests_npu.yml
+++ b/.github/workflows/tests_npu.yml
@@ -0,0 +1,99 @@
+name: tests_npu
+
+on:
+  workflow_dispatch:
+  push:
+    branches:
+      - "main"
+    paths:
+      - "**/*.py"
+      - "pyproject.toml"
+      - "Makefile"
+      - ".github/workflows/*.yml"
+  pull_request:
+    branches:
+      - "main"
+    paths:
+      - "**/*.py"
+      - "pyproject.toml"
+      - "Makefile"
+      - ".github/workflows/*.yml"
+
+jobs:
+  tests:
+    strategy:
+      fail-fast: false
+      matrix:
+        python:
+          - "3.11"
+        os:
+          - "linux-aarch64-a2-4"
+        pytorch_npu:
+          - "2.7.1"
+
+    runs-on: ${{ matrix.os }}
+
+    concurrency:
+      group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os }}-${{ matrix.python }}
+      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
+
+    container:
+      image: ascendai/cann:8.3.rc2-910b-ubuntu22.04-py3.11
+      env:
+        HF_ENDPOINT: https://hf-mirror.com
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        OS_NAME: ${{ matrix.os }}
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Install uv
+        run: |
+          curl -LsSf https://astral.sh/uv/install.sh | sh
+
+      - name: Install dependencies
+        run: |
+          uv venv
+          uv pip install torch-npu==${{matrix.pytorch_npu}}
+          uv pip install -e ".[dev]"
+
+      - name: Install node
+        run: |
+          apt-get update || true
+          apt-get install -y curl
+          curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
+          apt-get install -y nodejs
+
+      - name: Cache files
+        id: hf-hub-cache
+        uses: actions/cache@v4
+        with:
+          path: ${{ runner.temp }}/huggingface
+          key: huggingface-${{ matrix.os }}-${{ matrix.python }}-${{ hashFiles('tests/version.txt') }}
+
+      - name: Check quality
+        run: |
+          make style && make quality
+        env:
+          UV_NO_SYNC: 1
+
+      - name: Check license
+        run: |
+          make license
+        env:
+          UV_NO_SYNC: 1
+
+      - name: Check build
+        run: |
+          make build
+        env:
+          UV_NO_SYNC: 1
+
+      - name: Test with pytest
+        run: |
+          make test
+        env:
+          UV_NO_SYNC: 1
+          HF_HOME: /root/.cache/huggingface
+          HF_HUB_OFFLINE: "${{ steps.hf-hub-cache.outputs.cache-hit == 'true' && '1' || '0' }}"
--- a/.gitignore
+++ b/.gitignore
@@ -85,7 +85,7 @@ ipython_config.py
 # pyenv
 #   For a library or package, you might want to ignore these files since the code is
 #   intended to run in multiple environments; otherwise, check them in:
-# .python-version
+.python-version

 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
@@ -165,6 +165,9 @@ cython_debug/
 # uv
 uv.lock

+# macOS
+.DS_Store
+
 # custom .gitignore
 hf_cache/
 ms_cache/
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1 +1 @@
-include LICENSE requirements.txt
+include LICENSE
--- a/24
+++ b/24
@@ -1,24 +1,28 @@
 .PHONY: build commit license quality style test

-check_dirs := scripts src tests tests_v1 setup.py
+check_dirs := scripts src tests tests_v1
+
+RUN := $(shell command -v uv >/dev/null 2>&1 && echo "uv run" || echo "")
+BUILD := $(shell command -v uv >/dev/null 2>&1 && echo "uv build" || echo "python -m build")
+TOOL := $(shell command -v uv >/dev/null 2>&1 && echo "uvx" || echo "")

 build:
-	pip3 install build && python3 -m build
+	$(BUILD)

 commit:
-	pre-commit install
-	pre-commit run --all-files
+	$(TOOL) pre-commit install
+	$(TOOL) pre-commit run --all-files

 license:
-	python3 tests/check_license.py $(check_dirs)
+	$(RUN) python3 tests/check_license.py $(check_dirs)

 quality:
-	ruff check $(check_dirs)
-	ruff format --check $(check_dirs)
+	$(TOOL) ruff check $(check_dirs)
+	$(TOOL) ruff format --check $(check_dirs)

 style:
-	ruff check $(check_dirs) --fix
-	ruff format $(check_dirs)
+	$(TOOL) ruff check $(check_dirs) --fix
+	$(TOOL) ruff format $(check_dirs)

 test:
-	CUDA_VISIBLE_DEVICES= WANDB_DISABLED=true pytest -vv tests/
+	WANDB_DISABLED=true $(RUN) pytest -vv --import-mode=importlib tests/ tests_v1/
--- a/README.md
+++ b/README.md
@@ -5,11 +5,13 @@
 [![GitHub contributors](https://img.shields.io/github/contributors/hiyouga/LLaMA-Factory?color=orange)](https://github.com/hiyouga/LLaMA-Factory/graphs/contributors)
 [![GitHub workflow](https://github.com/hiyouga/LLaMA-Factory/actions/workflows/tests.yml/badge.svg)](https://github.com/hiyouga/LLaMA-Factory/actions/workflows/tests.yml)
 [![PyPI](https://img.shields.io/pypi/v/llamafactory)](https://pypi.org/project/llamafactory/)
-[![Citation](https://img.shields.io/badge/citation-840-green)](https://scholar.google.com/scholar?cites=12620864006390196564)
+[![Citation](https://img.shields.io/badge/citation-1000+-green)](https://scholar.google.com/scholar?cites=12620864006390196564)
 [![Docker Pulls](https://img.shields.io/docker/pulls/hiyouga/llamafactory)](https://hub.docker.com/r/hiyouga/llamafactory/tags)

 [![Twitter](https://img.shields.io/twitter/follow/llamafactory_ai)](https://twitter.com/llamafactory_ai)
 [![Discord](assets/thirdparty/discord.svg)](https://discord.gg/rKfvV9r9FK)
+[![WeChat](https://img.shields.io/badge/WeChat-User%20Group-blue?logo=wechat)](https://github.com/hiyouga/llamafactory-community)
+[![Blog](https://img.shields.io/badge/Hugo-Official%20Blog-blue?logo=hugo)](https://blog.llamafactory.net/en/)

 [![Open in Colab](assets/thirdparty/colab.svg)](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)
 [![Open in DSW](assets/thirdparty/dsw.svg)](https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory)
@@ -44,16 +46,20 @@

 https://github.com/user-attachments/assets/3991a3a8-4276-4d30-9cab-4cb0c4b9b99e

-Choose your path:
+Start local training:
+- Please refer to [usage](#getting-started)

+Start cloud training:
+- **Colab (free)**: https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
+- **PAI-DSW (free trial)**: https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory
+- **LLaMA Factory Online**: https://www.llamafactory.com.cn/?utm_source=LLaMA-Factory
+- **Alaya NeW (cloud GPU deal)**: https://docs.alayanew.com/docs/documents/useGuide/LLaMAFactory/mutiple/?utm_source=LLaMA-Factory
+
+Read technical notes:
 - **Documentation (WIP)**: https://llamafactory.readthedocs.io/en/latest/
 - **Documentation (AMD GPU)**: https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/fine_tune/llama_factory_llama3.html
- **Colab (free)**: https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
- **Local machine**: Please refer to [usage](#getting-started)
- **PAI-DSW (free trial)**: https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory
- **Alaya NeW (cloud GPU deal)**: https://docs.alayanew.com/docs/documents/useGuide/LLaMAFactory/mutiple/?utm_source=LLaMA-Factory
+- **Official Blog**: https://blog.llamafactory.net/en/
 - **Official Course**: https://www.lab4ai.cn/course/detail?id=7c13e60f6137474eb40f6fd3983c0f46&utm_source=LLaMA-Factory
- **LLaMA Factory Online**: https://www.llamafactory.com.cn/?utm_source=LLaMA-Factory

 > [!NOTE]
 > Except for the above links, all other websites are unauthorized third-party websites. Please carefully use them.
@@ -90,7 +96,7 @@ Choose your path:
 - **Integrated methods**: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO, etc.
 - **Scalable resources**: 16-bit full-tuning, freeze-tuning, LoRA and 2/3/4/5/6/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
 - **Advanced algorithms**: [GaLore](https://github.com/jiaweizzhao/GaLore), [BAdam](https://github.com/Ledzy/BAdam), [APOLLO](https://github.com/zhuhanqing/APOLLO), [Adam-mini](https://github.com/zyushun/Adam-mini), [Muon](https://github.com/KellerJordan/Muon), [OFT](https://github.com/huggingface/peft/tree/main/src/peft/tuners/oft), DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and PiSSA.
- **Practical tricks**: [FlashAttention-2](https://github.com/Dao-AILab/flash-attention), [Unsloth](https://github.com/unslothai/unsloth), [Liger Kernel](https://github.com/linkedin/Liger-Kernel), RoPE scaling, NEFTune and rsLoRA.
+- **Practical tricks**: [FlashAttention-2](https://github.com/Dao-AILab/flash-attention), [Unsloth](https://github.com/unslothai/unsloth), [Liger Kernel](https://github.com/linkedin/Liger-Kernel), [KTransformers](https://github.com/kvcache-ai/ktransformers/), RoPE scaling, NEFTune and rsLoRA.
 - **Wide tasks**: Multi-turn dialogue, tool using, image understanding, visual grounding, video recognition, audio understanding, etc.
 - **Experiment monitors**: LlamaBoard, TensorBoard, Wandb, MLflow, [SwanLab](https://github.com/SwanHubX/SwanLab), etc.
 - **Faster inference**: OpenAI-style API, Gradio UI and CLI with [vLLM worker](https://github.com/vllm-project/vllm) or [SGLang worker](https://github.com/sgl-project/sglang).
@@ -104,6 +110,12 @@ Choose your path:

 ## Blogs

+> [!TIP]
+> Now we have a dedicated blog for LLaMA Factory!
+>
+> Website: https://blog.llamafactory.net/en/
+
+- 💡 [KTransformers Fine-Tuning × LLaMA Factory: Fine-tuning 1000 Billion models with 2 4090-GPU + CPU](https://blog.llamafactory.net/en/posts/ktransformers/) (English)
 - 💡 [Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge](https://buaa-act.feishu.cn/wiki/GVzlwYcRFiR8OLkHbL6cQpYin7g) (English)
 - [Fine-tune a mental health LLM using LLaMA-Factory](https://www.lab4ai.cn/project/detail?id=25cce32ec131497b9e06a93336a0817f&type=project&utm_source=LLaMA-Factory) (Chinese)
 - [Fine-tune GPT-OSS for Role-Playing using LLaMA-Factory](https://docs.llamafactory.com.cn/docs/documents/best-practice/gptroleplay/?utm_source=LLaMA-Factory) (Chinese)
@@ -123,6 +135,8 @@ Choose your path:

 ## Changelog

+[25/10/26] We support Megatron-core training backend with [**mcore_adapter**](https://github.com/alibaba/ROLL/tree/main/mcore_adapter). See [PR #9237](https://github.com/hiyouga/LLaMA-Factory/pull/9237) to get started.
+
 [25/08/22] We supported **[OFT](https://arxiv.org/abs/2306.07280)** and **[OFTv2](https://arxiv.org/abs/2506.19847)**. See [examples](examples/README.md) for usage.

 [25/08/20] We supported fine-tuning the **[Intern-S1-mini](https://huggingface.co/internlm/Intern-S1-mini)** models. See [PR #8976](https://github.com/hiyouga/LLaMA-Factory/pull/8976) to get started.
@@ -264,27 +278,21 @@ Choose your path:

 | Model                                                             | Model size                       | Template             |
 | ----------------------------------------------------------------- | -------------------------------- | -------------------- |
-| [Baichuan 2](https://huggingface.co/baichuan-inc)                 | 7B/13B                           | baichuan2            |
 | [BLOOM/BLOOMZ](https://huggingface.co/bigscience)                 | 560M/1.1B/1.7B/3B/7.1B/176B      | -                    |
-| [ChatGLM3](https://huggingface.co/THUDM)                          | 6B                               | chatglm3             |
 | [Command R](https://huggingface.co/CohereForAI)                   | 35B/104B                         | cohere               |
-| [DeepSeek (Code/MoE)](https://huggingface.co/deepseek-ai)         | 7B/16B/67B/236B                  | deepseek             |
-| [DeepSeek 2.5/3](https://huggingface.co/deepseek-ai)              | 236B/671B                        | deepseek3            |
+| [DeepSeek (LLM/Code/MoE)](https://huggingface.co/deepseek-ai)     | 7B/16B/67B/236B                  | deepseek             |
+| [DeepSeek 3-3.2](https://huggingface.co/deepseek-ai)              | 236B/671B                        | deepseek3            |
 | [DeepSeek R1 (Distill)](https://huggingface.co/deepseek-ai)       | 1.5B/7B/8B/14B/32B/70B/671B      | deepseekr1           |
 | [ERNIE-4.5](https://huggingface.co/baidu)                         | 0.3B/21B/300B                    | ernie/ernie_nothink  |
-| [Falcon](https://huggingface.co/tiiuae)                           | 7B/11B/40B/180B                  | falcon               |
-| [Falcon-H1](https://huggingface.co/tiiuae)                        | 0.5B/1.5B/3B/7B/34B              | falcon_h1            |
+| [Falcon/Falcon H1](https://huggingface.co/tiiuae)                 | 0.5B/1.5B/3B/7B/11B/34B/40B/180B | falcon/falcon_h1     |
 | [Gemma/Gemma 2/CodeGemma](https://huggingface.co/google)          | 2B/7B/9B/27B                     | gemma/gemma2         |
 | [Gemma 3/Gemma 3n](https://huggingface.co/google)                 | 270M/1B/4B/6B/8B/12B/27B         | gemma3/gemma3n       |
 | [GLM-4/GLM-4-0414/GLM-Z1](https://huggingface.co/zai-org)         | 9B/32B                           | glm4/glmz1           |
-| [GLM-4.1V](https://huggingface.co/zai-org)                        | 9B                               | glm4v                |
-| [GLM-4.5/GLM-4.5V](https://huggingface.co/zai-org)                | 106B/355B                        | glm4_moe/glm4v_moe   |
+| [GLM-4.5/GLM-4.5(6)V](https://huggingface.co/zai-org)             | 9B/106B/355B                     | glm4_moe/glm4_5v     |
 | [GPT-2](https://huggingface.co/openai-community)                  | 0.1B/0.4B/0.8B/1.5B              | -                    |
-| [GPT-OSS](https://huggingface.co/openai)                          | 20B/120B                         | gpt                  |
-| [Granite 3.0-3.3](https://huggingface.co/ibm-granite)             | 1B/2B/3B/8B                      | granite3             |
-| [Granite 4](https://huggingface.co/ibm-granite)                   | 7B                               | granite4             |
-| [Hunyuan](https://huggingface.co/tencent/)                        | 7B                               | hunyuan              |
-| [Index](https://huggingface.co/IndexTeam)                         | 1.9B                             | index                |
+| [GPT-OSS](https://huggingface.co/openai)                          | 20B/120B                         | gpt_oss              |
+| [Granite 3-4](https://huggingface.co/ibm-granite)                 | 1B/2B/3B/7B/8B                   | granite3/granite4    |
+| [Hunyuan (MT)](https://huggingface.co/tencent/)                   | 7B                               | hunyuan              |
 | [InternLM 2-3](https://huggingface.co/internlm)                   | 7B/8B/20B                        | intern2              |
 | [InternVL 2.5-3.5](https://huggingface.co/OpenGVLab)              | 1B/2B/4B/8B/14B/30B/38B/78B/241B | intern_vl            |
 | [InternLM/Intern-S1-mini](https://huggingface.co/internlm/)       | 8B                               | intern_s1            |
@@ -298,15 +306,13 @@ Choose your path:
 | [LLaVA-1.5](https://huggingface.co/llava-hf)                      | 7B/13B                           | llava                |
 | [LLaVA-NeXT](https://huggingface.co/llava-hf)                     | 7B/8B/13B/34B/72B/110B           | llava_next           |
 | [LLaVA-NeXT-Video](https://huggingface.co/llava-hf)               | 7B/34B                           | llava_next_video     |
-| [MiMo](https://huggingface.co/XiaomiMiMo)                         | 7B                               | mimo                 |
+| [MiMo](https://huggingface.co/XiaomiMiMo)                         | 7B/309B                          | mimo/mimo_v2         |
 | [MiniCPM 1-4.1](https://huggingface.co/openbmb)                   | 0.5B/1B/2B/4B/8B                 | cpm/cpm3/cpm4        |
 | [MiniCPM-o-2.6/MiniCPM-V-2.6](https://huggingface.co/openbmb)     | 8B                               | minicpm_o/minicpm_v  |
-| [Ministral/Mistral-Nemo](https://huggingface.co/mistralai)        | 8B/12B                           | ministral            |
+| [Ministral 3](https://huggingface.co/mistralai)                   | 3B/8B/14B                        | ministral3           |
 | [Mistral/Mixtral](https://huggingface.co/mistralai)               | 7B/8x7B/8x22B                    | mistral              |
-| [Mistral Small](https://huggingface.co/mistralai)                 | 24B                              | mistral_small        |
 | [OLMo](https://huggingface.co/allenai)                            | 1B/7B                            | -                    |
 | [PaliGemma/PaliGemma2](https://huggingface.co/google)             | 3B/10B/28B                       | paligemma            |
-| [Phi-1.5/Phi-2](https://huggingface.co/microsoft)                 | 1.3B/2.7B                        | -                    |
 | [Phi-3/Phi-3.5](https://huggingface.co/microsoft)                 | 4B/14B                           | phi                  |
 | [Phi-3-small](https://huggingface.co/microsoft)                   | 7B                               | phi_small            |
 | [Phi-4](https://huggingface.co/microsoft)                         | 14B                              | phi4                 |
@@ -317,19 +323,18 @@ Choose your path:
 | [Qwen2.5-Omni](https://huggingface.co/Qwen)                       | 3B/7B                            | qwen2_omni           |
 | [Qwen3-Omni](https://huggingface.co/Qwen)                         | 30B                              | qwen3_omni           |
 | [Qwen2-VL/Qwen2.5-VL/QVQ](https://huggingface.co/Qwen)            | 2B/3B/7B/32B/72B                 | qwen2_vl             |
-| [Qwen3-VL](https://huggingface.co/Qwen)                           | 235B                             | qwen3_vl             |
+| [Qwen3-VL](https://huggingface.co/Qwen)                           | 2B/4B/8B/30B/32B/235B            | qwen3_vl             |
 | [Seed (OSS/Coder)](https://huggingface.co/ByteDance-Seed)         | 8B/36B                           | seed_oss/seed_coder  |
-| [Skywork o1](https://huggingface.co/Skywork)                      | 8B                               | skywork_o1           |
 | [StarCoder 2](https://huggingface.co/bigcode)                     | 3B/7B/15B                        | -                    |
-| [TeleChat2](https://huggingface.co/Tele-AI)                       | 3B/7B/35B/115B                   | telechat2            |
-| [XVERSE](https://huggingface.co/xverse)                           | 7B/13B/65B                       | xverse               |
+| [VibeThinker-1.5B](https://huggingface.co/WeiboAI)                | 1.5B                             | qwen3                |
 | [Yi/Yi-1.5 (Code)](https://huggingface.co/01-ai)                  | 1.5B/6B/9B/34B                   | yi                   |
-| [Yi-VL](https://huggingface.co/01-ai)                             | 6B/34B                           | yi_vl                |
 | [Yuan 2](https://huggingface.co/IEITYuan)                         | 2B/51B/102B                      | yuan                 |

 > [!NOTE]
 > For the "base" models, the `template` argument can be chosen from `default`, `alpaca`, `vicuna` etc. But make sure to use the **corresponding template** for the "instruct/chat" models.
 >
+> If the model has both reasoning and non-reasoning versions, please use the `_nothink` suffix to distinguish between them. For example, `qwen3` and `qwen3_nothink`.
+>
 > Remember to use the **SAME** template in training and inference.
 >
 > \*: You should install the `transformers` from main branch and use `DISABLE_VERSION_CHECK=1` to skip version check.
@@ -459,7 +464,7 @@ You also can add a custom chat template to [template.py](src/llamafactory/data/t
 Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

 ```bash
-pip install --upgrade huggingface_hub
+pip install "huggingface_hub<1.0.0"
 huggingface-cli login
 ```

@@ -509,10 +514,12 @@ huggingface-cli login
 ```bash
 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
 cd LLaMA-Factory
-pip install -e ".[torch,metrics]" --no-build-isolation
+pip install -e ".[metrics]" --no-build-isolation
 ```

-Extra dependencies available: torch, torch-npu, metrics, deepspeed, liger-kernel, bitsandbytes, hqq, eetq, gptq, aqlm, vllm, sglang, galore, apollo, badam, adam-mini, qwen, minicpm_v, openmind, swanlab, dev
+Optional dependencies available: `metrics`, `deepspeed`. Install with: `pip install -e ".[metrics,deepspeed]"`
+
+Additional dependencies for specific features are available in `examples/requirements/`.

 #### Install from Docker Image

@@ -531,13 +538,7 @@ Please refer to [build docker](#build-docker) to build the image yourself.
 Create an isolated Python environment with [uv](https://github.com/astral-sh/uv):

 ```bash
-uv sync --extra torch --extra metrics --prerelease=allow
-```
-
-Run LLaMA-Factory in the isolated environment:
-
-```bash
-uv run --prerelease=allow llamafactory-cli train examples/train_lora/llama3_lora_pretrain.yaml
+uv run llamafactory-cli webui
 ```

 </details>
@@ -574,7 +575,7 @@ To enable FlashAttention-2 on the Windows platform, please use the script from [

 <details><summary>For Ascend NPU users</summary>

-To install LLaMA Factory on Ascend NPU devices, please upgrade Python to version 3.10 or higher and specify extra dependencies: `pip install -e ".[torch-npu,metrics]"`. Additionally, you need to install the **[Ascend CANN Toolkit and Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**. Please follow the [installation tutorial](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/600alphaX/softwareinstall/instg/atlasdeploy_03_0031.html) or use the following commands:
+To install LLaMA Factory on Ascend NPU devices, please upgrade Python to version 3.10 or higher: `pip install -e . torch-npu==2.7.1`. Additionally, you need to install the **[Ascend CANN Toolkit and Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**. Please follow the [installation tutorial](https://www.hiascend.com/document/detail/en/CANNCommunityEdition/600alphaX/softwareinstall/instg/atlasdeploy_03_0031.html) or use the following commands:

 ```bash
 # replace the url according to your CANN version and devices
@@ -593,8 +594,8 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 | Requirement  | Minimum | Recommend      |
 | ------------ | ------- | -------------- |
 | CANN         | 8.0.RC1 | 8.0.0.alpha002 |
-| torch        | 2.1.0   | 2.4.0          |
-| torch-npu    | 2.1.0   | 2.4.0.post2    |
+| torch        | 2.1.0   | 2.7.1          |
+| torch-npu    | 2.1.0   | 2.7.1          |
 | deepspeed    | 0.13.2  | 0.13.2         |
 | vllm-ascend  | -       | 0.7.3          |

@@ -709,7 +710,6 @@ For CUDA users:
 ```bash
 docker build -f ./docker/docker-cuda/Dockerfile \
    --build-arg PIP_INDEX=https://pypi.org/simple \
-    --build-arg EXTRAS=metrics \
    -t llamafactory:latest .

 docker run -dit --ipc=host --gpus=all \
@@ -726,7 +726,6 @@ For Ascend NPU users:
 ```bash
 docker build -f ./docker/docker-npu/Dockerfile \
    --build-arg PIP_INDEX=https://pypi.org/simple \
-    --build-arg EXTRAS=torch-npu,metrics \
    -t llamafactory:latest .

 docker run -dit --ipc=host \
@@ -751,7 +750,6 @@ For AMD ROCm users:
 ```bash
 docker build -f ./docker/docker-rocm/Dockerfile \
    --build-arg PIP_INDEX=https://pypi.org/simple \
-    --build-arg EXTRAS=metrics \
    -t llamafactory:latest .

 docker run -dit --ipc=host \
--- a/README_zh.md
+++ b/README_zh.md
@@ -5,11 +5,13 @@
 [![GitHub contributors](https://img.shields.io/github/contributors/hiyouga/LLaMA-Factory?color=orange)](https://github.com/hiyouga/LLaMA-Factory/graphs/contributors)
 [![GitHub workflow](https://github.com/hiyouga/LLaMA-Factory/actions/workflows/tests.yml/badge.svg)](https://github.com/hiyouga/LLaMA-Factory/actions/workflows/tests.yml)
 [![PyPI](https://img.shields.io/pypi/v/llamafactory)](https://pypi.org/project/llamafactory/)
-[![Citation](https://img.shields.io/badge/citation-840-green)](https://scholar.google.com/scholar?cites=12620864006390196564)
+[![Citation](https://img.shields.io/badge/citation-1000+-green)](https://scholar.google.com/scholar?cites=12620864006390196564)
 [![Docker Pulls](https://img.shields.io/docker/pulls/hiyouga/llamafactory)](https://hub.docker.com/r/hiyouga/llamafactory/tags)

 [![Twitter](https://img.shields.io/twitter/follow/llamafactory_ai)](https://twitter.com/llamafactory_ai)
 [![Discord](assets/thirdparty/discord.svg)](https://discord.gg/rKfvV9r9FK)
+[![WeChat](https://img.shields.io/badge/WeChat-User%20Group-blue?logo=wechat)](https://github.com/hiyouga/llamafactory-community)
+[![Blog](https://img.shields.io/badge/Hugo-Official%20Blog-blue?logo=hugo)](https://blog.llamafactory.net/)

 [![Open in Colab](assets/thirdparty/colab.svg)](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)
 [![Open in DSW](assets/thirdparty/dsw.svg)](https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory)
@@ -44,18 +46,22 @@

 https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc

-选择你的打开方式：
+开始本地训练：
+- 请见[如何使用](#如何使用)

+开始云端训练：
+- **Colab（免费）**：https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing
+- **PAI-DSW（免费试用）**：https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory
+- **LLaMA Factory Online（在线微调）**：https://www.llamafactory.com.cn/?utm_source=LLaMA-Factory
+- **九章智算云（算力优惠活动）**：https://docs.alayanew.com/docs/documents/useGuide/LLaMAFactory/mutiple/?utm_source=LLaMA-Factory
+
+阅读技术文档：
 - **入门教程**：https://zhuanlan.zhihu.com/p/695287607
 - **微调视频教程**：https://www.bilibili.com/video/BV1djgRzxEts/
 - **框架文档**：https://llamafactory.readthedocs.io/zh-cn/latest/
 - **框架文档（昇腾 NPU）**：https://ascend.github.io/docs/sources/llamafactory/
- **Colab（免费）**：https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing
- **本地机器**：请见[如何使用](#如何使用)
- **PAI-DSW（免费试用）**：https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory
- **九章智算云（算力优惠活动）**：https://docs.alayanew.com/docs/documents/useGuide/LLaMAFactory/mutiple/?utm_source=LLaMA-Factory
+- **官方博客**：https://blog.llamafactory.net/
 - **官方课程**：https://www.lab4ai.cn/course/detail?id=7c13e60f6137474eb40f6fd3983c0f46&utm_source=LLaMA-Factory
- **LLaMA Factory Online（在线微调）**：https://www.llamafactory.com.cn/?utm_source=LLaMA-Factory

 > [!NOTE]
 > 除上述链接以外的其他网站均为未经许可的第三方网站，请小心甄别。
@@ -92,7 +98,7 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc
 - **集成方法**：（增量）预训练、（多模态）指令监督微调、奖励模型训练、PPO 训练、DPO 训练、KTO 训练、ORPO 训练等等。
 - **多种精度**：16 比特全参数微调、冻结微调、LoRA 微调和基于 AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ 的 2/3/4/5/6/8 比特 QLoRA 微调。
 - **先进算法**：[GaLore](https://github.com/jiaweizzhao/GaLore)、[BAdam](https://github.com/Ledzy/BAdam)、[APOLLO](https://github.com/zhuhanqing/APOLLO)、[Adam-mini](https://github.com/zyushun/Adam-mini)、[Muon](https://github.com/KellerJordan/Muon)、[OFT](https://github.com/huggingface/peft/tree/main/src/peft/tuners/oft)、DoRA、LongLoRA、LLaMA Pro、Mixture-of-Depths、LoRA+、LoftQ 和 PiSSA。
- **实用技巧**：[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)、[Unsloth](https://github.com/unslothai/unsloth)、[Liger Kernel](https://github.com/linkedin/Liger-Kernel)、RoPE scaling、NEFTune 和 rsLoRA。
+- **实用技巧**：[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)、[Unsloth](https://github.com/unslothai/unsloth)、[Liger Kernel](https://github.com/linkedin/Liger-Kernel)、[KTransformers](https://github.com/kvcache-ai/ktransformers/)、RoPE scaling、NEFTune 和 rsLoRA。
 - **广泛任务**：多轮对话、工具调用、图像理解、视觉定位、视频识别和语音理解等等。
 - **实验监控**：LlamaBoard、TensorBoard、Wandb、MLflow、[SwanLab](https://github.com/SwanHubX/SwanLab) 等等。
 - **极速推理**：基于 [vLLM](https://github.com/vllm-project/vllm) 或 [SGLang](https://github.com/sgl-project/sglang) 的 OpenAI 风格 API、浏览器界面和命令行接口。
@@ -106,6 +112,12 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc

 ## 官方博客

+> [!TIP]
+> 我们现在拥有了 LLaMA Factory 的专属博客！
+>
+> 网站地址：https://blog.llamafactory.net/
+
+- 💡 [KTransformers Fine-Tuning × LLaMA Factory: 用2张4090级的GPU+CPU 微调 1000B规模的超大模型](https://swcil84qspu.feishu.cn/wiki/Z1sSwb2poijybxkyPEkcDG6enVc) (中文)
 - 💡 [Easy Dataset × LLaMA Factory: 让大模型高效学习领域知识](https://buaa-act.feishu.cn/wiki/KY9xwTGs1iqHrRkjXBwcZP9WnL9)（中文）
 - [使用 LLaMA-Factory 微调心理健康大模型](https://www.lab4ai.cn/project/detail?id=25cce32ec131497b9e06a93336a0817f&type=project&utm_source=LLaMA-Factory)（中文）
 - [使用 LLaMA-Factory 构建 GPT-OSS 角色扮演模型](https://docs.llamafactory.com.cn/docs/documents/best-practice/gptroleplay/?utm_source=LLaMA-Factory)（中文）
@@ -125,6 +137,8 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc

 ## 更新日志

+[25/10/26] 我们支持了Megatron-core作为训练后端和适配了[**mcore_adapter**](https://github.com/alibaba/ROLL/tree/main/mcore_adapter)。查看[PR #9237](https://github.com/hiyouga/LLaMA-Factory/pull/9237)以使用。
+
 [25/08/22] 我们支持了 **[OFT](https://arxiv.org/abs/2306.07280)** 和 **[OFTv2](https://arxiv.org/abs/2506.19847)** 模型的微调。查看 [examples](examples/README.md) 以使用。

 [25/08/20] 我们支持了 **[Intern-S1-mini](https://huggingface.co/internlm/Intern-S1-mini)** 模型的微调。查看 [PR #8976](https://github.com/hiyouga/LLaMA-Factory/pull/8976) 以使用。
@@ -266,27 +280,21 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc

 | 模型名                                                             | 参数量                            | Template             |
 | ----------------------------------------------------------------- | -------------------------------- | -------------------- |
-| [Baichuan 2](https://huggingface.co/baichuan-inc)                 | 7B/13B                           | baichuan2            |
 | [BLOOM/BLOOMZ](https://huggingface.co/bigscience)                 | 560M/1.1B/1.7B/3B/7.1B/176B      | -                    |
-| [ChatGLM3](https://huggingface.co/THUDM)                          | 6B                               | chatglm3             |
 | [Command R](https://huggingface.co/CohereForAI)                   | 35B/104B                         | cohere               |
-| [DeepSeek (Code/MoE)](https://huggingface.co/deepseek-ai)         | 7B/16B/67B/236B                  | deepseek             |
-| [DeepSeek 2.5/3](https://huggingface.co/deepseek-ai)              | 236B/671B                        | deepseek3            |
+| [DeepSeek (LLM/Code/MoE)](https://huggingface.co/deepseek-ai)     | 7B/16B/67B/236B                  | deepseek             |
+| [DeepSeek 3-3.2](https://huggingface.co/deepseek-ai)              | 236B/671B                        | deepseek3            |
 | [DeepSeek R1 (Distill)](https://huggingface.co/deepseek-ai)       | 1.5B/7B/8B/14B/32B/70B/671B      | deepseekr1           |
 | [ERNIE-4.5](https://huggingface.co/baidu)                         | 0.3B/21B/300B                    | ernie/ernie_nothink  |
-| [Falcon](https://huggingface.co/tiiuae)                           | 7B/11B/40B/180B                  | falcon               |
-| [Falcon-H1](https://huggingface.co/tiiuae)                        | 0.5B/1.5B/3B/7B/34B              | falcon_h1            |
+| [Falcon/Falcon H1](https://huggingface.co/tiiuae)                 | 0.5B/1.5B/3B/7B/11B/34B/40B/180B | falcon/falcon_h1     |
 | [Gemma/Gemma 2/CodeGemma](https://huggingface.co/google)          | 2B/7B/9B/27B                     | gemma/gemma2         |
 | [Gemma 3/Gemma 3n](https://huggingface.co/google)                 | 270M/1B/4B/6B/8B/12B/27B         | gemma3/gemma3n       |
 | [GLM-4/GLM-4-0414/GLM-Z1](https://huggingface.co/zai-org)         | 9B/32B                           | glm4/glmz1           |
-| [GLM-4.1V](https://huggingface.co/zai-org)                        | 9B                               | glm4v                |
-| [GLM-4.5/GLM-4.5V](https://huggingface.co/zai-org)                | 106B/355B                        | glm4_moe/glm4v_moe   |
+| [GLM-4.5/GLM-4.5(6)V](https://huggingface.co/zai-org)             | 9B/106B/355B                     | glm4_moe/glm4_5v     |
 | [GPT-2](https://huggingface.co/openai-community)                  | 0.1B/0.4B/0.8B/1.5B              | -                    |
-| [GPT-OSS](https://huggingface.co/openai)                          | 20B/120B                         | gpt                  |
-| [Granite 3.0-3.3](https://huggingface.co/ibm-granite)             | 1B/2B/3B/8B                      | granite3             |
-| [Granite 4](https://huggingface.co/ibm-granite)                   | 7B                               | granite4             |
-| [Hunyuan](https://huggingface.co/tencent/)                        | 7B                               | hunyuan              |
-| [Index](https://huggingface.co/IndexTeam)                         | 1.9B                             | index                |
+| [GPT-OSS](https://huggingface.co/openai)                          | 20B/120B                         | gpt_oss              |
+| [Granite 3-4](https://huggingface.co/ibm-granite)                 | 1B/2B/3B/7B/8B                   | granite3/granite4    |
+| [Hunyuan (MT)](https://huggingface.co/tencent/)                   | 7B                               | hunyuan              |
 | [InternLM 2-3](https://huggingface.co/internlm)                   | 7B/8B/20B                        | intern2              |
 | [InternVL 2.5-3.5](https://huggingface.co/OpenGVLab)              | 1B/2B/4B/8B/14B/30B/38B/78B/241B | intern_vl            |
 | [InternLM/Intern-S1-mini](https://huggingface.co/internlm/)       | 8B                               | intern_s1            |
@@ -300,15 +308,13 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc
 | [LLaVA-1.5](https://huggingface.co/llava-hf)                      | 7B/13B                           | llava                |
 | [LLaVA-NeXT](https://huggingface.co/llava-hf)                     | 7B/8B/13B/34B/72B/110B           | llava_next           |
 | [LLaVA-NeXT-Video](https://huggingface.co/llava-hf)               | 7B/34B                           | llava_next_video     |
-| [MiMo](https://huggingface.co/XiaomiMiMo)                         | 7B                               | mimo                 |
+| [MiMo](https://huggingface.co/XiaomiMiMo)                         | 7B/309B                          | mimo/mimo_v2         |
 | [MiniCPM 1-4.1](https://huggingface.co/openbmb)                   | 0.5B/1B/2B/4B/8B                 | cpm/cpm3/cpm4        |
 | [MiniCPM-o-2.6/MiniCPM-V-2.6](https://huggingface.co/openbmb)     | 8B                               | minicpm_o/minicpm_v  |
-| [Ministral/Mistral-Nemo](https://huggingface.co/mistralai)        | 8B/12B                           | ministral            |
+| [Ministral 3](https://huggingface.co/mistralai)                   | 3B/8B/14B                        | ministral3           |
 | [Mistral/Mixtral](https://huggingface.co/mistralai)               | 7B/8x7B/8x22B                    | mistral              |
-| [Mistral Small](https://huggingface.co/mistralai)                 | 24B                              | mistral_small        |
 | [OLMo](https://huggingface.co/allenai)                            | 1B/7B                            | -                    |
 | [PaliGemma/PaliGemma2](https://huggingface.co/google)             | 3B/10B/28B                       | paligemma            |
-| [Phi-1.5/Phi-2](https://huggingface.co/microsoft)                 | 1.3B/2.7B                        | -                    |
 | [Phi-3/Phi-3.5](https://huggingface.co/microsoft)                 | 4B/14B                           | phi                  |
 | [Phi-3-small](https://huggingface.co/microsoft)                   | 7B                               | phi_small            |
 | [Phi-4](https://huggingface.co/microsoft)                         | 14B                              | phi4                 |
@@ -319,19 +325,18 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc
 | [Qwen2.5-Omni](https://huggingface.co/Qwen)                       | 3B/7B                            | qwen2_omni           |
 | [Qwen3-Omni](https://huggingface.co/Qwen)                         | 30B                              | qwen3_omni           |
 | [Qwen2-VL/Qwen2.5-VL/QVQ](https://huggingface.co/Qwen)            | 2B/3B/7B/32B/72B                 | qwen2_vl             |
-| [Qwen3-VL](https://huggingface.co/Qwen)                           | 235B                             | qwen3_vl             |
+| [Qwen3-VL](https://huggingface.co/Qwen)                           | 2B/4B/8B/30B/32B/235B            | qwen3_vl             |
 | [Seed (OSS/Coder)](https://huggingface.co/ByteDance-Seed)         | 8B/36B                           | seed_oss/seed_coder  |
-| [Skywork o1](https://huggingface.co/Skywork)                      | 8B                               | skywork_o1           |
 | [StarCoder 2](https://huggingface.co/bigcode)                     | 3B/7B/15B                        | -                    |
-| [TeleChat2](https://huggingface.co/Tele-AI)                       | 3B/7B/35B/115B                   | telechat2            |
-| [XVERSE](https://huggingface.co/xverse)                           | 7B/13B/65B                       | xverse               |
+| [VibeThinker-1.5B](https://huggingface.co/WeiboAI)                | 1.5B                             | qwen3                |
 | [Yi/Yi-1.5 (Code)](https://huggingface.co/01-ai)                  | 1.5B/6B/9B/34B                   | yi                   |
-| [Yi-VL](https://huggingface.co/01-ai)                             | 6B/34B                           | yi_vl                |
 | [Yuan 2](https://huggingface.co/IEITYuan)                         | 2B/51B/102B                      | yuan                 |

 > [!NOTE]
 > 对于所有“基座”（Base）模型，`template` 参数可以是 `default`, `alpaca`, `vicuna` 等任意值。但“对话”（Instruct/Chat）模型请务必使用**对应的模板**。
 >
+> 如果模型有推理 / 非推理两个版本，请使用 `_nothink` 后缀来区分不同的模板。例如 `qwen3` 和 `qwen3_nothink`。
+>
 > 请务必在训练和推理时采用**完全一致**的模板。
 >
 > \*：您需要从 main 分支安装 `transformers` 并使用 `DISABLE_VERSION_CHECK=1` 来跳过版本检查。
@@ -511,10 +516,12 @@ huggingface-cli login
 ```bash
 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
 cd LLaMA-Factory
-pip install -e ".[torch,metrics]" --no-build-isolation
+pip install -e ".[metrics]" --no-build-isolation
 ```

-可选的额外依赖项：torch、torch-npu、metrics、deepspeed、liger-kernel、bitsandbytes、hqq、eetq、gptq、aqlm、vllm、sglang、galore、apollo、badam、adam-mini、qwen、minicpm_v、openmind、swanlab、dev
+可选的额外依赖项：`metrics`、`deepspeed`。使用 `pip install -e ".[metrics,deepspeed]"` 安装。
+
+其他可选依赖项请参考 `examples/requirements/` 目录下的文件。

 #### 从镜像安装

@@ -533,13 +540,7 @@ docker run -it --rm --gpus=all --ipc=host hiyouga/llamafactory:latest
 使用 [uv](https://github.com/astral-sh/uv) 创建隔离的 Python 环境：

 ```bash
-uv sync --extra torch --extra metrics --prerelease=allow
-```
-
-在环境中运行 LLaMA-Factory：
-
-```bash
-uv run --prerelease=allow llamafactory-cli train examples/train_lora/llama3_lora_pretrain.yaml
+uv run llamafactory-cli webui
 ```

 </details>
@@ -576,7 +577,7 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl

 <details><summary>昇腾 NPU 用户指南</summary>

-在昇腾 NPU 设备上安装 LLaMA Factory 时，请升级 Python 到 3.10 及以上，并需要指定额外依赖项，使用 `pip install -e ".[torch-npu,metrics]"` 命令安装。此外，还需要安装 **[Ascend CANN Toolkit 与 Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**，安装方法请参考[安装教程](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/quickstart/quickstart/quickstart_18_0004.html)或使用以下命令：
+在昇腾 NPU 设备上安装 LLaMA Factory 时，请升级 Python 到 3.10 及以上，并需要指定额外依赖项，使用 `pip install -e . torch-npu==2.7.1` 命令安装。此外，还需要安装 **[Ascend CANN Toolkit 与 Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**，安装方法请参考[安装教程](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/quickstart/quickstart/quickstart_18_0004.html)或使用以下命令：

 ```bash
 # 请替换 URL 为 CANN 版本和设备型号对应的 URL
@@ -595,8 +596,8 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 | 依赖项        | 至少     | 推荐           |
 | ------------ | ------- | -------------- |
 | CANN         | 8.0.RC1 | 8.0.0.alpha002 |
-| torch        | 2.1.0   | 2.4.0          |
-| torch-npu    | 2.1.0   | 2.4.0.post2    |
+| torch        | 2.1.0   | 2.7.1          |
+| torch-npu    | 2.1.0   | 2.7.1          |
 | deepspeed    | 0.13.2  | 0.13.2         |
 | vllm-ascend  | -       | 0.7.3          |

--- a/data/v1_dpo_demo.jsonl
+++ b/data/v1_dpo_demo.jsonl
--- a/data/v1_dpo_demo.yaml
+++ b/data/v1_dpo_demo.yaml
@@ -0,0 +1,4 @@
+dpo_zh_demo:
+  path: HuggingFaceH4/orca_dpo_pairs
+  split: train_prefs
+  converter: pair
--- a/data/v1_sft_demo.yaml
+++ b/data/v1_sft_demo.yaml
@@ -1,8 +1,9 @@
 identity:
-  file_name: identity.json
+  path: data/identity.json
+  source: local
  converter: alpaca
 alpaca_en_demo:
-  file_name: alpaca_en_demo.json
-  dataset_dir: ~/data
+  path: data/alpaca_en_demo.json
+  source: local
  converter: alpaca
-  num_samples: 500
+  size: 500
--- a/docker/docker-cuda/Dockerfile
+++ b/docker/docker-cuda/Dockerfile
@@ -4,7 +4,6 @@ FROM ${BASE_IMAGE}

 # Installation arguments
 ARG PIP_INDEX=https://pypi.org/simple
-ARG EXTRAS=metrics
 ARG INSTALL_FLASHATTN=false
 ARG HTTP_PROXY=""

@@ -27,17 +26,13 @@ WORKDIR /app
 # Change pip source
 RUN pip config set global.index-url "${PIP_INDEX}" && \
    pip config set global.extra-index-url "${PIP_INDEX}" && \
-    pip install --no-cache-dir --upgrade pip packaging wheel setuptools
+    pip install --no-cache-dir --upgrade pip packaging wheel setuptools editables "hatchling>=1.18.0"

-# Install the requirements
-COPY requirements.txt /app
-RUN pip install --no-cache-dir -r requirements.txt
-
-# Copy the rest of the application into the image
+# Copy the application into the image
 COPY . /app

 # Install LLaMA Factory
-RUN pip install --no-cache-dir -e ".[${EXTRAS}]" --no-build-isolation
+RUN pip install --no-cache-dir --no-build-isolation -e ".[metrics,deepspeed]"

 # Rebuild flash attention
 RUN if [ "${INSTALL_FLASHATTN}" == "true" ]; then \
--- a/docker/docker-cuda/Dockerfile.megatron
+++ b/docker/docker-cuda/Dockerfile.megatron
@@ -0,0 +1,77 @@
+# NVIDIA official image (ubuntu-22.04 + cuda-12.4 + python-3.10)
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
+FROM nvcr.io/nvidia/pytorch:24.05-py3
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV PYPI_MIRROR=https://mirrors.aliyun.com/pypi/simple/
+ENV PYPI_TRUSTED_HOST=mirrors.aliyun.com
+ENV APT_MIRROR=https://mirrors.tuna.tsinghua.edu.cn/ubuntu/
+
+RUN pip install --upgrade pip setuptools wheel "hatchling>=1.18.0" editables --trusted-host ${PYPI_TRUSTED_HOST} --index-url ${PYPI_MIRROR}
+
+RUN pip uninstall -y torch torchvision torch-tensorrt \
+    flash_attn transformer-engine \
+    cudf dask-cuda cugraph cugraph-service-server cuml raft-dask cugraph-dgl cugraph-pyg dask-cudf
+
+RUN pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
+
+RUN pip uninstall -y opencv opencv-python opencv-python-headless && \
+    rm -rf /usr/local/lib/python3.10/dist-packages/cv2/ && \
+    pip install opencv-python-headless==4.11.0.86 --trusted-host ${PYPI_TRUSTED_HOST} --index-url ${PYPI_MIRROR}
+
+RUN pip install "numpy==1.26.4" "optree>=0.13.0" "spacy==3.7.5" "weasel==0.4.1" \
+    transformer-engine[pytorch]==2.2.0 megatron-core==0.13.0 deepspeed==0.16.4 \
+    --trusted-host ${PYPI_TRUSTED_HOST} --index-url ${PYPI_MIRROR}
+
+RUN pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
+
+# RUN pip install vllm==0.8.4 \
+#     --trusted-host ${PYPI_TRUSTED_HOST} --index-url ${PYPI_MIRROR}
+
+WORKDIR /build
+
+ARG apex_url=git+https://github.com/NVIDIA/apex.git@25.04
+RUN pip uninstall -y apex && \
+    MAX_JOBS=32 NINJA_FLAGS="-j32" NVCC_APPEND_FLAGS="--threads 32" \
+    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
+    --config-settings "--build-option=--cpp_ext --cuda_ext --parallel 32" ${apex_url}
+
+RUN rm -rf /build
+WORKDIR /workspace
+
+RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
+    { \
+    echo "deb ${APT_MIRROR} jammy main restricted universe multiverse"; \
+    echo "deb ${APT_MIRROR} jammy-security main restricted universe multiverse"; \
+    echo "deb ${APT_MIRROR} jammy-updates main restricted universe multiverse"; \
+    echo "deb ${APT_MIRROR} jammy-backports main restricted universe multiverse"; \
+    } > /etc/apt/sources.list
+
+RUN apt-get update && apt-get install -y zip
+
+RUN apt-get install -y openjdk-21-jdk
+ENV JAVA_HOME /usr/lib/jvm/java-21-openjdk-amd64
+
+# pip install LLaMA-Factory
+WORKDIR /app
+
+# Copy the application into the image
+COPY . /app
+
+# Install LLaMA Factory
+RUN pip install --no-cache-dir -e ".[metrics]" --no-build-isolation
+
+RUN pip install "git+https://github.com/alibaba/roll.git#subdirectory=mcore_adapter"
+
+# Expose port 7860 for LLaMA Board
+ENV GRADIO_SERVER_PORT=7860
+EXPOSE 7860
+
+# Expose port 8000 for API service
+ENV API_PORT=8000
+EXPOSE 8000
+
+# unset proxy
+ENV http_proxy=
+ENV https_proxy=
--- a/docker/docker-cuda/docker-compose.yml
+++ b/docker/docker-cuda/docker-compose.yml
@@ -5,7 +5,6 @@ services:
      context: ../..
      args:
        PIP_INDEX: https://pypi.org/simple
-        EXTRAS: metrics
    container_name: llamafactory
    ports:
      - "7860:7860"
--- a/docker/docker-npu/Dockerfile
+++ b/docker/docker-npu/Dockerfile
@@ -1,10 +1,10 @@
 # https://hub.docker.com/r/ascendai/cann/tags
-ARG BASE_IMAGE=ascendai/cann:8.1.rc1-910b-ubuntu22.04-py3.11
+
+ARG BASE_IMAGE=quay.io/ascend/cann:8.3.rc2-910b-ubuntu22.04-py3.11
 FROM ${BASE_IMAGE}

 # Installation arguments
 ARG PIP_INDEX=https://pypi.org/simple
-ARG EXTRAS=torch-npu,metrics
 ARG HTTP_PROXY=""
 ARG PYTORCH_INDEX=https://download.pytorch.org/whl/cpu

@@ -27,21 +27,15 @@ WORKDIR /app
 # Change pip source
 RUN pip config set global.index-url "${PIP_INDEX}" && \
    pip config set global.extra-index-url "${PIP_INDEX}" && \
-    pip install --no-cache-dir --upgrade pip packaging wheel setuptools
+    pip install --no-cache-dir --upgrade pip packaging wheel setuptools editables "hatchling>=1.18.0"
+
+# Copy the application into the image
+COPY . /app

 # Install torch-npu
 RUN pip uninstall -y torch torchvision torchaudio && \
-    pip install --no-cache-dir "torch-npu==2.5.1" "torchvision==0.20.1" --index-url "${PYTORCH_INDEX}"
-
-# Install the requirements
-COPY requirements.txt /app
-RUN pip install --no-cache-dir -r requirements.txt
-
-# Copy the rest of the application into the image
-COPY . /app
-
-# Install LLaMA Factory
-RUN pip install --no-cache-dir -e ".[${EXTRAS}]" --no-build-isolation
+    pip install --no-cache-dir "torch==2.7.1" "torch-npu==2.7.1" "torchvision==0.22.1" "torchaudio==2.7.1" --index-url "${PYTORCH_INDEX}" && \
+    pip install --no-cache-dir -e ".[metrics]" --no-build-isolation

 # Set up volumes
 # VOLUME [ "/root/.cache/huggingface", "/app/shared_data", "/app/output" ]
--- a/docker/docker-npu/docker-compose.yml
+++ b/docker/docker-npu/docker-compose.yml
@@ -1,12 +1,12 @@
 services:
-  llamafactory:
+  llamafactory-a2:
    build:
      dockerfile: ./docker/docker-npu/Dockerfile
      context: ../..
      args:
        PIP_INDEX: https://pypi.org/simple
-        EXTRAS: torch-npu,metrics
-    container_name: llamafactory
+    container_name: llamafactory-a2
+    image: llamafactory:npu-a2
    volumes:
      - /usr/local/dcmi:/usr/local/dcmi
      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
@@ -26,3 +26,33 @@ services:
      - /dev/devmm_svm
      - /dev/hisi_hdc
    restart: unless-stopped
+
+  llamafactory-a3:
+    profiles: ["a3"]
+    build:
+      dockerfile: ./docker/docker-npu/Dockerfile
+      context: ../..
+      args:
+        BASE_IMAGE: quay.io/ascend/cann:8.3.rc2-a3-ubuntu22.04-py3.11
+        PIP_INDEX: https://pypi.org/simple
+    container_name: llamafactory-a3
+    image: llamafactory:npu-a3
+    volumes:
+      - /usr/local/dcmi:/usr/local/dcmi
+      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
+      - /usr/local/Ascend/driver:/usr/local/Ascend/driver
+      - /etc/ascend_install.info:/etc/ascend_install.info
+    ports:
+      - "7861:7860"
+      - "8001:8000"
+    ipc: host
+    tty: true
+    # shm_size: "16gb"  # ipc: host is set
+    stdin_open: true
+    command: bash
+    devices:
+      - /dev/davinci0
+      - /dev/davinci_manager
+      - /dev/devmm_svm
+      - /dev/hisi_hdc
+    restart: unless-stopped
--- a/docker/docker-rocm/Dockerfile
+++ b/docker/docker-rocm/Dockerfile
@@ -4,7 +4,6 @@ FROM ${BASE_IMAGE}

 # Installation arguments
 ARG PIP_INDEX=https://pypi.org/simple
-ARG EXTRAS=metrics
 ARG INSTALL_FLASHATTN=false
 ARG HTTP_PROXY=""
 ARG PYTORCH_INDEX=https://download.pytorch.org/whl/rocm6.3
@@ -28,21 +27,14 @@ WORKDIR /app
 # Change pip source
 RUN pip config set global.index-url "${PIP_INDEX}" && \
    pip config set global.extra-index-url "${PIP_INDEX}" && \
-    pip install --no-cache-dir --upgrade pip packaging wheel setuptools
+    pip install --no-cache-dir --upgrade pip packaging wheel setuptools editables "hatchling>=1.18.0"

-# Reinstall pytorch rocm
-RUN pip uninstall -y torch torchvision torchaudio && \
-    pip install --no-cache-dir --pre torch torchvision torchaudio --index-url "${PYTORCH_INDEX}"
-
-# Install the requirements
-COPY requirements.txt /app
-RUN pip install --no-cache-dir -r requirements.txt
-
-# Copy the rest of the application into the image
+# Copy the application into the image
 COPY . /app

-# Install LLaMA Factory
-RUN pip install --no-cache-dir -e ".[${EXTRAS}]" --no-build-isolation
+# Reinstall pytorch rocm and install LLaMA Factory
+RUN pip uninstall -y torch torchvision torchaudio && \
+    pip install --no-cache-dir --no-build-isolation -e --pre ".[metrics,deepspeed]" --index-url "${PYTORCH_INDEX}"

 # Rebuild flash attention
 RUN if [ "${INSTALL_FLASHATTN}" == "true" ]; then \
--- a/docker/docker-rocm/docker-compose.yml
+++ b/docker/docker-rocm/docker-compose.yml
@@ -5,7 +5,6 @@ services:
      context: ../..
      args:
        PIP_INDEX: https://pypi.org/simple
-        EXTRAS: metrics
    container_name: llamafactory
    ports:
      - "7860:7860"
--- a/examples/accelerate/fsdp2_config.yaml
+++ b/examples/accelerate/fsdp2_config.yaml
@@ -0,0 +1,22 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: true
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_version: 2
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16  # or fp16
+num_machines: 1  # the number of nodes
+num_processes: 2  # the number of GPUs in all nodes
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate/fsdp_config_multiple_nodes.yaml
+++ b/examples/accelerate/fsdp_config_multiple_nodes.yaml
@@ -0,0 +1,34 @@
+# If you want to run this example on multiple nodes, you need to set the following parameters:
+# - num_machines: the number of nodes
+# - num_processes: the number of GPUs in all nodes, num_machines * num_processes_per_machine
+# - main_process_ip: the IP address of the main process, please keep it the same across all nodes
+# - main_process_port: the port of all nodes, please keep it the same across all nodes
+# - machine_rank: the rank of the current machine, starting from 0, and it should be 0 for main_process_ip
+
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_forward_prefetch: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_offload_params: false
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: true
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16  # or fp16
+main_process_ip: 192.168.0.1
+main_process_port: 29500
+num_machines: 2  # the number of nodes
+num_processes: 16  # the number of GPUs in all nodes, num_machines * num_processes_per_machine
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/ascend/qwen3_full_sft_fsdp2.yaml
+++ b/examples/ascend/qwen3_full_sft_fsdp2.yaml
@@ -0,0 +1,45 @@
+# Start FSDP2 fine-tuning
+# accelerate launch \
+#     --config_file examples/accelerate/fsdp2_config.yaml \
+#     src/train.py examples/ascend/qwen3_full_sft_fsdp2.yaml
+# Change `num_processes` in fsdp2_config.yaml to 16 in A3
+
+### model
+model_name_or_path: Qwen/Qwen3-8B
+trust_remote_code: true
+use_v1_kernels: true
+flash_attn: fa2
+
+### method
+stage: sft
+do_train: true
+finetuning_type: full
+
+### dataset
+dataset: alpaca_en_demo
+template: qwen3
+cutoff_len: 2048
+max_samples: 1000
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Qwen3-8B/full/sft
+logging_steps: 1
+save_steps: 500
+max_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: false
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 8
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-5
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 1800
+resume_from_checkpoint: null
--- a/examples/ascend/qwen3moe_full_sft_fsdp.yaml
+++ b/examples/ascend/qwen3moe_full_sft_fsdp.yaml
@@ -0,0 +1,46 @@
+# Start FSDP fine-tuning
+# accelerate launch \
+#     --config_file examples/accelerate/fsdp_config.yaml \
+#     src/train.py examples/ascend/qwen3moe_full_sft_fsdp.yaml
+# Change `num_processes` in fsdp_config.yaml to 16 in A3
+
+### model
+model_name_or_path: Qwen/Qwen3-30B-A3B-Instruct-2507
+trust_remote_code: true
+use_v1_kernels: true
+flash_attn: fa2
+
+### method
+stage: sft
+do_train: true
+finetuning_type: full
+disable_gradient_checkpointing: false
+
+### dataset
+dataset: alpaca_zh
+template: qwen3
+cutoff_len: 1024
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Qwen3-30B-A3B-Instruct-2507/full/sft
+logging_steps: 1
+save_steps: 500
+max_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: true
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 4
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-4
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+seed: 1234
--- a/examples/ascend/qwen3vlmoe_full_sft_fsdp2.yaml
+++ b/examples/ascend/qwen3vlmoe_full_sft_fsdp2.yaml
@@ -0,0 +1,48 @@
+# Start FSDP2 fine-tuning
+# accelerate launch \
+#     --config_file examples/accelerate/fsdp2_config.yaml \
+#     src/train.py examples/ascend/qwen3vlmoe_full_sft_fsdp2.yaml
+# Change `num_processes` in fsdp2_config.yaml to 16 in A3
+
+### model
+model_name_or_path: Qwen/Qwen3-VL-30B-A3B-Instruct
+image_max_pixels: 262144
+video_max_pixels: 16384
+trust_remote_code: true
+use_v1_kernels: true
+flash_attn: fa2
+
+### method
+stage: sft
+do_train: true
+finetuning_type: full
+disable_gradient_checkpointing: false
+
+### dataset
+dataset: llava_1k_en, llava_1k_zh
+template: qwen3_vl
+cutoff_len: 1024
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Qwen3-VL-30B-A3B-Instruct/full/sft
+logging_steps: 1
+save_steps: 500
+max_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: true
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-4
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+seed: 1234
--- a/examples/ascend/qwen3vlmoe_lora_sft_fsdp.yaml
+++ b/examples/ascend/qwen3vlmoe_lora_sft_fsdp.yaml
@@ -0,0 +1,42 @@
+### model
+model_name_or_path: Qwen/Qwen3-VL-30B-A3B-Instruct
+image_max_pixels: 262144
+video_max_pixels: 16384
+trust_remote_code: true
+use_v1_kernels: true  # replaced kernels: [NpuRMSNormKernel, NpuRoPEKernel, NpuQwen3VLMoEFusedMoEKernel]
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_rank: 8
+lora_target: all
+disable_gradient_checkpointing: false
+flash_attn: disabled
+
+### dataset
+dataset: alpaca_zh_demo, alpaca_en_demo
+template: qwen3_vl
+cutoff_len: 1024
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/qwen3vlmoe/lora/sft
+logging_steps: 1
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: true
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 8
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-4
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+seed: 1234
--- a/examples/deepspeed/ds_z2_autotp_config.json
+++ b/examples/deepspeed/ds_z2_autotp_config.json
@@ -0,0 +1,32 @@
+{
+  "_comment": "suooprted model list: https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/#supported-models",
+  "train_batch_size": "auto",
+  "train_micro_batch_size_per_gpu": "auto",
+  "gradient_accumulation_steps": "auto",
+  "gradient_clipping": "auto",
+  "zero_allow_untested_optimizer": true,
+  "fp16": {
+    "enabled": "auto",
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "initial_scale_power": 16,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "bf16": {
+    "enabled": "auto"
+  },
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": false,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients": true,
+    "round_robin_gradients": true
+  },
+  "tensor_parallel": {
+    "autotp_size": 2
+  }
+}
--- a/examples/inference/deepseek2_lora_sft_kt.yaml
+++ b/examples/inference/deepseek2_lora_sft_kt.yaml
@@ -0,0 +1,10 @@
+model_name_or_path: deepseek-ai/DeepSeek-V2-Lite
+adapter_name_or_path: saves/Kllama_deepseekV2
+template: deepseek
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # use KTransformers as LoRA sft backend to inference
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml
+cpu_infer: 32
+chunk_size: 8192
--- a/examples/inference/deepseek3_kt.yaml
+++ b/examples/inference/deepseek3_kt.yaml
@@ -0,0 +1,9 @@
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+template: deepseek
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # use KTransformers as LoRA sft backend to inference
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
--- a/examples/inference/deepseek3_lora_sft_kt.yaml
+++ b/examples/inference/deepseek3_lora_sft_kt.yaml
@@ -0,0 +1,10 @@
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+adapter_name_or_path: saves/Kllama_deepseekV3
+template: deepseek
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # use KTransformers as LoRA sft backend to inference
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
--- a/examples/inference/llama3.yaml
+++ b/examples/inference/llama3.yaml
@@ -1,4 +1,4 @@
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 template: llama3
-infer_backend: huggingface  # choices: [huggingface, vllm, sglang]
+infer_backend: huggingface  # choices: [huggingface, vllm, sglang, ktransformers]
 trust_remote_code: true
--- a/examples/inference/llama3_full_sft.yaml
+++ b/examples/inference/llama3_full_sft.yaml
@@ -1,4 +1,4 @@
 model_name_or_path: saves/llama3-8b/full/sft
 template: llama3
-infer_backend: huggingface  # choices: [huggingface, vllm, sglang]
+infer_backend: huggingface  # choices: [huggingface, vllm, sglang, ktransformers]
 trust_remote_code: true
--- a/examples/inference/llama3_lora_sft.yaml
+++ b/examples/inference/llama3_lora_sft.yaml
@@ -1,5 +1,5 @@
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 adapter_name_or_path: saves/llama3-8b/lora/sft
 template: llama3
-infer_backend: huggingface  # choices: [huggingface, vllm, sglang]
+infer_backend: huggingface  # choices: [huggingface, vllm, sglang, ktransformers]
 trust_remote_code: true
--- a/examples/inference/qwen2_5vl.yaml
+++ b/examples/inference/qwen2_5vl.yaml
@@ -1,4 +1,4 @@
 model_name_or_path: Qwen/Qwen2.5-VL-7B-Instruct
 template: qwen2_vl
-infer_backend: huggingface  # choices: [huggingface, vllm, sglang]
+infer_backend: huggingface  # choices: [huggingface, vllm, sglang, ktransformers]
 trust_remote_code: true
--- a/examples/inference/qwen3moe_lora_sft_kt.yaml
+++ b/examples/inference/qwen3moe_lora_sft_kt.yaml
@@ -0,0 +1,10 @@
+model_name_or_path: Qwen/Qwen3-235B-A22B-Instruct-2507
+adapter_name_or_path: saves/Kllama_Qwen3MoE_235bA22b
+template: qwen3_nothink
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # use KTransformers as LoRA sft backend to inference
+kt_optimize_rule: examples/kt_optimize_rules/Qwen3Moe-sft-amx.yaml
+cpu_infer: 32
+chunk_size: 8192
--- a/examples/kt_optimize_rules/DeepSeek-V2-Chat-sft-amx.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V2-Chat-sft-amx.yaml
@@ -0,0 +1,69 @@
+- match:
+    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/examples/kt_optimize_rules/DeepSeek-V2-Chat.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V2-Chat.yaml
@@ -0,0 +1,68 @@
+- match:
+    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearMarlin"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearMarlin"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KExpertsCPU"
+      out_device: "cuda"
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx-multi-gpu.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx-multi-gpu.yaml
@@ -0,0 +1,139 @@
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+        generate_device: "cpu"
+        prefill_device: "cpu"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9])\\."
+    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\.([12][0-9])\\."
+    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9])\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\.([12][0-9])\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9])\\.mlp$"
+    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\.([12][0-9])\\.mlp$"
+    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9])\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda:0"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op:  "KSFTExpertsCPU"
+      out_device: "cuda:0"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+
+- match:
+    name: "^model\\.layers\\.([12][0-9])\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda:1"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op:  "KSFTExpertsCPU"
+      out_device: "cuda:1"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9])\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\.([12][0-9])\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+      transfer_map:
+        10: "cuda:1"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9])\\."
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "(^model\\.layers\\.([12][0-9])\\.)|(model.norm)"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
--- a/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml
@@ -0,0 +1,69 @@
+- match:
+    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cpu"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft.yaml
@@ -0,0 +1,68 @@
+- match:
+    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cpu"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda"
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat.yaml
@@ -0,0 +1,68 @@
+- match:
+    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearMarlin"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearMarlin"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KExpertsCPU"
+      out_device: "cuda"
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/examples/kt_optimize_rules/DeepSeek-V3-Chat-amx.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V3-Chat-amx.yaml
@@ -0,0 +1,77 @@
+- match:
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+
+- match:
+    name: "^lm_head$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearMarlin"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearMarlin"
+      prefill_op: "KLinearTorch"
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KExpertsCPU"
+      out_device: "cuda"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu-4.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu-4.yaml
@@ -0,0 +1,392 @@
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
+
+# === Rotary Embedding Replacement ===
+
+# GPU 0: layers 0–14
+- match:
+    name: "^model\\.layers\\.([0-9]|1[0-4])\\."
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+
+# GPU 1: layers 15–29
+- match:
+    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\."
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+# GPU 2: layers 30–44
+- match:
+    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\."
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda:2"
+      prefill_device: "cuda:2"
+
+# GPU 3: layers 45–60
+- match:
+    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\."
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda:3"
+      prefill_device: "cuda:3"
+
+# === Linear Layers Replacement (excluding self_attn.kv_b_proj) ===
+
+# GPU 0: layers 0–14
+- match:
+    name: "^model\\.layers\\.([0-9]|1[0-4])\\.(?!self_attn\\.kv_b_proj).*$"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+# GPU 1: layers 15–29
+- match:
+    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.(?!self_attn\\.kv_b_proj).*$"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+# GPU 2: layers 30–44
+- match:
+    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.(?!self_attn\\.kv_b_proj).*$"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda:2"
+      prefill_device: "cuda:2"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+# GPU 3: layers 45–60
+- match:
+    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.(?!self_attn\\.kv_b_proj).*$"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda:3"
+      prefill_device: "cuda:3"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+# === MLP (MoE) Replacement ===
+
+# GPU 0: layers 0–14
+- match:
+    name: "^model\\.layers\\.([0-9]|1[0-4])\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+
+# GPU 1: layers 15–29
+- match:
+    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+# GPU 2: layers 30–44
+- match:
+    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE
+    kwargs:
+      generate_device: "cuda:2"
+      prefill_device: "cuda:2"
+
+# GPU 3: layers 45–60
+- match:
+    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE
+    kwargs:
+      generate_device: "cuda:3"
+      prefill_device: "cuda:3"
+
+# === MLP Gate Replacement ===
+
+# GPU 0: layers 0–14
+- match:
+    name: "^model\\.layers\\.([0-9]|1[0-4])\\.mlp\\.gate$"
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+
+# GPU 1: layers 15–29
+- match:
+    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp\\.gate$"
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+# GPU 2: layers 30–44
+- match:
+    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp\\.gate$"
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:2"
+      prefill_device: "cuda:2"
+
+# GPU 3: layers 45–60
+- match:
+    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp\\.gate$"
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:3"
+      prefill_device: "cuda:3"
+
+# === MLP Experts Replacement ===
+# replace with marlin expert. Open and modify layer-num as needed.
+# Each layer of malin experts takes about 6GB of GPU memory.
+# !!!Do remember 'close' cuda graph if you are using marlin expert.!!!
+# !!!KExpertsTorch is untested, we don't have enough VRAM.!!!
+
+# GPU 0: layers 3–4
+# - match:
+#     name: "^model\\.layers\\.([3-4])\\.mlp\\.experts$"
+#   replace:
+#     class: ktransformers.operators.experts.KTransformersExperts
+#     kwargs:
+#       generate_device: "cuda:0"
+#       generate_op:  "KExpertsMarlin"
+#   recursive: False
+
+# # GPU 1: layers 15–17
+# - match:
+#     name: "^model\\.layers\\.(1[5-7])\\.mlp\\.experts$"
+#   replace:
+#     class: ktransformers.operators.experts.KTransformersExperts
+#     kwargs:
+#       generate_device: "cuda:1"
+#       generate_op:  "KExpertsMarlin"
+#   recursive: False
+
+# # GPU 2: layers 30–32
+# - match:
+#     name: "^model\\.layers\\.(3[0-2])\\.mlp\\.experts$"
+#   replace:
+#     class: ktransformers.operators.experts.KTransformersExperts
+#     kwargs:
+#       generate_device: "cuda:2"
+#       generate_op:  "KExpertsMarlin"
+#   recursive: False
+
+# # GPU 3: layers 45–46
+# - match:
+#     name: "^model\\.layers\\.(4[5-6])\\.mlp\\.experts$"
+#   replace:
+#     class: ktransformers.operators.experts.KTransformersExperts
+#     kwargs:
+#       generate_device: "cuda:3"
+#       generate_op:  "KExpertsMarlin"
+#   recursive: False
+
+
+# === MLP Experts Replacement ===
+
+# GPU 0: layers 0–14
+- match:
+    name: "^model\\.layers\\.([0-9]|1[0-4])\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts
+    kwargs:
+      prefill_device: "cuda:0"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda:0"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False
+
+# GPU 1: layers 15–29
+- match:
+    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts
+    kwargs:
+      prefill_device: "cuda:1"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda:1"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False
+
+# GPU 2: layers 30–44
+- match:
+    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts
+    kwargs:
+      prefill_device: "cuda:2"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda:2"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False
+
+# GPU 3: layers 45–60
+- match:
+    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts
+    kwargs:
+      prefill_device: "cuda:3"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda:3"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False
+
+# === Self-Attention Replacement ===
+
+# GPU 0: layers 0–14
+- match:
+    name: "^model\\.layers\\.([0-9]|1[0-4])\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+      absorb_for_prefill: False
+
+# GPU 1: layers 15–29
+- match:
+    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+      absorb_for_prefill: False
+
+# GPU 2: layers 30–44
+- match:
+    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention
+    kwargs:
+      generate_device: "cuda:2"
+      prefill_device: "cuda:2"
+      absorb_for_prefill: False
+
+# GPU 3: layers 45–60
+- match:
+    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention
+    kwargs:
+      generate_device: "cuda:3"
+      prefill_device: "cuda:3"
+      absorb_for_prefill: False
+
+# === Overall Model Replacement with Transfer Map ===
+
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 means close layer‐wise prefill
+      transfer_map:
+        15: "cuda:1" # Layers 15+ on GPU 1
+        30: "cuda:2" # Layers 30+ on GPU 2
+        45: "cuda:3" # Layers 45+ on GPU 3
+
+# === Default Catch-All for Other Modules ===
+
+# GPU 0: layers 0–14
+- match:
+    name: "^model\\.layers\\.([0-9]|1[0-4])\\."
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+
+# GPU 1: layers 15–29
+- match:
+    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\."
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+# GPU 2: layers 30–44
+- match:
+    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\."
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:2"
+      prefill_device: "cuda:2"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda:3"
+      prefill_device: "cuda:3"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+# For final modules (model.norm), ensure they are on GPU 3 (as in your original config)
+- match:
+    name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:3"
+      prefill_device: "cuda:3"
--- a/examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
@@ -0,0 +1,156 @@
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+        generate_device: "cpu"
+        prefill_device: "cpu"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\."
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\.([3456][0-9])\\."
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.(?!self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\.([3456][0-9])\\.(?!self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\.([3456][0-9])\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp\\.gate$"
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\.([3456][0-9])\\.mlp\\.gate$"
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda:0"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op:  "KSFTExpertsCPU"
+      out_device: "cuda:0"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+
+- match:
+    name: "^model\\.layers\\.([3456][0-9])\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda:1"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op:  "KSFTExpertsCPU"
+      out_device: "cuda:1"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\.([3456][0-9])\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+      transfer_map:
+        30: "cuda:1"
+
+- match:
+    name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\."
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+
+- match:
+    name: "^lm_head"
+    class: torch.nn.Linear
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "(^model\\.layers\\.([3456][0-9])\\.)|(model.norm)"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cuda:1"
+      prefill_device: "cuda:1"
--- a/examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx.yaml
+++ b/examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx.yaml
@@ -0,0 +1,77 @@
+- match:
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+
+- match:
+    name: "^lm_head$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/examples/kt_optimize_rules/Qwen3Moe-sft-amx.yaml
+++ b/examples/kt_optimize_rules/Qwen3Moe-sft-amx.yaml
@@ -0,0 +1,80 @@
+- match:
+    class: ktransformers.models.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.RotaryEmbedding
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+
+- match:
+    name: "^lm_head$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+# - match:
+#     name: "^model\\.layers\\..*$"  # regular expression
+#     class: torch.nn.Linear  # only match modules matching name and class simultaneously
+#   replace:
+#     class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+#     kwargs:
+#       generate_device: "cuda"
+#       prefill_device: "cuda"
+#       generate_op: "KLinearTorch"
+#       prefill_op: "KLinearTorch"
+- match:
+    name: "^model\\.layers\\.(?!.*mlp\\.shared_expert_gate).*$"  # regular expression
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+  replace:
+    class: ktransformers.operators.experts.KQwen3MoeSparseMoeBlock     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda"
+      backend: "AMXInt8" # or "AMXBF16" or "AMXInt8"
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KQwen3MoeAttention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
+
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KQwen3MoeModel"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0
--- a/examples/megatron/qwen2_vl_full.yaml
+++ b/examples/megatron/qwen2_vl_full.yaml
@@ -0,0 +1,29 @@
+model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
+image_max_pixels: 262144
+video_max_pixels: 16384
+
+do_train: true
+stage: sft
+finetuning_type: full # only support full for now
+dataset: llava_1k_en
+preprocessing_num_workers: 8
+cutoff_len: 4096
+template: qwen2_vl
+
+output_dir: saves/mca/qwen2_vl_full
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 2
+num_train_epochs: 2
+learning_rate: 2e-5
+logging_steps: 1
+save_steps: 100
+lr_scheduler_type: cosine
+bf16: true
+
+# mcore speed up
+tensor_model_parallel_size: 4
+sequence_parallel: true
+pipeline_model_parallel_size: 2
+bias_activation_fusion: true
+apply_rope_fusion: true
+use_distributed_optimizer: true
--- a/examples/megatron/qwen3_moe_full.yaml
+++ b/examples/megatron/qwen3_moe_full.yaml
@@ -0,0 +1,35 @@
+model_name_or_path: Qwen/Qwen3-30B-A3B-Instruct-2507
+
+# GPU memory: 8 * 78GB
+do_train: true
+stage: sft
+finetuning_type: full # only support full for now
+dataset: alpaca_en_demo
+preprocessing_num_workers: 8
+cutoff_len: 4096
+template: qwen3_nothink
+
+# global batchsize = (8 // 2 // 4) * 8 = 8
+output_dir: saves/mca/qwen3_moe_full
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+num_train_epochs: 2
+learning_rate: 3e-6
+logging_steps: 1
+save_steps: 100
+lr_scheduler_type: constant
+bf16: true
+
+# mcore speed up
+tensor_model_parallel_size: 1
+sequence_parallel: false
+pipeline_model_parallel_size: 4
+bias_activation_fusion: true
+apply_rope_fusion: true
+use_distributed_optimizer: true
+overlap_param_gather: true
+overlap_grad_reduce: true
+moe_grouped_gemm: true
+moe_token_dispatcher_type: alltoall
+expert_model_parallel_size: 2
+recompute_granularity: full
--- a/examples/requirements/adam-mini.txt
+++ b/examples/requirements/adam-mini.txt
@@ -0,0 +1 @@
+adam-mini
--- a/examples/requirements/apollo.txt
+++ b/examples/requirements/apollo.txt
@@ -0,0 +1 @@
+apollo-torch
--- a/examples/requirements/aqlm.txt
+++ b/examples/requirements/aqlm.txt
@@ -0,0 +1 @@
+aqlm[gpu]>=1.1.0
--- a/examples/requirements/badam.txt
+++ b/examples/requirements/badam.txt
@@ -0,0 +1 @@
+badam>=1.2.1
--- a/examples/requirements/bitsandbytes.txt
+++ b/examples/requirements/bitsandbytes.txt
@@ -0,0 +1 @@
+bitsandbytes>=0.39.0
--- a/examples/requirements/eetq.txt
+++ b/examples/requirements/eetq.txt
@@ -0,0 +1 @@
+eetq
--- a/examples/requirements/fp8-te.txt
+++ b/examples/requirements/fp8-te.txt
@@ -0,0 +1,2 @@
+transformer_engine[pytorch]>=2.0.0
+accelerate>=1.10.0
--- a/examples/requirements/fp8.txt
+++ b/examples/requirements/fp8.txt
@@ -0,0 +1,2 @@
+torchao>=0.8.0
+accelerate>=1.10.0
--- a/examples/requirements/galore.txt
+++ b/examples/requirements/galore.txt
@@ -0,0 +1 @@
+galore-torch
--- a/examples/requirements/gptq.txt
+++ b/examples/requirements/gptq.txt
@@ -0,0 +1,2 @@
+optimum>=1.24.0
+gptqmodel>=2.0.0
--- a/examples/requirements/hqq.txt
+++ b/examples/requirements/hqq.txt
@@ -0,0 +1 @@
+hqq
--- a/examples/requirements/liger-kernel.txt
+++ b/examples/requirements/liger-kernel.txt
@@ -0,0 +1 @@
+liger-kernel>=0.5.5
--- a/examples/requirements/minicpm-v.txt
+++ b/examples/requirements/minicpm-v.txt
@@ -0,0 +1,8 @@
+soundfile
+torchvision
+torchaudio
+vector_quantize_pytorch
+vocos
+msgpack
+referencing
+jsonschema_specifications
--- a/examples/requirements/openmind.txt
+++ b/examples/requirements/openmind.txt
@@ -0,0 +1 @@
+openmind
--- a/examples/requirements/sglang.txt
+++ b/examples/requirements/sglang.txt
@@ -0,0 +1,2 @@
+sglang[srt]>=0.4.5
+transformers==4.51.1
--- a/examples/requirements/swanlab.txt
+++ b/examples/requirements/swanlab.txt
@@ -0,0 +1 @@
+swanlab
--- a/examples/requirements/vllm.txt
+++ b/examples/requirements/vllm.txt
@@ -0,0 +1 @@
+vllm>=0.4.3,<=0.11.0
--- a/examples/train_full/qwen3_full_sft_autotp.yaml
+++ b/examples/train_full/qwen3_full_sft_autotp.yaml
@@ -0,0 +1,46 @@
+### model
+model_name_or_path: Qwen/Qwen3-32B
+trust_remote_code: true
+use_v1_kernels: true
+
+### method
+stage: sft
+do_train: true
+finetuning_type: full
+deepspeed: examples/deepspeed/ds_z2_autotp_config.json
+
+### dataset
+dataset: identity,alpaca_en_demo
+template: qwen3
+cutoff_len: 2048
+max_samples: 1000
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/qwen3-32b/full/sft_autotp
+logging_steps: 1
+save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: false
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 4
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-4
+num_train_epochs: 3.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+
+### eval
+# eval_dataset: alpaca_en_demo
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
--- a/examples/train_lora/deepseek2_lora_sft_kt.yaml
+++ b/examples/train_lora/deepseek2_lora_sft_kt.yaml
@@ -0,0 +1,52 @@
+### model
+model_name_or_path: deepseek-ai/DeepSeek-V2-Lite
+trust_remote_code: true
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_rank: 8
+lora_target: all
+
+### dataset
+dataset: identity
+template: deepseek
+cutoff_len: 2048
+max_samples: 100000
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Kllama_deepseekV2
+logging_steps: 10
+save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: false
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+learning_rate: 1.0e-4
+num_train_epochs: 3.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+
+### ktransformers
+use_kt: true # use KTransformers as LoRA sft backend
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml
+cpu_infer: 32
+chunk_size: 8192
+
+### eval
+# eval_dataset: alpaca_en_demo
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
--- a/examples/train_lora/deepseek3_lora_sft_kt.yaml
+++ b/examples/train_lora/deepseek3_lora_sft_kt.yaml
@@ -0,0 +1,52 @@
+### model
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+trust_remote_code: true
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_rank: 8
+lora_target: all
+
+### dataset
+dataset: identity
+template: deepseek
+cutoff_len: 2048
+max_samples: 100000
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Kllama_deepseekV3
+logging_steps: 10
+save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: false
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+learning_rate: 1.0e-4
+num_train_epochs: 3.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+
+### ktransformers
+use_kt: true # use KTransformers as LoRA sft backend
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
+
+### eval
+# eval_dataset: alpaca_en_demo
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
--- a/examples/train_lora/qwen3moe_lora_sft_kt.yaml
+++ b/examples/train_lora/qwen3moe_lora_sft_kt.yaml
@@ -0,0 +1,52 @@
+### model
+model_name_or_path: Qwen/Qwen3-235B-A22B-Instruct-2507
+trust_remote_code: true
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_rank: 8
+lora_target: all
+
+### dataset
+dataset: identity, alpaca_en_demo
+template: qwen3_nothink
+cutoff_len: 2048
+max_samples: 100000
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Kllama_Qwen3MoE_235bA22b
+logging_steps: 10
+save_steps: 200
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: false
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+learning_rate: 1.0e-4
+num_train_epochs: 3
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+
+### ktransformers
+use_kt: true # use KTransformers as LoRA sft backend
+kt_optimize_rule: examples/kt_optimize_rules/Qwen3Moe-sft-amx.yaml
+cpu_infer: 32
+chunk_size: 8192
+
+### eval
+# eval_dataset: alpaca_en_demo
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,42 +1,123 @@
 [build-system]
-requires = ["setuptools>=61.0"]
-build-backend = "setuptools.build_meta"
+requires = ["hatchling"]
+build-backend = "hatchling.build"

 [project]
 name = "llamafactory"
-requires-python = ">=3.9.0"
-dynamic = [
-    "version",
-    "dependencies",
-    "optional-dependencies",
-    "scripts",
-    "authors",
-    "description",
-    "readme",
-    "license",
-    "keywords",
-    "classifiers"
+dynamic = ["version"]
+description = "Unified Efficient Fine-Tuning of 100+ LLMs"
+readme = "README.md"
+license = "Apache-2.0"
+requires-python = ">=3.11.0"
+authors = [
+    { name = "hiyouga", email = "hiyouga@buaa.edu.cn" }
+]
+keywords = [
+    "AI",
+    "LLM",
+    "GPT",
+    "ChatGPT",
+    "Llama",
+    "Transformer",
+    "DeepSeek",
+    "Pytorch"
+]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "Intended Audience :: Education",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: Apache Software License",
+    "Operating System :: OS Independent",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence"
+]
+dependencies = [
+    # core deps
+    "torch>=2.4.0",
+    "torchvision>=0.19.0",
+    "torchaudio>=2.4.0",
+    "transformers>=4.49.0,<=4.56.2,!=4.52.0; python_version < '3.10'",
+    "transformers>=4.49.0,<=4.57.1,!=4.52.0,!=4.57.0; python_version >= '3.10'",
+    "datasets>=2.16.0,<=4.0.0",
+    "accelerate>=1.3.0,<=1.11.0",
+    "peft>=0.14.0,<=0.17.1",
+    "trl>=0.8.6,<=0.9.6",
+    "torchdata>=0.10.0,<=0.11.0",
+    # gui
+    "gradio>=4.38.0,<=5.50.0",
+    "matplotlib>=3.7.0",
+    "tyro<0.9.0",
+    # ops
+    "einops",
+    "numpy",
+    "pandas",
+    "scipy",
+    # model and tokenizer
+    "sentencepiece",
+    "tiktoken",
+    "modelscope",
+    "hf-transfer",
+    "safetensors",
+    # python
+    "av",
+    "fire",
+    "omegaconf",
+    "packaging",
+    "protobuf",
+    "pyyaml",
+    "pydantic",
+    # api
+    "uvicorn",
+    "fastapi",
+    "sse-starlette"
 ]

+[project.optional-dependencies]
+dev = ["pre-commit", "ruff", "pytest", "build"]
+metrics = ["nltk", "jieba", "rouge-chinese"]
+deepspeed = ["deepspeed>=0.10.0,<=0.16.9"]
+
+[project.scripts]
+llamafactory-cli = "llamafactory.cli:main"
+lmf = "llamafactory.cli:main"
+
+[project.urls]
+Homepage = "https://github.com/hiyouga/LLaMA-Factory"
+Repository = "https://github.com/hiyouga/LLaMA-Factory"
+
+[tool.hatch.build.targets.wheel]
+packages = ["src/llamafactory"]
+
+[tool.hatch.version]
+path = "src/llamafactory/extras/env.py"
+pattern = "VERSION = \"(?P<version>[^\"]+)\""
+
 [tool.ruff]
-target-version = "py39"
+target-version = "py311"
 line-length = 119
 indent-width = 4

 [tool.ruff.lint]
 ignore = [
-    "C408", # collection
-    "C901", # complex
-    "E501", # line too long
-    "E731", # lambda function
-    "E741", # ambiguous var name
-    "D100", # no doc public module
-    "D101", # no doc public class
-    "D102", # no doc public method
-    "D103", # no doc public function
-    "D104", # no doc public package
-    "D105", # no doc magic method
-    "D107", # no doc __init__
+    "C408",  # collection
+    "C901",  # complex
+    "E501",  # line too long
+    "E731",  # lambda function
+    "E741",  # ambiguous var name
+    "UP007", # no upgrade union
+    "UP045", # no upgrade optional
+    "D100",  # no doc public module
+    "D101",  # no doc public class
+    "D102",  # no doc public method
+    "D103",  # no doc public function
+    "D104",  # no doc public package
+    "D105",  # no doc magic method
+    "D107",  # no doc __init__
 ]
 extend-select = [
    "C",      # complexity
@@ -73,23 +154,3 @@ indent-style = "space"
 docstring-code-format = true
 skip-magic-trailing-comma = false
 line-ending = "auto"
-
-[tool.uv]
-conflicts = [
-    [
-        { extra = "torch-npu" },
-        { extra = "aqlm" },
-    ],
-    [
-        { extra = "torch-npu" },
-        { extra = "vllm" },
-    ],
-    [
-        { extra = "torch-npu" },
-        { extra = "sglang" },
-    ],
-    [
-        { extra = "vllm" },
-        { extra = "sglang" },
-    ],
-]
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,36 +0,0 @@
-# core deps
-transformers>=4.49.0,<=4.56.2,!=4.52.0; python_version < '3.10'
-transformers>=4.49.0,<=4.57.1,!=4.52.0; python_version >= '3.10'
-datasets>=2.16.0,<=4.0.0
-accelerate>=1.3.0,<=1.11.0
-peft>=0.14.0,<=0.17.1
-trl>=0.8.6,<=0.9.6
-# gui
-gradio>=4.38.0,<=5.45.0
-matplotlib>=3.7.0
-tyro<0.9.0
-# ops
-einops
-numpy<2.0.0
-pandas>=2.0.0
-scipy
-# model and tokenizer
-sentencepiece
-tiktoken
-modelscope>=1.14.0
-hf-transfer
-safetensors<=0.5.3
-# python
-fire
-omegaconf
-packaging
-protobuf
-pyyaml
-pydantic<=2.10.6
-# api
-uvicorn
-fastapi
-sse-starlette
-# media
-av
-librosa
--- a/scripts/megatron_merge.py
+++ b/scripts/megatron_merge.py
@@ -0,0 +1,124 @@
+# Copyright 2025 the ROLL team and the LlamaFactory team.
+#
+# This code is modified from the ROLL library.
+# https://github.com/alibaba/ROLL/blob/main/mcore_adapter/tools/convert.py
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import fire
+import torch
+from mcore_adapter.models.converter.post_converter import convert_checkpoint_to_hf, convert_checkpoint_to_mca
+from mcore_adapter.training_args import DistributingParallelArguments
+from mcore_adapter.utils import get_logger
+from transformers import AutoConfig
+
+
+logger = get_logger(__name__)
+
+
+def convert_mca_to_hf(
+    checkpoint_path: str,
+    output_path: str = "./output",
+    bf16: bool = False,
+    fp16: bool = False,
+    convert_model_max_length: int | None = None,
+):
+    """Convert megatron checkpoint to HuggingFace format.
+
+    Args:
+        checkpoint_path: Path to the checkpoint to convert
+        output_path: Path to save the converted checkpoint
+        bf16: Use bfloat16 precision
+        fp16: Use float16 precision
+        convert_model_max_length: Change the model_max_length in hf config.json
+    """
+    if bf16 and fp16:
+        raise ValueError("bf16 and fp16 cannot be both True.")
+
+    torch_dtype = None
+    if bf16:
+        torch_dtype = torch.bfloat16
+    elif fp16:
+        torch_dtype = torch.float16
+
+    convert_checkpoint_to_hf(checkpoint_path, output_path, torch_dtype=torch_dtype)
+
+    if convert_model_max_length is not None:
+        config = AutoConfig.from_pretrained(output_path, trust_remote_code=True)
+        config.model_max_length = convert_model_max_length
+        config.save_pretrained(output_path)
+
+
+def convert(
+    checkpoint_path: str,
+    output_path: str = "./output",
+    bf16: bool = False,
+    fp16: bool = False,
+    convert_model_max_length: int | None = None,
+    tensor_model_parallel_size: int = 1,
+    pipeline_model_parallel_size: int = 1,
+    expert_model_parallel_size: int = 1,
+    virtual_pipeline_model_parallel_size: int | None = None,
+):
+    """Convert checkpoint between MCA and HuggingFace formats.
+
+    Args:
+        checkpoint_path: Path to the checkpoint to convert
+        output_path: Path to save the converted checkpoint
+        bf16: Use bfloat16 precision
+        fp16: Use float16 precision
+        convert_model_max_length: Change the model_max_length in hf config.json
+        tensor_model_parallel_size: Tensor model parallel size
+        pipeline_model_parallel_size: Pipeline model parallel size
+        expert_model_parallel_size: Expert model parallel size
+        virtual_pipeline_model_parallel_size: Virtual pipeline model parallel size
+    """
+    if bf16 and fp16:
+        raise ValueError("bf16 and fp16 cannot be both True.")
+
+    mca_config_path = os.path.join(checkpoint_path, "mca_config.json")
+    from_mca = os.path.exists(mca_config_path)
+
+    if not from_mca:
+        dist_args = DistributingParallelArguments(
+            tensor_model_parallel_size=tensor_model_parallel_size,
+            pipeline_model_parallel_size=pipeline_model_parallel_size,
+            expert_model_parallel_size=expert_model_parallel_size,
+            virtual_pipeline_model_parallel_size=virtual_pipeline_model_parallel_size,
+        )
+
+        convert_checkpoint_to_mca(
+            checkpoint_path,
+            output_path,
+            dist_args,
+            bf16=bf16,
+            fp16=fp16,
+        )
+    else:
+        convert_mca_to_hf(
+            checkpoint_path=checkpoint_path,
+            output_path=output_path,
+            bf16=bf16,
+            fp16=fp16,
+            convert_model_max_length=convert_model_max_length,
+        )
+
+
+def main():
+    fire.Fire(convert)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/stat_utils/cal_ppl.py
+++ b/scripts/stat_utils/cal_ppl.py
@@ -14,7 +14,7 @@

 import json
 from dataclasses import dataclass
-from typing import Any, Literal, Optional
+from typing import Any, Literal

 import fire
 import torch
@@ -61,7 +61,7 @@ def calculate_ppl(
    dataset_dir: str = "data",
    template: str = "default",
    cutoff_len: int = 2048,
-    max_samples: Optional[int] = None,
+    max_samples: int | None = None,
    train_on_prompt: bool = False,
 ):
    r"""Calculate the ppl on the dataset of the pre-trained models.
--- a/scripts/vllm_infer.py
+++ b/scripts/vllm_infer.py
@@ -14,8 +14,8 @@

 import gc
 import json
-from typing import Optional

+import av
 import fire
 from tqdm import tqdm
 from transformers import Seq2SeqTrainingArguments
@@ -33,6 +33,14 @@ if is_vllm_available():
    from vllm.lora.request import LoRARequest


+def _need_video_kwargs(template):
+    NEEDED_TEMPLATE = ["qwen3_vl", "glm4v"]
+    if any(t in template for t in NEEDED_TEMPLATE):
+        return True
+
+    return False
+
+
 def vllm_infer(
    model_name_or_path: str,
    adapter_name_or_path: str = None,
@@ -40,7 +48,7 @@ def vllm_infer(
    dataset_dir: str = "data",
    template: str = "default",
    cutoff_len: int = 2048,
-    max_samples: Optional[int] = None,
+    max_samples: int | None = None,
    vllm_config: str = "{}",
    save_name: str = "generated_predictions.jsonl",
    temperature: float = 0.95,
@@ -49,9 +57,9 @@ def vllm_infer(
    max_new_tokens: int = 1024,
    repetition_penalty: float = 1.0,
    skip_special_tokens: bool = True,
-    default_system: Optional[str] = None,
+    default_system: str | None = None,
    enable_thinking: bool = True,
-    seed: Optional[int] = None,
+    seed: int | None = None,
    pipeline_parallel_size: int = 1,
    image_max_pixels: int = 768 * 768,
    image_min_pixels: int = 32 * 32,
@@ -132,6 +140,7 @@ def vllm_infer(

    # Store all results in these lists
    all_prompts, all_preds, all_labels = [], [], []
+    need_video_kwargs = _need_video_kwargs(template)

    # Add batch process to avoid the issue of too many files opened
    for i in tqdm(range(0, len(train_dataset), batch_size), desc="Processing batched inference"):
@@ -147,6 +156,7 @@ def vllm_infer(
                    )["images"]
                }
            elif batch["videos"][j] is not None:
+                video_metadata, video_metadata_kwargs = None, None
                video = batch["videos"][j]
                multi_modal_data = {
                    "video": template_obj.mm_plugin._regularize_videos(
@@ -157,6 +167,25 @@ def vllm_infer(
                        video_maxlen=video_maxlen,
                    )["videos"]
                }
+                if need_video_kwargs:
+                    container = av.open(video[0], "r")
+                    video_stream = next(stream for stream in container.streams if stream.type == "video")
+                    sampling_indices = template_obj.mm_plugin._get_video_sample_indices(
+                        video_stream, video_fps, video_maxlen
+                    )
+                    total_frames = video_stream.frames
+                    video_metadata_kwargs = {
+                        "fps": getattr(tokenizer_module["processor"], "video_fps", 24.0),
+                        "do_sample_frames": False,
+                        "total_num_frames": total_frames,
+                    }
+                    video_metadata = dict(
+                        fps=video_fps,
+                        frames_indices=sampling_indices,
+                        total_num_frames=total_frames,
+                        video_backend="opencv",
+                    )
+                    multi_modal_data["video"] = (multi_modal_data["video"], video_metadata)
            elif batch["audios"][j] is not None:
                audio = batch["audios"][j]
                audio_data = template_obj.mm_plugin._regularize_audios(
@@ -167,7 +196,11 @@ def vllm_infer(
            else:
                multi_modal_data = None

-            vllm_inputs.append({"prompt_token_ids": batch["input_ids"][j], "multi_modal_data": multi_modal_data})
+            vllm_input_data = {"prompt_token_ids": batch["input_ids"][j], "multi_modal_data": multi_modal_data}
+            if "video_metadata_kwargs" in locals() and video_metadata_kwargs is not None:
+                vllm_input_data["mm_processor_kwargs"] = video_metadata_kwargs
+
+            vllm_inputs.append(vllm_input_data)
            prompts.append(tokenizer.decode(batch["input_ids"][j], skip_special_tokens=skip_special_tokens))
            labels.append(
                tokenizer.decode(
--- a/setup.py
+++ b/setup.py
@@ -1,116 +0,0 @@
-# Copyright 2025 the LlamaFactory team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import re
-
-from setuptools import find_packages, setup
-
-
-def get_version() -> str:
-    with open(os.path.join("src", "llamafactory", "extras", "env.py"), encoding="utf-8") as f:
-        file_content = f.read()
-        pattern = r"{}\W*=\W*\"([^\"]+)\"".format("VERSION")
-        (version,) = re.findall(pattern, file_content)
-        return version
-
-
-def get_requires() -> list[str]:
-    with open("requirements.txt", encoding="utf-8") as f:
-        file_content = f.read()
-        lines = [line.strip() for line in file_content.strip().split("\n") if not line.startswith("#")]
-        return lines
-
-
-def get_console_scripts() -> list[str]:
-    console_scripts = ["llamafactory-cli = llamafactory.cli:main"]
-    if os.getenv("ENABLE_SHORT_CONSOLE", "1").lower() in ["true", "y", "1"]:
-        console_scripts.append("lmf = llamafactory.cli:main")
-
-    return console_scripts
-
-
-extra_require = {
-    "torch": ["torch>=2.0.0", "torchvision>=0.15.0"],
-    "torch-npu": ["torch-npu==2.5.1", "torchvision==0.20.1", "decorator"],
-    "metrics": ["nltk", "jieba", "rouge-chinese"],
-    "deepspeed": ["deepspeed>=0.10.0,<=0.16.9"],
-    "liger-kernel": ["liger-kernel>=0.5.5"],
-    "bitsandbytes": ["bitsandbytes>=0.39.0"],
-    "hqq": ["hqq"],
-    "eetq": ["eetq"],
-    "gptq": ["optimum>=1.24.0", "gptqmodel>=2.0.0"],
-    "aqlm": ["aqlm[gpu]>=1.1.0"],
-    "vllm": ["vllm>=0.4.3,<=0.11.0"],
-    "sglang": ["sglang[srt]>=0.4.5", "transformers==4.51.1"],
-    "galore": ["galore-torch"],
-    "apollo": ["apollo-torch"],
-    "badam": ["badam>=1.2.1"],
-    "adam-mini": ["adam-mini"],
-    "minicpm_v": [
-        "soundfile",
-        "torchvision",
-        "torchaudio",
-        "vector_quantize_pytorch",
-        "vocos",
-        "msgpack",
-        "referencing",
-        "jsonschema_specifications",
-    ],
-    "openmind": ["openmind"],
-    "swanlab": ["swanlab"],
-    "fp8": ["torchao>=0.8.0", "accelerate>=1.10.0"],
-    "fp8-te": ["transformer_engine[pytorch]>=2.0.0", "accelerate>=1.10.0"],
-    "fp8-all": ["torchao>=0.8.0", "transformer_engine[pytorch]>=2.0.0", "accelerate>=1.10.0"],
-    "dev": ["pre-commit", "ruff", "pytest", "build"],
-}
-
-
-def main():
-    setup(
-        name="llamafactory",
-        version=get_version(),
-        author="hiyouga",
-        author_email="hiyouga@buaa.edu.cn",
-        description="Unified Efficient Fine-Tuning of 100+ LLMs",
-        long_description=open("README.md", encoding="utf-8").read(),
-        long_description_content_type="text/markdown",
-        keywords=["AI", "LLM", "GPT", "ChatGPT", "Llama", "Transformer", "DeepSeek", "Pytorch"],
-        license="Apache 2.0 License",
-        url="https://github.com/hiyouga/LLaMA-Factory",
-        package_dir={"": "src"},
-        packages=find_packages("src"),
-        python_requires=">=3.9.0",
-        install_requires=get_requires(),
-        extras_require=extra_require,
-        entry_points={"console_scripts": get_console_scripts()},
-        classifiers=[
-            "Development Status :: 4 - Beta",
-            "Intended Audience :: Developers",
-            "Intended Audience :: Education",
-            "Intended Audience :: Science/Research",
-            "License :: OSI Approved :: Apache Software License",
-            "Operating System :: OS Independent",
-            "Programming Language :: Python :: 3",
-            "Programming Language :: Python :: 3.9",
-            "Programming Language :: Python :: 3.10",
-            "Programming Language :: Python :: 3.11",
-            "Programming Language :: Python :: 3.12",
-            "Topic :: Scientific/Engineering :: Artificial Intelligence",
-        ],
-    )
-
-
-if __name__ == "__main__":
-    main()
--- a/src/llamafactory/api/app.py
+++ b/src/llamafactory/api/app.py
@@ -16,7 +16,7 @@ import asyncio
 import os
 from contextlib import asynccontextmanager
 from functools import partial
-from typing import Annotated, Optional
+from typing import Annotated

 from ..chat import ChatModel
 from ..extras.constants import EngineName
@@ -79,7 +79,7 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
    api_key = os.getenv("API_KEY")
    security = HTTPBearer(auto_error=False)

-    async def verify_api_key(auth: Annotated[Optional[HTTPAuthorizationCredentials], Depends(security)]):
+    async def verify_api_key(auth: Annotated[HTTPAuthorizationCredentials | None, Depends(security)]):
        if api_key and (auth is None or auth.credentials != api_key):
            raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid API key.")

--- a/src/llamafactory/api/protocol.py
+++ b/src/llamafactory/api/protocol.py
@@ -14,10 +14,9 @@

 import time
 from enum import Enum, unique
-from typing import Any, Optional, Union
+from typing import Any, Literal

 from pydantic import BaseModel, Field
-from typing_extensions import Literal


@unique
@@ -61,7 +60,7 @@ class FunctionDefinition(BaseModel):

 class FunctionAvailable(BaseModel):
    type: Literal["function", "code_interpreter"] = "function"
-    function: Optional[FunctionDefinition] = None
+    function: FunctionDefinition | None = None


 class FunctionCall(BaseModel):
@@ -77,35 +76,35 @@ class URL(BaseModel):

 class MultimodalInputItem(BaseModel):
    type: Literal["text", "image_url", "video_url", "audio_url"]
-    text: Optional[str] = None
-    image_url: Optional[URL] = None
-    video_url: Optional[URL] = None
-    audio_url: Optional[URL] = None
+    text: str | None = None
+    image_url: URL | None = None
+    video_url: URL | None = None
+    audio_url: URL | None = None


 class ChatMessage(BaseModel):
    role: Role
-    content: Optional[Union[str, list[MultimodalInputItem]]] = None
-    tool_calls: Optional[list[FunctionCall]] = None
+    content: str | list[MultimodalInputItem] | None = None
+    tool_calls: list[FunctionCall] | None = None


 class ChatCompletionMessage(BaseModel):
-    role: Optional[Role] = None
-    content: Optional[str] = None
-    tool_calls: Optional[list[FunctionCall]] = None
+    role: Role | None = None
+    content: str | None = None
+    tool_calls: list[FunctionCall] | None = None


 class ChatCompletionRequest(BaseModel):
    model: str
    messages: list[ChatMessage]
-    tools: Optional[list[FunctionAvailable]] = None
-    do_sample: Optional[bool] = None
-    temperature: Optional[float] = None
-    top_p: Optional[float] = None
+    tools: list[FunctionAvailable] | None = None
+    do_sample: bool | None = None
+    temperature: float | None = None
+    top_p: float | None = None
    n: int = 1
-    presence_penalty: Optional[float] = None
-    max_tokens: Optional[int] = None
-    stop: Optional[Union[str, list[str]]] = None
+    presence_penalty: float | None = None
+    max_tokens: int | None = None
+    stop: str | list[str] | None = None
    stream: bool = False


@@ -118,7 +117,7 @@ class ChatCompletionResponseChoice(BaseModel):
 class ChatCompletionStreamResponseChoice(BaseModel):
    index: int
    delta: ChatCompletionMessage
-    finish_reason: Optional[Finish] = None
+    finish_reason: Finish | None = None


 class ChatCompletionResponseUsage(BaseModel):
@@ -147,7 +146,7 @@ class ChatCompletionStreamResponse(BaseModel):
 class ScoreEvaluationRequest(BaseModel):
    model: str
    messages: list[str]
-    max_length: Optional[int] = None
+    max_length: int | None = None


 class ScoreEvaluationResponse(BaseModel):
--- a/src/llamafactory/chat/chat_model.py
+++ b/src/llamafactory/chat/chat_model.py
@@ -71,6 +71,16 @@ class ChatModel:
                    "SGLang not install, you may need to run `pip install sglang[all]`\n"
                    "or try to use HuggingFace backend: --infer_backend huggingface"
                ) from e
+        elif model_args.infer_backend == EngineName.KT:
+            try:
+                from .kt_engine import KTransformersEngine
+
+                self.engine: BaseEngine = KTransformersEngine(model_args, data_args, finetuning_args, generating_args)
+            except ImportError as e:
+                raise ImportError(
+                    "KTransformers not install, you may need to run `pip install ktransformers`\n"
+                    "or try to use HuggingFace backend: --infer_backend huggingface"
+                ) from e
        else:
            raise NotImplementedError(f"Unknown backend: {model_args.infer_backend}")

--- a/src/llamafactory/chat/hf_engine.py
+++ b/src/llamafactory/chat/hf_engine.py
@@ -14,9 +14,9 @@

 import asyncio
 import os
-from collections.abc import AsyncGenerator
+from collections.abc import AsyncGenerator, Callable
 from threading import Thread
-from typing import TYPE_CHECKING, Any, Callable, Optional, Union
+from typing import TYPE_CHECKING, Any, Optional, Union

 import torch
 from transformers import GenerationConfig, TextIteratorStreamer
--- a/src/llamafactory/chat/kt_engine.py
+++ b/src/llamafactory/chat/kt_engine.py
@@ -0,0 +1,284 @@
+# Copyright 2025 the KVCache.AI team, Approaching AI, and the LlamaFactory team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import asyncio
+import os
+import platform
+from collections.abc import AsyncGenerator
+from threading import Thread
+from typing import TYPE_CHECKING, Any, Optional
+
+import torch
+from typing_extensions import override
+
+from ..data import get_template_and_fix_tokenizer
+from ..extras import logging
+from ..extras.constants import EngineName
+from ..model import load_model, load_tokenizer
+from .base_engine import BaseEngine, Response
+
+
+if TYPE_CHECKING:
+    from transformers import PreTrainedTokenizer
+    from trl import PreTrainedModelWrapper
+
+    from ..data.mm_plugin import AudioInput, ImageInput, VideoInput
+    from ..hparams import DataArguments, FinetuningArguments, GeneratingArguments, ModelArguments
+
+from ktransformers.operators.flashinfer_wrapper import flashinfer_enabled
+from ktransformers.server.config.config import Config
+from ktransformers.util.utils import (
+    get_compute_capability,
+    prefill_and_generate_capture,
+)
+from ktransformers.util.vendors import GPUVendor, device_manager
+
+
+logger = logging.get_logger(__name__)
+
+
+class KTransformersEngine(BaseEngine):
+    def __init__(
+        self,
+        model_args: "ModelArguments",
+        data_args: "DataArguments",
+        finetuning_args: "FinetuningArguments",
+        generating_args: "GeneratingArguments",
+    ) -> None:
+        self.name = EngineName.KT
+        self.can_generate = finetuning_args.stage == "sft"
+
+        tok_mod = load_tokenizer(model_args)
+        self.tokenizer = tok_mod["tokenizer"]
+        self.tokenizer.padding_side = "left" if self.can_generate else "right"
+        self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)
+
+        self.model = load_model(
+            self.tokenizer, model_args, finetuning_args, is_trainable=False, add_valuehead=(not self.can_generate)
+        )
+
+        self.generating_args = generating_args.to_dict()
+        self.max_new_tokens = model_args.kt_maxlen
+        self.use_cuda_graph = model_args.kt_use_cuda_graph
+        self.mode = model_args.kt_mode
+        self.force_think = model_args.kt_force_think
+        self.chunk_size = model_args.chunk_size
+
+        try:
+            asyncio.get_event_loop()
+        except RuntimeError:
+            loop = asyncio.new_event_loop()
+            asyncio.set_event_loop(loop)
+
+        self.semaphore = asyncio.Semaphore(int(os.getenv("MAX_CONCURRENT", "1")))
+
+    @staticmethod
+    @torch.inference_mode()
+    def _get_scores(
+        model: "PreTrainedModelWrapper",
+        tokenizer: "PreTrainedTokenizer",
+        batch_input: list[str],
+        input_kwargs: Optional[dict[str, Any]] = {},
+    ) -> list[float]:
+        max_length: Optional[int] = input_kwargs.pop("max_length", None)
+        device = getattr(model.pretrained_model, "device", "cuda")
+        inputs = tokenizer(
+            batch_input,
+            padding=True,
+            truncation=True,
+            max_length=max_length or getattr(model.config, "max_position_embeddings", 1024),
+            return_tensors="pt",
+            add_special_tokens=False,
+        ).to(device)
+        values: torch.Tensor = model(**inputs, return_dict=True, use_cache=False)[-1]
+        scores = values.gather(dim=-1, index=(inputs["attention_mask"].sum(dim=-1, keepdim=True) - 1))
+        return scores
+
+    async def _generate(
+        self,
+        messages: list[dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> AsyncGenerator[str, None]:
+        paired = messages + [{"role": "assistant", "content": ""}]
+        prompt_ids, _ = self.template.encode_oneturn(self.tokenizer, paired, system, tools)
+        prompt_len = len(prompt_ids)
+
+        max_length: Optional[int] = input_kwargs.pop("max_length", None)
+        max_new_tokens: Optional[int] = input_kwargs.pop("max_new_tokens", None)
+
+        if "max_new_tokens" in self.generating_args:
+            max_tokens = int(self.generating_args["max_new_tokens"])
+        elif "max_length" in self.generating_args:
+            gl = int(self.generating_args["max_length"])
+            max_tokens = gl - prompt_len if gl > prompt_len else 1
+        else:
+            max_tokens = self.max_new_tokens or 256
+
+        if max_length is not None:
+            max_tokens = max(max_length - prompt_len, 1)
+        if max_new_tokens is not None:
+            max_tokens = int(max_new_tokens)
+        max_tokens = max(1, int(max_tokens))
+
+        if self.mode == "long_context":
+            max_len_cfg = Config().long_context_config["max_seq_len"]
+            need = prompt_len + max_tokens
+            assert max_len_cfg > need, f"please set max_seq_len > {need} in ~/.ktransformers/config.yaml"
+
+        device = next(self.model.parameters()).device
+        input_tensor = torch.tensor([prompt_ids], dtype=torch.long, device=device)
+        if self.force_think:
+            think = torch.tensor(
+                [self.tokenizer.encode("<think>\n", add_special_tokens=False)], dtype=torch.long, device=device
+            )
+            input_tensor = torch.cat([input_tensor, think], dim=1)
+
+        use_flashinfer = (
+            platform.system() != "Windows"
+            and getattr(self.model.config, "architectures", [""])[0]
+            in {"DeepseekV2ForCausalLM", "DeepseekV3ForCausalLM"}
+            and flashinfer_enabled
+            and get_compute_capability() >= 8
+            and device_manager.gpu_vendor == GPUVendor.NVIDIA
+        )
+
+        def make_gen():
+            if use_flashinfer:
+                return prefill_and_generate_capture(
+                    self.model,
+                    self.tokenizer,
+                    input_tensor,
+                    max_tokens,
+                    self.use_cuda_graph,
+                    mode=self.mode,
+                    force_think=self.force_think,
+                    chunk_size=self.chunk_size,
+                    use_flashinfer_mla=True,
+                    num_heads=self.model.config.num_attention_heads,
+                    head_dim_ckv=getattr(self.model.config, "kv_lora_rank", 0),
+                    head_dim_kpe=getattr(self.model.config, "qk_rope_head_dim", 0),
+                    q_head_dim=getattr(self.model.config, "qk_rope_head_dim", 0)
+                    + getattr(self.model.config, "qk_nope_head_dim", 0),
+                    echo_stream=False,
+                )
+            else:
+                return prefill_and_generate_capture(
+                    self.model,
+                    self.tokenizer,
+                    input_tensor,
+                    max_tokens,
+                    self.use_cuda_graph,
+                    mode=self.mode,
+                    force_think=self.force_think,
+                    chunk_size=self.chunk_size,
+                    echo_stream=False,
+                )
+
+        loop = asyncio.get_running_loop()
+        q: asyncio.Queue[Optional[str]] = asyncio.Queue()
+
+        def producer():
+            try:
+                gen = make_gen()
+                if hasattr(gen, "__aiter__"):
+
+                    async def drain_async():
+                        async for t in gen:
+                            loop.call_soon_threadsafe(q.put_nowait, t if isinstance(t, str) else str(t))
+
+                    asyncio.run(drain_async())
+                elif hasattr(gen, "__iter__"):
+                    for t in gen:
+                        loop.call_soon_threadsafe(q.put_nowait, t if isinstance(t, str) else str(t))
+                else:
+                    loop.call_soon_threadsafe(q.put_nowait, gen if isinstance(gen, str) else str(gen))
+            finally:
+                loop.call_soon_threadsafe(q.put_nowait, None)
+
+        Thread(target=producer, daemon=True).start()
+
+        while True:
+            item = await q.get()
+            if item is None:
+                break
+            yield item
+
+    @override
+    async def chat(
+        self,
+        messages: list[dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        images: Optional[list["ImageInput"]] = None,
+        videos: Optional[list["VideoInput"]] = None,
+        audios: Optional[list["AudioInput"]] = None,
+        **input_kwargs,
+    ) -> list["Response"]:
+        if not self.can_generate:
+            raise ValueError("The current model does not support `chat`.")
+        async with self.semaphore:
+            produced = ""
+            final_text = ""
+            async for t in self._generate(messages, system, tools, **input_kwargs):
+                delta = t
+                produced = produced + delta
+                if delta:
+                    final_text += delta
+
+            prompt_ids, _ = self.template.encode_oneturn(
+                self.tokenizer, messages + [{"role": "assistant", "content": ""}], system, tools
+            )
+            return [
+                Response(
+                    response_text=final_text,
+                    response_length=len(self.tokenizer.encode(final_text, add_special_tokens=False)),
+                    prompt_length=len(prompt_ids),
+                    finish_reason="stop",
+                )
+            ]
+
+    @override
+    async def stream_chat(
+        self,
+        messages: list[dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        images: Optional[list["ImageInput"]] = None,
+        videos: Optional[list["VideoInput"]] = None,
+        audios: Optional[list["AudioInput"]] = None,
+        **input_kwargs,
+    ) -> AsyncGenerator[str, None]:
+        if not self.can_generate:
+            raise ValueError("The current model does not support `stream_chat`.")
+        async with self.semaphore:
+            produced = ""
+            async for t in self._generate(messages, system, tools, **input_kwargs):
+                delta = t[len(produced) :] if t.startswith(produced) else t
+                produced = t
+                if delta:
+                    yield delta
+
+    @override
+    async def get_scores(
+        self,
+        batch_input: list[str],
+        **input_kwargs,
+    ) -> list[float]:
+        if self.can_generate:
+            raise ValueError("Cannot get scores using an auto-regressive model.")
+        args = (self.model, self.tokenizer, batch_input, input_kwargs)
+        async with self.semaphore:
+            return await asyncio.to_thread(self._get_scores, *args)
--- a/src/llamafactory/chat/vllm_engine.py
+++ b/src/llamafactory/chat/vllm_engine.py
@@ -16,6 +16,7 @@ import uuid
 from collections.abc import AsyncGenerator, AsyncIterator
 from typing import TYPE_CHECKING, Any, Optional, Union

+from packaging import version
 from typing_extensions import override

 from ..data import get_template_and_fix_tokenizer
@@ -77,11 +78,18 @@ class VllmEngine(BaseEngine):
            "tensor_parallel_size": get_device_count() or 1,
            "gpu_memory_utilization": model_args.vllm_gpu_util,
            "disable_log_stats": True,
-            "disable_log_requests": True,
            "enforce_eager": model_args.vllm_enforce_eager,
            "enable_lora": model_args.adapter_name_or_path is not None,
            "max_lora_rank": model_args.vllm_max_lora_rank,
        }
+
+        import vllm
+
+        if version.parse(vllm.__version__) <= version.parse("0.10.0"):
+            engine_args["disable_log_requests"] = True
+        else:
+            engine_args["enable_log_requests"] = False
+
        if self.template.mm_plugin.__class__.__name__ != "BasePlugin":
            engine_args["limit_mm_per_prompt"] = {"image": 4, "video": 2, "audio": 2}

--- a/src/llamafactory/data/converter.py
+++ b/src/llamafactory/data/converter.py
@@ -15,7 +15,7 @@ import json
 import os
 from abc import abstractmethod
 from dataclasses import dataclass
-from typing import TYPE_CHECKING, Any, Optional, Union
+from typing import TYPE_CHECKING, Any, Union

 from ..extras import logging
 from .data_utils import Role
@@ -40,7 +40,7 @@ class DatasetConverter:
    dataset_attr: "DatasetAttr"
    data_args: "DataArguments"

-    def _find_medias(self, medias: Union["MediaType", list["MediaType"], None]) -> Optional[list["MediaType"]]:
+    def _find_medias(self, medias: Union["MediaType", list["MediaType"], None]) -> list["MediaType"] | None:
        r"""Optionally concatenate media path to media dir when loading from local disk."""
        if medias is None:
            return None
--- a/src/llamafactory/data/data_utils.py
+++ b/src/llamafactory/data/data_utils.py
@@ -81,41 +81,48 @@ def split_dataset(
    eval_dataset: Optional[Union["Dataset", "IterableDataset", dict[str, "Dataset"]]],
    data_args: "DataArguments",
    seed: int,
-) -> "DatasetDict":
-    r"""Split the dataset and returns a dataset dict containing train set and validation set.
+) -> tuple[dict, dict]:
+    r"""Split the dataset and returns two dicts containing train set and validation set.

    Support both map dataset and iterable dataset.
+
+    Returns:
+        train_dict: Dictionary containing training data with key "train"
+        eval_dict: Dictionary containing evaluation data with keys "validation" or "validation_{name}"
    """
    if eval_dataset is not None and data_args.val_size > 1e-6:
        raise ValueError("Cannot specify `val_size` if `eval_dataset` is not None.")

-    dataset_dict = {}
+    # the train and eval better to in dict dtype and separately return for cpode clearly and good handle outside
+    train_dict, eval_dict = {}, {}
+
    if dataset is not None:
        if data_args.streaming:
            dataset = dataset.shuffle(buffer_size=data_args.buffer_size, seed=seed)

        if data_args.val_size > 1e-6:
            if data_args.streaming:
-                dataset_dict["validation"] = dataset.take(int(data_args.val_size))
-                dataset_dict["train"] = dataset.skip(int(data_args.val_size))
+                eval_dict["validation"] = dataset.take(int(data_args.val_size))
+                train_dict["train"] = dataset.skip(int(data_args.val_size))
            else:
                val_size = int(data_args.val_size) if data_args.val_size > 1 else data_args.val_size
-                dataset_dict = dataset.train_test_split(test_size=val_size, seed=seed)
-                dataset = dataset.train_test_split(test_size=val_size, seed=seed)
-                dataset_dict = {"train": dataset["train"], "validation": dataset["test"]}
+                split_result = dataset.train_test_split(test_size=val_size, seed=seed)
+                train_dict["train"] = split_result["train"]
+                eval_dict["validation"] = split_result["test"]
        else:
-            dataset_dict["train"] = dataset
+            train_dict["train"] = dataset

    if eval_dataset is not None:
        if isinstance(eval_dataset, dict):
-            dataset_dict.update({f"validation_{name}": data for name, data in eval_dataset.items()})
+            for name, data in eval_dataset.items():
+                eval_dict[f"validation_{name}"] = data
        else:
            if data_args.streaming:
                eval_dataset = eval_dataset.shuffle(buffer_size=data_args.buffer_size, seed=seed)

-            dataset_dict["validation"] = eval_dataset
+            eval_dict["validation"] = eval_dataset

-    return DatasetDict(dataset_dict)
+    return train_dict, eval_dict


 def get_dataset_module(dataset: Union["Dataset", "DatasetDict"]) -> "DatasetModule":
--- a/src/llamafactory/data/formatter.py
+++ b/src/llamafactory/data/formatter.py
@@ -16,7 +16,6 @@ import json
 import re
 from abc import ABC, abstractmethod
 from dataclasses import dataclass, field
-from typing import Optional, Union

 from typing_extensions import override

@@ -27,14 +26,14 @@ from .tool_utils import FunctionCall, get_tool_utils
@dataclass
 class Formatter(ABC):
    slots: SLOTS = field(default_factory=list)
-    tool_format: Optional[str] = None
+    tool_format: str | None = None

    @abstractmethod
    def apply(self, **kwargs) -> SLOTS:
        r"""Forms a list of slots according to the inputs to encode."""
        ...

-    def extract(self, content: str) -> Union[str, list["FunctionCall"]]:
+    def extract(self, content: str) -> str | list["FunctionCall"]:
        r"""Extract a list of tuples from the response message if using tools.

        Each tuple consists of function name and function arguments.
@@ -97,31 +96,46 @@ class FunctionFormatter(StringFormatter):
    @override
    def apply(self, **kwargs) -> SLOTS:
        content: str = kwargs.pop("content")
-        thought_words, thought = kwargs.pop("thought_words", None), None
-        if thought_words and len(thought_words) == 2:
-            regex = re.compile(rf"{re.escape(thought_words[0])}(.*?){re.escape(thought_words[1])}", re.DOTALL)
-            thought = re.search(regex, content)
+        thought_words = kwargs.pop("thought_words", None)
+        tool_call_words = kwargs.pop("tool_call_words", None)

-        if thought:
-            content = content.replace(thought.group(0), "")
+        def _parse_functions(json_content: str) -> list["FunctionCall"]:
+            try:
+                tool_calls = json.loads(json_content)
+                if not isinstance(tool_calls, list):  # parallel function call
+                    tool_calls = [tool_calls]

-        functions: list[FunctionCall] = []
-        try:
-            tool_calls = json.loads(content)
-            if not isinstance(tool_calls, list):  # parallel function call
-                tool_calls = [tool_calls]
+                return [FunctionCall(tc["name"], json.dumps(tc["arguments"], ensure_ascii=False)) for tc in tool_calls]
+            except json.JSONDecodeError:
+                raise RuntimeError(f"Invalid JSON format in function message: {str([content])}.")

-            for tool_call in tool_calls:
-                functions.append(
-                    FunctionCall(tool_call["name"], json.dumps(tool_call["arguments"], ensure_ascii=False))
-                )
+        tool_call_match = None
+        if tool_call_words and len(tool_call_words) == 2:
+            tool_call_regex = re.compile(
+                rf"{re.escape(tool_call_words[0])}(.*?){re.escape(tool_call_words[1])}", re.DOTALL
+            )
+            tool_call_match = re.search(tool_call_regex, content)

-        except json.JSONDecodeError:
-            raise RuntimeError(f"Invalid JSON format in function message: {str([content])}.")  # flat string
+        if tool_call_match is None:
+            thought_match = None
+            if thought_words and len(thought_words) == 2:
+                regex = re.compile(rf"{re.escape(thought_words[0])}(.*?){re.escape(thought_words[1])}", re.DOTALL)
+                thought_match = re.search(regex, content)

-        function_str = self.tool_utils.function_formatter(functions)
-        if thought:
-            function_str = thought.group(0) + function_str
+            if thought_match:
+                json_part = content.replace(thought_match.group(0), "")
+            else:
+                json_part = content
+
+            functions = _parse_functions(json_part)
+            function_str = self.tool_utils.function_formatter(functions)
+            if thought_match:
+                function_str = thought_match.group(0) + function_str
+        else:
+            thought_content = content.replace(tool_call_match.group(0), "")
+            functions = _parse_functions(tool_call_match.group(1))
+            function_str = self.tool_utils.function_formatter(functions)
+            function_str = thought_content + function_str

        return super().apply(content=function_str)

@@ -141,5 +155,5 @@ class ToolFormatter(Formatter):
            raise RuntimeError(f"Invalid JSON format in tool description: {str([content])}.")  # flat string

    @override
-    def extract(self, content: str) -> Union[str, list["FunctionCall"]]:
+    def extract(self, content: str) -> str | list["FunctionCall"]:
        return self.tool_utils.tool_extractor(content)
--- a/src/llamafactory/data/loader.py
+++ b/src/llamafactory/data/loader.py
@@ -16,7 +16,7 @@ import os
 from typing import TYPE_CHECKING, Literal, Optional, Union

 import numpy as np
-from datasets import Dataset, load_dataset, load_from_disk
+from datasets import Dataset, DatasetDict, load_dataset, load_from_disk

 from ..extras import logging
 from ..extras.constants import FILEEXT2TYPE
@@ -137,7 +137,6 @@ def _load_single_dataset(
            cache_dir=model_args.cache_dir,
            token=model_args.hf_hub_token,
            num_proc=data_args.preprocessing_num_workers,
-            trust_remote_code=model_args.trust_remote_code,
            streaming=data_args.streaming and dataset_attr.load_from != "file",
        )
        if data_args.streaming and dataset_attr.load_from == "file":
@@ -163,13 +162,13 @@ def _load_single_dataset(


 def _get_merged_dataset(
-    dataset_names: Optional[list[str]],
+    dataset_names: list[str] | None,
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    stage: Literal["pt", "sft", "rm", "ppo", "kto"],
    return_dict: bool = False,
-) -> Optional[Union["Dataset", "IterableDataset", dict[str, "Dataset"]]]:
+) -> Union["Dataset", "IterableDataset", dict[str, "Dataset"]] | None:
    r"""Return the merged datasets in the standard format."""
    if dataset_names is None:
        return None
@@ -228,7 +227,7 @@ def _get_dataset_processor(


 def _get_preprocessed_dataset(
-    dataset: Optional[Union["Dataset", "IterableDataset"]],
+    dataset: Union["Dataset", "IterableDataset"] | None,
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    stage: Literal["pt", "sft", "rm", "ppo", "kto"],
@@ -236,7 +235,7 @@ def _get_preprocessed_dataset(
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"] = None,
    is_eval: bool = False,
-) -> Optional[Union["Dataset", "IterableDataset"]]:
+) -> Union["Dataset", "IterableDataset"] | None:
    r"""Preprocesses the dataset, including format checking and tokenization."""
    if dataset is None:
        return None
@@ -312,20 +311,22 @@ def get_dataset(
        )

    with training_args.main_process_first(desc="pre-process dataset", local=(not data_args.data_shared_file_system)):
-        dataset = _get_preprocessed_dataset(
-            dataset, data_args, training_args, stage, template, tokenizer, processor, is_eval=False
-        )
-        if isinstance(eval_dataset, dict):
-            for eval_name, eval_data in eval_dataset.items():
-                eval_dataset[eval_name] = _get_preprocessed_dataset(
-                    eval_data, data_args, training_args, stage, template, tokenizer, processor, is_eval=True
-                )
-        else:
-            eval_dataset = _get_preprocessed_dataset(
-                eval_dataset, data_args, training_args, stage, template, tokenizer, processor, is_eval=True
+        # move front to make sure eval_dataset(if contain or split) can preprocessed appropriately
+        train_dict, eval_dict = split_dataset(dataset, eval_dataset, data_args, seed=training_args.seed)
+
+        if "train" in train_dict:
+            train_dict["train"] = _get_preprocessed_dataset(
+                train_dict["train"], data_args, training_args, stage, template, tokenizer, processor, is_eval=False
            )

-        dataset_dict = split_dataset(dataset, eval_dataset, data_args, seed=training_args.seed)
+        for key in eval_dict:
+            eval_dict[key] = _get_preprocessed_dataset(
+                eval_dict[key], data_args, training_args, stage, template, tokenizer, processor, is_eval=True
+            )
+
+        # Combine train and eval dictionaries
+        dataset_dict = DatasetDict({**train_dict, **eval_dict})
+
        if data_args.tokenized_path is not None:  # save tokenized dataset to disk
            if training_args.should_save:
                dataset_dict.save_to_disk(data_args.tokenized_path)
--- a/src/llamafactory/data/mm_plugin.py
+++ b/src/llamafactory/data/mm_plugin.py
@@ -22,10 +22,11 @@ import re
 from copy import deepcopy
 from dataclasses import dataclass
 from io import BytesIO
-from typing import TYPE_CHECKING, BinaryIO, Literal, Optional, TypedDict, Union
+from typing import TYPE_CHECKING, BinaryIO, Literal, NotRequired, Optional, TypedDict, Union

 import numpy as np
 import torch
+import torchaudio
 from transformers.image_utils import get_image_size, is_valid_image, to_numpy_array
 from transformers.models.mllama.processing_mllama import (
    convert_sparse_cross_attention_mask_to_dense,
@@ -34,16 +35,7 @@ from transformers.models.mllama.processing_mllama import (
 from typing_extensions import override

 from ..extras.constants import AUDIO_PLACEHOLDER, IGNORE_INDEX, IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER
-from ..extras.packages import (
-    is_librosa_available,
-    is_pillow_available,
-    is_pyav_available,
-    is_transformers_version_greater_than,
-)
-
-
-if is_librosa_available():
-    import librosa
+from ..extras.packages import is_pillow_available, is_pyav_available, is_transformers_version_greater_than


 if is_pillow_available():
@@ -68,15 +60,28 @@ if TYPE_CHECKING:
    from transformers import PreTrainedTokenizer, ProcessorMixin
    from transformers.feature_extraction_sequence_utils import SequenceFeatureExtractor
    from transformers.image_processing_utils import BaseImageProcessor
+    from transformers.video_processing_utils import BaseVideoProcessor

    class EncodedImage(TypedDict):
-        path: Optional[str]
-        bytes: Optional[bytes]
+        path: str | None
+        bytes: bytes | None

    ImageInput = Union[str, bytes, EncodedImage, BinaryIO, ImageObject]
    VideoInput = Union[str, BinaryIO, list[list[ImageInput]]]
    AudioInput = Union[str, BinaryIO, NDArray]

+    class RegularizedImageOutput(TypedDict):
+        images: list[ImageObject]
+
+    class RegularizedVideoOutput(TypedDict):
+        videos: list[list[ImageObject]]
+        durations: list[float]
+        fps_per_video: NotRequired[list[float]]
+
+    class RegularizedAudioOutput(TypedDict):
+        audios: list[NDArray]
+        sampling_rates: list[float]
+
    class MMProcessor(ProcessorMixin):
        patch_size: int
        image_seq_length: int
@@ -139,9 +144,9 @@ def _check_video_is_nested_images(video: "VideoInput") -> bool:

@dataclass
 class MMPluginMixin:
-    image_token: Optional[str]
-    video_token: Optional[str]
-    audio_token: Optional[str]
+    image_token: str | None
+    video_token: str | None
+    audio_token: str | None
    expand_mm_tokens: bool = True

    def _validate_input(
@@ -244,7 +249,7 @@ class MMPluginMixin:
        sample_frames = min(total_frames, video_maxlen, sample_frames)
        return np.linspace(0, total_frames - 1, sample_frames).astype(np.int32)

-    def _regularize_images(self, images: list["ImageInput"], **kwargs) -> dict[str, list["ImageObject"]]:
+    def _regularize_images(self, images: list["ImageInput"], **kwargs) -> "RegularizedImageOutput":
        r"""Regularize images to avoid error. Including reading and pre-processing."""
        results = []
        for image in images:
@@ -265,9 +270,10 @@ class MMPluginMixin:

        return {"images": results}

-    def _regularize_videos(self, videos: list["VideoInput"], **kwargs) -> dict[str, list[list["ImageObject"]]]:
+    def _regularize_videos(self, videos: list["VideoInput"], **kwargs) -> "RegularizedVideoOutput":
        r"""Regularizes videos to avoid error. Including reading, resizing and converting."""
        results = []
+        durations = []
        for video in videos:
            frames: list[ImageObject] = []
            if _check_video_is_nested_images(video):
@@ -275,6 +281,7 @@ class MMPluginMixin:
                    if not is_valid_image(frame) and not isinstance(frame, dict) and not os.path.exists(frame):
                        raise ValueError("Invalid image found in video frames.")
                frames = video
+                durations.append(len(frames) / kwargs.get("video_fps", 2.0))
            else:
                container = av.open(video, "r")
                video_stream = next(stream for stream in container.streams if stream.type == "video")
@@ -284,19 +291,31 @@ class MMPluginMixin:
                    if frame_idx in sample_indices:
                        frames.append(frame.to_image())

+                if video_stream.duration is None:
+                    durations.append(len(frames) / kwargs.get("video_fps", 2.0))
+                else:
+                    durations.append(float(video_stream.duration * video_stream.time_base))
+
            frames = self._regularize_images(frames, **kwargs)["images"]
            results.append(frames)

-        return {"videos": results}
+        return {"videos": results, "durations": durations}

    def _regularize_audios(
        self, audios: list["AudioInput"], sampling_rate: float, **kwargs
-    ) -> dict[str, Union[list["NDArray"], list[float]]]:
+    ) -> "RegularizedAudioOutput":
        r"""Regularizes audios to avoid error. Including reading and resampling."""
        results, sampling_rates = [], []
        for audio in audios:
            if not isinstance(audio, np.ndarray):
-                audio, sampling_rate = librosa.load(audio, sr=sampling_rate)
+                audio, sr = torchaudio.load(audio)
+                if audio.shape[0] > 1:
+                    audio = audio.mean(dim=0, keepdim=True)
+
+                if sr != sampling_rate:
+                    audio = torchaudio.functional.resample(audio, sr, sampling_rate)
+
+                audio = audio.squeeze(0).numpy()

            results.append(audio)
            sampling_rates.append(sampling_rate)
@@ -309,7 +328,7 @@ class MMPluginMixin:
        videos: list["VideoInput"],
        audios: list["AudioInput"],
        processor: "MMProcessor",
-        imglens: Optional[list[int]] = None,
+        imglens: list[int] | None = None,
    ) -> dict[str, "torch.Tensor"]:
        r"""Process visual inputs.

@@ -407,13 +426,13 @@ class BasePlugin(MMPluginMixin):
    def process_token_ids(
        self,
        input_ids: list[int],
-        labels: Optional[list[int]],
+        labels: list[int] | None,
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
        tokenizer: "PreTrainedTokenizer",
        processor: Optional["MMProcessor"],
-    ) -> tuple[list[int], Optional[list[int]]]:
+    ) -> tuple[list[int], list[int] | None]:
        r"""Pre-process token ids after tokenization for VLMs."""
        self._validate_input(processor, images, videos, audios)
        return input_ids, labels
@@ -446,6 +465,57 @@ class BasePlugin(MMPluginMixin):
        return self._get_mm_inputs(images, videos, audios, processor)


+@dataclass
+class ErnieVLPlugin(BasePlugin):
+    @override
+    def process_messages(
+        self,
+        messages: list[dict[str, str]],
+        images: list["ImageInput"],
+        videos: list["VideoInput"],
+        audios: list["AudioInput"],
+        processor: Optional["MMProcessor"],
+    ) -> list[dict[str, str]]:
+        self._validate_input(processor, images, videos, audios)
+        self._validate_messages(messages, images, videos, audios)
+        messages = deepcopy(messages)
+
+        image_processor: BaseImageProcessor = getattr(processor, "image_processor")
+
+        merge_length: int = getattr(image_processor, "merge_size") ** 2
+        if self.expand_mm_tokens:
+            mm_inputs = self._get_mm_inputs(images, videos, audios, processor)
+            image_grid_thw = mm_inputs.get("image_grid_thw", [])
+            video_grid_thw = mm_inputs.get("video_grid_thw", [])
+        else:
+            image_grid_thw = [None] * len(images)
+            video_grid_thw = [None] * len(videos)
+
+        image_idx, video_idx = 0, 0
+        for message in messages:
+            content = message["content"]
+            image_token = self.image_token or "<|IMAGE_PLACEHOLDER|>"
+            video_token = self.video_token or "<|VIDEO_PLACEHOLDER|>"
+            while IMAGE_PLACEHOLDER in content:
+                image_seqlen = image_grid_thw[image_idx].prod() // merge_length if self.expand_mm_tokens else 1
+                content = content.replace(
+                    IMAGE_PLACEHOLDER,
+                    f"Picture {image_idx + 1}:<|IMAGE_START|>{image_token * image_seqlen}<|IMAGE_END|>",
+                    1,
+                )
+                image_idx += 1
+            while VIDEO_PLACEHOLDER in content:
+                video_seqlen = video_grid_thw[video_idx].prod() // merge_length if self.expand_mm_tokens else 1
+                content = content.replace(
+                    VIDEO_PLACEHOLDER,
+                    f"Video {video_idx + 1}:<|VIDEO_START|>{video_token * video_seqlen}<|VIDEO_END|>",
+                    1,
+                )
+                video_idx += 1
+            message["content"] = content
+        return messages
+
+
@dataclass
 class Gemma3Plugin(BasePlugin):
    @override
@@ -1235,13 +1305,13 @@ class PaliGemmaPlugin(BasePlugin):
    def process_token_ids(
        self,
        input_ids: list[int],
-        labels: Optional[list[int]],
+        labels: list[int] | None,
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
        tokenizer: "PreTrainedTokenizer",
        processor: Optional["MMProcessor"],
-    ) -> tuple[list[int], Optional[list[int]]]:
+    ) -> tuple[list[int], list[int] | None]:
        self._validate_input(processor, images, videos, audios)
        num_images = len(images)
        image_seqlen = processor.image_seq_length if self.expand_mm_tokens else 0  # skip mm token
@@ -1418,10 +1488,8 @@ class Qwen2VLPlugin(BasePlugin):
        return image

    @override
-    def _regularize_videos(
-        self, videos: list["VideoInput"], **kwargs
-    ) -> dict[str, Union[list[list["ImageObject"]], list[float]]]:
-        results, fps_per_video = [], []
+    def _regularize_videos(self, videos: list["VideoInput"], **kwargs) -> "RegularizedVideoOutput":
+        results, fps_per_video, durations = [], [], []
        for video in videos:
            frames: list[ImageObject] = []
            if _check_video_is_nested_images(video):
@@ -1431,6 +1499,7 @@ class Qwen2VLPlugin(BasePlugin):

                frames = video
                fps_per_video.append(kwargs.get("video_fps", 2.0))
+                durations.append(len(frames) / kwargs.get("video_fps", 2.0))
            else:
                container = av.open(video, "r")
                video_stream = next(stream for stream in container.streams if stream.type == "video")
@@ -1442,8 +1511,10 @@ class Qwen2VLPlugin(BasePlugin):

                if video_stream.duration is None:
                    fps_per_video.append(kwargs.get("video_fps", 2.0))
+                    durations.append(len(frames) / kwargs.get("video_fps", 2.0))
                else:
                    fps_per_video.append(len(sample_indices) / float(video_stream.duration * video_stream.time_base))
+                    durations.append(float(video_stream.duration * video_stream.time_base))

            if len(frames) % 2 != 0:
                frames.append(frames[-1])
@@ -1451,7 +1522,7 @@ class Qwen2VLPlugin(BasePlugin):
            frames = self._regularize_images(frames, **kwargs)["images"]
            results.append(frames)

-        return {"videos": results, "fps_per_video": fps_per_video}
+        return {"videos": results, "fps_per_video": fps_per_video, "durations": durations}

    @override
    def _get_mm_inputs(
@@ -1462,6 +1533,7 @@ class Qwen2VLPlugin(BasePlugin):
        processor: "MMProcessor",
    ) -> dict[str, "torch.Tensor"]:
        image_processor: BaseImageProcessor = getattr(processor, "image_processor", None)
+        video_processor: BaseVideoProcessor = getattr(processor, "video_processor", None)
        mm_inputs = {}
        if len(images) != 0:
            images = self._regularize_images(
@@ -1479,7 +1551,7 @@ class Qwen2VLPlugin(BasePlugin):
                video_fps=getattr(processor, "video_fps", 2.0),
                video_maxlen=getattr(processor, "video_maxlen", 128),
            )
-            mm_inputs.update(image_processor(images=None, videos=video_data["videos"], return_tensors="pt"))
+            mm_inputs.update(video_processor(videos=video_data["videos"], return_tensors="pt"))
            temporal_patch_size: int = getattr(image_processor, "temporal_patch_size", 2)
            if "second_per_grid_ts" in processor.model_input_names:
                mm_inputs["second_per_grid_ts"] = [temporal_patch_size / fps for fps in video_data["fps_per_video"]]
@@ -1565,11 +1637,16 @@ class Qwen3VLPlugin(Qwen2VLPlugin):
                video_maxlen=getattr(processor, "video_maxlen", 128),
            )
            video_metadata = [
-                {"fps": getattr(processor, "video_fps", 24.0), "duration": len(video), "total_num_frames": len(video)}
-                for video in videos["videos"]
+                {"fps": getattr(processor, "video_fps", 24.0), "duration": duration, "total_num_frames": len(video)}
+                for video, duration in zip(videos["videos"], videos["durations"])
            ]
            mm_inputs.update(
-                video_processor(videos=videos["videos"], video_metadata=video_metadata, return_metadata=True)
+                video_processor(
+                    videos=videos["videos"],
+                    video_metadata=video_metadata,
+                    fps=getattr(processor, "video_fps", 2.0),
+                    return_metadata=True,
+                )
            )
            temporal_patch_size: int = getattr(image_processor, "temporal_patch_size", 2)
            if "second_per_grid_ts" in processor.model_input_names:
@@ -1622,27 +1699,27 @@ class Qwen3VLPlugin(Qwen2VLPlugin):
                num_image_tokens += 1

            while VIDEO_PLACEHOLDER in content:
-                metadata = video_metadata[idx]
-                timestamps = processor._calculate_timestamps(
-                    metadata.frames_indices,
-                    metadata.fps,
-                    video_processor.merge_size,
-                )
-                video_structure = ""
-                for frame_index in range(num_frames):
-                    video_seqlen = (
-                        video_grid_thw[num_video_tokens][1:].prod() // video_merge_length
-                        if self.expand_mm_tokens
-                        else 1
+                if self.expand_mm_tokens:
+                    metadata = video_metadata[idx]
+                    timestamps = processor._calculate_timestamps(
+                        metadata.frames_indices,
+                        metadata.fps,
+                        video_processor.merge_size,
                    )
-                    timestamp_sec = timestamps[frame_index]
-                    frame_structure = (
-                        f"<{timestamp_sec:.1f} seconds>"
-                        f"{self.vision_bos_token}{self.video_token * video_seqlen}{self.vision_eos_token}"
-                    )
-                    video_structure += frame_structure
-
-                if not self.expand_mm_tokens:
+                    video_structure = ""
+                    for frame_index in range(num_frames):
+                        video_seqlen = (
+                            video_grid_thw[num_video_tokens][1:].prod() // video_merge_length
+                            if self.expand_mm_tokens
+                            else 1
+                        )
+                        timestamp_sec = timestamps[frame_index]
+                        frame_structure = (
+                            f"<{timestamp_sec:.1f} seconds>"
+                            f"{self.vision_bos_token}{self.video_token * video_seqlen}{self.vision_eos_token}"
+                        )
+                        video_structure += frame_structure
+                else:
                    video_structure = f"{self.vision_bos_token}{self.video_token}{self.vision_eos_token}"

                content = content.replace(VIDEO_PLACEHOLDER, video_structure, 1)
@@ -1684,7 +1761,8 @@ class GLM4VPlugin(Qwen2VLPlugin):
            )
            # prepare video metadata
            video_metadata = [
-                {"fps": 2, "duration": len(video), "total_frames": len(video)} for video in video_data["videos"]
+                {"fps": 2, "duration": duration, "total_frames": len(video)}
+                for video, duration in zip(video_data["videos"], video_data["durations"])
            ]
            mm_inputs.update(video_processor(images=None, videos=video_data["videos"], video_metadata=video_metadata))

@@ -1797,6 +1875,7 @@ class Qwen2OmniPlugin(Qwen2VLPlugin):
        processor: "MMProcessor",
    ) -> dict[str, "torch.Tensor"]:
        image_processor: BaseImageProcessor = getattr(processor, "image_processor", None)
+        video_processor: BaseVideoProcessor = getattr(processor, "video_processor", None)
        feature_extractor: SequenceFeatureExtractor = getattr(processor, "feature_extractor", None)
        mm_inputs = {}
        if len(images) != 0:
@@ -1815,7 +1894,7 @@ class Qwen2OmniPlugin(Qwen2VLPlugin):
                video_fps=getattr(processor, "video_fps", 2.0),
                video_maxlen=getattr(processor, "video_maxlen", 128),
            )
-            mm_inputs.update(image_processor(images=None, videos=video_dict["videos"], return_tensors="pt"))
+            mm_inputs.update(video_processor(videos=video_dict["videos"], return_tensors="pt"))
            temporal_patch_size: int = getattr(image_processor, "temporal_patch_size", 2)
            mm_inputs["video_second_per_grid"] = torch.tensor(
                [temporal_patch_size / fps for fps in video_dict["fps_per_video"]]
@@ -1861,8 +1940,14 @@ class Qwen2OmniPlugin(Qwen2VLPlugin):
            image_grid_thw = mm_inputs.get("image_grid_thw", [])
            video_grid_thw = mm_inputs.get("video_grid_thw", [])
            if "feature_attention_mask" in mm_inputs:
-                input_lengths = (mm_inputs["feature_attention_mask"].sum(-1).numpy() - 1) // 2 + 1
-                audio_lengths = (input_lengths - 2) // 2 + 1
+                if processor.__class__.__name__ == "Qwen3OmniMoeProcessor":  # for qwen3omni
+                    input_lengths = mm_inputs["feature_attention_mask"].sum(-1)
+                    input_lengths_leave = input_lengths % 100
+                    feature_lengths = (input_lengths_leave - 1) // 2 + 1
+                    audio_lengths = ((feature_lengths - 1) // 2 + 1 - 1) // 2 + 1 + (input_lengths // 100) * 13
+                else:
+                    input_lengths = (mm_inputs["feature_attention_mask"].sum(-1).numpy() - 1) // 2 + 1
+                    audio_lengths = (input_lengths - 2) // 2 + 1
        else:
            mm_inputs = {}
            image_grid_thw = [None] * len(images)
@@ -2009,6 +2094,7 @@ class VideoLlavaPlugin(BasePlugin):

 PLUGINS = {
    "base": BasePlugin,
+    "ernie_vl": ErnieVLPlugin,
    "gemma3": Gemma3Plugin,
    "glm4v": GLM4VPlugin,
    "gemma3n": Gemma3nPlugin,
@@ -2040,9 +2126,9 @@ def register_mm_plugin(name: str, plugin_class: type["BasePlugin"]) -> None:

 def get_mm_plugin(
    name: str,
-    image_token: Optional[str] = None,
-    video_token: Optional[str] = None,
-    audio_token: Optional[str] = None,
+    image_token: str | None = None,
+    video_token: str | None = None,
+    audio_token: str | None = None,
    **kwargs,
 ) -> "BasePlugin":
    r"""Get plugin for multimodal inputs."""
--- a/src/llamafactory/data/parser.py
+++ b/src/llamafactory/data/parser.py
@@ -15,7 +15,7 @@
 import json
 import os
 from dataclasses import dataclass
-from typing import Any, Literal, Optional, Union
+from typing import Any, Literal

 from huggingface_hub import hf_hub_download

@@ -30,43 +30,43 @@ class DatasetAttr:
    # basic configs
    load_from: Literal["hf_hub", "ms_hub", "om_hub", "script", "file"]
    dataset_name: str
-    formatting: Literal["alpaca", "sharegpt"] = "alpaca"
+    formatting: Literal["alpaca", "sharegpt", "openai"] = "alpaca"
    ranking: bool = False
    # extra configs
-    subset: Optional[str] = None
+    subset: str | None = None
    split: str = "train"
-    folder: Optional[str] = None
-    num_samples: Optional[int] = None
+    folder: str | None = None
+    num_samples: int | None = None
    # common columns
-    system: Optional[str] = None
-    tools: Optional[str] = None
-    images: Optional[str] = None
-    videos: Optional[str] = None
-    audios: Optional[str] = None
+    system: str | None = None
+    tools: str | None = None
+    images: str | None = None
+    videos: str | None = None
+    audios: str | None = None
    # dpo columns
-    chosen: Optional[str] = None
-    rejected: Optional[str] = None
-    kto_tag: Optional[str] = None
+    chosen: str | None = None
+    rejected: str | None = None
+    kto_tag: str | None = None
    # alpaca columns
-    prompt: Optional[str] = "instruction"
-    query: Optional[str] = "input"
-    response: Optional[str] = "output"
-    history: Optional[str] = None
+    prompt: str | None = "instruction"
+    query: str | None = "input"
+    response: str | None = "output"
+    history: str | None = None
    # sharegpt columns
-    messages: Optional[str] = "conversations"
+    messages: str | None = "conversations"
    # sharegpt tags
-    role_tag: Optional[str] = "from"
-    content_tag: Optional[str] = "value"
-    user_tag: Optional[str] = "human"
-    assistant_tag: Optional[str] = "gpt"
-    observation_tag: Optional[str] = "observation"
-    function_tag: Optional[str] = "function_call"
-    system_tag: Optional[str] = "system"
+    role_tag: str | None = "from"
+    content_tag: str | None = "value"
+    user_tag: str | None = "human"
+    assistant_tag: str | None = "gpt"
+    observation_tag: str | None = "observation"
+    function_tag: str | None = "function_call"
+    system_tag: str | None = "system"

    def __repr__(self) -> str:
        return self.dataset_name

-    def set_attr(self, key: str, obj: dict[str, Any], default: Optional[Any] = None) -> None:
+    def set_attr(self, key: str, obj: dict[str, Any], default: Any | None = None) -> None:
        setattr(self, key, obj.get(key, default))

    def join(self, attr: dict[str, Any]) -> None:
@@ -90,7 +90,7 @@ class DatasetAttr:
                self.set_attr(tag, attr["tags"])


-def get_dataset_list(dataset_names: Optional[list[str]], dataset_dir: Union[str, dict]) -> list["DatasetAttr"]:
+def get_dataset_list(dataset_names: list[str] | None, dataset_dir: str | dict) -> list["DatasetAttr"]:
    r"""Get the attributes of the datasets."""
    if dataset_names is None:
        dataset_names = []
--- a/src/llamafactory/data/template.py
+++ b/src/llamafactory/data/template.py
@@ -49,6 +49,7 @@ class Template:
    default_system: str
    stop_words: list[str]
    thought_words: tuple[str, str]
+    tool_call_words: tuple[str, str]
    efficient_eos: bool
    replace_eos: bool
    replace_jinja_template: bool
@@ -156,7 +157,9 @@ class Template:
            elif message["role"] == Role.OBSERVATION:
                elements += self.format_observation.apply(content=message["content"])
            elif message["role"] == Role.FUNCTION:
-                elements += self.format_function.apply(content=message["content"], thought_words=self.thought_words)
+                elements += self.format_function.apply(
+                    content=message["content"], thought_words=self.thought_words, tool_call_words=self.tool_call_words
+                )
            else:
                raise NotImplementedError("Unexpected role: {}".format(message["role"]))

@@ -199,9 +202,12 @@ class Template:
            logger.info_rank0(f"Add pad token: {tokenizer.pad_token}")

        if stop_words:
-            num_added_tokens = tokenizer.add_special_tokens(
-                dict(additional_special_tokens=stop_words), replace_additional_special_tokens=False
-            )
+            try:
+                num_added_tokens = tokenizer.add_special_tokens(
+                    dict(additional_special_tokens=stop_words), replace_additional_special_tokens=False
+                )
+            except TypeError:
+                num_added_tokens = tokenizer.add_special_tokens(dict(additional_special_tokens=stop_words))
            logger.info_rank0("Add {} to stop words.".format(",".join(stop_words)))
            if num_added_tokens > 0:
                logger.warning_rank0("New tokens have been added, make sure `resize_vocab` is True.")
@@ -468,6 +474,7 @@ def register_template(
    default_system: str = "",
    stop_words: Optional[list[str]] = None,
    thought_words: Optional[tuple[str, str]] = None,
+    tool_call_words: Optional[tuple[str, str]] = None,
    efficient_eos: bool = False,
    replace_eos: bool = False,
    replace_jinja_template: bool = False,
@@ -519,6 +526,7 @@ def register_template(
        default_system=default_system,
        stop_words=stop_words or [],
        thought_words=thought_words or ("<think>\n", "\n</think>\n\n"),
+        tool_call_words=tool_call_words or ("<tool_call>", "</tool_call>"),
        efficient_eos=efficient_eos,
        replace_eos=replace_eos,
        replace_jinja_template=replace_jinja_template,
@@ -580,6 +588,7 @@ def parse_template(tokenizer: "PreTrainedTokenizer") -> "Template":
        default_system=default_system,
        stop_words=[],
        thought_words=("<think>\n", "\n</think>\n\n"),
+        tool_call_words=("<tool_call>", "</tool_call>"),
        efficient_eos=False,
        replace_eos=False,
        replace_jinja_template=False,
@@ -616,7 +625,14 @@ def get_template_and_fix_tokenizer(tokenizer: "PreTrainedTokenizer", data_args:
        logger.info_rank0(f"Using default system message: {data_args.default_system}.")
        template.default_system = data_args.default_system

-    template.enable_thinking = data_args.enable_thinking
+    if isinstance(template, ReasoningTemplate):
+        logger.warning_rank0(
+            "You are using reasoning template, "
+            "please add `_nothink` suffix if the model is not a reasoning model. "
+            "e.g., qwen3_vl_nothink"
+        )
+        template.enable_thinking = data_args.enable_thinking
+
    template.fix_special_tokens(tokenizer)
    template.fix_jinja_template(tokenizer)
    return template
@@ -956,6 +972,19 @@ register_template(
 )


+register_template(
+    name="ernie_vl",
+    format_user=StringFormatter(slots=["User: {{content}}"]),
+    format_assistant=StringFormatter(slots=["\nAssistant: {{content}}<|end_of_sentence|>"]),
+    format_system=StringFormatter(slots=["{{content}}\n"]),
+    stop_words=["<|end_of_sentence|>"],
+    replace_eos=True,
+    replace_jinja_template=True,
+    template_class=ReasoningTemplate,
+    mm_plugin=get_mm_plugin(name="ernie_vl", image_token="<|IMAGE_PLACEHOLDER|>", video_token="<|VIDEO_PLACEHOLDER|>"),
+)
+
+
 register_template(
    name="exaone",
    format_user=StringFormatter(slots=["[|user|]{{content}}\n[|assistant|]"]),
@@ -1105,7 +1134,7 @@ register_template(

 # copied from glm4 template
 register_template(
-    name="glm4v_moe",
+    name="glm4_5v",
    format_user=StringFormatter(slots=["<|user|>\n{{content}}<|assistant|>"]),
    format_assistant=StringFormatter(slots=["\n{{content}}"]),
    format_system=StringFormatter(slots=["<|system|>\n{{content}}"]),
@@ -1137,7 +1166,7 @@ register_template(


 register_template(
-    name="gpt",
+    name="gpt_oss",
    format_user=StringFormatter(slots=["<|start|>user<|message|>{{content}}<|end|><|start|>assistant"]),
    format_assistant=StringFormatter(slots=["{{content}}<|end|>"]),
    format_system=StringFormatter(slots=["<|start|>system<|message|>{{content}}<|end|>"]),
@@ -1201,10 +1230,10 @@ register_template(

 register_template(
    name="hunyuan",
-    format_user=StringFormatter(slots=["<|bos|>user\n{{content}}<|eos|>\n<|bos|>assistant\n"]),
-    format_assistant=StringFormatter(slots=["{{content}}<|eos|>\n"]),
-    format_system=StringFormatter(slots=["<|bos|>system\n{{content}}<|eos|>\n"]),
-    format_prefix=EmptyFormatter(slots=["<|bos|>"]),
+    format_user=StringFormatter(slots=["{{content}}<|extra_0|>"]),
+    format_assistant=StringFormatter(slots=["{{content}}<|eos|>"]),
+    format_system=StringFormatter(slots=["{{content}}<|extra_4|>"]),
+    format_prefix=EmptyFormatter(slots=["<|startoftext|>"]),
    stop_words=["<|eos|>"],
 )

@@ -1581,6 +1610,26 @@ register_template(
    template_class=ReasoningTemplate,
 )

+
+# copied from qwen template
+register_template(
+    name="mimo_v2",
+    format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
+    format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]),
+    format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
+    format_function=FunctionFormatter(slots=["{{content}}<|im_end|>\n"], tool_format="qwen"),
+    format_observation=StringFormatter(
+        slots=["<|im_start|>user\n<tool_response>\n{{content}}\n</tool_response><|im_end|>\n<|im_start|>assistant\n"]
+    ),
+    format_tools=ToolFormatter(tool_format="qwen"),
+    default_system="You are MiMo, a helpful AI assistant engineered by Xiaomi.",
+    stop_words=["<|im_end|>"],
+    replace_eos=True,
+    thought_words=("<think>", "</think>"),
+    template_class=ReasoningTemplate,
+)
+
+
 # copied from qwen2vl
 register_template(
    name="mimo_vl",
@@ -1664,6 +1713,19 @@ register_template(
 )


+register_template(
+    name="ministral3",
+    format_user=StringFormatter(slots=["[INST]{{content}}[/INST]"]),
+    format_system=StringFormatter(slots=["{{content}}\n\n"]),
+    format_function=FunctionFormatter(slots=["[TOOL_CALLS]{{content}}", {"eos_token"}], tool_format="mistral"),
+    format_observation=StringFormatter(slots=["""[TOOL_RESULTS]{"content": {{content}}}[/TOOL_RESULTS]"""]),
+    format_tools=ToolFormatter(tool_format="mistral"),
+    format_prefix=EmptyFormatter(slots=[{"bos_token"}]),
+    template_class=Llama2Template,
+    mm_plugin=get_mm_plugin(name="pixtral", image_token="[IMG]"),
+)
+
+
 register_template(
    name="olmo",
    format_user=StringFormatter(slots=["<|user|>\n{{content}}<|assistant|>\n"]),
--- a/src/llamafactory/extras/constants.py
+++ b/src/llamafactory/extras/constants.py
@@ -15,7 +15,6 @@
 import os
 from collections import OrderedDict, defaultdict
 from enum import Enum, unique
-from typing import Optional

 from peft.utils import SAFETENSORS_WEIGHTS_NAME as SAFE_ADAPTER_WEIGHTS_NAME
 from peft.utils import WEIGHTS_NAME as ADAPTER_WEIGHTS_NAME
@@ -56,6 +55,19 @@ LAYERNORM_NAMES = {"norm", "ln"}

 LLAMABOARD_CONFIG = "llamaboard_config.yaml"

+MCA_SUPPORTED_MODELS = {
+    "deepseek_v3",
+    "llama",
+    "mistral",
+    "mixtral",
+    "qwen2",
+    "qwen2_vl",
+    "qwen2_5_vl",
+    "qwen3",
+    "qwen3_moe",
+    "qwen3_next",
+}
+
 METHODS = ["full", "freeze", "lora", "oft"]

 MOD_SUPPORTED_MODELS = {"bloom", "falcon", "gemma", "llama", "mistral", "mixtral", "phi", "starcoder2"}
@@ -101,12 +113,14 @@ class AttentionFunction(str, Enum):
    DISABLED = "disabled"
    SDPA = "sdpa"
    FA2 = "fa2"
+    FA3 = "fa3"


 class EngineName(str, Enum):
    HF = "huggingface"
    VLLM = "vllm"
    SGLANG = "sglang"
+    KT = "ktransformers"


 class DownloadSource(str, Enum):
@@ -127,6 +141,7 @@ class QuantizationMethod(str, Enum):
    EETQ = "eetq"
    HQQ = "hqq"
    MXFP4 = "mxfp4"
+    FP8 = "fp8"


 class RopeScaling(str, Enum):
@@ -138,7 +153,7 @@ class RopeScaling(str, Enum):

 def register_model_group(
    models: dict[str, dict[DownloadSource, str]],
-    template: Optional[str] = None,
+    template: str | None = None,
    multimodal: bool = False,
 ) -> None:
    for name, path in models.items():
@@ -643,6 +658,26 @@ register_model_group(
 )


+register_model_group(
+    models={
+        "ERNIE-4.5-VL-28B-A3B-PT": {
+            DownloadSource.DEFAULT: "baidu/ERNIE-4.5-VL-28B-A3B-PT",
+            DownloadSource.MODELSCOPE: "PaddlePaddle/ERNIE-4.5-VL-28B-A3B-PT",
+        },
+        "ERNIE-4.5-VL-28B-A3B-Thinking": {
+            DownloadSource.DEFAULT: "baidu/ERNIE-4.5-VL-28B-A3B-Thinking",
+            DownloadSource.MODELSCOPE: "PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking",
+        },
+        "ERNIE-4.5-VL-424B-A47B-Base-PT": {
+            DownloadSource.DEFAULT: "baidu/ERNIE-4.5-VL-424B-A47B-PT",
+            DownloadSource.MODELSCOPE: "PaddlePaddle/ERNIE-4.5-VL-424B-A47B-PT",
+        },
+    },
+    template="ernie_vl",
+    multimodal=True,
+)
+
+
 register_model_group(
    models={
        "EXAONE-3.0-7.8B-Instruct": {
@@ -969,9 +1004,17 @@ register_model_group(
        "GLM-4.5V-Air-Thinking": {
            DownloadSource.DEFAULT: "zai-org/GLM-4.5V",
            DownloadSource.MODELSCOPE: "ZhipuAI/GLM-4.5V",
-        }
+        },
+        "GLM-4.6V": {
+            DownloadSource.DEFAULT: "zai-org/GLM-4.6V",
+            DownloadSource.MODELSCOPE: "ZhipuAI/GLM-4.6V",
+        },
+        "GLM-4.6V-Flash": {
+            DownloadSource.DEFAULT: "zai-org/GLM-4.6V-Flash",
+            DownloadSource.MODELSCOPE: "ZhipuAI/GLM-4.6V-Flash",
+        },
    },
-    template="glm4v_moe",
+    template="glm4_5v",
    multimodal=True,
 )

@@ -1024,7 +1067,7 @@ register_model_group(
            DownloadSource.MODELSCOPE: "openai/gpt-oss-120b",
        },
    },
-    template="gpt",
+    template="gpt_oss",
 )


@@ -1152,6 +1195,10 @@ register_model_group(
            DownloadSource.DEFAULT: "tencent/Hunyuan-7B-Instruct",
            DownloadSource.MODELSCOPE: "AI-ModelScope/Hunyuan-7B-Instruct",
        },
+        "Hunyuan-MT-7B-Instruct": {
+            DownloadSource.DEFAULT: "tencent/Hunyuan-MT-7B",
+            DownloadSource.MODELSCOPE: "Tencent-Hunyuan/Hunyuan-MT-7B",
+        },
    },
    template="hunyuan",
 )
@@ -1756,6 +1803,21 @@ register_model_group(
 )


+register_model_group(
+    models={
+        "MiMo-V2-Flash-Base": {
+            DownloadSource.DEFAULT: "XiaomiMiMo/MiMo-V2-Flash-Base",
+            DownloadSource.MODELSCOPE: "XiaomiMiMo/MiMo-V2-Flash-Base",
+        },
+        "MiMo-V2-Flash": {
+            DownloadSource.DEFAULT: "XiaomiMiMo/MiMo-V2-Flash",
+            DownloadSource.MODELSCOPE: "XiaomiMiMo/MiMo-V2-Flash",
+        },
+    },
+    template="mimo_v2",
+)
+
+
 register_model_group(
    models={
        "MiMo-7B-VL-RL": {
@@ -1780,7 +1842,7 @@ register_model_group(
        },
        "MiMo-VL-7B-SFT-2508": {
            DownloadSource.DEFAULT: "XiaomiMiMo/MiMo-VL-7B-SFT-2508",
-            DownloadSource.DEFAULT: "XiaomiMiMo/MiMo-VL-7B-SFT-2508",
+            DownloadSource.MODELSCOPE: "XiaomiMiMo/MiMo-VL-7B-SFT-2508",
        },
    },
    template="qwen2_vl",
@@ -1931,6 +1993,37 @@ register_model_group(
    template="mistral",
 )

+register_model_group(
+    models={
+        "Ministral-3-3B-Base-2512": {
+            DownloadSource.DEFAULT: "mistralai/Ministral-3-3B-Base-2512",
+            DownloadSource.MODELSCOPE: "mistralai/Ministral-3-3B-Base-2512",
+        },
+        "Ministral-3-8B-Base-2512": {
+            DownloadSource.DEFAULT: "mistralai/Ministral-3-8B-Base-2512",
+            DownloadSource.MODELSCOPE: "mistralai/Ministral-3-8B-Base-2512",
+        },
+        "Ministral-3-14B-Base-2512": {
+            DownloadSource.DEFAULT: "mistralai/Ministral-3-14B-Base-2512",
+            DownloadSource.MODELSCOPE: "mistralai/Ministral-3-14B-Base-2512",
+        },
+        "Ministral-3-3B-Instruct-2512": {
+            DownloadSource.DEFAULT: "mistralai/Ministral-3-3B-Instruct-2512",
+            DownloadSource.MODELSCOPE: "mistralai/Ministral-3-3B-Instruct-2512",
+        },
+        "Ministral-3-8B-Instruct-2512": {
+            DownloadSource.DEFAULT: "mistralai/Ministral-3-8B-Instruct-2512",
+            DownloadSource.MODELSCOPE: "mistralai/Ministral-3-8B-Instruct-2512",
+        },
+        "Ministral-3-14B-Instruct-2512": {
+            DownloadSource.DEFAULT: "mistralai/Ministral-3-14B-Instruct-2512",
+            DownloadSource.MODELSCOPE: "mistralai/Ministral-3-14B-Instruct-2512",
+        },
+    },
+    template="ministral3",
+    multimodal=True,
+)
+

 register_model_group(
    models={
@@ -3193,6 +3286,10 @@ register_model_group(

 register_model_group(
    models={
+        "Qwen3-VL-2B-Instruct": {
+            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-2B-Instruct",
+            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-2B-Instruct",
+        },
        "Qwen3-VL-4B-Instruct": {
            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-4B-Instruct",
            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-4B-Instruct",
@@ -3201,6 +3298,10 @@ register_model_group(
            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-8B-Instruct",
            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-8B-Instruct",
        },
+        "Qwen3-VL-32B-Instruct": {
+            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-32B-Instruct",
+            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-32B-Instruct",
+        },
        "Qwen3-VL-30B-A3B-Instruct": {
            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-30B-A3B-Instruct",
            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-30B-A3B-Instruct",
@@ -3217,6 +3318,10 @@ register_model_group(

 register_model_group(
    models={
+        "Qwen3-VL-2B-Thinking": {
+            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-2B-Thinking",
+            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-2B-Thinking",
+        },
        "Qwen3-VL-4B-Thinking": {
            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-4B-Thinking",
            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-4B-Thinking",
@@ -3225,6 +3330,10 @@ register_model_group(
            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-8B-Thinking",
            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-8B-Thinking",
        },
+        "Qwen3-VL-32B-Thinking": {
+            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-32B-Thinking",
+            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-32B-Thinking",
+        },
        "Qwen3-VL-30B-A3B-Thinking": {
            DownloadSource.DEFAULT: "Qwen/Qwen3-VL-30B-A3B-Thinking",
            DownloadSource.MODELSCOPE: "Qwen/Qwen3-VL-30B-A3B-Thinking",
@@ -3438,6 +3547,17 @@ register_model_group(
 )


+register_model_group(
+    models={
+        "VibeThinker-1.5B": {
+            DownloadSource.DEFAULT: "WeiboAI/VibeThinker-1.5B",
+            DownloadSource.MODELSCOPE: "WeiboAI/VibeThinker-1.5B",
+        },
+    },
+    template="qwen3",
+)
+
+
 register_model_group(
    models={
        "Vicuna-v1.5-7B-Chat": {
--- a/src/llamafactory/extras/logging.py
+++ b/src/llamafactory/extras/logging.py
@@ -117,7 +117,7 @@ def _configure_library_root_logger() -> None:
        library_root_logger.propagate = False


-def get_logger(name: Optional[str] = None) -> "_Logger":
+def get_logger(name: str | None = None) -> "_Logger":
    r"""Return a logger with the specified name. It it not supposed to be accessed externally."""
    if name is None:
        name = _get_library_name()
--- a/src/llamafactory/extras/misc.py
+++ b/src/llamafactory/extras/misc.py
@@ -313,6 +313,10 @@ def use_ray() -> bool:
    return is_env_enabled("USE_RAY")


+def use_kt() -> bool:
+    return is_env_enabled("USE_KT")
+
+
 def find_available_port() -> int:
    r"""Find an available port on the local machine."""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
@@ -328,3 +332,7 @@ def fix_proxy(ipv6_enabled: bool = False) -> None:
    if ipv6_enabled:
        os.environ.pop("http_proxy", None)
        os.environ.pop("HTTP_PROXY", None)
+        os.environ.pop("https_proxy", None)
+        os.environ.pop("HTTPS_PROXY", None)
+        os.environ.pop("all_proxy", None)
+        os.environ.pop("ALL_PROXY", None)
--- a/src/llamafactory/extras/packages.py
+++ b/src/llamafactory/extras/packages.py
@@ -70,6 +70,10 @@ def is_matplotlib_available():
    return _is_package_available("matplotlib")


+def is_mcore_adapter_available():
+    return _is_package_available("mcore_adapter")
+
+
 def is_pillow_available():
    return _is_package_available("PIL")

@@ -78,6 +82,10 @@ def is_ray_available():
    return _is_package_available("ray")


+def is_kt_available():
+    return _is_package_available("ktransformers")
+
+
 def is_requests_available():
    return _is_package_available("requests")

@@ -86,6 +94,14 @@ def is_rouge_available():
    return _is_package_available("rouge_chinese")


+def is_safetensors_available():
+    return _is_package_available("safetensors")
+
+
+def is_sglang_available():
+    return _is_package_available("sglang")
+
+
 def is_starlette_available():
    return _is_package_available("sse_starlette")

@@ -95,13 +111,14 @@ def is_transformers_version_greater_than(content: str):
    return _get_package_version("transformers") >= version.parse(content)


+@lru_cache
+def is_torch_version_greater_than(content: str):
+    return _get_package_version("torch") >= version.parse(content)
+
+
 def is_uvicorn_available():
    return _is_package_available("uvicorn")


 def is_vllm_available():
    return _is_package_available("vllm")
-
-
-def is_sglang_available():
-    return _is_package_available("sglang")
--- a/src/llamafactory/hparams/data_args.py
+++ b/src/llamafactory/hparams/data_args.py
@@ -16,22 +16,22 @@
 # limitations under the License.

 from dataclasses import asdict, dataclass, field
-from typing import Any, Literal, Optional
+from typing import Any, Literal


@dataclass
 class DataArguments:
    r"""Arguments pertaining to what data we are going to input our model for training and evaluation."""

-    template: Optional[str] = field(
+    template: str | None = field(
        default=None,
        metadata={"help": "Which template to use for constructing prompts in training and inference."},
    )
-    dataset: Optional[str] = field(
+    dataset: str | None = field(
        default=None,
        metadata={"help": "The name of dataset(s) to use for training. Use commas to separate multiple datasets."},
    )
-    eval_dataset: Optional[str] = field(
+    eval_dataset: str | None = field(
        default=None,
        metadata={"help": "The name of dataset(s) to use for evaluation. Use commas to separate multiple datasets."},
    )
@@ -39,7 +39,7 @@ class DataArguments:
        default="data",
        metadata={"help": "Path to the folder containing the datasets."},
    )
-    media_dir: Optional[str] = field(
+    media_dir: str | None = field(
        default=None,
        metadata={"help": "Path to the folder containing the images, videos or audios. Defaults to `dataset_dir`."},
    )
@@ -67,7 +67,7 @@ class DataArguments:
        default="concat",
        metadata={"help": "Strategy to use in dataset mixing (concat/interleave) (undersampling/oversampling)."},
    )
-    interleave_probs: Optional[str] = field(
+    interleave_probs: str | None = field(
        default=None,
        metadata={"help": "Probabilities to sample data from datasets. Use commas to separate multiple datasets."},
    )
@@ -79,15 +79,15 @@ class DataArguments:
        default=1000,
        metadata={"help": "The number of examples in one group in pre-processing."},
    )
-    preprocessing_num_workers: Optional[int] = field(
+    preprocessing_num_workers: int | None = field(
        default=None,
        metadata={"help": "The number of processes to use for the pre-processing."},
    )
-    max_samples: Optional[int] = field(
+    max_samples: int | None = field(
        default=None,
        metadata={"help": "For debugging purposes, truncate the number of examples for each dataset."},
    )
-    eval_num_beams: Optional[int] = field(
+    eval_num_beams: int | None = field(
        default=None,
        metadata={"help": "Number of beams to use for evaluation. This argument will be passed to `model.generate`"},
    )
@@ -103,7 +103,7 @@ class DataArguments:
        default=False,
        metadata={"help": "Whether or not to evaluate on each dataset separately."},
    )
-    packing: Optional[bool] = field(
+    packing: bool | None = field(
        default=None,
        metadata={"help": "Enable sequences packing in training. Will automatically enable in pre-training."},
    )
@@ -111,19 +111,19 @@ class DataArguments:
        default=False,
        metadata={"help": "Enable sequence packing without cross-attention."},
    )
-    tool_format: Optional[str] = field(
+    tool_format: str | None = field(
        default=None,
        metadata={"help": "Tool format to use for constructing function calling examples."},
    )
-    default_system: Optional[str] = field(
+    default_system: str | None = field(
        default=None,
        metadata={"help": "Override the default system message in the template."},
    )
-    enable_thinking: Optional[bool] = field(
+    enable_thinking: bool | None = field(
        default=True,
        metadata={"help": "Whether or not to enable thinking mode for reasoning models."},
    )
-    tokenized_path: Optional[str] = field(
+    tokenized_path: str | None = field(
        default=None,
        metadata={
            "help": (
--- a/src/llamafactory/hparams/evaluation_args.py
+++ b/src/llamafactory/hparams/evaluation_args.py
@@ -14,7 +14,7 @@

 import os
 from dataclasses import dataclass, field
-from typing import Literal, Optional
+from typing import Literal

 from datasets import DownloadMode

@@ -46,7 +46,7 @@ class EvaluationArguments:
        default=5,
        metadata={"help": "Number of examplars for few-shot learning."},
    )
-    save_dir: Optional[str] = field(
+    save_dir: str | None = field(
        default=None,
        metadata={"help": "Path to save the evaluation results."},
    )
--- a/src/llamafactory/hparams/finetuning_args.py
+++ b/src/llamafactory/hparams/finetuning_args.py
@@ -13,7 +13,7 @@
 # limitations under the License.

 from dataclasses import asdict, dataclass, field
-from typing import Any, Literal, Optional
+from typing import Any, Literal


@dataclass
@@ -40,7 +40,7 @@ class FreezeArguments:
            )
        },
    )
-    freeze_extra_modules: Optional[str] = field(
+    freeze_extra_modules: str | None = field(
        default=None,
        metadata={
            "help": (
@@ -56,7 +56,7 @@ class FreezeArguments:
 class LoraArguments:
    r"""Arguments pertaining to the LoRA training."""

-    additional_target: Optional[str] = field(
+    additional_target: str | None = field(
        default=None,
        metadata={
            "help": (
@@ -66,7 +66,7 @@ class LoraArguments:
            )
        },
    )
-    lora_alpha: Optional[int] = field(
+    lora_alpha: int | None = field(
        default=None,
        metadata={"help": "The scale factor for LoRA fine-tuning (default: lora_rank * 2)."},
    )
@@ -88,7 +88,7 @@ class LoraArguments:
            )
        },
    )
-    loraplus_lr_ratio: Optional[float] = field(
+    loraplus_lr_ratio: float | None = field(
        default=None,
        metadata={"help": "LoRA plus learning rate ratio (lr_B / lr_A)."},
    )
@@ -126,7 +126,7 @@ class LoraArguments:
 class OFTArguments:
    r"""Arguments pertaining to the OFT training."""

-    additional_target: Optional[str] = field(
+    additional_target: str | None = field(
        default=None,
        metadata={
            "help": (
@@ -220,27 +220,27 @@ class RLHFArguments:
        default=False,
        metadata={"help": "Whiten the rewards before compute advantages in PPO training."},
    )
-    ref_model: Optional[str] = field(
+    ref_model: str | None = field(
        default=None,
        metadata={"help": "Path to the reference model used for the PPO or DPO training."},
    )
-    ref_model_adapters: Optional[str] = field(
+    ref_model_adapters: str | None = field(
        default=None,
        metadata={"help": "Path to the adapters of the reference model."},
    )
-    ref_model_quantization_bit: Optional[int] = field(
+    ref_model_quantization_bit: int | None = field(
        default=None,
        metadata={"help": "The number of bits to quantize the reference model."},
    )
-    reward_model: Optional[str] = field(
+    reward_model: str | None = field(
        default=None,
        metadata={"help": "Path to the reward model used for the PPO training."},
    )
-    reward_model_adapters: Optional[str] = field(
+    reward_model_adapters: str | None = field(
        default=None,
        metadata={"help": "Path to the adapters of the reward model."},
    )
-    reward_model_quantization_bit: Optional[int] = field(
+    reward_model_quantization_bit: int | None = field(
        default=None,
        metadata={"help": "The number of bits to quantize the reward model."},
    )
@@ -248,7 +248,7 @@ class RLHFArguments:
        default="lora",
        metadata={"help": "The type of the reward model in PPO training. Lora model only supports lora training."},
    )
-    ld_alpha: Optional[float] = field(
+    ld_alpha: float | None = field(
        default=None,
        metadata={
            "help": (
@@ -361,15 +361,15 @@ class BAdamArgument:
        default="layer",
        metadata={"help": "Whether to use layer-wise or ratio-wise BAdam optimizer."},
    )
-    badam_start_block: Optional[int] = field(
+    badam_start_block: int | None = field(
        default=None,
        metadata={"help": "The starting block index for layer-wise BAdam."},
    )
-    badam_switch_mode: Optional[Literal["ascending", "descending", "random", "fixed"]] = field(
+    badam_switch_mode: Literal["ascending", "descending", "random", "fixed"] | None = field(
        default="ascending",
        metadata={"help": "the strategy of picking block to update for layer-wise BAdam."},
    )
-    badam_switch_interval: Optional[int] = field(
+    badam_switch_interval: int | None = field(
        default=50,
        metadata={
            "help": "Number of steps to update the block for layer-wise BAdam. Use -1 to disable the block update."
@@ -406,15 +406,15 @@ class SwanLabArguments:
        default=False,
        metadata={"help": "Whether or not to use the SwanLab (an experiment tracking and visualization tool)."},
    )
-    swanlab_project: Optional[str] = field(
+    swanlab_project: str | None = field(
        default="llamafactory",
        metadata={"help": "The project name in SwanLab."},
    )
-    swanlab_workspace: Optional[str] = field(
+    swanlab_workspace: str | None = field(
        default=None,
        metadata={"help": "The workspace name in SwanLab."},
    )
-    swanlab_run_name: Optional[str] = field(
+    swanlab_run_name: str | None = field(
        default=None,
        metadata={"help": "The experiment name in SwanLab."},
    )
@@ -422,19 +422,19 @@ class SwanLabArguments:
        default="cloud",
        metadata={"help": "The mode of SwanLab."},
    )
-    swanlab_api_key: Optional[str] = field(
+    swanlab_api_key: str | None = field(
        default=None,
        metadata={"help": "The API key for SwanLab."},
    )
-    swanlab_logdir: Optional[str] = field(
+    swanlab_logdir: str | None = field(
        default=None,
        metadata={"help": "The log directory for SwanLab."},
    )
-    swanlab_lark_webhook_url: Optional[str] = field(
+    swanlab_lark_webhook_url: str | None = field(
        default=None,
        metadata={"help": "The Lark(飞书) webhook URL for SwanLab."},
    )
-    swanlab_lark_secret: Optional[str] = field(
+    swanlab_lark_secret: str | None = field(
        default=None,
        metadata={"help": "The Lark(飞书) secret for SwanLab."},
    )
@@ -461,7 +461,7 @@ class FinetuningArguments(
        default="sft",
        metadata={"help": "Which stage will be performed in training."},
    )
-    finetuning_type: Literal["lora", "freeze", "full"] = field(
+    finetuning_type: Literal["lora", "oft", "freeze", "full"] = field(
        default="lora",
        metadata={"help": "Which fine-tuning method to use."},
    )
@@ -473,6 +473,15 @@ class FinetuningArguments(
        default=False,
        metadata={"help": "Whether or not to use the Adam-mini optimizer."},
    )
+    use_mca: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether or not to use MCA (Megatron Core Adapter) training. "
+                "Controlled by USE_MCA environment variable."
+            )
+        },
+    )
    use_muon: bool = field(
        default=False,
        metadata={"help": "Whether or not to use the Muon optimizer."},
@@ -501,7 +510,7 @@ class FinetuningArguments(
        default=False,
        metadata={"help": "Whether or not to disable the shuffling of the training set."},
    )
-    early_stopping_steps: Optional[int] = field(
+    early_stopping_steps: int | None = field(
        default=None,
        metadata={"help": "Number of steps to stop training if the `metric_for_best_model` does not improve."},
    )
@@ -521,11 +530,11 @@ class FinetuningArguments(
            return arg

        self.freeze_trainable_modules: list[str] = split_arg(self.freeze_trainable_modules)
-        self.freeze_extra_modules: Optional[list[str]] = split_arg(self.freeze_extra_modules)
+        self.freeze_extra_modules: list[str] | None = split_arg(self.freeze_extra_modules)
        self.lora_alpha: int = self.lora_alpha or self.lora_rank * 2
        self.lora_target: list[str] = split_arg(self.lora_target)
        self.oft_target: list[str] = split_arg(self.oft_target)
-        self.additional_target: Optional[list[str]] = split_arg(self.additional_target)
+        self.additional_target: list[str] | None = split_arg(self.additional_target)
        self.galore_target: list[str] = split_arg(self.galore_target)
        self.apollo_target: list[str] = split_arg(self.apollo_target)
        self.use_ref_model = self.stage == "dpo" and self.pref_loss not in ["orpo", "simpo"]
--- a/src/llamafactory/hparams/model_args.py
+++ b/src/llamafactory/hparams/model_args.py
@@ -1,4 +1,4 @@
-# Copyright 2025 HuggingFace Inc. and the LlamaFactory team.
+# Copyright 2025 HuggingFace Inc., the KVCache.AI team, Approaching AI, and the LlamaFactory team.
 #
 # This code is inspired by the HuggingFace's transformers library.
 # https://github.com/huggingface/transformers/blob/v4.40.0/examples/pytorch/language-modeling/run_clm.py
@@ -17,29 +17,30 @@

 import json
 from dataclasses import asdict, dataclass, field, fields
-from typing import Any, Literal, Optional, Union
+from typing import Any, Literal, Self

 import torch
-from transformers.training_args import _convert_str_dict
-from typing_extensions import Self
 from omegaconf import OmegaConf
+from transformers.training_args import _convert_str_dict

 from ..extras.constants import AttentionFunction, EngineName, QuantizationMethod, RopeScaling
 from ..extras.logging import get_logger

+
 logger = get_logger(__name__)

+
@dataclass
 class BaseModelArguments:
    r"""Arguments pertaining to the model."""

-    model_name_or_path: Optional[str] = field(
+    model_name_or_path: str | None = field(
        default=None,
        metadata={
            "help": "Path to the model weight or identifier from huggingface.co/models or modelscope.cn/models."
        },
    )
-    adapter_name_or_path: Optional[str] = field(
+    adapter_name_or_path: str | None = field(
        default=None,
        metadata={
            "help": (
@@ -48,11 +49,11 @@ class BaseModelArguments:
            )
        },
    )
-    adapter_folder: Optional[str] = field(
+    adapter_folder: str | None = field(
        default=None,
        metadata={"help": "The folder containing the adapter weights to load."},
    )
-    cache_dir: Optional[str] = field(
+    cache_dir: str | None = field(
        default=None,
        metadata={"help": "Where to store the pre-trained models downloaded from huggingface.co or modelscope.cn."},
    )
@@ -68,17 +69,17 @@ class BaseModelArguments:
        default=False,
        metadata={"help": "Whether or not the special tokens should be split during the tokenization process."},
    )
-    add_tokens: Optional[str] = field(
+    add_tokens: str | None = field(
        default=None,
        metadata={
            "help": "Non-special tokens to be added into the tokenizer. Use commas to separate multiple tokens."
        },
    )
-    add_special_tokens: Optional[str] = field(
+    add_special_tokens: str | None = field(
        default=None,
        metadata={"help": "Special tokens to be added into the tokenizer. Use commas to separate multiple tokens."},
    )
-    new_special_tokens_config: Optional[str] = field(
+    new_special_tokens_config: str | None = field(
        default=None,
        metadata={
            "help": (
@@ -108,7 +109,7 @@ class BaseModelArguments:
        default=True,
        metadata={"help": "Whether or not to use memory-efficient model loading."},
    )
-    rope_scaling: Optional[RopeScaling] = field(
+    rope_scaling: RopeScaling | None = field(
        default=None,
        metadata={"help": "Which scaling strategy should be adopted for the RoPE embeddings."},
    )
@@ -120,7 +121,7 @@ class BaseModelArguments:
        default=False,
        metadata={"help": "Enable shift short attention (S^2-Attn) proposed by LongLoRA."},
    )
-    mixture_of_depths: Optional[Literal["convert", "load"]] = field(
+    mixture_of_depths: Literal["convert", "load"] | None = field(
        default=None,
        metadata={"help": "Convert the model to mixture-of-depths (MoD) or load the MoD model."},
    )
@@ -136,7 +137,7 @@ class BaseModelArguments:
        default=False,
        metadata={"help": "Whether or not to enable liger kernel for faster training."},
    )
-    moe_aux_loss_coef: Optional[float] = field(
+    moe_aux_loss_coef: float | None = field(
        default=None,
        metadata={"help": "Coefficient of the auxiliary router loss in mixture-of-experts model."},
    )
@@ -168,23 +169,27 @@ class BaseModelArguments:
        default="offload",
        metadata={"help": "Path to offload model weights."},
    )
-    use_cache: bool = field(
+    use_kv_cache: bool = field(
        default=True,
        metadata={"help": "Whether or not to use KV cache in generation."},
    )
+    use_v1_kernels: bool = field(
+        default=False,
+        metadata={"help": "Whether or not to use high-performance kernels in training."},
+    )
    infer_dtype: Literal["auto", "float16", "bfloat16", "float32"] = field(
        default="auto",
        metadata={"help": "Data type for model weights and activations at inference."},
    )
-    hf_hub_token: Optional[str] = field(
+    hf_hub_token: str | None = field(
        default=None,
        metadata={"help": "Auth token to log in with Hugging Face Hub."},
    )
-    ms_hub_token: Optional[str] = field(
+    ms_hub_token: str | None = field(
        default=None,
        metadata={"help": "Auth token to log in with ModelScope Hub."},
    )
-    om_hub_token: Optional[str] = field(
+    om_hub_token: str | None = field(
        default=None,
        metadata={"help": "Auth token to log in with Modelers Hub."},
    )
@@ -277,7 +282,7 @@ class QuantizationArguments:
        default=QuantizationMethod.BNB,
        metadata={"help": "Quantization method to use for on-the-fly quantization."},
    )
-    quantization_bit: Optional[int] = field(
+    quantization_bit: int | None = field(
        default=None,
        metadata={"help": "The number of bits to quantize the model using on-the-fly quantization."},
    )
@@ -289,7 +294,7 @@ class QuantizationArguments:
        default=True,
        metadata={"help": "Whether or not to use double quantization in bitsandbytes int4 training."},
    )
-    quantization_device_map: Optional[Literal["auto"]] = field(
+    quantization_device_map: Literal["auto"] | None = field(
        default=None,
        metadata={"help": "Device map used to infer the 4-bit quantized model, needs bitsandbytes>=0.43.0."},
    )
@@ -369,7 +374,7 @@ class ProcessorArguments:
 class ExportArguments:
    r"""Arguments pertaining to the model export."""

-    export_dir: Optional[str] = field(
+    export_dir: str | None = field(
        default=None,
        metadata={"help": "Path to the directory to save the exported model."},
    )
@@ -381,11 +386,11 @@ class ExportArguments:
        default="cpu",
        metadata={"help": "The device used in model export, use `auto` to accelerate exporting."},
    )
-    export_quantization_bit: Optional[int] = field(
+    export_quantization_bit: int | None = field(
        default=None,
        metadata={"help": "The number of bits to quantize the exported model."},
    )
-    export_quantization_dataset: Optional[str] = field(
+    export_quantization_dataset: str | None = field(
        default=None,
        metadata={"help": "Path to the dataset or dataset name to use in quantizing the exported model."},
    )
@@ -401,7 +406,7 @@ class ExportArguments:
        default=False,
        metadata={"help": "Whether or not to save the `.bin` files instead of `.safetensors`."},
    )
-    export_hub_model_id: Optional[str] = field(
+    export_hub_model_id: str | None = field(
        default=None,
        metadata={"help": "The name of the repository if push the model to the Hugging Face hub."},
    )
@@ -431,7 +436,7 @@ class VllmArguments:
        default=32,
        metadata={"help": "Maximum rank of all LoRAs in the vLLM engine."},
    )
-    vllm_config: Optional[Union[dict, str]] = field(
+    vllm_config: dict | str | None = field(
        default=None,
        metadata={"help": "Config to initialize the vllm engine. Please use JSON strings."},
    )
@@ -457,7 +462,7 @@ class SGLangArguments:
        default=-1,
        metadata={"help": "Tensor parallel size for the SGLang engine."},
    )
-    sglang_config: Optional[Union[dict, str]] = field(
+    sglang_config: dict | str | None = field(
        default=None,
        metadata={"help": "Config to initialize the SGLang engine. Please use JSON strings."},
    )
@@ -473,26 +478,77 @@ class SGLangArguments:
            self.sglang_config = _convert_str_dict(json.loads(self.sglang_config))


+@dataclass
+class KTransformersArguments:
+    r"""Arguments pertaining to the KT training."""
+
+    use_kt: bool = field(
+        default=False,
+        metadata={"help": "Whether To Use KTransformers Optimizations For LoRA Training."},
+    )
+    kt_optimize_rule: str | None = field(
+        default=None,
+        metadata={
+            "help": "Path To The KTransformers Optimize Rule; See https://github.com/kvcache-ai/ktransformers/."
+        },
+    )
+    cpu_infer: int | None = field(
+        default=32,
+        metadata={"help": "Number Of CPU Cores Used For Computation."},
+    )
+    chunk_size: int | None = field(
+        default=8192,
+        metadata={"help": "Chunk Size Used For CPU Compute In KTransformers."},
+    )
+    mode: str | None = field(
+        default="normal",
+        metadata={"help": "Normal Or Long_Context For Llama Models."},
+    )
+
+    kt_maxlen: int = field(
+        default=4096,
+        metadata={"help": "Maximum Sequence (Prompt + Response) Length Of The KT Engine."},
+    )
+    kt_use_cuda_graph: bool = field(
+        default=True,
+        metadata={"help": "Whether To Use CUDA Graphs For The KT Engine."},
+    )
+    kt_mode: str = field(
+        default="normal",
+        metadata={"help": "Normal Or Long_Context Mode For The KT Engine."},
+    )
+    kt_force_think: bool = field(
+        default=False,
+        metadata={"help": "Force-Think Toggle For The KT Engine."},
+    )
+
+
@dataclass
 class ModelArguments(
-    SGLangArguments, VllmArguments, ExportArguments, ProcessorArguments, QuantizationArguments, BaseModelArguments
+    SGLangArguments,
+    VllmArguments,
+    KTransformersArguments,
+    ExportArguments,
+    ProcessorArguments,
+    QuantizationArguments,
+    BaseModelArguments,
 ):
    r"""Arguments pertaining to which model/config/tokenizer we are going to fine-tune or infer.

    The class on the most right will be displayed first.
    """

-    compute_dtype: Optional[torch.dtype] = field(
+    compute_dtype: torch.dtype | None = field(
        default=None,
        init=False,
        metadata={"help": "Torch data type for computing model outputs, derived from `fp/bf16`. Do not specify it."},
    )
-    device_map: Optional[Union[str, dict[str, Any]]] = field(
+    device_map: str | dict[str, Any] | None = field(
        default=None,
        init=False,
        metadata={"help": "Device map for model placement, derived from training stage. Do not specify it."},
    )
-    model_max_length: Optional[int] = field(
+    model_max_length: int | None = field(
        default=None,
        init=False,
        metadata={"help": "The maximum input length for model, derived from `cutoff_len`. Do not specify it."},
--- a/src/llamafactory/hparams/parser.py
+++ b/src/llamafactory/hparams/parser.py
@@ -18,7 +18,7 @@
 import os
 import sys
 from pathlib import Path
-from typing import Any, Optional, Union
+from typing import Any, Optional

 import torch
 import transformers
@@ -32,7 +32,7 @@ from transformers.utils import is_torch_bf16_gpu_available, is_torch_npu_availab
 from ..extras import logging
 from ..extras.constants import CHECKPOINT_NAMES, EngineName
 from ..extras.misc import check_dependencies, check_version, get_current_device, is_env_enabled
-from ..extras.packages import is_transformers_version_greater_than
+from ..extras.packages import is_mcore_adapter_available, is_transformers_version_greater_than
 from .data_args import DataArguments
 from .evaluation_args import EvaluationArguments
 from .finetuning_args import FinetuningArguments
@@ -53,8 +53,19 @@ _INFER_CLS = tuple[ModelArguments, DataArguments, FinetuningArguments, Generatin
 _EVAL_ARGS = [ModelArguments, DataArguments, EvaluationArguments, FinetuningArguments]
 _EVAL_CLS = tuple[ModelArguments, DataArguments, EvaluationArguments, FinetuningArguments]

+if is_mcore_adapter_available() and is_env_enabled("USE_MCA"):
+    from mcore_adapter import TrainingArguments as McaTrainingArguments

-def read_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> Union[dict[str, Any], list[str]]:
+    _TRAIN_MCA_ARGS = [ModelArguments, DataArguments, McaTrainingArguments, FinetuningArguments, GeneratingArguments]
+    _TRAIN_MCA_CLS = tuple[
+        ModelArguments, DataArguments, McaTrainingArguments, FinetuningArguments, GeneratingArguments
+    ]
+else:
+    _TRAIN_MCA_ARGS = []
+    _TRAIN_MCA_CLS = tuple()
+
+
+def read_args(args: dict[str, Any] | list[str] | None = None) -> dict[str, Any] | list[str]:
    r"""Get arguments from the command line or a config file."""
    if args is not None:
        return args
@@ -72,7 +83,7 @@ def read_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> Union[


 def _parse_args(
-    parser: "HfArgumentParser", args: Optional[Union[dict[str, Any], list[str]]] = None, allow_extra_keys: bool = False
+    parser: "HfArgumentParser", args: dict[str, Any] | list[str] | None = None, allow_extra_keys: bool = False
 ) -> tuple[Any]:
    args = read_args(args)
    if isinstance(args, dict):
@@ -145,6 +156,9 @@ def _check_extra_dependencies(
    finetuning_args: "FinetuningArguments",
    training_args: Optional["TrainingArguments"] = None,
 ) -> None:
+    if model_args.use_kt:
+        check_version("ktransformers", mandatory=True)
+
    if model_args.use_unsloth:
        check_version("unsloth", mandatory=True)

@@ -191,32 +205,57 @@ def _check_extra_dependencies(
            check_version("rouge_chinese", mandatory=True)


-def _parse_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _TRAIN_CLS:
+def _parse_train_args(args: dict[str, Any] | list[str] | None = None) -> _TRAIN_CLS:
    parser = HfArgumentParser(_TRAIN_ARGS)
    allow_extra_keys = is_env_enabled("ALLOW_EXTRA_ARGS")
    return _parse_args(parser, args, allow_extra_keys=allow_extra_keys)


-def _parse_infer_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _INFER_CLS:
+def _parse_train_mca_args(args: dict[str, Any] | list[str] | None = None) -> _TRAIN_MCA_CLS:
+    parser = HfArgumentParser(_TRAIN_MCA_ARGS)
+    allow_extra_keys = is_env_enabled("ALLOW_EXTRA_ARGS")
+    model_args, data_args, training_args, finetuning_args, generating_args = _parse_args(
+        parser, args, allow_extra_keys=allow_extra_keys
+    )
+
+    _configure_mca_training_args(training_args, data_args, finetuning_args)
+
+    return model_args, data_args, training_args, finetuning_args, generating_args
+
+
+def _configure_mca_training_args(training_args, data_args, finetuning_args) -> None:
+    """Patch training args to avoid args checking errors and sync MCA settings."""
+    training_args.predict_with_generate = False
+    training_args.generation_max_length = data_args.cutoff_len
+    training_args.generation_num_beams = 1
+    training_args.use_mca = True
+    finetuning_args.use_mca = True
+
+
+def _parse_infer_args(args: dict[str, Any] | list[str] | None = None) -> _INFER_CLS:
    parser = HfArgumentParser(_INFER_ARGS)
    allow_extra_keys = is_env_enabled("ALLOW_EXTRA_ARGS")
    return _parse_args(parser, args, allow_extra_keys=allow_extra_keys)


-def _parse_eval_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _EVAL_CLS:
+def _parse_eval_args(args: dict[str, Any] | list[str] | None = None) -> _EVAL_CLS:
    parser = HfArgumentParser(_EVAL_ARGS)
    allow_extra_keys = is_env_enabled("ALLOW_EXTRA_ARGS")
    return _parse_args(parser, args, allow_extra_keys=allow_extra_keys)


-def get_ray_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> RayArguments:
+def get_ray_args(args: dict[str, Any] | list[str] | None = None) -> RayArguments:
    parser = HfArgumentParser(RayArguments)
    (ray_args,) = _parse_args(parser, args, allow_extra_keys=True)
    return ray_args


-def get_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _TRAIN_CLS:
-    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
+def get_train_args(args: dict[str, Any] | list[str] | None = None) -> _TRAIN_CLS:
+    if is_env_enabled("USE_MCA"):
+        model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_mca_args(args)
+    else:
+        model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
+        finetuning_args.use_mca = False

    # Setup logging
    if training_args.should_log:
@@ -246,13 +285,16 @@ def get_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _
        if model_args.shift_attn:
            raise ValueError("PPO training is incompatible with S^2-Attn.")

+        if finetuning_args.reward_model_type == "lora" and model_args.use_kt:
+            raise ValueError("KTransformers does not support lora reward model.")
+
        if finetuning_args.reward_model_type == "lora" and model_args.use_unsloth:
            raise ValueError("Unsloth does not support lora reward model.")

        if training_args.report_to and training_args.report_to[0] not in ["wandb", "tensorboard"]:
            raise ValueError("PPO only accepts wandb or tensorboard logger.")

-    if training_args.parallel_mode == ParallelMode.NOT_DISTRIBUTED:
+    if not model_args.use_kt and training_args.parallel_mode == ParallelMode.NOT_DISTRIBUTED:
        raise ValueError("Please launch distributed training with `llamafactory-cli` or `torchrun`.")

    if training_args.deepspeed and training_args.parallel_mode != ParallelMode.DISTRIBUTED:
@@ -264,18 +306,15 @@ def get_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _
    if training_args.do_train and data_args.dataset is None:
        raise ValueError("Please specify dataset for training.")

-    if (training_args.do_eval or training_args.do_predict) and (
+    if (training_args.do_eval or training_args.do_predict or training_args.predict_with_generate) and (
        data_args.eval_dataset is None and data_args.val_size < 1e-6
    ):
-        raise ValueError("Please specify dataset for evaluation.")
+        raise ValueError("Please make sure eval_dataset be provided or val_size >1e-6")

    if training_args.predict_with_generate:
        if is_deepspeed_zero3_enabled():
            raise ValueError("`predict_with_generate` is incompatible with DeepSpeed ZeRO-3.")

-        if data_args.eval_dataset is None:
-            raise ValueError("Cannot use `predict_with_generate` if `eval_dataset` is None.")
-
        if finetuning_args.compute_accuracy:
            raise ValueError("Cannot use `predict_with_generate` and `compute_accuracy` together.")

@@ -314,6 +353,9 @@ def get_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _
    if model_args.use_unsloth and is_deepspeed_zero3_enabled():
        raise ValueError("Unsloth is incompatible with DeepSpeed ZeRO-3.")

+    if model_args.use_kt and is_deepspeed_zero3_enabled():
+        raise ValueError("KTransformers is incompatible with DeepSpeed ZeRO-3.")
+
    if data_args.neat_packing and is_transformers_version_greater_than("4.53.0"):
        raise ValueError("Neat packing is incompatible with transformers>=4.53.0.")

@@ -431,7 +473,7 @@ def get_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _
    return model_args, data_args, training_args, finetuning_args, generating_args


-def get_infer_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _INFER_CLS:
+def get_infer_args(args: dict[str, Any] | list[str] | None = None) -> _INFER_CLS:
    model_args, data_args, finetuning_args, generating_args = _parse_infer_args(args)

    # Setup logging
@@ -466,7 +508,7 @@ def get_infer_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _
    return model_args, data_args, finetuning_args, generating_args


-def get_eval_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _EVAL_CLS:
+def get_eval_args(args: dict[str, Any] | list[str] | None = None) -> _EVAL_CLS:
    model_args, data_args, eval_args, finetuning_args = _parse_eval_args(args)

    # Setup logging
--- a/src/llamafactory/hparams/training_args.py
+++ b/src/llamafactory/hparams/training_args.py
@@ -14,19 +14,33 @@

 import json
 from dataclasses import dataclass, field
-from typing import Literal, Optional, Union
+from typing import Literal

 from transformers import Seq2SeqTrainingArguments
 from transformers.training_args import _convert_str_dict

-from ..extras.misc import use_ray
+from ..extras.misc import is_env_enabled, use_ray
+from ..extras.packages import is_mcore_adapter_available
+
+
+if is_env_enabled("USE_MCA"):
+    if not is_mcore_adapter_available():
+        raise ImportError(
+            "mcore_adapter is required when USE_MCA=1. Please install `mcore_adapter` and its dependencies."
+        )
+
+    from mcore_adapter import Seq2SeqTrainingArguments as McaSeq2SeqTrainingArguments
+
+    BaseTrainingArguments = McaSeq2SeqTrainingArguments
+else:
+    BaseTrainingArguments = Seq2SeqTrainingArguments


@dataclass
 class RayArguments:
    r"""Arguments pertaining to the Ray training."""

-    ray_run_name: Optional[str] = field(
+    ray_run_name: str | None = field(
        default=None,
        metadata={"help": "The training results will be saved at `<ray_storage_path>/ray_run_name`."},
    )
@@ -34,7 +48,7 @@ class RayArguments:
        default="./saves",
        metadata={"help": "The storage path to save training results to"},
    )
-    ray_storage_filesystem: Optional[Literal["s3", "gs", "gcs"]] = field(
+    ray_storage_filesystem: Literal["s3", "gs", "gcs"] | None = field(
        default=None,
        metadata={"help": "The storage filesystem to use. If None specified, local filesystem will be used."},
    )
@@ -42,7 +56,7 @@ class RayArguments:
        default=1,
        metadata={"help": "The number of workers for Ray training. Default is 1 worker."},
    )
-    resources_per_worker: Union[dict, str] = field(
+    resources_per_worker: dict | str = field(
        default_factory=lambda: {"GPU": 1},
        metadata={"help": "The resources per worker for Ray training. Default is to use 1 GPU per worker."},
    )
@@ -50,7 +64,7 @@ class RayArguments:
        default="PACK",
        metadata={"help": "The placement strategy for Ray training. Default is PACK."},
    )
-    ray_init_kwargs: Optional[Union[dict, str]] = field(
+    ray_init_kwargs: dict | str | None = field(
        default=None,
        metadata={"help": "The arguments to pass to ray.init for Ray training. Default is None."},
    )
@@ -78,9 +92,14 @@ class RayArguments:


@dataclass
-class TrainingArguments(RayArguments, Seq2SeqTrainingArguments):
+class TrainingArguments(RayArguments, BaseTrainingArguments):
    r"""Arguments pertaining to the trainer."""

+    overwrite_output_dir: bool = field(
+        default=False,
+        metadata={"help": "deprecated"},
+    )
+
    def __post_init__(self):
-        Seq2SeqTrainingArguments.__post_init__(self)
        RayArguments.__post_init__(self)
+        BaseTrainingArguments.__post_init__(self)
--- a/src/llamafactory/launcher.py
+++ b/src/llamafactory/launcher.py
@@ -38,7 +38,7 @@ USAGE = (
 def launch():
    from .extras import logging
    from .extras.env import VERSION, print_env
-    from .extras.misc import find_available_port, get_device_count, is_env_enabled, use_ray
+    from .extras.misc import find_available_port, get_device_count, is_env_enabled, use_kt, use_ray

    logger = logging.get_logger(__name__)
    WELCOME = (
@@ -54,7 +54,12 @@ def launch():
    )

    command = sys.argv.pop(1) if len(sys.argv) > 1 else "help"
-    if command == "train" and (is_env_enabled("FORCE_TORCHRUN") or (get_device_count() > 1 and not use_ray())):
+    if is_env_enabled("USE_MCA"):  # force use torchrun
+        os.environ["FORCE_TORCHRUN"] = "1"
+
+    if command == "train" and (
+        is_env_enabled("FORCE_TORCHRUN") or (get_device_count() > 1 and not use_ray() and not use_kt())
+    ):
        # launch distributed training
        nnodes = os.getenv("NNODES", "1")
        node_rank = os.getenv("NODE_RANK", "0")
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Yaowei Zheng	7ef1fba34a	[version] fix gradio (#9685 )	2025-12-28 05:00:51 +08:00
Copilot	eceec8ab69	[deps] goodbye python 3.9 (#9677 ) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hiyouga <16256802+hiyouga@users.noreply.github.com> Co-authored-by: hiyouga <hiyouga@buaa.edu.cn>	2025-12-27 02:50:44 +08:00
Yaowei Zheng	b44f651e09	[ci] fix docker (#9678 )	2025-12-27 02:43:46 +08:00
Yaowei Zheng	55590f5ece	[misc] fix ci with uv (#9676 )	2025-12-27 01:39:13 +08:00
Copilot	a1b1931b4a	[breaking] migrate from setuptools to uv (#9673 ) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hiyouga <16256802+hiyouga@users.noreply.github.com>	2025-12-26 22:47:23 +08:00
Xunpeng Xiao	3c17f2722c	[model] Update ernie_vl to adapt new version (#9665 )	2025-12-26 19:57:49 +08:00
Copilot	a882e2d5fc	[assets] Add GitHub Copilot instructions for repository (#9675 ) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hiyouga <16256802+hiyouga@users.noreply.github.com>	2025-12-26 17:32:48 +08:00
Yaowei Zheng	a754604c11	[misc] fix accelerator (#9661 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-25 02:11:04 +08:00
Xunpeng Xiao	6a2eafbae3	[feat] Models trained and inferred with Mxfp4 are dequantized by default (#9652 ) Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-12-24 00:26:40 +08:00
Yaowei Zheng	84485406b7	[ci] disable pip cache for ci (#9654 )	2025-12-23 18:37:40 +08:00
Kingsley	1c8a42d2f8	[v1&WIP] dataloader init (#9645 )	2025-12-23 16:29:47 +08:00
thulyubh22	7901b2f32e	[model] efficient tuning for gpt-oss (#9354 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-23 16:28:38 +08:00
Yaowei Zheng	1f1f5a7d1b	[ci] remove docker cache (#9640 )	2025-12-22 01:03:10 +08:00
Yaowei Zheng	6ef9854713	[misc] fix cache & pin transformers to 4.57.1 (#9638 )	2025-12-22 00:20:55 +08:00
Hertz	4923f52a28	[model] support MiMo-V2-Flash model (#9637 )	2025-12-21 14:38:18 +08:00
Yaowei Zheng	0894b4f37e	[misc] lint (#9636 )	2025-12-20 16:19:39 +08:00
ZIYI ZENG	b0d49e137f	[misc] Support split eval_dataset when explict set "predict_with_generate" (#9604 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-20 01:46:00 +08:00
Xunpeng Xiao	ddd7dcc722	[data] Fix the video frame sampling issue #9620 (#9634 )	2025-12-19 18:36:31 +08:00
浮梦	5204cd2bca	[misc] add version check for moe (#9633 )	2025-12-19 14:57:37 +08:00
Xunpeng Xiao	8c74dca76a	[feat] Models trained and inferred with FP8 are dequantized by default (#9627 )	2025-12-18 22:54:35 +08:00
xvxuopop	e8deda53a1	[example] add Qwen3 series examples (#9624 ) Co-authored-by: UsernameFull <tohowtodoit@gmail.com>	2025-12-18 21:27:00 +08:00
mrhaoxx	a769fb94b9	[feat] support ktransformers for dpo (#9621 ) Co-authored-by: poryfly <porykid@gmail.com>	2025-12-18 21:26:25 +08:00
mrhaoxx	964569751f	[kt] refactor ktransformers integration (#9632 )	2025-12-18 21:26:04 +08:00
Hertz	9fd4b094d4	[model] support VibeThinker models (#9616 )	2025-12-16 21:50:46 +08:00
浮梦	18c21bce5a	[test] add allreduce test on npu (#9619 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-12-16 21:33:30 +08:00
sunyi0505	a0179772ab	[example] add deepspeed autotp config and example (#9602 )	2025-12-15 15:15:26 +08:00
Yaowei Zheng	aeda079014	[v1] model loader (#9613 )	2025-12-14 11:50:52 +08:00
Xunpeng Xiao	fdd24276ed	[feat] support new function call value (#9610 ) Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-12-14 00:20:33 +08:00
Yaowei Zheng	110d21713e	[v1] add dp & mp mesh (#9611 )	2025-12-13 01:44:28 +08:00
Yaowei Zheng	203069e11c	[v1] add accelerator (#9607 )	2025-12-12 19:22:06 +08:00
tangefly	4fd94141a4	[model] Add Ministral3 (#9582 ) Co-authored-by: kingsley <kingsleydodonow@gmail.com>	2025-12-10 15:57:24 +08:00
Kingsley	22d6ac29d5	[model] Rename GLMV template (#9595 )	2025-12-10 13:27:47 +08:00
DoubleWheat	cff4483392	[config] Fix RoPE scaling patch for resuming from a scaled model (#9588 )	2025-12-09 20:37:37 +08:00
Yaowei Zheng	5d56817e2b	[misc] lint (#9593 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-09 18:00:35 +08:00
Yaowei Zheng	1bbb461f76	[assets] update readme (#9587 )	2025-12-09 12:22:54 +08:00
Hertz	c1f5f8fff6	[model] support GLM4.6v (#9586 )	2025-12-09 11:06:42 +08:00
Yaowei Zheng	5744f1ea94	[v1] add models & accelerator (#9579 )	2025-12-08 02:30:25 +08:00
tangefly	739954910a	[deps] Update for Transformers v5 (#9569 )	2025-12-08 01:13:32 +08:00
xvxuopop	109162dc56	[fix] fix the issue when using fsdp2 with gradient checkpointing. (#9541 ) Co-authored-by: jin-yongxu <jinyongxu@h-partners.com>	2025-12-06 16:04:51 +08:00
jiaqiw09	165f3f073a	[examples] add fsdp config for mutiple nodes (#9575 ) Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-12-05 23:22:48 +08:00
jiaqiw09	efb13b7483	[V1] Refactor ascend MoE kernel patch logic & Support Qwen3-MoE (#9557 )	2025-12-02 00:22:03 +08:00
Username_Full	e43a972b25	[test] add npu test yaml and add ascend a3 docker file (#9547 ) Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>	2025-11-30 09:37:08 +08:00
Kingsley	22be45c78c	[misc] fix omni thinker load (#9552 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-30 09:36:36 +08:00
浮梦	d1f585f80a	[test] update test cmd (#9544 ) Co-authored-by: frozenleaves <frozen@Mac.local> Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-11-27 17:59:42 +08:00
xvxuopop	955396e8a5	[example] correct the parameter errors in the examples file. (#9543 )	2025-11-27 17:38:38 +08:00
xvxuopop	231756a5bf	[chat] fix the error when the vLLM version is greater than 0.10.0 (#9539 ) Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-11-27 02:14:53 +08:00
xvxuopop	2c4fb3c97e	[v1] Support fused moe kernel for qwen3vlmoe model. (#9532 )	2025-11-27 02:13:33 +08:00
浮梦	2b6f16f261	[model] temporarily support npu fused options on v0, powered by v1 kernels (#9520 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-11-27 02:08:36 +08:00
浮梦	f17efde693	[v1] support automatic discovery of registered kernels. (#9509 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-11-27 01:47:22 +08:00
Hertz	591fc9ed02	[model] support ERNIE-4.5-VL Models (#9521 )	2025-11-24 16:48:06 +08:00
Peilin Li	3140c242f0	[assets] add README with KT+llamafactory (#9514 )	2025-11-19 16:50:45 +08:00
Peilin Li	887c562d60	[example] Add KTransformers Qwen3MoE example (#9511 ) Co-authored-by: unknown <xiongchenhui@hisense.ad> Co-authored-by: Kingsley <kingsleydodonow@gmail.com>	2025-11-19 00:53:28 +08:00
Edge-Seven	9779b1f361	[misc] fix typos in some files (#9505 ) Co-authored-by: khanhkhanhlele <namkhanh20xx@gmail.com>	2025-11-18 20:36:01 +08:00
Yinlei Sun	45f0437a14	[v1] Add support for ShareGPT format. (#9486 )	2025-11-18 13:44:08 +08:00
浮梦	d4e120423d	[data] fix qwen3omni moe model (#9501 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-11-18 13:43:22 +08:00
Pory	10a446e373	[model] ktransformers qwen3 support (#9485 ) Co-authored-by: unknown <xiongchenhui@hisense.ad>	2025-11-13 20:09:44 +08:00
jiaqiw09	0aa4a051af	[test] support slow skip and device skip in Uts (#9484 )	2025-11-13 20:08:22 +08:00
Yaowei Zheng	8173a88a26	[assets] update readme (#9477 )	2025-11-12 16:15:41 +08:00
Kingsley	fef86fa7fe	[data] fix qwen3omni audio length calculation (#9467 )	2025-11-12 10:37:15 +08:00
taohongsheng	5afa851f71	[misc] Modify pip install command for huggingface_hub (#9463 )	2025-11-10 23:04:00 +08:00
MyungHa Kwon	a711bce664	[data] add openai format (#9449 )	2025-11-06 20:10:20 +08:00
魅影	bd24350cbf	[v1] add pair data converter (#9360 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-11-06 14:05:58 +08:00
Peilin Li	bd30c0003b	[train] fix denominator of ga in ksft loss (#9409 )	2025-11-05 20:53:23 +08:00
魅影	8edd2622ce	[docker] update npu dockerfile (#9407 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-11-05 18:28:32 +08:00
Yaowei Zheng	eaf963f67f	[model] update kt code (#9406 )	2025-11-05 15:27:22 +08:00
Kingsley	56f45e826f	[train] fix MPO re-weight (#9405 )	2025-11-04 21:10:41 +08:00
魅影	14abb75126	[model] enable using FA in npu (#9397 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-11-04 19:32:30 +08:00
한송민	5a9939050e	[model] add deepstack_merger_list to Qwen3-VL vision_model_keys (#9399 )	2025-11-04 19:27:34 +08:00
Peilin Li	934b3084ee	[train] KTransformers SFT as backend engine for LLaMA-Factory (#9400 ) Co-authored-by: jimmy128 <jimmy128@noreply.gitcode.com> Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-11-04 15:54:12 +08:00
Yaowei Zheng	3ae15da9c0	[misc] lint code (#9395 )	2025-11-03 22:08:59 +08:00
魅影	215580c77d	[data] fix mm pluigin for qwen omni video training (#9388 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-11-03 11:44:27 +08:00
魅影	767b344fb4	[model] remove npu sdpa patch (#9368 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-10-30 16:26:35 +08:00
Kingsley	3057db15c3	[readme] upd mcore readme (#9352 )	2025-10-27 21:23:31 +08:00
Kingsley	13170577b2	[feat] support megatron-LM training by mcore_adapter (#9237 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-10-26 16:21:30 +08:00
Xiaosu Zhu	129e918106	[data] Fix Qwen3VL plugin (#9297 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn> Co-authored-by: kingsley <kingsleydodonow@gmail.com>	2025-10-26 16:07:04 +08:00
Yaowei Zheng	9c0d033a15	[model] add qwen3vl 2b & 32b (#9343 )	2025-10-24 13:22:36 +08:00
Yaowei Zheng	2a822178de	[deps] fix yanked packages (#9333 )	2025-10-22 20:54:51 +08:00
Kingsley	b842457ef4	[ci] revert mac os ci setup (#9316 )	2025-10-21 18:26:12 +08:00
魅影	2c6aded5d4	[v1] kernel plugin (#9274 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2025-10-18 18:02:14 +08:00
Yaowei Zheng	d9d67ba62d	[misc] fix import error (#9299 )	2025-10-17 17:46:27 +08:00
Yaowei Zheng	a442fa90ad	[misc] fix import error (#9296 )	2025-10-17 10:54:30 +08:00
wyfdgg	8c341cbaae	[model] support hunyuan-mt model (#9284 ) Co-authored-by: wyfdgg <liwenkun0812@163.com> Co-authored-by: Yaowei Zheng <hiyouga@buaa.edu.cn>	2025-10-17 10:33:09 +08:00