fix: projector lookup for gemma4 modules (#10382 )

Co-authored-by: yiluoAK_47 <yiluoAK_47@163.com>
[model] set mm_projectors for omni models (#10378 )
2026-04-17 02:06:03 +08:00 · 2026-04-12 08:32:14 +08:00 · 2026-04-10 18:12:57 +08:00 · 2026-04-06 13:14:45 +08:00 · 2026-04-05 12:10:28 +08:00 · 2026-04-01 22:40:12 +08:00
17 changed files with 883 additions and 42 deletions
--- a/.ai/CLAUDE.md
+++ b/.ai/CLAUDE.md
@@ -0,0 +1,105 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Commands
+
+```bash
+# Code style (auto-fix)
+make style
+
+# Code quality check (no modifications)
+make quality
+
+# Run all tests
+make test
+
+# Run a single test file
+WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/path/to/test_file.py
+
+# Run tests matching a pattern
+WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ -k "test_name"
+
+# License header check
+make license
+
+# Build package
+make build
+```
+
+The project uses `uv` as the preferred package manager. Commands automatically use `uv run` / `uvx` if `uv` is available.
+
+## Architecture
+
+LlamaFactory has two parallel architectures controlled by the `USE_V1` environment variable:
+
+- **v0 (default):** `api, webui > chat, eval, train > data, model > hparams > extras`
+- **v1 (experimental, `USE_V1=1`):** `trainers > core > accelerator, plugins, config > utils`
+
+Most active development happens in v0. The v1 architecture lives in `src/llamafactory/v1/`.
+
+### Entry Points
+
+CLI entry point is `llamafactory-cli` / `lmf` → `src/llamafactory/cli.py:main()`, which dispatches to `launcher.py` based on `USE_V1`.
+
+Available subcommands: `train`, `chat`, `api`, `export`, `webchat`, `webui`, `env`, `version`, `help`.
+
+### Training Flow (v0)
+
+```
+run_exp() [tuner.py]
+  → read_args() → parse YAML/JSON config
+  → get_train_args() → produces typed argument dataclasses
+  → routes to: run_sft / run_dpo / run_ppo / run_rm / run_pt / run_kto
+  → optional: export_model()
+```
+
+Training is invoked with a YAML config: `llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml`
+
+### Configuration System
+
+All training parameters are YAML/JSON config files. Argument parsing in `src/llamafactory/hparams/parser.py` produces four typed dataclasses:
+- `ModelArguments` — model/tokenizer selection, quantization
+- `DataArguments` — datasets, templates, preprocessing
+- `FinetuningArguments` — LoRA rank/target, training method (sft/dpo/ppo/rm/pt/kto)
+- `TrainingArguments` — extends HuggingFace's `TrainingArguments`
+
+### Key Modules
+
+| Module | Purpose |
+|--------|---------|
+| `src/llamafactory/model/loader.py` | Loads model + tokenizer; applies quantization, LoRA, patches |
+| `src/llamafactory/model/patcher.py` | Model-specific compatibility patches |
+| `src/llamafactory/data/template.py` | Prompt templates; `TEMPLATES` dict maps model family → format |
+| `src/llamafactory/data/mm_plugin.py` | Multi-modal (image/video/audio) data handling |
+| `src/llamafactory/data/processor/` | Per-stage data processors (supervised, pairwise, pretrain, etc.) |
+| `src/llamafactory/train/sft/` | SFT trainer; other stages follow same structure |
+| `src/llamafactory/chat/` | Inference engines: `hf_engine`, `vllm_engine`, `sglang_engine`, `kt_engine` |
+| `src/llamafactory/extras/constants.py` | Enums and constants used across the project |
+
+### Adding Support for a New Model
+
+1. Add a prompt template to `src/llamafactory/data/template.py` in the `TEMPLATES` dict
+2. Add any necessary model patches in `src/llamafactory/model/patcher.py`
+3. Add multi-modal support in `src/llamafactory/data/mm_plugin.py` if needed
+
+### Distributed Training
+
+Multi-GPU automatically uses `torchrun`. Additional backends:
+- **Ray:** Optional Ray cluster support
+- **HyperParallel FSDP2:** `src/llamafactory/train/hyper_parallel/`
+- **Megatron-core:** `src/llamafactory/train/mca/`
+
+### Testing
+
+- `tests/` — v0 tests; `tests_v1/` — v1 tests
+- Most training tests require GPU hardware
+- pytest markers: `@pytest.mark.slow`, `@pytest.mark.runs_on(['cuda'])`
+- Always set `WANDB_DISABLED=true` when running tests
+
+### Code Style
+
+- Ruff for linting and formatting (line length 119, Google-style docstrings)
+- Python 3.11+ syntax
+- Double quotes for strings
+- All new files must include Apache 2.0 license header (checked by `make license`)
--- a/src/llamafactory/data/collator.py
+++ b/src/llamafactory/data/collator.py
@@ -380,6 +380,19 @@ class MultiModalDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
            for i, feature in enumerate(features):
                feature["token_type_ids"] = token_type_ids[i]

+        if "mm_token_type_ids" in mm_inputs: # need tensor-like for gemma4
+            mm_token_type_ids = mm_inputs.pop("mm_token_type_ids")
+            max_len = max(len(ids) for ids in mm_token_type_ids)
+            padded = []
+            for ids in mm_token_type_ids:
+                pad_len = max_len - len(ids)
+                if self.tokenizer.padding_side == "right":
+                    padded.append(ids + [0] * pad_len)
+                else:
+                    padded.append([0] * pad_len + ids)
+
+            mm_inputs["mm_token_type_ids"] = torch.tensor(padded, dtype=torch.long)
+
        features: dict[str, torch.Tensor] = super().__call__(features)

        bsz, seq_len = features["input_ids"].shape[:2]
--- a/src/llamafactory/data/mm_plugin.py
+++ b/src/llamafactory/data/mm_plugin.py
@@ -607,6 +607,194 @@ class Gemma3nPlugin(Gemma3Plugin):
        return messages


+@dataclass
+class Gemma4Plugin(BasePlugin):
+    r"""Plugin for the Gemma4 multimodal model."""
+
+    @override
+    def _regularize_videos(self, videos: list["VideoInput"], **kwargs) -> "RegularizedVideoOutput":
+        r"""Regularize videos, also tracking per-video FPS and frame indices for timestamp generation."""
+        results, fps_per_video, durations, frames_indices = [], [], [], []
+        for video in videos:
+            frames: list[ImageObject] = []
+            if _check_video_is_nested_images(video):
+                frames = video
+                fps_per_video.append(kwargs.get("video_fps", 2.0))
+                durations.append(len(frames) / kwargs.get("video_fps", 2.0))
+                frames_indices.append(list(range(len(frames))))
+            else:
+                container = av.open(video, "r")
+                video_stream = next(stream for stream in container.streams if stream.type == "video")
+                sample_indices = self._get_video_sample_indices(video_stream, **kwargs)
+                original_fps = float(video_stream.average_rate)
+                # for correctly calculate timestamps
+                frames_indices.append([idx / original_fps * kwargs.get("video_fps", 2.0) for idx in sample_indices])
+                container.seek(0)
+                for frame_idx, frame in enumerate(container.decode(video_stream)):
+                    if frame_idx in sample_indices:
+                        frames.append(frame.to_image())
+
+                if video_stream.duration is None:
+                    durations.append(len(frames) / kwargs.get("video_fps", 2.0))
+                else:
+                    durations.append(float(video_stream.duration * video_stream.time_base))
+
+            frames = self._regularize_images(frames, **kwargs)["images"]
+            results.append(frames)
+
+        return {"videos": results, "fps_per_video": fps_per_video, "durations": durations, "frames_indices": frames_indices}
+
+    @override
+    def _get_mm_inputs(
+        self,
+        images: list["ImageInput"],
+        videos: list["VideoInput"],
+        audios: list["AudioInput"],
+        processor: "MMProcessor",
+    ) -> dict[str, Union[list[int], "torch.Tensor"]]:
+        image_processor = getattr(processor, "image_processor", None)
+        video_processor = getattr(processor, "video_processor", None)
+        feature_extractor = getattr(processor, "feature_extractor", None)
+        mm_inputs = {}
+
+        if len(images) != 0:
+            regularized = self._regularize_images(
+                images,
+                image_max_pixels=getattr(processor, "image_max_pixels", 768 * 768),
+                image_min_pixels=getattr(processor, "image_min_pixels", 32 * 32),
+            )["images"]
+            mm_inputs.update(image_processor(regularized, return_tensors="pt"))
+
+        if len(videos) != 0:
+            video_data = self._regularize_videos(
+                videos,
+                image_max_pixels=getattr(processor, "video_max_pixels", 256 * 256),
+                image_min_pixels=getattr(processor, "video_min_pixels", 16 * 16),
+                video_fps=getattr(processor, "video_fps", 2.0),
+                video_maxlen=getattr(processor, "video_maxlen", 128),
+            )
+            video_metadata = [
+                {"fps": getattr(processor, "video_fps", 2.0), "duration": duration, "total_num_frames": len(video), "frames_indices": sample_indices}
+                for video, duration, sample_indices in zip(video_data["videos"], video_data["durations"], video_data["frames_indices"])
+            ]
+            mm_inputs.update(
+                video_processor(
+                    videos=video_data["videos"],
+                    video_metadata=video_metadata,
+                    return_tensors="pt",
+                    return_metadata=True,
+                    do_sample_frames=False,
+                )
+            )
+
+        if len(audios) != 0: # only for gemma4n
+            audios = self._regularize_audios(
+                audios,
+                sampling_rate=getattr(processor, "audio_sampling_rate", 16000),
+            )["audios"]
+
+            mm_inputs.update(
+                feature_extractor(
+                audios,
+                padding="max_length",
+                return_tensors="pt",
+            )
+        )
+
+        return mm_inputs
+
+    @override
+    def process_messages(
+        self,
+        messages: list[dict[str, str]],
+        images: list["ImageInput"],
+        videos: list["VideoInput"],
+        audios: list["AudioInput"],
+        processor: Optional["MMProcessor"],
+    ) -> list[dict[str, str]]:
+        self._validate_input(processor, images, videos, audios)
+        self._validate_messages(messages, images, videos, audios)
+        messages = deepcopy(messages)
+
+        boi_token: str = getattr(processor, "boi_token")
+        eoi_token: str = getattr(processor, "eoi_token")
+        boa_token: str = getattr(processor, "boa_token")
+        eoa_token: str = getattr(processor, "eoa_token")
+        image_token: str = getattr(processor, "image_token")
+        video_token: str = getattr(processor, "video_token")
+        audio_token: str = getattr(processor, "audio_token")
+
+        if self.expand_mm_tokens:
+            mm_inputs = self._get_mm_inputs(images, videos, audios, processor)
+            num_image_soft_tokens: list[int] = list(
+                mm_inputs.get("num_soft_tokens_per_image", [getattr(processor, "image_seq_length", 256)] * len(images))
+            )
+            num_video_soft_tokens: list[int] = list(mm_inputs.get("num_soft_tokens_per_video", [1] * len(videos)))
+            video_metadata = mm_inputs.get("video_metadata", [])
+        else:
+            num_image_soft_tokens = [1] * len(images)
+            num_video_soft_tokens = [1] * len(videos)
+            video_metadata = [None] * len(videos)
+
+        audio_iter = iter(audios)
+        image_iter = iter(num_image_soft_tokens)
+        video_iter = iter(zip(num_video_soft_tokens, video_metadata))
+
+        for message in messages:
+            content = message["content"]
+
+            while IMAGE_PLACEHOLDER in content:
+                n = next(image_iter)
+                content = content.replace(IMAGE_PLACEHOLDER, f"{boi_token}{image_token * n}{eoi_token}", 1)
+
+            while VIDEO_PLACEHOLDER in content:
+                num_soft_tokens_per_frame, metadata = next(video_iter)
+                if self.expand_mm_tokens:
+                    timestamp_strs = [f"{int(t // 60):02d}:{int(t % 60):02d}" for t in metadata.timestamps]
+                    frame_strs = [f"{ts} {boi_token}{video_token * num_soft_tokens_per_frame}{eoi_token}" for ts in timestamp_strs]
+                    video_str = " ".join(frame_strs)
+                else:
+                    video_str = f"{boi_token}{video_token * num_soft_tokens_per_frame}{eoi_token}"
+                content = content.replace(VIDEO_PLACEHOLDER, video_str, 1)
+
+            while AUDIO_PLACEHOLDER in content:
+                current_audio = next(audio_iter)
+                if self.expand_mm_tokens:
+                    num_audio_tokens = processor._compute_audio_num_tokens(current_audio, processor.feature_extractor.sampling_rate)
+                    audio_str = f"{boa_token}{audio_token * num_audio_tokens}{eoa_token}"
+                else:
+                    audio_str = f"{boa_token}{audio_token}{eoa_token}"
+
+                content = content.replace(AUDIO_PLACEHOLDER, audio_str, 1)
+
+            message["content"] = content
+
+        return messages
+
+    @override
+    def get_mm_inputs(
+        self,
+        images: list["ImageInput"],
+        videos: list["VideoInput"],
+        audios: list["AudioInput"],
+        imglens: list[int],
+        vidlens: list[int],
+        audlens: list[int],
+        batch_ids: list[list[int]],
+        processor: Optional["MMProcessor"],
+    ) -> dict[str, Union[list[int], "torch.Tensor"]]:
+        self._validate_input(processor, images, videos, audios)
+        mm_inputs = self._get_mm_inputs(images, videos, audios, processor)
+        # Pop metadata keys that must not be passed to the model.
+        for key in ("num_soft_tokens_per_image", "num_soft_tokens_per_video", "video_metadata",
+                    "_gemma4_fps_per_video", "_gemma4_frames_indices", "_gemma4_num_audio_soft_tokens"):
+            mm_inputs.pop(key, None)
+
+        mm_inputs["mm_token_type_ids"] = processor.create_mm_token_type_ids(batch_ids)
+
+        return mm_inputs
+
+
@dataclass
 class InternVLPlugin(BasePlugin):
    @override
@@ -1489,10 +1677,11 @@ class Qwen2VLPlugin(BasePlugin):

    @override
    def _regularize_videos(self, videos: list["VideoInput"], **kwargs) -> "RegularizedVideoOutput":
-        results, fps_per_video, durations = [], [], []
+        results, fps_per_video, durations, frames_indices = [], [], [], []
        for video in videos:
            frames: list[ImageObject] = []
            if _check_video_is_nested_images(video):
+                # we assume already sample frames from videos
                for frame in video:
                    if not is_valid_image(frame) and not isinstance(frame, dict) and not os.path.exists(frame):
                        raise ValueError("Invalid image found in video frames.")
@@ -1500,10 +1689,14 @@ class Qwen2VLPlugin(BasePlugin):
                frames = video
                fps_per_video.append(kwargs.get("video_fps", 2.0))
                durations.append(len(frames) / kwargs.get("video_fps", 2.0))
+                frames_indices.append(list(range(len(frames))))
            else:
                container = av.open(video, "r")
                video_stream = next(stream for stream in container.streams if stream.type == "video")
                sample_indices = self._get_video_sample_indices(video_stream, **kwargs)
+                original_fps = float(video_stream.average_rate)
+                # for qwen3vl video timestamp calculation
+                frames_indices.append([idx / original_fps * kwargs.get("video_fps", 2.0) for idx in sample_indices]) # hack usage when do_sample_frames=False
                container.seek(0)
                for frame_idx, frame in enumerate(container.decode(video_stream)):
                    if frame_idx in sample_indices:
@@ -1522,7 +1715,7 @@ class Qwen2VLPlugin(BasePlugin):
            frames = self._regularize_images(frames, **kwargs)["images"]
            results.append(frames)

-        return {"videos": results, "fps_per_video": fps_per_video, "durations": durations}
+        return {"videos": results, "fps_per_video": fps_per_video, "durations": durations, "frames_indices": frames_indices}

    @override
    def _get_mm_inputs(
@@ -1637,8 +1830,8 @@ class Qwen3VLPlugin(Qwen2VLPlugin):
                video_maxlen=getattr(processor, "video_maxlen", 128),
            )
            video_metadata = [
-                {"fps": getattr(processor, "video_fps", 24.0), "duration": duration, "total_num_frames": len(video)}
-                for video, duration in zip(videos["videos"], videos["durations"])
+                {"fps": getattr(processor, "video_fps", 2.0), "duration": duration, "total_num_frames": len(video), "frames_indices": sample_indices}
+                for video, duration, sample_indices in zip(videos["videos"], videos["durations"], videos["frames_indices"])
            ]
            mm_inputs.update(
                video_processor(
@@ -1646,6 +1839,7 @@ class Qwen3VLPlugin(Qwen2VLPlugin):
                    video_metadata=video_metadata,
                    fps=getattr(processor, "video_fps", 2.0),
                    return_metadata=True,
+                    do_sample_frames=False, # avoid changing frames_indices
                )
            )
            temporal_patch_size: int = getattr(image_processor, "temporal_patch_size", 2)
@@ -1677,7 +1871,7 @@ class Qwen3VLPlugin(Qwen2VLPlugin):
            image_grid_thw = mm_inputs.get("image_grid_thw", [])
            video_grid_thw = mm_inputs.get("video_grid_thw", [])
            num_frames = video_grid_thw[0][0] if len(video_grid_thw) > 0 else 0  # hard code for now
-            video_metadata = mm_inputs.get("video_metadata", {})
+            video_metadata = mm_inputs.get("video_metadata", [])

        else:
            image_grid_thw = [None] * len(images)
@@ -2200,8 +2394,9 @@ PLUGINS = {
    "base": BasePlugin,
    "ernie_vl": ErnieVLPlugin,
    "gemma3": Gemma3Plugin,
-    "glm4v": GLM4VPlugin,
    "gemma3n": Gemma3nPlugin,
+    "gemma4": Gemma4Plugin,
+    "glm4v": GLM4VPlugin,
    "intern_vl": InternVLPlugin,
    "kimi_vl": KimiVLPlugin,
    "llama4": Llama4Plugin,
--- a/src/llamafactory/data/template.py
+++ b/src/llamafactory/data/template.py
@@ -997,6 +997,55 @@ register_template(
 )


+register_template(
+    name="gemma4",
+    format_user=StringFormatter(slots=["<|turn>user\n{{content}}<turn|>\n<|turn>model\n"]),
+    format_assistant=StringFormatter(slots=["{{content}}<turn|>\n"]),
+    format_system=StringFormatter(slots=["<|turn>system\n<|think|>{{content}}<turn|>\n"]), #  default thought singal contained
+    format_observation=StringFormatter(
+        slots=["<|turn>tool\n{{content}}<turn|>\n<|turn>model\n"]
+    ), # seem not consistent with the chattemplate
+    format_tools=ToolFormatter(tool_format="gemma4"),
+    format_function=FunctionFormatter(slots=["<|tool>{{content}}<tool|>"], tool_format="gemma4"),
+    format_prefix=EmptyFormatter(slots=[{"bos_token"}]),
+    stop_words=["<turn|>"],
+    default_system="You are a helpful assistant.", # important for thinking
+    thought_words=("<|channel>thought\n", "<channel|>"),
+    replace_eos=True,
+    mm_plugin=get_mm_plugin(
+        "gemma4",
+        image_token="<|image|>",
+        video_token="<|video|>",
+    ),
+    template_class=ReasoningTemplate,
+)
+
+
+register_template(
+    name="gemma4n",
+    format_user=StringFormatter(slots=["<|turn>user\n{{content}}<turn|>\n<|turn>model\n"]),
+    format_assistant=StringFormatter(slots=["{{content}}<turn|>\n"]),
+    format_system=StringFormatter(slots=["<|turn>system\n<|think|>{{content}}<turn|>\n"]), #  default thought singal contained
+    format_observation=StringFormatter(
+        slots=["<|turn>tool\n{{content}}<turn|>\n<|turn>model\n"]
+    ),
+    format_tools=ToolFormatter(tool_format="gemma4"),
+    format_function=FunctionFormatter(slots=["<|tool>{{content}}<tool|>"], tool_format="gemma4"),
+    format_prefix=EmptyFormatter(slots=[{"bos_token"}]),
+    stop_words=["<turn|>"],
+    default_system="You are a helpful assistant.", # important for thinking
+    thought_words=("<|channel>thought\n", "<channel|>"),
+    replace_eos=True,
+    mm_plugin=get_mm_plugin(
+        "gemma4",
+        image_token="<|image|>",
+        video_token="<|video|>",
+        audio_token="<|audio|>",
+    ),
+    template_class=ReasoningTemplate,
+)
+
+
 register_template(
    name="glm4",
    format_user=StringFormatter(slots=["<|user|>\n{{content}}<|assistant|>"]),
--- a/src/llamafactory/data/tool_utils.py
+++ b/src/llamafactory/data/tool_utils.py
@@ -209,6 +209,164 @@ class DefaultToolUtils(ToolUtils):

        return results

+class Gemma4ToolUtils(ToolUtils):
+    r"""Gemma-4 tool using template."""
+
+    @override
+    @staticmethod
+    def tool_formatter(tools: list[dict[str, Any]]) -> str:
+        def _format_parameters(properties: dict[str, Any]) -> str:
+            parts: list[str] = []
+            for name, schema in properties.items():
+                item_parts: list[str] = []
+                if schema.get("description"):
+                    item_parts.append(f'description:<|"|>{schema["description"]}<|"|>')
+                if schema.get("type"):
+                    item_parts.append(f'type:<|"|>{str(schema["type"]).upper()}<|"|>')
+                parts.append(f"{name}:{{{','.join(item_parts)}}}")
+
+            return ",".join(parts)
+
+        declarations: list[str] = []
+        for tool in tools:
+            function_data = tool.get("function", tool) if tool.get("type") == "function" else tool
+            declaration = (
+                f"declaration:{function_data['name']}"
+                + "{"
+                + f'description:<|"|>{function_data.get("description", "")}<|"|>'
+            )
+
+            params = function_data.get("parameters")
+            if params:
+                param_parts: list[str] = []
+                if params.get("properties"):
+                    param_parts.append(f"properties:{{{_format_parameters(params['properties'])}}}")
+
+                if params.get("required"):
+                    required_text = ",".join(f'<|"|>{item}<|"|>' for item in params["required"])
+                    param_parts.append(f"required:[{required_text}]")
+
+                if params.get("type"):
+                    param_parts.append(f'type:<|"|>{str(params["type"]).upper()}<|"|>')
+
+                declaration += f",parameters:{{{','.join(param_parts)}}}"
+
+            response_declaration = function_data.get("response")
+            if response_declaration:
+                response_parts: list[str] = []
+                if response_declaration.get("description"):
+                    response_parts.append(f'description:<|"|>{response_declaration["description"]}<|"|>')
+
+                response_type = str(response_declaration.get("type", "")).upper()
+
+                if response_type == "OBJECT":
+                    response_parts.append(f'type:<|"|>{response_type}<|"|>')
+
+                declaration += f",response:{{{','.join(response_parts)}}}"
+
+            declarations.append(declaration + "}")
+
+        return "\n".join(declarations)
+
+    @override
+    @staticmethod
+    def tool_extractor(content: str) -> Union[str, list["FunctionCall"]]:
+        regex = re.compile(r"<\|tool_call\>call:([^{\s]+)\{(.*?)\}<tool_call\|>", re.DOTALL)
+        matches = re.findall(regex, content)
+        if not matches:
+            return content
+
+        def _parse_arguments(arg_text: str) -> Any:
+            text = arg_text.strip()
+            if not text:
+                return {}
+
+            # `function_formatter` writes dict arguments as `k:v,...` inside `{...}`.
+            # The extractor captures only the inner text, so re-wrap it to parse as JSON object.
+            object_like_text = "{" + text + "}"
+            # Convert Gemma string markers (<|"|>value<|"|>) to valid JSON strings.
+            normalized = re.sub(
+                r"<\|\"\|\>(.*?)<\|\"\|\>",
+                lambda m: json.dumps(m.group(1), ensure_ascii=False),
+                object_like_text,
+                flags=re.DOTALL,
+            )
+            # Quote unquoted object keys so the payload can be parsed by json.loads.
+            normalized = re.sub(r'(^|[{\s,])([A-Za-z_][A-Za-z0-9_]*)(\s*:)', r'\1"\2"\3', normalized)
+            try:
+                return json.loads(normalized)
+            except json.JSONDecodeError:
+                pass
+
+            try:
+                return json.loads(text)
+            except json.JSONDecodeError:
+                return text
+
+        results: list[FunctionCall] = []
+        for name, arg_block in matches:
+            parsed_arguments = _parse_arguments(arg_block)
+            if isinstance(parsed_arguments, str):
+                arguments = parsed_arguments
+            else:
+                arguments = json.dumps(parsed_arguments, ensure_ascii=False)
+            results.append(FunctionCall(name.strip(), arguments))
+
+        return results
+
+    @override
+    @staticmethod
+    def function_formatter(functions: list["FunctionCall"]) -> str:
+        def _format_argument(argument: Any, escape_keys: bool = True) -> str:
+            if isinstance(argument, str):
+                return f'<|"|>{argument}<|"|>'
+
+            if isinstance(argument, bool):
+                return "true" if argument else "false"
+
+            if isinstance(argument, dict):
+                items: list[str] = []
+                for key in sorted(argument.keys()):
+                    formatted_key = f'<|"|>{key}<|"|>' if escape_keys else str(key)
+                    formatted_value = _format_argument(argument[key], escape_keys=escape_keys)
+                    items.append(f"{formatted_key}:{formatted_value}")
+                return "{" + ",".join(items) + "}"
+
+            if isinstance(argument, (list, tuple)):
+                return "[" + ",".join(_format_argument(item, escape_keys=escape_keys) for item in argument) + "]"
+
+            if argument is None:
+                return "null"
+
+            return str(argument)
+
+        function_texts: list[str] = []
+        for function in functions:
+            name = function.name
+            raw_arguments = function.arguments
+
+            try:
+                parsed_arguments = json.loads(raw_arguments)
+            except (TypeError, json.JSONDecodeError):
+                parsed_arguments = raw_arguments
+
+            call_text = f"<|tool_call>call:{name}" + "{"
+            if isinstance(parsed_arguments, dict):
+                args_text = []
+                for key in sorted(parsed_arguments.keys()):
+                    value_text = _format_argument(parsed_arguments[key], escape_keys=False)
+                    args_text.append(f"{key}:{value_text}")
+
+                call_text += ",".join(args_text)
+            elif isinstance(parsed_arguments, str):
+                call_text += parsed_arguments
+            else:
+                call_text += _format_argument(parsed_arguments, escape_keys=False)
+
+            call_text += "}<tool_call|>"
+            function_texts.append(call_text)
+
+        return "".join(function_texts)

 class GLM4ToolUtils(ToolUtils):
    r"""GLM-4 tool using template."""
@@ -723,6 +881,7 @@ class LFM2ToolUtils(ToolUtils):

 TOOLS = {
    "default": DefaultToolUtils(),
+    "gemma4": Gemma4ToolUtils(),
    "glm4": GLM4ToolUtils(),
    "llama3": Llama3ToolUtils(),
    "lfm2": LFM2ToolUtils(),
--- a/src/llamafactory/extras/constants.py
+++ b/src/llamafactory/extras/constants.py
@@ -865,6 +865,34 @@ register_model_group(
 )


+register_model_group(
+    models={
+        "Gemma-4-26B-A4B-Thinking": {
+            DownloadSource.DEFAULT: "google/gemma-4-26B-A4B-it",
+        },
+        "Gemma-4-31B-Thinking": {
+            DownloadSource.DEFAULT: "google/gemma-4-31B-it",
+        },
+    },
+    template="gemma4",
+    multimodal=True,
+)
+
+
+register_model_group(
+    models={
+        "Gemma-4-E2B-Thinking": {
+            DownloadSource.DEFAULT: "google/gemma-4-E2B-it",
+        },
+        "Gemma-4-E4B-Thinking": {
+            DownloadSource.DEFAULT: "google/gemma-4-E4B-it",
+        },
+    },
+    template="gemma4n",
+    multimodal=True,
+)
+
+
 register_model_group(
    models={
        "GLM-4-9B": {
--- a/src/llamafactory/extras/packages.py
+++ b/src/llamafactory/extras/packages.py
@@ -70,6 +70,10 @@ def is_matplotlib_available():
    return _is_package_available("matplotlib")


+def is_hyper_parallel_available():
+    return _is_package_available("hyper_parallel")
+
+
 def is_mcore_adapter_available():
    return _is_package_available("mcore_adapter")

--- a/src/llamafactory/hparams/finetuning_args.py
+++ b/src/llamafactory/hparams/finetuning_args.py
@@ -482,6 +482,24 @@ class FinetuningArguments(
            )
        },
    )
+    use_hyper_parallel: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether or not to use HyperParallel distributed training backend (FSDP/TP). "
+                "Only supported for the 'sft' stage with full fine-tuning."
+            )
+        },
+    )
+    hyper_parallel_args: str | None = field(
+        default=None,
+        metadata={
+            "help": (
+                "Path to a JSON file containing HyperParallel strategy arguments "
+                "(e.g., tp_size, param_dtype). Used when use_hyper_parallel=True."
+            )
+        },
+    )
    use_muon: bool = field(
        default=False,
        metadata={"help": "Whether or not to use the Muon optimizer."},
--- a/src/llamafactory/model/adapter.py
+++ b/src/llamafactory/model/adapter.py
@@ -125,7 +125,7 @@ def _setup_freeze_tuning(

    model_type = getattr(model.config, "model_type", None)
    if not finetuning_args.freeze_multi_modal_projector and model_type in COMPOSITE_MODELS:
-        trainable_layers.append(COMPOSITE_MODELS[model_type].projector_key)
+        trainable_layers.extend(COMPOSITE_MODELS[model_type].projector_keys)

    forbidden_modules = get_forbidden_modules(model.config, finetuning_args)
    for name, param in model.named_parameters():
--- a/src/llamafactory/model/model_utils/liger_kernel.py
+++ b/src/llamafactory/model/model_utils/liger_kernel.py
@@ -45,7 +45,7 @@ def apply_liger_kernel(
        from liger_kernel.transformers import apply_liger_kernel_to_gemma3 as apply_liger_kernel
    elif model_type == "gemma3_text":
        from liger_kernel.transformers import apply_liger_kernel_to_gemma3_text as apply_liger_kernel
-    elif model_type == "glm4":
+    elif model_type in ["glm", "glm4"]: # for glm4-9b, glm4-32B respectively
        from liger_kernel.transformers import apply_liger_kernel_to_glm4 as apply_liger_kernel
    elif model_type == "glm4v":
        from liger_kernel.transformers import apply_liger_kernel_to_glm4v as apply_liger_kernel
--- a/src/llamafactory/model/model_utils/misc.py
+++ b/src/llamafactory/model/model_utils/misc.py
@@ -35,7 +35,7 @@ def find_all_linear_modules(model: "PreTrainedModel", freeze_vision_tower: bool)
        forbidden_modules.add("output")

    if model_type in COMPOSITE_MODELS:
-        forbidden_modules.add(COMPOSITE_MODELS[model_type].projector_key)
+        forbidden_modules.update(COMPOSITE_MODELS[model_type].projector_keys)

    if freeze_vision_tower and model_type in COMPOSITE_MODELS:
        forbidden_modules.update(COMPOSITE_MODELS[model_type].vision_model_keys)
--- a/src/llamafactory/model/model_utils/moe.py
+++ b/src/llamafactory/model/model_utils/moe.py
@@ -147,6 +147,11 @@ def add_z3_leaf_module(model: "PreTrainedModel") -> None:

        _set_z3_leaf_modules(model, [Qwen3NextSparseMoeBlock])

+    if model_type == "qwen3_5_moe":
+        from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import Qwen3_5MoeSparseMoeBlock
+
+        _set_z3_leaf_modules(model, [Qwen3_5MoeSparseMoeBlock])
+

 def configure_moe(config: "PretrainedConfig", model_args: "ModelArguments", is_trainable: bool) -> None:
    if not is_trainable or not model_args.moe_aux_loss_coef:
--- a/src/llamafactory/model/model_utils/visual.py
+++ b/src/llamafactory/model/model_utils/visual.py
@@ -39,16 +39,26 @@ transformers_logger = transformers.utils.logging.get_logger(__name__)
@dataclass
 class CompositeModel:
    model_type: str
-    projector_key: str
+    projector_keys: list[str]
    vision_model_keys: list[str]
    language_model_keys: list[str]
    lora_conflict_keys: list[str]

-    def get_projector(self, module: "torch.nn.Module") -> "torch.nn.Module":
-        for key in self.projector_key.split("."):
-            module = getattr(module, key)

-        return module
+    def get_projectors(self, module: "torch.nn.Module") -> list["torch.nn.Module"]:
+        mm_projectors: list[torch.nn.Module] = []
+        for projector_key in self.projector_keys:
+            project_module = module
+            for key in projector_key.split("."):
+                project_module = getattr(project_module, key, None)
+                if project_module is None: # i,e gemma4 bigger one, there is no embed_audio
+                    logger.warning_rank0(f"Projector key {projector_key} not found in module {module.__class__.__name__}.")
+                    break
+
+            if project_module is not None:
+                mm_projectors.append(project_module)
+
+        return mm_projectors


 COMPOSITE_MODELS: dict[str, "CompositeModel"] = {}
@@ -56,7 +66,7 @@ COMPOSITE_MODELS: dict[str, "CompositeModel"] = {}

 def _register_composite_model(
    model_type: str,
-    projector_key: Optional[str] = None,
+    projector_keys: list[str] | None = None,
    vision_model_keys: Optional[list[str]] = None,
    language_model_keys: Optional[list[str]] = None,
    lora_conflict_keys: Optional[list[str]] = None,
@@ -65,7 +75,7 @@ def _register_composite_model(

    Args:
        model_type: model type
-        projector_key: multi_modal_projector
+        projector_keys: multi_modal_projector
        vision_model_keys: vision_tower
        language_model_keys: language_model
        lora_conflict_keys: None
@@ -73,7 +83,7 @@ def _register_composite_model(
    """
    COMPOSITE_MODELS[model_type] = CompositeModel(
        model_type=model_type,
-        projector_key=projector_key or "multi_modal_projector",
+        projector_keys=projector_keys or ["multi_modal_projector"],
        vision_model_keys=vision_model_keys or ["vision_tower"],
        language_model_keys=language_model_keys or ["language_model", "lm_head"],
        lora_conflict_keys=lora_conflict_keys or [],
@@ -136,12 +146,16 @@ def autocast_projector_dtype(model: "PreTrainedModel", model_args: "ModelArgumen
    if getattr(model, "quantization_method", None):
        model_type = getattr(model.config, "model_type", None)
        if model_type in COMPOSITE_MODELS:
-            mm_projector = COMPOSITE_MODELS[model_type].get_projector(model)
+            mm_projectors = COMPOSITE_MODELS[model_type].get_projectors(model)
        else:
            return

-        logger.info_rank0(f"Casting multimodal projector outputs in {model_args.compute_dtype}.")
-        mm_projector.register_forward_hook(_mm_projector_forward_post_hook)
+        logger.info_rank0(
+            f"Casting multimodal projector outputs in {model_args.compute_dtype}: "
+            f"{COMPOSITE_MODELS[model_type].projector_keys}."
+        )
+        for mm_projector in mm_projectors:
+            mm_projector.register_forward_hook(_mm_projector_forward_post_hook)


 def configure_visual_model(config: "PretrainedConfig") -> None:
@@ -166,9 +180,9 @@ def get_forbidden_modules(config: "PretrainedConfig", finetuning_args: "Finetuni
            forbidden_modules.update(vision_model_keys)

        if finetuning_args.freeze_multi_modal_projector:
-            projector_key = COMPOSITE_MODELS[model_type].projector_key
-            logger.info_rank0(f"Set multi model projector not trainable: {projector_key}.")
-            forbidden_modules.add(projector_key)
+            projector_keys = COMPOSITE_MODELS[model_type].projector_keys
+            logger.info_rank0(f"Set multi model projector not trainable: {projector_keys}.")
+            forbidden_modules.update(projector_keys)

        if finetuning_args.freeze_language_model:
            language_model_keys = COMPOSITE_MODELS[model_type].language_model_keys
@@ -200,7 +214,7 @@ def patch_target_modules(

 _register_composite_model(
    model_type="dots_ocr",
-    projector_key="vision_tower.merger",
+    projector_keys=["vision_tower.merger"],
    vision_model_keys=["vision_tower"],
    language_model_keys=["model", "lm_head"],
    lora_conflict_keys=["merger"],
@@ -219,10 +233,18 @@ _register_composite_model(
 )


+_register_composite_model(
+    model_type="gemma4",
+    projector_keys=["model.embed_vision", "model.embed_audio"],
+    vision_model_keys=["vision_tower", "audio_tower"],
+    lora_conflict_keys=["per_layer_projection_norm"],
+)
+
+
 # copied from qwen2vl
 _register_composite_model(
    model_type="glm4v",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger"],
    vision_model_keys=["visual.patch_embed", "visual.blocks"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -231,7 +253,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="glm4v_moe",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger"],
    vision_model_keys=["visual.patch_embed", "visual.blocks"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -240,7 +262,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="glm_ocr",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger"],
    vision_model_keys=["visual.patch_embed", "visual.blocks"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -257,7 +279,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="Keye",
-    projector_key="mlp_AR",
+    projector_keys=["mlp_AR"],
    vision_model_keys=["visual.vision_model.patch_embedding", "visual.vision_model.encoder"],
    language_model_keys=["model", "lm_head"],
    lora_conflict_keys=["patch_embedding"],
@@ -292,7 +314,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="minicpmv",
-    projector_key="resampler",
+    projector_keys=["resampler"],
    vision_model_keys=["vpm"],
    language_model_keys=["llm"],
 )
@@ -300,7 +322,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="minicpmo",
-    projector_key="resampler",
+    projector_keys=["resampler"],
    vision_model_keys=["vpm", "apm", "audio_avg_pooler", "audio_projection_layer", "tts"],
    language_model_keys=["llm"],
    lora_conflict_keys=["audio_projection_layer"],
@@ -309,7 +331,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="mistral3",
-    projector_key="model.multi_modal_projector",
+    projector_keys=["model.multi_modal_projector"],
 )


@@ -332,7 +354,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen2_5_omni_thinker",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger", "audio_tower.proj"],
    vision_model_keys=["visual.patch_embed", "visual.blocks", "audio_tower"],
    language_model_keys=["model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -341,7 +363,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen2_vl",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger"],
    vision_model_keys=["visual.patch_embed", "visual.blocks"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -350,7 +372,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen2_5_vl",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger"],
    vision_model_keys=["visual.patch_embed", "visual.blocks"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -359,7 +381,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen3_vl",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger"],
    vision_model_keys=["visual.pos_embed", "visual.patch_embed", "visual.blocks", "visual.deepstack_merger_list"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -368,7 +390,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen3_vl_moe",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger"],
    vision_model_keys=["visual.pos_embed", "visual.patch_embed", "visual.blocks", "visual.deepstack_merger_list"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -377,7 +399,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen3_omni_moe_thinker",
-    projector_key="visual.merger",
+    projector_keys=["visual.merger", "audio_tower.proj"],
    vision_model_keys=[
        "visual.pos_embed",
        "visual.patch_embed",
@@ -392,7 +414,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen3_5",
-    projector_key="model.visual.merger",
+    projector_keys=["model.visual.merger"],
    vision_model_keys=["visual.pos_embed", "visual.patch_embed", "visual.blocks"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
@@ -401,7 +423,7 @@ _register_composite_model(

 _register_composite_model(
    model_type="qwen3_5_moe",
-    projector_key="model.visual.merger",
+    projector_keys=["model.visual.merger"],
    vision_model_keys=["visual.pos_embed", "visual.patch_embed", "visual.blocks"],
    language_model_keys=["language_model", "lm_head"],
    lora_conflict_keys=["patch_embed"],
--- a/src/llamafactory/train/hyper_parallel/init.py
+++ b/src/llamafactory/train/hyper_parallel/init.py
@@ -0,0 +1,18 @@
+# Copyright 2025 the LlamaFactory team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .workflow import run_sft
+
+
+__all__ = ["run_sft"]
--- a/src/llamafactory/train/hyper_parallel/workflow.py
+++ b/src/llamafactory/train/hyper_parallel/workflow.py
@@ -0,0 +1,183 @@
+# Copyright 2025 the LlamaFactory team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING, Optional
+
+from ...data import SFTDataCollatorWith4DAttentionMask, get_dataset, get_template_and_fix_tokenizer
+from ...extras.constants import IGNORE_INDEX
+from ...extras.logging import get_logger
+from ...extras.misc import calculate_tps
+from ...extras.packages import is_hyper_parallel_available, is_transformers_version_greater_than
+from ...extras.ploting import plot_loss
+from ...model import load_model, load_tokenizer
+from ..callbacks import SaveProcessorCallback
+from ..sft.metric import ComputeAccuracy, ComputeSimilarity, eval_logit_processor
+from ..trainer_utils import asft_loss_func, create_modelcard_and_push, create_ref_model, dft_loss_func, eaft_loss_func
+
+
+if TYPE_CHECKING:
+    from transformers import Seq2SeqTrainingArguments, TrainerCallback
+
+    from ...hparams import DataArguments, FinetuningArguments, GeneratingArguments, ModelArguments
+
+
+logger = get_logger(__name__)
+
+
+def run_sft(
+    model_args: "ModelArguments",
+    data_args: "DataArguments",
+    training_args: "Seq2SeqTrainingArguments",
+    finetuning_args: "FinetuningArguments",
+    generating_args: "GeneratingArguments",
+    callbacks: Optional[list["TrainerCallback"]] = None,
+):
+    if not is_hyper_parallel_available():
+        raise ImportError(
+            "hyper_parallel is not installed. Please install it with `pip install hyper_parallel`."
+        )
+
+    from hyper_parallel.integration.llamafactory import (  # pylint: disable=C0415
+        HyperParallelArguments,
+        HyperParallelTrainer,
+    )
+
+    tokenizer_module = load_tokenizer(model_args)
+    tokenizer = tokenizer_module["tokenizer"]
+    template = get_template_and_fix_tokenizer(tokenizer, data_args)
+    dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
+    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
+
+    ref_model = None
+    if finetuning_args.use_asft_loss:
+        ref_model = create_ref_model(model_args, finetuning_args)
+
+    data_collator = SFTDataCollatorWith4DAttentionMask(
+        template=template,
+        model=model if not training_args.predict_with_generate else None,
+        pad_to_multiple_of=8 if training_args.do_train else None,
+        label_pad_token_id=IGNORE_INDEX if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id,
+        block_diag_attn=model_args.block_diag_attn,
+        attn_implementation=getattr(model.config, "_attn_implementation", None),
+        compute_dtype=model_args.compute_dtype,
+        **tokenizer_module,
+    )
+
+    # Metric utils
+    metric_module = {}
+    if training_args.predict_with_generate:
+        metric_module["compute_metrics"] = ComputeSimilarity(tokenizer=tokenizer)
+    elif finetuning_args.compute_accuracy:
+        metric_module["compute_metrics"] = ComputeAccuracy()
+        metric_module["preprocess_logits_for_metrics"] = eval_logit_processor
+
+    # Keyword arguments for `model.generate`
+    gen_kwargs = generating_args.to_dict(obey_generation_config=True)
+    if is_transformers_version_greater_than("4.58.0"):
+        extra_ids = getattr(tokenizer, "additional_special_tokens_ids", None)
+        if not isinstance(extra_ids, list):
+            extra_special_tokens = getattr(tokenizer, "_extra_special_tokens", [])
+            string_tokens = [str(t) for t in extra_special_tokens]
+            extra_ids = tokenizer.convert_tokens_to_ids(string_tokens)
+        all_eos_ids = [tokenizer.eos_token_id] + [i for i in extra_ids if i != -1]
+        gen_kwargs["eos_token_id"] = list(dict.fromkeys(all_eos_ids))
+    else:
+        gen_kwargs["eos_token_id"] = [tokenizer.eos_token_id] + tokenizer.additional_special_tokens_ids
+    gen_kwargs["pad_token_id"] = tokenizer.pad_token_id
+
+    hp_args = HyperParallelArguments.from_finetuning_args(finetuning_args)
+
+    callbacks = list(callbacks or [])
+    processor = tokenizer_module.get("processor")
+    if processor is not None:
+        callbacks.append(SaveProcessorCallback(processor))
+
+    compute_loss_func = None
+    if finetuning_args.use_dft_loss:
+        compute_loss_func = dft_loss_func
+    elif finetuning_args.use_eaft_loss:
+        compute_loss_func = lambda outputs, labels, num_items_in_batch=None: eaft_loss_func(  # noqa: E731
+            outputs, labels, num_items_in_batch, finetuning_args.eaft_alpha
+        )
+    elif finetuning_args.use_asft_loss:
+        from functools import partial
+
+        compute_loss_func = partial(asft_loss_func, asft_alpha=finetuning_args.asft_alpha)
+
+    trainer = HyperParallelTrainer(
+        hp_args=hp_args,
+        model=model,
+        args=training_args,
+        finetuning_args=finetuning_args,
+        data_collator=data_collator,
+        callbacks=callbacks,
+        gen_kwargs=gen_kwargs,
+        ref_model=ref_model,
+        compute_loss_func=compute_loss_func,
+        **dataset_module,
+        **tokenizer_module,
+        **metric_module,
+    )
+
+    if finetuning_args.use_badam:
+        from types import MethodType
+
+        from badam import BAdamCallback, clip_grad_norm_old_version  # type: ignore[import]
+
+        trainer.accelerator.clip_grad_norm_ = MethodType(clip_grad_norm_old_version, trainer.accelerator)
+        trainer.add_callback(BAdamCallback)
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        trainer.save_model()
+        if finetuning_args.include_effective_tokens_per_second:
+            train_result.metrics["effective_tokens_per_sec"] = calculate_tps(
+                dataset_module["train_dataset"], train_result.metrics, stage="sft"
+            )
+
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+        if trainer.is_world_process_zero() and finetuning_args.plot_loss:
+            keys = ["loss"]
+            if isinstance(dataset_module.get("eval_dataset"), dict):
+                keys += sum(
+                    [[f"eval_{key}_loss", f"eval_{key}_accuracy"] for key in dataset_module["eval_dataset"].keys()],
+                    [],
+                )
+            else:
+                keys += ["eval_loss", "eval_accuracy"]
+
+            plot_loss(training_args.output_dir, keys=keys)
+
+    if training_args.predict_with_generate:
+        tokenizer.padding_side = "left"
+
+    # Evaluation
+    if training_args.do_eval:
+        metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs)
+        trainer.log_metrics("eval", metrics)
+        trainer.save_metrics("eval", metrics)
+
+    # Predict
+    if training_args.do_predict:
+        logger.warning_rank0_once("Batch generation can be very slow. Consider using `scripts/vllm_infer.py` instead.")
+        predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
+        trainer.log_metrics("predict", predict_results.metrics)
+        trainer.save_metrics("predict", predict_results.metrics)
+        trainer.save_predictions(dataset_module["eval_dataset"], predict_results, generating_args.skip_special_tokens)
+
+    # Create model card
+    create_modelcard_and_push(trainer, model_args, data_args, training_args, finetuning_args)
--- a/src/llamafactory/train/tuner.py
+++ b/src/llamafactory/train/tuner.py
@@ -24,7 +24,12 @@ from ..data import get_template_and_fix_tokenizer
 from ..extras import logging
 from ..extras.constants import V_HEAD_SAFE_WEIGHTS_NAME, V_HEAD_WEIGHTS_NAME
 from ..extras.misc import find_available_port, get_device_name, get_torch_device, infer_optim_dtype
-from ..extras.packages import is_mcore_adapter_available, is_ray_available, is_transformers_version_greater_than
+from ..extras.packages import (
+    is_hyper_parallel_available,
+    is_mcore_adapter_available,
+    is_ray_available,
+    is_transformers_version_greater_than,
+)
 from ..hparams import RayArguments, get_infer_args, get_ray_args, get_train_args, read_args
 from ..model import load_model, load_tokenizer
 from .callbacks import LogCallback, PissaConvertCallback, ReporterCallback
@@ -71,7 +76,16 @@ def _training_function(config: dict[str, Any]) -> None:

    callbacks.append(ReporterCallback(model_args, data_args, finetuning_args, generating_args))  # add to last

-    if finetuning_args.stage in ["pt", "sft", "dpo"] and finetuning_args.use_mca:
+    if finetuning_args.stage == "sft" and finetuning_args.use_hyper_parallel:
+        if not is_hyper_parallel_available():
+            raise ImportError(
+                "hyper_parallel is not installed. Please install it with `pip install hyper_parallel`."
+            )
+        from .hyper_parallel import run_sft as run_sft_hp
+
+        run_sft_hp(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
+
+    elif finetuning_args.stage in ["pt", "sft", "dpo"] and finetuning_args.use_mca:
        if not is_mcore_adapter_available():
            raise ImportError("mcore_adapter is not installed. Please install it with `pip install mcore-adapter`.")
        if finetuning_args.stage == "pt":
--- a/tests/data/test_mm_plugin.py
+++ b/tests/data/test_mm_plugin.py
@@ -57,7 +57,7 @@ TEXT_MESSAGES = [
 ]

 VIDEO_MESSAGES = [
-    {"role": "user", "content": "<video>What is in this viode?"},
+    {"role": "user", "content": "<video>What is in this video?"},
    {"role": "assistant", "content": "A cat."},
 ]

@@ -210,6 +210,34 @@ def test_gemma3_plugin():
    _check_plugin(**check_inputs)


+@pytest.mark.runs_on(["cpu", "mps"])
+@pytest.mark.skipif(not is_transformers_version_greater_than("5.6.0"), reason="Requires transformers>=5.6.0")
+def test_gemma4_plugin():
+    tokenizer_module = _load_tokenizer_module(model_name_or_path="google/gemma-4-31B-it")
+    processor = tokenizer_module["processor"]
+    gemma4_plugin = get_mm_plugin(name="gemma4", image_token="<|image|>", video_token="<|video|>")
+    check_inputs = {"plugin": gemma4_plugin, **tokenizer_module}
+    # validate
+    mm_inputs = gemma4_plugin._get_mm_inputs(IMAGES, NO_VIDEOS, NO_AUDIOS, processor)
+    num_image_soft_tokens = 256 # when we use default max_soft_tokens=280
+    image_token = getattr(processor, "image_token")
+    boi_token = getattr(processor, "boi_token")
+    eoi_token = getattr(processor, "eoi_token")
+
+    expected_mm_type_ids = [[int(token_id == getattr(processor, "image_token_id")) for token_id in token_ids] for token_ids in BATCH_IDS]
+    check_inputs["expected_mm_messages"] = [
+        {"role": "user", "content": f"{boi_token}{image_token * num_image_soft_tokens}{eoi_token}What is in this image?"},
+        {"role": "assistant", "content": "A cat."},
+    ]
+    for key in ("num_soft_tokens_per_image",):
+        mm_inputs.pop(key, None)
+
+    mm_inputs["mm_token_type_ids"] = expected_mm_type_ids
+    check_inputs["expected_mm_inputs"] = mm_inputs
+    check_inputs["expected_no_mm_inputs"] = {"mm_token_type_ids": expected_mm_type_ids}
+    _check_plugin(**check_inputs)
+
+
@pytest.mark.runs_on(["cpu", "mps"])
@pytest.mark.skipif(not is_transformers_version_greater_than("4.52.0"), reason="Requires transformers>=4.52.0")
 def test_internvl_plugin():
Author	SHA1	Message	Date
Kingsley	436d26bc28	fix: projector lookup for gemma4 modules (#10382 ) Co-authored-by: yiluoAK_47 <yiluoAK_47@163.com>	2026-04-12 08:32:14 +08:00
Kingsley	c109c061e5	[model] set mm_projectors for omni models (#10378 )	2026-04-10 18:12:57 +08:00
Kingsley	fa09c01c36	fix: gemma4 mm_token_type_ids padding (#10359 )	2026-04-06 13:14:45 +08:00
Kingsley	eae6f0b541	[model] gemma4 (#10346 )	2026-04-05 12:10:28 +08:00
Kingsley	acac63ef35	[data] fix qwen3vl timestamp (#10338 )	2026-04-01 22:40:12 +08:00
浮梦	e5e8546493	[misc] fix moe (#10334 ) Co-authored-by: frozenleaves <frozen@Mac.local>	2026-03-31 23:04:45 +08:00
Cui-yshoho	97433c53b6	[feat] support LlamaFactory SFT training by HyperParallel FSDP2 backend (#10289 )	2026-03-30 10:47:20 +08:00