Skip to content

feat(gemma4): add Gemma 4 27B (MoE) LoRA recipe#2492

Draft
aniruddh-alt wants to merge 4 commits into
mainfrom
gemma4-27b-lora
Draft

feat(gemma4): add Gemma 4 27B (MoE) LoRA recipe#2492
aniruddh-alt wants to merge 4 commits into
mainfrom
gemma4-27b-lora

Conversation

@aniruddh-alt

@aniruddh-alt aniruddh-alt commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Add Gemma 4 27B (MoE) LoRA recipe

Adds a LoRA SFT recipe for Gemma 4 27B (google/gemma-4-26B-A4B-it — Mixture-of-Experts, 26.5B total / ~4B active, image+text) under configs/recipes/gemma4/sft/27b_lora/. Builds on #2479 (now merged), which added the lora_exclude_modules support these recipes rely on.

What's in this PR

  • configs/recipes/gemma4/sft/27b_lora/train.yaml — FSDP (FULL_SHARD) LoRA SFT on alpaca-cleaned; same text-transformer scoping (lora_exclude_modules: .*vision_tower.*, .*multi_modal_projector.*) and transformer_layer_cls: Gemma4TextDecoderLayer as the 31B recipe.
  • configs/recipes/gemma4/sft/27b_lora/gcp_job.yaml — SkyPilot GCP job (A100:8, FSDP via oumi distributed torchrun).
  • configs/recipes/gemma4/README.md — mark 27B as "LoRA config available" + launch example.

Validation

Validated end-to-end in oumi's OSS environment (torch 2.10.0+cu128, transformers 5.7.0, peft 0.19.1, trl 1.4.0) on H100s with FSDP FULL_SHARD. google/gemma-4-26B-A4B-it loads and LoRA trains to completion — 9,292,800 trainable params (~0.035%, attention projections only, see the MoE note) — loss descending to <0.3, adapter saved (TRAIN_DONE rc=0).

Pointing the recipe's LoRA setup at a task's training split (in place of the shipped alpaca-cleaned default) gives a real downstream gain on the MoE, via oumi's NATIVE engine:

Task Base + LoRA
pubmedqa (n=100) 57.0% 76.0%

MoE note: the standard gate_proj/up_proj/down_proj targets do not match this model's fused expert MLPs, so LoRA currently adapts the attention projections only. Adapting the experts would need their specific module names — follow-up; the recipe comment documents this.

Eval note: native HF evaluation of this MoE OOMs on long prompts — the default batched_mm expert kernel copies expert weights per token-expert pair (~25.6 GiB for a ~900-token prompt at batch 1, independent of device_map sharding). Short-prompt tasks (pubmedqa) evaluate fine; long-prompt ones (e.g. banking77's 77-label prompt) don't. The grouped_mm kernel avoids this; wiring it through for this nested MoE is a follow-up. Training is unaffected.

Related issues

N/A — config-only addition. Builds on #2479 (merged).

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Base automatically changed from gemma4-oumi-onboarding to main June 4, 2026 17:39
@gitar-bot

gitar-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

Gitar is working

Gitar

README LoRA prose claimed the recipes exclude .*audio_tower.*, but the Larger image+text models (31B/27B) have no audio tower and exclude .*multi_modal_projector.* — generalize the prose to cover both families. Remove ddp_find_unused_parameters from 27b_lora/train.yaml: it is a no-op under FSDP (which this recipe always enables; distributed.py routes the flag only to the DDP wrapper) and its comment was misleading. Reword the header exclusion rationale to match the e4b sibling (Gemma4ClippableLinear).
Gemma 4 is under the Gemma Terms of Use and gated on HF, not apache-2.0/ungated. Match the rest of the repo's wording. Same liberate-bot fix as the sibling 31B PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant