feat(gemma4): add Gemma 4 27B (MoE) LoRA recipe#2492
Draft
aniruddh-alt wants to merge 4 commits into
Draft
Conversation
c1925c4 to
38eb807
Compare
README LoRA prose claimed the recipes exclude .*audio_tower.*, but the Larger image+text models (31B/27B) have no audio tower and exclude .*multi_modal_projector.* — generalize the prose to cover both families. Remove ddp_find_unused_parameters from 27b_lora/train.yaml: it is a no-op under FSDP (which this recipe always enables; distributed.py routes the flag only to the DDP wrapper) and its comment was misleading. Reword the header exclusion rationale to match the e4b sibling (Gemma4ClippableLinear).
Gemma 4 is under the Gemma Terms of Use and gated on HF, not apache-2.0/ungated. Match the rest of the repo's wording. Same liberate-bot fix as the sibling 31B PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Gemma 4 27B (MoE) LoRA recipe
Adds a LoRA SFT recipe for Gemma 4 27B (
google/gemma-4-26B-A4B-it— Mixture-of-Experts, 26.5B total / ~4B active, image+text) underconfigs/recipes/gemma4/sft/27b_lora/. Builds on #2479 (now merged), which added thelora_exclude_modulessupport these recipes rely on.What's in this PR
configs/recipes/gemma4/sft/27b_lora/train.yaml— FSDP (FULL_SHARD) LoRA SFT onalpaca-cleaned; same text-transformer scoping (lora_exclude_modules:.*vision_tower.*,.*multi_modal_projector.*) andtransformer_layer_cls: Gemma4TextDecoderLayeras the 31B recipe.configs/recipes/gemma4/sft/27b_lora/gcp_job.yaml— SkyPilot GCP job (A100:8, FSDP viaoumi distributed torchrun).configs/recipes/gemma4/README.md— mark 27B as "LoRA config available" + launch example.Validation
Validated end-to-end in oumi's OSS environment (
torch 2.10.0+cu128,transformers 5.7.0,peft 0.19.1,trl 1.4.0) on H100s with FSDPFULL_SHARD.google/gemma-4-26B-A4B-itloads and LoRA trains to completion — 9,292,800 trainable params (~0.035%, attention projections only, see the MoE note) — loss descending to <0.3, adapter saved (TRAIN_DONE rc=0).Pointing the recipe's LoRA setup at a task's training split (in place of the shipped
alpaca-cleaneddefault) gives a real downstream gain on the MoE, via oumi'sNATIVEengine:MoE note: the standard
gate_proj/up_proj/down_projtargets do not match this model's fused expert MLPs, so LoRA currently adapts the attention projections only. Adapting the experts would need their specific module names — follow-up; the recipe comment documents this.Eval note: native HF evaluation of this MoE OOMs on long prompts — the default
batched_mmexpert kernel copies expert weights per token-expert pair (~25.6 GiB for a ~900-token prompt at batch 1, independent ofdevice_mapsharding). Short-prompt tasks (pubmedqa) evaluate fine; long-prompt ones (e.g. banking77's 77-label prompt) don't. Thegrouped_mmkernel avoids this; wiring it through for this nested MoE is a follow-up. Training is unaffected.Related issues
N/A — config-only addition. Builds on #2479 (merged).
Before submitting