Skip to content

Add Qwen3-VL-2B config#4293

Open
subawocit wants to merge 1 commit into
mainfrom
qwen3-vl-2b
Open

Add Qwen3-VL-2B config#4293
subawocit wants to merge 1 commit into
mainfrom
qwen3-vl-2b

Conversation

@subawocit

Copy link
Copy Markdown
Collaborator

Description

This PR adds Qwen/Qwen3-VL-2B-Instruct (Qwen3-Omni vision encoder + Qwen3-2B dense decoder) to MaxText:

  • Add Qwen3-VL-2B model config and register Qwen/Qwen3-VL-2B-Instruct.
  • Wire Qwen3-VL-2B into multimodal config validation, prompt/image preprocessing, embedding fusion, and vision bidirectional masking.
  • Add Qwen3-VL-2B vision encoder/projector subclasses reusing the existing Qwen3 Omni vision tower implementation.
  • Extend vision RoPE handling to Qwen3 vision models.
  • Add HF config, HF shape metadata, parameter mappings, and conversion hooks for Qwen3-VL-2B checkpoint conversion.

Tests

Checkpoint conversion

python3 -m maxtext.checkpoint_conversion.to_maxtext \
    src/maxtext/configs/base.yml \
    model_name=qwen3-vl-2b \
    base_output_directory=<your_output_dir> \
    scan_layers=false \
    hf_access_token=<your_token> \
    weight_dtype=bfloat16 \
    hardware=cpu \
    skip_jax_distributed_system=True \
    checkpoint_storage_use_ocdbt=False \
    checkpoint_storage_use_zarr3=False \
    --eager_load_method=safetensors \
    --lazy_load_tensors=False

Image decode check

python3 -m maxtext.inference.decode \
    src/maxtext/configs/base.yml \
    model_name=qwen3-vl-2b \
    tokenizer_path=Qwen/Qwen3-VL-2B-Instruct \
    tokenizer_type=huggingface \
    load_parameters_path=gs://yuchenhou-maxtext-logs/checkpoints/qwen3-vl-2b/unscanned/2026-06-26-20-40/0/items \
    per_device_batch_size=1 \
    run_name=runner_image_2026-06-26-20-40 \
    scan_layers=false \
    use_multimodal=true \
    prompt='Describe this image' \
    image_path='tests/assets/test_image.jpg' \
    max_prefill_predict_length=512 \
    max_target_length=768 \
    ici_tensor_parallelism=4 \
    override_model_config=true \
    attention='dot_product' \
    hf_access_token=<your_token>

Output:

Input `<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe this image<|im_end|>
<|im_start|>assistant
` -> `This is a panoramic view of the Seattle skyline on a bright, sunny day. The image is taken from a high vantage point, looking down on the city.

The most prominent feature is the **Space Needle**, a distinctive observation tower located on the left side of the frame. It stands tall among the city's modern skyscrapers, which are a mix of glass and steel structures. The buildings are densely packed, creating a dense urban landscape.

In the background, a range of mountains is visible, with their peaks covered in snow, suggesting a high-altitude location. The sky is a clear, vibrant blue with a few scattered white clouds.

In the foreground, there are lush green trees and foliage, which add a natural element to the urban scene. The overall impression is one of a vibrant, modern city with a beautiful natural backdrop.

Video decode check

python3 -m maxtext.inference.decode \
    src/maxtext/configs/base.yml \
    model_name=qwen3-vl-2b \
    tokenizer_path=Qwen/Qwen3-VL-2B-Instruct \
    tokenizer_type=huggingface \
    load_parameters_path=gs://yuchenhou-maxtext-logs/checkpoints/qwen3-vl-2b/unscanned/2026-06-26-20-40/0/items \
    per_device_batch_size=1 \
    run_name=runner_video_2026-06-26-20-40 \
    scan_layers=false \
    use_multimodal=true \
    prompt='What is the classification of the single exhibit in this video?' \
    video_path='tests/assets/test_video.mp4' \
    max_prefill_predict_length=1240 \
    max_target_length=1280 \
    ici_tensor_parallelism=4 \
    override_model_config=true \
    attention='dot_product' \
    hf_access_token=<your_token>

Output:

Input `<|im_start|>user
<|vision_start|><|video_pad|><|vision_end|>What is the classification of the single exhibit in this video?<|im_end|>
<|im_start|>assistant
` -> `The exhibit in the video is a dinosaur model, specifically a multi-headed dinosaur. It is a display piece in a museum or exhibition, likely representing a prehistoric creature. The model is likely a replica or

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 18.18182% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/multimodal/processor.py 0.00% 7 Missing ⚠️
src/maxtext/layers/decoders.py 0.00% 1 Missing ⚠️
src/maxtext/layers/encoders.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant