Add Qwen3-VL-2B config by subawocit · Pull Request #4293 · AI-Hypercomputer/maxtext

subawocit · 2026-06-29T16:55:53Z

Description

This PR adds Qwen/Qwen3-VL-2B-Instruct (Qwen3-Omni vision encoder + Qwen3-2B dense decoder) to MaxText:

Add Qwen3-VL-2B model config and register Qwen/Qwen3-VL-2B-Instruct.
Wire Qwen3-VL-2B into multimodal config validation, prompt/image preprocessing, embedding fusion, and vision bidirectional masking.
Add Qwen3-VL-2B vision encoder/projector subclasses reusing the existing Qwen3 Omni vision tower implementation.
Extend vision RoPE handling to Qwen3 vision models.
Add HF config, HF shape metadata, parameter mappings, and conversion hooks for Qwen3-VL-2B checkpoint conversion.

Tests

Checkpoint conversion

python3 -m maxtext.checkpoint_conversion.to_maxtext \
    src/maxtext/configs/base.yml \
    model_name=qwen3-vl-2b \
    base_output_directory=<your_output_dir> \
    scan_layers=false \
    hf_access_token=<your_token> \
    weight_dtype=bfloat16 \
    hardware=cpu \
    skip_jax_distributed_system=True \
    checkpoint_storage_use_ocdbt=False \
    checkpoint_storage_use_zarr3=False \
    --eager_load_method=safetensors \
    --lazy_load_tensors=False

Image decode check

python3 -m maxtext.inference.decode \
    src/maxtext/configs/base.yml \
    model_name=qwen3-vl-2b \
    tokenizer_path=Qwen/Qwen3-VL-2B-Instruct \
    tokenizer_type=huggingface \
    load_parameters_path=gs://yuchenhou-maxtext-logs/checkpoints/qwen3-vl-2b/unscanned/2026-06-26-20-40/0/items \
    per_device_batch_size=1 \
    run_name=runner_image_2026-06-26-20-40 \
    scan_layers=false \
    use_multimodal=true \
    prompt='Describe this image' \
    image_path='tests/assets/test_image.jpg' \
    max_prefill_predict_length=512 \
    max_target_length=768 \
    ici_tensor_parallelism=4 \
    override_model_config=true \
    attention='dot_product' \
    hf_access_token=<your_token>

Output:

Input `<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe this image<|im_end|>
<|im_start|>assistant
` -> `This is a panoramic view of the Seattle skyline on a bright, sunny day. The image is taken from a high vantage point, looking down on the city.

The most prominent feature is the **Space Needle**, a distinctive observation tower located on the left side of the frame. It stands tall among the city's modern skyscrapers, which are a mix of glass and steel structures. The buildings are densely packed, creating a dense urban landscape.

In the background, a range of mountains is visible, with their peaks covered in snow, suggesting a high-altitude location. The sky is a clear, vibrant blue with a few scattered white clouds.

In the foreground, there are lush green trees and foliage, which add a natural element to the urban scene. The overall impression is one of a vibrant, modern city with a beautiful natural backdrop.

Video decode check

python3 -m maxtext.inference.decode \
    src/maxtext/configs/base.yml \
    model_name=qwen3-vl-2b \
    tokenizer_path=Qwen/Qwen3-VL-2B-Instruct \
    tokenizer_type=huggingface \
    load_parameters_path=gs://yuchenhou-maxtext-logs/checkpoints/qwen3-vl-2b/unscanned/2026-06-26-20-40/0/items \
    per_device_batch_size=1 \
    run_name=runner_video_2026-06-26-20-40 \
    scan_layers=false \
    use_multimodal=true \
    prompt='What is the classification of the single exhibit in this video?' \
    video_path='tests/assets/test_video.mp4' \
    max_prefill_predict_length=1240 \
    max_target_length=1280 \
    ici_tensor_parallelism=4 \
    override_model_config=true \
    attention='dot_product' \
    hf_access_token=<your_token>

Output:

Input `<|im_start|>user
<|vision_start|><|video_pad|><|vision_end|>What is the classification of the single exhibit in this video?<|im_end|>
<|im_start|>assistant
` -> `The exhibit in the video is a dinosaur model, specifically a multi-headed dinosaur. It is a display piece in a museum or exhibition, likely representing a prehistoric creature. The model is likely a replica or

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-29T17:00:46Z

Codecov Report

❌ Patch coverage is 18.18182% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/multimodal/processor.py	0.00%	7 Missing ⚠️
src/maxtext/layers/decoders.py	0.00%	1 Missing ⚠️
src/maxtext/layers/encoders.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

feat: Add Qwen3-VL-2B config and E2E tests

d36160c

subawocit marked this pull request as ready for review June 29, 2026 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Qwen3-VL-2B config#4293

Add Qwen3-VL-2B config#4293
subawocit wants to merge 1 commit into
mainfrom
qwen3-vl-2b

subawocit commented Jun 29, 2026

Uh oh!

codecov Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

subawocit commented Jun 29, 2026

Description

Tests

Checkpoint conversion

Image decode check

Video decode check

Checklist

Uh oh!

codecov Bot commented Jun 29, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant