Whisper: pass processing_class instead of removed tokenizer kwarg by danielhanchen · Pull Request #290 · unslothai/notebooks

danielhanchen · 2026-06-12T13:58:11Z

transformers 5.x removed the deprecated tokenizer argument from Seq2SeqTrainer, so the Whisper notebook fails at trainer construction on current installs with:

TypeError: Seq2SeqTrainer.__init__() got an unexpected keyword argument 'tokenizer'

4.x has warned about this for a while (the notebook's own saved output shows the FutureWarning pointing at processing_class), and processing_class accepts the feature extractor on both 4.57.6 and 5.x, so the rename is forward and backward compatible.

Changes: one line in original_template/Whisper.ipynb, synced to nb/Whisper.ipynb, nb/Kaggle-Whisper.ipynb, python_scripts/Whisper.py and python_scripts/Kaggle-Whisper.py. I avoided a full update_all_notebooks.py regeneration on purpose: it currently rewrites unrelated notebooks (install blocks, embedded outputs), which would bury this one line change.

Validated by running the patched notebook end to end (training, transcription inference, LoRA save) on transformers 5.11 inside the unsloth Docker image on a B200: passes in 2 minutes. JSON validity checked for all three ipynb files and AST checked for both python scripts.

transformers 5.x removed the deprecated tokenizer argument from Seq2SeqTrainer (4.x already warned: use processing_class instead), so the notebook dies at trainer construction on current installs. processing_class accepts the feature extractor on 4.57.6 and 5.x alike. Applied in original_template and synced to the generated nb, kaggle and python_scripts copies; a full regeneration was avoided on purpose since it rewrites unrelated notebooks.

gemini-code-assist

Code Review

This pull request updates the trainer configuration across several Jupyter notebooks and Python scripts by replacing the deprecated tokenizer parameter with processing_class = tokenizer.feature_extractor. The reviewer recommends passing the full tokenizer processor directly to processing_class instead of just the feature extractor. This ensures that both the feature extractor and tokenizer configurations are saved properly when calling trainer.save_model(), making the saved model fully self-contained.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T13:59:30Z

    "    data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),\n",
    "    eval_dataset = test_dataset,\n",
-    "    tokenizer = tokenizer.feature_extractor,\n",
+    "    processing_class = tokenizer.feature_extractor,\n",


Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change

" processing_class = tokenizer.feature_extractor,\n",

" processing_class = tokenizer,\\n",

gemini-code-assist · 2026-06-12T13:59:30Z

    "    data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),\n",
    "    eval_dataset = test_dataset,\n",
-    "    tokenizer = tokenizer.feature_extractor,\n",
+    "    processing_class = tokenizer.feature_extractor,\n",


Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change

" processing_class = tokenizer.feature_extractor,\n",

" processing_class = tokenizer,\\n",

gemini-code-assist · 2026-06-12T13:59:30Z

        "    data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=tokenizer),\n",
        "    eval_dataset = test_dataset,\n",
-        "    tokenizer = tokenizer.feature_extractor,\n",
+        "    processing_class = tokenizer.feature_extractor,\n",


Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change

" processing_class = tokenizer.feature_extractor,\n",

" processing_class = tokenizer,\\n",

gemini-code-assist · 2026-06-12T13:59:31Z

    data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),
    eval_dataset = test_dataset,
-    tokenizer = tokenizer.feature_extractor,
+    processing_class = tokenizer.feature_extractor,


Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change

processing_class = tokenizer.feature_extractor,

processing_class = tokenizer,

gemini-code-assist · 2026-06-12T13:59:31Z

    data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),
    eval_dataset = test_dataset,
-    tokenizer = tokenizer.feature_extractor,
+    processing_class = tokenizer.feature_extractor,


Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change

processing_class = tokenizer.feature_extractor,

processing_class = tokenizer,

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

danielhanchen mentioned this pull request Jun 12, 2026

Add Docker build for Blackwell that runs on any NVIDIA GPU host unslothai/unsloth#5748

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper: pass processing_class instead of removed tokenizer kwarg#290

Whisper: pass processing_class instead of removed tokenizer kwarg#290
danielhanchen wants to merge 1 commit into
mainfrom
whisper-transformers5-processing-class

danielhanchen commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	" processing_class = tokenizer.feature_extractor,\n",
	" processing_class = tokenizer,\\n",

	processing_class = tokenizer.feature_extractor,
	processing_class = tokenizer,

Conversation

danielhanchen commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant