Skip to content

Whisper: pass processing_class instead of removed tokenizer kwarg#290

Open
danielhanchen wants to merge 1 commit into
mainfrom
whisper-transformers5-processing-class
Open

Whisper: pass processing_class instead of removed tokenizer kwarg#290
danielhanchen wants to merge 1 commit into
mainfrom
whisper-transformers5-processing-class

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

transformers 5.x removed the deprecated tokenizer argument from Seq2SeqTrainer, so the Whisper notebook fails at trainer construction on current installs with:

TypeError: Seq2SeqTrainer.__init__() got an unexpected keyword argument 'tokenizer'

4.x has warned about this for a while (the notebook's own saved output shows the FutureWarning pointing at processing_class), and processing_class accepts the feature extractor on both 4.57.6 and 5.x, so the rename is forward and backward compatible.

Changes: one line in original_template/Whisper.ipynb, synced to nb/Whisper.ipynb, nb/Kaggle-Whisper.ipynb, python_scripts/Whisper.py and python_scripts/Kaggle-Whisper.py. I avoided a full update_all_notebooks.py regeneration on purpose: it currently rewrites unrelated notebooks (install blocks, embedded outputs), which would bury this one line change.

Validated by running the patched notebook end to end (training, transcription inference, LoRA save) on transformers 5.11 inside the unsloth Docker image on a B200: passes in 2 minutes. JSON validity checked for all three ipynb files and AST checked for both python scripts.

transformers 5.x removed the deprecated tokenizer argument from
Seq2SeqTrainer (4.x already warned: use processing_class instead), so
the notebook dies at trainer construction on current installs.
processing_class accepts the feature extractor on 4.57.6 and 5.x alike.

Applied in original_template and synced to the generated nb, kaggle and
python_scripts copies; a full regeneration was avoided on purpose since
it rewrites unrelated notebooks.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the trainer configuration across several Jupyter notebooks and Python scripts by replacing the deprecated tokenizer parameter with processing_class = tokenizer.feature_extractor. The reviewer recommends passing the full tokenizer processor directly to processing_class instead of just the feature extractor. This ensures that both the feature extractor and tokenizer configurations are saved properly when calling trainer.save_model(), making the saved model fully self-contained.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread nb/Kaggle-Whisper.ipynb
" data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),\n",
" eval_dataset = test_dataset,\n",
" tokenizer = tokenizer.feature_extractor,\n",
" processing_class = tokenizer.feature_extractor,\n",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change
" processing_class = tokenizer.feature_extractor,\n",
" processing_class = tokenizer,\\n",

Comment thread nb/Whisper.ipynb
" data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),\n",
" eval_dataset = test_dataset,\n",
" tokenizer = tokenizer.feature_extractor,\n",
" processing_class = tokenizer.feature_extractor,\n",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change
" processing_class = tokenizer.feature_extractor,\n",
" processing_class = tokenizer,\\n",

" data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=tokenizer),\n",
" eval_dataset = test_dataset,\n",
" tokenizer = tokenizer.feature_extractor,\n",
" processing_class = tokenizer.feature_extractor,\n",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change
" processing_class = tokenizer.feature_extractor,\n",
" processing_class = tokenizer,\\n",

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),
eval_dataset = test_dataset,
tokenizer = tokenizer.feature_extractor,
processing_class = tokenizer.feature_extractor,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change
processing_class = tokenizer.feature_extractor,
processing_class = tokenizer,

Comment thread python_scripts/Whisper.py
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),
eval_dataset = test_dataset,
tokenizer = tokenizer.feature_extractor,
processing_class = tokenizer.feature_extractor,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.

Suggested change
processing_class = tokenizer.feature_extractor,
processing_class = tokenizer,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant