Whisper: pass processing_class instead of removed tokenizer kwarg#290
Whisper: pass processing_class instead of removed tokenizer kwarg#290danielhanchen wants to merge 1 commit into
Conversation
transformers 5.x removed the deprecated tokenizer argument from Seq2SeqTrainer (4.x already warned: use processing_class instead), so the notebook dies at trainer construction on current installs. processing_class accepts the feature extractor on 4.57.6 and 5.x alike. Applied in original_template and synced to the generated nb, kaggle and python_scripts copies; a full regeneration was avoided on purpose since it rewrites unrelated notebooks.
There was a problem hiding this comment.
Code Review
This pull request updates the trainer configuration across several Jupyter notebooks and Python scripts by replacing the deprecated tokenizer parameter with processing_class = tokenizer.feature_extractor. The reviewer recommends passing the full tokenizer processor directly to processing_class instead of just the feature extractor. This ensures that both the feature extractor and tokenizer configurations are saved properly when calling trainer.save_model(), making the saved model fully self-contained.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| " data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),\n", | ||
| " eval_dataset = test_dataset,\n", | ||
| " tokenizer = tokenizer.feature_extractor,\n", | ||
| " processing_class = tokenizer.feature_extractor,\n", |
There was a problem hiding this comment.
Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.
| " processing_class = tokenizer.feature_extractor,\n", | |
| " processing_class = tokenizer,\\n", |
| " data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer),\n", | ||
| " eval_dataset = test_dataset,\n", | ||
| " tokenizer = tokenizer.feature_extractor,\n", | ||
| " processing_class = tokenizer.feature_extractor,\n", |
There was a problem hiding this comment.
Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.
| " processing_class = tokenizer.feature_extractor,\n", | |
| " processing_class = tokenizer,\\n", |
| " data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=tokenizer),\n", | ||
| " eval_dataset = test_dataset,\n", | ||
| " tokenizer = tokenizer.feature_extractor,\n", | ||
| " processing_class = tokenizer.feature_extractor,\n", |
There was a problem hiding this comment.
Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.
| " processing_class = tokenizer.feature_extractor,\n", | |
| " processing_class = tokenizer,\\n", |
| data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer), | ||
| eval_dataset = test_dataset, | ||
| tokenizer = tokenizer.feature_extractor, | ||
| processing_class = tokenizer.feature_extractor, |
There was a problem hiding this comment.
Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.
| processing_class = tokenizer.feature_extractor, | |
| processing_class = tokenizer, |
| data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor = tokenizer), | ||
| eval_dataset = test_dataset, | ||
| tokenizer = tokenizer.feature_extractor, | ||
| processing_class = tokenizer.feature_extractor, |
There was a problem hiding this comment.
Since the tokenizer variable actually holds the full processor (as indicated by processor = tokenizer in the data collator), passing tokenizer directly to processing_class is highly recommended. If you only pass tokenizer.feature_extractor, the trainer will only save the feature extractor configuration (preprocessor_config.json) and will omit the tokenizer configuration files (like tokenizer_config.json, vocab.json, etc.) when trainer.save_model() is called. Passing the full processor ensures that both the feature extractor and tokenizer are properly saved, making the saved model fully self-contained and ready for inference or resuming training.
| processing_class = tokenizer.feature_extractor, | |
| processing_class = tokenizer, |
transformers 5.x removed the deprecated tokenizer argument from Seq2SeqTrainer, so the Whisper notebook fails at trainer construction on current installs with:
4.x has warned about this for a while (the notebook's own saved output shows the FutureWarning pointing at processing_class), and processing_class accepts the feature extractor on both 4.57.6 and 5.x, so the rename is forward and backward compatible.
Changes: one line in original_template/Whisper.ipynb, synced to nb/Whisper.ipynb, nb/Kaggle-Whisper.ipynb, python_scripts/Whisper.py and python_scripts/Kaggle-Whisper.py. I avoided a full update_all_notebooks.py regeneration on purpose: it currently rewrites unrelated notebooks (install blocks, embedded outputs), which would bury this one line change.
Validated by running the patched notebook end to end (training, transcription inference, LoRA save) on transformers 5.11 inside the unsloth Docker image on a B200: passes in 2 minutes. JSON validity checked for all three ipynb files and AST checked for both python scripts.