ffmpeg_microphone_live利用時のエラー対応

whisper系でaudio stream処理をしたい場合、pipelineを利用して最も簡単に書けるのは `from transformers.pipelines.audio_utils import ffmpeg_microphone_live` の活用かなと認識しております(ex, https://huggingface.co/learn/audio-course/en/chapter7/voice-assistant )が、
transformers最新の実装ではキー名がarrayではなくrawのため、エラーとなります。
https://github.com/huggingface/transformers/blob/v4.46.2/src/transformers/pipelines/audio_utils.py#L258

こちらのPRでは、kotoba-whisperでリアルタイムstream処理が動きました。

気になっているのが、v2.1ではキー名を考慮されております。
https://huggingface.co/kotoba-tech/kotoba-whisper-v2.1/blob/main/kotoba_whisper.py#L157

自分の知識不足で、ffmpeg_microphone_liveよりも優れたstream処理のエコシステムがあり、そこではarrayキーなので、rawを切り捨てたのでしょうか？その場合でも、こちらの実装が問題となるケースがなさそうであれば、残す需要はありそうな気がしております。

Files changed (1) hide show

kotoba_whisper.py +2 -2

kotoba_whisper.py CHANGED Viewed

@@ -155,14 +155,14 @@ class KotobaWhisperPipeline(AutomaticSpeechRecognitionPipeline):
             inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
         if isinstance(inputs, dict):
             # Accepting `"array"` which is the key defined in `datasets` for better integration
-            if not ("sampling_rate" in inputs and "array" in inputs):
                 raise ValueError(
                     "When passing a dictionary to AutomaticSpeechRecognitionPipeline, the dict needs to contain a "
                     '"array" key containing the numpy array representing the audio and a "sampling_rate" key, '
                     "containing the sampling_rate associated with that array"
                 )
             in_sampling_rate = inputs.pop("sampling_rate")
-            inputs = inputs.pop("array", None)
             if in_sampling_rate != self.feature_extractor.sampling_rate:
                 if is_torchaudio_available():
                     from torchaudio import functional as F

             inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
         if isinstance(inputs, dict):
             # Accepting `"array"` which is the key defined in `datasets` for better integration
+            if not ("sampling_rate" in inputs and ("raw" in inputs or "array" in inputs)):
                 raise ValueError(
                     "When passing a dictionary to AutomaticSpeechRecognitionPipeline, the dict needs to contain a "
                     '"array" key containing the numpy array representing the audio and a "sampling_rate" key, '
                     "containing the sampling_rate associated with that array"
                 )
             in_sampling_rate = inputs.pop("sampling_rate")
+            inputs = inputs.pop("array", inputs.pop("raw", None))
             if in_sampling_rate != self.feature_extractor.sampling_rate:
                 if is_torchaudio_available():
                     from torchaudio import functional as F