facebook/seamless-m4t-v2-large · ASR (one language)

I have used the model to transcribe audio in English. It makes a lot of mistakes so I'm wondering if my usage is correct. I suspect that I am doing an English to English translation leading to rephrasing rather than a simple transcription.
Below is the relevant part of my code.

a = read(audio_file_path)
arr = numpy.array(a[1],dtype=float)
audio_inputs = processor(audios=arr, return_tensors="pt").to(device)
output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
transcription_chunk = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)

Thanks a lot!