livekit/turn-detector · Inquiry on Upcoming Language Support and Fine-Tuning Feasibility

Hi LiveKit team! 👋

First, thank you for your incredible work on the livekit/turn-detector model and the open-source ecosystem around it. The end-of-turn detection capabilities have been a game-changer for our conversational AI projects, especially with the improved accuracy over traditional VAD methods.

I wanted to ask about your plans for expanding language support. I recall seeing a post on X.com suggesting that multilingual support is in the pipeline for the near future. Could you share any updates on this? Many communities would greatly benefit from non-English implementations, and we’re eager to know timelines or prioritized languages.

Additionally, if broader language support isn’t imminent, is it feasible to fine-tune the current model on a custom language corpus?

For instance:

Training Requirements: What dataset format/size is recommended (e.g., conversational transcripts with turn boundaries labeled)?

Annotation Guidelines: Are specific metadata or annotations (e.g., silence duration, speaker roles) needed for training?

Architecture Constraints: Does the ONNX-based inference setup 68 allow for fine-tuning, or would adjustments to the model architecture be necessary?

We’re prepared to collaborate on preparing a training set for our target language and would appreciate guidance on best practices.

Thanks again for your transparency and dedication to advancing real-time communication tools! Looking forward to your insights.