File size: 3,506 Bytes
9f56267
313814b
9f56267
 
 
 
313814b
 
9f56267
313814b
 
9f56267
313814b
9f56267
313814b
9f56267
313814b
9f56267
313814b
9f56267
313814b
9f56267
313814b
9f56267
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# WARN: WIP (code is ugly, bad documentation, may have bugs, test files aren't included, CPU inference was barely tested, etc.)
# Intro
:peach:`speaches` is a web server that supports real-time transcription using WebSockets.
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper) is used as the backend. Both GPU and CPU inference are supported.
- LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for real-time transcription.
- Can be deployed using Docker (Compose configuration can be found in [compose.yaml](./compose.yaml)).
- All configuration is done through environment variables. See [config.py](./speaches/config.py).
- NOTE: only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
- NOTE: this isn't really meant to be used as a standalone tool but rather to add transcription features to other applications.
Please create an issue if you find a bug, have a question, or a feature suggestion.
# Quick Start
Spinning up a `speaches` web server
```bash
docker run --gpus=all --publish 8000:8000 --mount type=bind,source=$HOME/.cache/huggingface,target=/root/.cache/huggingface fedirz/speaches:cuda
# or
docker run --publish 8000:8000 --mount type=bind,source=$HOME/.cache/huggingface,target=/root/.cache/huggingface fedirz/speaches:cpu
```
Streaming audio data from a microphone. [websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
```bash
ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
# or
arecord -f S16_LE -c1 -r 16000 -t raw -D default 2>/dev/null | websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
```
Streaming audio data from a file.
```bash
ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - > output.raw
# send all data at once
cat output.raw | websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
# Output: {"text":"One,"}{"text":"One,  two,  three,  four,  five."}{"text":"One,  two,  three,  four,  five."}%
# streaming 16000 samples per second. each sample is 2 bytes
cat output.raw | pv -qL 32000 | websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
# Output: {"text":"One,"}{"text":"One,  two,"}{"text":"One,  two,  three,"}{"text":"One,  two,  three,  four,  five."}{"text":"One,  two,  three,  four,  five.  one."}%
```
Transcribing a file
```bash
# convert the file if it has a different format
ffmpeg -i output.wav -ac 1 -ar 16000 -f s16le output.raw
curl -X POST -F "[email protected]" http://0.0.0.0:8000/v1/audio/transcriptions
# Output: "{\"text\":\"One,  two,  three,  four,  five.\"}"%
```
# Roadmap
- [ ] Support file transcription (non-streaming) of multiple formats.
- [ ] CLI client.
- [ ] Separate the web server related code from the "core", and publish "core" as a package.
- [ ] Additional documentation and code comments.
- [ ] Write benchmarks for measuring streaming transcription performance. Possible metrics:
    - Latency (time when transcription is sent - time between when audio has been received)
    - Accuracy (already being measured when testing but the process can be improved)
    - Total seconds of audio transcribed / audio duration (since each audio chunk is being processed at least twice)
- [ ] Get the API response closer to the format used by OpenAI.
- [ ] Integrations...