Fedir Zadniprovskyi commited on
Commit
f043430
·
1 Parent(s): ce5dbe5

docs: update README.md

Browse files
Files changed (3) hide show
  1. README.md +64 -36
  2. audio.wav +0 -0
  3. faster_whisper_server/main.py +0 -2
README.md CHANGED
@@ -1,13 +1,23 @@
1
- ## Faster Whisper Server
2
- `faster-whisper-server` is a web server that supports real-time transcription using WebSockets.
3
- - [faster-whisper](https://github.com/SYSTRAN/faster-whisper) is used as the backend. Both GPU and CPU inference are supported.
4
- - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for real-time transcription.
5
- - Can be deployed using Docker (Compose configuration can be found in [compose.yaml](./compose.yaml)).
6
- - All configuration is done through environment variables. See [config.py](./faster_whisper_server/config.py).
7
- - NOTE: only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
8
- - NOTE: this isn't really meant to be used as a standalone tool but rather to add transcription features to other applications.
9
  Please create an issue if you find a bug, have a question, or a feature suggestion.
10
- # Quick Start
 
 
 
 
 
 
 
 
 
 
11
  Using Docker
12
  ```bash
13
  docker run --gpus=all --publish 8000:8000 --volume ~/.cache/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:cuda
@@ -17,42 +27,60 @@ docker run --publish 8000:8000 --volume ~/.cache/huggingface:/root/.cache/huggin
17
  Using Docker Compose
18
  ```bash
19
  curl -sO https://raw.githubusercontent.com/fedirz/faster-whisper-server/master/compose.yaml
20
- docker compose up --detach up faster-whisper-server-cuda
21
  # or
22
- docker compose up --detach up faster-whisper-server-cpu
23
  ```
24
  ## Usage
25
- Streaming audio data from a microphone. [websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
26
  ```bash
27
- ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
28
- # or
29
- arecord -f S16_LE -c1 -r 16000 -t raw -D default 2>/dev/null | websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ```
31
  Streaming audio data from a file.
32
  ```bash
33
- ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - > output.raw
34
  # send all data at once
35
- cat output.raw | websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
36
  # Output: {"text":"One,"}{"text":"One, two, three, four, five."}{"text":"One, two, three, four, five."}%
37
  # streaming 16000 samples per second. each sample is 2 bytes
38
- cat output.raw | pv -qL 32000 | websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
39
  # Output: {"text":"One,"}{"text":"One, two,"}{"text":"One, two, three,"}{"text":"One, two, three, four, five."}{"text":"One, two, three, four, five. one."}%
40
  ```
41
- Transcribing a file
42
- ```bash
43
- # convert the file if it has a different format
44
- ffmpeg -i output.wav -ac 1 -ar 16000 -f s16le output.raw
45
- curl -X POST -F "[email protected]" http://0.0.0.0:8000/v1/audio/transcriptions
46
- # Output: "{\"text\":\"One, two, three, four, five.\"}"%
47
- ```
48
- ## Roadmap
49
- - [ ] Support file transcription (non-streaming) of multiple formats.
50
- - [ ] CLI client.
51
- - [ ] Separate the web server related code from the "core", and publish "core" as a package.
52
- - [ ] Additional documentation and code comments.
53
- - [ ] Write benchmarks for measuring streaming transcription performance. Possible metrics:
54
- - Latency (time when transcription is sent - time between when audio has been received)
55
- - Accuracy (already being measured when testing but the process can be improved)
56
- - Total seconds of audio transcribed / audio duration (since each audio chunk is being processed at least twice)
57
- - [ ] Get the API response closer to the format used by OpenAI.
58
- - [ ] Integrations...
 
1
+ # Faster Whisper Server
2
+ `faster-whisper-server` is an OpenAI API compatible transcription server which uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as it's backend.
3
+ Features:
4
+ - GPU and CPU support.
5
+ - Easily deployable using Docker.
6
+ - Configurable through environment variables (see [config.py](./faster_whisper_server/config.py)).
7
+ - OpenAI API compatible.
8
+
9
  Please create an issue if you find a bug, have a question, or a feature suggestion.
10
+
11
+ ## OpenAI API Compatibility ++
12
+ See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information.
13
+ - Audio file transcription via `POST /v1/audio/transcriptions` endpoint.
14
+ - Unlike OpenAI's API, `faster-whisper-server` also supports streaming transcriptions(and translations). This is usefull for when you want to process large audio files would rather receive the transcription in chunks as they are processed rather than waiting for the whole file to be transcribe. It works in the similar way to chat messages are being when chatting with LLMs.
15
+ - Audio file translation via `POST /v1/audio/translations` endpoint.
16
+ - (WIP) Live audio transcription via `WS /v1/audio/transcriptions` endpoint.
17
+ - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
18
+ - Only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
19
+
20
+ ## Quick Start
21
  Using Docker
22
  ```bash
23
  docker run --gpus=all --publish 8000:8000 --volume ~/.cache/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:cuda
 
27
  Using Docker Compose
28
  ```bash
29
  curl -sO https://raw.githubusercontent.com/fedirz/faster-whisper-server/master/compose.yaml
30
+ docker compose up --detach faster-whisper-server-cuda
31
  # or
32
+ docker compose up --detach faster-whisper-server-cpu
33
  ```
34
  ## Usage
35
+ ### OpenAI API CLI
36
  ```bash
37
+ export OPENAI_API_KEY="cant-be-empty"
38
+ export OPENAI_BASE_URL=http://localhost:8000/v1/
39
+ ```
40
+ ```bash
41
+ openai api audio.transcriptions.create -m distil-medium.en -f audio.wav --response-format text
42
+
43
+ openai api audio.translations.create -m distil-medium.en -f audio.wav --response-format verbose_json
44
+ ```
45
+ ### OpenAI API Python SDK
46
+ ```python
47
+ from openai import OpenAI
48
+
49
+ client = OpenAI(api_key="cant-be-empty", base_url="http://localhost:8000/v1/")
50
+
51
+ audio_file = open("audio.wav", "rb")
52
+ transcript = client.audio.transcriptions.create(
53
+ model="distil-medium.en", file=audio_file
54
+ )
55
+ print(transcript.text)
56
+ ```
57
+
58
+ ### CURL
59
+ ```bash
60
+ # If `model` isn't specified, the default model is used
61
+ curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]"
62
+ curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]"
63
+ curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "streaming=true"
64
+ curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "streaming=true" -F "model=distil-large-v3"
65
+ # It's recommended that you always specify the language as that will reduce the transcription time
66
+ curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "streaming=true" -F "model=distil-large-v3" -F "language=en"
67
+
68
+ curl http://localhost:8000/v1/audio/translations -F "[email protected]"
69
+ ```
70
+
71
+ ### Live Transcription
72
+ [websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
73
+ Live transcribing audio data from a microphone.
74
+ ```bash
75
+ ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://localhost:8000/v1/audio/transcriptions
76
  ```
77
  Streaming audio data from a file.
78
  ```bash
79
+ ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - > audio.raw
80
  # send all data at once
81
+ cat audio.raw | websocat --no-close --binary ws://localhost:8000/v1/audio/transcriptions
82
  # Output: {"text":"One,"}{"text":"One, two, three, four, five."}{"text":"One, two, three, four, five."}%
83
  # streaming 16000 samples per second. each sample is 2 bytes
84
+ cat audio.raw | pv -qL 32000 | websocat --no-close --binary ws://localhost:8000/v1/audio/transcriptions
85
  # Output: {"text":"One,"}{"text":"One, two,"}{"text":"One, two, three,"}{"text":"One, two, three, four, five."}{"text":"One, two, three, four, five. one."}%
86
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
audio.wav ADDED
Binary file (209 kB). View file
 
faster_whisper_server/main.py CHANGED
@@ -237,7 +237,6 @@ async def transcribe_stream(
237
  ws: WebSocket,
238
  model: Annotated[Model, Query()] = config.whisper.model,
239
  language: Annotated[Language | None, Query()] = config.default_language,
240
- prompt: Annotated[str | None, Query()] = None,
241
  response_format: Annotated[
242
  ResponseFormat, Query()
243
  ] = config.default_response_format,
@@ -246,7 +245,6 @@ async def transcribe_stream(
246
  await ws.accept()
247
  transcribe_opts = {
248
  "language": language,
249
- "initial_prompt": prompt,
250
  "temperature": temperature,
251
  "vad_filter": True,
252
  "condition_on_previous_text": False,
 
237
  ws: WebSocket,
238
  model: Annotated[Model, Query()] = config.whisper.model,
239
  language: Annotated[Language | None, Query()] = config.default_language,
 
240
  response_format: Annotated[
241
  ResponseFormat, Query()
242
  ] = config.default_response_format,
 
245
  await ws.accept()
246
  transcribe_opts = {
247
  "language": language,
 
248
  "temperature": temperature,
249
  "vad_filter": True,
250
  "condition_on_previous_text": False,