Spaces:
Running
Running
File size: 5,548 Bytes
0f77492 e991395 3f70034 2a4fa08 3f70034 7eaa482 3f70034 32509de 083d0cc 0494311 c93db3f 9e0a638 ab7c9b0 8449dc3 3f70034 b7e701b c597752 3f70034 847af29 4c05546 37c3162 d868316 3f70034 dd65208 4ec36d4 b7e701b 3f70034 52668da 2e05e59 52668da 9e7cb60 32509de b551682 6227586 b551682 6227586 b551682 6227586 b551682 6227586 b551682 a97e9b5 c93db3f 2ba8130 c93db3f 32509de c93db3f 2ba8130 d1438f4 32509de ab7c9b0 d1438f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# Whisper-WebUI
A Gradio-based browser interface for [Whisper](https://github.com/openai/whisper). You can use it as an Easy Subtitle Generator!
![Whisper WebUI](https://github.com/jhj0517/Whsiper-WebUI/blob/master/screenshot.png)
## Notebook
If you wish to try this on Colab, you can do it in [here](https://colab.research.google.com/github/jhj0517/Whisper-WebUI/blob/master/notebook/whisper-webui.ipynb)!
# Feature
- Select the Whisper implementation you want to use between :
- [openai/whisper](https://github.com/openai/whisper)
- [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) (used by default)
- [insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper)
- Generate subtitles from various sources, including :
- Files
- Youtube
- Microphone
- Currently supported subtitle formats :
- SRT
- WebVTT
- txt ( only text file without timeline )
- Speech to Text Translation
- From other languages to English. ( This is Whisper's end-to-end speech-to-text translation feature )
- Text to Text Translation
- Translate subtitle files using Facebook NLLB models
- Translate subtitle files using DeepL API
- Speaker diarization with [pyannote](https://huggingface.co/pyannote/speaker-diarization-3.1) model as a post-processing.
- To download the model, you need a Huggingface token and must manually visit sites listed below to accept their conditions.
1. https://huggingface.co/pyannote/speaker-diarization-3.1
2. https://huggingface.co/pyannote/segmentation-3.0
# Installation and Running
### Prerequisite
To run this WebUI, you need to have `git`, `python` version 3.8 ~ 3.10, `FFmpeg` and `CUDA` (if you use NVIDIA GPU) version above 12.0
Please follow the links below to install the necessary software:
- git : [https://git-scm.com/downloads](https://git-scm.com/downloads)
- python : [https://www.python.org/downloads/](https://www.python.org/downloads/) **( If your python version is too new, torch will not install properly.)**
- FFmpeg : [https://ffmpeg.org/download.html](https://ffmpeg.org/download.html)
- CUDA : [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)
After installing FFmpeg, **make sure to add the `FFmpeg/bin` folder to your system PATH!**
### Automatic Installation
1. Download `Whisper-WebUI.zip` with the file corresponding to your OS from [v1.0.0](https://github.com/jhj0517/Whisper-WebUI/releases/tag/v1.0.0) and extract its contents.
2. Run `install.bat` or `install.sh` to install dependencies. (This will create a `venv` directory and install dependencies there.)
3. Start WebUI with `start-webui.bat` or `start-webui.sh`
4. To update the WebUI, run `update.bat` or `update.sh`
And you can also run the project with command line arguments if you like by running `start-webui.bat`, see [wiki](https://github.com/jhj0517/Whisper-WebUI/wiki/Command-Line-Arguments) for a guide to arguments.
- ## Running with Docker
1. Build the image
```sh
docker build -t whisper-webui:latest .
```
2. Run the container with commands
- For bash :
```sh
docker run --gpus all -d \
-v /path/to/models:/Whisper-WebUI/models \
-v /path/to/outputs:/Whisper-WebUI/outputs \
-p 7860:7860 \
-it \
whisper-webui:latest --server_name 0.0.0.0 --server_port 7860
```
- For PowerShell:
```shell
docker run --gpus all -d `
-v /path/to/models:/Whisper-WebUI/models `
-v /path/to/outputs:/Whisper-WebUI/outputs `
-p 7860:7860 `
-it `
whisper-webui:latest --server_name 0.0.0.0 --server_port 7860
```
# VRAM Usages
This project is integrated with [faster-whisper](https://github.com/guillaumekln/faster-whisper) by default for better VRAM usage and transcription speed.
According to faster-whisper, the efficiency of the optimized whisper model is as follows:
| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
|-------------------|-----------|-----------|-------|-----------------|-----------------|
| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
If you want to use an implementation other than faster-whisper, use `--whisper_type` arg and the repository name.<br>
Read [wiki](https://github.com/jhj0517/Whisper-WebUI/wiki/Command-Line-Arguments) for more info about CLI args.
## Available models
This is Whisper's original VRAM usage table for models.
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |
| base | 74 M | `base.en` | `base` | ~1 GB | ~16x |
| small | 244 M | `small.en` | `small` | ~2 GB | ~6x |
| medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
| large | 1550 M | N/A | `large` | ~10 GB | 1x |
`.en` models are for English only, and the cool thing is that you can use the `Translate to English` option from the "large" models!
## TODO🗓
- [x] Add DeepL API translation
- [x] Add NLLB Model translation
- [x] Integrate with faster-whisper
- [x] Integrate with insanely-fast-whisper
- [x] Integrate with whisperX ( Only speaker diarization part )
- [ ] Add fast api script
|