"...git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "8a861048dd5da11f2a82632333601b7bd42a71b8"
Commit c2eff804 authored by Sugon_ldc's avatar Sugon_ldc
Browse files

add new project

parents
Pipeline #1648 failed with stages
in 0 seconds
# Contributing to faster-whisper
Contributions are welcome! Here are some pointers to help you install the library for development and validate your changes before submitting a pull request.
## Install the library for development
We recommend installing the module in editable mode with the `dev` extra requirements:
```bash
git clone https://github.com/SYSTRAN/faster-whisper.git
cd faster-whisper/
pip install -e .[dev]
```
## Validate the changes before creating a pull request
1. Make sure the existing tests are still passing (and consider adding new tests as well!):
```bash
pytest tests/
```
2. Reformat and validate the code with the following tools:
```bash
black .
isort .
flake8 .
```
These steps are also run automatically in the CI when you open the pull request.
MIT License
Copyright (c) 2023 SYSTRAN
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
include faster_whisper/assets/silero_vad.onnx
include requirements.txt
include requirements.conversion.txt
include faster_whisper/assets/pyannote_vad_model.bin
[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
# Faster Whisper transcription with CTranslate2
**faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.
This implementation is up to 4 times faster than [openai/whisper](https://github.com/openai/whisper) for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
## Benchmark
### Whisper
For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:
* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
### Large-v2 model on GPU
| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
| --- | --- | --- | --- | --- | --- |
| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
| faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB |
*Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.*
### Small model on CPU
| Implementation | Precision | Beam size | Time | Max. memory |
| --- | --- | --- | --- | --- |
| openai/whisper | fp32 | 5 | 10m31s | 3101MB |
| whisper.cpp | fp32 | 5 | 17m42s | 1581MB |
| whisper.cpp | fp16 | 5 | 12m39s | 873MB |
| faster-whisper | fp32 | 5 | 2m44s | 1675MB |
| faster-whisper | int8 | 5 | 2m04s | 995MB |
*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*
### Distil-whisper
| Implementation | Precision | Beam size | Time | Gigaspeech WER |
| --- | --- | --- | --- | --- |
| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
*Executed with CUDA 11.4 on a NVIDIA 3090.*
<details>
<summary>testing details (click to expand)</summary>
For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
```python
from faster_whisper import WhisperModel
model_size = "distil-large-v2"
# model_size = "distil-medium.en"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
```
</details>
## Requirements
* Python 3.8 or greater
### GPU
GPU execution requires the following NVIDIA libraries to be installed:
* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
* [cuDNN 8 for CUDA 12](https://developer.nvidia.com/cudnn)
**Note**: Latest versions of `ctranslate2` support CUDA 12 only. For CUDA 11, the current workaround is downgrading to the `3.24.0` version of `ctranslate2` (This can be done with `pip install --force-reinstall ctranslate2==3.24.0` or specifying the version in a `requirements.txt`).
There are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.
<details>
<summary>Other installation methods (click to expand)</summary>
**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below.
#### Use Docker
The libraries (cuBLAS, cuDNN) are installed in these official NVIDIA CUDA Docker images: `nvidia/cuda:12.0.0-runtime-ubuntu20.04` or `nvidia/cuda:12.0.0-runtime-ubuntu22.04`.
#### Install with `pip` (Linux only)
On Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python.
```bash
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
```
**Note**: Version 9+ of `nvidia-cudnn-cu12` appears to cause issues due its reliance on cuDNN 9 (Faster-Whisper does not currently support cuDNN 9). Ensure your version of the Python package is for cuDNN 8.
#### Download the libraries from Purfview's repository (Windows & Linux)
Purfview's [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) provides the required NVIDIA libraries for Windows & Linux in a [single archive](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Decompress the archive and place the libraries in a directory included in the `PATH`.
</details>
## Installation
The module can be installed from [PyPI](https://pypi.org/project/faster-whisper/):
```bash
pip install faster-whisper
```
<details>
<summary>Other installation methods (click to expand)</summary>
### Install the master branch
```bash
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"
```
### Install a specific commit
```bash
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
```
</details>
## Usage
### Faster-whisper
```python
from faster_whisper import WhisperModel
model_size = "large-v3"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")
# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```
**Warning:** `segments` is a *generator* so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a `for` loop:
```python
segments, _ = model.transcribe("audio.mp3")
segments = list(segments) # The transcription will actually run here.
```
### multi-segment language detection
To directly use the model for improved language detection, the following code snippet can be used:
```python
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")
```
### Batched faster-whisper
The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-2 Clause license and integrates its VAD model to this library. We modify this implementation and also replaced the feature extraction with a faster torch-based implementation. Batched version improves the speed upto 10-12x compared to openAI implementation and 3-4x compared to the sequential faster_whisper version. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference.
The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper.
```python
from faster_whisper import WhisperModel, BatchedInferencePipeline
model = WhisperModel("medium", device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model)
segments, info = batched_model.transcribe("audio.mp3", batch_size=16)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```
### Faster Distil-Whisper
The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet
demonstrates how to run inference with distil-large-v3 on a specified audio file:
```python
from faster_whisper import WhisperModel
model_size = "distil-large-v3"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```
For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).
### Word-level timestamps
```python
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))
```
### VAD filter
The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad) model to filter out parts of the audio without speech:
```python
segments, _ = model.transcribe("audio.mp3", vad_filter=True)
```
The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
```python
segments, _ = model.transcribe(
"audio.mp3",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
```
### Logging
The library logging level can be configured like this:
```python
import logging
logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
```
### Going further
See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
## Community integrations
Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
* [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.
* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
* [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS.
* [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
* [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.
* [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)
* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux.
* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.
* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.
## Model conversion
When loading a model from its size such as `WhisperModel("large-v3")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).
We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.
For example the command below converts the [original "large-v3" Whisper model](https://huggingface.co/openai/whisper-large-v3) and saves the weights in FP16:
```bash
pip install transformers[torch]>=4.23
ct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2
--copy_files tokenizer.json preprocessor_config.json --quantization float16
```
* The option `--model` accepts a model name on the Hub or a path to a model directory.
* If the option `--copy_files tokenizer.json` is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.
Models can also be converted from the code. See the [conversion API](https://opennmt.net/CTranslate2/python/ctranslate2.converters.TransformersConverter.html).
### Load a converted model
1. Directly load the model from a local directory:
```python
model = faster_whisper.WhisperModel("whisper-large-v3-ct2")
```
2. [Upload your model to the Hugging Face Hub](https://huggingface.co/docs/transformers/model_sharing#upload-with-the-web-interface) and load it from its name:
```python
model = faster_whisper.WhisperModel("username/whisper-large-v3-ct2")
```
## Comparing performance against other implementations
If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:
* Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, `model.transcribe` uses a default beam size of 1 but here we use a default beam size of 5.
* When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable `OMP_NUM_THREADS`, which can be set when running your script:
```bash
OMP_NUM_THREADS=4 python3 my_script.py
```
import argparse
import time
from typing import Callable
import py3nvml.py3nvml as nvml
from memory_profiler import memory_usage
from utils import MyThread, get_logger, inference
logger = get_logger("faster-whisper")
parser = argparse.ArgumentParser(description="Memory benchmark")
parser.add_argument(
"--gpu_memory", action="store_true", help="Measure GPU memory usage"
)
parser.add_argument("--device-index", type=int, default=0, help="GPU device index")
parser.add_argument(
"--interval",
type=float,
default=0.5,
help="Interval at which measurements are collected",
)
args = parser.parse_args()
device_idx = args.device_index
interval = args.interval
def measure_memory(func: Callable[[], None]):
if args.gpu_memory:
logger.info(
"Measuring maximum GPU memory usage on GPU device."
" Make sure to not have additional processes running on the same GPU."
)
# init nvml
nvml.nvmlInit()
handle = nvml.nvmlDeviceGetHandleByIndex(device_idx)
gpu_name = nvml.nvmlDeviceGetName(handle)
gpu_memory_limit = nvml.nvmlDeviceGetMemoryInfo(handle).total >> 20
gpu_power_limit = nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000.0
info = {"gpu_memory_usage": [], "gpu_power_usage": []}
def _get_gpu_info():
while True:
info["gpu_memory_usage"].append(
nvml.nvmlDeviceGetMemoryInfo(handle).used >> 20
)
info["gpu_power_usage"].append(
nvml.nvmlDeviceGetPowerUsage(handle) / 1000
)
time.sleep(interval)
if stop:
break
return info
stop = False
thread = MyThread(_get_gpu_info, params=())
thread.start()
func()
stop = True
thread.join()
result = thread.get_result()
# shutdown nvml
nvml.nvmlShutdown()
max_memory_usage = max(result["gpu_memory_usage"])
max_power_usage = max(result["gpu_power_usage"])
print("GPU name: %s" % gpu_name)
print("GPU device index: %s" % device_idx)
print(
"Maximum GPU memory usage: %dMiB / %dMiB (%.2f%%)"
% (
max_memory_usage,
gpu_memory_limit,
(max_memory_usage / gpu_memory_limit) * 100,
)
)
print(
"Maximum GPU power usage: %dW / %dW (%.2f%%)"
% (
max_power_usage,
gpu_power_limit,
(max_power_usage / gpu_power_limit) * 100,
)
)
else:
logger.info("Measuring maximum increase of memory usage.")
max_usage = memory_usage(func, max_usage=True, interval=interval)
print("Maximum increase of RAM memory usage: %d MiB" % max_usage)
if __name__ == "__main__":
measure_memory(inference)
This diff is collapsed.
transformers
jiwer
evaluate
datasets
memory_profiler
py3nvml
import argparse
import timeit
from typing import Callable
from utils import inference
parser = argparse.ArgumentParser(description="Speed benchmark")
parser.add_argument(
"--repeat",
type=int,
default=3,
help="Times an experiment will be run.",
)
args = parser.parse_args()
def measure_speed(func: Callable[[], None]):
# as written in https://docs.python.org/3/library/timeit.html#timeit.Timer.repeat,
# min should be taken rather than the average
runtimes = timeit.repeat(
func,
repeat=args.repeat,
number=10,
)
print(runtimes)
print("Min execution time: %.3fs" % (min(runtimes) / 10.0))
if __name__ == "__main__":
measure_speed(inference)
import logging
from threading import Thread
from typing import Optional
from faster_whisper import WhisperModel
model_path = "large-v3"
model = WhisperModel(model_path, device="cuda")
def inference():
segments, info = model.transcribe("benchmark.m4a", language="fr")
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
def get_logger(name: Optional[str] = None) -> logging.Logger:
formatter = logging.Formatter("%(levelname)s: %(message)s")
logger = logging.getLogger(name)
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger.addHandler(handler)
return logger
class MyThread(Thread):
def __init__(self, func, params):
super(MyThread, self).__init__()
self.func = func
self.params = params
self.result = None
def run(self):
self.result = self.func(*self.params)
def get_result(self):
return self.result
import argparse
import json
import os
from datasets import load_dataset
from evaluate import load
from tqdm import tqdm
from transformers.models.whisper.english_normalizer import EnglishTextNormalizer
from faster_whisper import WhisperModel
parser = argparse.ArgumentParser(description="WER benchmark")
parser.add_argument(
"--audio_numb",
type=int,
default=None,
help="Specify the number of validation audio files in the dataset."
" Set to None to retrieve all audio files.",
)
args = parser.parse_args()
model_path = "large-v3"
model = WhisperModel(model_path, device="cuda")
# load the dataset with streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
# define the evaluation metric
wer_metric = load("wer")
with open(os.path.join(os.path.dirname(__file__), "normalizer.json"), "r") as f:
normalizer = EnglishTextNormalizer(json.load(f))
def inference(batch):
batch["transcription"] = []
for sample in batch["audio"]:
segments, info = model.transcribe(sample["array"], language="en")
batch["transcription"].append("".join([segment.text for segment in segments]))
batch["reference"] = batch["text"]
return batch
dataset = dataset.map(function=inference, batched=True, batch_size=16)
all_transcriptions = []
all_references = []
# iterate over the dataset and run inference
for i, result in tqdm(enumerate(dataset), desc="Evaluating..."):
all_transcriptions.append(result["transcription"])
all_references.append(result["reference"])
if args.audio_numb and i == (args.audio_numb - 1):
break
# normalize predictions and references
all_transcriptions = [normalizer(transcription) for transcription in all_transcriptions]
all_references = [normalizer(reference) for reference in all_references]
# compute the WER metric
wer = 100 * wer_metric.compute(
predictions=all_transcriptions, references=all_references
)
print("WER: %.3f" % wer)
from faster_whisper import WhisperModel
import time
model_size = "small"
path = "/modelpath/models--Systran--faster-whisper-small/"
# Run on GPU with FP16
model = WhisperModel(model_size_or_path=path, device="cuda", local_files_only=True)
# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
start_time = time.time()
segments, info = model.transcribe("./test1.mp4", beam_size=5, language="zh", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=1000))
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
end_time = time.time()
print('time:',end_time - start_time)
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
WORKDIR /root
RUN apt-get update -y && apt-get install -y python3-pip
COPY infer.py jfk.flac ./
RUN pip3 install faster-whisper
CMD ["python3", "infer.py"]
from faster_whisper import WhisperModel
jfk_path = "jfk.flac"
model = WhisperModel("tiny", device="cuda")
segments, info = model.transcribe(jfk_path, word_timestamps=True)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
File added
from faster_whisper.audio import decode_audio
from faster_whisper.transcribe import BatchedInferencePipeline, WhisperModel
from faster_whisper.utils import available_models, download_model, format_timestamp
from faster_whisper.version import __version__
__all__ = [
"available_models",
"decode_audio",
"WhisperModel",
"BatchedInferencePipeline",
"download_model",
"format_timestamp",
"__version__",
]
from typing import BinaryIO, Union
import torch
import torchaudio
def decode_audio(
input_file: Union[str, BinaryIO],
sampling_rate: int = 16000,
split_stereo: bool = False,
):
"""Decodes the audio.
Args:
input_file: Path to the input file or a file-like object.
sampling_rate: Resample the audio to this sample rate.
split_stereo: Return separate left and right channels.
Returns:
A float32 Torch Tensor.
If `split_stereo` is enabled, the function returns a 2-tuple with the
separated left and right channels.
"""
waveform, audio_sf = torchaudio.load(input_file) # waveform: channels X T
if audio_sf != sampling_rate:
waveform = torchaudio.functional.resample(
waveform, orig_freq=audio_sf, new_freq=sampling_rate
)
if split_stereo:
return waveform[0], waveform[1]
return waveform.mean(0)
def pad_or_trim(array, length: int, *, axis: int = -1):
"""
Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
"""
axis = axis % array.ndim
if array.shape[axis] > length:
idx = [Ellipsis] * axis + [slice(length)] + [Ellipsis] * (array.ndim - axis - 1)
return array[idx]
if array.shape[axis] < length:
pad_widths = (
[
0,
]
* array.ndim
* 2
)
pad_widths[2 * axis] = length - array.shape[axis]
array = torch.nn.functional.pad(array, tuple(pad_widths[::-1]))
return array
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment