Unverified Commit 48d0123f authored by Robert Dargavel Smith's avatar Robert Dargavel Smith Committed by GitHub
Browse files

add AudioDiffusionPipeline and LatentAudioDiffusionPipeline #1334 (#1426)



* add AudioDiffusionPipeline and LatentAudioDiffusionPipeline

* add docs to toc

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* Update pr_tests.yml

Fix tests

* parent 499ff34b3edc3e0c506313ab48f21514d8f58b09
author teticio <teticio@gmail.com> 1668765652 +0000
committer teticio <teticio@gmail.com> 1669041721 +0000

parent 499ff34b3edc3e0c506313ab48f21514d8f58b09
author teticio <teticio@gmail.com> 1668765652 +0000
committer teticio <teticio@gmail.com> 1669041704 +0000

add colab notebook

[Flax] Fix loading scheduler from subfolder (#1319)

[FLAX] Fix loading scheduler from subfolder

Fix/Enable all schedulers for in-painting (#1331)

* inpaint fix k lms

* onnox as well

* up

Correct path to schedlure (#1322)

* [Examples] Correct path

* uP

Avoid nested fix-copies (#1332)

* Avoid nested `# Copied from` statements during `make fix-copies`

* style

Fix img2img speed with LMS-Discrete Scheduler (#896)

Casting `self.sigmas` into a different dtype (the one of original_samples) is not advisable. In my img2img pipeline this leads to a long running time in the  `integrate.quad` call later on- by long I mean more than 10x slower.
Co-authored-by: default avatarAnton Lozhkov <anton@huggingface.co>

Fix the order of casts for onnx inpainting (#1338)

Legacy Inpainting Pipeline for Onnx Models (#1237)

* Add legacy inpainting pipeline compatibility for onnx

* remove commented out line

* Add onnx legacy inpainting test

* Fix slow decorators

* pep8 styling

* isort styling

* dummy object

* ordering consistency

* style

* docstring styles

* Refactor common prompt encoding pattern

* Update tests to permanent repository home

* support all available schedulers until ONNX IO binding is available
Co-authored-by: default avatarAnton Lozhkov <aglozhkov@gmail.com>

* updated styling from PR suggested feedback
Co-authored-by: default avatarAnton Lozhkov <aglozhkov@gmail.com>

Jax infer support negative prompt (#1337)

* support negative prompts in sd jax pipeline

* pass batched neg_prompt

* only encode when negative prompt is None
Co-authored-by: default avatarJuan Acevedo <jfacevedo@google.com>

Update README.md: Minor change to Imagic code snippet, missing dir error (#1347)

Minor change to Imagic Readme

Missing dir causes an error when running the example code.

make style

change the sample model (#1352)

* Update alt_diffusion.mdx

* Update alt_diffusion.mdx

Add bit diffusion [WIP] (#971)

* Create bit_diffusion.py

Bit diffusion based on the paper, arXiv:2208.04202, Chen2022AnalogBG

* adding bit diffusion to new branch

ran tests

* tests

* tests

* tests

* tests

* removed test folders + added to README

* Update README.md
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* move Mel to module in pipeline construction, make librosa optional

* fix imports

* fix copy & paste error in comment

* fix style

* add missing register_to_config

* fix class docstrings

* fix class docstrings

* tweak docstrings

* tweak docstrings

* update slow test

* put trailing commas back

* respect alphabetical order

* remove LatentAudioDiffusion, make vqvae optional

* move Mel from models back to pipelines :-)

* allow loading of pretrained audiodiffusion models

* fix tests

* fix dummies

* remove reference to latent_audio_diffusion in docs

* unused import

* inherit from SchedulerMixin to make loadable

* Apply suggestions from code review

* Apply suggestions from code review
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 459b8ca8
...@@ -57,6 +57,7 @@ jobs: ...@@ -57,6 +57,7 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
apt-get update && apt-get install libsndfile1-dev -y
python -m pip install -e .[quality,test] python -m pip install -e .[quality,test]
python -m pip install git+https://github.com/huggingface/accelerate python -m pip install git+https://github.com/huggingface/accelerate
python -m pip install -U git+https://github.com/huggingface/transformers python -m pip install -U git+https://github.com/huggingface/transformers
......
...@@ -165,4 +165,4 @@ tags ...@@ -165,4 +165,4 @@ tags
# DS_Store (MacOS) # DS_Store (MacOS)
.DS_Store .DS_Store
# RL pipelines may produce mp4 outputs # RL pipelines may produce mp4 outputs
*.mp4 *.mp4
\ No newline at end of file
...@@ -11,6 +11,7 @@ RUN apt update && \ ...@@ -11,6 +11,7 @@ RUN apt update && \
git-lfs \ git-lfs \
curl \ curl \
ca-certificates \ ca-certificates \
libsndfile1-dev \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
python3.8-venv && \ python3.8-venv && \
...@@ -33,6 +34,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -33,6 +34,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
datasets \ datasets \
hf-doc-builder \ hf-doc-builder \
huggingface-hub \ huggingface-hub \
librosa \
modelcards \ modelcards \
numpy \ numpy \
scipy \ scipy \
......
...@@ -11,6 +11,7 @@ RUN apt update && \ ...@@ -11,6 +11,7 @@ RUN apt update && \
git-lfs \ git-lfs \
curl \ curl \
ca-certificates \ ca-certificates \
libsndfile1-dev \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
python3.8-venv && \ python3.8-venv && \
...@@ -35,6 +36,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -35,6 +36,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
datasets \ datasets \
hf-doc-builder \ hf-doc-builder \
huggingface-hub \ huggingface-hub \
librosa \
modelcards \ modelcards \
numpy \ numpy \
scipy \ scipy \
......
...@@ -11,6 +11,7 @@ RUN apt update && \ ...@@ -11,6 +11,7 @@ RUN apt update && \
git-lfs \ git-lfs \
curl \ curl \
ca-certificates \ ca-certificates \
libsndfile1-dev \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
python3.8-venv && \ python3.8-venv && \
...@@ -33,6 +34,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -33,6 +34,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
datasets \ datasets \
hf-doc-builder \ hf-doc-builder \
huggingface-hub \ huggingface-hub \
librosa \
modelcards \ modelcards \
numpy \ numpy \
scipy \ scipy \
......
...@@ -11,6 +11,7 @@ RUN apt update && \ ...@@ -11,6 +11,7 @@ RUN apt update && \
git-lfs \ git-lfs \
curl \ curl \
ca-certificates \ ca-certificates \
libsndfile1-dev \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
python3.8-venv && \ python3.8-venv && \
...@@ -33,6 +34,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -33,6 +34,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
datasets \ datasets \
hf-doc-builder \ hf-doc-builder \
huggingface-hub \ huggingface-hub \
librosa \
modelcards \ modelcards \
numpy \ numpy \
scipy \ scipy \
......
...@@ -11,6 +11,7 @@ RUN apt update && \ ...@@ -11,6 +11,7 @@ RUN apt update && \
git-lfs \ git-lfs \
curl \ curl \
ca-certificates \ ca-certificates \
libsndfile1-dev \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
python3.8-venv && \ python3.8-venv && \
...@@ -32,6 +33,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -32,6 +33,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
datasets \ datasets \
hf-doc-builder \ hf-doc-builder \
huggingface-hub \ huggingface-hub \
librosa \
modelcards \ modelcards \
numpy \ numpy \
scipy \ scipy \
......
...@@ -11,6 +11,7 @@ RUN apt update && \ ...@@ -11,6 +11,7 @@ RUN apt update && \
git-lfs \ git-lfs \
curl \ curl \
ca-certificates \ ca-certificates \
libsndfile1-dev \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
python3.8-venv && \ python3.8-venv && \
...@@ -32,6 +33,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -32,6 +33,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
datasets \ datasets \
hf-doc-builder \ hf-doc-builder \
huggingface-hub \ huggingface-hub \
librosa \
modelcards \ modelcards \
numpy \ numpy \
scipy \ scipy \
......
...@@ -122,6 +122,8 @@ ...@@ -122,6 +122,8 @@
title: "VQ Diffusion" title: "VQ Diffusion"
- local: api/pipelines/repaint - local: api/pipelines/repaint
title: "RePaint" title: "RePaint"
- local: api/pipelines/audio_diffusion
title: "Audio Diffusion"
title: "Pipelines" title: "Pipelines"
- sections: - sections:
- local: api/experimental/rl - local: api/experimental/rl
......
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Audio Diffusion
## Overview
[Audio Diffusion](https://github.com/teticio/audio-diffusion) by Robert Dargavel Smith.
Audio Diffusion leverages the recent advances in image generation using diffusion models by converting audio samples to
and from mel spectrogram images.
The original codebase of this implementation can be found [here](https://github.com/teticio/audio-diffusion), including
training scripts and example notebooks.
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_audio_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py) | *Unconditional Audio Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/audio_diffusion_pipeline.ipynb) |
## Examples:
### Audio Diffusion
```python
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=mel.get_sample_rate()))
```
### Latent Audio Diffusion
```python
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```
### Audio Diffusion with DDIM (faster)
```python
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```
### Variations, in-painting, out-painting etc.
```python
output = pipe(
raw_audio=output.audios[0, 0],
start_step=int(pipe.get_default_steps() / 2),
mask_start_secs=1,
mask_end_secs=1,
)
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```
## AudioDiffusionPipeline
[[autodoc]] AudioDiffusionPipeline
- __call__
- encode
- slerp
## Mel
[[autodoc]] Mel
- audio_slice_to_image
- image_to_audio
...@@ -45,6 +45,7 @@ available a colab notebook to directly try them out. ...@@ -45,6 +45,7 @@ available a colab notebook to directly try them out.
| Pipeline | Paper | Tasks | Colab | Pipeline | Paper | Tasks | Colab
|---|---|:---:|:---:| |---|---|:---:|:---:|
| [alt_diffusion](./api/pipelines/alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | - | [alt_diffusion](./api/pipelines/alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | -
| [audio_diffusion](./api/pipelines/audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio_diffusion.git) | Unconditional Audio Generation |
| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | | [cycle_diffusion](./api/pipelines/cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
| [dance_diffusion](./api/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | | [dance_diffusion](./api/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
| [ddpm](./api/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | | [ddpm](./api/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
......
...@@ -35,6 +35,7 @@ available a colab notebook to directly try them out. ...@@ -35,6 +35,7 @@ available a colab notebook to directly try them out.
| Pipeline | Paper | Tasks | Colab | Pipeline | Paper | Tasks | Colab
|---|---|:---:|:---:| |---|---|:---:|:---:|
| [alt_diffusion](./api/pipelines/alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | | [alt_diffusion](./api/pipelines/alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation |
| [audio_diffusion](./api/pipelines/audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/audio_diffusion_pipeline.ipynb)
| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | | [cycle_diffusion](./api/pipelines/cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
| [dance_diffusion](./api/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | | [dance_diffusion](./api/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
| [ddpm](./api/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | | [ddpm](./api/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
......
...@@ -12,5 +12,5 @@ specific language governing permissions and limitations under the License. ...@@ -12,5 +12,5 @@ specific language governing permissions and limitations under the License.
# Using Diffusers for audio # Using Diffusers for audio
The [`DanceDiffusionPipeline`] can be used to generate audio rapidly! [`DanceDiffusionPipeline`] and [`AudioDiffusionPipeline`] can be used to generate
More coming soon! audio rapidly! More coming soon!
\ No newline at end of file \ No newline at end of file
...@@ -91,6 +91,7 @@ _deps = [ ...@@ -91,6 +91,7 @@ _deps = [
"isort>=5.5.4", "isort>=5.5.4",
"jax>=0.2.8,!=0.3.2", "jax>=0.2.8,!=0.3.2",
"jaxlib>=0.1.65", "jaxlib>=0.1.65",
"librosa",
"modelcards>=0.1.4", "modelcards>=0.1.4",
"numpy", "numpy",
"parameterized", "parameterized",
...@@ -181,6 +182,7 @@ extras["docs"] = deps_list("hf-doc-builder") ...@@ -181,6 +182,7 @@ extras["docs"] = deps_list("hf-doc-builder")
extras["training"] = deps_list("accelerate", "datasets", "tensorboard", "modelcards") extras["training"] = deps_list("accelerate", "datasets", "tensorboard", "modelcards")
extras["test"] = deps_list( extras["test"] = deps_list(
"datasets", "datasets",
"librosa",
"parameterized", "parameterized",
"pytest", "pytest",
"pytest-timeout", "pytest-timeout",
......
...@@ -30,12 +30,14 @@ if is_torch_available(): ...@@ -30,12 +30,14 @@ if is_torch_available():
) )
from .pipeline_utils import DiffusionPipeline from .pipeline_utils import DiffusionPipeline
from .pipelines import ( from .pipelines import (
AudioDiffusionPipeline,
DanceDiffusionPipeline, DanceDiffusionPipeline,
DDIMPipeline, DDIMPipeline,
DDPMPipeline, DDPMPipeline,
KarrasVePipeline, KarrasVePipeline,
LDMPipeline, LDMPipeline,
LDMSuperResolutionPipeline, LDMSuperResolutionPipeline,
Mel,
PNDMPipeline, PNDMPipeline,
RePaintPipeline, RePaintPipeline,
ScoreSdeVePipeline, ScoreSdeVePipeline,
......
...@@ -15,6 +15,7 @@ deps = { ...@@ -15,6 +15,7 @@ deps = {
"isort": "isort>=5.5.4", "isort": "isort>=5.5.4",
"jax": "jax>=0.2.8,!=0.3.2", "jax": "jax>=0.2.8,!=0.3.2",
"jaxlib": "jaxlib>=0.1.65", "jaxlib": "jaxlib>=0.1.65",
"librosa": "librosa",
"modelcards": "modelcards>=0.1.4", "modelcards": "modelcards>=0.1.4",
"numpy": "numpy", "numpy": "numpy",
"parameterized": "parameterized", "parameterized": "parameterized",
......
from ..utils import is_flax_available, is_onnx_available, is_torch_available, is_transformers_available from ..utils import (
is_flax_available,
is_librosa_available,
is_onnx_available,
is_torch_available,
is_transformers_available,
)
if is_torch_available(): if is_torch_available():
...@@ -14,6 +20,11 @@ if is_torch_available(): ...@@ -14,6 +20,11 @@ if is_torch_available():
else: else:
from ..utils.dummy_pt_objects import * # noqa F403 from ..utils.dummy_pt_objects import * # noqa F403
if is_torch_available() and is_librosa_available():
from .audio_diffusion import AudioDiffusionPipeline, Mel
else:
from ..utils.dummy_torch_and_librosa_objects import AudioDiffusionPipeline, Mel # noqa F403
if is_torch_available() and is_transformers_available(): if is_torch_available() and is_transformers_available():
from .alt_diffusion import AltDiffusionImg2ImgPipeline, AltDiffusionPipeline from .alt_diffusion import AltDiffusionImg2ImgPipeline, AltDiffusionPipeline
from .latent_diffusion import LDMTextToImagePipeline from .latent_diffusion import LDMTextToImagePipeline
......
# flake8: noqa
from .mel import Mel
from .pipeline_audio_diffusion import AudioDiffusionPipeline
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import warnings
from ...configuration_utils import ConfigMixin, register_to_config
from ...schedulers.scheduling_utils import SchedulerMixin
warnings.filterwarnings("ignore")
import numpy as np # noqa: E402
import librosa # noqa: E402
from PIL import Image # noqa: E402
class Mel(ConfigMixin, SchedulerMixin):
"""
Parameters:
x_res (`int`): x resolution of spectrogram (time)
y_res (`int`): y resolution of spectrogram (frequency bins)
sample_rate (`int`): sample rate of audio
n_fft (`int`): number of Fast Fourier Transforms
hop_length (`int`): hop length (a higher number is recommended for lower than 256 y_res)
top_db (`int`): loudest in decibels
n_iter (`int`): number of iterations for Griffin Linn mel inversion
"""
config_name = "mel_config.json"
@register_to_config
def __init__(
self,
x_res: int = 256,
y_res: int = 256,
sample_rate: int = 22050,
n_fft: int = 2048,
hop_length: int = 512,
top_db: int = 80,
n_iter: int = 32,
):
self.hop_length = hop_length
self.sr = sample_rate
self.n_fft = n_fft
self.top_db = top_db
self.n_iter = n_iter
self.set_resolution(x_res, y_res)
self.audio = None
def set_resolution(self, x_res: int, y_res: int):
"""Set resolution.
Args:
x_res (`int`): x resolution of spectrogram (time)
y_res (`int`): y resolution of spectrogram (frequency bins)
"""
self.x_res = x_res
self.y_res = y_res
self.n_mels = self.y_res
self.slice_size = self.x_res * self.hop_length - 1
def load_audio(self, audio_file: str = None, raw_audio: np.ndarray = None):
"""Load audio.
Args:
audio_file (`str`): must be a file on disk due to Librosa limitation or
raw_audio (`np.ndarray`): audio as numpy array
"""
if audio_file is not None:
self.audio, _ = librosa.load(audio_file, mono=True, sr=self.sr)
else:
self.audio = raw_audio
# Pad with silence if necessary.
if len(self.audio) < self.x_res * self.hop_length:
self.audio = np.concatenate([self.audio, np.zeros((self.x_res * self.hop_length - len(self.audio),))])
def get_number_of_slices(self) -> int:
"""Get number of slices in audio.
Returns:
`int`: number of spectograms audio can be sliced into
"""
return len(self.audio) // self.slice_size
def get_audio_slice(self, slice: int = 0) -> np.ndarray:
"""Get slice of audio.
Args:
slice (`int`): slice number of audio (out of get_number_of_slices())
Returns:
`np.ndarray`: audio as numpy array
"""
return self.audio[self.slice_size * slice : self.slice_size * (slice + 1)]
def get_sample_rate(self) -> int:
"""Get sample rate:
Returns:
`int`: sample rate of audio
"""
return self.sr
def audio_slice_to_image(self, slice: int) -> Image.Image:
"""Convert slice of audio to spectrogram.
Args:
slice (`int`): slice number of audio to convert (out of get_number_of_slices())
Returns:
`PIL Image`: grayscale image of x_res x y_res
"""
S = librosa.feature.melspectrogram(
y=self.get_audio_slice(slice), sr=self.sr, n_fft=self.n_fft, hop_length=self.hop_length, n_mels=self.n_mels
)
log_S = librosa.power_to_db(S, ref=np.max, top_db=self.top_db)
bytedata = (((log_S + self.top_db) * 255 / self.top_db).clip(0, 255) + 0.5).astype(np.uint8)
image = Image.fromarray(bytedata)
return image
def image_to_audio(self, image: Image.Image) -> np.ndarray:
"""Converts spectrogram to audio.
Args:
image (`PIL Image`): x_res x y_res grayscale image
Returns:
audio (`np.ndarray`): raw audio
"""
bytedata = np.frombuffer(image.tobytes(), dtype="uint8").reshape((image.height, image.width))
log_S = bytedata.astype("float") * self.top_db / 255 - self.top_db
S = librosa.db_to_power(log_S)
audio = librosa.feature.inverse.mel_to_audio(
S, sr=self.sr, n_fft=self.n_fft, hop_length=self.hop_length, n_iter=self.n_iter
)
return audio
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from math import acos, sin
from typing import List, Tuple, Union
import numpy as np
import torch
from PIL import Image
from ...models import AutoencoderKL, UNet2DConditionModel
from ...pipeline_utils import AudioPipelineOutput, BaseOutput, DiffusionPipeline, ImagePipelineOutput
from ...schedulers import DDIMScheduler, DDPMScheduler
from .mel import Mel
class AudioDiffusionPipeline(DiffusionPipeline):
"""
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
Parameters:
vqae ([`AutoencoderKL`]): Variational AutoEncoder for Latent Audio Diffusion or None
unet ([`UNet2DConditionModel`]): UNET model
mel ([`Mel`]): transform audio <-> spectrogram
scheduler ([`DDIMScheduler` or `DDPMScheduler`]): de-noising scheduler
"""
_optional_components = ["vqvae"]
def __init__(
self,
vqvae: AutoencoderKL,
unet: UNet2DConditionModel,
mel: Mel,
scheduler: Union[DDIMScheduler, DDPMScheduler],
):
super().__init__()
self.register_modules(unet=unet, scheduler=scheduler, mel=mel, vqvae=vqvae)
def get_input_dims(self) -> Tuple:
"""Returns dimension of input image
Returns:
`Tuple`: (height, width)
"""
input_module = self.vqvae if self.vqvae is not None else self.unet
# For backwards compatibility
sample_size = (
(input_module.sample_size, input_module.sample_size)
if type(input_module.sample_size) == int
else input_module.sample_size
)
return sample_size
def get_default_steps(self) -> int:
"""Returns default number of steps recommended for inference
Returns:
`int`: number of steps
"""
return 50 if isinstance(self.scheduler, DDIMScheduler) else 1000
@torch.no_grad()
def __call__(
self,
batch_size: int = 1,
audio_file: str = None,
raw_audio: np.ndarray = None,
slice: int = 0,
start_step: int = 0,
steps: int = None,
generator: torch.Generator = None,
mask_start_secs: float = 0,
mask_end_secs: float = 0,
step_generator: torch.Generator = None,
eta: float = 0,
noise: torch.Tensor = None,
return_dict=True,
) -> Union[
Union[AudioPipelineOutput, ImagePipelineOutput], Tuple[List[Image.Image], Tuple[int, List[np.ndarray]]]
]:
"""Generate random mel spectrogram from audio input and convert to audio.
Args:
batch_size (`int`): number of samples to generate
audio_file (`str`): must be a file on disk due to Librosa limitation or
raw_audio (`np.ndarray`): audio as numpy array
slice (`int`): slice number of audio to convert
start_step (int): step to start from
steps (`int`): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
generator (`torch.Generator`): random number generator or None
mask_start_secs (`float`): number of seconds of audio to mask (not generate) at start
mask_end_secs (`float`): number of seconds of audio to mask (not generate) at end
step_generator (`torch.Generator`): random number generator used to de-noise or None
eta (`float`): parameter between 0 and 1 used with DDIM scheduler
noise (`torch.Tensor`): noise tensor of shape (batch_size, 1, height, width) or None
return_dict (`bool`): if True return AudioPipelineOutput, ImagePipelineOutput else Tuple
Returns:
`List[PIL Image]`: mel spectrograms (`float`, `List[np.ndarray]`): sample rate and raw audios
"""
steps = steps or self.get_default_steps()
self.scheduler.set_timesteps(steps)
step_generator = step_generator or generator
# For backwards compatibility
if type(self.unet.sample_size) == int:
self.unet.sample_size = (self.unet.sample_size, self.unet.sample_size)
input_dims = self.get_input_dims()
self.mel.set_resolution(x_res=input_dims[1], y_res=input_dims[0])
if noise is None:
noise = torch.randn(
(batch_size, self.unet.in_channels, self.unet.sample_size[0], self.unet.sample_size[1]),
generator=generator,
device=self.device,
)
images = noise
mask = None
if audio_file is not None or raw_audio is not None:
self.mel.load_audio(audio_file, raw_audio)
input_image = self.mel.audio_slice_to_image(slice)
input_image = np.frombuffer(input_image.tobytes(), dtype="uint8").reshape(
(input_image.height, input_image.width)
)
input_image = (input_image / 255) * 2 - 1
input_images = torch.tensor(input_image[np.newaxis, :, :], dtype=torch.float).to(self.device)
if self.vqvae is not None:
input_images = self.vqvae.encode(torch.unsqueeze(input_images, 0)).latent_dist.sample(
generator=generator
)[0]
input_images = 0.18215 * input_images
if start_step > 0:
images[0, 0] = self.scheduler.add_noise(input_images, noise, self.scheduler.timesteps[start_step - 1])
pixels_per_second = (
self.unet.sample_size[1] * self.mel.get_sample_rate() / self.mel.x_res / self.mel.hop_length
)
mask_start = int(mask_start_secs * pixels_per_second)
mask_end = int(mask_end_secs * pixels_per_second)
mask = self.scheduler.add_noise(input_images, noise, torch.tensor(self.scheduler.timesteps[start_step:]))
for step, t in enumerate(self.progress_bar(self.scheduler.timesteps[start_step:])):
model_output = self.unet(images, t)["sample"]
if isinstance(self.scheduler, DDIMScheduler):
images = self.scheduler.step(
model_output=model_output, timestep=t, sample=images, eta=eta, generator=step_generator
)["prev_sample"]
else:
images = self.scheduler.step(
model_output=model_output, timestep=t, sample=images, generator=step_generator
)["prev_sample"]
if mask is not None:
if mask_start > 0:
images[:, :, :, :mask_start] = mask[:, step, :, :mask_start]
if mask_end > 0:
images[:, :, :, -mask_end:] = mask[:, step, :, -mask_end:]
if self.vqvae is not None:
# 0.18215 was scaling factor used in training to ensure unit variance
images = 1 / 0.18215 * images
images = self.vqvae.decode(images)["sample"]
images = (images / 2 + 0.5).clamp(0, 1)
images = images.cpu().permute(0, 2, 3, 1).numpy()
images = (images * 255).round().astype("uint8")
images = list(
map(lambda _: Image.fromarray(_[:, :, 0]), images)
if images.shape[3] == 1
else map(lambda _: Image.fromarray(_, mode="RGB").convert("L"), images)
)
audios = list(map(lambda _: self.mel.image_to_audio(_), images))
if not return_dict:
return images, (self.mel.get_sample_rate(), audios)
return BaseOutput(**AudioPipelineOutput(np.array(audios)[:, np.newaxis, :]), **ImagePipelineOutput(images))
@torch.no_grad()
def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray:
"""Reverse step process: recover noisy image from generated image.
Args:
images (`List[PIL Image]`): list of images to encode
steps (`int`): number of encoding steps to perform (defaults to 50)
Returns:
`np.ndarray`: noise tensor of shape (batch_size, 1, height, width)
"""
# Only works with DDIM as this method is deterministic
assert isinstance(self.scheduler, DDIMScheduler)
self.scheduler.set_timesteps(steps)
sample = np.array(
[np.frombuffer(image.tobytes(), dtype="uint8").reshape((1, image.height, image.width)) for image in images]
)
sample = (sample / 255) * 2 - 1
sample = torch.Tensor(sample).to(self.device)
for t in self.progress_bar(torch.flip(self.scheduler.timesteps, (0,))):
prev_timestep = t - self.scheduler.num_train_timesteps // self.scheduler.num_inference_steps
alpha_prod_t = self.scheduler.alphas_cumprod[t]
alpha_prod_t_prev = (
self.scheduler.alphas_cumprod[prev_timestep]
if prev_timestep >= 0
else self.scheduler.final_alpha_cumprod
)
beta_prod_t = 1 - alpha_prod_t
model_output = self.unet(sample, t)["sample"]
pred_sample_direction = (1 - alpha_prod_t_prev) ** (0.5) * model_output
sample = (sample - pred_sample_direction) * alpha_prod_t_prev ** (-0.5)
sample = sample * alpha_prod_t ** (0.5) + beta_prod_t ** (0.5) * model_output
return sample
@staticmethod
def slerp(x0: torch.Tensor, x1: torch.Tensor, alpha: float) -> torch.Tensor:
"""Spherical Linear intERPolation
Args:
x0 (`torch.Tensor`): first tensor to interpolate between
x1 (`torch.Tensor`): seconds tensor to interpolate between
alpha (`float`): interpolation between 0 and 1
Returns:
`torch.Tensor`: interpolated tensor
"""
theta = acos(torch.dot(torch.flatten(x0), torch.flatten(x1)) / torch.norm(x0) / torch.norm(x1))
return sin((1 - alpha) * theta) * x0 / sin(theta) + sin(alpha * theta) * x1 / sin(theta)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment