audioldm#
README(From Huggingface)#
AudioLDM#
AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.15.0 onwards.
Model Details#
AudioLDM was proposed in the paper AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al.
Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
Checkpoint Details#
This is the original, small version of the AudioLDM model, also referred to as audioldm-s-full. The four AudioLDM checkpoints are summarised in the table below:
Table 1: Summary of the AudioLDM checkpoints.
| Checkpoint | Training Steps | Audio conditioning | CLAP audio dim | UNet dim | Params |
|---|---|---|---|---|---|
| audioldm-s-full | 1.5M | No | 768 | 128 | 421M |
| audioldm-s-full-v2 | > 1.5M | No | 768 | 128 | 421M |
| audioldm-m-full | 1.5M | Yes | 1024 | 192 | 652M |
| audioldm-l-full | 1.5M | No | 768 | 256 | 975M |
Model Sources#
Usage#
First, install the required packages:
pip install --upgrade diffusers transformers accelerate
Text-to-Audio#
For text-to-audio generation, the AudioLDMPipeline can be used to load pre-trained weights and generate text-conditional audio outputs:
from diffusers import AudioLDMPipeline
import paddle
repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id, dtype=paddle.float16)
pipe = pipe
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
The resulting audio output can be saved as a .wav file:
import scipy
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Or displayed in a Jupyter Notebook / Google Colab:
from IPython.display import Audio
Audio(audio, rate=16000)
Tips#
Prompts:
Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
Inference:
The quality of the predicted audio sample can be controlled by the
num_inference_stepsargument: higher steps give higher quality audio at the expense of slower inference.The length of the predicted audio sample can be controlled by varying the
audio_length_in_sargument.
Citation#
BibTeX:
@article{liu2023audioldm,
title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={arXiv preprint arXiv:2301.12503},
year={2023}
}
Model Files#
README.md (4.3 KB)
model_index.json (494.0 B)
scheduler/scheduler_config.json (436.0 B)
text_encoder/config.json (971.0 B)
text_encoder/model_state.pdparams (478.0 MB)
tokenizer/merges.txt (445.7 KB)
tokenizer/special_tokens_map.json (239.0 B)
tokenizer/tokenizer_config.json (434.0 B)
tokenizer/vocab.json (779.6 KB)
unet/config.json (1.6 KB)
unet/model_state.pdparams (705.9 MB)
vae/config.json (634.0 B)
vae/model_state.pdparams (211.3 MB)
vocoder/config.json (807.0 B)
vocoder/model_state.pdparams (210.8 MB)