PicoAudio2: Temporal Controllable Text-to-Audio Generation With Natural Language Description

1MoE Key Lab of Artiffcial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University 2Shanghai AI Lab
†This work was done while Z. Zheng was an intern at Shanghai AI lab.
*Corresponding author: Mengyue Wu
PicoAudio2 improves temporal-controllable TTA via a new data processing pipeline and model architecture. Speciffcally, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, following PicoAudio, we encode timestamp information into a timestamp matrix to provide extra fine-grained time-aligned information to the model, on top of the coarse-grained textual description.

Data Pipeline & Model

数据示例 模型结构

Text-to-Audio Generation Comparison

Caption 1: An engine rumbles loudly at 0.2-3.9s and an air horn honks at 4.7-5.5s, 6.5-6.8s.
PicoAudio2 (Ours)
PicoAudio2 (Ours) Photo
AudioComposer
AudioComposer Photo
AudioLDM2
AudioLDM2 Photo
Tango2
Tango2 Photo
Make-An-Audio 2
Make-An-Audio 2 Photo
Caption 2: A bell is ringing loudly and quickly at 0.0-3.2s.
PicoAudio2 (Ours)
PicoAudio2 (Ours) Photo
AudioComposer
AudioComposer Photo
AudioLDM2
AudioLDM2 Photo
Tango2
Tango2 Photo
Make-An-Audio 2
Make-An-Audio 2 Photo
Caption 3: Loud wind noise at 0.2-2.0s and a car accelerating fast at 2.0-9.9s.
PicoAudio2 (Ours)
PicoAudio2 (Ours) Photo
AudioComposer
AudioComposer Photo
AudioLDM2
AudioLDM2 Photo
Tango2
Tango2 Photo
Make-An-Audio 2
Make-An-Audio 2 Photo
Caption 4: A quick loud explosion at 0.1-0.8s and music plays with pulsating sounds at 1.1-5.8s and a man talking at 6.8-9.6s.
PicoAudio2 (Ours)
PicoAudio2 (Ours) Photo
AudioComposer
AudioComposer Photo
AudioLDM2
AudioLDM2 Photo
Tango2
Tango2 Photo
Make-An-Audio 2
Make-An-Audio 2 Photo
Caption 5: Tap water is running at 1.0-7.6s and a tapping noise at 0.5-0.6s, 9.0-9.4s.
PicoAudio2 (Ours)
PicoAudio2 (Ours) Photo
AudioComposer
AudioComposer Photo
AudioLDM2
AudioLDM2 Photo
Tango2
Tango2 Photo
Make-An-Audio 2
Make-An-Audio 2 Photo

Transferred examples by LLM

Type Caption LLM Audio
Frequency
A pet cat meows for two times.
A pet cat meows at 1.3-1.9s and 3.5-4.1s.
Show Audio 1
Frequency
A train horn for three times.
A train horn at 0.0s-0.2s, 2.9-3.6s and 5.3-7.8s
Show Audio 4
Order
Cinking followed by a toilet flushing.
Cinking at 1.0-1.2s and a toilet flushing at 7.2-10.0s.
Show Audio 2
Order
A man speaks then digital beeps.
A man speaks at 0.8-9.4s and digital beeps at 9.4-10.0s.
Show Audio 3