PicoAudio2: Temporal Controllable Text-to-Audio Generation With Natural Language Description

Zihao Zheng†^1,2, Zeyu Xie¹, Xuenan Xu^1,2, Wen Wu², Chao Zhang², Mengyue Wu*¹

¹MoE Key Lab of Artiffcial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University ²Shanghai AI Lab

†This work was done while Z. Zheng was an intern at Shanghai AI lab.
*Corresponding author: Mengyue Wu

PicoAudio2 improves temporal-controllable TTA via a new data processing pipeline and model architecture. Speciffcally, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, following PicoAudio, we encode timestamp information into a timestamp matrix to provide extra fine-grained time-aligned information to the model, on top of the coarse-grained textual description.

Data Pipeline & Model

Text-to-Audio Generation Comparison

Caption 1: An engine rumbles loudly at 0.2-3.9s and an air horn honks at 4.7-5.5s, 6.5-6.8s.
PicoAudio2 (Ours)	AudioComposer	AudioLDM2	Tango2	Make-An-Audio 2
Caption 2: A bell is ringing loudly and quickly at 0.0-3.2s.
PicoAudio2 (Ours)	AudioComposer	AudioLDM2	Tango2	Make-An-Audio 2
Caption 3: Loud wind noise at 0.2-2.0s and a car accelerating fast at 2.0-9.9s.
PicoAudio2 (Ours)	AudioComposer	AudioLDM2	Tango2	Make-An-Audio 2
Caption 4: A quick loud explosion at 0.1-0.8s and music plays with pulsating sounds at 1.1-5.8s and a man talking at 6.8-9.6s.
PicoAudio2 (Ours)	AudioComposer	AudioLDM2	Tango2	Make-An-Audio 2
Caption 5: Tap water is running at 1.0-7.6s and a tapping noise at 0.5-0.6s, 9.0-9.4s.
PicoAudio2 (Ours)	AudioComposer	AudioLDM2	Tango2	Make-An-Audio 2

Transferred examples by LLM

Type	Caption	LLM
Frequency	A pet cat meows for two times.	A pet cat meows at 1.3-1.9s and 3.5-4.1s.
Frequency	A train horn for three times.	A train horn at 0.0s-0.2s, 2.9-3.6s and 5.3-7.8s
Order	Cinking followed by a toilet flushing.	Cinking at 1.0-1.2s and a toilet flushing at 7.2-10.0s.
Order	A man speaks then digital beeps.	A man speaks at 0.8-9.4s and digital beeps at 9.4-10.0s.