OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

On-screen environmental sound: Emergency vehicle sirens blare as people converse in the bustling city street.

Off-screen environmental sound: A baby is crying loudly.

Off-screen speech: “Whoa, what's all that noise?”

On-screen environmental sound: Birds chirp and tweet as a Baltimore oriole calls out, creating a lively atmosphere in a garden with a feeder.

Off-screen speech: Look! What can we see! A pair of birds enjoying the orange!

On-screen speech: This ocean, it's a force, a wild, untamed might. And she commands your awe with every breaking light.

Off-screen environmental sound: The sea breeze blows on the sea, raising waves.

On-screen speech: Hi everyone, thanks for joining me today, let's dive into your latest project and discuss next concrete steps.

Off-screen environmental sound: The sound of birds chirp.

On-screen environmental sound: A dog growls persistently, indicating its presence in a domestic setting, possibly as a pet.

Off-screen speech: Hey, easy now. It's okay...relax buddy.

BibTeX

@inproceedings{pian2023omnisonic,
  title={OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text},
  author={Pian, Weiguo and Kushwaha, Saksham Singh and Chen, Zhimin and Deng, Shijian and Wang, Kai and Guo, Yunhui and Tian, Yapeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},,
  year={2026}
}

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

CVPR 2026

Demo Video

BibTeX