OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
On-screen environmental sound: Emergency vehicle sirens blare as people converse in the bustling city street.
Off-screen environmental sound: A baby is crying loudly.
Off-screen speech: “Whoa, what's all that noise?”
On-screen environmental sound: Birds chirp and tweet as a Baltimore oriole calls out, creating a lively atmosphere in a garden with a feeder.
Off-screen speech: Look! What can we see! A pair of birds enjoying the orange!
On-screen speech: This ocean, it's a force, a wild, untamed might. And she commands your awe with every breaking light.
Off-screen environmental sound: The sea breeze blows on the sea, raising waves.
On-screen speech: Hi everyone, thanks for joining me today, let's dive into your latest project and discuss next concrete steps.
Off-screen environmental sound: The sound of birds chirp.
On-screen environmental sound: A dog growls persistently, indicating its presence in a domestic setting, possibly as a pet.
Off-screen speech: Hey, easy now. It's okay...relax buddy.
@inproceedings{pian2023omnisonic,
title={OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text},
author={Pian, Weiguo and Kushwaha, Saksham Singh and Chen, Zhimin and Deng, Shijian and Wang, Kai and Guo, Yunhui and Tian, Yapeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},,
year={2026}
}