DeepMind has unveiled V2A, an AI-powered technology capable of generating audio from a video input (not from a text description, but from the video input itself). As this technology matures, it’s going to have an outsized impact on video creation workflows.
The company says the system can generate high-quality audio (including music, sound effects, and voiceovers) that is contextually appropriate and synchronized with the visual content. There are some examples on DeepMind’s Generating Audio for Video blog post.
Like it or not, this is the future. Inevitably, this technology will evolve to do exactly what DeepMind says it will do. Imagine describing your video to a text-to-video model like OpenAI’s Sora or Google’s Veo, then having DeepMind’s V2A create the voiceover, soundtrack, and sound effects. Now, imagine combining these technologies into one interface where you simply describe what you want and have the AI create a complete production for you.
Depending on what you do for a living, this is either heaven or hell – perhaps it’s a bit of both. Purists, you don’t need to make the argument that AI will never be able to do what a human can do. If this tech stack is suitable for just 20 percent of video projects (corporate communications, memes, simple how-to-videos, etc.), the industry is going to feel it. As that percentage creeps up – and it will – we’re going to be in a new world.
Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it. This work was created with the assistance of various generative AI models.