Google's New AI Video Generator Can Talk

Google just launched its latest AI video generator, Veo 3, and it doesn't just render videos, it automatically adds speech, such as dialogue and voice-overs.

May 21, 2025

∙ Paid

If you were impressed by Veo 2, you’re going to be completely blown away by Veo 3.

Google IO 2025 just wrapped up, and it was an absolute overload of AI announcements. Many people, including myself, are still picking up jaws off the floor. But out of all the launches, Veo 3 is one of the most exciting for me.

I’ll talk about the rest of the announcements in a separate post, but for now, let’s focus on Google’s latest generative video model.

What’s New in Veo 3

Here’s a quick breakdown of the major upgrades:

Improved quality and better physics rendering when generating videos from text and image prompts
Bigger resolution at 4K output
Improved prompt adherence, meaning more accurate responses to your instructions
Automatically add speech, such as dialogue and voice-overs
It comes with native audio generation, such as music and sound effects

That’s right, Veo 3 can now add dialogue automatically. For me, that’s the most jaw-dropping feature of all. It is likely made possible by DeepMind’s earlier work in “video-to-audio” AI announced last June.

If you want to see how good it really is, Google DeepMind shared a few sample videos with character dialogue in this X post:

Looking at the sample videos, I think we are already seeing the next generation of AI filmmaking.

We’re Entering AI Filmmaking 2.0

Gone are the days when you’d have to generate a video on one platform, say Kling, write a script with ChatGPT, feed that script to another tool for audio like ElevenLabs, and then run a separate AI model to sync the lips with the dialogue.

It was a complicated workflow that could easily take hours, if not days. And that’s not even counting the cost of juggling five different tools and subscriptions.

With Veo 3, all of that gets compressed into a single pipeline. One prompt. One tool. And somehow, it pulls everything together — visually and audibly.

Let’s take this scene, for example:

Prompt: A medium shot frames an old sailor, his knitted blue sailor hat casting a shadow over his eyes, a thick grey beard obscuring his chin. He holds his pipe in one hand, gesturing with it towards the churning, grey sea beyond the ship’s railing. “This ocean, it’s a force, a wild, untamed might. And she commands your awe, with every breaking light”

Keep reading with a 7-day free trial

Subscribe to Generative AI Publication to keep reading this post and get 7 days of free access to the full post archives.