We spent the last week generating talking head videos with Google's Veo 3. Here's what actually works — and what the tutorials don't tell you.
The Promise vs. The Reality
Veo 3 is impressive. You give it a reference image and a script, and it generates video with synchronized lip movements and natural audio. No green screen. No filming. No editing dialogue to match mouth movements.
But "impressive" doesn't mean "ready to post."
Our first attempts looked like AI. The pacing was off. The delivery felt robotic. The energy changed randomly between clips. It took dozens of generations to figure out what was going wrong — and how to fix it.
Here's what we learned.
The Pacing Problem Nobody Talks About
This one cost us hours.
Veo 3 adjusts speaking pace to fit your entire script into the clip duration. Write 10 words? Slow, measured delivery. Write 25 words? Your AI presenter sounds like an auctioneer.
The model doesn't fail gracefully. It just speeds up.
The fix: Write scripts to a word count, not a thought.
| Feel You Want | Words for 8-sec clip |
|---|---|
| Slow, emphatic | 10-12 words |
| Conversational | 12-15 words |
| Energetic | 15-18 words |
Natural speech runs about 2-2.5 words per second. An 8-second clip with breathing room means 12-15 words is your sweet spot.
You can also adjust clip speed in post-production. Slightly slow down a rushed segment, speed up a draggy one. More reliable than fighting the model.
What NOT to Put in Your Prompts
We wasted generations on these mistakes:
Don't describe physical appearance. Your reference image handles this. Adding "27-year-old woman with brown hair" confuses things — or worse, the model tries to reconcile your description with the image and creates something uncanny.
Don't describe the scene. If you're using a keyframe image, the model sees it. Describing "coffee shop with morning light" when your keyframe already shows that just adds noise.
Don't mention ethnicity. We wrote "Indian-American woman" and got an Indian accent. The model made an assumption we didn't want. "Neutral American accent" is explicit and works.
Don't skip the audio instruction. Veo 3 adds background music by default. Every prompt needs "No background music" or you'll get lo-fi beats under your serious business advice.
What TO Put in Your Prompts
After testing dozens of variations, this structure works:
[Name] speaks directly to camera. Neutral American accent,
[pace description]. [Specific direction]. Clearly enunciates.
No background music.
Script: "[your dialogue]"
The phrase "clearly enunciates" improves lip-sync quality noticeably. "Measured pace" or "conversational energy" helps with consistency.
For specific moments, add direction:
- "leans in slightly on the final point"
- "pauses for one second, then continues"
- "slight smile forming"
You can't control exact timing, but these nudge the model in the right direction.
The Consistency Problem
Our bigger challenge: making three 8-second segments feel like one video.
Segment 1 would be calm and measured. Segment 2 would suddenly be intense. Segment 3 would shift to something else entirely. Same script format, same reference image, completely different energy.
The fix: Define a baseline delivery once, then note only variations.
Instead of describing the full performance every segment, create a "baseline" for each persona:
She speaks in a conversational, measured pace. Warm but direct energy — confident without being intense. Clear and grounded, like explaining something to a smart colleague.
Then each segment just says: "Baseline delivery, slight emphasis on the last line."
This cut our variation problems significantly.
The Keyframe Matters More Than You Think
Your Midjourney prompt for the keyframe image shapes everything.
What to include:
- Camera angle — "camera at arm's length, slight angle from below"
- Mouth position — "mouth slightly open mid-sentence" (helps lip-sync start naturally)
- Hand position — "right hand raised at chest level, fingers spread mid-gesture"
- Expression — "engaged, about to share something important"
- Lighting direction — "soft window light from the left"
What to avoid:
- Multiple people in focus
- Hands in awkward positions (they'll animate awkwardly)
- Complex backgrounds that compete for attention
- Harsh shadows on the face
We built a library of 8-9 scenes per persona. Professional settings, casual settings, intimate settings. Each tagged by mood so we can match scene to content.
The Production Workflow That Works
Here's our actual process:
- Write the script — 3 segments, 12-15 words each
- Generate keyframes — Midjourney with detailed prompts,
--ar 9:16 - Generate clips — Veo 3 with keyframe + script + delivery notes
- Review and regenerate — Usually takes 2-3 tries per segment
- Stitch in CapCut — Adjust pacing, add captions
- Export and post — With caption and hashtags ready
Total time for a 24-second video: about 45 minutes once you have your keyframes. The keyframe library is the investment that pays off.
Is Veo 3 Worth It?
For $20/month through Google One, you get 3 videos per day. That's 90 videos a month — more than enough for consistent posting.
The native lip-sync is why we chose it over alternatives. Kling has better motion quality, but lip-sync is a separate step. Runway needs add-ons. Pika is cheaper but less realistic.
For talking head content — tips, advice, thought leadership — Veo 3 hits the sweet spot of quality and convenience.
What We're Still Figuring Out
- Timing anchors — "pauses for 1 second" sometimes works, sometimes doesn't
- Extension feature — Veo can extend clips up to 148 seconds; haven't tested for dialogue
- Motion-heavy content — Walking, demonstrating products; might need Kling for this
- Multiple speakers — Haven't attempted; research says it's unreliable
The Bottom Line
AI video generation is real, but it's not magic. The gap between "technically possible" and "actually good" is filled with specific knowledge that only comes from doing the work.
Write to word counts. Define baseline deliveries. Build a keyframe library. Always say "no background music."
The tools will keep improving. But the fundamentals of clear communication — knowing what to say and how to say it — that's still on you.
