Veo 3.1 Lite vs Fast: what you actually lose at a third of the price
- 1.Lite costs 67% less than Fast at 720p ($0.40 vs $1.20 per 8s clip)
- 2.Quality gap is small on controlled scenes (product, dialogue, abstract) but large on complex motion
- 3.Both models include audio on every generation, audio is not a Lite tradeoff
- 4.Lite generates 35% faster than Fast on average (42s vs 65s)
- 5.Fast price drops April 7, narrowing Lite advantage from 67% to 50% at 720p
- 6.All 16 test videos are playable in-article for direct comparison

Google launched Veo 3.1 Lite today, their most cost-effective video generation model. At $0.05 per second for 720p output, an 8-second clip costs $0.40. The same clip on Veo 3.1 Fast costs $1.20. That is a 67% price reduction with what Google claims is "the same speed."
I ran 7 text-to-video prompts and one image-to-video test through both models via the Gemini API to find out what the price difference actually buys. Every text-to-video generation was 720p, 8 seconds, with identical prompt text. The results were more uneven than I expected.
The short version
Lite is not a uniform downgrade. On controlled, low-motion scenes (a rotating product, a woman at a window, ink in water) the quality gap is small. On complex motion, spatial depth, and physics (a barista pouring latte art, a falcon diving) the gap is dramatic.
Note: Every generation is non-deterministic; run the same prompt twice and you may get different results. What follows is what I observed across 23 generations.
Where Lite holds up
Dialogue scenes. A close-up of a woman turning at a rainy window, speaking a line of dialogue. Both models produce coherent lip movement, appropriate rain texture, and matching ambient audio. Lite's skin detail is slightly softer, but you would not pick it as the cheaper model in a blind test.
Product shots. A matte-black headphone rotating on a white surface. Lite handles the clean studio setup well because the scene is physically constrained: one object, one motion, controlled lighting. The reflection detail is slightly less precise on Lite, but the output (in my opinion) seems commercially usable.
Abstract and stylized. Ink drops swirling in water with fractal patterns. Both models generate compelling macro footage. The colour palette and temporal pacing are close. This makes sense: abstract content has no "correct" physics to violate, so the model's reduced capacity does not have an obvious failure mode.
Portrait and social (where Lite actually won). I prompted a street musician playing guitar in a European alley at golden hour. Lite produced a beautiful wide shot with warm golden tones, consistent framing, and natural subtle movement. Fast hallucinated an audience of 15+ people that the prompt never mentioned, zoomed in aggressively to an extreme close-up (losing the alley scene entirely), then pulled back out. The face changed between the wide and close-up frames, breaking continuity. In this case, Lite was the better output.
Where Lite breaks down
Human motion and physics. I prompted a barista pouring steamed milk into a latte. Lite started the pour correctly, two hands (one holding the cup and one pouring the steamed milk) a pitcher, and steamed milk flowing. However, in the final frames the cup began floating above the counter and the barista had a third hand pouring from a second pitcher. Fast produced a completely different interpretation: a cinematic beauty shot of the finished latte with bokeh, skipping the pour mechanics entirely. Note: neither model correctly created the rosetta pattern of the milk as it was being poured (the pattern was there before beforing).
Fast motion. This was the largest gap. I prompted a peregrine falcon diving through cloudy sky in slow motion with a camera track. Fast generated an actual diving trajectory with the camera following the descent, feathers rippling convincingly. Lite produced a bird hovering in place with what looked like a radial motion blur applied as a visual shorthand for speed. The dive never happened. The model understood "fast bird + sky + dramatic" but struggled with the temporal arc of a dive sequence.
The image-to-video test
Both Lite and Fast support image-to-video generation. I tested this with a photo of my Bengal cat Alfie, standing on a bed looking directly at the camera. The prompt asked the cat to walk forward, paws stepping carefully, tail swaying.
| Category | Prompt | Lite | Fast |
|---|---|---|---|
| Human Motion / Physics | A barista pours steamed milk into a latte, creating a rosetta pattern, steam rising | 42s | 63s |
| Dialogue Scene | A woman at a rainy window turns to camera and says 'I think the storm is finally passing' | 42s | 52s |
| Product / Commercial | A sleek matte-black headphone rotating slowly on a white surface, soft studio lighting | 32s | 48s |
| Abstract / Stylized | Time-lapse of ink drops falling into water in slow motion, swirling into fractal patterns | 32s | 62s |
| Fast Motion | A peregrine falcon diving through cloudy sky in slow motion, feathers rippling | 52s | 104s |
| Simple / Low Complexity | A single red balloon floating upward against a clear blue sky, gentle breeze | 39s | 63s |
| Portrait / Social (9:16) | A street musician playing acoustic guitar on a cobblestone European alley at golden hour | 42s | 63s |
Audio: not a tradeoff
Every single Lite output included an audio stream. All 13 Lite generations and all 10 Fast generations had AAC audio tracks. Audio quality seemed slightly better on Fast (richer ambient sound) during our tests, but Lite's audio is fully functional and not degraded placeholder audio.
Speed: Lite is actually faster
Google says Lite offers "the same speed" as Fast. In practice, (during testing) Lite was consistently faster to generate. Average generation time across all prompts:
Lite: 42 seconds average (range: 32 to 52 seconds)
Fast: 65 seconds average (range: 48 to 104 seconds)
Lite was 35% faster on average. The outlier was Fast's falcon prompt, which took 104 seconds, nearly double any other generation. This speed advantage is notable because Google's pricing page already positions Lite as the high-volume option. Faster generation means higher throughput for batch workflows.
The pricing picture
The confirmed per-second rates, directly from Google's pricing page:
Tier | Lite | Fast (today) | Fast (Apr 7) | Standard |
|---|---|---|---|---|
720p | $0.05/s | $0.15/s | $0.10/s | $0.40/s |
1080p | $0.08/s | $0.15/s | $0.12/s | $0.40/s |
4K | N/A | $0.35/s | $0.30/s | $0.60/s |
The timing matters. On April 7, Fast's 720p price drops from $0.15 to $0.10 per second. That cuts Lite's current 67% cost advantage to 50%. At 1080p, the gap narrows from 47% to 33%. If you are evaluating Lite for a production pipeline, factor in the April 7 price drop.
Confirmed Lite limitations beyond pricing: no 4K output, no video Extension (clip chaining), and a single video per API call.
When to use Lite
Use Lite for product shots, abstract backgrounds, dialogue scenes, and any application where the camera is relatively static and the scene is physically simple. At $0.40 per 8-second clip, it is viable for high-volume content generation where Fast's $1.20 per clip adds up.
Methodology
23 total generations via the Gemini API using google-genai Python SDK (13 Lite, 10 Fast). Some prompts were run more than once to check for consistency; the 16 videos shown in the comparison above are one per model per test. All text-to-video tests: 720p, 8 seconds, single output per call. Models: veo-3.1-lite-generate-preview and veo-3.1-fast-generate-preview. Image-to-video tested with a single source photograph at 9:16 aspect ratio. Audio presence confirmed via ffprobe stream analysis. Generation times measured programmatically (wall clock, including API polling). Costs calculated from published per-second rates. Total benchmark cost: $17.38.

Written by
Joseph Nordqvist
Joseph founded AI News Home in 2026. He studied marketing and later completed a postgraduate program in AI and machine learning (business applications) at UT Austin’s McCombs School of Business. He is now pursuing an MSc in Computer Science at the University of York.
Editorial Transparency
This article was produced with the assistance of AI tools as part of our editorial workflow. All analysis, conclusions, and editorial decisions were made by human editors. Read our Editorial Guidelines
References
- 1.
Build with Veo 3.1 Lite, our most cost-effective video generation model — Alisa Fortin, Guillaume Vernade, Google DeepMind, March 31, 2026
Primary
Was this useful?
Clarified methodology: 23 generations were run across 8 tests (7 text-to-video, 1 image-to-video), with some prompts run multiple times for consistency. The 16 videos shown in the interactive comparison are one per model per test.