Veo 3.1 Lite vs Fast benchmark: 8 tests, 16 videos compared at 720p

Google launched Veo 3.1 Lite today, their most cost-effective video generation model. At $0.05 per second for 720p output, an 8-second clip costs $0.40. The same clip on Veo 3.1 Fast costs $1.20. That is a 67% price reduction with what Google claims is "the same speed."

I ran 7 text-to-video prompts and one image-to-video test through both models via the Gemini API to find out what the price difference actually buys. Every text-to-video generation was 720p, 8 seconds, with identical prompt text. The results were more uneven than I expected.

The short version

Lite is not a uniform downgrade. On controlled, low-motion scenes (a rotating product, a woman at a window, ink in water) the quality gap is small. On complex motion, spatial depth, and physics (a barista pouring latte art, a falcon diving) the gap is dramatic.

Note: Every generation is non-deterministic; run the same prompt twice and you may get different results. What follows is what I observed across 23 generations.

Where Lite holds up

Dialogue scenes. A close-up of a woman turning at a rainy window, speaking a line of dialogue. Both models produce coherent lip movement, appropriate rain texture, and matching ambient audio. Lite's skin detail is slightly softer, but you would not pick it as the cheaper model in a blind test.

Product shots. A matte-black headphone rotating on a white surface. Lite handles the clean studio setup well because the scene is physically constrained: one object, one motion, controlled lighting. The reflection detail is slightly less precise on Lite, but the output (in my opinion) seems commercially usable.

Abstract and stylized. Ink drops swirling in water with fractal patterns. Both models generate compelling macro footage. The colour palette and temporal pacing are close. This makes sense: abstract content has no "correct" physics to violate, so the model's reduced capacity does not have an obvious failure mode.

Portrait and social (where Lite actually won). I prompted a street musician playing guitar in a European alley at golden hour. Lite produced a beautiful wide shot with warm golden tones, consistent framing, and natural subtle movement. Fast hallucinated an audience of 15+ people that the prompt never mentioned, zoomed in aggressively to an extreme close-up (losing the alley scene entirely), then pulled back out. The face changed between the wide and close-up frames, breaking continuity. In this case, Lite was the better output.

Where Lite breaks down

Human motion and physics. I prompted a barista pouring steamed milk into a latte. Lite started the pour correctly, two hands (one holding the cup and one pouring the steamed milk) a pitcher, and steamed milk flowing. However, in the final frames the cup began floating above the counter and the barista had a third hand pouring from a second pitcher. Fast produced a completely different interpretation: a cinematic beauty shot of the finished latte with bokeh, skipping the pour mechanics entirely. Note: neither model correctly created the rosetta pattern of the milk as it was being poured (the pattern was there before beforing).

Fast motion. This was the largest gap. I prompted a peregrine falcon diving through cloudy sky in slow motion with a camera track. Fast generated an actual diving trajectory with the camera following the descent, feathers rippling convincingly. Lite produced a bird hovering in place with what looked like a radial motion blur applied as a visual shorthand for speed. The dive never happened. The model understood "fast bird + sky + dramatic" but struggled with the temporal arc of a dive sequence.

The image-to-video test

Both Lite and Fast support image-to-video generation. I tested this with a photo of my Bengal cat Alfie, standing on a bed looking directly at the camera. The prompt asked the cat to walk forward, paws stepping carefully, tail swaying.

Audio: not a tradeoff

Every single Lite output included an audio stream. All 13 Lite generations and all 10 Fast generations had AAC audio tracks. Audio quality seemed slightly better on Fast (richer ambient sound) during our tests, but Lite's audio is fully functional and not degraded placeholder audio.

Speed: Lite is actually faster

Google says Lite offers "the same speed" as Fast. In practice, (during testing) Lite was consistently faster to generate. Average generation time across all prompts:

Lite: 42 seconds average (range: 32 to 52 seconds)
Fast: 65 seconds average (range: 48 to 104 seconds)

Lite was 35% faster on average. The outlier was Fast's falcon prompt, which took 104 seconds, nearly double any other generation. This speed advantage is notable because Google's pricing page already positions Lite as the high-volume option. Faster generation means higher throughput for batch workflows.

The pricing picture

The confirmed per-second rates, directly from Google's pricing page:

Tier	Lite	Fast (today)	Fast (Apr 7)	Standard
720p	$0.05/s	$0.15/s	$0.10/s	$0.40/s
1080p	$0.08/s	$0.15/s	$0.12/s	$0.40/s
4K	N/A	$0.35/s	$0.30/s	$0.60/s

The timing matters. On April 7, Fast's 720p price drops from $0.15 to $0.10 per second. That cuts Lite's current 67% cost advantage to 50%. At 1080p, the gap narrows from 47% to 33%. If you are evaluating Lite for a production pipeline, factor in the April 7 price drop.

Confirmed Lite limitations beyond pricing: no 4K output, no video Extension (clip chaining), and a single video per API call.

When to use Lite

Use Lite for product shots, abstract backgrounds, dialogue scenes, and any application where the camera is relatively static and the scene is physically simple. At $0.40 per 8-second clip, it is viable for high-volume content generation where Fast's $1.20 per clip adds up.

Methodology

23 total generations via the Gemini API using google-genai Python SDK (13 Lite, 10 Fast). Some prompts were run more than once to check for consistency; the 16 videos shown in the comparison above are one per model per test. All text-to-video tests: 720p, 8 seconds, single output per call. Models: veo-3.1-lite-generate-preview and veo-3.1-fast-generate-preview. Image-to-video tested with a single source photograph at 9:16 aspect ratio. Audio presence confirmed via ffprobe stream analysis. Generation times measured programmatically (wall clock, including API polling). Costs calculated from published per-second rates. Total benchmark cost: $17.38.

Veo 3.1 Lite vs Fast: what you actually lose at a third of the price