Netflix VOID benchmark: testing physics-aware video object removal

Netflix just open-sourced a model for removing objects from video while generating physically consistent counterfactual outcomes — showing what would have happened if the object was never there.

It's called VOID: Video Object and Interaction Deletion. It's built on Alibaba's CogVideoX-Fun-V1.5-5b-InP foundation model, fine-tuned on synthetic counterfactual training data, and released under Apache 2.0. The paper (arXiv 2604.02296) was posted on April 2, 2026 by Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, and Ta-Ying Cheng.

The headline claim from the paper: in a user study where participants compared outputs from seven different models, VOID was selected as the best 64.8% of the time. Runway came second at 18.4%.

We wanted to see whether the quality holds up on clips the team has never seen, so we rented an H100 and ran our own tests.

What VOID actually does

Most video object removal tools work like aggressive paint-over: mask the object, fill in the background, hope nobody notices. VOID takes a different approach. It asks: what would physically happen to the rest of the scene if this object had never been there?

The project page shows examples like removing a hand holding a glass (the glass falls) or removing a person pressing a blender (the blender stays off). The paper's abstract describes producing "physically consistent counterfactual outcomes."

The quadmask system

VOID's mask format is a four-value grayscale encoding called a quadmask. Instead of just marking what to remove, it tells the model what else will be physically affected:

Black (0): The primary object to remove
Dark grey (63): Where the primary object and affected regions overlap
Light grey (127): Regions that will change after removal (falling objects, altered trajectories)
White (255): Background to preserve

Generating a quadmask is semi-automatic. Based on the repo's pipeline code, it works in stages: a user clicks points on the object to remove (Stage 0), SAM2 segments it into a binary mask (Stage 1), a VLM analyzes which other objects will be physically affected by the removal (Stage 2), those affected objects are segmented and their post-removal positions estimated (Stage 3), and everything is combined into the final four-value quadmask (Stage 4). The repo's default VLM configuration uses gemini-3-pro-preview.

Two-pass inference

Pass 1 is the primary inpainting pass. It takes the source video, quadmask, and a text prompt describing the desired background. The default configuration uses 50 denoising steps with a guidance scale of 1.0. In our tests, each step took about 1.4 seconds on an H100 (we ran with 30 steps rather than the default 50 to reduce processing time).

Pass 2 is optional. According to the README, it "uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency." This uses the technique from "Go-with-the-Flow" (Burgert et al., CVPR 2025 Oral), which replaces random temporal noise with correlated noise derived from optical flow fields. Pass 2 uses a separate checkpoint (void_pass2.safetensors), defaults to 50 steps with guidance scale 6.0. Since it's a full additional inference pass, it roughly doubles the total processing time.

Go-with-the-Flow was a collaboration between Netflix Eyeline Studios, Netflix, Stony Brook University, University of Maryland, and Stanford University.

Training data

VOID is fine-tuned on synthetic "counterfactual" video pairs — matched before/after videos where removing an object triggers physically correct consequences. The repo describes two data generation pipelines:

HUMOTO pipeline: Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. Human characters interact with objects, then the human is removed and the physics engine simulates what happens to the objects.
Kubric pipeline: Generates counterfactual videos using the Kubric simulation engine with Google Scanned Objects. Objects interact via rigid-body dynamics; removing one object changes the other's trajectory.

The README states: "Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data."

How it was built

What makes VOID unusual is how it composes open tools from multiple organizations:

Component	Source	Role
CogVideoX-Fun-V1.5-5b-InP	Alibaba PAI	5B-parameter video diffusion transformer backbone
SAM2	Meta	Primary object segmentation from user click points
Go-with-the-Flow	Netflix Eyeline Studios et al.	Flow-warped noise for Pass 2 temporal consistency
DeepSpeed ZeRO Stage 2	Microsoft	Training infrastructure (8x A100 80GB, per the README)

The quadmask pipeline also uses a VLM for reasoning about physical interactions, and the Kubric and HUMOTO pipelines for generating training data.

Our test results

We sourced 5 stock video clips from Pexels covering different removal scenarios: a static object on a textured surface, a walking person, a transparent glass, a small colored object, and a ball on grass. Each clip was preprocessed to VOID's required 672x384 resolution at 12fps.

Masks were generated using SAM2 with manually clicked point prompts (8–39 points per object, placed using a custom GUI to ensure precise placement on the target). We used simple binary masks rather than the full quadmask pipeline, since none of our test clips involved physical interactions that would require the grey mask regions.

We ran both Pass 1 (guidance scale 1.0, 30 steps) and Pass 2 (guidance scale 6.0, 50 steps with flow-warped noise) on an H100 80GB GPU via RunPod. The outputs shown below are from Pass 2.

The five test clips used in our VOID benchmark: book on table, person walking, water glass, lemon on table, and soccer ball on grass — Our five test clips sourced from Pexels, covering static objects, motion, transparency, small objects, and natural textures.

What we observed

Book on table: Clean removal. The wood grain pattern was reconstructed convincingly across the full surface where the book sat. The table's gold legs and surrounding scene were unaffected.
Person walking: The person and their shadow were both removed from the sidewalk. However, the reconstructed concrete surface shows visible distortion — the tile pattern doesn't match the surrounding area convincingly across 120 frames. This was the weakest result of the five.
Water glass: The transparent glass was removed from the stool, leaving the bottle intact. No visible ghosting or edge artifacts where the glass sat. Transparent objects are typically difficult for inpainting models, and VOID handled this well.
Lemon on table: The lemon was removed but left a visible lighter patch on the white table surface. The shadow region where the lemon sat doesn't perfectly match the surrounding surface.
Soccer ball on grass: The ball was removed and grass texture filled in. The low-light evening conditions made the dark fill area less noticeable, but the reconstructed grass texture is somewhat less detailed than the surrounding area.

Limitations we hit

Resolution is capped at 672x384 (width x height). Fine for evaluation, but any production use would need super-resolution on top.
Max 197 frames at 12fps — about 16 seconds. Enough for individual shots, not full scenes.
The mask pipeline is involved. You need SAM2 for segmentation, precise multi-point annotations (a single center click isn't enough), and ideally a VLM for physics-affected regions. It's not a one-click tool.
GPU requirements are steep: 40GB+ VRAM minimum. We used an H100 80GB. There is no consumer GPU path today.
cuDNN compatibility issue: We hit a cuDNN frontend error with PyTorch 2.11.0 + CUDA 13.0 on RunPod. The fix was disabling the cuDNN SDP attention backend (torch.backends.cuda.enable_cudnn_sdp(False)). Anyone running VOID on recent PyTorch builds may encounter this.

Physics in action

Our initial tests above used simple binary masks — remove the object, keep everything else. But VOID's real differentiator is counterfactual physics reasoning using the full quadmask pipeline. To test this, we ran four of the team's own demo scenes through the VOID inference pipeline on an A100 80GB GPU via Modal.

Each scene tests counterfactual reasoning: remove the cause, and the effect should disappear too. The quadmasks for these scenes were generated by the VOID team using their full pipeline (SAM2 + VLM physics analysis + quadmask fusion). We ran single-pass inference at the model's native 384x672 resolution on an A100 via Modal. The team's project page shows the same scenes processed with 4K input and their full two-pass pipeline — the difference in quality is significant.

At our single-pass, native-resolution settings, the results are mixed. The bowling demo is the strongest: the bowlers are removed and the pins stay standing, correctly reasoning that without anyone to throw the ball, there's no strike. The can crush demo gets the counterfactual right for most of the clip — the can stays intact without the hand — but the object dissolves toward the end. The pillow and marshmallow demos show the model attempting the right reasoning (uncompressed pillow, untoasted marshmallows) but the execution is imperfect: the pillow doesn't fully decompress, and some browning remains on the marshmallows.

However, the team's own results on the project page — using 4K input with the full two-pass pipeline — nail all of these correctly. We verified all 12 working demos: bowling pins stay standing, dominos stop falling, the blender stays off, the pillow stays flat, the spinning top stays still. The model clearly works; the quality gap comes down to resolution and the second inference pass.

The paper's claims

The project page lists six baselines VOID was compared against: ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte. According to the paper, VOID was preferred 64.8% of the time in the user study, with Runway second at 18.4%.

Important caveat: this is an arXiv preprint, not a peer-reviewed publication. The benchmark numbers have not been independently replicated. The specific user study methodology and detailed results are in the paper body, which we have not reproduced here.

Why this matters

The paper's abstract closes: "We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning." VOID's open-source release under Apache 2.0 means anyone can build on the pipeline, reproduce the results, or adapt it for their own use cases.

Transparency

The benchmark was designed and executed collaboratively: Claude Opus 4.6 helped build the test harness, preprocessed the Pexels stock clips, ran SAM2 segmentation and VOID inference on RunPod (simple removal tests) and Modal (physics demos), and built the interactive comparison components. All source clips for the simple tests are from Pexels, used under the Pexels license. Physics demo videos are from the VOID HuggingFace Space. All test clips and outputs are shown unmodified.

Netflix VOID: we tested their physics-aware video eraser with our own clips