Netflix VOID: we tested their physics-aware video eraser with our own clips

Netflix just open-sourced a model that does something none of the big AI labs offer individually: remove objects from video while generating physically correct counterfactual outcomes — showing what would have happened if the object was never there.
It's called VOID: Video Object and Interaction Deletion. It's built on Alibaba's CogVideoX-Fun-V1.5-5b-InP foundation model, fine-tuned on synthetic counterfactual training data, and released under Apache 2.0. The paper (arXiv 2604.02296) was posted on April 2, 2026 by Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, and Ta-Ying Cheng.
The headline claim from the paper: in a user study where participants compared outputs from seven different models, VOID was selected as the best 64.8% of the time. Runway came second at 18.4%.
We wanted to see whether the quality holds up on clips the team has never seen, so we rented an H100 and ran our own tests.
What VOID actually does
Most video object removal tools work like aggressive paint-over: mask the object, fill in the background, hope nobody notices. VOID takes a different approach. It asks: what would physically happen to the rest of the scene if this object had never been there?
The project page shows examples like removing a hand holding a glass (the glass falls) or removing a person pressing a blender (the blender stays off). The paper's abstract describes producing "physically consistent counterfactual outcomes."
The quadmask system
VOID's mask format is a four-value grayscale encoding called a quadmask. Instead of just marking what to remove, it tells the model what else will be physically affected:
Black (0): The primary object to remove
Dark grey (63): Where the primary object and affected regions overlap
Light grey (127): Regions that will change after removal (falling objects, altered trajectories)
White (255): Background to preserve
Generating a quadmask is semi-automatic. Based on the repo's pipeline code, it works in stages: a user clicks points on the object to remove (Stage 0), SAM2 segments it into a binary mask (Stage 1), a VLM analyzes which other objects will be physically affected by the removal (Stage 2), those affected objects are segmented and their post-removal positions estimated (Stage 3), and everything is combined into the final four-value quadmask (Stage 4). The repo's default VLM configuration uses gemini-3-pro-preview.
Two-pass inference
Pass 1 is the primary inpainting pass. It takes the source video, quadmask, and a text prompt describing the desired background. The default configuration uses 50 denoising steps with a guidance scale of 1.0. In our tests, each step took about 1.4 seconds on an H100 (we ran with 30 steps rather than the default 50 to reduce processing time).
Pass 2 is optional. According to the README, it "uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency." This uses the technique from "Go-with-the-Flow" (Burgert et al., CVPR 2025 Oral), which replaces random temporal noise with correlated noise derived from optical flow fields. Pass 2 uses a separate checkpoint (void_pass2.safetensors), defaults to 50 steps with guidance scale 6.0. Since it's a full additional inference pass, it roughly doubles the total processing time.
Go-with-the-Flow was a collaboration between Netflix Eyeline Studios, Netflix, Stony Brook University, University of Maryland, and Stanford University.
Training data
VOID is fine-tuned on synthetic "counterfactual" video pairs — matched before/after videos where removing an object triggers physically correct consequences. The repo describes two data generation pipelines:
HUMOTO pipeline: Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. Human characters interact with objects, then the human is removed and the physics engine simulates what happens to the objects.
Kubric pipeline: Generates counterfactual videos using the Kubric simulation engine with Google Scanned Objects. Objects interact via rigid-body dynamics; removing one object changes the other's trajectory.
The README states: "Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data."
How it was built
What makes VOID unusual is how it composes open tools from multiple organizations:
Component | Source | Role |
|---|---|---|
CogVideoX-Fun-V1.5-5b-InP | Alibaba PAI | 5B-parameter video diffusion transformer backbone |
SAM2 | Meta | Primary object segmentation from user click points |
Go-with-the-Flow | Netflix Eyeline Studios et al. | Flow-warped noise for Pass 2 temporal consistency |
DeepSpeed ZeRO Stage 2 | Microsoft | Training infrastructure (8x A100 80GB, per the README) |
The quadmask pipeline also uses a VLM for reasoning about physical interactions, and the Kubric and HUMOTO pipelines for generating training data.
Our test results
We sourced 5 stock video clips from Pexels covering different removal scenarios: a static object on a textured surface, a walking person, a transparent glass, a small colored object, and a ball on grass. Each clip was preprocessed to VOID's required 672x384 resolution at 12fps.
Masks were generated using SAM2 with manually clicked point prompts (8–39 points per object, placed using a custom GUI to ensure precise placement on the target). We used simple binary masks rather than the full quadmask pipeline, since none of our test clips involved physical interactions that would require the grey mask regions.
We ran both Pass 1 (guidance scale 1.0, 30 steps) and Pass 2 (guidance scale 6.0, 50 steps with flow-warped noise) on an H100 80GB GPU via RunPod. The outputs shown below are from Pass 2.

| Test | Category | Object Removed | Notes |
|---|---|---|---|
| Book on Table | Static Object | White book on a round wooden table | Static object removal on textured wooden surface. Tests basic inpainting quality. |
| Person Walking | Person + Motion | Person walking on an urban sidewalk | Full person removal from a moving scene across 120 frames. |
| Water Glass | Transparent Object | Glass of water with reflections and refractions | Transparent objects are among the hardest for inpainting models. |
| Lemon on Table | Small Object | Fresh lemon (both halves) on a white table | Small colorful object against clean white surface. |
| Soccer Ball | Object on Grass | Soccer ball on grass in evening light | Object removal on natural grass texture with warm evening lighting. |
What we observed
Book on table: Clean removal. The wood grain pattern was reconstructed convincingly across the full surface where the book sat. The table's gold legs and surrounding scene were unaffected.
Person walking: The person and their shadow were both removed from the sidewalk. However, the reconstructed concrete surface shows visible distortion — the tile pattern doesn't match the surrounding area convincingly across 120 frames. This was the weakest result of the five.
Water glass: The transparent glass was removed from the stool, leaving the bottle intact. No visible ghosting or edge artifacts where the glass sat. Transparent objects are typically difficult for inpainting models, and VOID handled this well.
Lemon on table: The lemon was removed but left a visible lighter patch on the white table surface. The shadow region where the lemon sat doesn't perfectly match the surrounding surface.
Soccer ball on grass: The ball was removed and grass texture filled in. The low-light evening conditions made the dark fill area less noticeable, but the reconstructed grass texture is somewhat less detailed than the surrounding area.
Limitations we hit
Resolution is capped at 672x384 (width x height). Fine for evaluation, but any production use would need super-resolution on top.
Max 197 frames at 12fps — about 16 seconds. Enough for individual shots, not full scenes.
The mask pipeline is involved. You need SAM2 for segmentation, precise multi-point annotations (a single center click isn't enough), and ideally a VLM for physics-affected regions. It's not a one-click tool.
GPU requirements are steep: 40GB+ VRAM minimum. We used an H100 80GB. There is no consumer GPU path today.
cuDNN compatibility issue: We hit a cuDNN frontend error with PyTorch 2.11.0 + CUDA 13.0 on RunPod. The fix was disabling the cuDNN SDP attention backend (
torch.backends.cuda.enable_cudnn_sdp(False)). Anyone running VOID on recent PyTorch builds may encounter this.
Physics in action
Our initial tests above used simple binary masks — remove the object, keep everything else. But VOID's real differentiator is counterfactual physics reasoning using the full quadmask pipeline. To test this, we ran four of the team's own demo scenes through the VOID inference pipeline on an A100 80GB GPU via Modal.
Each scene tests counterfactual reasoning: remove the cause, and the effect should disappear too. The quadmasks for these scenes were generated by the VOID team using their full pipeline (SAM2 + VLM physics analysis + quadmask fusion). We ran single-pass inference at the model's native 384x672 resolution on an A100 via Modal. The team's project page shows the same scenes processed with 4K input and their full two-pass pipeline — the difference in quality is significant.
| Demo | Object Removed | Physics Effect | Description |
|---|---|---|---|
| Bowling | Bowlers | Pins remain undisturbed | The bowlers are removed — and since no one is there to throw the ball, the pins stay standing. |
| Can Crush | Hand | Can mostly stays intact | The hand crushing the can is removed. The can stays intact for most of the clip. |
| Pillow | Weight | Partial decompression | The weight on the pillow is removed. The model attempts to show the pillow without compression. |
| Marshmallow Toast | Blow torch | Mostly untoasted | The blow torch is removed. Without a heat source, the marshmallows should stay white. |
At our single-pass, native-resolution settings, the results are mixed. The bowling demo is the strongest: the bowlers are removed and the pins stay standing, correctly reasoning that without anyone to throw the ball, there's no strike. The can crush demo gets the counterfactual right for most of the clip — the can stays intact without the hand — but the object dissolves toward the end. The pillow and marshmallow demos show the model attempting the right reasoning (uncompressed pillow, untoasted marshmallows) but the execution is imperfect: the pillow doesn't fully decompress, and some browning remains on the marshmallows.
However, the team's own results on the project page — using 4K input with the full two-pass pipeline — nail all of these correctly. We verified all 12 working demos: bowling pins stay standing, dominos stop falling, the blender stays off, the pillow stays flat, the spinning top stays still. The model clearly works; the quality gap comes down to resolution and the second inference pass.
The paper's claims
The project page lists six baselines VOID was compared against: ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte. According to the paper, VOID was preferred 64.8% of the time in the user study, with Runway second at 18.4%.
Important caveat: this is an arXiv preprint, not a peer-reviewed publication. The benchmark numbers have not been independently replicated. The specific user study methodology and detailed results are in the paper body, which we have not reproduced here.
Why this matters
The paper's abstract closes: "We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning." VOID's open-source release under Apache 2.0 means anyone can build on the pipeline, reproduce the results, or adapt it for their own use cases.
Transparency
The benchmark was designed and executed collaboratively: Claude Opus 4.6 helped build the test harness, preprocessed the Pexels stock clips, ran SAM2 segmentation and VOID inference on RunPod (simple removal tests) and Modal (physics demos), and built the interactive comparison components. All source clips for the simple tests are from Pexels, used under the Pexels license. Physics demo videos are from the VOID HuggingFace Space. All test clips and outputs are shown unmodified.

Written by
Joseph Nordqvist
Joseph founded AI News Home in 2026. He studied marketing and later completed a postgraduate program in AI and machine learning (business applications) at UT Austin’s McCombs School of Business. He is now pursuing an MSc in Computer Science at the University of York.
Editorial Transparency
This article was produced with the assistance of AI tools as part of our editorial workflow. All analysis, conclusions, and editorial decisions were made by human editors. Read our Editorial Guidelines
References
- 1.
VOID: Video Object and Interaction Deletion — Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng, arXiv, April 2, 2026
Primary - 2.
- 3.
- 4.
- 5.
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise — Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, Ning Yu, arXiv (CVPR 2025 Oral), January 14, 2025
- 6.
- 7.
- 8.
Kubric: A scalable dataset generator — Klaus Greff et al., GitHub (Google Research), January 1, 2022
- 9.
Pexels License, Pexels
All content on Pexels is free to use for personal and commercial purposes with no attribution required.
Was this useful?
Added inline images from benchmark testing, verified all claims against primary sources
Initial benchmark: 5 clips tested on H100 via RunPod, interactive before/after slider component