印刻万物 TOP3DGS印刻万物TOP3DGS

stage 05

帧抽取与素材筛选

PublishedLast reviewed: 2026-05-08

Frame Extraction & Selection

Frame Extraction Pipeline Overview

The entire process is a pipeline: raw video → coarse extraction → blur filter → duplicate filter → exposure filter → selected frames → (optional) mask generation.

Typical count reduction:

StageFrame count (10-min video example)Notes
Raw video36,000 (60fps × 600s)All frames
Coarse extraction (2fps)~1,200FFmpeg time-interval extraction
After blur filter~900Remove motion-blurred frames ~25%
After duplicate filter~350Remove high-similarity redundant frames ~60%
After exposure filter~300Remove over/underexposed frames ~15%
Final selection200–300Human spot-check confirmation

Step 1: FFmpeg Coarse Extraction

FFmpeg is the Swiss Army knife of frame extraction. Core idea: extract at a fixed frame rate while preserving maximum quality.

figure

Basic Command Template

bash
-vf "fps=2" \
-qscale:v 2 \
-start_number 001 \
output/frame_%04d.jpg

Parameter explanation:

ParameterMeaningRecommended value
-vf "fps=2"Extract 2 frames per secondWalking speed: 2fps, slow orbit: 1fps
-qscale: v 2JPEG quality (1=highest, 31=lowest)2 (near-lossless)
-start_number 001Output file starting number001
%04dFour-digit zero-paddedEnsures correct sorting

Extraction Rate by Scenario

Capture methodMovement speedRecommended fpsReason
Handheld walking (indoor)~0.5m/s2 fps~25cm between frames, ~75% overlap
Handheld walking (outdoor)~1m/s3 fps~33cm between frames, ~70% overlap
Tripod slow pan~0.2m/s1 fps~20cm between frames, >80% overlap
Drone flight~3m/s2 fps~1.5m between frames, with 70% overlap
Action camera fast sweep~2m/s4 fpsFast movement needs denser sampling

Rule of thumb: Extraction rate = movement speed (m/s) × 2 ÷ frame width coverage (m). Target: ≥70% overlap between adjacent frames.

Advanced: mpdecimate Auto-Deduplication

FFmpeg's built-in mpdecimate filter can automatically drop near-duplicate frames during extraction (redundant frames from pauses or very slow movement):

bash
# Extract + auto-deduplicate ffmpeg -i input.mp4 \
-vf "fps=2,mpdecimate=hi=64*200:lo=64*50:frac=0.33" \
-vsync vfr \
-qscale:v 2 \
output/frame_%04d.jpg

Preserving EXIF Metadata

FFmpeg strips metadata by default. Add -map_metadata 0 to preserve source file global metadata:

bash
ffmpeg -i input.mp4 \
-vf "fps=2" \
-qscale:v 2 \
-map_metadata 0 \
output/frame_%04d.jpg

Note: Video files have limited EXIF (usually only camera model and date). For precise focal length, record it in meta.yaml and batch-write with ExifTool:

bash
# Batch write focal length (assuming 24mm lens) exiftool -FocalLength=24 -FocalLengthIn35mmFilm=24 ./output/*.jpg

Step 2: Blur Detection & Removal

Motion blur is the #1 enemy of video frame extraction. Even 5–10 severely blurred frames among 300 good ones will cause visible blur artifacts at corresponding angles.

Laplacian Variance Method

Principle: Apply the Laplacian operator (second derivative) to the image and compute the variance of the result. Sharp images have many edges → high variance. Blurry images have few edges → low variance.

python
import cv2
def laplacian_variance(image_path):
"""Calculate sharpness score for a single image"""
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
return cv2.Laplacian(gray, cv2.CV_64F).var()

Threshold Selection

Scene typeRecommended thresholdNotes
Texture-rich (bookshelves, gardens)100–150Sharp frames in normal scenes typically >200
Medium texture (indoor, corridors)50–100Scenes with many white walls have lower overall scores
Sparse texture (white walls, sky)30–50Can't use high threshold or everything gets deleted

Practical tip: Run once without a threshold, output all scores to CSV. Plot a histogram — find the clear "bimodal" distribution. Low peak = blurry frames, high peak = sharp frames. Set threshold at the valley between peaks.

File Size Quick-Screen Method

A faster rough filter: motion-blurred JPEG frames are typically 20–40% smaller than sharp frames (blur reduces detail → higher compression ratio).

bash
# Sort by file size — smallest 20% are likely blurry ls -lS ./raw_frames/ \

Step 3: Duplicate Frame Removal

When you pause during capture, hesitate while turning, or move very slowly, consecutive frames are nearly identical. These redundant frames provide no new information — they only slow COLMAP and waste training time.

figure

SSIM (Structural Similarity) Deduplication

python
from skimage.metrics import structural_similarity as ssim import cv2
def is_duplicate(frame_a_path, frame_b_path, threshold=0.95):
"""Check if two frames are duplicates via SSIM"""
a = cv2.cvtColor(cv2.imread(frame_a_path), cv2.COLOR_BGR2GRAY)
b = cv2.cvtColor(cv2.imread(frame_b_path), cv2.COLOR_BGR2GRAY)
a = cv2.resize(a, (640, 480))
# Downscale for speed
b = cv2.resize(b, (640, 480))
return ssim(a, b) > threshold

SSIM Threshold Selection

ThresholdEffectUse case
0.98Only removes near-static duplicatesConservative, afraid of over-deletion
0.95Balanced (recommended)Most scenarios
0.90Aggressive deduplicationToo many frames, need significant reduction
0.85Very aggressiveUse with caution, may remove useful frames

Step 4: Exposure Anomaly Detection

Overexposed (highlights blown white) and underexposed (shadows crushed black) frames must also be removed. They have few feature points and lost color information, causing color discontinuities at corresponding angles.

python
import cv2 import numpy as np
def check_exposure(image_path, low_thresh=30, high_thresh=225, max_ratio=0.25):
"""Detect over/underexposure. Returns: 'ok', 'overexposed', 'underexposed'"""
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
total = gray.size
over = np.sum(gray > high_thresh) / total
under = np.sum(gray < low_thresh) / total
if over > max_ratio:
return 'overexposed'
elif under > max_ratio:
return 'underexposed'
return 'ok'

How Many Frames to Keep?

More isn't always better, and fewer isn't always worse. There's a sweet spot:

Subject typeRecommended framesNotes
Small object (<50cm)50–80Three-ring orbit sufficient
Medium object (50cm–2m)80–150Plus bottom fill shots
Single room (<30㎡)150–250Three layers × loop path
Multi-room (50–100㎡)250–40080–120 frames per room
Building exterior300–500Ground + drone
Large complex500–1000+Grid segmentation, 150–200 per cell

Problems with Too Many vs Too Few

IssueToo few (<50)Too many (>1000)
COLMAPFeature matching breaks, SfM failsRuntime grows quadratically (N² complexity)
MemoryGPU VRAM insufficient, OOM crash
Training qualityHoles, blurry regionsDiminishing returns; redundant frames introduce noise
Time costCOLMAP: 100 frames ~1h, 500 ~8h, 1000 ~30h+

COLMAP processing speed reference (RTX 4090):

• 100 frames: ~30 minutes

• 300 frames: ~2 hours

• 500 frames: ~6 hours

• 1000 frames: ~24 hours+

Frame count vs COLMAP time is approximately O(N²) because feature matching is pairwise.

Mask Generation: Tell the Algorithm "Ignore This"

A mask is a black-and-white image the same size as the original: white area = train normally, black area = ignore during training. It solves the problem of scene elements you don't want the model to learn (pedestrians, vehicles, glass reflections, construction barriers).

figure

When Masks Are Needed

ScenarioNeed mask?Reason
Pedestrians passing✅ Strongly recommendedDifferent position across frames → floaters
Glass/mirrors✅ RecommendedReflection content changes with viewing angle → artifacts
Screens/TVs✅ RecommendedScreen content differs every frame
Vehicles passing✅ Strongly recommendedSame as pedestrians
Wind-blown leaves⚠️ OptionalSlight motion may be tolerated; severe requires mask
Pure static scene❌ Not neededNo moving objects
Photographer's shadow✅ RecommendedShadow position changes as you move

Method 1: SAM2 / SAM3 Automatic Segmentation

Meta's SAM2 and SAM3 (released March 2026) can automatically detect and segment any object in images. Combined with Grounding DINO for text-guided detection, you can batch-generate masks for "person," "car," "animal."

python
from transformers import pipeline
# Initialize SAM3 pipeline generator = pipeline("mask-generation", model="facebook/sam3", device=0) outputs = generator(image_path, points_per_batch=64)

Method 2: Manual Painting (Photoshop / GIMP)

For cases automatic segmentation can't handle (glass reflections, specific light spots):

  1. Open frame in Photoshop

  2. New layer, paint black over areas to ignore

  3. Export as matching _mask.png (pure black and white, no gradients)

  4. Ensure mask dimensions exactly match the original frame

Batch tip: If glass/mirror positions are relatively fixed across frames (e.g., display cases), paint one mask template, then use SAM2's video propagation to automatically apply across all frames.

Mask File Naming Convention

text
01_selected/ ├── 001.jpg ├── 002.jpg └── ...
masks/ ├── 001_mask.png ├── 002_mask.png └── ...

Format requirements:

• PNG format (lossless, no compression artifacts)

• Pure black and white (0 or 255 only, no gray gradients)

• Dimensions exactly match corresponding frame

• Filenames correspond one-to-one with frames

Common Mistakes & Troubleshooting

MistakeConsequenceSolution
Extraction rate too high (10fps+)Frame explosion, COLMAP can't handle itReduce to 2–3fps
Extraction rate too low (0.5fps)Insufficient overlap, SfM chain breaksIncrease to 2fps or higher
Didn't filter blurry framesModel shows blur artifacts at corresponding anglesLaplacian variance + threshold filter
Didn't remove duplicatesCOLMAP 10× slower with no benefitSSIM deduplication
Forgot to write EXIF focal lengthCOLMAP can't initialize intrinsicsExifTool batch write
Mask size doesn't match originalTraining error or mask misalignmentEnsure pixel-level alignment
Mask has gray gradientsEdge areas half-trained, produces artifactsBinarize: only 0 and 255
Extracted from HDR videoColor space mismatch, training produces color shiftUse SDR video only, or convert to SDR first

figure

Next Steps

• Frame selection complete, colors inconsistent → Enter 06-Color Grading & Consistency

• Want to train directly → Enter 08-Training

• Need to review asset organization → Back to 04-Asset Organization & Archival

• ← Previous chapter: 04-Asset Organization & Archival