印刻万物 TOP3DGS印刻万物TOP3DGS

Extended notes · Theory

The Secret to Extreme Efficiency: Differentiable Rasterization

The technical internals of 3DGS real-time rendering — how tile-based rendering, CUDA optimization, and differentiable backpropagation render millions of Gaussians in milliseconds.

Cross-checked against public sources

The Tile-based Rendering Pipeline

The 3DGS rendering pipeline has five stages: frustum culling (removing Gaussians outside the view frustum) → projection to screen space (converting 3D means and covariance matrices to 2D) → tile assignment (dividing the image into 16×16 pixel blocks, determining which tiles each Gaussian covers) → depth sorting (sorting Gaussians by depth within each tile) → rasterization and blending (computing color contributions for each pixel).

Tile-based rendering is the key to 3DGS efficiency. Traditional per-pixel rendering requires traversing all potentially covering Gaussians for each pixel — very inefficient with large Gaussian counts. Tile-based advantages: pixels within a tile share a Gaussian list (reducing redundant computation); each tile can be processed independently by a GPU thread block (parallelism-friendly); tile data fits in GPU shared memory (reducing global memory access). The 16×16 tile size is empirically optimized.

Projection Transform and Depth Sorting

Covariance matrix projection uses the Jacobian: Σ₂D = JW Σ₃D WᵀJᵀ, where J is the projection Jacobian and W is the world-to-camera transformation. This formula guarantees that the projected 2D Gaussian remains a valid Gaussian distribution, correctly reflecting how the 3D shape projects onto the screen. Depth sorting uses GPU radix sort — an O(n) time complexity parallel sorting algorithm — to ensure correct Alpha blending order.

Alpha Blending and Early Termination

The Alpha blending formula: C = Σ cᵢαᵢGᵢ ∏ⱼ<ᵢ (1 - αⱼGⱼ), where accumulated transmittance represents the product of all preceding Gaussians' transparencies. When accumulated transmittance T < 0.001, contributions from subsequent Gaussians are negligible and computation can be terminated early (Early Termination). This significantly reduces computation on opaque object surfaces. 3DGS uses one thread block per tile, with each thread handling one pixel in the tile, maximizing GPU occupancy and memory bandwidth utilization.

Differentiable Backpropagation: The Training Backbone

3DGS needs not only fast rendering but also fast gradient computation to support training. Every step of the rendering process is a differentiable mathematical operation, allowing complete gradients to be computed via chain rule. For Alpha blending, opacity αᵢ affects not only the i-th term but all subsequent terms through accumulated transmittance, making the gradient formula more complex. Sorting itself is non-differentiable, but sorting only determines blending order while blending itself is differentiable — backpropagation accumulates gradients in the same sorted order without differentiating through the sort. This careful design makes the entire pipeline end-to-end differentiable, supporting efficient training.

Related learning path

understand-gaussian-splatting · Module 04

Sources