Skip to content

perf: split viewDepth into separate depthBuffer for sort cache locality#8587

Merged
mvaligursky merged 1 commit intomainfrom
mv-depth-buffer-split
Apr 10, 2026
Merged

perf: split viewDepth into separate depthBuffer for sort cache locality#8587
mvaligursky merged 1 commit intomainfrom
mv-depth-buffer-split

Conversation

@mvaligursky
Copy link
Copy Markdown
Contributor

Splits viewDepth out of projCache into a dedicated parallel depthBuffer, improving sort pass cache behavior for zero additional memory cost.

Changes:

  • Reduce CACHE_STRIDE from 8 to 7 by removing viewDepth from projCache slot 7
  • Add a parallel depthBuffer (1 u32 per splat) written during tile count, read by all sort and rasterize passes
  • Sort passes (bitonic, bucket sort, chunk sort) now read depth via stride-1 depthBuffer[entryIdx] instead of stride-8 projCache[entryIdx * 8 + 7], eliminating cache thrashing on random depth lookups
  • Merge globalPairCounter and largeSplatCount into a single countersBuffer[2] to stay within the WebGPU 10 storage-buffer-per-stage limit on Metal
  • Restructure large tile count shader to use isActive flag instead of early return, required for WGSL uniform control flow with atomicLoad-derived bounds

Performance:

  • Bucket sort: 0.6ms → 0.2ms (3x improvement)
  • Tile sort: 1.3ms → 1.2ms
  • Total frame: 8.5ms → 8.2ms (~3.5% improvement)
  • Measured on 17M-splat scene, zero memory overhead (projCache shrinks by exactly the amount depthBuffer adds)

Move viewDepth out of projCache (slot 7) into a parallel depthBuffer,
reducing CACHE_STRIDE from 8 to 7. Total memory is unchanged — projCache
shrinks by 1 u32/splat while depthBuffer adds 1 u32/splat.

Sort passes (bitonic, bucket sort, chunk sort) now read depth via
stride-1 depthBuffer[entryIdx] instead of stride-8
projCache[entryIdx * 8 + 7], eliminating cache thrashing on random
depth lookups. Measured 3x improvement on bucket sort (0.6ms → 0.2ms)
and ~0.3ms total frame improvement on 17M-splat scenes.

Also merges globalPairCounter and largeSplatCount into a single
countersBuffer[2] to stay within the 10 storage-buffer-per-stage
WebGPU limit on Metal.
@mvaligursky mvaligursky self-assigned this Apr 10, 2026
@mvaligursky mvaligursky added performance Relating to load times or frame rate area: graphics Graphics related issue labels Apr 10, 2026
@mvaligursky mvaligursky merged commit 75ad28a into main Apr 10, 2026
8 checks passed
@mvaligursky mvaligursky deleted the mv-depth-buffer-split branch April 10, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: graphics Graphics related issue performance Relating to load times or frame rate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant