Marchless Cubes: Bitfield SIMD Voxel Meshing

February 2025 · 12 min read

rust graphics simd performance

Marching Cubes is the go-to algorithm for extracting triangle meshes from volumetric data. Published in 1987, reimplemented thousands of times. The idea is simple: for each cell in a 3D grid, check the 8 corners to see which are "inside" the surface and which are "outside." Encode that as an 8-bit index, look it up in a 256-entry triangle table, and emit the corresponding geometry.

It works. It's well-documented. But the core loop is inherently scalar: you iterate cell by cell, gather 8 corner values, build one lookup index, emit some triangles, and move on. What if you could compute the lookup index for every voxel in the grid simultaneously?

That's what this post is about. In Plenum, I use an approach I call "marchless cubes." It still uses the standard marching cubes triangle table for triangle emission. But the expensive part, extracting the 8-bit neighbor mask for each voxel, is done via pure bitwise SIMD operations on the entire grid at once. No per-cell iteration for neighbor extraction. No gathering of corner values. Just shifts and ORs on a giant bitfield.

Traditional Marching Cubes: The Per-Cell Loop

In standard marching cubes, the inner loop looks something like this (pseudocode):

for x in 0..grid_size {
  for y in 0..grid_size {
    for z in 0..grid_size {
      // Gather 8 corner values
      let c0 = grid[x][y][z];
      let c1 = grid[x+1][y][z];
      let c2 = grid[x+1][y+1][z];
      // ... 5 more corners
      let index = (c0 < 0) | (c1 < 0) << 1 | ...;
      let triangles = TRI_TABLE[index];
      emit(triangles);
    }
  }
}

Each cell requires 8 memory lookups to gather corners. Adjacent cells share 4 of their 8 corners, but the standard approach re-reads them anyway. For a 64x64x64 grid, that's 262,144 cells, each doing 8 gathers, building one index, and doing one table lookup. It's a lot of serial work.

The Key Insight: Voxels as a Bitfield

In Plenum, the voxel grid is a VoxelGrid, a structure backed by u8x64 SIMD vectors. Each voxel is a single bit: 1 for solid, 0 for empty. An 8x8x8 grid fits in 512 bits, which is exactly one u64x64 (64 lanes of 64-bit integers) when treated as a flat bitfield.

This is the foundation. When your entire grid is a bitfield, "checking a neighbor" becomes "shifting the bitfield by one position." You don't index into an array. You shift the whole grid.

Shifting the Whole Grid

The VoxelGrid has methods like shxl(), shyl(), and shzl() that shift the entire bitfield in each axis direction. Shifting left in X means every voxel now holds the value of its +X neighbor. Shifting left in Y gives you the +Y neighbor. Combinations give you diagonal neighbors.

The visualization below shows a 2D slice of this concept. The blue grid is the original bitfield. Click a shift direction to see the shifted version overlaid in orange. Where they overlap (green), both the original cell and its neighbor are solid.

In the actual code, the shifts operate on u64x64 SIMD vectors. Shifting across lane boundaries is handled by shift_left and shift_right functions that treat the entire 4096-bit vector as a single giant integer:

/// Treats u64x64 (4096 bits) as one large integer, shifting left.
fn shift_left(data: u64x64, amount: u64) -> u64x64 {
    data << amount | (data >> (64 - amount)).rotate_elements_right::<1>()
}

The rotate_elements_right call is the critical piece. When you shift a 64-bit lane left, the high bits that overflow need to become the low bits of the next lane. That's what the rotate does: it moves each lane's overflow bits into the adjacent lane, making the shift seamless across the full 4096-bit width.

From Shifts to Lookup Byte

Here's where the pieces come together. In standard marching cubes, you gather 8 corner values for one cell at a time. In the marchless approach, each shift gives you one corner for every cell at once. The code creates 8 shifted copies of the grid:

let values = [
    grid,                      // v0: the cell itself
    grid.shxl(),               // v1: +X neighbor
    grid.shxl().shyl(),        // v2: +X, +Y neighbor
    grid.shyl(),               // v3: +Y neighbor
    grid.shzl(),               // v4: +Z neighbor
    grid.shxl().shzl(),        // v5: +X, +Z neighbor
    grid.shxl().shyl().shzl(), // v6: +X, +Y, +Z neighbor
    grid.shyl().shzl(),        // v7: +Y, +Z neighbor
];

Each shifted grid tells you, for every voxel position, whether that corner is solid. Now we need to take those 8 single-bit answers and pack them into one 8-bit byte per voxel. The visualization below shows this process for a 2D slice (4 corners instead of 8). Click any cell to see how its corner values assemble into a lookup index:

Click a cell to see how shifted grids produce its lookup index

But there's a problem. Each shifted grid has one bit per voxel, packed tightly. We need to spread those bits apart so each voxel gets a full byte, with corner i's bit placed at bit position i. That's where pad_zeros comes in.

The pad_zeros Magic

The pad_zeros function takes 64 packed bits (stored as a u8x64) and "unpacks" them so that each original bit occupies bit 0 of its own byte. It uses a clever sequence of shifts and masks:

const M1: u64 = 15 | (15 << 32);
const M2: u64 = 3 | (3 << 16) | (3 << 32) | (3 << 48);
const M3: u64 = u64::from_ne_bytes([1; 8]); // 0x0101010101010101

fn pad_zeros(case: u8x64) -> u64x64 {
    let x: u64x64 = case.cast();
    let x = (x | (x << 28)) & u64x64::splat(M1);
    let x = (x | (x << 14)) & u64x64::splat(M2);
    let x = (x | (x << 7))  & u64x64::splat(M3);
    x
}

This is a classic bit-spreading technique. Each step doubles the spacing between bits. After three rounds of shift-and-mask, 8 consecutive bits have been spread into 8 bytes, each containing just bit 0. The visualization below shows the transformation:

Input byte

Once each shifted grid has been padded, the assembly is simple. Shift the padded result for corner i left by i bit positions, then OR all 8 together:

let padded = values.iter().map(|v| pad_zeros(v.0));
let shifted = padded.enumerate().map(|(i, p)| p << i as u64);

let mut grid = u64x64::splat(0);
for s in shifted {
    grid |= s;
}
grid = !grid;

After this, you can transmute the u64x64 into a [u8x64; 8] (the NeighborGrid), where each byte is the 8-bit marching cubes lookup index for one voxel. The final !grid inverts the bits because the convention in the TRI_TABLE treats 1 as "inside" while the bitfield uses 1 as "solid" with inverted winding.

From Lookup to Triangles

At this point, the "marchless" part is done. We have an 8-bit index for every voxel in the grid, computed without iterating cell-by-cell. The rest is standard marching cubes: loop through the voxels, skip any with index 0 (all empty) or 255 (all solid), and use the TRI_TABLE to emit triangles for everything else.

for x in 0..7 {
  for y in 0..7 {
    for z in 0..7 {
      let byte = neighbors.0[z].as_array()[y * 8 + x];
      if byte == 0 || byte == 255 { continue; }

      let tri_config = &TRI_TABLE[byte as usize];
      // Emit triangles at edge midpoints...
    }
  }
}

The triangle emission is simple: for each edge index in the table entry, compute the midpoint of that edge on the unit cube and emit a vertex. The current implementation places vertices at edge midpoints (no SDF interpolation), which produces the characteristic blocky-but-smooth marching cubes look.

Why This is Fast

The traditional approach does N^3 * 8 memory accesses to gather corner values. The marchless approach replaces all of that with a fixed number of SIMD operations on the entire bitfield:

8 grid shifts (each is a shift + rotate + OR on a u64x64 vector)
8 pad_zeros calls (each is 6 SIMD ops: 3 shifts + 3 ANDs)
8 shift-and-ORs to assemble the final lookup bytes

That's roughly 80 SIMD operations total, regardless of grid size (within the 4096-bit bitfield capacity). Compare that to 262,144 cells each doing 8 random memory accesses. The bitfield approach turns a memory-bound scatter-gather problem into a compute-bound register-width operation.

There's also a data density advantage. The entire 8x8x8 voxel grid fits in 64 bytes (512 bits). That's a single cache line on most architectures. The neighbor extraction happens entirely within registers, with zero cache misses.

The Shift Functions Under the Hood

The shift_left and shift_right functions are worth understanding in detail. They treat a u64x64 as a single 4096-bit integer, shifting across all 64 lanes seamlessly:

fn shift_left(data: u64x64, amount: u64) -> u64x64 {
    data << amount
    | (data >> (64 - amount)).rotate_elements_right::<1>()
}

fn shift_right(data: u64x64, amount: u64) -> u64x64 {
    data >> amount
    | (data << (64 - amount)).rotate_elements_left::<1>()
}

The first term shifts each lane individually. The second term captures the bits that would overflow from one lane and feeds them into the adjacent lane via rotate_elements. This is the glue that makes the 64 separate 64-bit lanes behave as a single contiguous bitfield. Without it, you'd get gaps at every lane boundary.

Limitations and Tradeoffs

This approach has constraints. The voxel grid must be binary (solid or not), so there's no SDF interpolation along edges. Vertices sit at edge midpoints, not at the exact surface crossing. For smooth terrain, you'd want a hybrid approach: use bitfield shifts for the neighbor mask extraction, then sample a continuous SDF for vertex placement.

The grid size is also limited by the SIMD register width. A u64x64 holds 4096 bits, which maps to an 8x8x64 grid (or 16x16x16 with wider vectors). For larger chunks, you'd tile multiple bitfield operations, though each tile still benefits from the bulk extraction.

Takeaway

The lesson here is about a shift in thinking. The standard marching cubes loop asks: "For this cell, what are its 8 neighbors?" That's a per-element question. The marchless approach asks: "For the entire grid, what does the +X shifted version look like?" That's a bulk operation.

When you reformulate per-element gather operations as bulk bitfield shifts, you unlock a completely different performance regime. Instead of N^3 random memory accesses, you get a fixed number of register-width operations. Instead of fighting the cache hierarchy, you work entirely within SIMD registers.

This pattern applies far beyond voxel meshing. Cellular automata, collision detection, flood fill, erosion and dilation in image processing: anywhere you check neighbors in a grid, the same shift-and-combine trick applies. Stop thinking about individual elements. Start thinking about what happens when you shift everything at once.

The full implementation is in Plenum, which uses Rust's nightly portable_simd feature for all of this.