1 Introduction

Bounding volume hierarchies (BVHs) and spatial data structures in general are indispensable tools in computer graphics. They are used to accelerate a multitude of algorithms, including collision detection, frustum culling, and ray tracing. Due to the low cost and high performance of massively parallel manycore graphics processing units (GPUs), BVHs that can be efficiently constructed and traversed on the GPU are of particular interest in GPU-accelerated physically based animation and ray tracing.

However, current research focuses on GPU BVHs for triangular surface meshes or point clouds (see Sect. 2). BVHs for volumetric meshes, e.g., tetrahedral meshes used in finite element simulations, must have different characteristics to achieve high performance. For example, volumetric meshes generally fill a space more densely than surface meshes and therefore lead to deeper hierarchies. Volumetric meshes are widely used in physically based animation, computational physics, and scientific visualization of simulation results (see Fig. 1). Furthermore, volumetric meshes are advantageous for the description of 3D printed models, as they enable the description of material gradients (see, e.g., Altenhofen et al. [2]).

In this paper, we examine how construction and traversal performance as well as memory use can be improved by adapting previous GPU-optimized BVHs for volumetric meshes. We introduce a novel octree-based linear BVH (OLBVH) based on the linear BVH (LBVH) by Lauterbach et al. [12]. We demonstrate the performance and memory benefits of using an octree-based BVH instead of a binary tree-based BVH for Direct volume rendering (DVR), plane intersection for cross-section computation or slicing, and inside-outside intersection testing. Furthermore, OLBVH creates a hierarchy of tightly fitting bounding boxes, unlike the original LBVH and most other LBVH variants (see Sect. 2). This allows for the introduction of boundary flags (see Sect. 3.1) that enable efficient space skipping in DVR, conservative slicing, and inside-outside intersection testing between volumetric meshes (see Sect. 4).

In the following, Sect. 2 discusses related work in the field of GPU-optimized BVHs. Section 3 describes the concepts behind our novel BVH and relevant implementation details. Section 4 discusses the performance and memory characteristics of our data structure and compares them with current state-of-the-art GPU BVHs. Finally, Sect. 5 summarizes the paper, discusses limitations, and provides avenues for further research.

Fig. 1
figure 1

The octree linear bounding volume hierarchy OLBVH enables efficient direct volume rendering of scientific data on unstructured volumetric meshes. The mesh shown has 12 million tetrahedra

2 Related work

In this section, we discuss related work in the field of boundary volume hierarchies with GPU-accelerated construction and traversal.

The linear bounding volume hierarchy (LBVH) introduced by Lauterbach et al. [12] forms the basis of many of the newer approaches. They use Morton codes [15] (see Fig. 2) to approximately spatially sort discretized element centroids in parallel. Split positions for the hierarchy are determined in parallel according to differing bits of neighboring codes. While construction is very fast, it results in many singleton nodes, i.e., nodes with only one child, and nodes overlap significantly due to the approximate nature of Morton order sorting. They also introduce a surface area heuristic (SAH) hierarchy that ameliorates these issues, but significantly increases construction cost.

Fig. 2
figure 2

Morton codes for a \(4 \times 4\) grid. The bits of the individual integer coordinates are interleaved, leading to a space-filling curve with a fractal Z-like shape

A number of authors have since improved SAH-LBVH construction performance. Pantaleoni and Luebke [20] introduced a two-level hierarchical LBVH (HLBVH) Morton sorting method with better performance for dynamic meshes. This method was further improved by Garanzha et al. [8] who introduced a task-based approach to HLBVH construction that enables construction in a single kernel call. Karras [11] developed an approach that computes all hierarchy levels in parallel leading to better scaling, while also generalizing to k-D trees and octrees. This approach was further improved by Apetrei [3] using atomic operations to avoid binary searches to improve construction speed. However, surface area heuristics do not map well to volumetric meshes.

A potential alternative to the SAH are the extended Morton codes (EMCs) introduced by Vinkler et al. [25]. EMCs encode the discretized bounding box size into an arbitrary number of bits of the code, while the number of bits per coordinate can be varied as well. However, the position of these bits must be chosen carefully for good performance. Furthermore, high-quality volumetric meshes for simulation tend to vary smoothly in element size (see, e.g., Alliez et al. [1]), reducing the potential benefit of EMCs.

For time-varying data, refitting, as used in an early CPU-based work by Wald et al. [26], is an interesting alternative. However, the resulting BVHs are of lower quality and would lose the beneficial properties of tightly fitting nodes.

Besides efficient construction, traversal requires special attention on the GPU as it is typically performed recursively and the stack resides in local memory (a thread-private area of GPU RAM) and the compute-to-bandwidth ratio is very large. Murguia et al. [18], García et al. [9], as well as Binder and Keller [5], use bit trails for stackless traversal. Bit trails store the path through a tree as individual bits in an integer. However, these approaches are designed for binary trees and incur additional memory overheads such as allocation of empty nodes or hash maps. Vaidyanathan et al. [22] recently introduced a short stack approach which avoids these overheads and supports higher tree arities, albeit at the cost of requiring some stack space.

While Zellmann et al. [28] use the LBVH data structure for volumetric data, they do so for volumetric data on a sparse regular grid, avoiding looseness a priori. General unstructured volumetric meshes require a different approach. In particular, octrees are beneficial as volumetric data fill space more densely and octrees help avoid excessive hierarchy depth. For the use case of surface reconstruction from point clouds, Zhou et al. [29] construct an octree using a bottom-up approach based on Morton order sorting. While large point clouds can be stored efficiently, the memory overhead of the hierarchy itself is large and limits tree depth. Furthermore, hierarchy levels are allocated separately, leading to increased allocation and synchronization overhead. Gu et al. [10] introduce the octree-based LOBVH and a binary tree-based variant LLBVH based on the work of Karras et al. [11]. However, their approach requires full allocation of the finest octree level followed by a compaction step. This severely limits the maximum octree depth and results in a large memory overhead.

With the advent of specialized ray tracing cores in consumer graphics cards (see, e.g., Stich [21]), another option is to use the built-in hardware-accelerated BVH. Wald et al. [27] trace very short rays to perform point location in tetrahedral meshes. They achieve significant speedups compared to a pure general purpose computing on graphics processing units (GPGPU) approach, despite performing ray casting instead of direct point location. However, availability of GPUs with ray tracing hardware is currently limited to only the newest generation from a single vendor. Furthermore, efficient hardware BVH construction is itself an open research topic (see, e.g., Doyle et al. [7] or Viitanen et al. [23, 24]).

Fig. 3
figure 3

This figure shows the data layout of the \(\texttt {CO}\) and \(\texttt {PO}\) arrays for a sample tree. The gray circles represent the tree nodes. The blue numbers above the tree nodes represent entries of the \(\texttt {CO}\) array, and the red numbers represent entries of the \(\texttt {PO}\) array. The green arrow in the background indicates the in-memory order

3 Concept and implementation

In this section, we describe the basic concept of the OLBVH data structure and detail the implementation of the data structure as well as construction and traversal algorithms.

Like previous LBVH variants (see Sect. 2), our data structure relies on approximate spatial sorting of elements ordered by Morton code [15]. We quantize the input coordinates x, y, and z as l-bit integers \({\hat{x}} = (x_{l-1}, x_{l-2},\dots ,x_0)\) and interleave their bits as shown in Fig. 2:

$$\begin{aligned} m({\hat{x}},{\hat{y}},{\hat{z}}) = (x_{l-1},y_{l-1},z_{l-1},\dots ,x_{0},y_{0},z_{0}) \end{aligned}$$
(1)

For quantization purposes, we use the enclosing axis-aligned bounding box (AABB) \(H({\mathbb {M}})\) of the volumetric mesh \({\mathbb {M}}\):

$$\begin{aligned} H({\mathbb {M}}) = \left( \mathbf {x}_\text {min}^{\mathbb {M}}, \mathbf {x}_\text {max}^{\mathbb {M}}\right) , \end{aligned}$$
(2)

where \(\mathbf {x}_\text {min}^{\mathbb {M}}\) and \(\mathbf {x}_\text {max}^{\mathbb {M}}\) are the componentwise minimum and maximum positions within the mesh.

A key difference to previous LBVH variants is that we do not encode the centroids of AABBs. Instead, we use the fact that Morton codes inherently span an equidistant grid. The resolution of this grid depends on the number of bits l used for quantization. The cell size of the Morton grid can be calculated as:

$$\begin{aligned} \mathbf {s}^l = (s^l_x, s^l_y, s^l_z)^T {:}{=}\frac{\mathbf {x}_\text {max}^{\mathbb {M}} - \mathbf {x}_\text {min}^{\mathbb {M}}}{2^l - 1} \end{aligned}$$
(3)

Using this size, quantization and reconstruction follow the equations below, respectively:

$$\begin{aligned} \begin{aligned}&q (x) = \left\lfloor \frac{x - x^{\mathbb {M}}_\text {min}}{s^l_x} + \frac{1}{2} \right\rfloor = {\hat{x}}\\&q ^{-1}({\hat{x}}) = x^{\mathbb {M}}_\text {min} + {\hat{x}} \cdot s^l_x \end{aligned} \end{aligned}$$
(4)

The quantized AABB \({\hat{H}}({\mathbb {P}})\) of a primitive \({\mathbb {P}} \in {\mathbb {M}}\) is defined by two quantized points:

$$\begin{aligned} {\hat{H}}({\mathbb {P}}) {:}{=}\left( \hat{\mathbf {x}}_\text {min}^{\mathbb {P}}, \hat{\mathbf {x}}_\text {max}^{\mathbb {P}}\right) \end{aligned}$$
(5)

For each primitive \({\mathbb {P}} \in {\mathbb {M}}\), we generate the Morton codes enclosed by the primitive’s quantized AABB \({\hat{H}}({\mathbb {P}})\). The set of generated Morton codes for a primitive \({\mathbb {P}} \in {\mathbb {M}}\) is given by:

$$\begin{aligned} M({\hat{H}}({\mathbb {P}})) {:}{=}\left\{ m({\hat{x}}',{\hat{y}}',{\hat{z}}') \mid \bigwedge _c^{\{x, y, z\}}{\hat{c}}^{\mathbb {P}}_\text {min} \le {\hat{c}}' \le {\hat{c}}^{\mathbb {P}}_\text {max}\right\} \end{aligned}$$
(6)

Combined with a heuristic to determine the tree depth \(L = l + 1\) (the root level is present, even for 0-bit Morton codes), we use \(M({\hat{H}}({\mathbb {P}}))\) to split primitives a priori and eliminate looseness, i.e., BVH sibling cells never overlap. Additionally, we store boundary flags at every node of the tree, allowing for early termination at nodes containing only interior cells when only determining if a point or bounding box is inside or outside the mesh \({\mathbb {M}}\). As interior primitives near the boundary must share points with boundary primitives and the sets \(M({\hat{H}}({\mathbb {P}}))\) are inclusive, cells that contain empty space must always contain boundary primitives provided that no primitive has a negative signed volume.

3.1 Data structure

The OLBVH data structure consists of six arrays containing:

  1. 1.

    primitive indices \(\texttt {P}[n_m]\) in \([0, n_p)\) sorted by Morton code,

  2. 2.

    tree node bounding volumes \(\texttt {BV}[n_n]\),

  3. 3.

    child node offsets \(\texttt {CO}[n_{n_i} + 1]\) in \([0, n_n)\),

  4. 4.

    primitive index offsets \(\texttt {PO}[n_n + 1]\) in \([0, n_m]\),

  5. 5.

    boolean boundary flags \(\texttt {BF}[n_n]\),

  6. 6.

    per-level node offsets \(\texttt {NO}[L]\) in \([1, n_n]\),

where \(n_p\) is the number of primitives in \({\mathbb {M}}\), \(n_m\) is sum of the numbers of Morton codes per primitive, \(n_n\) is the number of tree nodes, and \(n_{n_i}\) is the number of internal tree nodes excluding leaves.

As the bounding volume of a primitive may intersect with several spatial cells of the Morton grid, \(\texttt {P}\) may contain duplicate indices. The bounding volumes of tree nodes \(\texttt {BV}\) are laid out linearly in memory following a breadth-first traversal order. In order to allow for top-down traversal, \(\texttt {CO}\) stores child offsets for the tree nodes. Every tree node is associated with a maximum of eight children. Due to the contiguous levelwise order in memory, the \(\texttt {CO}\) array enables retrieval of the child node indices \(C_i\) of a given tree node i:

$$\begin{aligned} C_i {:}{=}\{ j \mid \texttt {CO}{}[i] < j \le \texttt {CO}{}[i + 1]\}. \end{aligned}$$
(7)

In order to manage memory efficiently, the \(\texttt {CO}\) array only contains entries for internal, i.e., non-leaf, nodes. As a result of the chosen data structure, the node index of an internal node’s rightmost child is equal to the child offset of its succeeding node in memory, i.e.,

$$\begin{aligned} \text {Node { i} is { j}'s rightmost child} \implies \texttt {CO}{}[j + 1] = i \end{aligned}$$
(8)

This property is useful while constructing the internal hierarchy levels starting from the leaves (see Sect. 3.2).

The OLBVH also incorporates a primitive offsets array \(\texttt {PO}\) to infer the primitive indices in \(\texttt {P}\) are associated with node i. As nodes reside in memory in breadth-first order, it is necessary to handle adjacent nodes of different hierarchical levels:

$$\begin{aligned} P_i {:}{=}{\left\{ \begin{array}{ll} \{\texttt {P}{}[j] \mid 0 \le j< \texttt {PO}{}[i + 1]\}, &{} \text {if } \texttt {PO}{}[i] > \texttt {PO}{}[i + 1] \\ \{\texttt {P}{}[j] \mid \texttt {PO}{}[i] \le j < \texttt {PO}{}[i + 1]\}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(9)

As any rightmost child has the same upper offset as its parent node, it holds that:

$$\begin{aligned} \text {Node { i} is { j}'s rightmost child} \implies \texttt {PO}{}[i + 1] = \texttt {PO}{}[j + 1]\nonumber \\ \end{aligned}$$
(10)

The memory layout of the \(\texttt {CO}\) and \(\texttt {PO}\) arrays is illustrated for a sample tree in Fig. 3.

The boundary flag array \(\texttt {BF}\) stores a boolean boundary flag for each tree node i. The final array \(\texttt {NO}\) stores the node index offsets for each level. It is used to determine the hierarchical level of a node given its index.

3.2 Construction

The OLBVH construction procedure receives any volumetric mesh \({\mathbb {M}}\) with marked boundary primitives as input. We focus on tetrahedral meshes in our evaluation. However, OLBVH is equally applicable to general polyhedral meshes. We use the GPU-optimized tetrahedral mesh data structure by Mueller-Roemer et al. [16, 17] to efficiently store and process tetrahedral meshes on the GPU, and use it to determine which primitives lie on the boundary, i.e., have one or more faces with only one neighbor. As this is part of pre-processing, not BVH construction, and covered in previous publications, we do not evaluate the computation of initial boundary primitive marking. If \(H({\mathbb {M}})\) is not known beforehand, a parallel reduction on the mesh’s vertices calculates \(\mathbf {x}_\text {min}^{\mathbb {M}}\) and \(\mathbf {x}_\text {max}^{\mathbb {M}}\). Parallel primitives, such as reductions, parallel prefix sums, and sorting, are efficiently implemented in libraries such as CUB [13] or Thrust [4]. We use Thrust in our implementation. Construction is performed in four stages:

  1. 1.

    Determine the tree depth heuristically

  2. 2.

    Calculate and sort Morton codes of primitive AABBs

  3. 3.

    Record at which levels the sorted Morton codes split

  4. 4.

    Bottom-up construction based on split positions.

In the initial stage, we determine the number \(l \le l_\text {max}\) of bits to use for quantization, and therefore the depth of the tree, according to element size. The maximum possible number of quantization bits is \(l_\text {max} = 10\) in our implementation, as we use 32-bit integers to store Morton codes. In parallel over primitives, we estimate the tree level by first computing the binary logarithm of the largest tetrahedron AABB axis in relation to the maximum grid resolution:

$$\begin{aligned} \alpha _{\mathbb {P}} = \text {max}\left( \left\lfloor \text {ld}\left\lfloor \frac{\mathbf {x}_\text {max}^{\mathbb {P}} - \mathbf {x}_\text {min}^{\mathbb {P}}}{\mathbf {s}^{l_\text {max}}}\right\rfloor _\text {max}\right\rfloor , 0\right) \end{aligned}$$
(11)

where the division is performed per component. The binary logarithm can be efficiently implemented using a count leading zeros instruction (the __clz intrinsic in CUDA). Subsequently, the construction procedure chooses

$$\begin{aligned} l = \mathrm {clamp}\left( \left\lfloor 10.5 - \left( \mathrm {avg}\left( \alpha _{\mathbb {P}}\right) + \alpha \right) \right\rfloor , 0, 10\right) \end{aligned}$$
(12)

by performing a parallel reduction on \(\alpha _{\mathbb {P}}\), where \(\alpha \in [0, 10]\) is a tuning coefficient. While it is possible to choose a negative \(\alpha \), it is not advised, as the resulting Morton grid would be finer than most primitive AABBs. This would lead to oversampling. Therefore, we define \(\alpha \in [0, 10]\). Increasing \(\alpha \) results in a coarser Morton grid and thereby faster construction and reduced use of memory.

Since the maximum level is heuristically determined, we continue with calculating the Morton codes in the second stage. The size of the Morton grid cells \(\mathbf {s}^l\) can be computed using Eq. 3. On the basis of \(\mathbf {s}^l\), we generate a temporary array of Morton codes \(\texttt {MC}\). As the overall number of Morton codes \(n_m\) is not known a priori, we perform two parallel passes over primitives to determine \(n_m\) and allocate and fill \(\texttt {MC}\). Each thread calculates \(\left| M({\hat{H}}({\mathbb {P}}))\right| \) for one primitive. A parallel exclusive prefix sum determines the array offsets the Morton codes of each primitive are written to. By performing the prefix sum for \(n_p + 1\) elements, the final offset corresponds to \(n_m\) and additional branching is avoided. A second pass over all primitives generates the sets \(M({\hat{H}}({\mathbb {P}}))\) and writes the Morton codes into \(\texttt {MC}\) at the computed offsets, while also writing each primitive’s index to \(\texttt {P}\) at the corresponding offset.

With a parallel sort, the primitive indices \(\texttt {P}\) are sorted using the Morton codes in \(\texttt {MC}\) as keys. As a result, the primitive indices are ordered along the Z-curve. Primitive indices are potentially duplicated, as \(M({\hat{H}}({\mathbb {P}}))\) may include several Morton codes. Additionally, the root node of the OLBVH is constructed at this point. Because the root node encloses all primitives \({\mathbb {P}} \in {\mathbb {M}}\), its AABB is set to \(H({\mathbb {M}})\). Moreover, the two initial entries of \(\texttt {PO}\) are set to 0 and the overall number of Morton codes \(n_m\), respectively. Straightforwardly, the first entry of \(\texttt {CO}\) is 0, the first entry of \(\texttt {NO}\) is 1, and the last is \(n_n\). Additionally, the root boundary flag \(\texttt {BF}[0]\) is set to true.

In the third stage, we determine the indices at which the Morton codes in \(\texttt {MC}\) indicate splits between BVH nodes. As the OLBVH is an octree, we compare Morton codes advancing in 3-bit steps beginning from the most significant bit. Thus, each split is associated with a level \(l_c < L\) and an index referring to a Morton code in the \(\texttt {MC}\) array. As the split positions refer to the Morton codes of primitives, the split generation procedure writes these positions to the \(\texttt {PO}\) array. As in the case of Morton code generation, we first determine the number of splits between neighboring codes. A parallel scan is then used to determine the size of and offsets into the corresponding arrays. As no split is recorded for the last Morton code, a split is appended sequentially by the CPU for each level. This can be done concurrently while filling the remainder of the \(\texttt {PO}\). A parallel stable sort by key procedure sorts \(\texttt {PO}\) by split level resulting in the intended primitive offset order. As the splits are sorted by level, the remaining entries of the \(\texttt {NO}\) array can be calculated by exploiting the per-level contiguous order. In parallel over splits, we search for adjacent splits of different level. If two such splits are found at positions i and \(i+1\), a node level offset was found and thus \(\texttt {NO}{}[ splits [i].l_c] = i + 1\).

The final stage relies on \(\texttt {NO}\) and \(\texttt {PO}\) to construct the OLBVH in a bottom-up manner, i.e., starting at the leaf level. As the data layout does not allow for immediate inference of the parent node for a given node, we cannot construct the hierarchy in a single kernel launch using atomic flags as in Apetrei’s [3] agglomerative LBVH builder.

figure e

We first calculate the leaf node AABBs and boundary flags \(\texttt {BF}\) using Algorithm 1. As all Morton codes associated with a leaf node are equal, the minimum AABB coordinates are obtained by calculating \(q^{-1}(m^{-1}(m_{x,y,z}))\) for an arbitrary \(m_{x,y,z}\) associated with the leaf node. Furthermore, the maximum AABB coordinates are calculated by adding the grid size \(\mathbf {s}^l\) to the minimum coordinates. As a result, leaf node AABBs can be calculated in constant time. The boundary flag \(\texttt {BF}{}[i]\) of a leaf node is set if any primitive in \(P_i\) is tagged as a boundary element. Therefore, boundary flag calculation for a leaf node is \(\mathcal {O}(|P_i|)\).

Algorithm 2 outlines internal node generation. We allocate an auxiliary array \(\texttt {CI}\) of the same size as the overall number of generated Morton codes \(n_m\). The hierarchy construction procedure writes node indices to the primitive offset positions of \(\texttt {CI}\) for each level. On construction of the parent hierarchical level, we exploit Eq. 10 to lookup the rightmost child indices for internal nodes at the upper primitive offset positions of the \(\texttt {CI}\) array. To reduce the number of kernel launches, the computation of \(\texttt {CI}\) can be merged with generateLeafNodes for the first iteration. Between internal levels, a separate kernel must be used to prevent conflicting reads and writes. \(\texttt {CI}\) and \(\texttt {PO}\) are used to determine the child offsets \(\texttt {CO}\) according to Eq. 8. Finally, internal node AABBs and boundary flags are determined from their children.

figure f

3.3 Traversal

Fig. 4
figure 4

Traversal proceeds on the compacted tree until a leaf node satisfies \(F_N\). The green arrows indicate the traversal path. Variable assignments and short stack states appear below the tree. Initially, traversal concludes that \(F_N\) holds for nodes 1, 2, 5 and 6. Thus, the next node is 1, while nodes 2, 5 and 6 remain on the short stack. In the second step, none of 1’s children satisfy \(F_N\). Thus, traversal pops 2 from the short stack, determines the corresponding parent level, increments the trail bits of that level, and finds the current level using Eq. 13. Finally, \(F_N\) holds for 16 and traversal terminates

Algorithm 3 outlines our OLBVH traversal approach. Figure 4 presents a sample traversal execution. Traversing the OLBVH for input geometric predicates \(F_N\) for nodes and \(F_P\) for primitives proceeds by pushing child nodes to a traversal stack. If a full traversal stack is used, its size is bounded by \(7 \cdot l + 8\) node indices. This typically exceeds the amount of available fast on-chip shared memory, resulting in having to use slower cache-backed local memory. Consequently, OLBVH traversal uses a short stack [22] to reduce the memory requirement for the stack. As a result of the shallow hierarchy, a 32-bit integer is sufficient to encode the trail. Leaf nodes can be determined efficiently by comparison with the lower level offset of the last level. Since siblings are on the same level, it suffices to check the first child node. The short stack approach uses a queue Q to record the nodes in \(C_i\) (cf. Eq. 7) for which the predicate \(F_N\) evaluates as true. On the evaluated GPUs, the use of a short stack results in a minor approx. 5% speedup, which is not evaluated in detail, on older GPUs we observed larger speedups.

figure g

Unlike Vaidyanathan et al.’s [22] short stack implementation, we do not tag parent nodes, avoiding unnecessary branching. Additionally, our implementation does not treat leaf nodes satisfying \(F_N\) throughout traversal. This is due to the fact that depending on \(\alpha \) leaf nodes contain several primitives and a high primitive count may lead to significant thread divergence. After pushing intersecting child nodes to Q, the procedure pushes tree nodes from Q to the short stack and identifies the tree node at which traversal continues. This procedure corresponds to lines 10–27 of Vaidyanathan et al.’s [22] BVH\(-N\) traversal method with \(N=8\). The only two adaptations are the bit-trail handling and comparison of \(\texttt {NO}\)\([ parentLevel ]\) to determine the level of a node on short stack pop to avoid unnecessary branching (lines 16–20 in Stack Pop):

$$\begin{aligned} l_{cur} = parentLevel + (\texttt {NO}{}[ parentLevel ] \le n) \end{aligned}$$
(13)

In many applications, it is sufficient to traverse the tree evaluating \(F_N\) until a leaf node is reached, as sibling cells never overlap. Primitives associated with the resulting leaf node are then checked until \(F_P\) evaluates to true. However, for some applications such as intersection detection it is required to keep track of intersecting leaf nodes and to test primitives for intersection. In such scenarios, a node set N is used to maintain leaf nodes satisfying \(F_N\). This approach is inherently prone to overflow. Thus, additional treatment is required in order to prevent exceeding the size of N on node push. One solution is to evaluate \(F_P\) for the primitives of all nodes in N if a node push would result in an overflow. After traversal, we evaluate \(F_P\) for the primitives associated with the resulting leaf node or the set of leaf nodes N.

4 Results

In this section, we evaluate OLBVH construction time and memory consumption. In addition, we investigate our acceleration structure’s benefits in three application areas: direct volume rendering of unstructured meshes for scientific visualization, conservative slicing for 3D-printing, and intersection detection. In order to avoid inaccuracies in runtime measurements, we calculate the median of 50 repetitions. All algorithms were implemented in C++ and CUDA using Visual Studio 2015 and CUDA 10.1. Our implementation of Wald et al.’s [27] method additionally uses OptiX 7.0 [19]. All measurements were performed on one machine with an RTX 2080 Ti and another with a Quadro GP100. The latter GPU does not support RTX in hardware. We consistently use a short stack size of 8. The meshes used throughout the evaluation are shown in Fig. 5. While good simulation meshes have smoothly varying element sizes and tetrahedra with high aspect ratios, the meshes were chosen to include both good meshes and meshes with very large variances in element size, as shown in Table 1. Additionally, while most of the meshes fill the majority of their bounding box, we included two shell-like meshes, “Part” and “Pot.”

Table 1 Mesh element size and quality information for the meshes shown in Fig. 5
Fig. 5
figure 5

Meshes used in the evaluation. Surface overlaid over a cross section shown to visualize both exterior and interior resolution. From top left to bottom right: Bar, Bunny, Cube, Cylinder, Die, Dragon, Fusion, Gargoyle, Jets, Part, Pot, Tardis, and Wrench. The models include both meshes with highly regular primitive size (e.g., Part) and others with large variance in primitive size (e.g., Tardis)

4.1 Construction

We compare OLBVH construction with Apetrei’s [3] improved LBVH construction and OptiX’ acceleration structure construction with subsequent compaction. The details of OptiX’ acceleration structure are not documented and are considered a black box. The resulting construction times are shown in Fig. 6. As tree depth is affected by the choice of \(\alpha \), we present construction times for \(\alpha \in \{0,1,2\}\). For \(\alpha = 0\) our construction approach is slower than Apetrei’s LBVH builder. For \(\alpha \in \{1,2\}\) our approach outperforms or matches the construction times of the LBVH builder in many cases. If no hardware acceleration is used, OLBVH construction is faster than OptiX on all but the smallest meshes (cf. Table 2), even for \(\alpha = 0\). When hardware acceleration is present, our construction approach is slightly slower than OptiX’ BVH construction for \(\alpha = 0\), but achieves significantly lower runtimes when \(\alpha \in \{1,2\}\).

Besides construction time, another beneficial aspect of our data structure is its memory consumption. Table 2 compares the memory consumption of our OLBVH to LBVH and OptiX’ acceleration structure. OLBVH consumes significantly less memory than LBVH or OptiX without specialized RTX hardware. For the jets and dragon meshes, OptiX consumes slightly less memory than OLBVH with \(\alpha = 0\) when RTX hardware is used. Increasing \(\alpha \) results in a significant reduction of memory consumption.

Fig. 6
figure 6

Comparison of construction times between LBVH using Apetrei’s [3] fast agglomerative approach, OptiX, and OLBVH with \(\alpha \in \{0,1,2\}\) on a Quadro GP 100 (upper) and an RTX 2080 Ti (lower)

Table 2 Mesh sizes (in number of tetrahedra) and comparison of memory consumption between LBVH, OptiX’ compacted BVH, and OLBVH for \(\alpha \in \{0,1,2\}\)

4.2 Direct volume rendering

Direct volume rendering of unstructured meshes requires repeated point location along each view ray. A potential performance issue is that a large number of samples may not hit the geometry. We use the boundary marking to efficiently compute the relevant ray interval in which the ray intersects the geometry. Prior to ray marching, we traverse only the boundary nodes of the OLBVH for each ray and determine the intervals for which each ray intersects leaf node AABBs. Subsequently, we perform efficient ray marching by traversing the OLBVH for each sample until a leaf node containing the sample is found.

Table 3 Mesh sizes (in number of tetrahedra) for the DVR evaluation meshes. We also give the ray interval computation times in milliseconds on the Quadro GP 100

We compare our approach with the LBVH as well as Wald et al.’s [27] approach with and without RTX hardware acceleration. In order to provide a good basis for comparisons, we evaluate all approaches using the same camera settings. The sizes of the evaluated meshes are given in Table 3. In every measurement, a 10242 image was rendered. The runtimes for the ray interval computation appear in Table 3. Because the traversal for determining the relevant ray interval safely skips the majority of nodes, its runtime imposes a negligible overhead. Unlike Moriccal et al.’s [14] space skipping extension to Wald et al.’s approach, we do not require a secondary acceleration structure.

Figure 7 shows the sampling rates per second for the different ray marching variants. If no specialized RTX hardware is used and \(\alpha = 0\), our approach outperforms the ray marching technique of Wald et al. [27] by up to 8.4\(\times \) and the LBVH variant by 2\(\times \)–13\(\times \). The Jets mesh benefits the least, which is due to the fact that it is geometrically a cube (compare Fig. 1 and 5). Thus, our ray marching approach cannot benefit from the prior space skipping. Nonetheless, we observe a significant speedup. For the fusion and part meshes, it is even possible to choose \(\alpha = 1\) and match the performance of Wald et al.’s method [27]. When the Pot mesh is used, we are able to match the sampling rate of Wald et al.’s method [27] with \(\alpha =2\). Therefore, an application could construct the OLBVH more quickly and use less memory while achieving the same performance. The LBVH results in lower sampling rates than the other methods. Traversing the binary radix tree consumes significantly more time than using our method with \(\alpha \in \{0,1\}\). However, RTX hardware accelerated ray marching is faster than our method. As only a minority of GPUs provides RTX hardware, we expect our method to be superior on the majority of GPUs.

Fig. 7
figure 7

Comparison of sampling rates using LBVH, Wald et al.’s method, and our approach for \(\alpha \in \{0,1,2\}\) on a Quadro GP 100 (upper, no RTX) and an RTX 2080 Ti (lower, with RTX). We consider only samples inside the geometry

4.3 Conservative slicing

Like direct volume rendering, slicing of volumetric meshes with volumetrically varying materials (see, e.g., Altenhofen et al. [2]) is a point location problem. However, as post-processing of 3D printed models using conventional CAM or by polishing can only remove material, slicing typically has to be performed conservatively, i.e., material must be deposited in a voxel of the printing volume even if only the AABB of the voxel intersects the geometry and not the centroid itself. Therefore, if a tetrahedron intersects the AABB, the sampling point is valid. We interpolate material/scalar parameters for a sampling point inside the input mesh using the containing tetrahedron. In case of a sampling point outside the mesh, we extrapolate using the closest tetrahedron intersecting the conservative AABB. We evaluate on a 9992 grid using planes and slice thicknesses normal to the slicing plane as they appear in Table 4.

In the following, we describe an efficient conservative slicing algorithm using OLBVH. We initially traverse the OLBVH for a sampling point, attempting to find a containing tetrahedron. Due to tightly fitting AABBs, it is sufficient to find the single leaf node that contains the sampling point. If no such tetrahedron can be found, the sampling point is potentially near the boundary. In that case, we reinitialize the short stack and traverse the OLBVH for the corresponding AABB while only considering boundary nodes. This decreases the number of relevant tree nodes during traversal. In the second traversal phase, we maintain a set N of intersecting leaf nodes. The competing LBVH variant traverses the tree for the voxel AABB finding the closest point in the mesh to the sampling point. As Wald et al.’s [27] approach only applies to point sampling, we do not consider it in this use case.

Table 4 Mesh sizes (in number of tetrahedra) and slicing setups for conservative slicing where the origin is the mesh AABB’s midpoint. The slice thickness is the voxel size normal to the slicing plane
Fig. 8
figure 8

Comparison of slicing times for a single slice using LBVH and our approach for \(\alpha \in \{0,1,2\}\) on a Quadro GP 100 (upper) and an RTX 2080 Ti (lower)

Figure 8 presents the runtimes for slicing using our approach and LBVH. Our approach outperforms LBVH for \(\alpha \in \{0,1\}\) with speedups between 3\(\times \) and 25\(\times \). Therefore, the OLBVH allows for acceleration of conservative slicing, while applications can benefit from fast, memory-efficient construction on GPUs.

4.4 Mesh intersection

The third use case is intersection/overlap detection between two volumetric meshes. This has use cases in interactive mesh editing and Lagrangian (moving mesh) simulations. Initially, we construct an OLBVH for both input meshes. Subsequently, traversing the OLBVH of one mesh for each tetrahedron of the other yields a set of intersecting leaf nodes N for each tetrahedron. By testing the tetrahedra contained by the nodes in N for intersection with the traversal tetrahedron, we identify the tetrahedra which intersect the other mesh. In order to avoid the vast majority of intersection tests, we exploit the boundary marking to infer intersection for interior nodes. When a tetrahedron intersects the AABB of a non-boundary node, that tetrahedron must intersect with the other mesh. In addition, we perform both traversals (\(A \subset B\) and \(B \subset A\)) simultaneously.

Table 5 Mesh sizes (in number of Tetrahedra) for the intersection test cases

We compare our OLBVH-accelerated mesh intersection method to LBVH traversal using the test cases in Table 5, and the benchmark results appear in Fig. 10. For example, Fig. 9 illustrates the first intersection case. OLBVH-accelerated intersection detection results in a speedup of 5\(\times \)–52\(\times \) for \(\alpha = 0\). This is due to the fact that the overwhelming majority of intersections can be inferred from the boundary flags, while the LBVH-accelerated intersection detection method has to perform narrow phase intersection checks for each tetrahedron due to overlapping AABBs. Furthermore, a significant benefit can be found for all \(\alpha \in \{0,1,2\}\). Mesh intersection using OLBVH is fast and applications can benefit from quicker OLBVH construction and significantly reduced memory consumption by choosing a higher \(\alpha \).

Fig. 9
figure 9

Intersection case 1 between Tardis (red) and Die (blue) meshes. Intersecting tetrahedra have been removed. As the Tardis mesh extends far into the interior of the Die mesh, the vast majority of intersecting tetrahedra can be detected early throughout traversal. Narrow phase intersection detection is only required where mesh boundaries are close to each other

Fig. 10
figure 10

Comparison of calculation times for computing the intersections using LBVH and our method for \(\alpha \in \{0,1,2\}\) on a Quadro GP 100 (upper) and an RTX 2080 Ti (lower). The numbers underneath the bars correspond to the test cases in Table 5

5 Conclusion

In summary, we have introduced the OLBVH, a novel Morton curve based linear BVH for volumetric meshes. We have proposed a novel linear BVH construction approach for volumetric meshes, which produces memory efficient trees in a time efficient manner. The heuristic used during construction is parameter controlled allowing for faster construction and reduced memory footprint by generating shallower trees. In addition, the resulting BVH is of high quality, as sibling cells are non-overlapping. We have demonstrated that the OLBVH is a versatile acceleration structure for volumetric meshes by examining three use cases: direct volume rendering for scientific visualization, conservative slicing for 3D printing, and inside-outside intersection detection.

The introduction of boundary marking, which requires a tightly fitting BVH, has enabled the development of efficient algorithms for all three use cases. We have achieved a significant increase of sampling rates in DVR, 8.4\(\times \)–13\(\times \) compared to state-of-the-art methods without specialized hardware for BVH traversal. OLBVH has allowed us to implement an efficient conservative slicing algorithm that results in a 3\(\times \)-25\(\times \) speedup. Additionally, we have accelerated intersection detection by 5\(\times \)–52\(\times \). As a result, our data structure has opened the door for efficient GPU-accelerated volumetric mesh lookups.

5.1 Limitations

While memory consumption and construction time can be significantly reduced by increasing \(\alpha \), a more shallow tree may lead to lower performance, as leaf nodes contain more primitives. Furthermore, tetrahedra are assigned to nodes based on their quantized AABBs. Therefore, some nodes contain tetrahedra that do not overlap the node’s AABB. For very large meshes, such as the 62M-tetrahedron jpn-qk dataset used by Wald et al. [27], GPU memory on the systems used in our evaluation is insufficient. Wald et al. partially work around this issue by generating separate BVHs for groups of 1M tetrahedra in the mesh.

5.2 Future work

A promising approach to improve the OLBVH is to generate Morton codes for each tetrahedron based on the tetrahedron itself instead of its AABB. Since the relevant ray intervals can be computed efficiently, it may be worthwhile to sort rays by their workload, in order to benefit from load balancing and efficient memory access using the short stack. To overcome the memory limitation, the workaround used by Wald et al. could be implemented using OLBVH. Alternatively, out-of-core variants could be examined. While current RTX hardware allows for custom intersection shaders, it does not allow for custom BVHs. If it should become possible to combine the built-in triangle intersection with custom BVHs, it may be interesting to revisit tetrahedral DVR using the same approach as Wald et al. but with OLBVH. Another interesting avenue for future research is exploring the performance benefits of a hardware or FPGA implementation of OLBVH. Such an implementation would benefit from the short stack, due to the small number of registers during traversal.