1 Introduction

Direct volume rendering generates informative two-dimensional (2D) images from three-dimensional (3D) volumes by directly mapping voxel values to optical properties with a transfer function. To highlight a target, users need to design a transfer function so that voxels within the target area carry distinctly different optical properties from those outsides. This process requires tremendous user design efforts, especially when the region of interest is hard to differentiate with neighboring objects. Image segmentation is a highly related field regarding extracting a target. However, utilizing image segmentation techniques to facilitate the volume rendering process has not been well studied. For exploring 3D volumes, there are intrinsic limitations to use a traditional 2D input device, a mouse, or a keyboard, to label and interact with 3D volumes. The dimensional discrepancy imposes heavy burdens on users as they need to escalate 2D actions into 3D effects subjectively. Additionally, perceiving the entire 3D dataset on a 2D surface is infeasible. We need a more effective way to observe the data. To address the challenges above, we design and implement NUI-VR\(^2\), a natural user interface-enabled volume rendering system in the virtual reality space. In NUI-VR\(^2\), users inspect the 3D volume in a VR environment and specify a few seeds within the target with intuitive gestures and voice commands. With those seeds, image segmentation converts the original volume into a probability volume where voxels in the target yield higher values. A simple linear transfer function will highlight the target well. Users can explore the rendered volume with NUI inside an immersive VR environment. In summary, our main contribution is threefold:

  • Propose a generic strategy for integrating image segmentation and volume rendering. Image segmentation and feature selection techniques instead of high-dimensional transfer functions are applied to highlight the target.

  • Design and implement a novel end-to-end volume interaction, image segmentation, and volume rendering system in VR.

  • Develop an attention-based NUI for the VR environment with unlimited gestures and voice commands.

2 Related work

Transfer function design has been the focus of many volume rendering researchers (Pfister et al. 2001; Arens and Domik 2010; Ljung et al. 2016; Mady and Abou El-Seoud 2020). Initially, researchers assign optical properties to voxels based on their intensity values (He et al. 1996; Bajaj et al. 1997; Sabella 1988; König and Gröller 2001). Levoy (1988) first added the local grayscale gradient to the mapping process to isolate objects with similar intensities. Other data features, e.g., curvature (Kindlmann et al. 2003), texture (Caban and Rheingans 2008), and distance (Tappenbeck et al. 2006) information were also included later on. The difficulty in identifying a proper transfer function increases as the dimensionality of the features expands. Tzeng et al. (2005) proposed a smart volume rendering system. They trained a model with the user’s inputs and used it to classify the volume to eventually simplify the transfer functions. Topology-based (Takeshima et al. 2005; Weber et al. 2007) and domain-specific (Tiede et al. 1998) segmentations were also applied to divide the original volume into sub-volumes to achieve a similar goal. ImageVis3D (Fogal and Krüger 2010) is a powerful transfer function design software with an intuitive user interface. All of them help design a transfer function, but none of them frees users from the daunting process.

On the image segmentation side, researchers proposed numerous innovative algorithms (Zhu and Yuille 1996; Gao et al. 2012; Bali and Singh 2015; Kuruvilla et al. 2016) over the years. They usually require some user interactions to guarantee accurate segmentation results. Such user interactions include specifying sample seeds inside the target (Boykov and Jolly 2001; Vezhnevets and Konouchine 2005; Karasev et al. 2013) or boundary masks outside (Mortensen and Barrett 1998). If the sample seeds or boundary masks are not well-defined, which is often the case with a traditional 2D input device, segmentation leakages may occur. Researchers have been trying to improve the user interactions with Microsoft Kinect. Kinect-based interfaces effectively alleviated the dimensional discrepancy between the user space and the 3D data (Wang and Jung 2017; Ju et al. 2018).

Researchers have been trying to utilize VR as the visualization media to visualize volumetric data. At first, Hänel et al. (2016) used a theatre-like system consisting of a room-sized cube and projectors. However, the high setup requirement prohibits its wide adoption. Recently, portable VR technologies, from simple cardboard inserts for smartphones to sophisticated VR headsets with accurate tracking sensors, became widely available (El Beheiry et al. 2019). The advancement of VR technologies led to flourishing research to exploit portable VR to visualize volumetric data (Cohen et al. 2013; Chan et al. 2013; Faludi et al. 2019), and the results have been promising because of the immersive environment. As VR improves user experience in volume exploration and visualization, we set up our proposed system entirely in VR.

3 Methods

NUI-VR\(^2\) is set up in the VR space. With the six-degree-of-freedom positional tracking capability of the current VR headsets, e.g., Oculus Rift, users can virtually interact with the volumes. As reported in Hänel et al. (2016), users are more motivated to explore the data in VR because of the immersive experience. We take advantage of the portable VR headset, Oculus Rift, instead of the enormous theatre-like environment as in Hänel et al. (2016) to make NUI-VR\(^2\) more accessible. The typical rendering process for an \(m \times n \times k\) volume is numbered in Fig. 1 and can be summarized as below:

  1. 1.

    Users browse through the volume with voice commands and gestures. Once users locate the target, they could record a few seeds within the target and generate all kinds of 3D masks outside the target. Those seeds and masks serve as the inputs for various segmentation algorithms.

  2. 2.

    Assume users record s seeds, and we compute f predefined features for each seed, we will get an \(s \times f\) seed feature matrix after the feature computation step.

  3. 3.

    The feature selection process selects r features out of the f features and reduces the size of the seed feature matrix to \(s \times r\).

  4. 4.

    We compute the r features for the entire volume and obtain an \(m \times n \times k \times r\) feature volume.

  5. 5.

    With the seed matrix, the feature volume, and optionally the boundary masks, we can apply a wide range of image segmentation techniques to generate an \(m \times n \times k\) probability volume. The target region will carry larger values than the others in the probability volume.

  6. 6.

    Finally, we render the probability volume with a simple linear transfer function to highlight the target. Figure 1 shows a rendering of a human head with a tumor highlighted as an example.

Fig. 1
figure 1

The NUI-VR\(^2\) workflow overview. Blue boxes represent the data, whereas yellow boxes represent the components of NUI-VR\(^2\). The numbers indicate the workflow sequence

3.1 Natural user interface

One challenge for users to explore 3D volumes, especially in VR, is the lack of an efficient input device. Kinect recognizes voice commands and tracks 3D locations of multiple human joints, including fingertips. We designed a NUI system with Kinect to reduce the dimensional discrepancy. It enables users to generate seeds and boundary masks directly in the 3D space effortlessly.

3.1.1 NUI system overview

Figure 2 shows an overview of NUI-VR\(^2\) with an emphasis on the NUI system. The NUI system runs on an individual thread separate from the render thread and detects voice commands and gestures. Once an event is detected, the NUI thread sends the event and some optional metadata to the render thread. The render thread then updates the volume rendering as instructed by the event. For example, when users move their left-hand tips, the NUI thread sends the LEFT-HAND-TIP-MOVED event with the position info to the render thread. The render thread then fetches the new position data and repositions the rendering.

Fig. 2
figure 2

NUI-VR\(^2\) system overview

3.1.2 3D image browsing

We use the 3D position of the left-hand tip to browse volume slices. To aid further discussions, we define the Kinect origin as \({\mathfrak {o}}\in {\mathbb {R}}^3\) and the direction where Kinect is facing as a unit vector \({\mathfrak {p}}\). \(\tilde{d}(t)\), the distance of the left-hand tip to \({\mathfrak {o}}\) in the \({\mathfrak {p}}\) direction at time t, is used to extract a slice from the volume. We assume the volume resides in the center of the Kinect space with the viewing plane perpendicular to \({\mathfrak {p}}\) and map the volume evenly to a valid range. The first (last, resp.) slice maps to the minimum (maximum, resp.) value within the range. Thus, a slice can be picked with \(\tilde{d}(t)\). However, \(\tilde{d}(t)\) is noisy and may cause jitterings. To address this issue, we use recursive filtering:

$$\begin{aligned} d(t_0)&= {} \tilde{d}(t_0) \end{aligned}$$
(1)
$$\begin{aligned} d(t_n)&= {} \alpha \tilde{d}(t_{n}) + (1-\alpha )d(t_{n-1}) \end{aligned}$$
(2)

where \(\alpha\) (\(0< \alpha < 0\)) is the smoothing factor. \({d}(t_{n})\) will be more stable than \(\tilde{d}(t)\). Figure 3 shows an example of browsing a 3D human brain in NUI-VR\(^2\). The distance of the user’s left-hand tip to \({\mathfrak {o}}\) in the viewing plane direction controls the positions of the green dots. Users can issue an “insert” voice command to add them to the seed set.

Fig. 3
figure 3

Viewing a 2D slice in a 3D MR human head volume in NUI-VR\(^2\)

3.1.3 3D surface generation

We track and smooth the joints on the arm and hands with Kinect in our NUI. Those joints can form a closed spatial curve. By sweeping and recording the curves over time, we can generate various 3D surfaces:

3.1.3.1 Surface from polygon

We track a total of m joints \(\{P_i\}\) from the left-hand tip (\(P_0\)) to the right-hand tip (\(P_{m-1}\)), where m ranges from 3 to 11. \(P_0\) still controls the browsing of the volume, as detailed in the previous section. We use the orthogonal components of \(P_i\) to create polygons on the current viewing plane. Given \(P_0\) and its normal direction \({\mathfrak {p}}\), the vector from the Kinect origin \({\mathfrak {o}}\) to the plane can be denoted as \(\varvec{v}:=\langle P_0, {\mathfrak {p}}\rangle {\mathfrak {p}}\), where \(\langle \cdot , \cdot \rangle\) indicates the inner product. \(P_i\) is then projected to the plane through \(Q_i:= P_i - \langle P_i - \varvec{v}, {\mathfrak {p}}\rangle {\mathfrak {p}}\). The loci of \(Q_i\), \(Q_i(t):i=1, \dots , m\), form a continuous 3D surface and controls the shape of the polygon in the slice.

3.1.3.2 Surface from circles

Instead of polygons, users can also use circles with varying centers and radii to form 3D surfaces. Only two joints, the left-hand tip, and the right-hand tip, are tracked. Since the left-hand tip is always in the viewing plane as designed, we only need to project the right-hand tip to the plane. The line segment in-between defines the circle’s diameter. The sweeping of circles constructs a 3D surface.

3.1.3.3 Spatial curve from points

We only track the left-hand tip. Its 3D positions construct a continuous spatial curve. While moving the left hand, the viewing slice also follows. The recorded curve helps mark a target for image segmentation algorithms.

3.1.4 Voice control

Kinect features a multi-array microphone and recognizes predefined voice commands from the input audio stream with the Microsoft speech application programming interface. We integrated voice commands in the volume rendering pipeline, so that when users explore 3D volumes, they can interact with the rendering process with their voice commands without worrying about physical limitations. Table 1 lists some voice commands in NUI-VR\(^2\). We can easily add more voice commands to our NUI system.

Table 1 A few voice command examples in NUI-VR\(^2\)

3.2 Interactive image segmentation

Interactive image segmentations enable users to specify seeds and masks that serve as strong hints for extracting the target, e.g., the target shall include all the seeds and shall not leak through the masks. We provide a generic strategy to leverage interactive image segmentation in volume rendering. The image segmentation algorithm converts the original \(m \times n \times k\) volume into an \(m \times n \times k\) probability volume, where voxels in the target area carry larger values as they share similar feature values with the seeds. By allowing users to specify numerous advanced image segmentation algorithms (Gao et al. 2010, 2012; Zhu et al. 2014; Gao et al. 2010; Chang et al. 2018), we provide unlimited optimization opportunities to NUI-VR\(^2\).

For simplicity, we use kernel density estimation (KDE) with a Gaussian kernel (Terrell and Scott 1992) as an example. KDE is the default image segmentation engine in NUI-VR\(^2\). The kernel density estimator \(f(\varvec{v})\) is the summation of s discrete multivariate Gaussians centered at \(\varvec{c}_i\):

$$\begin{aligned} f(\varvec{v}):=\sum _{i=1}^{s} e^{{-||\varvec{v}-\varvec{c}_i||}^2/\alpha ^2} \end{aligned}$$
(3)

where \(\varvec{v} \in {\mathbb {R}}^r\) is a volume voxel to be estimated, r is the number of features, \(\varvec{c}_i \in {\mathbb {R}}^r\) (for \(i=1, ..., s\)) are the seeds, and \(\alpha\) is the bandwidth of the Gaussian kernel. \(f(\varvec{v})\) represents the possibility that voxel \(\varvec{v}\) falls within the same group with \({\varvec{c}}_i\) (for \(i=1, ..., s\)). A linear transfer function will emphasize the target very well in the final probability volume.

3.3 Feature selection

NUI-VR\(^2\) provides a feature set that consists of the 3D spatial location, intensity, and texture features (Vimort and McCormick 2017), including energy, entropy, correlation, difference moment (DM), inertia, cluster shade (CS), cluster prominence (CP), Haralick’s correlation (HC), short run emphasis (SRE), long run emphasis (LRE), gray level non-uniformity (GLNU), run length non-uniformity (RLNU), low gray level run emphasis (LGLRE), and high gray level run emphasis (HGLRE). One significant advantage of using NUI-VR\(^2\) is that users could supply additional features to expand the default feature set.

However, different targets may share different sets of descriptive features. For example, points on a vertical line in a 2D Cartesian coordinate system bear the same x coordinate. It will yield the most accurate result to characterize them only with their x values instead of their 2D coordinates. It is time-consuming and may lead to decreased accuracy if all of the features are used to characterize the target (Caban and Rheingans 2008). Thus, we add a feature selection process in NUI-VR\(^2\). The default feature selection algorithm is based on the \(\ell\)1-norm support vector machine (SVM). SVM is a supervised machine learning algorithm to find the optimal hyperplane for the classification problem Zhu et al. (2004). Given a set of n labeled training data \(\{(\varvec{x}_i, y_i)\}_1^n\) with \(\varvec{x}_i \in {\mathbb {R}}^r\) being the training data and \(y_i \in \{-1, 1\}\) being the label, the \(\ell\)1-norm SVM tries to solve the following optimization problem:

$$\begin{aligned} \min _{b, \varvec{w}}\displaystyle \sum _{i=1}^{n}(1-y_i(b+\varvec{w}\cdot \varphi (\varvec{x}_i))) + \lambda \Vert \varvec{w}\Vert _1 \end{aligned}$$
(4)

\(\varphi :{\mathbb {R}}^r \rightarrow {\mathbb {R}}^q\) is the kernel function that maps \(\varvec{x}\) from the original r-dimensional space to a q-dimensional space, where \(\varvec{x}\) will be easier to be separated by a hyperplane. \(\varvec{w}\) represents the hyperplane, and b is the bias. \(\lambda\) is the penalty parameter that controls the sparsity of \(\varvec{w}\). All training data carries a label of 1 or − 1 depending on whether or not they are within the target. We retain \(\varvec{x}\) in the original space, i.e., \(r=q\), so that \(\varvec{w}\) corresponds to the feature set. The solution to the above optimization problem leads to a sparse \(\varvec{w}\). We only select features with nonzero \(\varvec{w}\) values as they play a major role in discriminating the two groups of training data. By varying \(\lambda\), we can control the number of selected features.

3.4 Virtual reality

Using VR to explore 3D volumes has big potentials (El Beheiry et al. 2019; Cohen et al. 2013; Chan et al. 2013; Faludi et al. 2019). Coupled with our touch-less NUI, VR will undoubtedly make NUI-VR\(^2\) more innovative, efficient, and enjoyable because of its realistic and immersive nature. There are various kinds of consumer VR devices, including cardboard viewer (e.g., Google Cardboard), mobile device mount (e.g., Samsung Gear VR), standalone VR (e.g., Oculus Quest), and tethered VR (e.g., Oculus Rift). We decided to build our system with Oculus Rift because it offers the best performance as we can connect the VR headset with a powerful computer. We use the Visualization Toolkit (VTK) Schroeder et al. (2004) as the rendering engine. VTK supports Oculus Rift natively and provides a rich feature set related to volume rendering. Figure 4 shows the rendering result of a \(512 \times 512 \times 288\) abdominal CT volume in NUI-VR\(^2\). A predefined transfer function is applied. Both figures mirror the entire display in Oculus Rift after chromatic aberration and lens distortion corrections. In Fig. 4a, users can have an overall view of the data, whereas in Fig. 4b, users can have a closer look at the internal structure of the volume. The rendering results illustrate that NUI-VR\(^2\) can deliver high-quality images in the VR space just as in a traditional desktop setup.

Fig. 4
figure 4

Render an abdominal CT volume with NUI-VR\(^2\) with a predefined transfer function. a Render the volume from outside; b render from inside the volume

3.5 System evaluation

As far as we know, there is no comparable image segmentation and volume rendering system as NUI-VR\(^2\), so it is hard to perform direct system-to-system comparisons. In Sect. 3.4, we have qualitatively shown the superb rendering quality of NUI-VR\(^2\) in the VR space. In the next section, we will comprehensively evaluate NUI-VR\(^2\) from the other perspectives:

  • First, we compare the NUI system with a traditional mouse to show that the NUI system ensures higher image segmentation accuracy than a mouse.

  • Next, we illustrate the effectiveness of image segmentation and feature selection in volume rendering. Users can easily adjust the rendering results by changing their parameters.

  • Finally, we compare NUI-VR2 with ImageVis3D, the leading software in transfer function design. The results show that NUI-VR\(^2\) highlights targets better and requires less effort from users than ImageVis3D.

4 Results

4.1 Evaluation of NUI

In this section, we qualitatively and quantifiably evaluate the NUI system. In NUI-VR\(^2\), the interactive image segmentation result plays a vital role in the rendering quality. We select Shortcut (Zhu et al. 2014) as the image segmentation algorithm that requires a bounding surface outside the target as the initialization. Within the same amount of time, we use a traditional mouse and our NUI to define the bounding surfaces. Figure 5 shows that we can only define some sparse strips outside the targets with a mouse, and segmentation leakages occur. In comparison, we can swap closed 3D surfaces outside the targets with gestures using our NUI. There are no segmentation leakages because of the well-defined surfaces. Table 2 shows the dice coefficients and the Hausdorff distances of the segmentation results. NUI ensures consistently more accurate results than the mouse. Both the qualitative and quantifiable results show the superb efficiency and effectiveness of the NUI system.

Fig. 5
figure 5

Image segmentation result comparison between a mouse and our NUI. (top) Vessel segmentation; (bottom) brain tumor segmentation; a ground truth of targets; b initializations with the mouse; c segmentation results with the mouse; d initializations with our NUI; e segmentation results with our NUI

Table 2 Dice coefficients (the larger the better) and Hausdorff distances (the smaller the better) result comparison

4.2 Effectiveness of image segmentation and feature selection

We do not rely on transfer functions to adjust the volume rendering effects. Instead, we modify the parameters for the image segmentation and feature selection algorithms. This section will show the effectiveness of our proposed approach and how to change the rendering results in NUI-VR\(^2\).

We first use the 3D spatial location, the intensity, and eight texture features to highlight the cerebral cortex. To compute those texture features, we calculate them over the entire voxel intensity range of the volume, set a voxel intensity bin at each intensity level, and average their values in all the 13 directions. The only parameter left is the neighborhood radius size N. We use KDE with the Gaussian kernel as the image segmentation algorithm, and two hundred seeds are selected within the target. The bandwidth of the Gaussian kernel \(\alpha\) dominates the KDE calculation. Figure 6 shows the different rendering results with different N and \(\alpha\) values. The cerebral cortex is well highlighted. The rendering effects can be adjusted by simply varying those two parameters. On the contrary, it would be pretty hard to highlight the cerebral cortex with a transfer function based on these 12 features, i.e., designing a 12-dimensional transfer function.

Fig. 6
figure 6

The cerebral cortex is well highlighted in all cases. The rendering quality can be adjusted by tuning N and \(\alpha\)

Next, we illustrate the effectiveness of feature selection. We use all of the 18 features in the default feature set, and perform a feature selection process as detailed in Sect. 3.3. Figure 7 shows the rendering result of a \(86 \times 142 \times 240\) CT volume with different numbers of selected features. The time to compute the probability volume on a commodity laptop is labeled. As we increase \(\lambda\) in Eq. 4, we increase the penalty for non-sparse \(\varvec{w}\) in the optimization, so the resulting \(\varvec{w}\) becomes more sparse and has more 0 elements. As a result, fewer features get selected. Feature selection effectively filters out redundant features and speeds up the volume rendering process. How many features should be selected depends on the feature set and the data. In NUI-VR2, users can adjust \(\lambda\) to find the sweet spot for the optimal rendering quality and speed.

Fig. 7
figure 7

Rendering of the spine with different numbers of selected features. a With all the 18 features; b with correlation, SRE, LRE, and LGLRE removed; c with correlation, SRE, LRE, LGLRE, energy, entropy, IDM, and inertia removed; d with only intensity, CP, HC, GLNU, RLNU, and HGLRE

4.3 Comparison with transfer function design

In this section, we compare NUI-VR\(^2\) with the traditional transfer function design method. With NUI-VR\(^2\), users only need to specify some seeds/masks to label the target. Figure 8 shows some example renderings with NUI-VR\(^2\). The default feature set and image segmentation engine in NUI-VR\(^2\) highlight the targets very well. For example, we only select seeds from the left hippocampus, and NUI-VR\(^2\) only renders the left hippocampus accordingly, even though the right hippocampus shares almost the same contexture as the left one. It is tough to do a similar rendering with the traditional transfer function design method.

The brain tumor is the easiest one to render with a transfer function, and we used ImageVis3D to highlight it. The 2D transfer function editor in ImageVis3D is a histogram of the intensity (x-axis) and gradient magnitude (y-axis) of voxels. To highlight a target, we need to place all kinds of geometries in different places in the editor by trials and errors. In comparison, users only need to designate a few sample points in the VR environment with NUI in NUI-VR\(^2\). Figure 9 shows the rendering result with ImageVis3D. We cannot render the brain tumor differently from the other objects even with a carefully crafted 2D transfer function. A higher-dimensional transfer function will be more effective in highlighting the buried head tumor, but a steeper learning curve and more design efforts will be required.

Fig. 8
figure 8

Target highlighting in NUI-VR\(^2\). a kidneys; b the left hippocampus; c a brain tumor

Fig. 9
figure 9

Highlight the brain tumor with ImageVis3D. a The carefully handcrafted 2D transfer function; b the rendering result. The tumor buried inside is barely visible from outside the human brain; c The zoom-in view around the tumor area

5 Conclusion

In this paper, we detail the design of NUI-VR\(^2\): a NUI-enabled volume rendering system in the VR space. NUI-VR\(^2\) marries image segmentation and volume rendering. Numerous general-purpose image segmentation algorithms could fit into our system. Users only need to define some seeds/masks to label the target instead of designing a complicated transfer function. We also design a Kinect-based NUI system based on 3D gestures and voice commands. Users can explore the volume, select seeds, and generate boundary masks directly in the 3D space. All the operations happen in an immersive VR environment. VR naturally fits 3D volume rendering and improves the user experience in perceiving the volume. NUI-VR\(^2\) dramatically simplifies the target-centric volume rendering process and delivers high-quality rendering results.