Introduction

Current clinical practice in temporal bone surgery for cochlear implantation (CI) and vestibular schwannoma removal is still centered on a conventional and open operation setting. One fundamental challenge in moving to less traumatic minimally invasive procedures is to satisfy the required navigation accuracy. To ensure delicate risk structures such as the facial nerve and chorda tympani are not damaged by surgical tools, the clinical navigation accuracy must exceed 0.5 mm [15, 19]. Recent efforts have used force-feedback [28], optical tracking systems (OTSs) [5] and neuro-monitoring for CI [1], but each one of these strategies have drawbacks in the minimally invasive setting. For example, OTSs require line-of-sight and a registration between patient and tracking system. Electromagnetic tracking [13], force-feedback or neuro-monitoring, on the other hand, feature-limited accuracy. None of these methods can be used to navigate next-generation flexible instruments that follow nonlinear paths [7]. X-ray imaging, on the other hand, is precise and not constrained by line-of-sight. However, similar to OTS, fiducials used for patient registration significantly impact tracking accuracy of surgical instruments. The small size of fiducials, low contrast to anatomy alongside high anatomy-to-anatomy contrast and rotational symmetry characterize challenges specific to pose estimation of surgical tools in temporal bone surgery.

Unlike previous methods, deep learning allows instrument pose estimation to break into submillimeter accuracy at acceptable execution times [6, 16]. Previous non-deep learning pipelines based on 2D/3D registration [12] and template matching [26] achieve submillimeter accuracy for simple geometries. However, such techniques do not scale, do not generalize to more complex instruments and/or require full-head preoperative CT scans. In addition, these solutions are usually customized to a specific application, e.g., screws in pedicle screw placement [6], fiducials [11] or guide wires and instruments [26]. The recent shift to deep neural networks offers better accuracy at near real-time execution speed for implants and instruments [6, 16]. However, no such solution has been proposed for temporal bone surgery.

We propose i3PosNet, a deep learning-powered iterative image-guided instrument pose estimation method to provide high-precision estimation of poses. We focus on pose estimation from X-ray images showing a fiducial (screw) placed on the skull close to the temporal bone, because screws are commonly used as fiducials in robotic CI [5]. For optimum performance, i3PosNet implements a modular pipeline consisting of (1) region of interest normalization, (2) 2D pose estimation and (3) 3D pose reconstruction to determine the pose. To this effect, we patchify and normalize a small region around the expected fiducial position. We design a convolutional neural network (CNN) to predict six pseudo-landmarks on two axes with subpixel accuracy. A geometric model reconstructs 3D poses from their landmark coordinates. The division of the pipeline into three modular steps reduces complexity, increases performance and significantly boosts the angle estimation performance of i3PosNet (Fig. 1).

Table 1 Dataset summary

As part of this work, we publish three datasets in addition to the source code.Footnote 1 Based on these three datasets, we show i3PosNet (i) generalizes to real X-ray while only training on synthetic images, (ii) generalizes to two surgical instruments in pilot evaluations, and (iii) outperforms state-of-the-art registration methods as well as the end-to-end variation of i3PosNet. As no public datasets with ground truth poses are available for training and evaluation of novel pose estimation methods, we believe these new datasets will foster further developments in this area. Dataset A consists of synthetic radiographs with a medical screw for training and evaluation. In Dataset B, the screw is replaced by either a drill or a robot. Dataset C features real images of micro-screws placed on a head phantom. Images are acquired with a c-arm and manually annotated in 2D.

Fig. 1
figure 1

Instrument pose estimation from single X-ray: three instruments

Related work

Pose estimation using radiographs of instruments in the temporal bone has received little scientific attention. Most published research uses other tracking paradigms, most notably optical tracking [5]. While some deep learning-based approaches directly [4, 16] or indirectly [6] extract instrument poses from X-ray images, neither address temporal bone surgery. In this section, we give a brief overview of instrument tracking for temporal bone surgery.

Robotic solutions for minimally invasive cochlear implantation demand accurate tracking solutions. The small size of critical risk structures limits the acceptable navigation accuracy to which the tracking accuracy is a significant contributor [19]. In response, robotic cochlear implantation solutions rely on high precision optical [2, 5, 15] and electromagentic tracking [17]. However, electromagnetic tracking is not suitable to guide a robot because of accuracy and metal-distortion constraints [13]. Optical tracking cannot directly measure the instrument pose due to the occluded line-of-sight. Instead, the tool base is tracked adding the base-to-tip registration as a significant error source [5]. Adding additional, redundant approaches based on pose reconstruction from bone density and drill forces [28] as well as a neuro-monitoring-based fail-safe [1] Rathgeb et.al.[18] report a navigation accuracy of \(0.22 \pm 0.1\) mm at the critical facial nerve / \(0.11\pm 0.08\) mm at the target.

Earlier work on X-ray-imaging-based instrument pose estimation centers on traditional registration [8, 10,11,12, 24, 29], segmentation [11] and template-matching [26]. These methods are employed for various applications from pedicle screw placement to implant and guide wire localization. Temporal bone surgery has only been addressed by Kügler et.al.[12]. Recent work introduces deep learning methods for instrument pose estimation using segmentation as intermediate representations [6] and directly estimating the 3D pose [4, 16]. While Miao et.al.[16] employ 974 specialized neural networks, Bui et.al.’s work [4] extends the PoseNet architecture, but does not feature anatomy. Bier et.al.[3] proposed an anatomical landmark localization method. However, instruments sizes are significantly smaller in temporal bone surgery impacting X-ray attenuation and therefore image contrast. For the pose estimation of surgical instruments on endoscopic images [9, 14], deep learning is prevalent technique, but sub-pixel accuracy is not achieved—in part because the manual ground truth annotation does not allow it.

No deep learning-based pose estimation method addresses temporal bone surgery or its challenges such as very small instruments and low contrast.

Datasets

This paper introduces three new datasets: two synthetic digitally rendered radiograph (DRR) datasets (Dataset A for a screw and Dataset B for two surgical instruments), and a real X-ray dataset (Dataset C for manually labeled screws). All datasets include annotations for the pose with a unified file format (Table 1).

Dataset A: synthetic

This dataset shows a CT scan of a head and a screw rendered as X-ray. We balance it w.r.t. anatomical variation and projection geometry by implementing a statistical sampling method, which we describe here. In consequence, the dataset is ideal for method development and the exploration of design choices (Fig. 2a).

Anatomy and fiducial

To account for the variation of patient-specific anatomy, we consider three different conserved human cadaver heads captured by a SIEMENS SOMATOM Definition AS+. The slices of the transverse plane are centered around the otobasis and include the full cross section of the skull. A small medical screw is virtually implanted near the temporal bone similar to the use for tool registration of the pre-operative CT in robotic cochlear surgery [5]. The screw geometry is defined by a CAD mesh provided by the screw manufacturer (c.f. Dataset C). Its bounding box diagonal is 6.5 mm.

Method for radiograph generation

Our DRR generation pipeline is fully parameterizable and tailored to the surgical pose estimation use-case. We use the Insight Segmentation and Reconstruction Toolkit and the Registration Toolkit to modify and project the CT anatomy and the fiducial into 2D images. The pipeline generates projections and corresponding ground truth poses from a CT anatomy, a mesh of the fiducial and a parameter definition, where most parameters can be defined statistically. Since the CT data only includes limited sagittal height, we require all projection rays that pass through an approx. 5 mm sphere around the fiducial to be uncorrupted by regions of missing CT data. For this, missing data are delineated by polygons. We export the pose \(\varvec{\theta }\) (Eq. 1) of the instrument for later use in training and evaluation.

Fig. 2
figure 2

Sample images from Dataset A (left), Dataset B (center) and Dataset C (right, the normalized detail illustrates low contrast)

Parameters of generation

The generation process is dependent on the projection geometry and the relative position and orientation of the fiducial w.r.t. the anatomy. In specific, we manually choose ten realistic fiducial poses w.r.t. the anatomy per subject and side. To increase variety, we add noise to these poses (position \(\mathbf {x}_\mathrm{instr}\) and orientation \(\mathbf {n}_\mathrm{instr}\)): The noise magnitude for testing is lower than for training to preserve realism and increase generalization. The projection geometry on the other hand describes the configuration of the mesh w.r.t. X-ray source and detector. These parameters are encoded in the Source-Object-Distance, the displacement orthogonal to the projection direction and the rotations around the object. We derive these from the specification of the c-arm, which we also use for the acquisition of real X-ray images.

In total, the dataset contains 18,000 images across three anatomies. Three thousand of them are generated with the less noise setting for testing.

Dataset B: surgical tools

To show generalization to other instruments, we introduce a second synthetic dataset. This dataset differs from the first twofold: instead of a screw, we use a medical drill or a prototype robot (see Fig. 1); these are placed at realistic positions inside of the temporal bone instead of on the bone surface. This dataset includes the same number of images per instrument as Dataset A (Fig. 2b).

Drill

Despite the drill’s length (drill diameter 3 mm), for the estimation of the tip’s pose, only the tip is considered to limit the influence of drill bending.

Prototype robot

The non-rigid drilling robot consists of a spherical drilling head and two cylinders connected by a flexible joint. By flexing and expanding the joint in coordination with cushions on the cylinders, the drill-head creates nonlinear access paths. With a bounding box diagonal of the robot up to the joint of 13.15 mm, its dimensions are in line with typical MIS and temporal bone surgery applications.

Dataset C: real X-rays

For Dataset C, we acquire real X-ray images and annotations. The dataset includes 540 images with diverse acquisition angles, distances and positions (Fig. 2c).

Preparation

The experimental setup is based on a realistic X-ray head phantom featuring a human skull embedded in tissue equivalent material. To not damage the head phantom for the placement of the fiducial (screw), we attach the medical titanium micro-screw with modeling clay as the best non-destructive alternative. The setup is then placed on the carbon fiber table and supported by X-ray-translucent foam blocks.

Fig. 3
figure 3

Definition of pose; length not to scale

Image acquisition

We capture multiple X-ray images with a Ziehm c-arm certified for clinical use before repositioning the screw on the skull. These images are collected from multiple directions per placement.

Manual annotation

In a custom annotation tool, we recreate the projection geometry from c-arm specifications replacing X-ray source and detector with camera and X-ray, respectively. The screw is rendered as an outline and interactively translated and rotated to match the X-ray. We remove images, where the screw is not visible to the annotater (approx. 1% of images). Finally, the projected screw position is calculated to fit the projection geometry.

Methods

Our approach breaks the task of surgical pose estimation down into three steps: reduction in image variety (region of interest appearance normalization), a convolutional neural network for information extraction and finally pose reconstruction from pseudo-landmarks. We describe a formalized problem definition in addition to these individual three steps in this section.

Problem definition

Surgical pose estimation is the task of extracting the pose from a X-ray image. We define the pose\(\varvec{\theta }\) to be the set of coordinates, forward angle, projection angle and distance to the X-ray source (depth). It is defined w.r.t. the detector and the projection geometry (see Fig. 3).

$$\begin{aligned} \varvec{\theta }= \left( x , y , \alpha , \tau , d\right) ^\mathrm{T} \end{aligned}$$
(1)

The forward angle\(\alpha \) indicates the angle between the instrument’s main rotational axis projected onto the image plane and the horizontal axis of the image. The projection angle\(\tau \) quantifies the tilt of the instrument w.r.t. the detector plane. The depth\(d\) represents the distance on the projection normal from the source (focal point) to the instrument (c.f. Source-Object-Distance).

We assume an initial pose with accuracy \(\Delta x_\text {initial} \le 2.5\) mm and \(\Delta \alpha _\text {initial} \le 30^{\circ }\) is available. The initial pose can be manually identified, adopted from an independent low-precision tracking system, previous time points or determined from the image by a localization U-Net [14].

Given a new predicted pose, the initial guess is updated and the steps described in “Appearance normalization”– “Pose reconstruction” section iteratively repeated.

Appearance normalization

Since megapixel X-ray images (\(1024\times 1024\) pixels) are dominated by regions unimportant to the pose estimation, the direct extraction of subpixel accuracy from megapixel images oversaturates the deep learning strategy with data. It is therefore crucial to reduce the size of the image and the image variety based on prior knowledge (Fig. 2).

To reduce image variety, the appearance normalization creates an image patch that is rotated, translated and cut to the estimated object position. Additionally, we normalize the intensity in the patch of \(92\times 48\) pixels. Based on this normalization step and given perfect knowledge of the pose, the object will always appear similar w.r.t. the in-plane pose components (position and forward angle). We define this as standard pose—the object positioned at a central location and oriented in the direction of the x-axis of the patch.

Pseudo-landmarks are generated from the pose annotation. Their geometric placement w.r.t. the fiducial is described by the pair of x- and y-coordinates \((x_i, y_i)\). The independence from the fiducial’s appearance motivates the term “pseudo-landmark”. They are placed 15 mm apart in a cross-shape centered on the instrument (see Fig. 4). Two normalized support vectors define the legs of the cross: instruments rotational axis (x-direction, \(3+1\) landmarks), and its cross-product with the projection direction (y-direction, \(2+1\) landmarks). Equation 2 formalizes the transformation of the pose to point coordinates in the image plane dependent on the projection geometry (\(c_{\text {d2p}} = {\Delta _\text {ds} / d_\text {SDD}}\) with Source-Detector-Distance \(d_{SDD}\) and Detector-Pixel-Spacing \(\Delta _{ds}\)). \(\left\{ (x^\text {LP}_i , y^\text {LP}_i)\right\} _{i=1}^6\) describe the local placement (LP) of pseudo-landmarks w.r.t. the support vectors. Finally, landmark positions are normalized w.r.t. maximum values.

$$\begin{aligned} (x_{i},y_{i})^\mathrm{T} = (x,y)^\mathrm{T} + {1 \over c_{\text {d2p}} d} \cdot R(\alpha ) (x^\text {LP}_i \cdot \cos (\tau ), y^\text {LP}_i)^\mathrm{T} \end{aligned}$$
(2)

To construct prior knowledge for training, we generate random variations of the in-plane components of the pose effecting both image and annotation:

$$\begin{aligned} \begin{array}{ll} &{}(R, \beta ) \sim (\mathcal {U}(0,\Delta x_\text {initial}), \mathcal {U}(0{^{\circ }},360{^{\circ }}))\\ &{}\Delta \alpha \sim \mathcal {N}(0,(\frac{1}{3} \Delta \alpha _\text {initial})^2) \end{array} \end{aligned}$$
(3)

By drawing the variation of the position \(\mathbf {x}_\mathrm{instr}\) in polar coordinates \((R, \beta )\) (\(\mathcal {U}\): uniform distribution) and the forward angle \(\Delta \alpha \) from a normal distribution (\(\mathcal {N}\)), we skew the training samples toward configurations similar to the standard pose. This skew favors accuracy based on good estimates over more distant cases, similar to class imbalance. In effect, this appearance normalization increases the effectiveness (performance per network complexity) of the Deep Neural Network through the utilization of data similarity.

The patch appearance is dominated by the difference between the actual and the standard pose, i.e. the error of the prior knowledge.

Pseudo-landmark prediction

Based on the \(92\times 48\)-normalized greyscale patch, a convolutional neural network (CNN) extracts the pose information. Our analysis shows pseudo-landmarks outperform direct pose prediction. While we explain the design of the pseudo-landmark estimation here, the direct pose estimation follows a similar strategy.

Fig. 4
figure 4

Pseudo-landmark placement: initial (blue), estimation (yellow) and central landmark (red), ground truth (green)

The CNN is designed after a VGG-fashion [20] with 13 weight layers. We benchmark the CNN on multiple design dimensions including the number of convolutional layers and blocks, the pooling layer type, the number of fully connected layers and the regularization strategy. In this context, we assume a block consists of multiple convolutional layers and ends in a pooling layer shrinking the layer size by a factor of \(2\times 2\).

All layers use ReLU activation. We double the number of channels after every block, starting with 32 for the first block. We use the mean squared error as loss function. For optimizers, we evaluated both Stochastic Gradient Descent with Nesterov Momentum update and Adam including different parameter combinations.

Pose reconstruction

We reconstruct the pose from the pseudo-landmarks based on their placement in a cross-shape (\(x^\text {LP}_i = 0\) or \(y^\text {LP}_i = 0\)). It enables us to invert Eq. 2 geometrically by fitting lines through two subsets of landmarks. The intersection yields the position \(\varvec{x}= (x,y)^\mathrm{T}\) and the slope the forward angle \(\alpha \). The depth \(d\) and projection angle \(\tau \) are determined by using Eqs. 4 and 5 on the same landmark subsets.

$$\begin{aligned} d= {c_{d2p}}^{-1} \cdot {|y^\text {LP}_i - y^\text {LP}_j| \over |(x_i,y_i)^\mathrm{T} - (x_j,y_j)^\mathrm{T}|_2}, i \ne j , x^\text {LP}_{i/j} = 0 \end{aligned}$$
(4)
$$\begin{aligned} \cos (\tau ) = c_{d2p} d\cdot {|(x_i,y_i)^\mathrm{T} - (x_j,y_j)^\mathrm{T}|_2 \over |x^\text {LP}_i - x^\text {LP}_j|}, i \ne j, y^\text {LP}_{i/j} = 0 \end{aligned}$$
(5)

Experiments and results

We performed a large number of different experiments with independent analysis of the position \(\varvec{x}\), forward angle \(\alpha \), projection angle \(\tau \) and depth \(d\). In this section, we present the common experimental setup, evaluation metrics and multiple evaluations of i3PosNet. We group in-depth evaluations in three blocks: a general “Evaluation” section, and the analysis of the modular design and limitations in “Evaluation of design decisions” and “Limitations to projection parameters”. To streamline the description of the training and evaluation pipeline, we only present differences to the initially described common experimental setup.

Common experimental setup

Generalization to unseen anatomies is a basic requirement to any CAI methodology, therefore we always evaluate on an unseen anatomy. Following this leave-one-anatomy-out evaluation strategy, individual training runs only include 10 k images, since 5 k images are available for training per anatomy.

Training

Based on the training dataset (from Dataset A), we create 20 variations of the prior pose knowledge (see “Appearance Normalization”). In consequence, training uses 20 different image patches per \(1024\times 1024\) image for all experiments (200 k patches for training). We train the convolutional neural network to minimize the mean squared error of the pseudo-landmark regression with the Adam optimizer and standard parameters. Instead of decay, we use a learning rate schedule that uses fixed learning rates for different training stages of \(5 \times 10^{-3}\) initially and decrease the exponent by one every 35 epochs. We stop the training after 80 epochs and choose the best-performing checkpoint by monitoring the loss on the validation set.

Testing

The testing dataset (from Dataset A) features the unseen third anatomy, which we combine with ten randomly drawn initial poses per image. We use the strategy presented in “Appearance Normalization” to draw initial in-plane poses. For testing, no out-of-plane components are required a priori. The prediction is iterated three times, where the prior knowledge is updated with pose predictions each iteration (Fig. 5). Images with projection angles \(|\tau | > 80^{\circ }\) are filtered out, because the performance significantly degrades biasing results. This leads to on average 7864 tests per trained model. This degradation is obvious given the ambiguity that arises, if projection direction and fiducial main axis (almost) align. We provide an in-depth analysis of this limitation in “Limitations to projection parameters”.

Fig. 5
figure 5

Iterative refinement scheme (“recon&crop”: reconstruct pose and crop according to estimation)

Metrics

We evaluated the in-plane and out-of-plane components of the predicted pose independently using five error measures. During annotation, we experienced, out-of-plane components are much harder to recover from single images so we expect it much harder to recover for neural networks as well.

In-plane components

The Position Error (also reprojection distance [25]) is the Euclidean Distance in a plane through the fiducial and orthogonal to the projection normal. It is measured in pixel (in the image) or millimeter (in the world coordinate system). The relationship between pixel and millimeter position error is image-dependent because of varying distances between source and object. The Forward Angle Error is the angle between estimated and ground truth orientation projected into the image plane in degrees, i.e., the in-plane angle error.

Out-of-plane components

For the Projection Angle Error, we consider the tilt of the fiducial out of the image plane in degrees. Since the sign of the projection angle is not recoverable for small fiducials (\(\cos (\tau ) = \cos (-\tau )\)), we only compare absolute values for this out-of-plane angle error. The rotation angle is not recoverable at all for rotationally symmetric fiducials specifically at this size. Finally, the Depth Error (millimeter) considers the distance between X-ray source and fiducial (also known as the target registration error in the projection direction [25]).

Fig. 6
figure 6

Quantative comparison of i3PosNet, i2PosNet (no iter.), Registration with Covariance Matrix Adaptation Evolution and Gradient Correlation or Mutual Information

Table 2 Results for experiments of synthetic (Dataset A) and real (Dataset C) screw experiments and additional instruments (Dataset B)

Evaluation

Comparison to registration

Due to long execution times of the registration (>30 min per image), the evaluation was performed on a 25-image subset of one (unseen) anatomy with two independent estimates image. We limited the number of DRRs generated online to 400. At that point the registration was always converged. Both i3PosNet and i2PosNet metrics represent distributions from four independent trainings to cover statistical training differences. Comparing i3PosNet to two previously validated registration methods [12], i3PosNet outperforms these by a factor of 5 (see Fig. 6). The errors for i3PosNet and i2PosNet are below 0.5 Pixel (0.1 mm) for all images. At the same time, i3PosNet reduces the single-image prediction time to 57.6 ms on one GTX 1080 at 6% utilization.

Real X-ray image evaluation

Because of the significant computation overhead (projection), we randomly choose 25 images from anatomy 1 in Dataset A and performed two pose estimations from randomly sampled deviations from the initial estimate. Four i3PosNet-models were independently trained for 80 epochs and evaluated for three iterations (see Table 2).

Generalization

i3PosNet also generalizes well to other instruments. Training and evaluating i3PosNet with corresponding images from Dataset B (drill and robot) shows consistent results across all metrics.

Evaluation of design decisions

To emphasize our reliance on geometric considerations and our central design decision, we evaluated the prediction of forward angles (end-to-end) in comparison with the usage of pseudo-landmarks and 2D pose reconstruction (modular). Comparing our modular solution to the end-to-end setup, we found the latter to display significantly larger errors for the forward angle, especially for relevant cases of bad initialization (see Fig. 7).

Limitations to projection parameters

We evaluate the dependence on projection angles \(\tau \) imposed especially for drill-like instruments (strong rotational symmetry) (see Fig. 8). We observe a decreasing quality starting at 60\(^{\circ }\) with instabilities around 90\(^{\circ }\) motivating the exclusion of images with \(|\tau | > 80^{\circ }\) from the general experiments.

Fig. 7
figure 7

The addition of virtual landmarks (modular, a) improves forward angle errors for inaccurate initial angles in comparison to regressing the angle directly (end-to-end, b)

Fig. 8
figure 8

Evaluation of the forward angle dependent on the projection angle; examples showing different instruments for different projection angles

Discussion and conclusion

We estimate the pose of three surgical instruments using a deep learning-based approach. By including geometric considerations into our method, we are able to approximate the nonlinear properties of rotation and projection.

The accuracy provided by i3PosNet improves the ability of surgeons to accurately determine the pose of instruments, even when the line of sight is obstructed. However, the transfer of the model trained solely based on synthetic data significantly reduces the accuracy, a problem widely observed in learning for CAI [21, 23, 27, 30]. As a consequence, while promising on synthetic results, i3PosNet closely misses the required tracking accuracy for temporal bone surgery on real X-ray data. Solving this issue is a significant CAI challenge and requires large annotated datasets mixed into the training [4] or novel methods for generation [22].

In the future, we also want to embed i3PosNet in a multi-tool localization scheme, where fiducials, instruments, etc., are localized and their pose estimated without the knowledge of the projection matrix. To increase the 3D accuracy, multiple orthogonal X-rays and a proposal scheme for the projection direction may be used. Through this novel navigation method, surgeries previously barred from minimally invasive approaches are opened to new possibilities with an outlook of higher precision and reduced patient surgery trauma.