Introduction

Nowadays, during endovascular neurosurgery, the operator often performs three-dimensional (3D) rotational angiography using an angiography system at the beginning of the treatment to adjust the fluoroscopy projection to a suitable direction for vessel visualization and to confirm the treatment strategy between the operator and assistants. In addition, a two-dimensional (2D) image, known as a working angle, is usually created, and endovascular neurosurgery is performed while checking the angle. Important vessels and angles may change depending on the treatment situation, and a 3D image navigation system located outside the angiography suite/operating room is used to confirm the obtained vessel images; however, the operator usually provides oral instructions to the assistant to operate the system to keep his/her hands sterile. Oral instructions for acquiring 3D images are limited, and if the operator is manually adjusting the working angle at the workstation, sterile gloves have to be repeatedly removed and put on to maintain a sterile environment inside the operating room. Consequently, procedure time may prolong, increasing the risk of infection [1]. Therefore, maneuvering an image navigation system by operators themselves while maintaining a sterile condition is pivotal for a safe and successful outcome. Contactless control of medical software in the immediate environment of the operating room and radiology suite has already been reviewed [2]. Moreover, a contactless gesture navigation system using Kinect has already been developed and is used in various medical fields. Its clinical application has been reported for abdominal surgery [3,4,5], percutaneous biopsy [6], examination assistance [7, 8], rehabilitation [9, 10], and medical education [11]. In the field of neurosurgery, only one study has applied this kind of navigation system during a craniotomy [12]. Kinect is a low-cost device developed by Microsoft that provides a joystick-free game with gesture recognition, depth camera, accelerometers, and other features. In this study, the authors aimed to clarify whether this system could be connected to an existing workstation commonly used and applied to endovascular neurosurgery instead of creating another new independent system.

Methods

In this study, a contactless image manipulation interface was developed and suitably used for endovascular neurosurgery. Nowadays, 3D image navigation is imperative to confirm complicated cerebral vessel anatomy and to perform a secure treatment. For contactless operation, a system that can control workstation operation with both hand gestures and voice using Kinect and a voice recognition decoder software was devised. A 3D image reconstructed using a volume rendering technique (VRT) was used. The right hand was assigned for analog operations, such as image enlargement and rotation, and the operation mode was set to the left hand. The voice was digitally operated, such as for saving and reading the current coordinates and for switching the viewpoint direction.

Hardware setup and development of the operation interface

Xbox One KINECT V2 (Microsoft Corporation) can recognize the shape of a person and can also identify detailed movements and hand shapes. Speech recognition was achieved using Julius, an open-source “A Large Vocabulary Continuous Speech Recognition Engine” (Version.4.1.3; https://github.com/julius-speech/julius), as a high-performance large vocabulary continuous speech recognition (LVCSR) decoder software. The operator wore a headset microphone, and Kinect was placed above the large surgical display besides a sub-display as shown in Fig. 1. The image navigation system is linked to a fluoroscope and a patient image database by default. Therefore, the authors developed an interface that replaces the mouse and keyboard operations of the existing system by using gesture and voice operations. Figure 2 shows the network scheme configuration developed in this study. The surgeon’s operation is converted by the software developed on a Windows PC and then sent to the workstation of the navigation system (Syngo X workplace workstation of Artis Q BA Twin, Siemens Health Care, Forchheim, Germany) via Bluetooth. A summary of the system development and implementation environment is given in Table 1. The developed system and the speech recognition software are separate but run on the same PC and communicated within the PC to transfer speech recognition results to the developed system. With the adjustments and improvements as described below, gesture and voice command inputs were successfully converted to a common operation using a keyboard and a mouse. It means that this system can be applied to various types of workstation system types. In fact, during its development, the system was confirmed to respond to two different types of workstation interfaces of Siemens angiography devices. Prior to actual clinical use, the surgeons were once or twice trained to operate the system.

Fig. 1
figure 1

A layout with Kinect and sub-display attached to the existing angiography system

Fig. 2
figure 2

Scheme showing the developed network of operation interfaces in this study. The solid line indicates wired connection, whereas the broken line indicates wireless connection

Table 1 A summary of the development and implementation environment for this system

The gesture design interface

Gesture operation was based on a relative displacement sensing technique so that the surgeon does not have to look away from the surgical display. The shape of the hands used for the operation was set to imitate the “rock–paper–scissors” position, a well-known hand game among Japanese people. Figure 3 shows hand gestures and the corresponding assigned functions.

  1. (1)

    While changing the shape of the left hand from “rock” to “paper,” like swiping on a smartphone, the right-hand operation mode is set according to the movement direction: swiping up for pan and zoom, swiping down for adjusting the window level, swiping left for rotation, and swiping right for mouse mode. A sound effect will indicate and display the corresponding icon on sub-display when the operation mode is switched.

  2. (2)

    When changing the shape of the left hand to “scissors,” a green frame appears on sub-display, indicating that voice operation is available.

  3. (3)

    When the left hand is raised above the shoulder, it sets the system to stop updating the position information of the right hand. This function enables a more effective, smoother adjustment of the movement and it improves reaction accuracy. To be more specific, at the end of the swipe motion, the right hand may shift slightly when changing the command shape; this shift cannot be ignored when accurate position specification is required, such as for measuring the size of an aneurysm. For this reason, the above function was additionally set.

  4. (4)

    During right-hand operation, “rock” and “paper” shapes are assigned to drag an image with the mouse. Alternatively, “scissors” shape of the right hand enables zoom in and zoom out in the vertical movement in the pan and zoom modes. In the mouse mode, the forward/backward movement is set as a click operation. Click operation is used to specify the measurement point of the blood vessel diameter.

Fig. 3
figure 3

A table showing the shape/direction/movement of both hands and the corresponding accompanying navigation functions. In the rightmost column, the reply sound and display indicating the confirmed operation are described. N/A: not applicable

Voice operations and recognition

Gesture commands were assigned as operations that require hand movements, such as “measure a specific distance” using the mouse, whereas voice commands were assigned to operations that can be done in a single operation that does not require mouse operation, such as “turn the VRT image to the front view.” In particular, in addition to the “start” command for starting the gesture operation and “lock” command for stopping the operation, there are other commands for switching and rotating the viewpoint from the reference direction, saving and loading images, and moving to the measurement mode. These operations are frequently used in endovascular neurosurgery. Voice is picked up using the operator’s headset microphone and sent to a nearby Windows PC via Bluetooth. Moreover, the voice information is converted into text information using the Julius software. Table 2 shows the voice commands and the corresponding functions. A sound starting with S has a poor rate of recognition by the software; therefore, the voice actually adopted for system commands was devised to have as high recognition rate as possible. The terms used for voice command, such as “Hidari” (Japanese for “left”), can be replaced as appropriate. To inform the surgeon that the voice command was operated correctly or not, the corresponding English words were presented through equipped earphones. In this study, although Japanese was used as the voice command for pronunciation fluency, other languages can be substituted for Japanese. However, it is important to avoid confusing pronunciations. After some trial and error, we decided to avoid similar pronunciations in this voice operation to prevent reduction in the success rate of recognition. Before clinical application, a check test was conducted in a silent environment, and correct speech recognition was observed in > 90% instances.

Table 2 The correspondence table between voice commands and assigned functions

Prevention of false recognition

In the phase of recognizing the hand shapes, “scissors” is apt to be mistakenly recognized as “rock”; therefore, “scissors” were set to be recognized for the first time when this shape continued for several frames. Kinect can recognize up to six persons’ gestures but only up to three hand shapes. Therefore, to ensure accurate recognition of the main surgeon’s hand, the space for recognition was intentionally narrowed down around the main surgeon’s standing position. Consequently, it was recommended to the main operator to stand within a width of approximately 2 m in front of the Kinect mounted above the surgical display. Voice command execution was available only when the left hand showed the “scissors” shape. As the voice command will be misrecognized if the types of voice command are increased, it was set to the required minimum number.

Results

This system was applied and evaluated in two clinical cases for aneurysm coil embolization and treatment of brain arteriovenous malformations (AVMs). With some exceptions, the main flow of endovascular neurosurgery is as follows: (1) puncture the femoral artery and place a guiding catheter; (2) obtain 2D or 3D images; (3) operate a workstation to create a working angle suitable for embolization and identify vessel anatomy; (4) obtain a 2D angiography image at the created angle; (5) start the treatment; and (6) if required, reconfirm the VRT image at the workstation or recapture the stereoscopic structure. The present system was used in steps 3 and 6 of this series of procedures. When abnormal vessels are identified, 3D images are obtained in almost all cases. Step 3 is always performed when treatment is being considered and/or performed. The cases that require step 6 are those with somewhat complex vascular architectures. For example, in approximately half of the cases with branches arising from the neck of the aneurysm, we reconfirmed the precise positions of the branches by changing the angle of the image during the surgery due to preserve the patency. The duration required for each procedure was measured by analyzing the video recording of the actual treatment. The main operator was a single person, and the video analysis was checked by multiple people.

Figure 4a shows the gesture operation for the treatment of brain AVMs. In actual endovascular treatment, no major problems were encountered in the specific operation of this system. In Fig. 4b, when the viewpoint direction is switched to the left by the voice command “left,” “L” indicating the left viewpoint is displayed on the lower right part of the screen. It was possible to perform a series of operations, such as determining the working angle, measuring the blood vessel and aneurysm diameters, and saving snapshots on the table side without any stress.

Fig. 4
figure 4

An operator performing the gesture and voice operation and a 3D volume rendering technique (VRT) image moving on the workstation in synchronization with the operator’s action. a Picking stone with his right hand allows observation of the rotation movement of the 3D VRT image just like using a mouse. b When the operator plays scissors with his hand and says “Hidari,” the VRT image instantly switches to the image seen on the left side. “Hidari” means left in Japanese

Owing to the addition of a restraint command that acts as a brake to ensure smooth operation, the same operations need not be repeated; therefore, while maintaining a sterile condition, the treatment, including precise operations such as measuring the blood vessel diameter, was successfully completed.

When aiming to return the obtained image to the reference angle, the system responded quickly to the voice command, and the latter was useful. Table 3 shows the time required to operate the workstation using this system during the actual treatment. Each operation using gestures, such as measurement and image storage, took approximately 10–30 s. For voice recognition, the system took 4 s for a new voice command. For successive voice commands, we were able to switch the mode in 3 s, and the entire series of commands for one case could be performed in a total of 2–3 min. As for the interaction with the conventional system, the clinical radiologist or an assistant had to be asked about some non-major complementary procedures. Specifically, naming the bookmark and snapshot images required typing letters using a conventional keyboard. However, the other procedures necessary for treatment were completed using this system only.

Table 3 A table shows the time taken for each operation at the workstation through a contactless navigation system during endovascular surgery

In clinical practice, noises, such as conversations and fan noise, around the monitor were captured. Consequently, the voice recognition rate was slightly lower than that in a silent condition, and therefore, incorrect recognition sometimes occurred. When the volume of the monitor showing vital signs was turned down, the voice recognition rate improved; therefore, incorrect voice recognition was considered to be primarily caused by the surrounding sound.

Discussion

In this study, the authors constructed an operation interface using hand gestures and voice in an existing 3D image navigation system. The operation command was set for endovascular neurosurgery. Time and effort are required for performing procedures while using gestures; therefore, voice command was used for digital operations. As a result of clinical practice evaluation, the 3D image was confirmed to be smoothly manipulated by a gesture operation. Conversely, misrecognition may occur due to environmental sounds (noise) during voice operation; however, after implementing countermeasures, the interface became sufficiently applicable for clinical use. In the field of neurosurgery, only image-assisted surgery using Kinect gesture recognition in craniotomy has been reported [12]. However, to the best of our knowledge, no similar reports have been published for endovascular treatment, including endovascular neurosurgery. One study has reported the application of Kinect for touchless navigation via the radiological DICOM image viewer [13]. The present report is the first one to be applied to the clinical practice of endovascular intervention.

In a high-volume center with many members in the endovascular neurosurgery team, this system might not be required if the tasks at the workstation and the sterile tasks in the operation room are separated. However, many facilities have only one specialist who plays various roles. A rapid and smooth image-assisted surgery could be achieved using this system, which seems to be particularly effective for endovascular neurosurgery with a small team. It was thought that this system might result in time loss, but compared with the conventional methods, such as manual manipulation at the workstation or using a mouse under the drape, Kinect uses the same amount of time to perform the same operation. Moreover, the surgeon did not have to go back and forth between the rooms, which resulted in less loss of time. Recently, with the emergence of coronavirus disease 2019 (COVID-19) pandemic, contactless elevators and various touch panels have been developed and used in the public. There may be situations in which stroke treatment and mechanical thrombectomy must be performed even in patients with emerging infections, including those with COVID-19. As per the proposed Protected Code Stroke [14, 15], the number of people entering the patient’s environment should be minimized and the door should be closed for treatment. The treatment should be completed with as little contact as possible with the patient and surrounding environment, and we believe that this system will be extremely useful in such situations.

Regarding hand gestures, favorable operability can be achieved via continuous improvements. Voice recognition is simple, and the system shows a quick response; after some improvements, the polished recognition rate may be further improved by reducing environmental sounds, avoiding interference with environmental sounds using a directional microphone, and using speech recognition software with word learning. In this study, the newly developed interface was adapted to and incorporated into a workstation attached to our commonly used Siemens angiography system. Figure 5 shows a comparison of configurations between the conventional and newly developed interfaces. This newly developed interface can correspond to different navigation systems by exchanging the software of the conversion unit. Therefore, it will probably be possible to adapt this system to workstations other than Siemens angiography devices in the future.

Fig. 5
figure 5

Comparison of configurations between the conventional and newly developed interfaces

However, some points remain to be improved in the future. The presence of more assistants during endovascular procedures increased the number of people in the Kinect recognition range, thereby increasing the response time. To address this problem, the number of people who can be identified by Kinect can be increased. Moreover, as the voice recognition rate deteriorates in the presence of environmental noises, such as monitor sound, we are considering using a directional microphone.

This study has some limitations. The development and clinical application of this system were performed in the angiography suite that we usually use, but the operability may be different in an operating room, such as in a hybrid operating room. However, the ambient sound and distance between the operating room and our angiography suite are similar, so we can say that the development environment is comparable. Another limitation is that the Kinect sensor, which was available at the beginning of the development, is no longer commercially available. Hence, the concept of this system must be modified so that it can be adapted to other depth sensors (Leap Controller, VicoVR, Orbbec camera, etc.). In the future, we plan to use the Intel Real Sense gesture sensor, which is capable of recognizing five fingers. As the Kinect can only recognize three types of hand gestures, we chose the hand shapes of “rock–paper–scissors,” which are easy for anyone to understand and reproduce. Additionally, the gestures required large movements to operate. However, because Real Sense can recognize five fingers, zooming, for example, can be done using the index finger and thumb spread out, similar to that on a smartphone, which is expected to reduce the movement in the gesture operation. Furthermore, in the case of Kinect, the user should raise the left hand to stop the command for preventing hand misalignment when changing the shape of the fingers on the right hand, but in the case of a sensor that can recognize five fingers, the accuracy of hand-position recognition is likely to be improved, and this operation is also expected to become unnecessary. The last limitation is that this system does not substitute for all the functions of a workstation. When treating a lesion with a simple vascular architecture, as shown in the results section of this paper, the procedures performed on the workstation are relatively simple. Therefore, we could perform the treatment without any problems. A complicated shunt disease requires time-consuming interpretation before treatment. Consequently, multiple DSA images must be interpreted and compared, and the use of this system is not suitable for such prolonged interpretations. As mentioned earlier, the images can be saved and loaded with this system, but the process of naming them requires keyboard input by an assistant.

With the development of artificial intelligence technology, the recognition rate of gesture sensor and voice recognition is improving on a daily basis. In the future, we would like to build the next system using a new gesture sensor to replace the Kinect and new speech recognition software, which will further improve both gesture and speech recognition accuracy, and further highlight the difference between existing system and our developed system. As a clinical significance, we believe that this will reduce the risk of unnecessary infections and stress on the surgeon and improve the quality of the surgery.

Conclusions

The authors developed a contactless operating interface that combines the existing workstation system with Kinect and voice recognition software. This interface allows the neuroradiologist to operate the workstation while maintaining sterile conditions, without having to walk back and forth between the operating room and the console room. Since the Kinect is not commercially available, it will be necessary to modify and adapt this concept to the currently available depth sensor in the future, but this system and interface can withstand common clinical endovascular neurosurgery while maintaining excellent precision. Since it does not replace the full functionality of the workstation, further clinical usage experience is warranted.