Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery,International Journal of Computer Assisted Radiology and Surgery

当前位置： X-MOL 学术 › Int. J. CARS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
International Journal of Computer Assisted Radiology and Surgery ( IF 2.3 ) Pub Date : 2021-03-24 , DOI: 10.1007/s11548-021-02343-y
Jie Ying Wu ₁ , Aniruddha Tamhane ₁ , Peter Kazanzides ₁ , Mathias Unberath ₁

Affiliation

Purpose

Multi- and cross-modal learning consolidates information from multiple data sources which may offer a holistic representation of complex scenarios. Cross-modal learning is particularly interesting, because synchronized data streams are immediately useful as self-supervisory signals. The prospect of achieving self-supervised continual learning in surgical robotics is exciting as it may enable lifelong learning that adapts to different surgeons and cases, ultimately leading to a more general machine understanding of surgical processes.

Methods

We present a learning paradigm using synchronous video and kinematics from robot-mediated surgery. Our approach relies on an encoder–decoder network that maps optical flow to the corresponding kinematics sequence. Clustering on the latent representations reveals meaningful groupings for surgeon gesture and skill level. We demonstrate the generalizability of the representations on the JIGSAWS dataset by classifying skill and gestures on tasks not used for training.

Results

For tasks seen in training, we report a 59 to 70% accuracy in surgical gestures classification. On tasks beyond the training setup, we note a 45 to 65% accuracy. Qualitatively, we find that unseen gestures form clusters in the latent space of novice actions, which may enable the automatic identification of novel interactions in a lifelong learning scenario.

Conclusion

From predicting the synchronous kinematics sequence, optical flow representations of surgical scenes emerge that separate well even for new tasks that the model had not seen before. While the representations are useful immediately for a variety of tasks, the self-supervised learning paradigm may enable research in lifelong and user-specific learning.

中文翻译：