当前位置: X-MOL 学术Int. J. CARS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
International Journal of Computer Assisted Radiology and Surgery ( IF 2.3 ) Pub Date : 2021-03-24 , DOI: 10.1007/s11548-021-02343-y
Jie Ying Wu 1 , Aniruddha Tamhane 1 , Peter Kazanzides 1 , Mathias Unberath 1
Affiliation  

Purpose

Multi- and cross-modal learning consolidates information from multiple data sources which may offer a holistic representation of complex scenarios. Cross-modal learning is particularly interesting, because synchronized data streams are immediately useful as self-supervisory signals. The prospect of achieving self-supervised continual learning in surgical robotics is exciting as it may enable lifelong learning that adapts to different surgeons and cases, ultimately leading to a more general machine understanding of surgical processes.

Methods

We present a learning paradigm using synchronous video and kinematics from robot-mediated surgery. Our approach relies on an encoder–decoder network that maps optical flow to the corresponding kinematics sequence. Clustering on the latent representations reveals meaningful groupings for surgeon gesture and skill level. We demonstrate the generalizability of the representations on the JIGSAWS dataset by classifying skill and gestures on tasks not used for training.

Results

For tasks seen in training, we report a 59 to 70% accuracy in surgical gestures classification. On tasks beyond the training setup, we note a 45 to 65% accuracy. Qualitatively, we find that unseen gestures form clusters in the latent space of novice actions, which may enable the automatic identification of novel interactions in a lifelong learning scenario.

Conclusion

From predicting the synchronous kinematics sequence, optical flow representations of surgical scenes emerge that separate well even for new tasks that the model had not seen before. While the representations are useful immediately for a variety of tasks, the self-supervised learning paradigm may enable research in lifelong and user-specific learning.



中文翻译:

跨模式自我监督表示学习,用于机器人手术中的手势和技能识别

目的

多模式和跨模式学习整合了来自多个数据源的信息,这些信息可以提供复杂场景的整体表示。跨模式学习特别有趣,因为同步数据流可立即用作自我监控信号。在外科手术机器人中实现自我监督的持续学习的前景令人兴奋,因为它可以使终身学习适应不同的外科医生和病例,最终使人们对外科手术过程有了更全面的机器理解。

方法

我们提出了使用同步视频和机器人介导的手术运动学的学习范例。我们的方法依赖于编码器-解码器网络,该网络将光流映射到相应的运动学序列。在潜在表示上的聚类揭示了针对外科医生手势和技能水平的有意义的分组。通过对不用于训练的任务的技能和手势进行分类,我们证明了JIGSAWS数据集上表示的可概括性。

结果

对于培训中看到的任务,我们报告手术手势分类的准确度为59%到70%。对于超出培训设置范围的任务,我们注意到其精度为45%到65%。定性地,我们发现看不见的手势在新手动作的潜在空间中形成簇,这可以在终身学习的情况下自动识别新颖的交互作用。

结论

通过预测同步运动学序列,外科手术场景的光流表示形式就可以很好地分离,即使对于模型从未见过的新任务也是如此。尽管这些表示形式可立即用于各种任务,但自我监督的学习范式可以促进终身学习和针对特定用户的学习的研究。

更新日期:2021-03-24
down
wechat
bug