当前位置: X-MOL 学术Int. J. CARS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bidirectional long short-term memory for surgical skill classification of temporally segmented tasks
International Journal of Computer Assisted Radiology and Surgery ( IF 3 ) Pub Date : 2020-09-30 , DOI: 10.1007/s11548-020-02269-x
Jason D Kelly 1 , Ashley Petersen 2 , Thomas S Lendvay 3 , Timothy M Kowalewski 1
Affiliation  

Purpose

The majority of historical surgical skill research typically analyzes holistic summary task-level metrics to create a skill classification for a performance. Recent advances in machine learning allow time series classification at the sub-task level, allowing predictions on segments of tasks, which could improve task-level technical skill assessment.

Methods

A bidirectional long short-term memory (LSTM) network was used with 8-s windows of multidimensional time-series data from the Basic Laparoscopic Urologic Skills dataset. The network was trained on experts and novices from four common surgical tasks. Stratified cross-validation with regularization was used to avoid overfitting. The misclassified cases were re-submitted for surgical technical skill assessment to crowds using Amazon Mechanical Turk to re-evaluate and to analyze the level of agreement with previous scores.

Results

Performance was best for the suturing task, with 96.88% accuracy at predicting whether a performance was an expert or novice, with 1 misclassification, when compared to previously obtained crowd evaluations. When compared with expert surgeon ratings, the LSTM predictions resulted in a Spearman coefficient of 0.89 for suturing tasks. When crowds re-evaluated misclassified performances, it was found that for all 5 misclassified cases from peg transfer and suturing tasks, the crowds agreed more with our LSTM model than with the previously obtained crowd scores.

Conclusion

The technique presented shows results not incomparable with labels which would be obtained from crowd-sourced labels of surgical tasks. However, these results bring about questions of the reliability of crowd sourced labels in videos of surgical tasks. We, as a research community, should take a closer look at crowd labeling with higher scrutiny, systematically look at biases, and quantify label noise.



中文翻译:

用于时间分段任务的手术技能分类的双向长短期记忆

目的

大多数历史手术技能研究通常会分析整体摘要任务级别指标,以创建性能的技能分类。机器学习的最新进展允许在子任务级别进行时间序列分类,允许对任务段进行预测,这可以改进任务级别的技术技能评估。

方法

双向长短期记忆 (LSTM) 网络与来自基本腹腔镜泌尿科技能数据集的多维时间序列数据的 8 秒窗口一起使用。该网络接受了来自四项常见手术任务的专家和新手的培训。使用正则化的分层交叉验证来避免过度拟合。使用 Amazon Mechanical Turk 将错误分类的病例重新提交给人群进行外科技术技能评估,以重新评估和分析与先前分数的一致程度。

结果

性能最适合缝合任务,与之前获得的人群评估相比,在预测性能是专家还是新手方面的准确率为 96.88%,有 1 次错误分类。与专家外科医生评级相比,LSTM 预测导致缝合任务的 Spearman 系数为 0.89。当人群重新评估错误分类的表现时,发现对于所有 5 个来自钉转移和缝合任务的错误分类案例,与之前获得的人群分数相比,人群更认同我们的 LSTM 模型。

结论

所呈现的技术显示的结果与从外科任务的众包标签中获得的标签相比是不可比拟的。然而,这些结果带来了外科手术视频中众包标签可靠性的问题。作为一个研究社区,我们应该更仔细地审视人群标签,系统地审视偏见,并量化标签噪音。

更新日期:2020-11-18
down
wechat
bug