当前位置: X-MOL 学术IEEE Trans. Med. Imaging › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene Segmentation
IEEE Transactions on Medical Imaging ( IF 10.6 ) Pub Date : 2022-05-23 , DOI: 10.1109/tmi.2022.3177077
Yueming Jin 1 , Yang Yu 2 , Cheng Chen 2 , Zixu Zhao 2 , Pheng-Ann Heng 3 , Danail Stoyanov 1
Affiliation  

Automatic surgical scene segmentation is fundamental for facilitating cognitive intelligence in the modern operating theatre. Previous works rely on conventional aggregation modules (e.g., dilated convolution, convolutional LSTM), which only make use of the local context. In this paper, we propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance, by progressively capturing the global context. We firstly develop a hierarchy Transformer to capture intra-video relation that includes richer spatial and temporal cues from neighbor pixels and previous frames. A joint space-time window shift scheme is proposed to efficiently aggregate these two cues into each pixel embedding. Then, we explore inter-video relation via pixel-to-pixel contrastive learning, which well structures the global embedding space. A multi-source contrast training objective is developed to group the pixel embeddings across videos with the ground-truth guidance, which is crucial for learning the global property of the whole data. We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches. Code is available at https://github.com/YuemingJin/STswinCL .

中文翻译:

探索手术语义场景分割的视频内和视频间关系

自动手术场景分割是促进现代手术室认知智能的基础。以前的工作依赖于传统的聚合模块(例如,扩张卷积、卷积 LSTM),它们只使用本地上下文。在本文中,我们提出了一个新颖的框架 STswinCL,它通过逐步捕获全局上下文来探索互补的视频内和视频间关系,以提高分割性能。我们首先开发了一个层次结构 Transformer 来捕获视频内关系,其中包括来自相邻像素和前一帧的更丰富的空间和时间线索。提出了一种联合时空窗口移位方案,以有效地将这两个线索聚合到每个像素嵌入中。然后,我们通过像素到像素的对比学习探索视频间关系,它很好地构建了全局嵌入空间。开发了一个多源对比训练目标,以使用地面实况指导对视频中的像素嵌入进行分组,这对于学习整个数据的全局属性至关重要。我们在两个公共手术视频基准上广泛验证了我们的方法,包括 EndoVis18 Challenge 和 CaDIS 数据集。实验结果证明了我们方法的有前途的性能,它始终超过以前最先进的方法。代码可在 包括 EndoVis18 Challenge 和 CaDIS 数据集。实验结果证明了我们方法的有前途的性能,它始终超过以前最先进的方法。代码可在 包括 EndoVis18 Challenge 和 CaDIS 数据集。实验结果证明了我们方法的有前途的性能,它始终超过以前最先进的方法。代码可在https://github.com/YuemingJin/STswinCL .
更新日期:2022-05-23
down
wechat
bug