当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unambiguous Text Localization, Retrieval, and Recognition for Cluttered Scenes
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2020-08-21 , DOI: 10.1109/tpami.2020.3018491
Xuejian Rong 1 , Chucai Yi 2, 3 , Yingli Tian 1
Affiliation  

Text instance as one category of self-described objects provides valuable information for understanding and describing cluttered scenes. The rich and precise high-level semantics embodied in the text could drastically benefit the understanding of the world around us. While most recent visual phrase grounding approaches focus on general objects, this paper explores extracting designated texts and predicting unambiguous scene text information, i.e., to accurately localize and recognize a specific targeted text instance in a cluttered image from natural language descriptions (referring expressions). To address this issue, first a novel recurrent dense text localization network (DTLN) is proposed to sequentially decode the intermediate convolutional representations of a cluttered scene image into a set of distinct text instance detections. Our approach avoids repeated text detections at multiple scales by recurrently memorizing previous detections, and effectively tackles crowded text instances in close proximity. Second, we propose a context reasoning text retrieval (CRTR) model, which jointly encodes text instances and their context information through a recurrent network, and ranks localized text bounding boxes by a scoring function of context compatibility. Third, a recurrent text recognition module is introduced to extend the applicability of aforementioned DTLN and CRTR models, via text verification or transcription. Quantitative evaluations on standard scene text extraction benchmarks and a newly collected scene text retrieval dataset demonstrate the effectiveness and advantages of our models for the joint scene text localization, retrieval, and recognition task.

中文翻译:

杂乱场景的明确文本定位、检索和识别

文本实例作为一类自我描述的对象,为理解和描述杂乱的场景提供了有价值的信息。文本中体现的丰富而精确的高级语义可以极大地促进我们对周围世界的理解。虽然最近的视觉短语基础方法侧重于一般对象,但本文探讨了提取指定文本并预测明确的场景文本信息,即从自然语言描述(引用表达式)中准确定位和识别杂乱图像中的特定目标文本实例。为了解决这个问题,首先提出了一种新颖的循环密集文本定位网络(DTLN),将杂乱场景图像的中间卷积表示顺序解码为一组不同的文本实例检测。我们的方法通过反复记忆以前的检测来避免在多个尺度上重复文本检测,并有效地处理近距离拥挤的文本实例。其次,我们提出了一种上下文推理文本检索(CRTR)模型,该模型通过循环网络对文本实例及其上下文信息进行联合编码,并通过上下文兼容性的评分函数对本地化文本边界框进行排名。第三,引入了循环文本识别模块,通过文本验证或转录来扩展上述 DTLN 和 CRTR 模型的适用性。对标准场景文本提取基准和新收集的场景文本检索数据集的定量评估证明了我们的模型在联合场景文本定位、检索和识别任务中的有效性和优势。
更新日期:2020-08-21
down
wechat
bug