Applying recent advances in Visual Question Answering to Record Linkage,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Applying recent advances in Visual Question Answering to Record Linkage
arXiv - CS - Artificial Intelligence Pub Date : 2020-07-12 , DOI: arxiv-2007.05881
Marko Smilevski

Multi-modal Record Linkage is the process of matching multi-modal records from multiple sources that represent the same entity. This field has not been explored in research and we propose two solutions based on Deep Learning architectures that are inspired by recent work in Visual Question Answering. The neural networks we propose use two different fusion modules, the Recurrent Neural Network + Convolutional Neural Network fusion module and the Stacked Attention Network fusion module, that jointly combine the visual and the textual data of the records. The output of these fusion models is the input of a Siamese Neural Network that computes the similarity of the records. Using data from the Avito Duplicate Advertisements Detection dataset, we train these solutions and from the experiments, we concluded that the Recurrent Neural Network + Convolutional Neural Network fusion module outperforms a simple model that uses hand-crafted features. We also find that the Recurrent Neural Network + Convolutional Neural Network fusion module classifies dissimilar advertisements as similar more frequently if their average description is bigger than 40 words. We conclude that the reason for this is that the longer advertisements have a different distribution then the shorter advertisements who are more prevalent in the dataset. In the end, we also conclude that further research needs to be done with the Stacked Attention Network, to further explore the effects of the visual data on the performance of the fusion modules.

中文翻译：

将视觉问答的最新进展应用于记录链接

多模态记录链接是匹配来自多个来源、代表同一实体的多模态记录的过程。该领域尚未在研究中进行探索，我们提出了两种基于深度学习架构的解决方案，这些解决方案受到视觉问答领域近期工作的启发。我们提出的神经网络使用两个不同的融合模块，循环神经网络 + 卷积神经网络融合模块和堆叠注意网络融合模块，它们共同结合了记录的视觉和文本数据。这些融合模型的输出是计算记录相似度的连体神经网络的输入。使用来自 Avito 重复广告检测数据集的数据，我们训练这些解决方案和实验，我们得出的结论是，循环神经网络 + 卷积神经网络融合模块优于使用手工制作特征的简单模型。我们还发现，如果平均描述大于 40 个单词，循环神经网络 + 卷积神经网络融合模块会将不同的广告更频繁地归类为相似。我们得出结论，其原因是较长的广告具有不同的分布，而较短的广告在数据集中更为普遍。最后，我们还得出结论，需要对 Stacked Attention Network 进行进一步的研究，以进一步探索视觉数据对融合模块性能的影响。我们还发现，如果平均描述大于 40 个单词，循环神经网络 + 卷积神经网络融合模块会将不同的广告更频繁地归类为相似。我们得出结论，其原因是较长的广告具有不同的分布，而较短的广告在数据集中更为普遍。最后，我们还得出结论，需要对 Stacked Attention Network 进行进一步的研究，以进一步探索视觉数据对融合模块性能的影响。我们还发现，如果平均描述大于 40 个单词，循环神经网络 + 卷积神经网络融合模块会将不同的广告更频繁地归类为相似。我们得出结论，其原因是较长的广告具有不同的分布，而较短的广告在数据集中更为普遍。最后，我们还得出结论，需要对 Stacked Attention Network 进行进一步的研究，以进一步探索视觉数据对融合模块性能的影响。我们得出结论，其原因是较长的广告具有不同的分布，而较短的广告在数据集中更为普遍。最后，我们还得出结论，需要对 Stacked Attention Network 进行进一步的研究，以进一步探索视觉数据对融合模块性能的影响。我们得出结论，其原因是较长的广告具有不同的分布，而较短的广告在数据集中更为普遍。最后，我们还得出结论，需要对 Stacked Attention Network 进行进一步的研究，以进一步探索视觉数据对融合模块性能的影响。

更新日期：2020-07-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>