当前位置: X-MOL 学术Pers. Ubiquitous Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Captioning model based on meta-learning using prior-convergence knowledge for explainable images
Personal and Ubiquitous Computing ( IF 3.006 ) Pub Date : 2021-04-06 , DOI: 10.1007/s00779-021-01558-9
Ji-Won Baek , Kyungyong Chung

Big data has a variety of data types, including image and text. In particular, image data-based research on face recognition and objection detection has been conducted in diverse areas. Deep learning needs a massive amount of data for learning a model accurately. The amount of data collected is different in each area, and thus it is likely to lack data for analysis through deep learning. Accordingly, it is necessary a method of learning a model effectively and predicting a result accurately with the use of a small amount of data. Also, captions and tags are generated to obtain image information. In the case of tagging, an image is expressed with words, while, in the case of captioning, a sentence can be created in connection with words. For this reason, it is possible to obtain image information in detail through captioning, compared to tagging. However, when a caption is created with words, there is the limitation of end-to-end to lower performance if labeled data are not sufficient. As a solution to the problem, meta-learning, in which a small amount of data can be used, is applied. This study proposes the captioning model based on meta-learning using prior-convergence knowledge for explainable images. The proposed method collects multimodal image data for predicting image information. From the collected data, the attributes representing object information and context information are used. After that, with the use of a small amount of data, meta-learning is applied in a bottom-up approach to creating a sentence for captioning. It can solve the problem caused by data shortage. Lastly, for the extraction of image features, LSTM for convolution network and captioning is established, and the basis for explanation is generated through the reverse operation. The generated basis is an image object. An appropriate explanation sentence is displayed in line with a particular object. Performance evaluation is conducted in two ways for accuracy. Firstly, BLEU score is evaluated according to whether there is meta-learning. Secondly, the proposed captioning model based on prior knowledge, RNN-based captioning model, and bidirectional RNN-based captioning model is evaluated in terms of BLEU score. Therefore, through the proposed method, LSTM’s bottom-up method reduces the cost of improving image resolution and solves the data shortage problem through meta-learning. In addition, it is possible to find the basis of image information using subtitles and to more accurately describe information about photos based on XAI.



中文翻译:

基于元学习的字幕模型,该模型使用先验收敛知识来解释图像

大数据具有多种数据类型,包括图像和文本。尤其是,已经在不同领域进行了基于图像数据的面部识别和异物检测研究。深度学习需要大量数据才能准确学习模型。每个区域收集的数据量都不同,因此可能缺少通过深度学习进行分析的数据。因此,需要一种使用少量数据有效地学习模型并准确地预测结果的方法。同样,字幕和标签被生成以获得图像信息。在加标签的情况下,图像用单词表示,而在加字幕的情况下,可以结合单词创建句子。因此,与标记相比,可以通过字幕获得详细的图像信息。但是,使用文字创建标题时,如果标记的数据不足,则端到端的性能会受到限制。作为该问题的解决方案,应用了可以使用少量数据的元学习。这项研究提出了一种基于元学习的字幕模型,该模型使用先验收敛知识来解释图像。所提出的方法收集用于预测图像信息的多峰图像数据。从收集的数据中,使用代表对象信息和上下文信息的属性。之后,通过使用少量数据,以自下而上的方式应用元学习来创建用于字幕的句子。可以解决由于数据不足造成的问题。最后,为了提取图像特征,建立了用于卷积网络和字幕的LSTM,并通过相反的操作来产生解释的基础。生成的基础是图像对象。根据特定对象显示适当的解释语句。性能评估以两种方式进行,以确保准确性。首先,根据是否存在元学习来评估BLEU得分。其次,根据BLEU评分评估了基于先验知识,基于RNN的字幕模型和基于双向RNN的字幕模型所提出的字幕模型。因此,通过提出的方法,LSTM的自下而上的方法降低了提高图像分辨率的成本,并通过元学习解决了数据短缺的问题。另外,可以使用字幕找到图像信息的基础,并可以基于XAI更准确地描述有关照片的信息。

更新日期:2021-04-08
down
wechat
bug