当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
From subtitles to substantial metadata: examining characteristics of named entities and their role in indexing
International Journal on Digital Libraries ( IF 1.6 ) Pub Date : 2018-10-16 , DOI: 10.1007/s00799-018-0252-z
Anne-Stine Ruud Husevåg

AbstractThis paper explores the possible role of named entities extracted from text in subtitles in automatic indexing of TV programs. This is done by analyzing entity types, name density and name frequencies in subtitles and metadata records from different genres of TV programs. The name density in metadata records is much higher than the name density in subtitles, and named entities with high frequencies in the subtitles are more likely to be mentioned in the metadata records. Further analysis of the metadata records indicates an increase in use of named entities in metadata in accordance with the frequency the entities have in the subtitles. The most substantial difference was between a frequency of one or two, where the named entities with a frequency of two in the subtitles were twice as likely to be present in the metadata records. Personal names, geographical names and names of organizations were the most prominent entity types in both the news subtitles and news metadata, while persons, creative works and locations are the most prominent in culture programs. It is not possible to extract all the named entities in the manually created metadata records by applying named entity recognition to the subtitles for the same programs, but it is possible to find a large subset of named entities for some categories in certain genres. The results reported in this paper show that subtitles are a good source for personal names for all the genres covered in our study, and for creative works in literature programs. In total, it was possible to find 38% of the named entities in metadata records for news programs, 32% for literature programs, while 21% of the named entities in metadata records for talk shows were also present in the subtitles for the programs.

中文翻译:

从字幕到大量元数据:检查命名实体的特征及其在索引中的作用

摘要本文探讨了从字幕中的文本中提取的命名实体在电视节目自动索引中的可能作用。这是通过分析不同类型电视节目的字幕和元数据记录中的实体类型,名称密度和名称频率来完成的。元数据记录中的名称密度远高于字幕中的名称密度,并且在字幕中具有较高频率的命名实体更有可能在元数据记录中被提及。对元数据记录的进一步分析表明,根据实体在字幕中具有的频率,在元数据中使用命名实体的情况有所增加。最大的区别在于频率为一或两个之间,其中在字幕中频率为两个的命名实体出现在元数据记录中的可能性是后者的两倍。在新闻字幕和新闻元数据中,人名,地名和组织名称是最突出的实体类型,而人,创意作品和地点在文化节目中是最突出的。通过将命名实体识别应用于相同节目的字幕,不可能提取手动创建的元数据记录中的所有命名实体,但是可以为某些类型的某些类别找到命名实体的较大子集。本文报道的结果表明,字幕是我们研究中涉及的所有类型的个人名称以及文学作品中的创意作品的良好来源。总共可以在新闻节目的元数据记录中找到38%的命名实体,对于文学节目则可以找到32%,
更新日期:2018-10-16
down
wechat
bug