Comprehensive Information Integration Modeling Framework for Video Titling,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Comprehensive Information Integration Modeling Framework for Video Titling
arXiv - CS - Multimedia Pub Date : 2020-06-24 , DOI: arxiv-2006.13608
Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, Fei Wu

In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume. To recommend these videos to potential consumers more effectively, diverse and catchy video titles are critical. However, consumer-generated videos seldom accompany appropriate titles. To bridge this gap, we integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework. Although automatic video titling is very useful and demanding, it is much less addressed than video captioning. The latter focuses on generating sentences that describe videos as a whole while our task requires the product-aware multi-grained video analysis. To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization. Specifically, the granular-level interaction modeling first utilizes temporal-spatial landmark cues, descriptive words, and abstractive attributes to builds three individual graphs and recognizes the intra-actions in each graph through Graph Neural Networks (GNN). Then the global-local aggregation module is proposed to model inter-actions across graphs and aggregate heterogeneous graphs into a holistic graph representation. The abstraction-level story-line summarization further considers both frame-level video features and the holistic graph to utilize the interactions between products and backgrounds, and generate the story-line topic of the video. We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform, and will make the desensitized version publicly available to nourish further development of the research community...

中文翻译：

视频字幕综合信息集成建模框架

在电子商务中，消费者生成的视频通常会传达消费者对某些产品不同方面的个人偏好，数量庞大。为了更有效地向潜在消费者推荐这些视频，多样化和吸引人的视频标题至关重要。然而，消费者生成的视频很少有合适的标题。为了弥合这一差距，我们在端到端建模框架中整合了综合信息源，包括消费者生成的视频内容、消费者提供的叙述性评论句子和产品属性。尽管自动视频标题非常有用且要求很高，但它比视频字幕要少得多。后者侧重于生成描述整个视频的句子，而我们的任务需要产品感知的多粒度视频分析。为了解决这个问题，所提出的方法包括两个过程，即粒度级交互建模和抽象级故事情节摘要。具体来说，粒度级交互建模首先利用时空标志线索、描述性词和抽象属性来构建三个单独的图，并通过图神经网络 (GNN) 识别每个图中的内部动作。然后提出全局-局部聚合模块来建模跨图的交互并将异构图聚合为整体图表示。抽象级的故事情节概括进一步考虑了帧级视频特征和整体图，以利用产品和背景之间的相互作用，生成视频的故事情节主题。我们从世界领先的电子商务平台淘宝的真实世界数据中相应地收集了一个大规模数据集，并将公开提供脱敏版本，以促进研究社区的进一步发展......

更新日期：2020-06-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文