当前位置: X-MOL 学术Data Knowl. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A multi-view similarity measure framework for trouble ticket mining
Data & Knowledge Engineering ( IF 2.7 ) Pub Date : 2020-02-12 , DOI: 10.1016/j.datak.2020.101800
Jian Xu , Jiapeng Mu , Gaorong Chen

Text similarity measures play a very important role in several text mining applications. Although there is an extensive literature on measuring the similarity between long texts, there is less work related to the measurement of similarity between short texts. And most of these works on short text similarity are based on adaptations of long-text similarity methods. Unfortunately, the description of a trouble ticket is just a kind of short texts. Thus, ticket mining applications such as ticket classification, ticket clustering, and ticket resolution recommendation often suffer from poor performance because of tickets’ particular characteristics of unstructured, short free-text with large vocabulary size, large volume, non-English dictionary words, and so on. Therefore, the ability to accurately measure the similarity between two tickets is critical to the performance of ticket mining.

To address this performance issue, this paper proposes a multi-view similarity measure framework that easily integrates several kinds of existing similarity measures including surface matching based measures, semantic similarity measures and syntax based measures. Further, in order to make full use of the strengths of different similarity measures, our framework adopts four different policies to combine them. In particular, we consider a machine learning based policy that can be applied to integrate various similarity measures in a more general way, which makes our framework flexible and extensible. To demonstrate the effectiveness of measures generated from our framework, we empirically validate them on a publicly available short text data set and apply them to a real-world ticket data set from a large enterprise IT infrastructure. Some important findings obtained via the result analysis will be helpful to further improve performance.



中文翻译:

故障票挖掘的多视图相似性度量框架

文本相似性度量在多个文本挖掘应用程序中起着非常重要的作用。尽管有大量的文献来衡量长文本之间的相似性,但有关度量短文本之间的相似性的工作却很少。这些关于短文本相似性的著作大部分都是基于长文本相似性方法的改编的。不幸的是,故障单的描述只是一种简短的文字。因此,票证挖掘应用程序(例如票证分类,票证聚类和票证解析建议)通常会因其非结构化,自由词汇量大,词汇量大,非英语词典单词和以此类推。因此,

为了解决这个性能问题,本文提出了一种多视图相似性度量框架,该框架可以轻松地集成几种现有的相似性度量,包括基于表面匹配的度量,语义相似性度量和基于语法的度量。此外,为了充分利用不同相似性度量的优势,我们的框架采用了四种不同的策略来将它们组合在一起。特别是,我们考虑了一种基于机器学习的策略,该策略可用于以更通用的方式集成各种相似性度量,这使我们的框架具有灵活性和可扩展性。为了证明从我们的框架中产生的措施的有效性,我们根据可公开获取的短文本数据集对它们进行了经验验证,并将其应用于来自大型企业IT基础架构的现实票务数据集。

更新日期:2020-02-12
down
wechat
bug