当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identification of Chinese dark jargons in Telegram underground markets using context-oriented and linguistic features
Information Processing & Management ( IF 7.4 ) Pub Date : 2022-07-29 , DOI: 10.1016/j.ipm.2022.103033
Yiwei Hou , Hailin Wang , Haizhou Wang

When cybercriminals communicate with their customers in underground markets, they tend to use secure and customizable instant messaging (IM) software, i.e. Telegram. It is a popular IM software with over 700 million monthly active users (MAU) up to June 2022. In recent years, more and more dark jargons (i.e. an innocent-looking replacement of sensitive terms) appear frequently on Telegram. Therefore, jargons identification is one of the most significant research perspectives to track online underground markets and cybercrimes. This paper proposes a novel Chinese Jargons Identification Framework (CJI-Framework) to identify dark jargons. Firstly, we collect chat history from Telegram groups that are related to the underground market and construct the corpus TUMCC (Telegram Underground Market Chinese Corpus), which is the first Chinese corpus in jargons identification research field. Secondly, we extract seven brand-new features which can be classified into three categories: Vectors-based Features (VF), Lexical analysis-based Features (LF), and Dictionary analysis-based Features (DF), to identify Chinese dark jargons from commonly-used words. Based on these features, we then run a statistical outlier detection to decide whether a word is a jargon. Furthermore, we employ a word vector projection method and a transfer learning method to improve the effect of the framework. Experimental results show that CJI-Framework achieves a remarkable performance with an F1-score of 89.66%. After adaptation for English, it performs better than state-of-the-art English jargons identification method as well. Our built corpus and code have been publicly released to facilitate the reproduction and extension of our work.



中文翻译:

使用面向上下文和语言特征的电报地下市场中的中文暗行话识别

当网络犯罪分子在地下市场与他们的客户交流时,他们倾向于使用安全且可定制的即时消息 (IM) 软件,例如 Telegram。它是一种流行的 IM 软件,截至 2022 年 6 月,每月活跃用户 (MAU) 超过 7 亿。近年来,Telegram 上频繁出现越来越多的黑暗行话(即看似无辜的敏感术语替换)。因此,行话识别是追踪在线地下市场和网络犯罪的最重要的研究视角之一。本文提出了一种新的汉语行话识别框架(CJI-Framework)来识别暗行话。首先,我们从与地下市场相关的 Telegram 群中收集聊天记录,并构建语料库TUMCC(Telegram Underground Market Chinese Corpus),这是第一个在行话识别研究领域的中文语料库。其次,我们提取了七个全新的特征,可以分为三类:基于向量的特征(VF)、基于词法分析的特征(LF)和基于字典分析的特征(DF),以识别来自常用词。基于这些特征,我们然后运行统计异常值检测来确定一个词是否是行话。此外,我们采用词向量投影方法和迁移学习方法来提高框架的效果。实验结果表明,CJI-Framework 取得了显着的性能,F1 分数为 89.66%。在适应英语后,它的性能也优于最先进的英语行话识别方法。

更新日期:2022-07-29
down
wechat
bug