当前位置: X-MOL 学术Int. J. Mach. Learn. & Cyber. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network
International Journal of Machine Learning and Cybernetics ( IF 3.1 ) Pub Date : 2020-05-02 , DOI: 10.1007/s13042-020-01122-6
Gyeongmin Kim , Chanhee Lee , Jaechoon Jo , Heuiseok Lim

Countless cyber threat intelligence (CTI) reports are used by companies around the world on a daily basis for security reasons. To secure critical cybersecurity information, analysts and individuals should accordingly analyze information on threats and vulnerabilities. However, analyzing such overwhelming volumes of reports requires considerable time and effort. In this study, we propose a novel approach that automatically extracts core information from CTI reports using a named entity recognition (NER) system. During the process of constructing our proposed NER system, we defined meaningful keywords in the security domain as entities, including malware, domain/URL, IP address, Hash, and Common Vulnerabilities and Exposures. Furthermore, we linked these keywords with the words extracted from the text data of the report. To achieve a higher performance, we utilized the character-level feature vector as an input to bidirectional long-short-term memory using a conditional random field network. We finally achieved an average F1-score of 75.05%. We release 498,000 tag datasets created during our research.

中文翻译:

使用深层Bi-LSTM-CRF网络自动提取网络威胁的命名实体

出于安全原因,世界各地的公司每天都会使用无数的网络威胁情报(CTI)报告。为了保护重要的网络安全信息,分析人员和个人应相应地分析有关威胁和漏洞的信息。但是,分析如此大量的报告需要大量的时间和精力。在这项研究中,我们提出了一种新颖的方法,该方法使用命名实体识别(NER)系统自动从CTI报告中提取核心信息。在构建我们建议的NER系统的过程中,我们在安全域中将有意义的关键字定义为实体,包括恶意软件,域/ URL,IP地址,哈希以及常见漏洞和披露。此外,我们将这些关键字与从报告的文本数据中提取的单词相关联。为了获得更高的性能,我们使用字符级特征向量作为使用条件随机场网络的双向长期短期记忆的输入。最终,我们的平均F1分数达到75.05%。我们发布了在研究过程中创建的498,000个标签数据集。
更新日期:2020-05-02
down
wechat
bug