Zero-shot fine-grained entity typing in information security based on ontology

doi:10.1016/j.knosys.2021.107472

Knowledge-Based Systems

Volume 232, 28 November 2021, 107472

https://doi.org/10.1016/j.knosys.2021.107472 Get rights and content

Abstract

The field of information security suffers from the lack of labelled entities. This study proposes a zero-shot hybrid approach, combining a clustering algorithm with a method for representing category labels, to classify fine-grained entity typing based on unified cybersecurity ontology (UCO) to address this issue. However, certain category labels in UCO do not have distinct domain features, while certain abbreviations cannot be obtained directly from word embedding using Word2vec. Thus, we propose a new method, referred to as mixed entities and hierarchy of UCO (MEHC), to represent the category labels. Moreover, to further improve the performance of fine-grained entity typing we propose the triClustering algorithm to re-cluster coarse-grained classification results or determine corresponding types for new entities, based on the theorem that the sum of two sides of a triangle is greater than the third. The experimental results prove that our triClustering algorithm can effectively shorten the computation time and that the proposed hybrid method is superior to other baselines for information security applications.

Introduction

Fine-grained entity typing is beneficial to many natural language processing tasks, such as entity linking [1], [2], relation extraction [3], [4], and knowledge base completion [5], [6]. In this study, we aim to utilise existing unified cybersecurity ontology (UCO) [7] to construct a formal knowledge graph, and thereby fill a gap in the information security domain. UCO was employed because it provides a common understanding of the cybersecurity domain and has been extended with numerous relevant cybersecurity standards, vocabularies, and ontologies. It contains 106 classes and 633 axioms. Moreover, compared with other ontologies in the information security domain, it features a more detailed classification system and axioms [7]. In our previous study [8], we proposed a model that identified named entities from crowdsourced annotations (coarse-grained typing). Herein, to obtain a simple knowledge graph of information security, we populated it with the extracted entities using fine-grained typing.

Generally, such research focuses on the general field [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], and to the best of our knowledge, no such studies exist in the information security domain. The primary reasons are as follows. (i) Although the general domain possesses certain mature knowledge bases that can be employed for fine-grained entity typing, such as Wikipedia, WordNet, and DBpedia, but they are not professional enough for the field of information security. Thus, studies such as [9], [10], [11], [12], [13], [14], [15], are not suitable for application in this domain. (ii) Furthermore, there is the absence of scaled, labelled data in the information security domain for use as a training corpus for classification models. The schemes that were proposed in [16], [17], [18], [19], [20] require a considerable amount of labelled data as a training corpus. Moreover, although certain studies have been conducted on zero- or few-shot entity classification for general applications [21], [22], [23], the features of the information security domain have been neglected.

For example, in the information security field, the feature vectors of various categories obtained by Word2vec lack the domain features of the field. For instance, ‘Consequence’ and ‘Means’ classes are common in the general domain; however, if Word2vec is used to obtain the representations of these two categories directly, the lack of domain features in the vectors results in bias in subsequent work. Further, certain abbreviations exist in the UCO categories; for example, TTP denotes ‘tactics, techniques, and procedures’, and cannot be mapped as a single word. However, this implies that TTP cannot be used to obtain representations directly from Word2vec. The scheme in [23] proposed that category labels be denoted via representative entities and category hierarchies. However, when these entities and categories are phrases, the use of only the average of the sum of word vectors, constituting the phrases, cannot accurately provide the domain features of the phrase. For example, in the general field, the three words in the entity ‘steal login credentials’ may be assumed to equally contribute to the representation of the phrase; however, in the information security domain, the word ‘login’ deserves greater domain features than the other two words. Furthermore, the representation of these entities (i.e., the parent of a category and the category itself) should be proportioned considering their importance in the final composition of the feature representation of the category. In addition, from the related literature on fine-grained entity typing [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], we determined that few researchers have considered the problem of misclassification in coarse-grained entity typing tasks. However, the inaccuracy in the results of coarse-grained typing certainly results in performance degradation in fine-grained entity typing tasks.

Therefore, we proposed a hybrid method that combines a clustering algorithm with a method for representing category labels. The contributions of this study are as follows.

(a)
We proposed a new method for fine-grained entity typing in the information security field based on UCO, eliminating the use of a training corpus.
(b)
We adopted a novel method referred to as mixed entity and hierarchy (MEHC) for representing category labels. This method utilises category hierarchy and representative entities to solve the problem of UCO category representation and introduces a feed-forward neural network, as well as a pooling mechanism, to learn the domain features that represent the categories and entities. The feed-forward neural network primarily aids in modelling the relationship between the words in phrases, whereas the pooling mechanism is used to model the contributions of the phrase and compositional word embeddings to final phrase representations. Our experimental results show that the MEHC method is superior to the other baselines considered in this study.
(c)
To further improve the performance of fine-grained entity typing, we suggested that the coarse-grained classification results should be clustered again. To reduce the computation time, we proposed the triClustering algorithm to process the results of coarse-grained typing. Through experiments, we demonstrated that this algorithm can effectively shorten the computation time and that re-clustering the results from coarse-grained classification can effectively improve the subsequent fine-grained classification.

The remainder of this paper is organised as follows. Section 2 discusses related works on fine-grained entity typing, and briefly introduces entity typing. Section 3 examines (i) the representation of category labels, (ii) reprocessing of coarse-grained typing, and (iii) fine-grained typing without a training corpus. Section 4 presents our experimental results; furthermore, based on the results of our previous work on coarse-grained typing, we discuss these results and compare the approach proposed in this study with other zero-shot and few-shot fine-grained entity typing methods. Finally, in Section 5, we present our conclusions and directions for future work.

Section snippets

Related works

Entity typing is a task that infers the types of entities mentioned in any given text. Although it is similar to named entity recognition (NER), they differ in certain respects. NER extracts entities from unstructured text and classifies them, typically using coarse-grained typing, whereas entity typing primarily refers to fine-grained typing. For example, consider the sentence ‘Sleazy Android app developers continue to sneak their fake apps by the Google Play gatekeepers’. In this sentence,

Methods

Based on a previous study, we formulated fine-grained entity typing as a multiclass classification problem. We considered a set of classes { $c_{1}, c_{2}, \dots, c_{n}$ }, where $c_{i}$ represents a category in the coarse-grained classification and a top-level category in UCO. Two sets exist within $c_{i}$ . One set $E_{i}$ , which contains all the entities classified under category $c_{i}$ in the coarse-grained classification. The other is a hierarchy of $c_{i}$ in UCO, which is represented by $\{l_{1} c_{1}, l_{1} c_{2}, \dots, l_{1} c_{n}, l_{2} c_{1}, \dots, l_{2} c_{n}, \dots, l_{m} c_{1}, \dots, l_{m} c_{n}\}$ ,

Experiments and results

We intend to prove the following four aspects in our experiments. First, the proposed hybrid method can perform better than other baseline models for zero-/few-shot fine-grained entity typing in the information security domain. Second, the representation method of MEHC is better than the general category representation method. Third, the triClustering algorithm is more efficient than the other k-means clustering methods. Finally, prepossessing coarse-grained classification results can improve

Conclusions

This study proposed a hybrid computational approach that combined a clustering algorithm with a category representation for zero-shot fine-grained entity typing in information security. Based on the particularities of UCO and the information security field, we proposed a new category representation method called MEHC, which represented the category by selecting the representative entities under it. Moreover, the parent categories were used to represent each subcategory by employing the

CRediT authorship contribution statement

Han Zhang: Conceptualization, Methodology, Writing – original draft. Jiaxian Zhu: Investigation, Software. Jicheng Chen: Data curation, Validation. Junxiu Liu: Writing – review & editing. Lixia Ji: Supervision, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by Major Science and Technology Project in Henan Province, China (grant No. 201300210500).

References (45)

Rama-ManeiroEfrén et al.
Collective disambiguation in entity linking based on topic coherence in semantic graphs
Knowl.-Based Syst.
(2020)
ZhangJiangying et al.
A multi-feature fusion model for Chinese relation extraction with entity sense
Knowl.-Based Syst.
(2020)
DanielssonP.E.
Euclidean distance mapping
Comput. Graph. Image Process.
(1980)
Y. Onoe, G. Durrett, Fine-grained entity typing for domain independent entity linking, in: Proceedings of the AAAI...
T. Nayak, H.T. Ng, Effective modeling of encoder–decoder architecture for joint entity and relation extraction, in:...
C. Malaviya, C. Bhagavatula, A. Bosselut, et al. Commonsense knowledge base completion with structural and semantic...
ZhouJ. et al.
KACC: A multi-task benchmark for knowledge abstraction, concretization and completion
(2020)
Z. Syed, A. Padia, T. Finin, et al. UCO: A unified cybersecurity ontology, in: Workshop at the 30th AAAI Conference on...
ZhangH. et al.
Multifeature named entity recognition in information security based on adversarial learning
Secur. Commun. Netw.
(2019)
AliM.A. et al.
Fine-grained named entity typing over distantly supervised data via refinement in hyperbolic space
(2021)

M.A. Ali, Y. Sun, B. Li, et al. Fine-grained named entity typing over distantly supervised data based on refined...

NayakN.V. et al.

Zero-shot learning with common sense knowledge graphs

(2020)

ZhangT. et al.

MZET: Memory augmented zero-shot fine-grained named entity typing

(2020)

RenX. et al.

Clustype: Effective entity recognition and typing by relation phrase-based clustering

GangemiA. et al.

Automatic typing of dbpedia entities

X. Ling, D.S. Weld, Fine-grained entity recognition, in: 26th AAAI Conference on Artificial Intelligence, 2012,...

YuanZ. et al.

Otyper: A neural architecture for open named entity typing

X. Mengge, B. Yu, Z. Zhang, et al. Coarse-to-fine pre-training for named entity recognition, in: Proceedings of the...

YaghoobzadehY. et al.

Corpus-level fine-grained entity typing using contextual information

(2016)

XuP. et al.

Neural fine-grained entity type classification with hierarchy-aware loss

(2018)

B. Zhou, D. Khashabi, C.T. Tsai, et al. Zero-shot open entity typing as type-compatible grounding, in: Proceedings of...

HuangL. et al.

Building a fine-grained entity typing system overnight for a new x (x= language, domain, genre)