Skip to main content
Log in

Recognizing Named Entities in Specific Domain

  • Published:
Lobachevskii Journal of Mathematics Aims and scope Submit manuscript

Abstract

The paper presents the results of applying the BERT representation model in the named entity recognition task (NER) for the cybersecurity domain in Russian. We compare several approaches to domain-specific NER combining BERT fine-tuning on a domain-specific text collection, general labeled data, domain-specific data augmentation, and a domain-specific annotated dataset. We showed that using a BERT model fine-tuned on a domain text collection and pre-trained on the combination of a general dataset and augmented data achieves the best results of named entity recognition. We also studied computational performance of the BERT model in so-called mixed precision regime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://github.com/LAIR-RCC/InfSecurityRussianNLP

  2. https://github.com/NVIDIA/apex

  3. https://github.com/LAIR-RCC/InfSecurityRussianNLP

REFERENCES

  1. I. Afanasyev, V. Voevodin, V. Rudyak, and A. Emelyanenko,‘‘The practice of conducting performance analysis of supercomputer applications,’’ Numer. Methods Program. 20, 346–355 (2019).

    Google Scholar 

  2. D. Bahdanau, K. Cho, and Y. Bengio,‘‘Neural machine translation by jointly learning to align and translate,’’ arXiv:1409.0473 (2014).

  3. V. Bocharov, A. Starostin, S. Alexeeva, A. Bodrova, A. Chunchunkov, S. Dzhumaev, I. Efimenko, D. Granovsky, V. Khoroshevsky, I. Krylova, M. Nikolaeva, I. Smurov, and S. Toldova, ‘‘FactRuEval 2016: Evaluation of named entity recognition and fact extraction systems for Russian,’’ in Proceedings of International Conference on Computational Linguistics Dialog-2016 (2016), No. 22, pp. 702–720.

  4. R. Bridges, C. Jones, M. Iannacone, K. Testa, and J. Goodall, ‘‘Automatic labeling for entity extraction in cyber security,’’ arXiv:1308.4941 (2013)

  5. L. Chen, A. Moschitti, G. Castellucci, A. Favalli, and R. Romagnoli, ‘‘Transfer learning for industrial applications of named entity recognition,’’ in Proceedings of the 2nd Workshop on Natural Language for Artificial Intelligence NL4AI 2018 (2018), pp. 129–140.

  6. DeepPavlov Documentation. http://docs.deeppavlov.ai/en/master/. Accessed Dec. 25, 2019.

  7. J. Devlin, M. Chang, K. Lee, and K. Toutanova, ‘‘Bert: Pre-training of deep bidirectional transformers for language understanding,’’ arXiv:1810.04805 (2018).

  8. Ch. Fellbaum, WordNet: An Electronic Lexical Database (MIT, Boston, MA, 1998).

    MATH  Google Scholar 

  9. H. Gasmi, A. Bouras, and J. Laval, ‘‘LSTM recurrent neural networks for cybersecurity named entity recognition,’’ in Proceedings of the International Conference on Software Engineering Advances ICSEA, 2018, Vol. 11.

  10. J. Howard and S. Ruder, ‘‘Universal language model fine-tuning for text classification,’’ arXiv:1801.06146 (2018).

  11. A. Joshi, R. Lal, T. Finin, and A. Joshi, ‘‘Extracting cybersecurity related linked data from text,’’ in Proceedings of the 2013 IEEE 7th International Conference on Semantic Computing (2013), pp. 252–259.

  12. S. Kobayashi, ‘‘Contextual augmentation: Data augmentation by words with paradigmatic relations,’’ in Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics NAACL-HLT, 2018, pp. 452–457.

  13. Y. Kuratov and M. Arkhipov, ‘‘Adaptation of deep bidirectional multilingual transformers for russian language,’’ arXiv:1905.07213 (2019).

  14. J. Lafferty, A. McCallum, and F. Pereira, ‘‘Conditional random fields: Probabilistic: models for segmenting and labeling sequence data,’’ in Proceedings of the International Conference on Machine Learning ICML-2001 (2001).

  15. G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, ‘‘Neural architectures for named entity recognition,’’ arXiv:1603.01360 (2016).

  16. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of word representations in vector space,’’ arXiv:1301.3781 (2013).

  17. V. Mozharova and N. Loukachevitch, ‘‘Combining knowledge and CRF-based approach to named entity recognition in Russian,’’ in Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (Springer, Cham, 2016), pp. 185–195.

  18. V. Mozharova and N. Loukachevitch, ‘‘Recognizing names in islam-related russian twitter,’’ in Proceedings of the Conference on Data Analytics and Management in Data Intensive Domains DAMDID-2017 (2017), pp. 319–324.

  19. J. Piskorski, L. Laskova, M. Marcinczuk, L. Pivovarova, P. Priban, J. Steinberger, and R. Yangarberger, ‘‘The second cross-lingual challenge on recognition, normalization, classification, and linking of named entities across slavic languages,’’ in Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing BSNLP-2019 (2019), pp. 63–74.

  20. E. Sang and F. Meulde, ‘‘Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,’’ in Proceedings of the 7th conference on Natural Language Learning at HLT-NAACL 2003 (2003), Vol. 4, pp. 142–147.

  21. A. Sirotina and N. Loukachevitch, ‘‘Named entity recognition in information security domain for Russian,’’ in Proceedings of the Recent Advances in Natural Language Processing RANLP-2019 (2019), pp. 1115–1122.

    Google Scholar 

  22. K. Shinzato, S. Sekine, N. Yoshinaga, and K. Torisawa, ‘‘Constructing dictionaries for named entity recognition on specific domains from the Web,’’ in Proceedings of the Web Content Mining with Human Language Technologies Workshop on the 5th International Semantic Web (2006).

  23. B. Strauss, B. Toma, A. Ritter, M. de Marneffe, and W. Xu, ‘‘Results of the wnut16 named entity recognition shared task,’’ in Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (2016), pp. 138–144.

  24. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proceedings of the International Conference on Advances in Neural Information Processing Systems (2017), 5998–6008.

  25. J. Wei and K. Zou, ‘‘Eda: Easy data augmentation techniques for boosting performance on text classification tasks,’’ in Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP-2019 (2019), pp. 6381–6387.

  26. Y. Wu, M. Schuster, Z. Chen, Q. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., ‘‘Google’s neural machine translation system: Bridging the gap between human and machine translation,’’ arXiv:1609.08144 (2016).

  27. W. Yang Wang and D. Yang, ‘‘That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets,’’ in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015), pp. 2557–2563.

Download references

Funding

The research is carried out using the equipment of the shared research facilities of HPC computing resources at Lomonosov Moscow State University. The participation of M. Tikhomirov in the reported study was funded by RFBR, project no. 19-37-90119.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to M. M. Tikhomirov, N. V. Loukachevitch or B. V. Dobrov.

Additional information

(Submitted by E. E. Tyrtyshnikov)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tikhomirov, M.M., Loukachevitch, N.V. & Dobrov, B.V. Recognizing Named Entities in Specific Domain. Lobachevskii J Math 41, 1591–1602 (2020). https://doi.org/10.1134/S199508022008020X

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S199508022008020X

Keywords:

Navigation