Skip to main content
Log in

Russian-Language Thesauri: Automatic Construction and Application for Natural Language Processing Tasks

  • Published:
Automatic Control and Computer Sciences Aims and scope Submit manuscript

Abstract—

The paper overviews the existing digital Russian-language thesauri and the methods of their automatic construction and application. The authors have analyzed the main characteristics of thesauri published in open access for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. Statistical and linguistic methods of thesaurus construction that allow automation of their development and reduce the labor costs of expert linguists have been studied. In particular, algorithms for extracting keywords and semantic thesaurus relations of all types have been considered and the quality of the thesauri generated with the use of these tools was assessed. To illustrate features of various methods of constructing thesaurus relations, the authors developed a combined method that fully automatically generates a specialized thesaurus based on a text corpus of a selected domain and several existing linguistic resources. The proposed method was used to conduct experiments on two Russian-language text corpora that represent two different domains: articles on migration and tweets. The resulting thesauri were analyzed by means of an integrated assessment that had been developed by the authors in a previous study and allows one to determine various aspects of the analyzed thesaurus and appraise the quality of the methods of its generation. The analysis revealed the main advantages and disadvantages of various approaches to thesaurus construction and extraction of semantic relations of different types, and also made it possible to identify potential focus areas for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

REFERENCES

  1. Aitchison, J., Gilchrist, A., and Bawden, D., Thesaurus Construction and Use: A Practical Manual, Psychology Press, 2000.

    Google Scholar 

  2. Sidorova, E.A., Ontology-based approach to modeling the process of extracting information from text, Ontol.Proekt., 2018, vol. 8, no. 1, pp. 134–151.

    Google Scholar 

  3. Elenevskaya, M.N. and Ovchinnikova, I.G., The storage and description of the verbal associations, Vopr. Psikholingvist., 2016, no. 29, pp. 69–92.

  4. Paramonov, I., et al., Thesaurus-based method of increasing text-via-keyphrase graph connectivity during keyphrase extraction for e-tourism applications, Commun. Comput. Inf. Sci., 2016, vol. 649, pp. 129–141.

    Google Scholar 

  5. Shchitov, I., Lagutina, K., Lagutina, N., and Paramonov, I., Sentiment classification of long newspaper articles based on automatically generated thesaurus with various semantic relationships, Proceedings of the 21st Conference of Open Innovations Association FRUCT, Helsinki, 2017, pp. 290–295.

  6. Blenda, N. A., Overview of Russian-language thesauri to solve the problem of calculating the semantic similarity for scientific publications, Informatsionnye tekhnologii i sistemy, Trudy Chetvertoi Mezhdunarodnoi nauchnoi konferentsii (Information Technologies and Systems, Proceedings of the Fourth International Scientific Conference), 2015, pp. 70–74.

  7. Porshnev, S.V., On the quality of open electronic thesauruses of the Russian language, Sbornik materialov Vserossiiskoi molodezhnoi shkoly-seminara “Aktual’nye problemy informatsionnykh tekhnologii, elektroniki i radiotekhniki—2015 (IT-ER—2015) (Proc. All-Russian Youth School-Seminar Current Problems of Information Technology, Electronics, and Radio Engineering—2015 (IT-ER—2015), 2015, vol. 2, pp. 45–48.

  8. Loukachevitch, N. and Dobrov, B., RuThes linguistic ontology vs. Russian wordnets, Proceedings of the Seventh Global WordNet Conference, 2014, pp. 154–162.

  9. Loukachevitch, N., Dobrov, B., and Chetviorkin, I., RuThes-Lite, a publicly available version of Thesaurus of Russian language RuThes, Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference Dialogue, 2014, no. 13, pp. 340–349.

  10. Loukachevitch, N.V., Lashevich, G., Gerasimova, A.A., Ivanov, V.V., and Dobrov, B.V., Creating Russian WordNet by conversion, Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference Dialogue, 2016, no. 15, pp. 405–415.

  11. Braslavski, P., Ustalov, D., Mukhin, M., and Kiselev, Y., YARN: Spinning-in-Progress, Proceedings of the Eight Global Wordnet Conference, 2016, pp. 58–65.

  12. Sukhonogov, A.M. and Yablonsky, S.A., Automation of the construction of English-Russian WordNet, Komp’yuternaya lingvistika i intellektual’nye tekhnologii, Trudy Mezhdunarodnogo seminara Dialog (Computational Linguistics and Intellectual Technologies. Proceedings of the International Seminar Dialogue), 2005, pp. 25–31.

  13. Azarowa, I., RussNet as a computer lexicon for Russian, Proceedings of the Intelligent Information SystemsIIS-2008, 2008, pp. 341–350.

    Google Scholar 

  14. Azarova, I.V., Zakharov, V.P., Kiselev, Yu., Ustalov, D.A., and Khokhlova, M.V., Integration of RussNet and YARN thesauruses, Komp’yuternaya lingvistika i vychislitel’nye ontologii, Trudy XIX Mezhdunarodnoi obedinennoi nauchnoi konferentsii Internet i sovremennoe obshchestvo (IMS-2016) (Computational Linguistics and Computational Ontologies, Proceedings of the 19th International United Scientific Conference The Internet and Modern Society (IMS-2016)), St. Petersburg, 2016, pp. 7–13.

  15. Sladkova, O., Pirumova, L., and Pirumov A., Internet information resources for agricultural specialists, Mezhdunar. S-kh. Zh., 2016, no. 2, pp. 44–48.

  16. Galieva, A.M. and Yakubova, D.D., Principles of representing vocabulary in the socio-political thesaurus of the Tatar language, Filol. Nauki, Vopr. Teor. Prakt., 2016, no. 12-2, pp. 80–84.

  17. Galieva, A.M., Kirillovich, A.V., Lukashevich, N.V., Nevzorova, O.A., Suleimanov, D.Sh., and Yakubova, D.D., Russian-tatar socio-political thesaurus: publishing in the linguistic linked open data cloud, Int. J. Open Inf. Technol., 2017, vol. 5, no. 11, pp. 64–73.

    Google Scholar 

  18. Ageev, M.S., Dobrov, B.V., and Lukashevich, N.V., Automatic rubrication of texts: Methods and problems, Uch. Zap. Kazan. Gos. Univ., Ser. Fiz.-Mat. Nauki, 2008, vol. 150, no. 4, pp. 25–40.

    Google Scholar 

  19. Lukashevich, N.V., Dobrov, B.V., Pavlov, A.M., and Shternov, S.V., Ontological resources and information-analytical system in the subject area Security, Ontol.Proekt., 2018, vol. 8, no. 1, pp. 74–95.

    Google Scholar 

  20. Mishunin, O.B., Savinov, A.P., and Firstov, D.I., Problems of automatic free-text answer grading in intelligent tutoring systems, Sovrem. Probl. Nauki Obraz., 2015, no. 2-2, pp. 189–199.

  21. Alekseev, A.A., Thematic representation of a news cluster as a basis for summarization, Program. Inzh., 2014, no. 3, pp. 41–48.

  22. Ustalov, D.A., Concept discovery from synonymy graphs, Vychisl. Tekhnol., 2017, vol. 22, no. S1, pp. 99–112.

    Google Scholar 

  23. Kolchin, M., Chistyakov, A., Lapaev, M., and Khaydarova, R., FOODpedia: Russian food products as a linked data dataset, International Semantic Web Conference, 2015, pp. 87–09.

  24. Hasan, K. and Vincent, N., Automatic keyphrase extraction: A survey of the state of the art, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, pp. 1262–1273.

  25. Dobrov, B.V. and Lukashevich, N.V., Linguistic ontology on natural sciences and technologies for information-retrieval applications, Uch. Zap. Kazan. Gos. Univ., Ser. Fiz.-Mat. Nauki, 2007, vol. 149, no. 2, pp. 49–72.

    Google Scholar 

  26. Lukashevich, N.V., Dobrov, B.V., and Chuiko, D.S., Automated analysis of multiword expressions for computational dictionaries, Komp’yuternaya lingvistika i intellektual’nye tekhnologii: Tr. Mezhdunarodnoi konferentsii Dialog (Computational Linguistics and Intellectual Technologies: Proc. Annual International Conference Dialogue), 2008, no. 7, pp. 339–344.

  27. Turney, P.D. and Pantel, P., From frequency to meaning: Vector space models of semantics, J. Artif. Intell. Res., 2010, vol. 37, pp. 141–188.

    Article  MathSciNet  Google Scholar 

  28. Zakharov, V.P., Corpus-based approach to thesaurus and ontology construction, Strukt. Prikl. Lingvist., 2015, no. 11, pp. 123–141.

  29. Kotova, E.E. and Pisarev, I.A., Construction of thematic ontologies using the method of automated thesauri development, Izv. S.-Peterb. Gos. Electrotekh. Univ. LETI, 2016, no. 3, pp. 37–47.

  30. Ayusheeva, N.N. and Kusheeva, T.N., Method for calculating weight factors of vertices of a semantic network of a scientific text, Fundam. Issled., 2012, no. 6-3, pp. 626–630.

  31. Ayusheeva, N.N., Gombozhapova, T.N., and Dorzhaev, T.V., A method for automatically determining the subject of a scientific text, Fundam. Issled., 2016, nos. 8-2, pp. 229–233.

  32. Chetviorkin, I. and Loukachevitch, N., Extraction of Russian sentiment lexicon for product meta-domain, Proceedings of COLING 2012, 2012, pp. 593–610.

  33. Loukachevitch, N. and Levchik, A., Creating a general Russian sentiment lexicon, Proceedings of Language Resources and Evaluation Conference, 2016, pp. 1171–1176.

  34. Vanyushkin, A.S. and Grashchenko, L.A., Evaluation of keyword extraction algorithms: Tools and resources, Nov. Inf. Tekhnol. Avtom. Sist., 2017, vol. 20, pp. 95–102.

    Google Scholar 

  35. Lukashevich, N.V. and Logachev, Yu.M., Automatic term extraction based on feature combination, Vychisl. Metody Program., 2010, vol. 11, no. 4, pp. 108–116.

    Google Scholar 

  36. Lagutina, N.S., Lagutina, K.V., Mamedov, E.I., and Paramonov, I.V., Methodological aspects of separating semantic relationships for automatic generation of specialized thesauri and their evaluation, Model. Anal. Inf. Sist., 2016, vol. 23, no. 6, pp. 826–840.

    Article  MathSciNet  Google Scholar 

  37. Lukashevich, N.V., Quasi-synonyms in linguistic ontologies, Komp’yuternaya lingvistika i intellektual’nye tekhnologii: Po materialam ezhegodnoi Mezhdunarodnoi konferentsii “Dialog” (Computational Linguistics and Intellectual Technologies: Based on the Materials of the Annual International Conference Dialogue), 2010, no. 9, pp. 307–312.

  38. Lukashevich, N.V., Modeling of the PART-WHOLE relations in a linguistic resource for information-retrieval applications, Inf. Tekhnol., 2007, no. 12, pp. 28–34.

  39. Baranyuk, V.V., Bogoradnikova, A.V., and Smirnova, O.S., Defining the scope semantics by forming its thesaurus, Int. J. Open Inf. Technol., 2016, vol. 4, no. 9, pp. 74–79.

    Google Scholar 

  40. Nugumanova, A.B., Bessmertnyi, I.A., Petsina, P., and Baiburin, E.M., Semantic relations in text classification based on bag-of-words model, Program. Prod. Sist., 2016, no. 2, pp. 89–99.

  41. Panchenko, A., Ustalov, D., Arefyev, N., Paperno, D., Konstantinova, N., Loukachevitch, N., and Biemann, C., Human and machine judgements for Russian semantic relatedness, Analysis of Images, Social Networks and Texts. 5th International Conference, AIST 2016, 2016, pp. 221–235.

  42. Rapp, R., The automatic generation of thesauri of related words for English, French, German, and Russian, Int. J. Speech Technol., 2008, vol. 11, nos. 3–4, pp. 147–156.

    Article  Google Scholar 

  43. Galina, I.V., Kozerenko, E.B., Morozova, Yu.I., Somin, N.V., and Sharnin, M.M., Associative portraits of subject areas as a tool for automated construction of big data systems for knowledge extraction: Theory, methods, visualization, and application, Inf. Primen., 2015, vol. 9, no. 2, pp. 92–110.

    Google Scholar 

  44. Kuznetsov, I.P., Kozerenko, E.B., and Charnine, M.M., Technological peculiarity of knowledge extraction for logical-analytical systems, Proceedings of ICAI, 2012, vol. 12, pp. 18–21.

    Google Scholar 

  45. Zolotarev, O.V. and Sharnin, M.M., Methods for extracting knowledge from natural language texts and the construction of models of business processes on the basis of identifying processes, objects, their relationships, and characteristics, Trudy Mezhdunarodnoi nauchnoi konferentsii CPT2014 (Proceedings of the International Scientific Conference CPT2014), 2015, pp. 92–98.

  46. Zolotarev, O.V., Sharnin, M.M., and Klimenko, S.V., Semantic approach to the analysis of terrorist activity on the Internet based on thematic modeling methods, Vestn. Ross. Nov. Univ., Ser.: Slozhnye Sist.: Modeli Anal. Upr., 2016, no. 3, pp. 64–71.

  47. Lagutina, N.S, Lagutina, K.V., Shchitov, I.A., and Paramonov, I.V., Analysis of influence of different relations types on the quality of thesaurus application to text classification problems, Model. Anal. Inf. Sist., 2017, vol. 24, no. 6, pp. 772–787.

    Article  Google Scholar 

  48. Sabirova, K. and Lukanin, A., Automatic extraction of hypernyms and hyponyms from Russian texts, Supplementary Proceedings of the 3rd International Conference on Analysis of Images, Social Networks and Texts (AIST’2014), 2014, pp. 35–40.

  49. Bolshakova, E.I., Ivanov, K.M., Sapin, A.S., and Sharikov, G.F., A system for extracting information from texts on the basis of lexical and syntactic templates, Pyatnadtsataya natsional’naya konferentsiya po iskusstvennomu intellektu s mezhdunarodnym uchastiem (Fifteenth National Conference on Artificial Intelligence with International Participation), 2016, pp. 14–22.

  50. Rabchevskii, E.A., Automatic construction of ontologies based on lexical and syntactic patterns for information retrieval, Elektronnye biblioteki: Perspektivnye metody i tekhnologii, elektronnye kollektsii, Sb. nauch. tr. 11-i Vserossiiskoi nauchnoi konferentsii RCDL-2009 (Digital Libraries: Promising Methods and Technologies, Digital Collections, Proc. 11th All-Russian Scientific Conference RCDL-2009), Petrozavodsk, 2009, pp. 69–77.

  51. Mihalcea, R. and Tarau, P., TextRank: Bringing order into texts, Proceedings of Empirical Methods in Natural Language Processing—EMNLP, Barcelona, 2004, pp. 404–411.

  52. Wiemer-Hastings, P., Wiemer-Hastings, K., and Graesser, A., Latent semantic analysis, Proceedings of the 16th International Joint Conference on Artificial Intelligence, 2004, pp. 1–14.

  53. Noh, S., Kim, S., and Jung, C., A lightweight program similarity detection model using XML and Levenshtein distance, FECS, 2006, pp. 3–9.

    Google Scholar 

  54. Lefever, E., Van de Kauter, M., and Hoste, V., Evaluation of automatic hypernym extraction from technical corpora in English and Dutch, 9th International Conference on Language Resources and Evaluation (LREC), 2014, pp. 490–497.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to N. S. Lagutina, K. V. Lagutina, A. S. Adrianov or I. V. Paramonov.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by A. Ovchinnikova

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lagutina, N.S., Lagutina, K.V., Adrianov, A.S. et al. Russian-Language Thesauri: Automatic Construction and Application for Natural Language Processing Tasks. Aut. Control Comp. Sci. 53, 705–718 (2019). https://doi.org/10.3103/S0146411619070149

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0146411619070149

Keywords:

Navigation