Skip to main content
Log in

Analysis of the Mutual Relevance of Topical Corpus Documents in the Problem of Assessing the Proximity of Text to the Semantic Standard

  • PATTERN RECOGNITION AND IMAGE ANALYSIS AUTOMATED SYSTEMS, HARDWARE, AND SOFTWARE
  • Published:
Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Abstract

The article is devoted to the problem of the unity and integrity of the image of a semantic standard, allocated by phrases for a topical text. Herewith, the proximity of the text to the standard is assessed without searching for paraphrases, and the base for assessing the proximity of the text to the standard is the division of words of each of its phrases into classes according to the value of the TF-IDF measure relative to the texts of the corpus, previously formed by the expert. The analyzed texts are abstracts of scientific articles together with their titles. The core of the problem is as follows: for each phrase, the maximum proximity to the standard is achieved with respect to its corpus document and, as a consequence, it is required to assess the mutual relevance of such documents for different phrases of the analyzed text. In this study, this problem is solved by introducing the distances between the vectors of the values of the TF-IDF measure of the words of a separate phrase with respect to different documents in the corpus. In this case, the distance between documents, relative to which the closest proximity to the standard of phrases of the analyzed text was achieved, should be minimal. Using the Euclidean metric and Manhattan distance as an example, this study illustrates the application of the proposed approach to the problem of choosing a higher-level text for the given one in the hierarchy being formed in terms of semantic standard complementarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Математические методы распознавания образов (in Russian)

REFERENCES

  1. I. A. Andreev, V. A. Bashaev, V. V. Klein, V. S. Moshkin, and N. G. Yarushkina, “A semantic metric of the termhood based on the subject area ontology,” Avtom. Protsess. Upr. 38 (4), 76–84 (2014) [in Russian].

    Google Scholar 

  2. C. J. Date, An Introduction to Database Systems (Person, Boston, 2003).

    MATH  Google Scholar 

  3. G. M. Emelyanov, D. V. Mikhailov, and A. P. Kozlov, “Relevance of a set of topical texts to a knowledge unit and the estimation of the closeness of linguistic forms of its expression to a semantic pattern,” Pattern Recognit. Image Anal. 28 (4), 771–782 (2018).

    Article  Google Scholar 

  4. I. A. Irkhin, V. G. Bulatov, and K. V. Vorontsov, “Additive regularizarion of topic models with fast text vectorizartion,” Komp’yut. Issled. Model. 12 (6), 1515–1528 (2020) [in Russian].

    Google Scholar 

  5. M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” in Analysis of Images, Social Networks and Texts (Springer, 2015), pp. 320–332.

    Google Scholar 

  6. D. V. Mikhailov, G. M. Emelyanov, and E. I. Zaitseva, “Recognition of superphrase unities in texts while establishing their semantic equivalence,” Pattern Recognit. Image Anal. 13 (3), 447–451 (2003).

    Google Scholar 

  7. D. Mikhaylov, and G. Emelyanov, “Estimation by phrases for the closeness of a topical text to the semantic pattern without paraphrasing,” in Proc. of the 14th International Conference IS-2019 (Ulyanovsk, 2019), pp. 23–31. http://ceur-ws.org/Vol-2475/paper2.pdf.

  8. D. V. Mikhaylov and G. M. Emelyanov, “Estimation of the closeness to a semantic pattern of a topical text without construction of periphrases,” Pattern Recognit. Image Anal. 29 (4), 647–653 (2019).

    Article  Google Scholar 

  9. D. V. Mikhaylov and G. M. Emelyanov, “Hierarchization of topical texts based on the estimate of proximity to the semantic pattern without paraphrasing,” Pattern Recognit. Image Anal. 30 (3), 440–449 (2020).

    Article  Google Scholar 

  10. A. Moskvina, D. Orlova, P. Panicheva, and O. Mitrofanova, “Development of the core for syntactic parser for Russian based on NLTK libraries,” in Computational Linguistics and Ontology. Proceedings of XIX International Conference “Internet and Modern Society” (St. Petersburg, 2016), pp. 44–54 [in Russian].

  11. Natural Language Toolkit. http://www.nltk.org/. Accessed March 5, 2021.

  12. PDFMiner – Python PDF parser and analyzer. https://euske.github.io/pdfminer/. Accessed March 5, 2021.

  13. D. Pospelov, “Models of human communication: Dialogue with computer,” Int. J. Gen. Syst. 12 (4), 333–338 (1986).

    Article  Google Scholar 

  14. S. E. Robertson, S. Walker, and M. Hancock-Beaulieu, “Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive,” in Proceedings of the 7th Text REtrieval Conference (TREC 7) (1998), pp. 199–210.

  15. G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto, “Soft similarity and soft cosine measure: Similarity of features in vector space model,” Comput. Sist. 18 (3), 491–504 (2014).

    Google Scholar 

  16. The Eclipse Foundation. https://www.eclipse.org.

  17. P. D. Turney, “The latent relation mapping engine: Algorithm and experiments,” J. Artif. Intell. Res. 33, 615–655 (2008).

    Article  MATH  Google Scholar 

  18. N. G. Zagoruiko, Applied Methods of Data and Knowledge Analysis (Inst. Mat., Novosibirsk, 1999) [in Russian].

    Google Scholar 

Download references

Funding

This work was supported by the Russian Foundation for Basic Research (project no. 19-01-00006-a).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. V. Mikhaylov.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work by the authors, it has not been published before, and, unless the Editorial Board of the Pattern Recognition and Image Analysis journal rejects to publish it, it will not be sent to other publications.

CONFLICT OF INTEREST

The authors declare that they have no conflict of interest.

Additional information

Dmitry Mikhaylov Born in 1974. Graduated from Yaroslav-the-Wise Novgorod State University in 1997. In 2003 he defended his Candidate’s dissertation, and in 2013 he defended his Doctoral dissertation in physical and mathematical sciences. From 2000 to 2007 he worked at the Department of Computer Engineering and Automated Systems Software at Yaroslav-the-Wise Novgorod State University. Professor at the Department of Information Technologies and Systems. Member of the Russian public organization “Association for Pattern Recognition and Image Analysis” since 2002. Research interests: computational linguistics and artificial intelligence. Author of more than 46 papers in the field of pattern recognition and image analysis.

Gennady Martinovich Emelyanov Born in 1943. Graduated from the Leningrad Electrotechnical Institute in 1966. Received his Candidate’s degree in Engineering in 1971. Received Doctoral degree in Engineering (1990). Dean of the Faculty of Mathematics and Informatics, Yaroslav-the-Wise Novgorod State University from 1993 to 2003. Professor at the Department of Information Technologies and Systems. Research interests: construction of problem-oriented computing systems for image processing and analysis. Author of more than 101 papers in the field of pattern recognition and image analysis.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mikhaylov, D.V., Emelyanov, G.M. Analysis of the Mutual Relevance of Topical Corpus Documents in the Problem of Assessing the Proximity of Text to the Semantic Standard. Pattern Recognit. Image Anal. 31, 588–594 (2021). https://doi.org/10.1134/S1054661821030172

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1054661821030172

Keywords:

Navigation