Skip to main content
Log in

Entropy-Randomized Method for the Reconstruction of Missing Data

  • INTELLECTUAL CONTROL SYSTEMS, DATA ANALYSIS
  • Published:
Automation and Remote Control Aims and scope Submit manuscript

Abstract

The article deals with the problem of reconstructing missing data in data collections for machine learning problems. We propose a new randomized method for missing data reconstruction based on the technology of entropy-robust estimation and generation of ensembles of random variables. The method is similar to the use of an auxiliary regression to reconstruct missing values, but unlike the latter, no additional constraints are imposed on the likelihood function of errors in the sample in the case of entropy estimation and small amounts of data are permissible; this becomes extremely relevant in problems where the amount of data for training is limited and the omissions are not systematic. The proposed method is used to reconstruct missing data on the areas of thermokarst lakes in the Arctic zone of the Russian Federation as measured from satellite images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

Similar content being viewed by others

Notes

  1. It may be necessary to iterate and test various models.

  2. The boundaries of the intervals indirectly affect the quality of reconstructed data measured by the adopted functional. Numerical optimization of this functional can be used to select the boundaries.

REFERENCES

  1. Zagoruiko, N.G., Metody raspoznavaniya i ikh primenenie (Recognition Methods and Their Application), Moscow: Sov. Radio, 1972.

    Google Scholar 

  2. Little, R.J.A. and Rubin, D.B., Statistical Analysis with Missing Data, New York: John Wiley & Sons, 1987. Translated under the title: Statisticheskii analiz dannykh s propuskami, Moscow: Finansy Statistika, 1990.

    MATH  Google Scholar 

  3. Zagoruiko, N.G., Prikladnye metody analiza dannykh i znanii (Applied Data and Knowledge Analysis Methods), Novosibirsk: IM SO RAN, 1999.

    Google Scholar 

  4. Zloba, E. and Yatskiv, I., Statistical methods for reconstruction of missing data, Comput. Model. New Technol., 2004, vol. 6, pp. 55–56.

    Google Scholar 

  5. Molenberghs, G. and Kenward, M.G., Missing Data in Clinical Studies, Chichester: John Wiley & Sons, 2007, pp. 47–50.

    Book  Google Scholar 

  6. Cheema, J., A review of missing data handling methods in education research, Rev. Educat. Res., 2014, no. 4, pp. 487–508.

  7. Kruglov, V.V. and Abramenkova, I.V., Methods for reconstruction of missing data in data arrays, Metody Vosstanovl. Propuskov Massivakh Dannykh, 2005, no. 2.

  8. Van Buuren, S., Flexible Imputation of Missing Data, Boca Raton: Chapman and Hall/CRC, 2012.

    Book  Google Scholar 

  9. Enders, C., Applied Missing Data Analysis, New York–London: The Guilford Press, 2010.

    Google Scholar 

  10. Schafer, J.L. and Schenker, N., Inference with imputed conditional means, J. Am. St. Assoc., 2000, vol. 95, no. 449, pp. 144–154.

    Article  MathSciNet  Google Scholar 

  11. Mander, A. and Clayton, D., HotDeck imputation, Stata Tech. Bull., 1999, vol. 51, pp. 32–34.

    Google Scholar 

  12. Batista, G.E.A.P.A. and Monard M.C., K-nearest neighbour as imputation method: experimental results, Technical Report ICMC-USP, 2002.

  13. Dan Li, Jitender Deogun, William Spaulding, and Bill Shuart, Towards missing data imputation: a study of fuzzy K-means clustering method, in RSCTC 2004. LNAI 3066 , Tsumoto, S. et al., Eds., 2004, pp. 573–579.

  14. Dempster, A.P., Laird, N.M., and Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc., 1977, vol. 39, pp. 1–38.

    MathSciNet  MATH  Google Scholar 

  15. Zhou, X.-Y. and Lim, J.S., EM algorithm with GMM and naive Bayesian to implement missing values, Adv. Sci. Technol. Lett., 2014, vol. 46, pp. 1–5.

    Google Scholar 

  16. Zagoruiko, N.G., Elkina, V.N., and Timerkaev, V.S., Algorithm for filling the gaps in empirical tables (Zet algorithm), in Empiricheskoe predskazanie i raspoznavanie obrazov. Vyp. 61: Vychislitel’nye sistemy (Empirical Forecasting and Pattern Recognition. Issue 61: Computational Systems), Novosibirsk, 1975, pp. 3–27.

  17. Snityuk, V.E., Evolution method in reconstruction of missing data, Sbornik trudov VI-i Mezhd. konf. “Intellektual’nyi analiz informatsii” (Proc. VI Int. Conf. “Intellectual Analysis of Information”) (Kiev, 2006), pp. 262–271.

  18. Algorithm ZetBraid, in Informatsionnye intellektual’nye sistemy (Information Intellectual Systems), 2008. Iss. 40.

  19. Rubin, D.B., Multiple Imputation for Nonresponse in Surveys, New York: Wiley, 1987, pp. 64–66.

    Book  Google Scholar 

  20. Rubin, D.B., Multiple imputation after 18+ years, J. Am. Stat. Assoc., 1996, no. 91, pp. 473–489.

  21. Lipsitz, S.R., Lue Ping Zhao, and Molenberghs, G.A., Semiparametric method of multiple imputation, J. R. Stat. Soc. Ser. B (Stat. Methodol.), 1998, vol. 60, no. 1, pp. 127–144.

    Article  MathSciNet  Google Scholar 

  22. Horton, N.J. and Lipsitz, S.R., Multiple imputation in practice: comparison of software packages for regression models with missing variables, Am. Stat., 2001, vol. 55, no. 3, pp. 244–254.

    Article  MathSciNet  Google Scholar 

  23. Efron, B., Missing Data, Imputation, and the Bootstrap, J. Am. Stat. Assoc., 1994, vol. 89, no. 426, pp. 463–475.

    Article  MathSciNet  Google Scholar 

  24. Popkov, Y.S., Dubnov, Y.A., and Popkov, A.Y., New method of randomized forecasting using entropy-robust estimation: application to the world population prediction, Mathematics, 2016, vol. 4, no. 1, pp. 1–16.

    Article  Google Scholar 

  25. Popkov, Yu.S., Popkov, A.Yu., and Dubnov, Yu.A., Randomizirovannoe mashinnoe obuchenie (Randomized Machine Learning), Moscow: URSS, 2019.

    Google Scholar 

  26. Popkov, Yu.S., Asymptotic efficiency of maximum entropy estimates, Dokl. Ross. Akad. Nauk. Mat. Inf. Protsessy Upr., 2020, vol. 493, pp. 100–103.

    Google Scholar 

  27. Geweke, J. and Hisashi, T., Note on the Sampling Distribution for the Metropolis-Hastings Output, J. Am. Stat. Assoc., 2003, vol. 96, no. 453, pp. 270–281.

    MATH  Google Scholar 

  28. Ioffe, A.D. and Tikhomirov, V.M., Teoriya ekstremal’nykh zadach (Theory of Extremum Problems), Moscow: Nauka, 1974.

    Google Scholar 

  29. Darkhovsky, B.S., Popkov, Yu.S., Popkov, A.Yu., and Aliev, A.S., A method of generating random vectors with a given probability density function, Autom. Remote Control, 2018, vol. 79, pp. 1569–1581.

    Article  MathSciNet  Google Scholar 

  30. Rubinstein, R.Y. and Kroese, D.P., Simulation and Monte Carlo Method, Chichester: John Wiley & Sons, 2007.

    Book  Google Scholar 

  31. Polishchuk, V.Yu. and Polishchuk, Yu.M., Geoimitatsionnoe modelirovanie polei termokarstovykh ozer v zonakh merzloty (Geo-simulation modeling of thermokarst lake fields in permafrost zones), Khanty-Mansiisk: UIP YuGU, 2013.

    Google Scholar 

  32. Polishchuk, Y.M., Muratov, I.N., and Polishchuk, V.Y., Remote research of spatiotemporal dynamics of thermokarst lake fields in Siberian permafrost, in The Arctic: Current Issues and Challenges, Pokrovsky, O.S., Kirpotin, S.N., and Malov, A.I., Eds., New York: Nova Science Publ., 2020, pp. 208–237.

  33. Popkov, Yu.S., Volkovich, V., Mel’nikov, A.V., and Polishchuk, Yu.M., Methodological issues of using randomized machine learning to forecast the dynamics of thermokarst Arctic lakes, Vestn. Yuzhno-Ural. Gos. Univ. Ser. Komp’yut. Tekhnol. Upr. Radioelektron., 2019, vol. 19, no. 4, pp. 5–12.

    Google Scholar 

  34. Electronic resource https://cloud.uriit.ru/index.php/s/0DOrxL9RmGqXsV0.

Download references

Funding

This work was supported by the Russian Foundation for Basic Research, projects nos. 19-07-00282 and 20-07-00223, within the framework of a state funded program.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yu. A. Dubnov, V. Yu. Polishchuk, Yu. S. Popkov, Yu. M. Polishchuk, A. V. Mel’nikov or E. S. Sokol.

Additional information

Translated by V. Potapchouck

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dubnov, Y.A., Polishchuk, V.Y., Popkov, Y.S. et al. Entropy-Randomized Method for the Reconstruction of Missing Data. Autom Remote Control 82, 670–686 (2021). https://doi.org/10.1134/S0005117921040056

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0005117921040056

Keywords

Navigation