Abstract
The article deals with the problem of reconstructing missing data in data collections for machine learning problems. We propose a new randomized method for missing data reconstruction based on the technology of entropy-robust estimation and generation of ensembles of random variables. The method is similar to the use of an auxiliary regression to reconstruct missing values, but unlike the latter, no additional constraints are imposed on the likelihood function of errors in the sample in the case of entropy estimation and small amounts of data are permissible; this becomes extremely relevant in problems where the amount of data for training is limited and the omissions are not systematic. The proposed method is used to reconstruct missing data on the areas of thermokarst lakes in the Arctic zone of the Russian Federation as measured from satellite images.
Similar content being viewed by others
Notes
It may be necessary to iterate and test various models.
The boundaries of the intervals indirectly affect the quality of reconstructed data measured by the adopted functional. Numerical optimization of this functional can be used to select the boundaries.
REFERENCES
Zagoruiko, N.G., Metody raspoznavaniya i ikh primenenie (Recognition Methods and Their Application), Moscow: Sov. Radio, 1972.
Little, R.J.A. and Rubin, D.B., Statistical Analysis with Missing Data, New York: John Wiley & Sons, 1987. Translated under the title: Statisticheskii analiz dannykh s propuskami, Moscow: Finansy Statistika, 1990.
Zagoruiko, N.G., Prikladnye metody analiza dannykh i znanii (Applied Data and Knowledge Analysis Methods), Novosibirsk: IM SO RAN, 1999.
Zloba, E. and Yatskiv, I., Statistical methods for reconstruction of missing data, Comput. Model. New Technol., 2004, vol. 6, pp. 55–56.
Molenberghs, G. and Kenward, M.G., Missing Data in Clinical Studies, Chichester: John Wiley & Sons, 2007, pp. 47–50.
Cheema, J., A review of missing data handling methods in education research, Rev. Educat. Res., 2014, no. 4, pp. 487–508.
Kruglov, V.V. and Abramenkova, I.V., Methods for reconstruction of missing data in data arrays, Metody Vosstanovl. Propuskov Massivakh Dannykh, 2005, no. 2.
Van Buuren, S., Flexible Imputation of Missing Data, Boca Raton: Chapman and Hall/CRC, 2012.
Enders, C., Applied Missing Data Analysis, New York–London: The Guilford Press, 2010.
Schafer, J.L. and Schenker, N., Inference with imputed conditional means, J. Am. St. Assoc., 2000, vol. 95, no. 449, pp. 144–154.
Mander, A. and Clayton, D., HotDeck imputation, Stata Tech. Bull., 1999, vol. 51, pp. 32–34.
Batista, G.E.A.P.A. and Monard M.C., K-nearest neighbour as imputation method: experimental results, Technical Report ICMC-USP, 2002.
Dan Li, Jitender Deogun, William Spaulding, and Bill Shuart, Towards missing data imputation: a study of fuzzy K-means clustering method, in RSCTC 2004. LNAI 3066 , Tsumoto, S. et al., Eds., 2004, pp. 573–579.
Dempster, A.P., Laird, N.M., and Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc., 1977, vol. 39, pp. 1–38.
Zhou, X.-Y. and Lim, J.S., EM algorithm with GMM and naive Bayesian to implement missing values, Adv. Sci. Technol. Lett., 2014, vol. 46, pp. 1–5.
Zagoruiko, N.G., Elkina, V.N., and Timerkaev, V.S., Algorithm for filling the gaps in empirical tables (Zet algorithm), in Empiricheskoe predskazanie i raspoznavanie obrazov. Vyp. 61: Vychislitel’nye sistemy (Empirical Forecasting and Pattern Recognition. Issue 61: Computational Systems), Novosibirsk, 1975, pp. 3–27.
Snityuk, V.E., Evolution method in reconstruction of missing data, Sbornik trudov VI-i Mezhd. konf. “Intellektual’nyi analiz informatsii” (Proc. VI Int. Conf. “Intellectual Analysis of Information”) (Kiev, 2006), pp. 262–271.
Algorithm ZetBraid, in Informatsionnye intellektual’nye sistemy (Information Intellectual Systems), 2008. Iss. 40.
Rubin, D.B., Multiple Imputation for Nonresponse in Surveys, New York: Wiley, 1987, pp. 64–66.
Rubin, D.B., Multiple imputation after 18+ years, J. Am. Stat. Assoc., 1996, no. 91, pp. 473–489.
Lipsitz, S.R., Lue Ping Zhao, and Molenberghs, G.A., Semiparametric method of multiple imputation, J. R. Stat. Soc. Ser. B (Stat. Methodol.), 1998, vol. 60, no. 1, pp. 127–144.
Horton, N.J. and Lipsitz, S.R., Multiple imputation in practice: comparison of software packages for regression models with missing variables, Am. Stat., 2001, vol. 55, no. 3, pp. 244–254.
Efron, B., Missing Data, Imputation, and the Bootstrap, J. Am. Stat. Assoc., 1994, vol. 89, no. 426, pp. 463–475.
Popkov, Y.S., Dubnov, Y.A., and Popkov, A.Y., New method of randomized forecasting using entropy-robust estimation: application to the world population prediction, Mathematics, 2016, vol. 4, no. 1, pp. 1–16.
Popkov, Yu.S., Popkov, A.Yu., and Dubnov, Yu.A., Randomizirovannoe mashinnoe obuchenie (Randomized Machine Learning), Moscow: URSS, 2019.
Popkov, Yu.S., Asymptotic efficiency of maximum entropy estimates, Dokl. Ross. Akad. Nauk. Mat. Inf. Protsessy Upr., 2020, vol. 493, pp. 100–103.
Geweke, J. and Hisashi, T., Note on the Sampling Distribution for the Metropolis-Hastings Output, J. Am. Stat. Assoc., 2003, vol. 96, no. 453, pp. 270–281.
Ioffe, A.D. and Tikhomirov, V.M., Teoriya ekstremal’nykh zadach (Theory of Extremum Problems), Moscow: Nauka, 1974.
Darkhovsky, B.S., Popkov, Yu.S., Popkov, A.Yu., and Aliev, A.S., A method of generating random vectors with a given probability density function, Autom. Remote Control, 2018, vol. 79, pp. 1569–1581.
Rubinstein, R.Y. and Kroese, D.P., Simulation and Monte Carlo Method, Chichester: John Wiley & Sons, 2007.
Polishchuk, V.Yu. and Polishchuk, Yu.M., Geoimitatsionnoe modelirovanie polei termokarstovykh ozer v zonakh merzloty (Geo-simulation modeling of thermokarst lake fields in permafrost zones), Khanty-Mansiisk: UIP YuGU, 2013.
Polishchuk, Y.M., Muratov, I.N., and Polishchuk, V.Y., Remote research of spatiotemporal dynamics of thermokarst lake fields in Siberian permafrost, in The Arctic: Current Issues and Challenges, Pokrovsky, O.S., Kirpotin, S.N., and Malov, A.I., Eds., New York: Nova Science Publ., 2020, pp. 208–237.
Popkov, Yu.S., Volkovich, V., Mel’nikov, A.V., and Polishchuk, Yu.M., Methodological issues of using randomized machine learning to forecast the dynamics of thermokarst Arctic lakes, Vestn. Yuzhno-Ural. Gos. Univ. Ser. Komp’yut. Tekhnol. Upr. Radioelektron., 2019, vol. 19, no. 4, pp. 5–12.
Electronic resource https://cloud.uriit.ru/index.php/s/0DOrxL9RmGqXsV0.
Funding
This work was supported by the Russian Foundation for Basic Research, projects nos. 19-07-00282 and 20-07-00223, within the framework of a state funded program.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by V. Potapchouck
Rights and permissions
About this article
Cite this article
Dubnov, Y.A., Polishchuk, V.Y., Popkov, Y.S. et al. Entropy-Randomized Method for the Reconstruction of Missing Data. Autom Remote Control 82, 670–686 (2021). https://doi.org/10.1134/S0005117921040056
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0005117921040056