Skip to main content
Log in

Outlier Detection Algorithms Over Fuzzy Data with Weighted Least Squares

  • Published:
International Journal of Fuzzy Systems Aims and scope Submit manuscript

Abstract

In the classical leave-one-out procedure for outlier detection in regression analysis, we exclude an observation and then construct a model on the remaining data. If the difference between predicted and observed value is high we declare this value an outlier. As a rule, those procedures utilize single comparison testing. The problem becomes much harder when the observations can be associated with a given degree of membership to an underlying population, and the outlier detection should be generalized to operate over fuzzy data. We present a new approach for outlier detection that operates over fuzzy data using two inter-related algorithms. Due to the way outliers enter the observation sample, they may be of various order of magnitude. To account for this, we divided the outlier detection procedure into cycles. Furthermore, each cycle consists of two phases. In Phase 1, we apply a leave-one-out procedure for each non-outlier in the dataset. In Phase 2, all previously declared outliers are subjected to Benjamini–Hochberg step-up multiple testing procedure controlling the false-discovery rate, and the non-confirmed outliers can return to the dataset. Finally, we construct a regression model over the resulting set of non-outliers. In that way, we ensure that a reliable and high-quality regression model is obtained in Phase 1 because the leave-one-out procedure comparatively easily purges the dubious observations due to the single comparison testing. In the same time, the confirmation of the outlier status in relation to the newly obtained high-quality regression model is much harder due to the multiple testing procedure applied hence only the true outliers remain outside the data sample. The two phases in each cycle are a good trade-off between the desire to construct a high-quality model (i.e., over informative data points) and the desire to use as much data points as possible (thus leaving as much observations as possible in the data sample). The number of cycles is user defined, but the procedures can finalize the analysis in case a cycle with no new outliers is detected. We offer one illustrative example and two other practical case studies (from real-life thrombosis studies) that demonstrate the application and strengths of our algorithms. In the concluding section, we discuss several limitations of our approach and also offer directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Selvanathan, S.A., Selvanathan, S., Keller, G.: Business Statistics: Australia New Zealand, 7th ed. Cengage Learning Australia (2017).

  2. Freedman, D.: Statistical Models: Theory and Practice. Cambridge University Press, Cambridge (2009)

    MATH  Google Scholar 

  3. Magnusson, M., Andersen, M., Jonasson, J., Vehtari, A.: Bayesian leave-one-out cross-validation for large data. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, vol. 97, pp. 4244–4253 (2019)

  4. Yan, X., Gang Su, X.: Linear regression analysis: Theory and computing, World Scientific (2009)

  5. Chukhrova, N., Johannssen, A.: Fuzzy regression analysis: systematic review and bibliography. Appl. Soft Comput. J. 84, 105708 (2019)

    Google Scholar 

  6. Denoeux, Th.: Maximum likelihood estimation from fuzzy data using the EM algorithm. Fuzzy Sets Syst. 183, 72–91 (2011)

    MathSciNet  MATH  Google Scholar 

  7. Nikolova, N., Panayotov, P., Panayotova, D., Ivanova, S., Tenekedjiev, K.: Using fuzzy sets in surgical treatment selection and homogenizing stratification of patients with significant chronic ischemic mitral regurgitation. Int. J. Comput. Intell. Syst. 12, 1075 (2019)

    Google Scholar 

  8. Viertl, R.: Statistical Methods for Fuzzy Data. Wiley, New York (2011)

    MATH  Google Scholar 

  9. Coppi, R.: Management of uncertainty in statistical reasoning: the case of regression analysis. Int. J. Approx. Reason. 47(3), 284–305 (2008)

    MathSciNet  MATH  Google Scholar 

  10. Dubois, D., Nguyen, H.T., Prade, H.: Possibility theory, probability and fuzzy sets misunderstandings, bridges and gaps. In: Dubois, D., Prade, H. (eds.) Fundamentals of Fuzzy Sets. The Handbooks of Fuzzy Sets Series, vol. 7, pp. 343–438. Springer, Boston (2000)

    Google Scholar 

  11. Dubois, D., Prade, H: Fuzzy sets and probability: misunderstandings, bridges and gaps. In: Second IEEE International Conference on Fuzzy Systems, San Francisco, CA, USA, vol. 2, pp. 1059–1068 (1993)

  12. Ruspini, E.: Possibility as similarity; the semantics of fuzzy logic. In: UAI '90: Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, MIT, Cambridge, MA, USA, July 27–29 (1990)

  13. Chachi, J., Taheri, S.: Multiple fuzzy regression model for fuzzy input-output data. Iran. J. Fuzzy Syst. 13(4), 63–78 (2016)

    MathSciNet  MATH  Google Scholar 

  14. Klir, G.: Foundations of fuzzy set theory and fuzzy logic: a historical overview. Int. J. Gen. Syst. 30(2), 91–131 (2001)

    MathSciNet  MATH  Google Scholar 

  15. Coppi, R., D’Urso, P., Giordani, P., Santoro, A.: Least squares estimation of a linear regression model with LR fuzzy response. Comput. Stat. Data Anal. 51, 267–286 (2006)

    MathSciNet  MATH  Google Scholar 

  16. D’Urso, P.: Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data. Comput. Stat. Data Anal. 42, 47–72 (2003)

    MathSciNet  MATH  Google Scholar 

  17. Gao, P., Gao, Y.: Quadrilateral Interval Type-2 Fuzzy Regression Analysis for Data Outlier Detection. Mathematical Problems in Engineering 2019, 4914593 (2019). https://doi.org/10.1155/2019/4914593

    Article  MathSciNet  MATH  Google Scholar 

  18. Tanaka, H., Hayashi, I., Watada, J.: Possibilistic linear regression analysis for fuzzy data. Eur. J. Oper. Res. 40, 389–396 (1989)

    MathSciNet  MATH  Google Scholar 

  19. Tanaka, H., Vejima, S., Asai, K.: Linear regression analysis with fuzzy model. IEEE Trans. Syst. Man Cybern. 12, 903–907 (1982)

    MATH  Google Scholar 

  20. Diamond, P.: Fuzzy least squares. Inf. Sci. 46, 141–157 (1988)

    MathSciNet  MATH  Google Scholar 

  21. Jinn, J.H., Song, C., Chao, J.C.: A study of fuzzy linear regression. In: InterStat, (6), http://interstat.statjournals.net/YEAR/2008/articles/0807006.pdf. Accessed 08 Nov 2020 (2008)

  22. Cook, R.D.: Influential observations in linear regression. J. Am. Stat. Assoc. 74(365), 169–174 (1979)

    MathSciNet  MATH  Google Scholar 

  23. Neter, J., Kutner, M.H., Nachtsheim, C.J., Wasserman, W.: Applied Linear Statistical Models, 4th edn. Irwin, Chicago (1996)

    Google Scholar 

  24. D’Urso, P., Gastaldi, T.: A least-squares approach to fuzzy linear regression analysis. Comput. Stat. Data Anal. 34, 427–440 (2000)

    MATH  Google Scholar 

  25. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap, pp. 45–57. Chapman & Hall, New York (1993)

    MATH  Google Scholar 

  26. Maddala, G.S.: Introduction to Econometrics, 2nd edn. MacMillan, New York (1992)

    Google Scholar 

  27. Kao, C., Chyu, C.: A fuzzy linear regression model with better explanatory power. Fuzzy Sets Syst. 126, 401–409 (2002)

    MathSciNet  MATH  Google Scholar 

  28. Peters, G.: Fuzzy linear regression with fuzzy intervals. Fuzzy Sets Syst. 63, 45–55 (1994)

    MathSciNet  Google Scholar 

  29. Wang, G., Guo, P.: Outlier detection approaches in fuzzy regression models. In: 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), Edmonton, AB, pp. 980–985 (2013)

  30. Modarres, M., Nasrabadi, E., Nasrabadi, M.: Fuzzy linear regression analysis from the point of view risk. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 12, 635–649 (2004)

    MathSciNet  MATH  Google Scholar 

  31. Modarres, M., Nasrabadi, E., Nasrabadi, M.: Fuzzy linear regression with least squares errors. Appl. Math. Comput. 163, 977–989 (2005)

    MathSciNet  MATH  Google Scholar 

  32. Bisserier, A., Boukezzoula, R., Galichet, S.: A revisited approach to linear fuzzy regression using trapezoidal fuzzy intervals. Inf. Sci. 180, 3653–3673 (2010)

    MathSciNet  MATH  Google Scholar 

  33. D’Urso, P., Massari, R., Santoro, A.: Robust fuzzy regression analysis. Inf. Sci. 181, 4154–4174 (2011)

    MathSciNet  MATH  Google Scholar 

  34. Dehghan, M., Hamidi, F., Salajegheh, H.: Study of linear regression based on least squares and fuzzy least absolutes deviations and its application in geography. In: 4th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1–6 (2015)

  35. D’Urso, P., Massari, R.: Weighted least squares and least median squares estimation for the fuzzy linear regression analysis. Metron 71, 279–306 (2013)

    MathSciNet  MATH  Google Scholar 

  36. Bargiela, A., Pedrycz, W., Nakashima, T.: Multiple regression with fuzzy data. Fuzzy Sets Syst. 158, 2169–2188 (2007)

    MathSciNet  MATH  Google Scholar 

  37. D’Urso, P., Santoro, A.: Goodness of fit and variable selection in the fuzzy multiple linear regression. Fuzzy Sets Syst. 157, 2627–2647 (2006)

    MathSciNet  MATH  Google Scholar 

  38. Ferraro, M.B., Coppi, R., Gonzalez Rodriguez, G., Colubi, A.: A linear regression model for imprecise response. Int. J. Approx. Reason. 51, 759–770 (2010)

    MathSciNet  MATH  Google Scholar 

  39. Kao, C., Chyu, C.: Least-square estimates in fuzzy regression analysis. Eur. J. Oper. Res. 148, 426–435 (2003)

    MathSciNet  MATH  Google Scholar 

  40. Lu, J., Wang, R.: An enhanced fuzzy linear regression model with more flexible spreads. Fuzzy Sets Syst. 160, 2505–2523 (2009)

    MathSciNet  MATH  Google Scholar 

  41. Chachi, J., Taheri, S.M., Arghami, N.R.: A hybrid fuzzy regression model and its application in hydrology engineering. Appl. Soft Comput. 25, 149–158 (2014)

    Google Scholar 

  42. Jajuga, K.: Linear fuzzy regression. Fuzzy Sets Syst. 20(3), 343–353 (1986)

    MathSciNet  MATH  Google Scholar 

  43. Yang, M.-S., Ko, C.-H.: On cluster-wise fuzzy regression analysis. IEEE Trans. Syst. Man Cybern. B 27(1), 1–13 (1997)

    Google Scholar 

  44. Suk, H.W., Hwang, H.: Regularized fuzzy clusterwise ridge regression. Adv. Data Analy. Classif. 4(1), 35–51 (2010)

    MathSciNet  MATH  Google Scholar 

  45. D’Urso, P., Santoro, A.: Fuzzy clusterwise linear regression analysis with symmetrical fuzzy output variable. Comput. Stat. Data Anal. 51(1), 287–313 (2006)

    MathSciNet  MATH  Google Scholar 

  46. D’Urso, P., Massari, R., Santoro, A.: A class of fuzzy clusterwise regression models. Inf. Sci. 180, 4737–4762 (2010)

    MathSciNet  MATH  Google Scholar 

  47. Lee, H.T., Chen, S.H.: Fuzzy regression model with fuzzy input and output data for manpower forecasting. Fuzzy Sets Syst. 119(2), 205–213 (2001)

    MathSciNet  Google Scholar 

  48. Imoto, S., Yabuuchi, Y., Watada, J.: Fuzzy regression model of R & D project evaluation. Appl. Soft Comput. 8(3), 1266–1273 (2008)

    Google Scholar 

  49. Lee, H., Tanaka, H.: Fuzzy approximations with non-symmetric fuzzy parameters in fuzzy regression analysis. J. Oper. Res. Soc. Japan 42(1), 98–112 (1999)

    MathSciNet  MATH  Google Scholar 

  50. Yang, Z., Yin, Y., Chen, Y.: Robust fuzzy varying coefficient regression analysis with crisp inputs and gaussian fuzzy output. J. Comput. Sci. Eng. 7(4), 263–271 (2013)

    Google Scholar 

  51. Khashei, M., Hejazi, S.R., Bijari, M.: A new hybrid artificial neural networks and fuzzy regression model for time series forecasting. Fuzzy Sets Syst. 159(7), 769–786 (2008)

    MathSciNet  MATH  Google Scholar 

  52. Kwong, C.K., Chen, Y., Wong, H.: Modeling manufacturing processes using fuzzy regression with the detection of outliers. Int. J. Adv. Manuf. Technol. 36, 547–557 (2008)

    Google Scholar 

  53. Chan, K.Y., Kwong, C.K., Fogarty, T.C.: Modelling manufacturing processes using a genetic programming-based fuzzy regression with detection of outliers. Inf. Sci. 180, 506–518 (2010)

    Google Scholar 

  54. Gladysz, B., Kuchta, D.: Outliers detection in selected fuzzy regression models. In: WILF '07: Proceedings of the 7th International Workshop on Fuzzy Logic and Applications, (Berlin, Heidelberg), pp. 211–218, Springer-Verlag (2007)

  55. Nasrabadi, E., Hashemi, S.M., Ghatee, M.: An LP-based approach to outliers detection in fuzzy regression analysis. Int. J. Uncertain. Fuzz. Knowl.-Based Syst. 15(4), 441–456 (2007)

    MathSciNet  MATH  Google Scholar 

  56. Mashinchi, M. H., Orgun, M. A., Mashinchi, M. R.: A least square approach for the detection and removal of outliers for fuzzy linear regressions. In: Second World Congress on Nature and Biologically Inspired Computing Dec. 15–17, 2010 in Kitakyushu, Fukuoka, Japan, pp. 134–139 (2010)

  57. Press, W.H., Teukolski, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes—The Art of Scientific Computing, 3rd edn. Cambridge University Press, Cambridge (2007)

    MATH  Google Scholar 

  58. Nikolova, N.D., Toneva-Zheynova, D., Naydenov, D., Tenekedjiev, K.: Imputing missing values of environment multi-dimensional vectors using a modified Roweis algorithm. In: Proc. IFAC Workshop on Dynamics and Control of Agriculture and Food Processing, Plovdiv, Bulgaria, pp. 119–205 (2012)

  59. Tenekedjiev, K., Karakatsanis, N., Bekiaris, A.: Fictitious covariance matrices. In: Proc. Forth International Conference, Adaptive Computing in Design and Manufacture ACDM’2000, pp. 23–26, Plymouth, UK (2000)

  60. Gujarati, D.N., Porter, D.: Basic Econometrics, 5th edn. McGraw-Hill, Boston (2008)

    Google Scholar 

  61. Montgomery, D., Peck, E., Vining, G.: Introduction to Linear Regression Analysis. Wiley, New York (2001)

    MATH  Google Scholar 

  62. Tenekedjiev, K., Radoinova, D.: Numeral procedures for stature estimating according to length of limb long bones in Bulgarian and Hungarian populations. Acta Morphol. Anthropol. 6, 90–97 (2001)

    Google Scholar 

  63. Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29(4), 1165–1188 (2001)

    MathSciNet  MATH  Google Scholar 

  64. Benjamini, Y.: Discovering the false discovery rate. J. R. Stat. Soc. B 72(4), 405–416 (2010)

    MathSciNet  MATH  Google Scholar 

  65. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  66. McCloskey, A.: Bonferroni-based size-correction for nonstandard testing problems. J. Econ. 200, 17–35 (2017)

    MathSciNet  MATH  Google Scholar 

  67. Mittelhammer, R.C., Judge, G., Miller, D.: Econometric Foundations. Cambridge University Press, Cambridge (2000)

    MATH  Google Scholar 

  68. Romano, J.P., Shaikh, A.M., Wolf, M.: A practical two-step method for testing moment inequalities. Econometrica 82(5), 1979–2002 (2014)

    MathSciNet  MATH  Google Scholar 

  69. Ariens, R.A.: Fibrin(ogen) and thrombotic disease. J. Thromb. Haemost. 11(Suppl 1), 294–305 (2013)

    Google Scholar 

  70. Mangold, A., Alias, S., Scherz, T., Hofbauer, T., Jakowitsch, J., Panzenböck, A., Simon, D., Laimer, D., Bangert, C., Kammerlander, A., Mascherbauer, J., Winter, M.P., Distelmaier, K., Adlbrecht, C., Preissner, K.T., Lang, I.M.: Coronary neutrophil extracellular trap burden and deoxyribonuclease activity in ST-elevation acute coronary syndrome are predictors of ST-segment resolution and infarct size. Circ Res. 116(7), 1182–1192 (2015)

    Google Scholar 

  71. Farkas, A., Farkas, V.J., Gubucz, I., Szabó, L., Bálint, K., Tenekedjiev, K., Nagy, A.I., Sótonyi, P., Hidi, L., Nagy, Z., Szikora, I., Merkely, B., Kolev, K.: Neutrophil extracellular traps in thrombi retrieved during interventional treatment of ischemic arterial diseases. Thromb. Res. 175, 46–52 (2019)

    Google Scholar 

  72. Kovács, A., Sótonyi, P., Nagy, A.I., Tenekedjiev, K., Wohner, N., Komorowicz, E., Kovács, E., Nikolova, N.D., Szabó, L., Kovalszky, I., Machovich, R., Szelid, Z., Becker, D., Merkely, B., Kolev, K.: Ultrastructure and composition of thrombi in coronary and peripheral artery disease: correlations with clinical and laboratory findings. Thromb. Res. 135(4), 760–766 (2015)

    Google Scholar 

  73. Varjú, I., Sótonyi, P., Machovich, R., Szabó, L., Tenekedjiev, K., Silva, M.M., Longstaff, C., Kolev, K.: Hindered dissolution of fibrin formed under mechanical stress. J. Thromb. Haemost. 9, 979–986 (2011)

    Google Scholar 

  74. Wohner, N., Sótonyi, P., Machovich, R., Szabó, L., Tenekedjiev, K., Silva, M.M., Longstaff, C., Kolev, K.: Lytic resistance of fibrin containing red blood cells. Arteriosc. Thromb. Vasc. Biol. 31, 2306–2313 (2011)

    Google Scholar 

  75. Politis, D.: Computer-intensive methods in statistical analysis. IEEE Signal Process. Mag. 15(1), 39–55 (1998)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Hungarian National Research, Development and Innovation Office (NKFIH) (129528, KK) and the Higher Education Institutional Excellence Programme of the Ministry of Human Capacities in Hungary for the Molecular Biology thematic programme of Semmelweis University (KK). The research is also supported by the University of Tasmania’s internal research and development fund RT.112222.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Natalia Nikolova.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nikolova, N., Rodríguez, R.M., Symes, M. et al. Outlier Detection Algorithms Over Fuzzy Data with Weighted Least Squares. Int. J. Fuzzy Syst. 23, 1234–1256 (2021). https://doi.org/10.1007/s40815-020-01049-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40815-020-01049-8

Keywords

Navigation