Skip to main content
Log in

Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Existing distance metric learning methods require optimisation to learn a feature space to transform data—this makes them computationally expensive in large datasets. In classification tasks, they make use of class information to learn an appropriate feature space. In this paper, we present a simple supervised dissimilarity measure which does not require learning or optimisation. It uses class information to measure dissimilarity of two data instances in the input space directly. It is a supervised version of an existing data-dependent dissimilarity measure called \(m_\mathrm{e}\). Our empirical results in k-NN and LVQ classification tasks show that the proposed simple supervised dissimilarity measure generally produces predictive accuracy better than or at least as good as existing state-of-the-art supervised and unsupervised dissimilarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Use to explain the behaviour of Random Forest, RF similarity aims to track the sign of the margin of x (defined as \(P(+1|x)-P(-1|x)\), where \(+\) 1 and − 1 are the two class labels) [4]. In contrast, iForest-based similarity aims to measure similarity of two points such that points are more similar in sparse region than two points of the same inter-point distance in dense region [22].

  2. Path length was used by iForest [14] as the anomaly score for the purpose of anomaly detection; and path length is a proxy to mass in mass estimation (see Section 4 in Ting et al. [19]). Mass-based dissimilarity [20], mentioned earlier, is an extension of mass estimation which is implemented using completely random trees such as iForest. Though based on RF, some path length-based similarity [28] can be viewed as a variant of mass-based dissimilarity which is implemented using classification trees rather than completely random trees.

  3. The source code for ClustRF is at http://www.eecs.qmul.ac.uk/~xiatian/project_robust_graphs/index.html.

References

  1. Aryal S (2017) A data-dependent dissimilarity measure: an effective alternative to distance measures. Monash University, Clayton PhD thesis

    Google Scholar 

  2. Aryal S, Ting KM, Haffari G, Washio T (2014) \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining, IEEE, pp 707–712

  3. Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506

    Article  Google Scholar 

  4. Breiman L (2000) Some infinity theory for predictor ensembles, Technical Report 577. Statistics Dept, UCB

  5. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27

    Article  MATH  Google Scholar 

  6. Davies A, Ghahramani Z (2014) The random forest kernel and creating other kernels for big data from random partitions. arXiv:1402.4293

  7. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  8. Deza MM, Deza E (2009) Encyclopedia of distances. Springer, Berlin

    Book  MATH  Google Scholar 

  9. Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  10. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  11. Kohonen T (1995) Learning vector quantization. Springer, Berlin, pp 175–189

    Google Scholar 

  12. Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463

    Article  Google Scholar 

  13. Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364

    Article  MathSciNet  MATH  Google Scholar 

  14. Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422

  15. Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297

  16. Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305

    Article  Google Scholar 

  17. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Google Scholar 

  18. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–657

    Article  MathSciNet  MATH  Google Scholar 

  19. Ting KM, Zhou G-T, Liu FT, Tan SC (2013) Mass estimation. Mach Learn 90(1):127–160

    Article  MathSciNet  MATH  Google Scholar 

  20. Ting KM, Zhu Y, Carman M, Zhu Y, Washio T, Zhou Z-H (2019) Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach Learn 108(2):331–376

    Article  MathSciNet  MATH  Google Scholar 

  21. Ting KM, Zhu Y, Carman M, Zhu Y, Zhou Z-H (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1205–1214

  22. Ting KM, Zhu Y, Zhou Z-H (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2329–2337

  23. Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352

    Article  Google Scholar 

  24. Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564

    Article  MathSciNet  MATH  Google Scholar 

  25. Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  26. Yang L (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State University

  27. Zadeh PH, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 2464–2471

  28. Zhu X, Loy CC, Gong S (2014) Constructing robust affinity graphs for spectral clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1450–1457

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Ming Ting.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Algorithms to construct isolation forest

Appendix A: Algorithms to construct isolation forest

Algorithms to construct isolation forest and building isolation tree used in \(m_\mathrm{e}\) and \(d_\mathrm{e}\) are provided in Algorithms 1 and 2. Note that after trees are created, the entire dataset D is passed in each isolation tree to record data mass (in \(m_\mathrm{e}\)) and class entropy (in \(d_\mathrm{e}\)) in each node.

figure d
figure e

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wells, J.R., Aryal, S. & Ting, K.M. Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning. Knowl Inf Syst 62, 3203–3216 (2020). https://doi.org/10.1007/s10115-020-01454-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01454-3

Keywords

Navigation