Skip to main content
Log in

A Novel Pruning Strategy for Mining Discriminative Patterns

  • Research Paper
  • Published:
Iranian Journal of Science and Technology, Transactions of Electrical Engineering Aims and scope Submit manuscript

Abstract

Discriminative patterns are sets of characteristics that differentiate multiple groups from each other, for example, successful and unsuccessful medical treatments. The objective of the discriminative pattern mining task is to discover a set of significant patterns that occur with disproportionate frequencies in different class-labeled datasets, generally dataset \( D^{ + } \) against dataset \( D^{ - } \). The discriminative pattern mining task faces two important problems: (1) the large search space problem where the search space exponentially increases with the number of items, and (2) the redundancy problem where the discriminative power of many patterns mainly derives from their sub-patterns. The common method to overcome the large search space problem is to discover frequent patterns in \( D^{ + } \) and to use them as candidate discriminative patterns. In this paper, (1) we introduce a novel pruning strategy to reduce the search space. This strategy generates a new dataset \( D^{new} \) = \( D^{ + } \) − \( D^{ - } \) and employs frequent patterns in it as candidate discriminative patterns. Following this idea, another problem appears: how to implement this idea efficiently? (2) Note that we do not explicitly calculate \( D^{ + } \) − \( D^{ - } \). To directly mine the frequent patterns in \( D^{ + } \) − \( D^{ - } \), we propose a prefix-tree, dubbed DDP-tree. This tree is directly built from \( D^{ + } \) and \( D^{ - } \), and contains the essential information about frequent patterns in \( D^{ + } \) − \( D^{ - } \). (3) To show the effectiveness of this strategy, we propose an algorithm, dubbed DiffNRDP-Miner (DiffNRDP-Miner: Difference based non-redundant discriminative pattern miner.), based on it. The advantages of DiffNRDP-Miner are that it removes redundant patterns and only needs to set one parameter, unlike other algorithms where several parameters must be set. Experimental results on benchmark datasets demonstrate that: this strategy (1) generates good patterns, where most of them are discriminative, (2) significantly reduces the search space, and (3) does not decrease the discriminative information of patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. DiffNRDP-Miner: Difference based non-redundant discriminative pattern miner.

  2. NRDP-tree: Non-redundant discriminative pattern tree.

References

  • Aryabarzan N, Minaei-Bidgoli B, Teshnehlab M (2018) negFIN: an efficient algorithm for fast mining frequent itemsets. Expert Syst Appl 105:129–143

    Article  Google Scholar 

  • Azevedo PJ (2010) Rules for contrast sets. Intell Data Anal 14(6):623–640

    Article  Google Scholar 

  • Bay SD, Pazzani MJ (1999) Detecting change in categorical data: mining contrast sets. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 302–306

  • Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246. https://doi.org/10.1023/a:1011429418057

    Article  MATH  Google Scholar 

  • Boley M, Grosskreutz H (2009) Non-redundant subgroup discovery using a closure system. In: Buntine W, Grobelnik M, Mladenić D, Shawe-Taylor J (eds) Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2009, Bled, Slovenia, September 7–11, 2009, proceedings, part I. Springer, Berlin, pp 179–194

  • Cagliero L, Chiusano S, Garza P, Bruno G (2015) Pattern set mining with schema-based constraint. Knowl-Based Syst 84:224–238. https://doi.org/10.1016/j.knosys.2015.04.023

    Article  Google Scholar 

  • Carmona CJ et al (2015) A fuzzy genetic programming-based algorithm for subgroup discovery and the application to one problem of pathogenesis of acute sore throat conditions in humans. Inf Sci (Ny) 298:180–197. https://doi.org/10.1016/j.ins.2014.11.030

    Article  Google Scholar 

  • Cheng H, Yan X, Han J, Hsu CW (2007) Discriminative frequent pattern analysis for effective classification. In: 2007 IEEE 23rd international conference on data engineering, pp 716–725. https://doi.org/10.1109/icde.2007.367917

  • Cheng H, Yan X, Han J, Yu PS (2008) Direct discriminative pattern mining for effective classification. In: Proceedings of the 2008 IEEE 24th international conference on data engineering. IEEE Computer Society, pp 169–178. https://doi.org/10.1109/icde.2008.4497425

  • Cong G, Tan K-L, Tung AKH, Xu X (2005) Mining top-K covering rule groups for gene expression data. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, Baltimore, Maryland, pp 670–681. https://doi.org/10.1145/1066157.1066234

  • De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 2007 SIAM international conference on data mining, pp 237–248

  • Deng Z-H (2016) DiffNodesets: an efficient structure for fast mining frequent itemsets. Appl Soft Comput 41:214–223

    Article  Google Scholar 

  • Deng ZH, Lv SL (2014) Fast mining frequent itemsets using Nodesets. Expert Syst Appl 41(10):4505–4512. https://doi.org/10.1016/j.eswa.2014.01.025

    Article  Google Scholar 

  • Deshpande M, Kuramochi M, Karypis G (2018) Frequent sub-structure-based approach for classifying chemical compounds. IEEE Trans Knowl Data Eng 17(TR# 03-016):1036–1050

    Google Scholar 

  • Dua D, Karra Taniskidou E (2017) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed 15 Oct 2019

  • Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Diego, California, USA, pp 43–52. https://doi.org/10.1145/312129.312191

  • Droge B (2006) Phillip good: permutation, parametric, and bootstrap tests of hypotheses. Metrika 64(2):249–250

    Article  Google Scholar 

  • Fan W et al. (2008) Direct mining of discriminative and essential frequent patterns via model-based search tree. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Las Vegas, Nevada, USA, pp 230–238. https://doi.org/10.1145/1401890.1401922

  • Fang G, Pandey G, Wang W, Gupta M, Steinbach M, Kumar V (2012) Mining low-support discriminative patterns from dense and high-dimensional data. IEEE Trans Knowl Data Eng 24(2):279–294. https://doi.org/10.1109/TKDE.2010.241

    Article  Google Scholar 

  • Garriga GC, Kralj P, Lavrač N (2008) Closed sets for labeled data. J Mach Learn Res 9:559–580

    MathSciNet  MATH  Google Scholar 

  • Gong H, He Z (2012) Permutation methods for testing the significance of phosphorylation motifs. Stat Interface 5:61–74

    Article  MathSciNet  Google Scholar 

  • Grosskreutz H, Paurat D (2011) Fast discovery of relevant subgroups using a reduced search space. Fraunhofer Inst. IAIS, Sankt Augustin

    Google Scholar 

  • Großkreutz H, Paurat D, Rüping S (2012) An enhanced relevance criterion for more concise supervised pattern discovery. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1442–1450. https://doi.org/10.1145/2339530.2339756

  • Guns T, Nijssen S, de Raedt L (2013) k-Pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418. https://doi.org/10.1109/tkde.2011.204

    Article  Google Scholar 

  • He Z, Gu F, Zhao C, Liu X, Wu J, Wang J (2017) Conditional discriminative pattern mining: concepts and algorithms. Inf Sci (Ny) 375:1–15. https://doi.org/10.1016/j.ins.2016.09.047

    Article  Google Scholar 

  • He Z, Zhang S, Wu J (2019a) Significance-based discriminative sequential pattern mining. Expert Syst Appl 122:54–64

    Article  Google Scholar 

  • He Z, Zhang S, Gu F, Wu J (2019b) Mining conditional discriminative sequential patterns. Inf Sci (Ny) 478:524–539

    Article  Google Scholar 

  • Helal S (2016) Subgroup discovery algorithms: a survey and empirical evaluation. J Comput Sci Technol 31(3):561–576. https://doi.org/10.1007/s11390-016-1647-1

    Article  Google Scholar 

  • Herrera F, Carmona CJ, González P, del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525. https://doi.org/10.1007/s10115-010-0356-2

    Article  Google Scholar 

  • Kameya Y, Sato T (2012) RP-growth: top-k mining of relevant patterns with minimum support raising. In: Proceedings of the 2012 SIAM international conference on data mining, pp 816–827

  • Karypis G, Wang J (2005) HARMONY: efficiently mining the best rules for classification. In: 5th SIAM international conference on data mining, pp 205–216

  • Kralj Novak P, Nada Lavrač I, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10(Feb):377–403. https://doi.org/10.1145/1577069.1577083

    Article  MATH  Google Scholar 

  • Lavrač N, Gamberger D (2006) Relevancy in constraint-based subgroup discovery. In: Boulicaut J-F, De Raedt L, Mannila H (eds) Constraint-based mining and inductive databases: European workshop on inductive databases and constraint based mining, Hinterzarten, Germany, March 11–13, 2004, revised selected papers. Springer, Berlin, pp 243–266

  • Lavrač N, Gamberger D, Jovanoski V (1999) A study of relevance for learning in deductive databases. J Log Program 40(2–3):215–249

    Article  MathSciNet  Google Scholar 

  • Li J, Liu G, Wong L (2007) Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Jose, California, USA, pp 430–439. https://doi.org/10.1145/1281192.1281240

  • Li J, Liu J, Toivonen H, Satou K, Sun Y, Sun B (2014) Discovering statistically non-redundant subgroups. Knowl-Based Syst 67:315–327. https://doi.org/10.1016/j.knosys.2014.04.030

    Article  Google Scholar 

  • Liu H, Yang Y, Chen Z, Zheng Y (2014a) A tree-based contrast set-mining approach to detecting group differences. INFORMS J. Comput 26(2):208–221. https://doi.org/10.1287/ijoc.2013.0558

    Article  MathSciNet  MATH  Google Scholar 

  • Liu X, Wu J, Gu F, Wang J, He Z (2014b) Discriminative pattern mining and its applications in bioinformatics. Brief Bioinform 16(5):884–900. https://doi.org/10.1093/bib/bbu042

    Article  Google Scholar 

  • Liu X, Wu J, Gong H, Deng S, He Z (2014c) Mining conditional phosphorylation motifs. IEEE/ACM Trans Comput Biol Bioinform 11(5):915–927. https://doi.org/10.1109/tcbb.2014.2321400

    Article  Google Scholar 

  • Lo D, Cheng H, Han J, Khoo S-C, Sun C (2009) Classification of software behaviors for failure detection. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining—KDD’09, 2009, p 557. https://doi.org/10.1145/1557019.1557083

  • Ma L, Assimes TL, Asadi NB, Iribarren C, Quertermous T, Wong WH (2010) An ‘almost exhaustive’ search-based sequential permutation method for detecting epistasis in disease association studies. Genet Epidemiol 34(5):434–443. https://doi.org/10.1002/gepi.20496

    Article  Google Scholar 

  • Machado FP (2003) CPAR: classification based on predictive association rules

  • Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory. Springer, Berlin, pp 398–416

  • Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12(10):2825–2830

    MathSciNet  MATH  Google Scholar 

  • Ramamohanarao K, Bailey J (2003) Discovery of emerging patterns and their use in classification. In: Gedeon TD, Fung LCC (eds) AI 2003: advances in artificial intelligence: 16th Australian conference on AI, Perth, Australia, December 3–5, 2003. Proceedings. Springer, Berlin, pp 1–11

  • Schwartz D, Gygi SP (2005) An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol 23(11):1391–1398. https://doi.org/10.1038/nbt1146

    Article  Google Scholar 

  • Segal E, Friedman N, Kaminski N, Regev A, Koller D (2018) From signatures to models: understanding cancer using microarrays. Nat Genet 37(6 Suppl). https://www.nature.com/articles/ng1561. Accessed 09 Sept 2018

  • Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Reading

    Google Scholar 

  • Terlecki P, Walczak K (2007) Jumping emerging patterns with negation in transaction databases—classification and discovery. Inf Sci 177(24):5675–5690. https://doi.org/10.1016/j.ins.2007.07.018

    Article  MathSciNet  MATH  Google Scholar 

  • van Leeuwen M, Knobbe A (2012) Diverse subgroup set discovery. Data Min Knowl Discov 25(2):208–242. https://doi.org/10.1007/s10618-012-0273-y

    Article  MathSciNet  Google Scholar 

  • Wang K, Li M, Bucan M (2007) Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet 81(6):1278–1283. https://doi.org/10.1086/522374

    Article  Google Scholar 

  • Wang T, Kettenbach AN, Gerber SA, Bailey-Kellogg C (2012) MMFPh: a maximal motif finder for phosphoproteomics datasets. Bioinformatics 28(12):1562–1570. https://doi.org/10.1093/bioinformatics/bts195

    Article  Google Scholar 

  • Wenmin L, Jiawei H, Jian P (2001) CMAR: accurate and efficient classification based on multiple class-association rules. In: Proceedings IEEE international conference on data mining, 2001. ICDM 2001, pp 369–376. https://doi.org/10.1109/ICDM.2001.989541

  • Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: European symposium on principles of data mining and knowledge discovery, pp 78–87

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Behrouz Minaei-Bidgoli.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aryabarzan, N., Minaei-Bidgoli, B. A Novel Pruning Strategy for Mining Discriminative Patterns. Iran J Sci Technol Trans Electr Eng 45, 505–527 (2021). https://doi.org/10.1007/s40998-020-00397-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40998-020-00397-3

Keywords

Navigation