A Novel Pruning Strategy for Mining Discriminative Patterns

Aryabarzan, Nader; Minaei-Bidgoli, Behrouz

doi:10.1007/s40998-020-00397-3

Nader Aryabarzan¹ &
Behrouz Minaei-Bidgoli²

100 Accesses
Explore all metrics

Abstract

Discriminative patterns are sets of characteristics that differentiate multiple groups from each other, for example, successful and unsuccessful medical treatments. The objective of the discriminative pattern mining task is to discover a set of significant patterns that occur with disproportionate frequencies in different class-labeled datasets, generally dataset \( D^{ + } \) against dataset \( D^{ - } \). The discriminative pattern mining task faces two important problems: (1) the large search space problem where the search space exponentially increases with the number of items, and (2) the redundancy problem where the discriminative power of many patterns mainly derives from their sub-patterns. The common method to overcome the large search space problem is to discover frequent patterns in \( D^{ + } \) and to use them as candidate discriminative patterns. In this paper, (1) we introduce a novel pruning strategy to reduce the search space. This strategy generates a new dataset \( D^{new} \) = \( D^{ + } \) − \( D^{ - } \) and employs frequent patterns in it as candidate discriminative patterns. Following this idea, another problem appears: how to implement this idea efficiently? (2) Note that we do not explicitly calculate \( D^{ + } \) − \( D^{ - } \). To directly mine the frequent patterns in \( D^{ + } \) − \( D^{ - } \), we propose a prefix-tree, dubbed DDP-tree. This tree is directly built from \( D^{ + } \) and \( D^{ - } \), and contains the essential information about frequent patterns in \( D^{ + } \) − \( D^{ - } \). (3) To show the effectiveness of this strategy, we propose an algorithm, dubbed DiffNRDP-Miner (DiffNRDP-Miner: Difference based non-redundant discriminative pattern miner.), based on it. The advantages of DiffNRDP-Miner are that it removes redundant patterns and only needs to set one parameter, unlike other algorithms where several parameters must be set. Experimental results on benchmark datasets demonstrate that: this strategy (1) generates good patterns, where most of them are discriminative, (2) significantly reduces the search space, and (3) does not decrease the discriminative information of patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

DiffNRDP-Miner: Difference based non-redundant discriminative pattern miner.
NRDP-tree: Non-redundant discriminative pattern tree.

References

Aryabarzan N, Minaei-Bidgoli B, Teshnehlab M (2018) negFIN: an efficient algorithm for fast mining frequent itemsets. Expert Syst Appl 105:129–143
Article Google Scholar
Azevedo PJ (2010) Rules for contrast sets. Intell Data Anal 14(6):623–640
Article Google Scholar
Bay SD, Pazzani MJ (1999) Detecting change in categorical data: mining contrast sets. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 302–306
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246. https://doi.org/10.1023/a:1011429418057
Article MATH Google Scholar
Boley M, Grosskreutz H (2009) Non-redundant subgroup discovery using a closure system. In: Buntine W, Grobelnik M, Mladenić D, Shawe-Taylor J (eds) Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2009, Bled, Slovenia, September 7–11, 2009, proceedings, part I. Springer, Berlin, pp 179–194
Cagliero L, Chiusano S, Garza P, Bruno G (2015) Pattern set mining with schema-based constraint. Knowl-Based Syst 84:224–238. https://doi.org/10.1016/j.knosys.2015.04.023
Article Google Scholar
Carmona CJ et al (2015) A fuzzy genetic programming-based algorithm for subgroup discovery and the application to one problem of pathogenesis of acute sore throat conditions in humans. Inf Sci (Ny) 298:180–197. https://doi.org/10.1016/j.ins.2014.11.030
Article Google Scholar
Cheng H, Yan X, Han J, Hsu CW (2007) Discriminative frequent pattern analysis for effective classification. In: 2007 IEEE 23rd international conference on data engineering, pp 716–725. https://doi.org/10.1109/icde.2007.367917
Cheng H, Yan X, Han J, Yu PS (2008) Direct discriminative pattern mining for effective classification. In: Proceedings of the 2008 IEEE 24th international conference on data engineering. IEEE Computer Society, pp 169–178. https://doi.org/10.1109/icde.2008.4497425
Cong G, Tan K-L, Tung AKH, Xu X (2005) Mining top-K covering rule groups for gene expression data. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, Baltimore, Maryland, pp 670–681. https://doi.org/10.1145/1066157.1066234
De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 2007 SIAM international conference on data mining, pp 237–248
Deng Z-H (2016) DiffNodesets: an efficient structure for fast mining frequent itemsets. Appl Soft Comput 41:214–223
Article Google Scholar
Deng ZH, Lv SL (2014) Fast mining frequent itemsets using Nodesets. Expert Syst Appl 41(10):4505–4512. https://doi.org/10.1016/j.eswa.2014.01.025
Article Google Scholar
Deshpande M, Kuramochi M, Karypis G (2018) Frequent sub-structure-based approach for classifying chemical compounds. IEEE Trans Knowl Data Eng 17(TR# 03-016):1036–1050
Google Scholar
Dua D, Karra Taniskidou E (2017) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed 15 Oct 2019
Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Diego, California, USA, pp 43–52. https://doi.org/10.1145/312129.312191
Droge B (2006) Phillip good: permutation, parametric, and bootstrap tests of hypotheses. Metrika 64(2):249–250
Article Google Scholar
Fan W et al. (2008) Direct mining of discriminative and essential frequent patterns via model-based search tree. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Las Vegas, Nevada, USA, pp 230–238. https://doi.org/10.1145/1401890.1401922
Fang G, Pandey G, Wang W, Gupta M, Steinbach M, Kumar V (2012) Mining low-support discriminative patterns from dense and high-dimensional data. IEEE Trans Knowl Data Eng 24(2):279–294. https://doi.org/10.1109/TKDE.2010.241
Article Google Scholar
Garriga GC, Kralj P, Lavrač N (2008) Closed sets for labeled data. J Mach Learn Res 9:559–580
MathSciNet MATH Google Scholar
Gong H, He Z (2012) Permutation methods for testing the significance of phosphorylation motifs. Stat Interface 5:61–74
Article MathSciNet Google Scholar
Grosskreutz H, Paurat D (2011) Fast discovery of relevant subgroups using a reduced search space. Fraunhofer Inst. IAIS, Sankt Augustin
Google Scholar
Großkreutz H, Paurat D, Rüping S (2012) An enhanced relevance criterion for more concise supervised pattern discovery. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1442–1450. https://doi.org/10.1145/2339530.2339756
Guns T, Nijssen S, de Raedt L (2013) k-Pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418. https://doi.org/10.1109/tkde.2011.204
Article Google Scholar
He Z, Gu F, Zhao C, Liu X, Wu J, Wang J (2017) Conditional discriminative pattern mining: concepts and algorithms. Inf Sci (Ny) 375:1–15. https://doi.org/10.1016/j.ins.2016.09.047
Article Google Scholar
He Z, Zhang S, Wu J (2019a) Significance-based discriminative sequential pattern mining. Expert Syst Appl 122:54–64
Article Google Scholar
He Z, Zhang S, Gu F, Wu J (2019b) Mining conditional discriminative sequential patterns. Inf Sci (Ny) 478:524–539
Article Google Scholar
Helal S (2016) Subgroup discovery algorithms: a survey and empirical evaluation. J Comput Sci Technol 31(3):561–576. https://doi.org/10.1007/s11390-016-1647-1
Article Google Scholar
Herrera F, Carmona CJ, González P, del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525. https://doi.org/10.1007/s10115-010-0356-2
Article Google Scholar
Kameya Y, Sato T (2012) RP-growth: top-k mining of relevant patterns with minimum support raising. In: Proceedings of the 2012 SIAM international conference on data mining, pp 816–827
Karypis G, Wang J (2005) HARMONY: efficiently mining the best rules for classification. In: 5th SIAM international conference on data mining, pp 205–216
Kralj Novak P, Nada Lavrač I, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10(Feb):377–403. https://doi.org/10.1145/1577069.1577083
Article MATH Google Scholar
Lavrač N, Gamberger D (2006) Relevancy in constraint-based subgroup discovery. In: Boulicaut J-F, De Raedt L, Mannila H (eds) Constraint-based mining and inductive databases: European workshop on inductive databases and constraint based mining, Hinterzarten, Germany, March 11–13, 2004, revised selected papers. Springer, Berlin, pp 243–266
Lavrač N, Gamberger D, Jovanoski V (1999) A study of relevance for learning in deductive databases. J Log Program 40(2–3):215–249
Article MathSciNet Google Scholar
Li J, Liu G, Wong L (2007) Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Jose, California, USA, pp 430–439. https://doi.org/10.1145/1281192.1281240
Li J, Liu J, Toivonen H, Satou K, Sun Y, Sun B (2014) Discovering statistically non-redundant subgroups. Knowl-Based Syst 67:315–327. https://doi.org/10.1016/j.knosys.2014.04.030
Article Google Scholar
Liu H, Yang Y, Chen Z, Zheng Y (2014a) A tree-based contrast set-mining approach to detecting group differences. INFORMS J. Comput 26(2):208–221. https://doi.org/10.1287/ijoc.2013.0558
Article MathSciNet MATH Google Scholar
Liu X, Wu J, Gu F, Wang J, He Z (2014b) Discriminative pattern mining and its applications in bioinformatics. Brief Bioinform 16(5):884–900. https://doi.org/10.1093/bib/bbu042
Article Google Scholar
Liu X, Wu J, Gong H, Deng S, He Z (2014c) Mining conditional phosphorylation motifs. IEEE/ACM Trans Comput Biol Bioinform 11(5):915–927. https://doi.org/10.1109/tcbb.2014.2321400
Article Google Scholar
Lo D, Cheng H, Han J, Khoo S-C, Sun C (2009) Classification of software behaviors for failure detection. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining—KDD’09, 2009, p 557. https://doi.org/10.1145/1557019.1557083
Ma L, Assimes TL, Asadi NB, Iribarren C, Quertermous T, Wong WH (2010) An ‘almost exhaustive’ search-based sequential permutation method for detecting epistasis in disease association studies. Genet Epidemiol 34(5):434–443. https://doi.org/10.1002/gepi.20496
Article Google Scholar
Machado FP (2003) CPAR: classification based on predictive association rules
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory. Springer, Berlin, pp 398–416
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12(10):2825–2830
MathSciNet MATH Google Scholar
Ramamohanarao K, Bailey J (2003) Discovery of emerging patterns and their use in classification. In: Gedeon TD, Fung LCC (eds) AI 2003: advances in artificial intelligence: 16th Australian conference on AI, Perth, Australia, December 3–5, 2003. Proceedings. Springer, Berlin, pp 1–11
Schwartz D, Gygi SP (2005) An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol 23(11):1391–1398. https://doi.org/10.1038/nbt1146
Article Google Scholar
Segal E, Friedman N, Kaminski N, Regev A, Koller D (2018) From signatures to models: understanding cancer using microarrays. Nat Genet 37(6 Suppl). https://www.nature.com/articles/ng1561. Accessed 09 Sept 2018
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Reading
Google Scholar
Terlecki P, Walczak K (2007) Jumping emerging patterns with negation in transaction databases—classification and discovery. Inf Sci 177(24):5675–5690. https://doi.org/10.1016/j.ins.2007.07.018
Article MathSciNet MATH Google Scholar
van Leeuwen M, Knobbe A (2012) Diverse subgroup set discovery. Data Min Knowl Discov 25(2):208–242. https://doi.org/10.1007/s10618-012-0273-y
Article MathSciNet Google Scholar
Wang K, Li M, Bucan M (2007) Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet 81(6):1278–1283. https://doi.org/10.1086/522374
Article Google Scholar
Wang T, Kettenbach AN, Gerber SA, Bailey-Kellogg C (2012) MMFPh: a maximal motif finder for phosphoproteomics datasets. Bioinformatics 28(12):1562–1570. https://doi.org/10.1093/bioinformatics/bts195
Article Google Scholar
Wenmin L, Jiawei H, Jian P (2001) CMAR: accurate and efficient classification based on multiple class-association rules. In: Proceedings IEEE international conference on data mining, 2001. ICDM 2001, pp 369–376. https://doi.org/10.1109/ICDM.2001.989541
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: European symposium on principles of data mining and knowledge discovery, pp 78–87

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Nader Aryabarzan
School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
Behrouz Minaei-Bidgoli

Authors

Nader Aryabarzan
View author publications
You can also search for this author in PubMed Google Scholar
Behrouz Minaei-Bidgoli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Behrouz Minaei-Bidgoli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aryabarzan, N., Minaei-Bidgoli, B. A Novel Pruning Strategy for Mining Discriminative Patterns. Iran J Sci Technol Trans Electr Eng 45, 505–527 (2021). https://doi.org/10.1007/s40998-020-00397-3

Download citation

Received: 31 January 2020
Accepted: 26 November 2020
Published: 05 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s40998-020-00397-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Novel Pruning Strategy for Mining Discriminative Patterns

Abstract

Access this article

Similar content being viewed by others

Depth-First Traversal over a Mirrored Space for Non-redundant Discriminative Itemsets

Analyzing Efficient Algorithms of Frequent Pattern Mining

Statistically Significant Discriminative Patterns Searching

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Novel Pruning Strategy for Mining Discriminative Patterns

Abstract

Access this article

Similar content being viewed by others

Depth-First Traversal over a Mirrored Space for Non-redundant Discriminative Itemsets

Analyzing Efficient Algorithms of Frequent Pattern Mining

Statistically Significant Discriminative Patterns Searching

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation