Abstract
Genomic Selection (GS) is a breeding technique that utilizes whole genome markers to make trait predictions. The goal of GS is to identify the top candidates that have the most desirable trait values. Usually, GS has been formulated as a regression problem where the marker data is used to predict phenotypic values. However, since the end goal of GS is identification of top candidates, ranking the individuals makes far more sense. Creating accurate ranking models pose three fundamental challenges—presence of noise in phenotypic data, extremely high dimensional nature of the genotypic data and small sample size of the genomic datasets. To combat these challenges, we present a novel two phase approach to increase the noise tolerance of ranking based approaches. The proposed algorithm uses pruning to perform noise filtering and leverages biclustering to improve model generalization. This approach is evaluated on both pointwise and pairwise ranking algorithms. Previous work on Arabidopsis and CIMMYT wheat datasets yielded mean Normalized Discounted Cumulative Gain (NDCG) @10 scores of 0.883 and 0.748 respectively. The proposed approach outperforms these results on both of the datasets yielding ranking accuracies of 0.965 and 0.865 respectively.
Similar content being viewed by others
Notes
Dataset available at http://publiclines.versailles.inra.fr/page/33.
References
Abellán J, Masegosa AR (2012) Bagging schemes on the presence of class noise in classification. Expert Syst Appl 39(8):6827–6837
Abellán J, Moral S (2003) Building classification trees using the total uncertainty criterion. Int J Intell Syst 18(12):1215–1225
Aggarwal CC (2015) Data mining: the textbook. Springer Publishing Company, Berlin
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp 94–105
Banerjee R, Marathi B, Singh M (2020) Efficient genomic selection using ensemble learning and ensemble feature reduction. J Crop Sci Biotechnol 1–13
Bernardo R, Yu J (2007) Prospects for genomewide selection for quantitative traits in maize. Crop Sci 47(3):1082–1090
Beukert U, Li Z, Liu G, Zhao Y, Ramachandra N, Mirdita V et al (2017) Genome-based identification of heterotic patterns in rice. Rice 10:1
Blondel M, Onogi A, Iwata H, Ueda N (2015) A ranking approach to genomic selection. PloS One 10(6):1
Bootkrajang J, Kabán A (2012) Label-noise robust logistic regression and its applications. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 143–158
Bootkrajang J, Kabán A (2014) Learning kernel logistic regression in the presence of class label noise. Pattern Recogn 47(11):3641–3655
Bouveyron C, Girard S (2009) Robust supervised classification with mixture models: Learning from data with uncertain labels. Pattern Recogn 42(11):2649–2658
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. Journal of artificial intelligence research 11:131–167
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, pp 89–96
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794
Cheng Y, Church GM (2000) Biclustering of expression data. In Ismb (Vol. 8, No. 2000, pp. 93–103).
Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. In: Proceedings of the 2004 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 114–125
Cossock D, Zhang T (2006) Subset ranking using regression. In: International conference on computational learning theory. Springer, Berlin, pp 605–619
Crossa J, de Los Campos G, Pérez P, Gianola D, Burgueño J, Araus JL et al (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186(2):713–724
de Castro PA, de França FO, Ferreira HM, Von Zuben FJ (2007). Applying biclustering to text mining: an immune-inspired approach. In: International conference on artificial immune systems. Springer, Berlin, pp 83–94
de França FO, Coelho AL (2015) A biclustering approach for classification with mislabeled data. Expert Syst Appl 42(12):5065–5075
de França FO, Von Zuben FJ (2010) Finding a high coverage set of δ-biclusters with swarm intelligence. In: IEEE congress on evolutionary computation. IEEE, pp 1–8
de França, F. O., & Von Zuben, F. J. (2011). Extracting additive and multiplicative coherent biclusters with swarm intelligence. In 2011 IEEE Congress of Evolutionary Computation (CEC) (pp. 632–638). IEEE.
de França FO, Coelho GP, Von Zuben FJ (2013) Predicting missing values with biclustering: a coherence-based approach. Pattern Recogn 46(5):1255–1266
Ding W, Geng X, Zhang XD (2015) Learning to rank from noisy data. ACM Trans Intell Syst Technol (TIST) 7(1):1–21
Dorigo M, Bonabeau E, Theraulaz G (2000) Ant algorithms and stigmergy. Fut Gen Comput Syst 16(8):851–871
Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Trans Neural Networks Learn Syst 25(5):845–869
Freund Y, Iyer R, Schapire RE, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969
Goddard ME, Hayes BJ (2007) Genomic selection. J Anim Breed Genet 124(6):323–330
Guan D, Yuan W, Lee YK, Lee S (2011) Identifying mislabeled training data with the aid of unlabeled data. Appl Intell 35(3):345–358
Herbrich R (2000) Large margin rank boundaries for ordinal regression. Adv Large Marg Classif 115–132.
Heslot N, Yang HP, Sorrells ME, Jannink JL (2012) Genomic selection in plant breeding: a comparison of models. Crop Sci 52(1):146–160
Jannink JL, Lorenz AJ, Iwata H (2010) Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics 9(2):166–177
Kadam DC, Potts SM, Bohn MO, Lipka AE, Lorenz AJ (2016). Genomic prediction of single crosses in the early stages of a maize hybrid breeding pipeline. G3 (Bethesda) 6(11):3443–3453
Li H (2011) Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies 4(1):1–113
Li P, Wu Q, Burges CJ (2008) Mcrank: learning to rank using multiple classification and gradient boosting. In: Advances in neural information processing systems, pp 897–904
Loudet O, Chaillou S, Camilleri C, Bouchez D, Daniel-Vedele F (2002) Bay-0× Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor Appl Genet 104(6–7):1173–1184
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Mirkin B (1997) Mathematical classification and clustering. J Oper Res Soc 48(8):852–852
Padilha VA, Campello RJ (2017) A systematic comparative evaluation of biclustering techniques. BMC Bioinform 18(1):55
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Rebbapragada U, Brodley CE (2007) Class noise mitigation through instance weighting. In: European conference on machine learning. Springer, Berlin, pp 708–715
Segal E, Battle A, Koller D (2002) Decomposing gene expression into cellular processes. In: Biocomputing, pp 89–100
Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by Gibbs sampling. Bioinformatics 19(suppl_2):ii196-ii205.
Tabassian M, Ghaderi R, Ebrahimpour R (2012) Combining complementary information sources in the Dempster-Shafer framework for solving classification problems with imperfect labels. Knowl-Based Syst 27:92–102
Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(suppl_1):S136-S144.
Wang X, Liu F, Jiao LC, Zhou Z, Yu J, Li B et al (2012) An evidential reasoning based classification algorithm and its application for face recognition with class noise. Pattern Recogn 45(12):4117–4128
Wang X, Xu Y, Hu Z, Xu C (2018) Genomic selection methods for crop improvement: Current status and prospects. The Crop Journal 6(4):330–340
Wu Q, Burges CJ, Svore KM, Gao J (2008) Ranking, boosting, and model adaptation. Technical report, Microsoft Research
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210.
Zhu X, Wu X, Chen Q (2006) Bridging local and global data cleansing: Identifying class noise in large, distributed data datasets. Data Min Knowl Disc 12(2–3):275–308
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Banerjee, R., Singh, M. Using noise reduction to enhance ranking based genomic selection. Genet Resour Crop Evol 68, 3319–3331 (2021). https://doi.org/10.1007/s10722-021-01190-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10722-021-01190-9