Skip to main content
Log in

Using noise reduction to enhance ranking based genomic selection

  • Research Article
  • Published:
Genetic Resources and Crop Evolution Aims and scope Submit manuscript

Abstract

Genomic Selection (GS) is a breeding technique that utilizes whole genome markers to make trait predictions. The goal of GS is to identify the top candidates that have the most desirable trait values. Usually, GS has been formulated as a regression problem where the marker data is used to predict phenotypic values. However, since the end goal of GS is identification of top candidates, ranking the individuals makes far more sense. Creating accurate ranking models pose three fundamental challenges—presence of noise in phenotypic data, extremely high dimensional nature of the genotypic data and small sample size of the genomic datasets. To combat these challenges, we present a novel two phase approach to increase the noise tolerance of ranking based approaches. The proposed algorithm uses pruning to perform noise filtering and leverages biclustering to improve model generalization. This approach is evaluated on both pointwise and pairwise ranking algorithms. Previous work on Arabidopsis and CIMMYT wheat datasets yielded mean Normalized Discounted Cumulative Gain (NDCG) @10 scores of 0.883 and 0.748 respectively. The proposed approach outperforms these results on both of the datasets yielding ranking accuracies of 0.965 and 0.865 respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Dataset available at http://publiclines.versailles.inra.fr/page/33.

References

  • Abellán J, Masegosa AR (2012) Bagging schemes on the presence of class noise in classification. Expert Syst Appl 39(8):6827–6837

    Article  Google Scholar 

  • Abellán J, Moral S (2003) Building classification trees using the total uncertainty criterion. Int J Intell Syst 18(12):1215–1225

    Article  Google Scholar 

  • Aggarwal CC (2015) Data mining: the textbook. Springer Publishing Company, Berlin

    Google Scholar 

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp 94–105

  • Banerjee R, Marathi B, Singh M (2020) Efficient genomic selection using ensemble learning and ensemble feature reduction. J Crop Sci Biotechnol 1–13

  • Bernardo R, Yu J (2007) Prospects for genomewide selection for quantitative traits in maize. Crop Sci 47(3):1082–1090

    Article  Google Scholar 

  • Beukert U, Li Z, Liu G, Zhao Y, Ramachandra N, Mirdita V et al (2017) Genome-based identification of heterotic patterns in rice. Rice 10:1

    Article  Google Scholar 

  • Blondel M, Onogi A, Iwata H, Ueda N (2015) A ranking approach to genomic selection. PloS One 10(6):1

  • Bootkrajang J, Kabán A (2012) Label-noise robust logistic regression and its applications. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 143–158

  • Bootkrajang J, Kabán A (2014) Learning kernel logistic regression in the presence of class label noise. Pattern Recogn 47(11):3641–3655

    Article  Google Scholar 

  • Bouveyron C, Girard S (2009) Robust supervised classification with mixture models: Learning from data with uncertain labels. Pattern Recogn 42(11):2649–2658

    Article  Google Scholar 

  • Brodley CE, Friedl MA (1999) Identifying mislabeled training data. Journal of artificial intelligence research 11:131–167

    Article  Google Scholar 

  • Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, pp 89–96

  • Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794

  • Cheng Y, Church GM (2000) Biclustering of expression data. In Ismb (Vol. 8, No. 2000, pp. 93–103).

  • Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. In: Proceedings of the 2004 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 114–125

  • Cossock D, Zhang T (2006) Subset ranking using regression. In: International conference on computational learning theory. Springer, Berlin, pp 605–619

  • Crossa J, de Los Campos G, Pérez P, Gianola D, Burgueño J, Araus JL et al (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186(2):713–724

    Article  CAS  Google Scholar 

  • de Castro PA, de França FO, Ferreira HM, Von Zuben FJ (2007). Applying biclustering to text mining: an immune-inspired approach. In: International conference on artificial immune systems. Springer, Berlin, pp 83–94

  • de França FO, Coelho AL (2015) A biclustering approach for classification with mislabeled data. Expert Syst Appl 42(12):5065–5075

    Article  Google Scholar 

  • de França FO, Von Zuben FJ (2010) Finding a high coverage set of δ-biclusters with swarm intelligence. In: IEEE congress on evolutionary computation. IEEE, pp 1–8

  • de França, F. O., & Von Zuben, F. J. (2011). Extracting additive and multiplicative coherent biclusters with swarm intelligence. In 2011 IEEE Congress of Evolutionary Computation (CEC) (pp. 632–638). IEEE.

  • de França FO, Coelho GP, Von Zuben FJ (2013) Predicting missing values with biclustering: a coherence-based approach. Pattern Recogn 46(5):1255–1266

    Article  Google Scholar 

  • Ding W, Geng X, Zhang XD (2015) Learning to rank from noisy data. ACM Trans Intell Syst Technol (TIST) 7(1):1–21

    Article  Google Scholar 

  • Dorigo M, Bonabeau E, Theraulaz G (2000) Ant algorithms and stigmergy. Fut Gen Comput Syst 16(8):851–871

    Article  Google Scholar 

  • Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Trans Neural Networks Learn Syst 25(5):845–869

    Article  Google Scholar 

  • Freund Y, Iyer R, Schapire RE, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969

    Google Scholar 

  • Goddard ME, Hayes BJ (2007) Genomic selection. J Anim Breed Genet 124(6):323–330

    Article  CAS  Google Scholar 

  • Guan D, Yuan W, Lee YK, Lee S (2011) Identifying mislabeled training data with the aid of unlabeled data. Appl Intell 35(3):345–358

    Article  Google Scholar 

  • Herbrich R (2000) Large margin rank boundaries for ordinal regression. Adv Large Marg Classif 115–132.

  • Heslot N, Yang HP, Sorrells ME, Jannink JL (2012) Genomic selection in plant breeding: a comparison of models. Crop Sci 52(1):146–160

    Article  Google Scholar 

  • Jannink JL, Lorenz AJ, Iwata H (2010) Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics 9(2):166–177

    Article  CAS  Google Scholar 

  • Kadam DC, Potts SM, Bohn MO, Lipka AE, Lorenz AJ (2016). Genomic prediction of single crosses in the early stages of a maize hybrid breeding pipeline. G3 (Bethesda) 6(11):3443–3453

  • Li H (2011) Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies 4(1):1–113

    Article  CAS  Google Scholar 

  • Li P, Wu Q, Burges CJ (2008) Mcrank: learning to rank using multiple classification and gradient boosting. In: Advances in neural information processing systems, pp 897–904

  • Loudet O, Chaillou S, Camilleri C, Bouchez D, Daniel-Vedele F (2002) Bay-0× Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor Appl Genet 104(6–7):1173–1184

    Article  CAS  Google Scholar 

  • Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45

    Article  CAS  Google Scholar 

  • Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829

    Article  CAS  Google Scholar 

  • Mirkin B (1997) Mathematical classification and clustering. J Oper Res Soc 48(8):852–852

    Article  Google Scholar 

  • Padilha VA, Campello RJ (2017) A systematic comparative evaluation of biclustering techniques. BMC Bioinform 18(1):55

    Article  Google Scholar 

  • Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  • Rebbapragada U, Brodley CE (2007) Class noise mitigation through instance weighting. In: European conference on machine learning. Springer, Berlin, pp 708–715

  • Segal E, Battle A, Koller D (2002) Decomposing gene expression into cellular processes. In: Biocomputing, pp 89–100

  • Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by Gibbs sampling. Bioinformatics 19(suppl_2):ii196-ii205.

  • Tabassian M, Ghaderi R, Ebrahimpour R (2012) Combining complementary information sources in the Dempster-Shafer framework for solving classification problems with imperfect labels. Knowl-Based Syst 27:92–102

    Article  Google Scholar 

  • Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(suppl_1):S136-S144.

  • Wang X, Liu F, Jiao LC, Zhou Z, Yu J, Li B et al (2012) An evidential reasoning based classification algorithm and its application for face recognition with class noise. Pattern Recogn 45(12):4117–4128

    Article  Google Scholar 

  • Wang X, Xu Y, Hu Z, Xu C (2018) Genomic selection methods for crop improvement: Current status and prospects. The Crop Journal 6(4):330–340

    Article  Google Scholar 

  • Wu Q, Burges CJ, Svore KM, Gao J (2008) Ranking, boosting, and model adaptation. Technical report, Microsoft Research

    Google Scholar 

  • Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210.

  • Zhu X, Wu X, Chen Q (2006) Bridging local and global data cleansing: Identifying class noise in large, distributed data datasets. Data Min Knowl Disc 12(2–3):275–308

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rohan Banerjee.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Banerjee, R., Singh, M. Using noise reduction to enhance ranking based genomic selection. Genet Resour Crop Evol 68, 3319–3331 (2021). https://doi.org/10.1007/s10722-021-01190-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10722-021-01190-9

Keywords

Navigation