Detecting N6-methyladenosine sites from RNA transcriptomes using random forest

https://doi.org/10.1016/j.jocs.2020.101238Get rights and content

Highlights

  • N6-methyladenosine (m6A) modifications are one the most frequently occurring RNA post transcriptional modifications. These modifications perform vital roles in different biological processes, including, localization and translation of proteins, X chromosome inactivation, cell stability, microRNA regulation and reprogramming etc.

  • An abnormal change in m6A may lead to certain abnormalities, including, cancer, brain related disorders etc. Precise detection of m6A modifications is crucial for the diagnosis and treatment of these diseases.

  • Existing methods suffer from the problem of inefficient detection of m6A sites, especially in varied structures of yeast transcriptomes.

  • The proposed m6A-pred predictor utilizes a fusion of features including, statistical, and chemical properties of the nucleotides, to precisely predict the presence of m6A sites in RNA sequences.

  • The m6A-pred predictor outperforms all the previously reported predictors in terms of accuracy, specificity and Mathew correlation coefficient values.

Abstract

N6-methyladenosine (m6A) modifications are one the most frequently occurring RNA post transcriptional modifications. These modifications perform vital roles in different biological processes, including, localization and translation of proteins, X chromosome inactivation, cell stability, microRNA regulation, and reprogramming etc. Any abnormal change in m6A sites may lead to several abnormalities, including, cancer, brain-related disorders and many other life threatening diseases. Precise detection of m6A modifications is crucial for the diagnosis and treatment of these diseases. Existing methods suffer from the problem of inefficient detection of m6A sites, especially in yeast transcriptomes (due to varied structure) and inability of the computational techniques to capture the encoded information surrounding the m6A sites. In this work, we propose a novel method (called m6A-pred predictor) that utilizes a fusion of characteristics including, statistical, and chemical properties of the nucleotides, to precisely predict the presence of m6A sites in RNA sequences. The fusion of multiple types of features results in a high dimensional vector which is further optimized using an evolutionary algorithm. Finally, the random forest classifier is used to detect m6A sites by using the most discriminative features. The results, benchmarked on yeast transcriptomes, indicate that m6A-pred predictor outperforms all the previously reported predictors, notably, with an accuracy value of 78.58%, specificity value of 79.65% and Matthews correlation coefficient of 0.5717.

Introduction

Among the 150 types of post transcription modifications of cellular RNA, the N6-methyladenosine (denoted as m6A) is the most frequently occurring modification [1]. These modifications are catalyzed by N6-adenosyl methyltransferase complexes including METTL3, METTL14 and WTAP and some others., [2], [3]. The m6A sites are present in both the prokaryotes and eukaryotes, [2]. In a recent study, it is found that m6A modifications control different types of biological processes like localization and translation of proteins and some other essential cellular tasks [4]. Any abnormal change in m6A may lead to certain abnormalities, including, cancer, brain related disorders, and a lot of other diseases [5], [6]. Furthermore, it is also found that m6A is non-randomly distributed across genomes [7]. Thus, identification of m6A sites is important in order to understand many key biological functions.

Traditionally, wet lab experiments are used for providing single-nucleotide resolution map of the m6A sites across human transcriptomes. However, the results of these resolution maps for other species m6A sites are not satisfactory because of false positive and false negative detections [8]. In addition, genome-wide m6A site detection through wet experiments is laborious and costly. Thus, we require computational tools to precisely detect the m6A sites. These computational tools will find m6A sites as well as save experimental time and costs. The benefits of these tools will be a direction for future research in bioinformatics and drug discovery for severe diseases like cancer, brain disorder and other abnormalities. Above aforementioned benefits motivated us to make a predictor for classification of m6A sites.

Due to efficient high-throughput technologies, the m6A genome-wide distributions are existing for different types of species like Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae and likewise other species [7], [9], [9], [10]. Researchers got unprecedented opportunities due to the availability of large volume experimental data. These type of data provide the feasibility for developing computational tools in order to correctly predict m6A sites. Different type of computational methods were proposed according to the nature of data for the identification of m6A sites. For example, in [11], [12] authors proposed two computational tools for yeast specific data. In the first method, the authors used nucleotide chemical property to encode features, while in the second method the authors use pseudo nucleotide composition (PseNC) for encoding the RNA sequence. The first one has low sensitivity on independent dataset, while the second one has low specificity.

Another group of researchers, developed a predictor named pRNAm-PC by using chemical property based auto-covariance and auto cross-variance features in pseDNC [13]. The proposed technique has two problems, first, the feature vector dimension is very high and secondly, the accuracy they reported is considerably low. Another predictor tool namely, SRAMP was proposed for mammalian m6A sites classification [14]. Their work was inspired from the works of [11], [12] which primarily uses binary encoding scheme for feature extraction. The specificity of the proposed work is good but sensitivity is very low.

Subsequently, some researchers in [15] developed another predictor MethyRNA for the prediction of m6A site in Homo sapiens and Mus musculus datasets. While the performances of these proposed techniques are good for finding m6A sites in mammalian transcriptomes, but their results are not good to identify m6A sites accurately in yeast transcriptome [14], [15]. The use of these features result in inefficient detection of m6A sites in yeast transcriptomes due to different structure from other species and inability of the computational techniques to capture the encoded information surrounding the m6A sites. Moreover, the information around yeast m6A sites are not well studied [14].

Some authors introduced heuristic based nucleotide physical-chemical property selection algorithm for m6A sites identification. The results of this technique are good from previous techniques in term of sensitivity but specificity is still low [16]. Recently, further improvements were carry out by Chen et al. [17], using ensemble support vector machines. In their work, they used PseKNC (pseudo K-tuple Nucleotide Composition), along with motif and gapped k-mer based techniques for feature extraction. The final output is generated by using majority voting strategy. The accuracy is good but time complexity is high due to high dimensional feature vectors as well ensemble classifiers.

In [18], the authors introduced a new predictor named imethyl-STTNC. They used split trinucleotide composition and split tetranucleotide composition features for RNA sequence encoding. Multiple classifiers were used for the detection of m6A sites in RNA sequences. The support vector machine achieved high performance on these features. The reported accuracy is good for Homo sapiens dataset but not satisfactory for yeast dataset. Recently, in [19], a new predictor called BERMP has been introduced for classification of m6A sites of multiple species. The proposed predictor’s accuracy is good for mammalian and plant species. In this work, the authors used ENAC (Enhanced Nucleic Acid Composition) as feature encoding scheme. They used deep learning and Random forest based ensemble classifier for prediction, while the prediction accuracy still did not improve for yeast species RNA sequences.

Most recently, M6AMRFS was introduced for the prediction m6A sites in multiple species namely, Saccharomyces cerevisiae, Homo sapiens, Musculas and Arabidopsis Thaliana, [20]. The authors used local position specific dinucleotide frequency composition and binary encoding to encode features. They used Extreme Gradient boost (XG boost) classifier. However, its performance in identification of m6A sites in yeast transcriptome remained average, and further enhancements are still needed.

A predictor named DeepM6ASeq was proposed in [21] for classification m6A sites. They used CNN for the detection of m6A sites. But the predictor is only evaluated for mammalian data. In recent time, another predictor WHISTLE, [22] was developed. The authors extracted sequence derived features and genome derived features from human mature messenger RNA sequences and full transcript sequences. They did not test their predictor on yeast data. Mostly recently, a predictor SICM6A [23] was developed for cross species m6A data classification. However, the accuracy of the mentioned predictor is good for all species, but in yeast transcriptomes, its results are nearly equal to [19].

From the extensive literature review above, we conclude that, the feature representation techniques used, are not satisfactory for the identification of m6A sites in yeast species. Moreover, apart from statistical and chemical properties of nucleotides in transcriptomes, another aspect that is not widely explored is the fusion of these characteristics surrounding the m6A sites. In this work, we propose a novel method (called m6A-pred predictor) that utilizes a fusion of characteristics including, statistical, and chemical properties of the nucleotides, to precisely predict the presence of m6A sites in RNA sequences. The proposed method will overcome the problem of inefficient detection of m6A sites in yeast transcriptomes (due to varied structure) and inability of the computational techniques to capture the encoded information surrounding the m6A sites. The extracted hybrid features are usually high dimensional. To reduce feature dimensionality, we also explore the feature importance through feature selection methods.

The main contributions of this study are as follow:

  • 1.

    Combination of chemical properties based features and statistical based features are extracted from RNA to capture sounding information of m6A sites in more efficient way.

  • 2.

    Wrapper based feature selection method is used to select important features and remove redundant and unnecessary features.

The remaining part of the paper is organized as follows: in Section 2, we detail the materials and method used to develop the proposed m6A-pred predictor. We benchmark our results in Section 3, on the intriguing RNA transcripts of Saccharomyces cerevisiae species. Lastly, we conclude the study in Section 4.

Section snippets

Benchmark dataset

The benchmark dataset to conduct this study is taken from [11]. The data set contains 2614 sequences, which has 1,307 positive sequences (having m6A sites) and 1,307 negative sequences (having non-m6A sites). The positive subset of the whole dataset are experimentally identified m6A sites. To prevent imbalance bias in training set, these 1,307 non-m6A samples were randomly selected from the 33,280 non-m6A sites. All sequences are 51 bp long and have less than 85% sequence similarity.

The m6A-pred model

To overcome

Dataset

The proposed m6A-pred predictor is evaluated on benchmark dataset taken from [11]. The dataset contains 1307 sequences containing m6A sites and 1307 sequences without m6A sites. The proposed predictor takes a sample as input, extracts three type of features from that sample, then reduces feature using GA algorithm and final 39 feature vector is given input to random forest to predictor whether this is an m6A site or non m6A site.

Jackknife validation

The predictor’s performance is measured on the basis of its

Discussion

N6-methyladenosine (m6A) is frequently occurring post-transcriptional modification on both eukaryote and prokaryote messenger RNA transcripts. The regulation of messenger RNA is facilitated by RNA binding proteins which recognize these m6A modification sites. The occurrence of m6A on small portion of transcript performs vital roles in cellular function like cellular-stability, pre-mRNA splicing, microRNA regulation and some other important biological functions, thus indicating a link with

Conclusion

In this work, we proposed a new predictor called m6A-pred for the detection of m6A sites in Saccharomyces cerevisiae transcriptome. We extracted three type of sequential features, namely, ring function hydrogen chemical property based features, dinucleotide features and trinucleotide features. These features result in a high dimensional feature vector. Furthermore, we select the most discriminative features using the genetic algorithm optimization technique. The random forest classifier is used

CRediT authorship contribution statement

Asad Khan: Conception, Design of study, Acquisition of data, Analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Hafeez Ur Rehman: Conception, Design of study, Analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Usman Habib: Conception, Design of study, Analysis and/or interpretation of data, Writing - review & editing. Umer Ijaz: Conception, Design of study, Analysis and/or interpretation of data, Writing -

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

All authors approved the version of the manuscript to be published.

Asad Khan received his MS degree in Computer Science from COMSATS University Islamabad in the year 2017. He obtained his first class honor degree in Computer Science (with distinction) from Abdul Wali Khan University, Mardan, Pakistan in the year 2014. He is currently working as a Ph.D. research fellow in the Department of Computer Science, at the National University of Computer & Emerging Sciences, Peshawar, Pakistan. His research interests are in the area of machine learning applied to

References (47)

  • CantaraW.A. et al.

    The RNA modification database, RNAMDB: 2011 update

    Nucleic Acids Res.

    (2010)
  • MeyerK.D. et al.

    The dynamic epitranscriptome: N 6-methyladenosine and gene expression control

    Nat. Rev. Mol. Cell Biol.

    (2014)
  • NilsenT.W.

    Internal mRNA methylation finally finds functions

    Science

    (2014)
  • LinderB. et al.

    Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome

    Nat. Methods

    (2015)
  • DominissiniD. et al.

    Topology of the human and mouse m 6 A RNA methylomes revealed by m 6 A-seq

    Nature

    (2012)
  • LuoG.-Z. et al.

    Unique features of the m 6 A methylome in Arabidopsis thaliana

    Nat. Commun.

    (2014)
  • ChenW. et al.

    Identification and analysis of the N 6-methyladenosine in the Saccharomyces cerevisiae transcriptome

    Sci. Rep.

    (2015)
  • ZhouY. et al.

    SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features

    Nucleic Acids Res.

    (2016)
  • ChenW. et al.

    MethyRNA: a web server for identification of N6-methyladenosine sites

    J. Biomol. Struct. Dyn.

    (2017)
  • ChenW. et al.

    Detecting N 6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines

    Sci. Rep.

    (2017)
  • HuangY. et al.

    BERMP: a cross-species classifier for predicting m6A sites by integrating a deep learning algorithm and a random forest approach

    Int. J. Biol. Sci.

    (2018)
  • QiangX. et al.

    M6AMRFS: robust prediction of N6-methyladenosine sites with sequence-based features in multiple species

    Front. Genet.

    (2018)
  • ChenK. et al.

    WHISTLE: a high-accuracy map of the human N 6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach

    Nucleic Acids Res.

    (2019)
  • Cited by (12)

    • Codon-mRNA prediction using deep optimal neurocomputing technique (DLSTM-DSN-WOA) and multivariate analysis

      2023, Results in Engineering
      Citation Excerpt :

      Our work similarity with this work in the idea of predicting protein that influence disease and evaluation measurement and method that based AI but different in dataset used. [12] proposed a new method to predict the existence of m6A in RNA sequences this method used statistical and chemical properties of nucleotides and called (m6A-pred predictor) and uses random forest classifier to predict m6A by identify features that was discriminative. The proposed work uses (accuracy) and (Mathew correlation coefficient values) for measuring performance of algorithm.

    • A brief review of machine learning methods for RNA methylation sites prediction

      2022, Methods
      Citation Excerpt :

      The model was available at: https://rnamd.com/intron/. m6A-pred [96] is unique in that it combines the statistical and chemical characteristics of nucleotides. The fusion of multiple features makes the feature dimension higher, so an evolutionary algorithm is used to optimize the feature vector.

    • RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features

      2022, Methods
      Citation Excerpt :

      The random forest algorithm is based on using a decision tree as the basic unit, which is one of the learning methods of integrated learning methods in machine learning. The RF algorithm is widely used in biological sequence modification research and exhibits high accuracy [49–55]. When RF deals with the two-classification problem, it can be regarded as a special bagging method.

    • StackRAM: a cross-species method for identifying RNA N<sup>6</sup>-methyladenosine sites based on stacked ensemble

      2022, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      Compared with other classifiers, StackRAM integrates single classifier to obtain a combined learner with high generalization performance, learns the relationship between different predictors and real classes, and effectively mines sequences that characterize m6A sites in RNA sequences. To validate the effectiveness of the proposed method, five different state-of-the-art methods are chosen to identify m6A sites for the dataset S. cerevisiae, including iRNA-Methyl [34], pRNAm-PC [81], RNA-Methylpred [82], m6A-pred [83] and DeepM6Apred [84]. iRNA-Methyl used pseudo nucleotide composition to identify N6-methyladenosine sites.

    View all citing articles on Scopus

    Asad Khan received his MS degree in Computer Science from COMSATS University Islamabad in the year 2017. He obtained his first class honor degree in Computer Science (with distinction) from Abdul Wali Khan University, Mardan, Pakistan in the year 2014. He is currently working as a Ph.D. research fellow in the Department of Computer Science, at the National University of Computer & Emerging Sciences, Peshawar, Pakistan. His research interests are in the area of machine learning applied to Bioinformatics applications.

    Hafeez Ur Rehman is serving as Associate Professor and Head of Department (HoD) in the Department of Computer Science at National University of Computer & Emerging Sciences, Peshawar. He obtained his BS (Computer Science) degree from COMSATS Institute of Information Technology (CIIT), Abbottabad in the year 2006. He completed his MS and Ph.D. degrees from Politecnico di Torino University, Italy, in the year 2010 and 2014 respectively. Dr. Hafeez has significant research contributions. He actively works in the fields of Bioinformatics and Medical Image Analysis. He has published a number of articles at many prestigious international research platforms. He is HEC approved Ph.D. supervisor and also recipient of HEC-SRGP and IGNITE research grants to conduct various projects under his mentorship.

    Usman Habib is currently serving as Assistant Professor & Coordinator Graduate Program Committee at the department of Computer Science, FAST National University of Computers & Emerging Sciences (NUCES). He holds more than ten years of teaching and research experience spanning from 2006 to date. Along with teaching and research, he has also worked and successfully completed different industrial projects. He has completed his Ph.D. at ICT department, Technical university of Vienna, Austria and obtained his Master’s degree from the Norwegian University of Science and Technology, (NTNU), Norway in the year 2008. Dr. Usman has been actively involved in research as well, and has authored several conference and journal publications. He is currently interested in the fields of machine learning, data analytics, and fault detection and diagnosis systems.

    Umer Ijaz is an Assistant Professor at the Electrical Engineering & Technology Department, Government College University Faisalabad, Pakistan. After completing his graduation in B.Sc. Electrical Engineering in 2007, he worked with various national and multi-national companies. He completed his MS in Communication Engineering and Ph.D. in Electronics and Communication Engineering from Politecnico di Torino, Italy. During his MS/PhD studies he worked in collaboration with ST Microelectronics on various projects and conducted research w.r.t design and development of X-direction 3D avatar animations by employing wearable motion sensors. Dr. Umer has published several research papers on various national and international forums. He is HEC approved Ph.D. supervisor and also recipient of HEC-SRGP and IGNITE research grants to conduct various projects under his mentorship.

    View full text