当前期刊: BMC Bioinformatics Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Correction to: Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-22
    Louis Ranjard; Thomas K. F. Wong; Allen G. Rodrigo

    Following publication of the original article [1], the author reported that there are several errors in the original article;

    更新日期:2020-01-23
  • Reverse engineering directed gene regulatory networks from transcriptomics and proteomics data of biomining bacterial communities with approximate Bayesian computation and steady-state signalling simulations
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-21
    Antoine Buetti-Dinh; Malte Herold; Stephan Christel; Mohamed El Hajjami; Francesco Delogu; Olga Ilie; Sören Bellenberg; Paul Wilmes; Ansgar Poetsch; Wolfgang Sand; Mario Vera; Igor V. Pivkin; Ran Friedman; Mark Dopson

    Network inference is an important aim of systems biology. It enables the transformation of OMICs datasets into biological knowledge. It consists of reverse engineering gene regulatory networks from OMICs data, such as RNAseq or mass spectrometry-based proteomics data, through computational methods. This approach allows to identify signalling pathways involved in specific biological functions. The ability to infer causality in gene regulatory networks, in addition to correlation, is crucial for several modelling approaches and allows targeted control in biotechnology applications. We performed simulations according to the approximate Bayesian computation method, where the core model consisted of a steady-state simulation algorithm used to study gene regulatory networks in systems for which a limited level of details is available. The simulations outcome was compared to experimentally measured transcriptomics and proteomics data through approximate Bayesian computation. The structure of small gene regulatory networks responsible for the regulation of biological functions involved in biomining were inferred from multi OMICs data of mixed bacterial cultures. Several causal inter- and intraspecies interactions were inferred between genes coding for proteins involved in the biomining process, such as heavy metal transport, DNA damage, replication and repair, and membrane biogenesis. The method also provided indications for the role of several uncharacterized proteins by the inferred connection in their network context. The combination of fast algorithms with high-performance computing allowed the simulation of a multitude of gene regulatory networks and their comparison to experimentally measured OMICs data through approximate Bayesian computation, enabling the probabilistic inference of causality in gene regulatory networks of a multispecies bacterial system involved in biomining without need of single-cell or multiple perturbation experiments. This information can be used to influence biological functions and control specific processes in biotechnology applications.

    更新日期:2020-01-22
  • ShinyOmics: collaborative exploration of omics-data
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-17
    Defne Surujon; Tim van Opijnen

    Omics-profiling is a collection of increasingly prominent approaches that result in large-scale biological datasets, for instance capturing an organism’s behavior and response in an environment. It can be daunting to manually analyze and interpret such large datasets without some programming experience. Additionally, with increasing amounts of data; management, storage and sharing challenges arise. Here, we present ShinyOmics, a web-based application that allows rapid collaborative exploration of omics-data. By using Tn-Seq, RNA-Seq, microarray and proteomics datasets from two human pathogens, we exemplify several conclusions that can be drawn from a rich dataset. We identify a protease and several chaperone proteins upregulated under aminoglycoside stress, show that antibiotics with the same mechanism of action trigger similar transcriptomic responses, point out the dissimilarity in different omics-profiles, and overlay the transcriptional response on a metabolic network. ShinyOmics is easy to set up and customize, and can utilize user supplied metadata. It offers several visualization and comparison options that are designed to assist in novel hypothesis generation, as well as data management, online sharing and exploration. Moreover, ShinyOmics can be used as an interactive supplement accompanying research articles or presentations.

    更新日期:2020-01-17
  • Lag penalized weighted correlation for time series clustering
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-16
    Thevaa Chandereng; Anthony Gitter

    The similarity or distance measure used for clustering can generate intuitive and interpretable clusters when it is tailored to the unique characteristics of the data. In time series datasets generated with high-throughput biological assays, measurements such as gene expression levels or protein phosphorylation intensities are collected sequentially over time, and the similarity score should capture this special temporal structure. We propose a clustering similarity measure called Lag Penalized Weighted Correlation (LPWC) to group pairs of time series that exhibit closely-related behaviors over time, even if the timing is not perfectly synchronized. LPWC aligns time series profiles to identify common temporal patterns. It down-weights aligned profiles based on the length of the temporal lags that are introduced. We demonstrate the advantages of LPWC versus existing time series and general clustering algorithms. In a simulated dataset based on the biologically-motivated impulse model, LPWC is the only method to recover the true clusters for almost all simulated genes. LPWC also identifies clusters with distinct temporal patterns in our yeast osmotic stress response and axolotl limb regeneration case studies. LPWC achieves both of its time series clustering goals. It groups time series with correlated changes over time, even if those patterns occur earlier or later in some of the time series. In addition, it refrains from introducing large shifts in time when searching for temporal patterns by applying a lag penalty. The LPWC R package is available at https://github.com/gitter-lab/LPWCand CRANunder a MIT license.

    更新日期:2020-01-16
  • PRAP: Pan Resistome analysis pipeline
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-15
    Yichen He; Xiujuan Zhou; Ziyan Chen; Xiangyu Deng; Andrew Gehring; Hongyu Ou; Lida Zhang; Xianming Shi

    Antibiotic resistance genes (ARGs) can spread among pathogens via horizontal gene transfer, resulting in imparities in their distribution even within the same species. Therefore, a pan-genome approach to analyzing resistomes is necessary for thoroughly characterizing patterns of ARGs distribution within particular pathogen populations. Software tools are readily available for either ARGs identification or pan-genome analysis, but few exist to combine the two functions. We developed Pan Resistome Analysis Pipeline (PRAP) for the rapid identification of antibiotic resistance genes from various formats of whole genome sequences based on the CARD or ResFinder databases. Detailed annotations were used to analyze pan-resistome features and characterize distributions of ARGs. The contribution of different alleles to antibiotic resistance was predicted by a random forest classifier. Results of analysis were presented in browsable files along with a variety of visualization options. We demonstrated the performance of PRAP by analyzing the genomes of 26 Salmonella enterica isolates from Shanghai, China. PRAP was effective for identifying ARGs and visualizing pan-resistome features, therefore facilitating pan-genomic investigation of ARGs. This tool has the ability to further excavate potential relationships between antibiotic resistance genes and their phenotypic traits.

    更新日期:2020-01-15
  • Automatic construction of metabolic models with enzyme constraints
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-14
    Pavlos Stephanos Bekiaris; Steffen Klamt

    In order to improve the accuracy of constraint-based metabolic models, several approaches have been developed which intend to integrate additional biological information. Two of these methods, MOMENT and GECKO, incorporate enzymatic (kcat) parameters and enzyme mass constraints to further constrain the space of feasible metabolic flux distributions. While both methods have been proven to deliver useful extensions of metabolic models, they may considerably increase size and complexity of the models and there is currently no tool available to fully automate generation and calibration of such enzyme-constrained models from given stoichiometric models. In this work we present three major developments. We first conceived short MOMENT (sMOMENT), a simplified version of the MOMENT approach, which yields the same predictions as MOMENT but requires significantly fewer variables and enables direct inclusion of the relevant enzyme constraints in the standard representation of a constraint-based model. When measurements of enzyme concentrations are available, these can be included as well leading in the extreme case, where all enzyme concentrations are known, to a model representation that is analogous to the GECKO approach. Second, we developed the AutoPACMEN toolbox which allows an almost fully automated creation of sMOMENT-enhanced stoichiometric metabolic models. In particular, this includes the automatic read-out and processing of relevant enzymatic data from different databases and the reconfiguration of the stoichiometric model with embedded enzymatic constraints. Additionally, tools have been developed to adjust (kcat and enzyme pool) parameters of sMOMENT models based on given flux data. We finally applied the new sMOMENT approach and the AutoPACMEN toolbox to generate an enzyme-constrained version of the E. coli genome-scale model iJO1366 and analyze its key properties and differences with the standard model. In particular, we show that the enzyme constraints improve flux predictions (e.g., explaining overflow metabolism and other metabolic switches) and demonstrate, for the first time, that these constraints can markedly change the spectrum of metabolic engineering strategies for different target products. The methodological and tool developments presented herein pave the way for a simplified and routine construction and analysis of enzyme-constrained metabolic models.

    更新日期:2020-01-15
  • A pipeline to create predictive functional networks: application to the tumor progression of hepatocellular carcinoma
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-14
    Maxime Folschette; Vincent Legagneux; Arnaud Poret; Lokmane Chebouba; Carito Guziolowski; Nathalie Théret

    Integrating genome-wide gene expression patient profiles with regulatory knowledge is a challenging task because of the inherent heterogeneity, noise and incompleteness of biological data. From the computational side, several solvers for logic programs are able to perform extremely well in decision problems for combinatorial search domains. The challenge then is how to process the biological knowledge in order to feed these solvers to gain insights in a biological study. It requires formalizing the biological knowledge to give a precise interpretation of this information; currently, very few pathway databases offer this possibility. The presented work proposes an automatic pipeline to extract automatically regulatory knowledge from pathway databases and generate novel computational predictions related to the state of expression or activity of biological molecules. We applied it in the context of hepatocellular carcinoma (HCC) progression, and evaluate the precision and the stability of these computational predictions. Our working base is a graph of 3383 nodes and 13,771 edges extracted from the KEGG database, in which we integrate 209 differentially expressed genes between low and high aggressive HCC across 294 patients. Our computational model predicts the shifts of expression of 146 initially non-observed biological components. Our predictions were validated at 88% using a larger experimental dataset and cross-validation techniques. In particular, we focus on the protein complexes predictions and show for the first time that NFKB1/BCL-3 complexes are activated in aggressive HCC. In spite of the large dimension of the reconstructed models, our analyses over the computational predictions discover a well constrained region where KEGG regulatory knowledge constrains gene expression of several biomolecules. These regions can offer interesting windows to perturb experimentally such complex systems. This new pipeline allows biologists to develop their own predictive models based on a list of genes. It facilitates the identification of new regulatory biomolecules using knowledge graphs and predictive computational methods. Our workflow is implemented in an automatic python pipeline which is publicly available at https://github.com/LokmaneChebouba/key-pipeand contains as testing data all the data used in this paper.

    更新日期:2020-01-14
  • Deep neural networks for human microRNA precursor detection
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-13
    Xueming Zheng; Xingli Fu; Kaicheng Wang; Meng Wang

    MicroRNAs (miRNAs) play important roles in a variety of biological processes by regulating gene expression at the post-transcriptional level. So, the discovery of new miRNAs has become a popular task in biological research. Since the experimental identification of miRNAs is time-consuming, many computational tools have been developed to identify miRNA precursor (pre-miRNA). Most of these computation methods are based on traditional machine learning methods and their performance depends heavily on the selected features which are usually determined by domain experts. To develop easily implemented methods with better performance, we investigated different deep learning architectures for the pre-miRNAs identification. In this work, we applied convolution neural networks (CNN) and recurrent neural networks (RNN) to predict human pre-miRNAs. We combined the sequences with the predicted secondary structures of pre-miRNAs as input features of our models, avoiding the feature extraction and selection process by hand. The models were easily trained on the training dataset with low generalization error, and therefore had satisfactory performance on the test dataset. The prediction results on the same benchmark dataset showed that our models outperformed or were highly comparable to other state-of-the-art methods in this area. Furthermore, our CNN model trained on human dataset had high prediction accuracy on data from other species. Deep neural networks (DNN) could be utilized for the human pre-miRNAs detection with high performance. Complex features of RNA sequences could be automatically extracted by CNN and RNN, which were used for the pre-miRNAs prediction. Through proper regularization, our deep learning models, although trained on comparatively small dataset, had strong generalization ability.

    更新日期:2020-01-14
  • The impact of various seed, accessibility and interaction constraints on sRNA target prediction- a systematic assessment
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-13
    Martin Raden; Teresa Müller; Stefan Mautner; Rick Gelhausen; Rolf Backofen

    Seed and accessibility constraints are core features to enable highly accurate sRNA target screens based on RNA-RNA interaction prediction. Currently, available tools provide different (sets of) constraints and default parameter sets. Thus, it is hard to impossible for users to estimate the influence of individual restrictions on the prediction results. Here, we present a systematic assessment of the impact of established and new constraints on sRNA target prediction both on a qualitative as well as computational level. This is done exemplarily based on the performance of IntaRNA, one of the most exact sRNA target prediction tools. IntaRNA provides various ways to constrain considered seed interactions, e.g. based on seed length, its accessibility, minimal unpaired probabilities, or energy thresholds, beside analogous constraints for the overall interaction. Thus, our results reveal the impact of individual constraints and their combinations. This provides both a guide for users what is important and recommendations for existing and upcoming sRNA target prediction approaches.We show on a large sRNA target screen benchmark data set that only by altering the parameter set, IntaRNA recovers 30% more verified interactions while becoming 5-times faster. This exemplifies the potential of seed, accessibility and interaction constraints for sRNA target prediction.

    更新日期:2020-01-13
  • Guidelines for cell-type heterogeneity quantification based on a comparative analysis of reference-free DNA methylation deconvolution software
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-13
    Clémentine Decamps; Florian Privé; Raphael Bacher; Daniel Jost; Arthur Waguet; Eugene Andres Houseman; Eugene Lurie; Pavlo Lutsik; Aleksandar Milosavljevic; Michael Scherer; Michael G. B. Blum; Magali Richard

    Cell-type heterogeneity of tumors is a key factor in tumor progression and response to chemotherapy. Tumor cell-type heterogeneity, defined as the proportion of the various cell-types in a tumor, can be inferred from DNA methylation of surgical specimens. However, confounding factors known to associate with methylation values, such as age and sex, complicate accurate inference of cell-type proportions. While reference-free algorithms have been developed to infer cell-type proportions from DNA methylation, a comparative evaluation of the performance of these methods is still lacking. Here we use simulations to evaluate several computational pipelines based on the software packages MeDeCom, EDec, and RefFreeEWAS. We identify that accounting for confounders, feature selection, and the choice of the number of estimated cell types are critical steps for inferring cell-type proportions. We find that removal of methylation probes which are correlated with confounder variables reduces the error of inference by 30–35%, and that selection of cell-type informative probes has similar effect. We show that Cattell’s rule based on the scree plot is a powerful tool to determine the number of cell-types. Once the pre-processing steps are achieved, the three deconvolution methods provide comparable results. We observe that all the algorithms’ performance improves when inter-sample variation of cell-type proportions is large or when the number of available samples is large. We find that under specific circumstances the methods are sensitive to the initialization method, suggesting that averaging different solutions or optimizing initialization is an avenue for future research. Based on the lessons learned, to facilitate pipeline validation and catalyze further pipeline improvement by the community, we develop a benchmark pipeline for inference of cell-type proportions and implement it in the R package medepir.

    更新日期:2020-01-13
  • Bayesian differential analysis of gene regulatory networks exploiting genetic perturbations
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-09
    Yan Li; Dayou Liu; Tengfei Li; Yungang Zhu

    Gene regulatory networks (GRNs) can be inferred from both gene expression data and genetic perturbations. Under different conditions, the gene data of the same gene set may be different from each other, which results in different GRNs. Detecting structural difference between GRNs under different conditions is of great significance for understanding gene functions and biological mechanisms. In this paper, we propose a Bayesian Fused algorithm to jointly infer differential structures of GRNs under two different conditions. The algorithm is developed for GRNs modeled with structural equation models (SEMs), which makes it possible to incorporate genetic perturbations into models to improve the inference accuracy, so we name it BFDSEM. Different from the naive approaches that separately infer pair-wise GRNs and identify the difference from the inferred GRNs, we first re-parameterize the two SEMs to form an integrated model that takes full advantage of the two groups of gene data, and then solve the re-parameterized model by developing a novel Bayesian fused prior following the criterion that separate GRNs and differential GRN are both sparse. Computer simulations are run on synthetic data to compare BFDSEM to two state-of-the-art joint inference algorithms: FSSEM and ReDNet. The results demonstrate that the performance of BFDSEM is comparable to FSSEM, and is generally better than ReDNet. The BFDSEM algorithm is also applied to a real data set of lung cancer and adjacent normal tissues, the yielded normal GRN and differential GRN are consistent with the reported results in previous literatures. An open-source program implementing BFDSEM is freely available in Additional file 1.

    更新日期:2020-01-11
  • Improving the organization and interactivity of metabolic pathfinding with precomputed pathways
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-10
    Sarah M. Kim; Matthew I. Peña; Mark Moll; George N. Bennett; Lydia E. Kavraki

    The rapid growth of available knowledge on metabolic processes across thousands of species continues to expand the possibilities of producing chemicals by combining pathways found in different species. Several computational search algorithms have been developed for automating the identification of possible heterologous pathways; however, these searches may return thousands of pathway results. Although the large number of results are in part due to the large number of possible compounds and reactions, a subset of core reaction modules is repeatedly observed in pathway results across multiple searches, suggesting that some subpaths between common compounds were more consistently explored than others.To reduce the resources spent on searching the same metabolic space, a new meta-algorithm for metabolic pathfinding, Hub Pathway search with Atom Tracking (HPAT), was developed to take advantage of a precomputed network of subpath modules. To investigate the efficacy of this method, we created a table describing a network of common hub metabolites and how they are biochemically connected and only offloaded searches to and from this hub network onto an interactive webserver capable of visualizing the resulting pathways. A test set of nineteen known pathways taken from literature and metabolic databases were used to evaluate if HPAT was capable of identifying known pathways. HPAT found the exact pathway for eleven of the nineteen test cases using a diverse set of precomputed subpaths, whereas a comparable pathfinding search algorithm that does not use precomputed subpaths found only seven of the nineteen test cases. The capability of HPAT to find novel pathways was demonstrated by its ability to identify novel 3-hydroxypropanoate (3-HP) synthesis pathways. As for pathway visualization, the new interactive pathway filters enable a reduction of the number of displayed pathways from hundreds down to less than ten pathways in several test cases, illustrating their utility in reducing the amount of presented information while retaining pathways of interest. This work presents the first step in incorporating a precomputed subpath network into metabolic pathfinding and demonstrates how this leads to a concise, interactive visualization of pathway results. The modular nature of metabolic pathways is exploited to facilitate efficient discovery of alternate pathways.

    更新日期:2020-01-11
  • LDpop: an interactive online tool to calculate and visualize geographic LD patterns
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-10
    T. A. Alexander; M. J. Machiela

    Linkage disequilibrium (LD)—the non-random association of alleles at different loci—defines population-specific haplotypes which vary by genomic ancestry. Assessment of allelic frequencies and LD patterns from a variety of ancestral populations enables researchers to better understand population histories as well as improve genetic understanding of diseases in which risk varies by ethnicity. We created an interactive web module which allows for quick geographic visualization of linkage disequilibrium (LD) patterns between two user-specified germline variants across geographic populations included in the 1000 Genomes Project. Interactive maps and a downloadable, sortable summary table allow researchers to easily compute and compare allele frequencies and LD statistics of dbSNP catalogued variants. The geographic mapping of each SNP’s allele frequencies by population as well as visualization of LD statistics allows the user to easily trace geographic allelic correlation patterns and examine population-specific differences. LDpop is a free and publicly available cross-platform web tool which can be accessed online at https://ldlink.nci.nih.gov/?tab=ldpop

    更新日期:2020-01-11
  • Microscopy cell nuclei segmentation with enhanced U-Net
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-08
    Feixiao Long

    Cell nuclei segmentation is a fundamental task in microscopy image analysis, based on which multiple biological related analysis can be performed. Although deep learning (DL) based techniques have achieved state-of-the-art performances in image segmentation tasks, these methods are usually complex and require support of powerful computing resources. In addition, it is impractical to allocate advanced computing resources to each dark- or bright-field microscopy, which is widely employed in vast clinical institutions, considering the cost of medical exams. Thus, it is essential to develop accurate DL based segmentation algorithms working with resources-constraint computing. An enhanced, light-weighted U-Net (called U-Net+) with modified encoded branch is proposed to potentially work with low-resources computing. Through strictly controlled experiments, the average IOU and precision of U-Net+ predictions are confirmed to outperform other prevalent competing methods with 1.0% to 3.0% gain on the first stage test set of 2018 Kaggle Data Science Bowl cell nuclei segmentation contest with shorter inference time. Our results preliminarily demonstrate the potential of proposed U-Net+ in correctly spotting microscopy cell nuclei with resources-constraint computing.

    更新日期:2020-01-09
  • Multiset sparse partial least squares path modeling for high dimensional omics data analysis
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-09
    Attila Csala; Aeilko H. Zwinderman; Michel H. Hof

    Recent technological developments have enabled the measurement of a plethora of biomolecular data from various omics domains, and research is ongoing on statistical methods to leverage these omics data to better model and understand biological pathways and genetic architectures of complex phenotypes. Current reviews report that the simultaneous analysis of multiple (i.e. three or more) high dimensional omics data sources is still challenging and suitable statistical methods are unavailable. Often mentioned challenges are the lack of accounting for the hierarchical structure between omics domains and the difficulty of interpretation of genomewide results. This study is motivated to address these challenges. We propose multiset sparse Partial Least Squares path modeling (msPLS), a generalized penalized form of Partial Least Squares path modeling, for the simultaneous modeling of biological pathways across multiple omics domains. msPLS simultaneously models the effect of multiple molecular markers, from multiple omics domains, on the variation of multiple phenotypic variables, while accounting for the relationships between data sources, and provides sparse results. The sparsity in the model helps to provide interpretable results from analyses of hundreds of thousands of biomolecular variables. With simulation studies, we quantified the ability of msPLS to discover associated variables among high dimensional data sources. Furthermore, we analysed high dimensional omics datasets to explore biological pathways associated with Marfan syndrome and with Chronic Lymphocytic Leukaemia. Additionally, we compared the results of msPLS to the results of Multi-Omics Factor Analysis (MOFA), which is an alternative method to analyse this type of data. msPLS is an multiset multivariate method for the integrative analysis of multiple high dimensional omics data sources. It accounts for the relationship between multiple high dimensional data sources while it provides interpretable results through its sparse solutions. The biomarkers found by msPLS in the omics datasets can be interpreted in terms of biological pathways associated with the pathophysiology of Marfan syndrome and of Chronic Lymphocytic Leukaemia. Additionally, msPLS outperforms MOFA in terms of variation explained in the chronic lymphocytic leukaemia dataset while it identifies the two most important clinical markers for Chronic Lymphocytic Leukaemia http://uva.csala.me/mspls.https://github.com/acsala/2018_msPLS

    更新日期:2020-01-09
  • DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-09
    Hiroyuki Fukuda; Kentaro Tomii

    Recently developed methods of protein contact prediction, a crucially important step for protein structure prediction, depend heavily on deep neural networks (DNNs) and multiple sequence alignments (MSAs) of target proteins. Protein sequences are accumulating to an increasing degree such that abundant sequences to construct an MSA of a target protein are readily obtainable. Nevertheless, many cases present different ends of the number of sequences that can be included in an MSA used for contact prediction. The abundant sequences might degrade prediction results, but opportunities remain for a limited number of sequences to construct an MSA. To resolve these persistent issues, we strove to develop a novel framework using DNNs in an end-to-end manner for contact prediction. We developed neural network models to improve precision of both deep and shallow MSAs. Results show that higher prediction accuracy was achieved by assigning weights to sequences in a deep MSA. Moreover, for shallow MSAs, adding a few sequential features was useful to increase the prediction accuracy of long-range contacts in our model. Based on these models, we expanded our model to a multi-task model to achieve higher accuracy by incorporating predictions of secondary structures and solvent-accessible surface areas. Moreover, we demonstrated that ensemble averaging of our models can raise accuracy. Using past CASP target protein domains, we tested our models and demonstrated that our final model is superior to or equivalent to existing meta-predictors. The end-to-end learning framework we built can use information derived from either deep or shallow MSAs for contact prediction. Recently, an increasing number of protein sequences have become accessible, including metagenomic sequences, which might degrade contact prediction results. Under such circumstances, our model can provide a means to reduce noise automatically. According to results of tertiary structure prediction based on contacts and secondary structures predicted by our model, more accurate three-dimensional models of a target protein are obtainable than those from existing ECA methods, starting from its MSA. DeepECA is available from https://github.com/tomiilab/DeepECA.

    更新日期:2020-01-09
  • Integrative analysis of time course metabolic data and biomarker discovery
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-09
    Takoua Jendoubi; Timothy M. D. Ebbels

    Metabolomics time-course experiments provide the opportunity to understand the changes to an organism by observing the evolution of metabolic profiles in response to internal or external stimuli. Along with other omic longitudinal profiling technologies, these techniques have great potential to uncover complex relations between variations across diverse omic variables and provide unique insights into the underlying biology of the system. However, many statistical methods currently used to analyse short time-series omic data are i) prone to overfitting, ii) do not fully take into account the experimental design or iii) do not make full use of the multivariate information intrinsic to the data or iv) are unable to uncover multiple associations between different omic data. The model we propose is an attempt to i) overcome overfitting by using a weakly informative Bayesian model, ii) capture experimental design conditions through a mixed-effects model, iii) model interdependencies between variables by augmenting the mixed-effects model with a conditional auto-regressive (CAR) component and iv) identify potential associations between heterogeneous omic variables by using a horseshoe prior. We assess the performance of our model on synthetic and real datasets and show that it can outperform comparable models for metabolomic longitudinal data analysis. In addition, our proposed method provides the analyst with new insights on the data as it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We also show through simulation that our model is fairly robust against inaccuracies in metabolite assignments. On real data, we demonstrate that the number of profiled metabolites slightly affects the predictive ability of the model. Our single model approach to longitudinal analysis of metabolomics data provides an approach simultaneously for integrative analysis and biomarker discovery. In addition, it lends better interpretation by allowing analysis at the pathway level. An accompanying R package for the model has been developed using the probabilistic programming language Stan. The package offers user-friendly functions for simulating data, fitting the model, assessing model fit and postprocessing the results. The main aim of the R package is to offer freely accessible resources for integrative longitudinal analysis for metabolomics scientists and various visualization functions easy-to-use for applied researchers to interpret results.

    更新日期:2020-01-09
  • Optimization and expansion of non-negative matrix factorization
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-06
    Xihui Lin; Paul C. Boutros

    Non-negative matrix factorization (NMF) is a technique widely used in various fields, including artificial intelligence (AI), signal processing and bioinformatics. However existing algorithms and R packages cannot be applied to large matrices due to their slow convergence or to matrices with missing entries. Besides, most NMF research focuses only on blind decompositions: decomposition without utilizing prior knowledge. Finally, the lack of well-validated methodology for choosing the rank hyperparameters also raises concern on derived results. We adopt the idea of sequential coordinate-wise descent to NMF to increase the convergence rate. We demonstrate that NMF can handle missing values naturally and this property leads to a novel method to determine the rank hyperparameter. Further, we demonstrate some novel applications of NMF and show how to use masking to inject prior knowledge and desirable properties to achieve a more meaningful decomposition. We show through complexity analysis and experiments that our implementation converges faster than well-known methods. We also show that using NMF for tumour content deconvolution can achieve results similar to existing methods like ISOpure. Our proposed missing value imputation is more accurate than conventional methods like multiple imputation and comparable to missForest while achieving significantly better computational efficiency. Finally, we argue that the suggested rank tuning method based on missing value imputation is theoretically superior to existing methods. All algorithms are implemented in the R package NNLM, which is freely available on CRAN and Github.

    更新日期:2020-01-06
  • Conserved genomic neighborhood is a strong but no perfect indicator for a direct interaction of microbial gene products
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-03
    Robert Esch; Rainer Merkl

    The order of genes in bacterial genomes is not random; for example, the products of genes belonging to an operon work together in the same pathway. The cotranslational assembly of protein complexes is deemed to conserve genomic neighborhoods even stronger than a common function. This is why a conserved genomic neighborhood can be utilized to predict, whether gene products form protein complexes. We were interested to assess the performance of a neighborhood-based classifier that analyzes a large number of genomes. Thus, we determined for the genes encoding the subunits of 494 experimentally verified hetero-dimers their local genomic context. In order to generate phylogenetically comprehensive genomic neighborhoods, we utilized the tools offered by the Enzyme Function Initiative. For each subunit, a sequence similarity network was generated and the corresponding genome neighborhood network was analyzed to deduce the most frequent gene product. This was predicted as interaction partner, if its abundance exceeded a threshold, which was the frequency giving rise to the maximal Matthews correlation coefficient. For the threshold of 16%, the true positive rate was 45%, the false positive rate 0.06%, and the precision 55%. For approximately 20% of the subunits, the interaction partner was not found in a neighborhood of ± 10 genes. Our phylogenetically comprehensive analysis confirmed that complex formation is a strong evolutionary factor that conserves genome neighborhoods. On the other hand, for 55% of the cases analyzed here, classification failed. Either, the interaction partner was not present in a ± 10 gene window or was not the most frequent gene product.

    更新日期:2020-01-04
  • Evolving knowledge graph similarity for supervised learning in complex biomedical domains
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-03
    Rita T. Sousa; Sara Silva; Catia Pesquita

    In recent years, biomedical ontologies have become important for describing existing biological knowledge in the form of knowledge graphs. Data mining approaches that work with knowledge graphs have been proposed, but they are based on vector representations that do not capture the full underlying semantics. An alternative is to use machine learning approaches that explore semantic similarity. However, since ontologies can model multiple perspectives, semantic similarity computations for a given learning task need to be fine-tuned to account for this. Obtaining the best combination of semantic similarity aspects for each learning task is not trivial and typically depends on expert knowledge. We have developed a novel approach, evoKGsim, that applies Genetic Programming over a set of semantic similarity features, each based on a semantic aspect of the data, to obtain the best combination for a given supervised learning task. The approach was evaluated on several benchmark datasets for protein-protein interaction prediction using the Gene Ontology as the knowledge graph to support semantic similarity, and it outperformed competing strategies, including manually selected combinations of semantic aspects emulating expert knowledge. evoKGsim was also able to learn species-agnostic models with different combinations of species for training and testing, effectively addressing the limitations of predicting protein-protein interactions for species with fewer known interactions. evoKGsim can overcome one of the limitations in knowledge graph-based semantic similarity applications: the need to expertly select which aspects should be taken into account for a given application. Applying this methodology to protein-protein interaction prediction proved successful, paving the way to broader applications.

    更新日期:2020-01-04
  • Reconstruction and analysis of a carbon-core metabolic network for Dunaliella salina
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-02
    Melanie Fachet; Carina Witte; Robert J. Flassig; Liisa K. Rihko-Struckmann; Zaid McKie-Krisberg; Jürgen E. W. Polle; Kai Sundmacher

    The green microalga Dunaliella salina accumulates a high proportion of β-carotene during abiotic stress conditions. To better understand the intracellular flux distribution leading to carotenoid accumulation, this work aimed at reconstructing a carbon core metabolic network for D. salina CCAP 19/18 based on the recently published nuclear genome and its validation with experimental observations and literature data. The reconstruction resulted in a network model with 221 reactions and 212 metabolites within three compartments: cytosol, chloroplast and mitochondrion. The network was implemented in the MATLAB toolbox CellNetAnalyzer and checked for feasibility. Furthermore, a flux balance analysis was carried out for different light and nutrient uptake rates. The comparison of the experimental knowledge with the model prediction revealed that the results of the stoichiometric network analysis are plausible and in good agreement with the observed behavior. Accordingly, our model provides an excellent tool for investigating the carbon core metabolism of D. salina. The reconstructed metabolic network of D. salina presented in this work is able to predict the biological behavior under light and nutrient stress and will lead to an improved process understanding for the optimized production of high-value products in microalgae.

    更新日期:2020-01-02
  • Bayesian mixture regression analysis for regulation of Pluripotency in ES cells
    BMC Bioinform. (IF 2.213) Pub Date : 2020-01-02
    Mehran Aflakparast; Geert Geeven; Mathisca C.M. de Gunst

    Observed levels of gene expression strongly depend on both activity of DNA binding transcription factors (TFs) and chromatin state through different histone modifications (HMs). In order to recover the functional relationship between local chromatin state, TF binding and observed levels of gene expression, regression methods have proven to be useful tools. They have been successfully applied to predict mRNA levels from genome-wide experimental data and they provide insight into context-dependent gene regulatory mechanisms. However, heterogeneity arising from gene-set specific regulatory interactions is often overlooked. We show that regression models that predict gene expression by using experimentally derived ChIP-seq profiles of TFs can be significantly improved by mixture modelling. In order to find biologically relevant gene clusters, we employ a Bayesian allocation procedure which allows us to integrate additional biological information such as three-dimensional nuclear organization of chromosomes and gene function. The data integration procedure involves transforming the additional data into gene similarity values. We propose a generic similarity measure that is especially suitable for situations where the additional data are of both continuous and discrete type, and compare its performance with similar measures in the context of mixture modelling. We applied the proposed method on a data from mouse embryonic stem cells (ESC). We find that including additional data results in mixture components that exhibit biologically meaningful gene clusters, and provides valuable insight into the heterogeneity of the regulatory interactions.

    更新日期:2020-01-02
  • LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Emanuel Maldonado; Agostinho Antunes

    Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations. We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets. We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.

    更新日期:2019-12-31
  • ZDOG: zooming in on dominating genes with mutations in cancer pathways
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Rudi Alberts; Jinyu Chen; Louxin Zhang

    Inference of cancer-causing genes and their biological functions are crucial but challenging due to the heterogeneity of somatic mutations. The heterogeneity of somatic mutations reveals that only a handful of oncogenes mutate frequently and a number of cancer-causing genes mutate rarely. We develop a Cytoscape app, named ZDOG, for visualization of the extent to which mutated genes may affect cancer pathways using the dominating tree model. The dominator tree model allows us to examine conveniently the positional importance of a gene in cancer signalling pathways. This tool facilitates the identification of mutated “master” regulators even with low mutation frequency in deregulated signalling pathways. We have presented a model for facilitating the examination of the extent to which mutation in a gene may affect downstream components in a signalling pathway through its positional information. The model is implemented in a user-friendly Cytoscape app which will be freely available upon publication. Together with a user manual, the ZDOG app is freely available at GitHub (https://github.com/rudi2013/ZDOG). It is also available in the Cytoscape app store (http://apps.cytoscape.org/apps/ZDOG) and users can easily install it using the Cytoscape App Manager.

    更新日期:2019-12-31
  • The H3ABioNet helpdesk: an online bioinformatics resource, enhancing Africa’s capacity for genomics research
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Judit Kumuthini; Lyndon Zass; Sumir Panji; Samson P. Salifu; Jonathan K. Kayondo; Victoria Nembaware; Mamana Mbiyavanga; Ajayi Olabode; Ali Kishk; Gordon Wells; Nicola J. Mulder

    Currently, formal mechanisms for bioinformatics support are limited. The H3Africa Bioinformatics Network has implemented a public and freely available Helpdesk (HD), which provides generic bioinformatics support to researchers through an online ticketing platform. The following article reports on the H3ABioNet HD (H3A-HD)‘s development, outlining its design, management, usage and evaluation framework, as well as the lessons learned through implementation. The H3A-HD evaluated using automatically generated usage logs, user feedback and qualitative ticket evaluation. Evaluation revealed that communication methods, ticketing strategies and the technical platforms used are some of the primary factors which may influence the effectivity of HD. To continuously improve the H3A-HD services, the resource should be regularly monitored and evaluated. The H3A-HD design, implementation and evaluation framework could be easily adapted for use by interested stakeholders within the Bioinformatics community and beyond.

    更新日期:2019-12-31
  • Alignment-free genomic sequence comparison using FCGR and signal processing
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Daniel Lichtblau

    Alignment-free methods of genomic comparison offer the possibility of scaling to large data sets of nucleotide sequences comprised of several thousand or more base pairs. Such methods can be used for purposes of deducing “nearby” species in a reference data set, or for constructing phylogenetic trees. We describe one such method that gives quite strong results. We use the Frequency Chaos Game Representation (FCGR) to create images from such sequences, We then reduce dimension, first using a Fourier trig transform, followed by a Singular Values Decomposition (SVD). This gives vectors of modest length. These in turn are used for fast sequence lookup, construction of phylogenetic trees, and classification of virus genomic data. We illustrate the accuracy and scalability of this approach on several benchmark test sets. The tandem of FCGR and dimension reductions using Fourier-type transforms and SVD provides a powerful approach for alignment-free genomic comparison. Results compare favorably and often surpass best results reported in prior literature. Good scalability is also observed.

    更新日期:2019-12-31
  • DynaVenn: web-based computation of the most significant overlap between ordered sets
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Jérémy Amand; Tobias Fehlmann; Christina Backes; Andreas Keller

    In many research disciplines, ordered lists are compared. One example is to compare a subset of all significant genes or proteins in a primary study to those in a replication study. Often, the top of the lists are compared using Venn diagrams, ore more precisely Euler diagrams (set diagrams showing logical relations between a finite collection of different sets). If different cohort sizes, different techniques or algorithms for evaluation were applied, a direct comparison of significant genes with a fixed threshold can however be misleading and approaches comparing lists would be more appropriate. We developed DynaVenn, a web-based tool that incrementally creates all possible subsets from two or three ordered lists and computes for each combination a p-value for the overlap. Respectively, dynamic Venn diagrams are generated as graphical representations. Additionally an animation is generated showing how the most significant overlap is reached by backtracking. We demonstrate the improved performance of DynaVenn over an arbitrary cut-off approach on an Alzheimer’s Disease biomarker set. DynaVenn combines the calculation of the most significant overlap of different cohorts with an intuitive visualization of the results. It is freely available as a web service at http://www.ccb.uni-saarland.de/dynavenn.

    更新日期:2019-12-31
  • ki67 nuclei detection and ki67-index estimation: a novel automatic approach based on human vision modeling
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Barbara Rita Barricelli; Elena Casiraghi; Jessica Gliozzo; Veronica Huber; Biagio Eugenio Leone; Alessandro Rizzi; Barbara Vergani

    The protein ki67 (pki67) is a marker of tumor aggressiveness, and its expression has been proven to be useful in the prognostic and predictive evaluation of several types of tumors. To numerically quantify the pki67 presence in cancerous tissue areas, pathologists generally analyze histochemical images to count the number of tumor nuclei marked for pki67. This allows estimating the ki67-index, that is the percentage of tumor nuclei positive for pki67 over all the tumor nuclei. Given the high image resolution and dimensions, its estimation by expert clinicians is particularly laborious and time consuming. Though automatic cell counting techniques have been presented so far, the problem is still open. In this paper we present a novel automatic approach for the estimations of the ki67-index. The method starts by exploiting the STRESS algorithm to produce a color enhanced image where all pixels belonging to nuclei are easily identified by thresholding, and then separated into positive (i.e. pixels belonging to nuclei marked for pki67) and negative by a binary classification tree. Next, positive and negative nuclei pixels are processed separately by two multiscale procedures identifying isolated nuclei and separating adjoining nuclei. The multiscale procedures exploit two Bayesian classification trees to recognize positive and negative nuclei-shaped regions. The evaluation of the computed results, both through experts’ visual assessments and through the comparison of the computed indexes with those of experts, proved that the prototype is promising, so that experts believe in its potential as a tool to be exploited in the clinical practice as a valid aid for clinicians estimating the ki67-index. The MATLAB source code is open source for research purposes.

    更新日期:2019-12-30
  • Improvement of the memory function of a mutual repression network in a stochastic environment by negative autoregulation
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    A. B. M. Shamim Ul Hasan; Hiroyuki Kurata; Sebastian Pechmann

    Cellular memory is a ubiquitous function of biological systems. By generating a sustained response to a transient inductive stimulus, often due to bistability, memory is central to the robust control of many important biological processes. However, our understanding of the origins of cellular memory remains incomplete. Stochastic fluctuations that are inherent to most biological systems have been shown to hamper memory function. Yet, how stochasticity changes the behavior of genetic circuits is generally not clear from a deterministic analysis of the network alone. Here, we apply deterministic rate equations, stochastic simulations, and theoretical analyses of Fokker-Planck equations to investigate how intrinsic noise affects the memory function in a mutual repression network. We find that the addition of negative autoregulation improves the persistence of memory in a small gene regulatory network by reducing stochastic fluctuations. Our theoretical analyses reveal that this improved memory function stems from an increased stability of the steady states of the system. Moreover, we show how the tuning of critical network parameters can further enhance memory. Our work illuminates the power of stochastic and theoretical approaches to understanding biological circuits, and the importance of considering stochasticity when designing synthetic circuits with memory function.

    更新日期:2019-12-30
  • Biomedical named entity recognition using deep neural networks with contextual information
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Hyejin Cho; Hyunju Lee

    In biomedical text mining, named entity recognition (NER) is an important task used to extract information from biomedical articles. Previously proposed methods for NER are dictionary- or rule-based methods and machine learning approaches. However, these traditional approaches are heavily reliant on large-scale dictionaries, target-specific rules, or well-constructed corpora. These methods to NER have been superseded by the deep learning-based approach that is independent of hand-crafted features. However, although such methods of NER employ additional conditional random fields (CRF) to capture important correlations between neighboring labels, they often do not incorporate all the contextual information from text into the deep learning layers. We propose herein an NER system for biomedical entities by incorporating n-grams with bi-directional long short-term memory (BiLSTM) and CRF; this system is referred to as a contextual long short-term memory networks with CRF (CLSTM). We assess the CLSTM model on three corpora: the disease corpus of the National Center for Biotechnology Information (NCBI), the BioCreative II Gene Mention corpus (GM), and the BioCreative V Chemical Disease Relation corpus (CDR). Our framework was compared with several deep learning approaches, such as BiLSTM, BiLSTM with CRF, GRAM-CNN, and BERT. On the NCBI corpus, our model recorded an F-score of 85.68% for the NER of diseases, showing an improvement of 1.50% over previous methods. Moreover, although BERT used transfer learning by incorporating more than 2.5 billion words, our system showed similar performance with BERT with an F-scores of 81.44% for gene NER on the GM corpus and a outperformed F-score of 86.44% for the NER of chemicals and diseases on the CDR corpus. We conclude that our method significantly improves performance on biomedical NER tasks. The proposed approach is robust in recognizing biological entities in text.

    更新日期:2019-12-30
  • Identification of infectious disease-associated host genes using machine learning techniques
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Ranjan Kumar Barman; Anirban Mukhopadhyay; Ujjwal Maulik; Santasabuj Das

    With the global spread of multidrug resistance in pathogenic microbes, infectious diseases emerge as a key public health concern of the recent time. Identification of host genes associated with infectious diseases will improve our understanding about the mechanisms behind their development and help to identify novel therapeutic targets. We developed a machine learning techniques-based classification approach to identify infectious disease-associated host genes by integrating sequence and protein interaction network features. Among different methods, Deep Neural Networks (DNN) model with 16 selected features for pseudo-amino acid composition (PAAC) and network properties achieved the highest accuracy of 86.33% with sensitivity of 85.61% and specificity of 86.57%. The DNN classifier also attained an accuracy of 83.33% on a blind dataset and a sensitivity of 83.1% on an independent dataset. Furthermore, to predict unknown infectious disease-associated host genes, we applied the proposed DNN model to all reviewed proteins from the database. Seventy-six out of 100 highly-predicted infectious disease-associated genes from our study were also found in experimentally-verified human-pathogen protein-protein interactions (PPIs). Finally, we validated the highly-predicted infectious disease-associated genes by disease and gene ontology enrichment analysis and found that many of them are shared by one or more of the other diseases, such as cancer, metabolic and immune related diseases. To the best of our knowledge, this is the first computational method to identify infectious disease-associated host genes. The proposed method will help large-scale prediction of host genes associated with infectious-diseases. However, our results indicated that for small datasets, advanced DNN-based method does not offer significant advantage over the simpler supervised machine learning techniques, such as Support Vector Machine (SVM) or Random Forest (RF) for the prediction of infectious disease-associated host genes. Significant overlap of infectious disease with cancer and metabolic disease on disease and gene ontology enrichment analysis suggests that these diseases perturb the functions of the same cellular signaling pathways and may be treated by drugs that tend to reverse these perturbations. Moreover, identification of novel candidate genes associated with infectious diseases would help us to explain disease pathogenesis further and develop novel therapeutics.

    更新日期:2019-12-30
  • The coming era of artificial intelligence in biological data science
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Henry Han; Wenbin Liu

    The biological data science is characterized by a massive amount of data from heterogeneous sources. How to decipher complex relationships among heterogeneous datasets remains an urgent challenge. Although traditional model-driven methods still play an important role in analyzing all kinds of data, it lacks capabilities to exploit the huge amount of available data or even big data to discover knowledge, predict data behaviors, and decipher complex relationships among data. Therefore, data-driven becomes the theme of biological data science for its capabilities in listening to data, interacting with data, and extracting knowledge from data. Modern artificial intelligence will dominate biological data science for its unpreceded learning capabilities to process complex data. Compared to traditional AI techniques (e.g. automated reasoning), machine learning and deep learning are the core to enable machines with intelligence. A deep learning machine has much more complicate learning topologies, which may change dynamically for the sake of learning, besides at least the same complicate-level learning mechanism as traditional machine learning models such as support vector machines. Deep learning is good at discovering latent complex relationships among data and handling big data well. More importantly, deep learning merges feature extraction and prediction (e.g. classification) in a single learning procedure and makes feature extraction more adaptive and compatible with prediction. The scRNA-seq data, SNP, interactome or even clinical data usually need very different but complicate feature extraction procedures before entering downstream learning. Deep learning prepares itself for a good candidate to process those data and starts to make good progress in handling next generation sequencing data. Artificial intelligence is expected to dominate biological data science in the near future with the maturity of AI itself. Most state-of-the-art AI techniques are originated from computer vision, image recognition, or natural language processing. It is not easy to migrate the existing AI techniques to the biological data science field though some efforts are being made. The special characteristics of enormous data generated in biological data science calls for building their own AI theory, methods, and systems. To some degree, the maturity of AI in biological data science will indicate the realization of precision medicine. This special issue aims to initialize AI techniques for bioinformatics, clinical, and health data. All papers included in this special issue have developed their own novel AI techniques in problem-solving. They range from a computational framework for disease-specific gene regulatory network detection to graph regularized low-rank representation for multi-cancer sample clustering, graph-Laplacian PCA, and etc. In particular, one paper in this special issue is devoted to effectively detecting the clinic risk factors of portal vein system thrombosis (PVST) for splenectomy and cardia devascularization patients by building an SVM-based prediction system under novel feature extraction. It presents pioneering research work on this topic though results are still not that perfect. However, it can inspire more future work on the rare-explored topic by using more advanced deep learning techniques (e.g. novel few-shot learning) to extract high-level representative hidden features for the sake of clinic risk analysis. This article has been published as part of BMC Bioinformatics Volume 20 Supplement 22, 2019: Decipher computational analytics in digital health and precision medicine. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-22 . Funding This study and publication costs were supported in part by the National Natural Science Foundation of China under Grant No. 61572367 and 61573017. Affiliations Department of Computer and Information Science, Fordham University, Lincoln Center, 113 W. 60th Street, New York, NY, 10023, USA Henry Han Institute of Computational Science and Technology, Guangzhou University, College city 240, Guangzhou, 510006, China Wenbin LiuAuthors Search for Henry Han in: PubMed • Google Scholar Search for Wenbin Liu in: PubMed • Google Scholar Contributions HH drafted and finalized the manuscript. WL participated in the discussion of manuscript finalization. Both authors read and approved the final manuscript. Corresponding author Correspondence to Henry Han. Competing interests The authors declare that they have no competing interests. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Reprints and Permissions Cite this article Han, H., Liu, W. The coming era of artificial intelligence in biological data science. BMC Bioinformatics 20, 712 (2019) doi:10.1186/s12859-019-3225-3 Download citation Published 30 December 2019 DOI https://doi.org/10.1186/s12859-019-3225-3

    更新日期:2019-12-30
  • VariFAST: a variant filter by automated scoring based on tagged-signatures
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Hang Zhang; Ke Wang; Juan Zhou; Jianhua Chen; Yizhou Xu; Dong Wang; Xiaoqi Li; Renliang Sun; Mancang Zhang; Zhuo Wang; Yongyong Shi

    Variant calling and refinement from whole genome/exome sequencing data is a fundamental task for genomics studies. Due to the limited accuracy of NGS sequencing and variant callers, IGV-based manual review is required for further false positive variant filtering, which costs massive labor and time, and results in high inter- and intra-lab variability. To overcome the limitation of manual review, we developed a novel approach for Variant Filter by Automated Scoring based on Tagged-signature (VariFAST), and also provided a pipeline integrating GATK Best Practices with VariFAST, which can be easily used for high quality variants detection from raw data. Using the bam and vcf files, VariFAST calculates a v-score by sum of weighted metrics causing false positive variations, and marks tags in the manner of keeping high consistency with manual review, for each variant. We validated the performance of VariFAST for germline variant filtering using the benchmark sequencing data from GIAB, and also for somatic variant filtering using sequencing data of both malignant carcinoma and benign adenomas as well. VariFAST also includes a predictive model trained by XGBOOST algorithm for germline variants refinement, which reveals better MCC and AUC than the state-of-the-art VQSR, especially outcompete in INDEL variant filtering. VariFAST can assist researchers efficiently and conveniently to filter the false positive variants, including both germline and somatic ones, in NGS data analysis. The VariFAST source code and the pipeline integrating with GATK Best Practices are available at https://github.com/bioxsjtu/VariFAST.

    更新日期:2019-12-30
  • PEIS: a novel approach of tumor purity estimation by identifying information sites through integrating signal based on DNA methylation data
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Shudong Wang; Lihua Wang; Yuanyuan Zhang; Shanchen Pang; Xinzeng Wang

    Tumor purity plays an important role in understanding the pathogenic mechanism of tumors. The purity of tumor samples is highly sensitive to tumor heterogeneity. Due to Intratumoral heterogeneity of genetic and epigenetic data, it is suitable to study the purity of tumors. Among them, there are many purity estimation methods based on copy number variation, gene expression and other data, while few use DNA methylation data and often based on selected information sites. Consequently, how to choose methylation sites as information sites has an important influence on the purity estimation results. At present, the selection of information sites was often based on the differentially methylated sites that only consider the mean signal, without considering other possible signals and the strong correlation among adjacent sites. Considering integrating multi-signals and strong correlation among adjacent sites, we propose an approach, PEIS, to estimate the purity of tumor samples by selecting informative differential methylation sites. Application to 12 publicly available tumor datasets, it is shown that PEIS provides accurate results in the estimation of tumor purity which has a high consistency with other existing methods. Also, through comparing the results of different information sites selection methods in the evaluation of tumor purity, it shows the PEIS is superior to other methods. A new method to estimate the purity of tumor samples is proposed. This approach integrates multi-signals of the CpG sites and the correlation between the sites. Experimental analysis shows that this method is in good agreement with other existing methods for estimating tumor purity.

    更新日期:2019-12-30
  • An efficient gene selection method for microarray data based on LASSO and BPSO
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Ying Xiong; Qing-Hua Ling; Fei Han; Qing-Hua Liu

    The main goal of successful gene selection for microarray data is to find compact and predictive gene subsets which could improve the accuracy. Though a large pool of available methods exists, selecting the optimal gene subset for accurate classification is still very challenging for the diagnosis and treatment of cancer. To obtain the most predictive genes subsets without filtering out critical genes, a gene selection method based on least absolute shrinkage and selection operator (LASSO) and an improved binary particle swarm optimization (BPSO) is proposed in this paper. To avoid overfitting of LASSO, the initial gene pool is divided into clusters based on their structure. LASSO is then employed to select high predictive genes and further calculate the contribution value which indicates the genes’ sensitivity to samples’ classes. With the second-level gene pool established by double filter strategy, the BPSO encoding the contribution information obtained from LASSO is improved to perform gene selection. Moreover, from the perspective of the bit change probability, a new mapping function is defined to guide the updating of the particle to select the more predictive genes in the improved BPSO. With the compact gene pool obtained by double filter strategies, the improved BPSO could select the optimal gene subsets with high probability. The experimental results on several public microarray data with extreme learning machine verify the effectiveness of the proposed method compared to the relevant methods.

    更新日期:2019-12-30
  • PCA via joint graph Laplacian and sparse constraint: Identification of differentially expressed genes and sample clustering on gene expression data
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Chun-Mei Feng; Yong Xu; Mi-Xiao Hou; Ling-Yun Dai; Jun-Liang Shang

    In recent years, identification of differentially expressed genes and sample clustering have become hot topics in bioinformatics. Principal Component Analysis (PCA) is a widely used method in gene expression data. However, it has two limitations: first, the geometric structure hidden in data, e.g., pair-wise distance between data points, have not been explored. This information can facilitate sample clustering; second, the Principal Components (PCs) determined by PCA are dense, leading to hard interpretation. However, only a few of genes are related to the cancer. It is of great significance for the early diagnosis and treatment of cancer to identify a handful of the differentially expressed genes and find new cancer biomarkers. In this study, a new method gLSPCA is proposed to integrate both graph Laplacian and sparse constraint into PCA. gLSPCA on the one hand improves the clustering accuracy by exploring the internal geometric structure of the data, on the other hand identifies differentially expressed genes by imposing a sparsity constraint on the PCs. Experiments of gLSPCA and its comparison with existing methods, including Z-SPCA, GPower, PathSPCA, SPCArt, gLPCA, are performed on real datasets of both pancreatic cancer (PAAD) and head & neck squamous carcinoma (HNSC). The results demonstrate that gLSPCA is effective in identifying differentially expressed genes and sample clustering. In addition, the applications of gLSPCA on these datasets provide several new clues for the exploration of causative factors of PAAD and HNSC.

    更新日期:2019-12-30
  • The exploration of disease-specific gene regulatory networks in esophageal carcinoma and stomach adenocarcinoma
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Guimin Qin; Luqiong Yang; Yuying Ma; Jiayan Liu; Qiuyan Huo

    Feed-forward loops (FFLs), consisting of miRNAs, transcription factors (TFs) and their common target genes, have been validated to be important for the initialization and development of complex diseases, including cancer. Esophageal Carcinoma (ESCA) and Stomach Adenocarcinoma (STAD) are two types of malignant tumors in the digestive tract. Understanding common and distinct molecular mechanisms of ESCA and STAD is extremely crucial. In this paper, we presented a computational framework to explore common and distinct FFLs, and molecular biomarkers for ESCA and STAD. We identified FFLs by combining regulation pairs and RNA-seq data. Then we constructed disease-specific co-expression networks based on the FFLs identified. We also used random walk with restart (RWR) on disease-specific co-expression networks to prioritize candidate molecules. We identified 148 and 242 FFLs for these two types of cancer, respectively. And we found that one TF, E2F3 was related to ESCA, two genes, DTNA and KCNMA1 were related to STAD, while one TF ESR1 and one gene KIT were associated with both of the two types of cancer. This proposed computational framework predicted disease-related biomolecules effectively and discovered the correlation between two types of cancers, which helped develop the diagnostic and therapeutic strategies of Esophageal Carcinoma and Stomach Adenocarcinoma.

    更新日期:2019-12-30
  • Multi-cancer samples clustering via graph regularized low-rank representation method under sparse and symmetric constraints
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Juan Wang; Cong-Hai Lu; Jin-Xing Liu; Ling-Yun Dai; Xiang-Zhen Kong

    Identifying different types of cancer based on gene expression data has become hotspot in bioinformatics research. Clustering cancer gene expression data from multiple cancers to their own class is a significance solution. However, the characteristics of high-dimensional and small samples of gene expression data and the noise of the data make data mining and research difficult. Although there are many effective and feasible methods to deal with this problem, the possibility remains that these methods are flawed. In this paper, we propose the graph regularized low-rank representation under symmetric and sparse constraints (sgLRR) method in which we introduce graph regularization based on manifold learning and symmetric sparse constraints into the traditional low-rank representation (LRR). For the sgLRR method, by means of symmetric constraint and sparse constraint, the effect of raw data noise on low-rank representation is alleviated. Further, sgLRR method preserves the important intrinsic local geometrical structures of the raw data by introducing graph regularization. We apply this method to cluster multi-cancer samples based on gene expression data, which improves the clustering quality. First, the gene expression data are decomposed by sgLRR method. And, a lowest rank representation matrix is obtained, which is symmetric and sparse. Then, an affinity matrix is constructed to perform the multi-cancer sample clustering by using a spectral clustering algorithm, i.e., normalized cuts (Ncuts). Finally, the multi-cancer samples clustering is completed. A series of comparative experiments demonstrate that the sgLRR method based on low rank representation has a great advantage and remarkable performance in the clustering of multi-cancer samples.

    更新日期:2019-12-30
  • Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Yu-hua Yao; Ya-ping Lv; Ling Li; Hui-min Xu; Bin-bin Ji; Jing Chen; Chun Li; Bo Liao; Xu-ying Nan

    Subcellular localization prediction of protein is an important component of bioinformatics, which has great importance for drug design and other applications. A multitude of computational tools for proteins subcellular location have been developed in the recent decades, however, existing methods differ in the protein sequence representation techniques and classification algorithms adopted. In this paper, we firstly introduce two kinds of protein sequences encoding schemes: dipeptide information with space and Gapped k-mer information. Then, the Gapped k-mer calculation method which is based on quad-tree is also introduced. >From the prediction results, this method not only reduces the dimension, but also improves the prediction precision of protein subcellular localization.

    更新日期:2019-12-30
  • A novel method detecting the key clinic factors of portal vein system thrombosis of splenectomy & cardia devascularization patients for cirrhosis & portal hypertension
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Mingzhao Wang; Linglong Ding; Meng Xu; Juanying Xie; Shengli Wu; Shengquan Xu; Yingmin Yao; Qingguang Liu

    Portal vein system thrombosis (PVST) is potentially fatal for patients if the diagnosis is not timely or the treatment is not proper. There hasn’t been any available technique to detect clinic risk factors to predict PVST after splenectomy in cirrhotic patients. The aim of this study is to detect the clinic risk factors of PVST for splenectomy and cardia devascularization patients for liver cirrhosis and portal hypertension, and build an efficient predictive model to PVST via the detected risk factors, by introducing the machine learning method. We collected 92 clinic indexes of splenectomy plus cardia devascularization patients for cirrhosis and portal hypertension, and proposed a novel algorithm named as RFA-PVST (Risk Factor Analysis for PVST) to detect clinic risk indexes of PVST, then built a SVM (support vector machine) predictive model via the detected risk factors. The accuracy, sensitivity, specificity, precision, F-measure, FPR (false positive rate), FNR (false negative rate), FDR (false discovery rate), AUC (area under ROC curve) and MCC (Matthews correlation coefficient) were adopted to value the predictive power of the detected risk factors. The proposed RFA-PVST algorithm was compared to mRMR, SVM-RFE, Relief, S-weight and LLEScore. The statistic test was done to verify the significance of our RFA-PVST. Anticoagulant therapy and antiplatelet aggregation therapy are the top-2 risk clinic factors to PVST, followed by D-D (D dimer), CHOL (Cholesterol) and Ca (calcium). The SVM (support vector machine) model built on the clinic indexes including anticoagulant therapy, antiplatelet aggregation therapy, RBC (Red blood cell), D-D, CHOL, Ca, TT (thrombin time) and Weight factors has got pretty good predictive capability to PVST. It has got the highest PVST predictive accuracy of 0.89, and the best sensitivity, specificity, precision, F-measure, FNR, FPR, FDR and MCC of 1, 0.75, 0.85, 0.92, 0, 0.25, 0.15 and 0.8 respectively, and the comparable good AUC value of 0.84. The statistic test results demonstrate that there is a strong significant difference between our RFA-PVST and the compared algorithms, including mRMR, SVM-RFE, Relief, S-weight and LLEScore, that is to say, the risk indicators detected by our RFA-PVST are statistically significant. The proposed novel RFA-PVST algorithm can detect the clinic risk factors of PVST effectively and easily. Its most contribution is that it can display all the clinic factors in a 2-dimensional space with independence and discernibility as y-axis and x-axis, respectively. Those clinic indexes in top-right corner of the 2-dimensional space are detected automatically as risk indicators. The predictive SVM model is powerful with the detected clinic risk factors of PVST. Our study can help medical doctors to make proper treatments or early diagnoses to PVST patients. This study brings the new idea to the study of clinic treatment for other diseases as well.

    更新日期:2019-12-30
  • Modelling TERT regulation across 19 different cancer types based on the MIPRIP 2.0 gene regulatory network approach
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Alexandra M. Poos; Theresa Kordaß; Amol Kolte; Volker Ast; Marcus Oswald; Karsten Rippe; Rainer König

    Reactivation of the telomerase reverse transcriptase gene TERT is a central feature for unlimited proliferation of the majority of cancers. However, the underlying regulatory processes are only partly understood. We assembled regulator binding information from serveral sources to construct a generic human and mouse gene regulatory network. Advancing our “Mixed Integer linear Programming based Regulatory Interaction Predictor” (MIPRIP) approach, we identified the most common and cancer-type specific regulators of TERT across 19 different human cancers. The results were validated by using the well-known TERT regulation by the ETS1 transcription factor in a subset of melanomas with mutations in the TERT promoter. Our improved MIPRIP2 R-package and the associated generic regulatory networks are freely available at https://github.com/KoenigLabNM/MIPRIP. MIPRIP 2.0 identified common as well as tumor type specific regulators of TERT. The software can be easily applied to transcriptome datasets to predict gene regulation for any gene and disease/condition under investigation.

    更新日期:2019-12-30
  • Benchmarking the PEPOP methods for mimicking discontinuous epitopes
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-30
    Vincent Demolombe; Alexandre G. de Brevern; Franck Molina; Géraldine Lavigne; Claude Granier; Violaine Moreau

    Computational methods provide approaches to identify epitopes in protein Ags to help characterizing potential biomarkers identified by high-throughput genomic or proteomic experiments. PEPOP version 1.0 was developed as an antigenic or immunogenic peptide prediction tool. We have now improved this tool by implementing 32 new methods (PEPOP version 2.0) to guide the choice of peptides that mimic discontinuous epitopes and thus potentially able to replace the cognate protein Ag in its interaction with an Ab. In the present work, we describe these new methods and the benchmarking of their performances. Benchmarking was carried out by comparing the peptides predicted by the different methods and the corresponding epitopes determined by X-ray crystallography in a dataset of 75 Ag-Ab complexes. The Sensitivity (Se) and Positive Predictive Value (PPV) parameters were used to assess the performance of these methods. The results were compared to that of peptides obtained either by chance or by using the SUPERFICIAL tool, the only available comparable method. The PEPOP methods were more efficient than, or as much as chance, and 33 of the 34 PEPOP methods performed better than SUPERFICIAL. Overall, “optimized” methods (tools that use the traveling salesman problem approach to design peptides) can predict peptides that best match true epitopes in most cases.

    更新日期:2019-12-30
  • Old drug repositioning and new drug discovery through similarity learning from drug-target joint feature spaces
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Yi Zheng; Hui Peng; Xiaocai Zhang; Zhixun Zhao; Xiaoying Gao; Jinyan Li

    Detection of new drug-target interactions by computational algorithms is of crucial value to both old drug repositioning and new drug discovery. Existing machine-learning methods rely only on experimentally validated drug-target interactions (i.e., positive samples) for the predictions. Their performance is severely impeded by the lack of reliable negative samples. We propose a method to construct highly-reliable negative samples for drug target prediction by a pairwise drug-target similarity measurement and OCSVM with a high-recall constraint. On one hand, we measure the pairwise similarity between every two drug-target interactions by combining the chemical similarity between their drugs and the Gene Ontology-based similarity between their targets. Then we calculate the accumulative similarity with all known drug-target interactions for each unobserved drug-target interaction. On the other hand, we obtain the signed distance from OCSVM learned from the known interactions with high recall (≥0.95) for each unobserved drug-target interaction. After normalizing all accumulative similarities and signed distances to the range [0,1], we compute the score for each unobserved drug-target interaction via averaging its accumulative similarity and signed distance. Unobserved interactions with lower scores are preferentially served as reliable negative samples for the classification algorithms. The performance of the proposed method is evaluated on the interaction data between 1094 drugs and 1556 target proteins. Extensive comparison experiments using four classical classifiers and one domain predictive method demonstrate the superior performance of the proposed method. A better decision boundary has been learned from the constructed reliable negative samples. Proper construction of highly-reliable negative samples can help the classification models learn a clear decision boundary which contributes to the performance improvement.

    更新日期:2019-12-27
  • Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Shifu Chen; Yanqing Zhou; Yaru Chen; Tanxiao Huang; Wenting Liao; Yun Xu; Zhicheng Li; Jia Gu

    Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. Most existing duplicate removing tools cannot handle the UMI-integrated data. Some modern tools can work with UMIs, but are usually slow and use too much memory. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results. This paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. Gencore reports statistical results in both HTML and JSON formats. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. The JSON format report contains all the statistical results, and is interpretable for downstream programs. Comparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. This tool is available at: https://github.com/OpenGene/gencore

    更新日期:2019-12-27
  • Identifying miRNA synergism using multiple-intervention causal inference
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Junpeng Zhang; Vu Viet Hoang Pham; Lin Liu; Taosheng Xu; Buu Truong; Jiuyong Li; Nini Rao; Thuc Duy Le

    Studying multiple microRNAs (miRNAs) synergism in gene regulation could help to understand the regulatory mechanisms of complicated human diseases caused by miRNAs. Several existing methods have been presented to infer miRNA synergism. Most of the current methods assume that miRNAs with shared targets at the sequence level are working synergistically. However, it is unclear if miRNAs with shared targets are working in concert to regulate the targets or they individually regulate the targets at different time points or different biological processes. A standard method to test the synergistic activities is to knock-down multiple miRNAs at the same time and measure the changes in the target genes. However, this approach may not be practical as we would have too many sets of miRNAs to test. n this paper, we present a novel framework called miRsyn for inferring miRNA synergism by using a causal inference method that mimics the multiple-intervention experiments, e.g. knocking-down multiple miRNAs, with observational data. Our results show that several miRNA-miRNA pairs that have shared targets at the sequence level are not working synergistically at the expression level. Moreover, the identified miRNA synergistic network is small-world and biologically meaningful, and a number of miRNA synergistic modules are significantly enriched in breast cancer. Our further analyses also reveal that most of synergistic miRNA-miRNA pairs show the same expression patterns. The comparison results indicate that the proposed multiple-intervention causal inference method performs better than the single-intervention causal inference method in identifying miRNA synergistic network. Taken together, the results imply that miRsyn is a promising framework for identifying miRNA synergism, and it could enhance the understanding of miRNA synergism in breast cancer.

    更新日期:2019-12-27
  • Topological structure analysis of chromatin interaction networks
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Juris Viksna; Gatis Melkus; Edgars Celms; Kārlis Čerāns; Karlis Freivalds; Paulis Kikusts; Lelde Lace; Mārtiņš Opmanis; Darta Rituma; Peteris Rucevskis

    Current Hi-C technologies for chromosome conformation capture allow to understand a broad spectrum of functional interactions between genome elements. Although significant progress has been made into analysis of Hi-C data to identify biologically significant features, many questions still remain open, in particular regarding potential biological significance of various topological features that are characteristic for chromatin interaction networks. It has been previously observed that promoter capture Hi-C (PCHi-C) interaction networks tend to separate easily into well-defined connected components that can be related to certain biological functionality, however, such evidence was based on manual analysis and was limited. Here we present a novel method for analysis of chromatin interaction networks aimed towards identifying characteristic topological features of interaction graphs and confirming their potential significance in chromatin architecture. Our method automatically identifies all connected components with an assigned significance score above a given threshold. These components can be subjected afterwards to different assessment methods for their biological role and/or significance. The method was applied to the largest PCHi-C data set available to date that contains interactions for 17 haematopoietic cell types. The results demonstrate strong evidence of well-pronounced component structure of chromatin interaction networks and provide some characterisation of this component structure. We also performed an indicative assessment of potential biological significance of identified network components with the results confirming that the network components can be related to specific biological functionality. The obtained results show that the topological structure of chromatin interaction networks can be well described in terms of isolated connected components of the network and that formation of these components can be often explained by biological features of functionally related gene modules. The presented method allows automatic identification of all such components and evaluation of their significance in PCHi-C dataset for 17 haematopoietic cell types. The method can be adapted for exploration of other chromatin interaction data sets that include information about sufficiently large number of different cell types, and, in principle, also for analysis of other kinds of cell type-specific networks.

    更新日期:2019-12-27
  • Transcription factor regulatory modules provide the molecular mechanisms for functional redundancy observed among transcription factors in yeast
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Tzu-Hsien Yang

    Current technologies for understanding the transcriptional reprogramming in cells include the transcription factor (TF) chromatin immunoprecipitation (ChIP) experiments and the TF knockout experiments. The ChIP experiments show the binding targets of TFs against which the antibody directs while the knockout techniques find the regulatory gene targets of the knocked-out TFs. However, it was shown that these two complementary results contain few common targets. Researchers have used the concept of TF functional redundancy to explain the low overlap between these two techniques. But the detailed molecular mechanisms behind TF functional redundancy remain unknown. Without knowing the possible molecular mechanisms, it is hard for biologists to fully unravel the cause of TF functional redundancy. To mine out the molecular mechanisms, a novel algorithm to extract TF regulatory modules that help explain the observed TF functional redundancy effect was devised and proposed in this research. The method first searched for candidate TF sets from the TF binding data. Then based on these candidate sets the method utilized the modified Steiner Tree construction algorithm to construct the possible TF regulatory modules from protein-protein interaction data and finally filtered out the noise-induced results by using confidence tests. The mined-out regulatory modules were shown to correlate to the concept of functional redundancy and provided testable hypotheses of the molecular mechanisms behind functional redundancy. And the biological significance of the mined-out results was demonstrated in three different biological aspects: ontology enrichment, protein interaction prevalence and expression coherence. About 23.5% of the mined-out TF regulatory modules were literature-verified. Finally, the biological applicability of the proposed method was shown in one detailed example of a verified TF regulatory module for pheromone response and filamentous growth in yeast. In this research, a novel method that mined out the potential TF regulatory modules which elucidate the functional redundancy observed among TFs is proposed. The extracted TF regulatory modules not only correlate the molecular mechanisms to the observed functional redundancy among TFs, but also show biological significance in inferring TF functional binding target genes. The results provide testable hypotheses for biologists to further design subsequent research and experiments.

    更新日期:2019-12-27
  • Rearrangement analysis of multiple bacterial genomes
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Mehwish Noureen; Ipputa Tada; Takeshi Kawashima; Masanori Arita

    Genomes are subjected to rearrangements that change the orientation and ordering of genes during evolution. The most common rearrangements that occur in uni-chromosomal genomes are inversions (or reversals) to adapt to the changing environment. Since genome rearrangements are rarer than point mutations, gene order with sequence data can facilitate more robust phylogenetic reconstruction. Helicobacter pylori is a good model because of its unique evolution in niche environment. We have developed a method to identify genome rearrangements by comparing almost-conserved genes among closely related strains. Orthologous gene clusters, rather than the gene sequences, are used to align the gene order so that comparison of large number of genomes becomes easier. Comparison of 72 Helicobacter pylori strains revealed shared as well as strain-specific reversals, some of which were found in different geographical locations. Degree of genome rearrangements increases with time. Therefore, gene orders can be used to study the evolutionary relationship among species and strains. Multiple genome comparison helps to identify the strain-specific as well as shared reversals. Identification of the time course of rearrangements can provide insights into evolutionary events.

    更新日期:2019-12-27
  • Model-based cell clustering and population tracking for time-series flow cytometry data
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Kodai Minoura; Ko Abe; Yuka Maeda; Hiroyoshi Nishikawa; Teppei Shimamura

    Modern flow cytometry technology has enabled the simultaneous analysis of multiple cell markers at the single-cell level, and it is widely used in a broad field of research. The detection of cell populations in flow cytometry data has long been dependent on “manual gating” by visual inspection. Recently, numerous software have been developed for automatic, computationally guided detection of cell populations; however, they are not designed for time-series flow cytometry data. Time-series flow cytometry data are indispensable for investigating the dynamics of cell populations that could not be elucidated by static time-point analysis. Therefore, there is a great need for tools to systematically analyze time-series flow cytometry data. We propose a simple and efficient statistical framework, named CYBERTRACK (CYtometry-Based Estimation and Reasoning for TRACKing cell populations), to perform clustering and cell population tracking for time-series flow cytometry data. CYBERTRACK assumes that flow cytometry data are generated from a multivariate Gaussian mixture distribution with its mixture proportion at the current time dependent on that at a previous timepoint. Using simulation data, we evaluate the performance of CYBERTRACK when estimating parameters for a multivariate Gaussian mixture distribution, tracking time-dependent transitions of mixture proportions, and detecting change-points in the overall mixture proportion. The CYBERTRACK performance is validated using two real flow cytometry datasets, which demonstrate that the population dynamics detected by CYBERTRACK are consistent with our prior knowledge of lymphocyte behavior. Our results indicate that CYBERTRACK offers better understandings of time-dependent cell population dynamics to cytometry users by systematically analyzing time-series flow cytometry data.

    更新日期:2019-12-27
  • iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networks
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Binh P. Nguyen; Quang H. Nguyen; Giang-Nam Doan-Ngoc; Thanh-Hoang Nguyen-Vo; Susanto Rahardja

    Since protein-DNA interactions are highly essential to diverse biological events, accurately positioning the location of the DNA-binding residues is necessary. This biological issue, however, is currently a challenging task in the age of post-genomic where data on protein sequences have expanded very fast. In this study, we propose iProDNA-CapsNet – a new prediction model identifying protein-DNA binding residues using an ensemble of capsule neural networks (CapsNets) on position specific scoring matrix (PSMM) profiles. The use of CapsNets promises an innovative approach to determine the location of DNA-binding residues. In this study, the benchmark datasets introduced by Hu et al. (2017), i.e., PDNA-543 and PDNA-TEST, were used to train and evaluate the model, respectively. To fairly assess the model performance, comparative analysis between iProDNA-CapsNet and existing state-of-the-art methods was done. Under the decision threshold corresponding to false positive rate (FPR) ≈ 5%, the accuracy, sensitivity, precision, and Matthews’s correlation coefficient (MCC) of our model is increased by about 2.0%, 2.0%, 14.0%, and 5.0% with respect to TargetDNA (Hu et al., 2017) and 1.0%, 75.0%, 45.0%, and 77.0% with respect to BindN+ (Wang et al., 2010), respectively. With regards to other methods not reporting their threshold settings, iProDNA-CapsNet also shows a significant improvement in performance based on most of the evaluation metrics. Even with different patterns of change among the models, iProDNA-CapsNets remains to be the best model having top performance in most of the metrics, especially MCC which is boosted from about 8.0% to 220.0%. According to all evaluation metrics under various decision thresholds, iProDNA-CapsNet shows better performance compared to the two current best models (BindN and TargetDNA). Our proposed approach also shows that CapsNet can potentially be used and adopted in other biological applications.

    更新日期:2019-12-27
  • Fast and accurate microRNA search using CNN
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Xubo Tang; Yanni Sun

    There are many different types of microRNAs (miRNAs) and elucidating their functions is still under intensive research. A fundamental step in functional annotation of a new miRNA is to classify it into characterized miRNA families, such as those in Rfam and miRBase. With the accumulation of annotated miRNAs, it becomes possible to use deep learning-based models to classify different types of miRNAs. In this work, we investigate several key issues associated with successful application of deep learning models for miRNA classification. First, as secondary structure conservation is a prominent feature for noncoding RNAs including miRNAs, we examine whether secondary structure-based encoding improves classification accuracy. Second, as there are many more non-miRNA sequences than miRNAs, instead of assigning a negative class for all non-miRNA sequences, we test whether using softmax output can distinguish in-distribution and out-of-distribution samples. Finally, we investigate whether deep learning models can correctly classify sequences from small miRNA families. We present our trained convolutional neural network (CNN) models for classifying miRNAs using different types of feature learning and encoding methods. In the first method, we explicitly encode the predicted secondary structure in a matrix. In the second method, we use only the primary sequence information and one-hot encoding matrix. In addition, in order to reject sequences that should not be classified into targeted miRNA families, we use a threshold derived from softmax layer to exclude out-of-distribution sequences, which is an important feature to make this model useful for real transcriptomic data. The comparison with the state-of-the-art ncRNA classification tools such as Infernal shows that our method can achieve comparable sensitivity and accuracy while being significantly faster. Automatic feature learning in CNN can lead to better classification accuracy and sensitivity for miRNA classification and annotation. The trained models and also associated codes are freely available at https://github.com/HubertTang/DeepMir.

    更新日期:2019-12-27
  • Efficient computation of stochastic cell-size transient dynamics
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Cesar Augusto Nieto-Acuna; Cesar Augusto Vargas-Garcia; Abhyudai Singh; Juan Manuel Pedraza

    How small, fast-growing bacteria ensure tight cell-size distributions remains elusive. High-throughput measurement techniques have propelled efforts to build modeling tools that help to shed light on the relationships between cell size, growth and cycle progression. Most proposed models describe cell division as a discrete map between size at birth and size at division with stochastic fluctuations assumed. However, such models underestimate the role of cell size transient dynamics by excluding them. We propose an efficient approach for estimation of cell size transient dynamics. Our technique approximates the transient size distribution and statistical moment dynamics of exponential growing cells following an adder strategy with arbitrary precision. We approximate, up to arbitrary precision, the distribution of division times and size across time for the adder strategy in rod-shaped bacteria cells. Our approach is able to compute statistical moments like mean size and its variance from such distributions efficiently, showing close match with numerical simulations. Additionally, we observed that these distributions have periodic properties. Our approach further might shed light on the mechanisms behind gene product homeostasis.

    更新日期:2019-12-27
  • DeepMF: deciphering the latent patterns in omics profiles with a deep learning method
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Lingxi Chen; Jiao Xu; Shuai Cheng Li

    With recent advances in high-throughput technologies, matrix factorization techniques are increasingly being utilized for mapping quantitative omics profiling matrix data into low-dimensional embedding space, in the hope of uncovering insights in the underlying biological processes. Nevertheless, current matrix factorization tools fall short in handling noisy data and missing entries, both deficiencies that are often found in real-life data. Here, we propose DeepMF, a deep neural network-based factorization model. DeepMF disentangles the association between molecular feature-associated and sample-associated latent matrices, and is tolerant to noisy and missing values. It exhibited feasible cancer subtype discovery efficacy on mRNA, miRNA, and protein profiles of medulloblastoma cancer, leukemia cancer, breast cancer, and small-blue-round-cell cancer, achieving the highest clustering accuracy of 76%, 100%, 92%, and 100% respectively. When analyzing data sets with 70% missing entries, DeepMF gave the best recovery capacity with silhouette values of 0.47, 0.6, 0.28, and 0.44, outperforming other state-of-the-art MF tools on the cancer data sets Medulloblastoma, Leukemia, TCGA BRCA, and SRBCT. Its embedding strength as measured by clustering accuracy is 88%, 100%, 84%, and 96% on these data sets, which improves on the current best methods 76%, 100%, 78%, and 87%. DeepMF demonstrated robust denoising, imputation, and embedding ability. It offers insights to uncover the underlying biological processes such as cancer subtype discovery. Our implementation of DeepMF can be found at https://github.com/paprikachan/DeepMF.

    更新日期:2019-12-27
  • IILLS: predicting virus-receptor interactions based on similarity and semi-supervised learning
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Cheng Yan; Guihua Duan; Fang-Xiang Wu; Jianxin Wang

    Viral infectious diseases are the serious threat for human health. The receptor-binding is the first step for the viral infection of hosts. To more effectively treat human viral infectious diseases, the hidden virus-receptor interactions must be discovered. However, current computational methods for predicting virus-receptor interactions are limited. In this study, we propose a new computational method (IILLS) to predict virus-receptor interactions based on Initial Interaction scores method via the neighbors and the Laplacian regularized Least Square algorithm. IILLS integrates the known virus-receptor interactions and amino acid sequences of receptors. The similarity of viruses is calculated by the Gaussian Interaction Profile (GIP) kernel. On the other hand, we also compute the receptor GIP similarity and the receptor sequence similarity. Then the sequence similarity is used as the final similarity of receptors according to the prediction results. The 10-fold cross validation (10CV) and leave one out cross validation (LOOCV) are used to assess the prediction performance of our method. We also compare our method with other three competing methods (BRWH, LapRLS, CMF). The experiment results show that IILLS achieves the AUC values of 0.8675 and 0.9061 with the 10-fold cross validation and leave-one-out cross validation (LOOCV), respectively, which illustrates that IILLS is superior to the competing methods. In addition, the case studies also further indicate that the IILLS method is effective for the virus-receptor interaction prediction.

    更新日期:2019-12-27
  • SpliceFinder: ab initio prediction of splice sites using convolutional neural network
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Ruohan Wang; Zishuai Wang; Jianping Wang; Shuaicheng Li

    Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

    更新日期:2019-12-27
  • Deep mixed model for marginal epistasis detection and population stratification correction in genome-wide association studies
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Haohan Wang; Tianwei Yue; Jingkang Yang; Wei Wu; Eric P. Xing

    Genome-wide Association Studies (GWAS) have contributed to unraveling associations between genetic variants in the human genome and complex traits for more than a decade. While many works have been invented as follow-ups to detect interactions between SNPs, epistasis are still yet to be modeled and discovered more thoroughly. In this paper, following the previous study of detecting marginal epistasis signals, and motivated by the universal approximation power of deep learning, we propose a neural network method that can potentially model arbitrary interactions between SNPs in genetic association studies as an extension to the mixed models in correcting confounding factors. Our method, namely Deep Mixed Model, consists of two components: 1) a confounding factor correction component, which is a large-kernel convolution neural network that focuses on calibrating the residual phenotypes by removing factors such as population stratification, and 2) a fixed-effect estimation component, which mainly consists of an Long-short Term Memory (LSTM) model that estimates the association effect size of SNPs with the residual phenotype. After validating the performance of our method using simulation experiments, we further apply it to Alzheimer’s disease data sets. Our results help gain some explorative understandings of the genetic architecture of Alzheimer’s disease.

    更新日期:2019-12-27
  • Venn-diaNet : venn diagram based network propagation analysis framework for comparing multiple biological experiments
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Benjamin Hur; Dongwon Kang; Sangseon Lee; Ji Hwan Moon; Gung Lee; Sun Kim

    The main research topic in this paper is how to compare multiple biological experiments using transcriptome data, where each experiment is measured and designed to compare control and treated samples. Comparison of multiple biological experiments is usually performed in terms of the number of DEGs in an arbitrary combination of biological experiments. This process is usually facilitated with Venn diagram but there are several issues when Venn diagram is used to compare and analyze multiple experiments in terms of DEGs. First, current Venn diagram tools do not provide systematic analysis to prioritize genes. Because that current tools generally do not fully focus to prioritize genes, genes that are located in the segments in the Venn diagram (especially, intersection) is usually difficult to rank. Second, elucidating the phenotypic difference only with the lists of DEGs and expression values is challenging when the experimental designs have the combination of treatments. Experiment designs that aim to find the synergistic effect of the combination of treatments are very difficult to find without an informative system. We introduce Venn-diaNet, a Venn diagram based analysis framework that uses network propagation upon protein-protein interaction network to prioritizes genes from experiments that have multiple DEG lists. We suggest that the two issues can be effectively handled by ranking or prioritizing genes with segments of a Venn diagram. The user can easily compare multiple DEG lists with gene rankings, which is easy to understand and also can be coupled with additional analysis for their purposes. Our system provides a web-based interface to select seed genes in any of areas in a Venn diagram and then perform network propagation analysis to measure the influence of the selected seed genes in terms of ranked list of DEGs. We suggest that our system can logically guide to select seed genes without additional prior knowledge that makes us free from the seed selection of network propagation issues. We showed that Venn-diaNet can reproduce the research findings reported in the original papers that have experiments that compare two, three and eight experiments. Venn-diaNet is freely available at: http://biohealth.snu.ac.kr/software/venndianet

    更新日期:2019-12-27
  • LEMON: a method to construct the local strains at horizontal gene transfer sites in gut metagenomics
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-27
    Chen Li; Yiqi Jiang; Shuaicheng Li

    Horizontal Gene Transfer (HGT) refers to the transfer of genetic materials between organisms through mechanisms other than parent-offspring inheritance. HGTs may affect human health through a large number of microorganisms, especially the gut microbiomes which the human body harbors. The transferred segments may lead to complicated local genome structural variations. Details of the local genome structure can elucidate the effects of the HGTs. In this work, we propose a graph-based method to reconstruct the local strains from the gut metagenomics data at the HGT sites. The method is implemented in a package named LEMON. The simulated results indicate that the method can identify transferred segments accurately on reference sequences of the microbiome. Simulation results illustrate that LEMON could recover local strains with complicated structure variation. Furthermore, the gene fusion points detected in real data near HGT breakpoints validate the accuracy of LEMON. Some strains reconstructed by LEMON have a replication time profile with lower standard error, which demonstrates HGT events recovered by LEMON is reliable. Through LEMON we could reconstruct the sequence structure of bacteria, which harbors HGT events. This helps us to study gene flow among different microbial species.

    更新日期:2019-12-27
  • ICGRM: integrative construction of genomic relationship matrix combining multiple genomic regions for big dataset
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-26
    Dan Jiang; Cong Xin; Jinhua Ye; Yingbo Yuan; Ming Fang

    Genomic prediction is an advanced method for estimating genetic values, which has been widely accepted for genetic evaluation in animal and disease-risk prediction in human. It estimates genetic values with genome-wide distributed SNPs instead of pedigree. The key step of it is to construct genomic relationship matrix (GRM) via genome-wide SNPs; however, usually the calculation of GRM needs huge computer memory especially when the SNP number and sample size are big, so that sometimes it will become computationally prohibitive even for super computer clusters. We herein developed an integrative algorithm to compute GRM. To avoid calculating GRM for the whole genome, ICGRM freely divides the genome-wide SNPs into several segments and computes the summary statistics related to GRM for each segment that requires quite few computer RAM; then it integrates these summary statistics to produce GRM for whole genome. It showed that the computer memory of ICGRM was reduced by 15 times (from 218Gb to 14Gb) after the genome SNPs were split into 5 to 200 parts in terms of the number of SNPs in our simulation dataset, making it computationally feasible for almost all kinds of computer servers. ICGRM is implemented in C/C++ and freely available via https://github.com/mingfang618/CLGRM. ICGRM is computationally efficient software to build GRM and can be used for big dataset.

    更新日期:2019-12-27
  • PROMO: an interactive tool for analyzing clinically-labeled multi-omic cancer datasets
    BMC Bioinform. (IF 2.213) Pub Date : 2019-12-26
    Dvir Netanely; Neta Stern; Itay Laufer; Ron Shamir

    Analysis of large genomic datasets along with their accompanying clinical information has shown great promise in cancer research over the last decade. Such datasets typically include thousands of samples, each measured by one or several high-throughput technologies (‘omics’) and annotated with extensive clinical information. While instrumental for fulfilling the promise of personalized medicine, the analysis and visualization of such large datasets is challenging and necessitates programming skills and familiarity with a large array of software tools to be used for the various steps of the analysis. We developed PROMO (Profiler of Multi-Omic data), a friendly, fully interactive stand-alone software for analyzing large genomic cancer datasets together with their associated clinical information. The tool provides an array of built-in methods and algorithms for importing, preprocessing, visualizing, clustering, clinical label enrichment testing, and survival analysis that can be performed on a single or multi-omic dataset. The tool can be used for quick exploration and stratification of tumor samples taken from patients into clinically significant molecular subtypes. Identification of prognostic biomarkers and generation of simple subtype classifiers are additional important features. We review PROMO’s main features and demonstrate its analysis capabilities on a breast cancer cohort from TCGA. PROMO provides a single integrated solution for swiftly performing a complete analysis of cancer genomic data for subtype discovery and biomarker identification without writing a single line of code, and can, therefore, make the analysis of these data much easier for cancer biologists and biomedical researchers. PROMO is freely available for download at http://acgt.cs.tau.ac.il/promo/.

    更新日期:2019-12-27
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
2020新春特辑
限时免费阅读临床医学内容
ACS材料视界
科学报告最新纳米科学与技术研究
清华大学化学系段昊泓
自然科研论文编辑服务
加州大学洛杉矶分校
上海纽约大学William Glover
南开大学化学院周其林
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug