-
Accelerating the continuous community sharing of digital neuromorphology data bioRxiv. Bioinform. Pub Date : 2024-03-18 Carolina Tecuatl, Bengt Ljungquist, Giorgio A. Ascoli
The tree-like morphology of neurons and glia is a key cellular determinant of circuit connectivity and metabolic function in the nervous system of essentially all animals. To elucidate the contribution of specific cell types to both physiological and pathological brain states, it is important to access detailed neuroanatomy data for quantitative analysis and computational modeling. NeuroMorpho.Org
-
Multi-layered Network Analysis of Osteoking in the Treatment of Osteoporosis: Unraveling Mechanisms from Gene Expression to Molecular Docking bioRxiv. Bioinform. Pub Date : 2024-03-18 He Chen, Jun Ying, Xianjie Xie, Boyun Huang, Pengcheng Lin
This study aimed to elucidate the therapeutic mechanisms of Osteoking in the treatment of osteoporosis through a comprehensive analysis of potential targets, active ingredients, and associated pathways. Method: The study employed an integrated approach to understand the molecular mechanisms underlying Osteoking's treatment of osteoporosis. The construction of the protein-protein interaction network
-
Predicting cell-type-specific exon inclusion in the human brain reveals more complex splicing mechanisms in neurons than glia bioRxiv. Bioinform. Pub Date : 2024-03-18 Lieke Michielsen, Justine Hsu, Anoushka Joglekar, Natan Belchikov, Marcel Reinders, Hagen Tilgner, Ahmed Mahfouz
Alternative splicing contributes to molecular diversity across brain cell types. RNA-binding proteins (RBPs) regulate splicing, but the genome-wide mechanisms remain poorly understood. Here, we used RBP binding sites and/or the genomic sequence to predict exon inclusion in neurons and glia as measured by long-read single-cell data in human hippocampus and frontal cortex. We found that alternative splicing
-
Heterogeneity analysis of acute exacerbations of chronic obstructive pulmonary disease and a deep learning framework with weak supervision and privacy protection bioRxiv. Bioinform. Pub Date : 2024-03-18 Yuto Suzuki, Andrew Hill, Elena Engel, Ann Granchelli, Gabe Lockhart, Farnoush Banaei-Kashani, Russell Paul Bowler
Chronic obstructive pulmonary disease (COPD) affects 5-10% of the adult US population and is a major cause of mortality. Acute exacerbations of COPD (AECOPDs) are a major driver of COPD morbidity and mortality, but there are no cost-effective methods to identify early AECOPDs when treatment is most likely to reduce the severity and duration of AECOPDs. We conducted the first long-term (> 12 months)
-
A deep profile of gene expression across 18 human cancers bioRxiv. Bioinform. Pub Date : 2024-03-17 Wei Qiu, Ayse Berceste Dincer, Joseph Janizek, Safiye Celik, Mikael Pittet, Kamila Naxerova, Su-In Lee
Clinically and biologically valuable information may reside untapped in large cancer gene expression data sets. Deep unsupervised learning has the potential to extract this information with unprecedented efficacy but has thus far been hampered by a lack of biological interpretability and robustness. Here, we present DeepProfile, a comprehensive framework that addresses current challenges in applying
-
PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context bioRxiv. Bioinform. Pub Date : 2024-03-17 Guojie Zhong, Yige Zhao, Demi Zhuang, Wendy K Chung, Yufeng Shen
Accurate prediction of the functional impact of missense variants is important for disease gene discovery, clinical genetic diagnostics, therapeutic strategies, and protein engineering. Previous efforts have focused on predicting a binary pathogenicity classification, but the functional impact of missense variants is multi-dimensional. Pathogenic missense variants in the same gene may act through different
-
Information-Content-Informed Kendall-tau Correlation: Utilizing Missing Values bioRxiv. Bioinform. Pub Date : 2024-03-17 Robert M Flight, Praneeth S Bhatt, Hunter N.B. Moseley
Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by removing them or are imputed and used in the calculation of the correlation coefficient. In both cases, the correlation value will be impacted based on a perspective that missing data represents no useful information. However, missing values occur
-
Identification of novel prognostic targets in acute kidney injury using bioinformatics and next generation sequencing data analysis bioRxiv. Bioinform. Pub Date : 2024-03-17 Basavaraj Mallikarjunayya Vastrad, Chanabasayya Mallikarjunayya Vastrad
Acute kidney injury (AKI) is a type of renal disease occurs frequently in hospitalized patients, which may cause abnormal renal function and structure with increase in serum creatinine level with or without reduced urine output. With the incidence of AKI is increasing. However, the molecular mechanisms of AKI have not been elucidated. It is significant to further explore the molecular mechanisms of
-
ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations bioRxiv. Bioinform. Pub Date : 2024-03-17 Weijie Yin, Zhaoyu Zhang, Liang He, Rui Jiang, Shuo Zhang, Gan Liu, Xuegong Zhang, Tao Qin, Zhen Xie
With large amounts of unlabeled RNA sequences data produced by high-throughput sequencing technologies, pre-trained RNA language models have been developed to estimate semantic space of RNA molecules, which facilities the understanding of grammar of RNA language. However, existing RNA language models overlook the impact of structure when modeling the RNA semantic space, resulting in incomplete feature
-
Evaluation of methods for RNA-Seq analysis for uncovering key components of estrogen receptor-alpha signaling pathway in breast cancer bioRxiv. Bioinform. Pub Date : 2024-03-17 Wanru Guo
Breast cancer is the most common female cancer worldwide. Higher estrogen receptor (ER) expression is often associated with poor prognosis in ER positive breast cancer, however the exact mechanism is unknown. RNA-Seq data of three different experiments of ER knockdown (siE1, siE2, siE3) was used by researchers previously to identify TNFAIP1/BACURD2 as the mediator of ER induced increase in cell migration
-
A Functional Map of the Human Intrinsically Disordered Proteome bioRxiv. Bioinform. Pub Date : 2024-03-17 Iva Pritišanac, T. Reid Alderson, Đesika Kolarić, Taraneh Zarin, Shuting Xie, Alex X. Lu, Aqsa Alam, Abdullah Maqsood, Ji-Young Youn, Julie D. Forman-Kay, Alan M. Moses
Intrinsically disordered regions (IDRs) represent at least one-third of the human proteome and defy the established structure-function paradigm. Because IDRs often have limited positional sequence conservation, the functional classification of IDRs using standard bioinformatics is generally not possible. Here, we show that evolutionarily conserved molecular features of the intrinsically disordered
-
Characterising tandem repeat complexities across long-read sequencing platforms with TREAT bioRxiv. Bioinform. Pub Date : 2024-03-17 Niccolo Tesi, Alex Salazar, Yaran Zhang, Sven van der Lee, Marc Hulsman, Lydian Knoop, Sanduni Wijesekera, Jana Krizova, Anne-Fleur Schneider, Maartje Pennings, Kristel Sleegers, Erik-Jan Kamsteeg, Marcel Reinders, Henne Holstege
Tandem repeats (TR) play important roles in genomic variation and disease risk in humans. Long-read sequencing allows for the characterisation of TRs, however, the underlying bioinformatics perspective remains challenging. We evaluated potential biases when genotyping >864k TRs using diverse Oxford Nanopore Technology (ONT) and PacBio long-read sequencing technologies. We showed that, in rare cases
-
PPIscreenML: Structure-based screening for protein-protein interactions using AlphaFold bioRxiv. Bioinform. Pub Date : 2024-03-17 Victoria Mischley, Johannes Maier, Jesse Chen, John Karanicolas
Protein-protein interactions underlie nearly all cellular processes. With the advent of protein structure prediction methods such as AlphaFold2 (AF2), models of specific protein pairs can be built extremely accurately in most cases. However, determining the relevance of a given protein pair remains an open question. It is presently unclear how to use best structure-based tools to infer whether a pair
-
GGTyper: genotyping complex structural variants using short-read sequencing data bioRxiv. Bioinform. Pub Date : 2024-03-17 Tim Mirus, Robert Lohmayer, Bjarni V Halldorsson, Birte Kehr
Motivation: Complex structural variants are genomic rearrangements that involve multiple segments of DNA. They contribute to human diversity and have been shown to cause Mendelian disease. Nevertheless, our abilities to analyse complex structural variants are very limited. As opposed to deletions and other canonical types of structural variants (SVs), there are no established tools that have explicitly
-
Conformal prediction of molecule-induced cancer cell growth inhibition challenged by strong distribution shifts bioRxiv. Bioinform. Pub Date : 2024-03-17 Saiveth Hernandez-Hernandez, Qianrong Guo, Pedro Ballester
The drug discovery process often employs phenotypic and target-based virtual screening to identify potential drug candidates. Despite the longstanding dominance of target-based approaches, phenotypic virtual screening is undergoing a resurgence due to its potential being now better understood. In the context of cancer cell lines, a well-established experimental system for phenotypic screens, molecules
-
Gapless assembly of complete human and plant chromosomes using only nanopore sequencing bioRxiv. Bioinform. Pub Date : 2024-03-17 Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautianinen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Doreen Ware, Michael C. Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H.J. Wittenberg, Adam M Phillippy
The combination of ultra-long Oxford Nanopore (ONT) sequencing reads with long, accurate PacBio HiFi reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads
-
DrugHIVE: Target-specific spatial drug design and optimization with a hierarchical generative model bioRxiv. Bioinform. Pub Date : 2024-03-17 Jesse A Weller, Remo Rohs
Rapid advancement in the computational methods of structure-based drug design has led to their widespread adoption as key tools in the early drug development process. Recently, the remarkable growth of available crystal structure data and libraries of commercially available or readily synthesizable molecules have unlocked previously inaccessible regions of chemical space for drug development. Paired
-
A Comparison of Antibody-Antigen Complex Sequence-to-Structure Prediction Methods and their Systematic Biases bioRxiv. Bioinform. Pub Date : 2024-03-17 Katherine Maia McCoy, Margaret E Ackerman, Gevorg Grigoryan
The ability to accurately predict antibody-antigen complex structures from their sequences could greatly advance our understanding of the immune system and would aid in the development of novel antibody therapeutics. There have been considerable recent advancements in predicting protein-protein interactions (PPIs) fueled by progress in machine learning (ML). To understand the current state of the field
-
Celldetective: an AI-enhanced image analysis tool for unraveling dynamic cell interactions bioRxiv. Bioinform. Pub Date : 2024-03-17 Remy Torro, Beatriz Diaz Bello, Dalia El Arawi, Lorna Ammer, Patrick Chames, Kheya Sengupta, Laurent Limozin
A current key challenge in bioimaging is the analysis of multimodal and multidimensional data reporting dynamic interactions between diverse cell populations. We developed Celldetective, a software that integrates AI-based segmentation and tracking algorithms and automated signal analysis into a user-friendly graphical interface. It offers complete interactive visualization, annotation, and training
-
Biophysics-based protein language models for protein engineering bioRxiv. Bioinform. Pub Date : 2024-03-17 Sam Gelman, Bryce Johnson, Chase Freschlin, Sameer D'Costa, Anthony Gitter, Philip A Romero
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical
-
Comprehensive machine learning boosts structure-based virtual screening for PARP1 inhibitors bioRxiv. Bioinform. Pub Date : 2024-03-17 Klaudia Caba, Viet-Khoa Tran-Nguyen, Taufiq Rahman, Pedro Ballester
Poly ADP-ribose polymerase 1 (PARP1) is an attractive therapeutic target for cancer treatment. Machine-learning scoring functions constitute a promising approach to discovering novel PARP1 inhibitors. Cutting-edge PARP1-specific machine-learning scoring functions were investigated using semi-synthetic training data from docking activity-labelled molecules: known PARP1 inhibitors, hard-to-discriminate
-
CLigOpt: Controllable Ligand Design through Target-Specific Optimisation bioRxiv. Bioinform. Pub Date : 2024-03-17 Yutong Li, Pedro Henrique da Costa Avelar, Xinyue Chen, Li Zhang, Min Wu, Sophia Tsoka
Fragment-based drug design (FBDD), where fragments act as starting points for molecular generation, is an effective way to constrain chemical space and improve generation for biologically active molecules. Key challenges in this process is to navigate through the vast molecular space, and produce promising molecules. Here, we propose a controllable FBDD model, CLigOpt, which can generate molecules
-
Random forest machine learning algorithm classifies white- and brown-rot fungi according to the number of Carbohydrate-Active enZyme genes bioRxiv. Bioinform. Pub Date : 2024-03-17 Natsuki Hasegawa, Masashi Sugiyama, Kiyohiko Igarashi
Wood-rotting fungi play an important role in the global carbon cycle because they are only known organisms that digest wood, the largest carbon stock in nature. In the present study, we used linear discriminant analysis and random forest (RF) machine learning algorithms to predict white- or brown-rot decay modes from the numbers of genes encoding Carbohydrate-Active enZymes (CAZymes) with over 98%
-
Scvi-hub: an actionable repository for model-driven single cell analysis bioRxiv. Bioinform. Pub Date : 2024-03-17 Can Ergen, Valeh Valiollah Pour Amiri, Martin Kim, Aaron Streets, Adam Gayoso, Nir Yosef
The accumulation of single-cell omics datasets in the public domain has opened new opportunities for reusing and leveraging the vast amount of information they contain. Such uses, however, are complicated by the need for complex and resource-consuming procedures for data transfer, normalization and integration that must be addressed prior to any analysis. Here we present scvi-hub: a platform for efficiently
-
Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data bioRxiv. Bioinform. Pub Date : 2024-03-16 Michael B. Hall, Ryan R. Wick, Louise M. Judd, An N. T. Nguyen, Eike J. Steinig, Ouli Xie, Mark R. Davies, Torsten Seemann, Timothy P. Stinear, Lachlan J. M. Coin
Variant calling is fundamental in bacterial genomics, underpinning the identification of disease transmission clusters, the construction of phylogenetic trees, and antimicrobial resistance prediction. This study presents a comprehensive benchmarking of SNP and indel variant calling accuracy across 14 diverse bacterial species using Oxford Nanopore Technologies (ONT) and Illumina sequencing. We generate
-
AutoPeptideML: A study on how to build more trustworthy peptide bioactivity predictors bioRxiv. Bioinform. Pub Date : 2024-03-16 Raul Fernandez-Diaz, Rodrigo Cossio-Perez, Clement Agoni, Lam Thanh Hoang, Vanessa Lopez, Denis C Shields
Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build trustworthy models. We consider the design of such an AutoML tool for developing peptide bioactivity predictors. We analyse different design choices concerning data acquisition and negative class definition, homology partitioning
-
The power and limits of predicting exon-exon interactions using protein 3D structures bioRxiv. Bioinform. Pub Date : 2024-03-16 Jeanine Liebold, Aylin Del Moral-Morales, Karen Manalastas-Cantos, Olga Tsoy, Stefan Kurtz, Jan Baumbach, Khalique Newaz
Alternative splicing (AS) effects on cellular functions can be captured by studying changes in the underlying protein-protein interactions (PPIs). Because AS results in the gain or loss of exons, existing methods for predicting AS-related PPI changes utilize known PPI interfacing exon-exon interactions (EEIs), which only cover ~5% of known human PPIs. Hence, there is a need to extend the existing limited
-
Detection of spatial chromatin accessibility patterns with inter-cellular correlations bioRxiv. Bioinform. Pub Date : 2024-03-16 Xiaoyang Chen, Keyi Li, Xiaoqing Wu, Zhen Li, Qun Jiang, Yanhong Wu, Rui Jiang
Recent advances in spatial sequencing technologies enable simultaneous capture of spatial location and chromatin accessibility of cells within intact tissue slices. Identifying peaks that display spatial variation and cellular heterogeneity is the first and key analytic task for characterizing the spatial chromatin accessibility landscape of complex tissues. Here we propose an efficient and iterative
-
DNA barcoding and species delimitation based on four Chloroplast loci to identify species within the Elaeocarpaceae family bioRxiv. Bioinform. Pub Date : 2024-03-16 Jyotsana Khushwaha, Alpana Joshi, Subrata K Das
DNA barcoding is an indispensable taxonomic tool for plant species identification. The present research successfully amplified the four coding regions of plastid namely; matK, rpoB, ndhJ, and accD, sequenced, and submitted it to NCBI Genbank after verification. This study evaluated the significance of single (matK, rpoB, ndhJ, and accD) and two-locus plastid loci (rpoB+matK, ndhJ+matK, accD+matK) and
-
Modeling interpretable correspondence between cell state and perturbation response with CellCap bioRxiv. Bioinform. Pub Date : 2024-03-16 Yang Xu, Stephen Fleming, Matthew Tegtmeyer, Steven A. McCarroll, Mehrtash Babadi
Single-cell transcriptomics, in conjunction with genetic and compound perturbations, offers a robust approach for exploring cellular behaviors in diverse contexts. Such experiments allow uncovering cell-state-specific responses to perturbations, a crucial aspect in unraveling the intricate molecular mechanisms governing cellular behavior and potentially discovering novel regulatory pathways and therapeutic
-
The Hitchhiker's Guide to Sequencing Data Types and Volumes for Population-Scale Pangenome Construction bioRxiv. Bioinform. Pub Date : 2024-03-16 Prasad Sarashetti, Josipa Lipovac, Filip Tomas, Mile Sikic, Jianjun Liu
Long-read (LR) technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT (ULONT). Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting
-
Effect of dataset partitioning strategies for evaluating out-of-distribution generalisation for predictive models in biochemistry bioRxiv. Bioinform. Pub Date : 2024-03-16 Raúl Fernández-Díaz, Thanh Lam Hoang, Vanessa Lopez, Denis C. Shields
We have developed Hestia, a computational tool that provides a unified framework for introducing similarity correction techniques across different biochemical data types. We propose a new strategy for dividing a dataset into training and evaluation subsets (CCPart) and have compared it against other methods at different thresholds to explore the impact that these choices have on model generalisation
-
Targeting Leishmania infantum Mannosyl-oligosaccharide glucosidase with natural products: pH-dependent inhibition explored through computer-aided drug design. bioRxiv. Bioinform. Pub Date : 2024-03-16 Luis Daniel Goyzueta Mamani, Haruna Luz Barazorda Ccahuana, Mayron Antonio Candia Puma, Alexsandro Sobreira Galdino, Ricardo Andrez Machado de Avila, Rodolfo Cordeiro Giunchetti, Jose Luis Medina Franco, Monica Florin Christensen, Miguel Angel Chavez Fumagalli
Visceral Leishmaniasis (VL) is a serious public health issue, documented in more than ninety countries, where an estimated 500,000 new cases emerge each year. Regardless of novel methodologies, advancements, and experimental interventions, therapeutic limitations, and drug resistance are still challenging. For this reason, based on previous research, we screened natural products (NP) from Nuclei of
-
A Hybrid Diffusion Model for Stable, Affinity-Driven, Receptor-Aware Peptide Generation bioRxiv. Bioinform. Pub Date : 2024-03-16 Vishva Saravanan Ramasubramanian, Soham Choudhuri, Bhaswar Ghosh
The convergence of biotechnology and artificial intelligence has the potential to transform drug development, especially in the field of therapeutic peptide design. Peptides are short chains of amino acids with diverse therapeutic applications that offer several advantages over small molecular drugs, such as targeted therapy and minimal side effects. However, limited oral bioavailability and enzymatic
-
Comprehensive detection and characterization of human druggable pockets through novel binding site descriptors bioRxiv. Bioinform. Pub Date : 2024-03-16 Arnau Comajuncosa-Creus, Guillem Jorba, Xavier Barril, Patrick Aloy
Druggable pockets are protein regions that have the ability to bind organic small molecules, and their characterization is essential in target-based drug discovery. However, strategies to derive pocket descriptors are scarce and usually exhibit limited applicability. Here, we present PocketVec, a novel approach to generate pocket descriptors for any protein binding site of interest through the inverse
-
MetaDIA: A Novel Database Reduction Strategy for DIA Human Gut Metaproteomics bioRxiv. Bioinform. Pub Date : 2024-03-16 Haonan Duan, Zhibin Ning, Zhongzhi Sun, Tiannan Guo, Yingying Sun, Daniel Figeys
Background: Microbiomes, especially within the gut, are complex and may comprise hundreds of species. The identification of peptides in metaproteomics presents a significant challenge, as it involves matching peptides to mass spectra within an enormous search space for complex and unknown samples. This poses difficulties for both the accuracy and the speed of identification. Specifically, analysis
-
BayesianSSA: a Bayesian statistical model based on structural sensitivity analysis for predicting responses to enzyme perturbations in metabolic networks bioRxiv. Bioinform. Pub Date : 2024-03-16 Shion Hosoda, Hisashi Iwata, Takuya Miura, Maiko Tanabe, Takashi Okada, Atsushi Mochizuki, Miwa Sato
Chemical bioproduction has attracted attention as a key technology in a decarbonized society. In computational design for chemical bioproduction, it is necessary to predict changes in metabolic fluxes when up-/down-regulating enzymatic reactions, that is, responses of the system to enzyme perturbations. Structural sensitivity analysis (SSA) was previously developed as a method to predict qualitative
-
Stereochemically-aware bioactivity descriptors for uncharacterized chemical compounds bioRxiv. Bioinform. Pub Date : 2024-03-16 Arnau Comajuncosa-Creus, Aksel Lenes, Miguel Sanchez-Palomino, Patrick Aloy
We recently presented a set of deep neural networks to generate bioactivity descriptors associated to small molecules (i.e. Signaturizers), capturing their effects at increasing levels of biological complexity (i.e. from protein targets to clinical outcomes). However, such models were trained on 2D representations of molecules and are thus unable to capture key differences in the activity of stereoisomers
-
Cancer Stemness Online: A resource for investigating cancer stemness and associations with immune response bioRxiv. Bioinform. Pub Date : 2024-03-16 Weiwei Zhou, Minghai Su, Tiantongfei Jiang, Yunjin Xie, Jingyi Shi, Yingying Ma, Kang Xu, Gang Xu, Yongsheng Li, Juan Xu
Cancer progression involves the gradual loss of a differentiated phenotype and acquisition of progenitor and stem-cell-like features, which are potential culprit in immunotherapy resistance. Although the state-of-art predictive computational methods have facilitated predicting the cancer stemness, currently there is no efficient resource that can meet various requirements of usage. Here, we presented
-
HUHgle: An Interactive Substrate Design Tool for Covalent Protein-ssDNA Labeling Using HUH-tags bioRxiv. Bioinform. Pub Date : 2024-03-16 Adam T Smiley, Natalia S Babilonia-Díaz, Aspen J Hughes, Andrew CD Lemmex, Michael JM Anderson, Kassidy J Tompkins, Wendy R Gordon
HUH-tags have emerged as versatile fusion partners that mediate sequence specific protein-ssDNA bioconjugation through a simple and efficient reaction. Here we present HUHgle, a python-based interactive tool for the visualization, design, and optimization of substrates for HUH-tag mediated covalent labeling of proteins of interest with ssDNA substrates of interest. HUHgle streamlines design processes
-
ArtiDock: fast and accurate machine learning approach to protein-ligand docking based on multimodal data augmentation bioRxiv. Bioinform. Pub Date : 2024-03-16 Taras Voitsitskyi, Semen Yesylevskyy, Volodymyr Bdzhola, Roman Stratiichuk, Ihor Koleiev, Zakhar Ostrovsky, Volodymyr Vozniak, Ivan Khropachov, Pavlo Henitsoi, Leonid Popryho, Roman Zhytar, Alan Nafiiev, Serhii Starosyla
We present ArtiDock - the deep learning technique for predicting ligand poses in the protein binding pockets (aka "AI docking"), which is based on augmenting inherently limited training data with algorithmically generated artificial binding pockets and the ensembles of representative conformations of the ligand-protein complexes obtained from MD simulations. Performance of ArtiDock is compared systematically
-
Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction bioRxiv. Bioinform. Pub Date : 2024-03-16 Yang Yue, Shu Li, Yihua Cheng, Zexuan Zhu, Lie Wang, Tingjun Hou, Shan He
Structure-based machine learning algorithms have been utilized to predict the properties of protein-protein interaction (PPI) complexes, such as binding affinity, which is critical for understanding biological mechanisms and disease treatments. While most existing algorithms represent PPI complex graph structures at the atom-scale or residue-scale, these representations can be computationally expensive
-
Machine learning-based prediction of fish acute mortality: Implementation, interpretation, and regulatory relevance bioRxiv. Bioinform. Pub Date : 2024-03-16 Lilian Gasser, Christoph Schuer, Fernando Perez-Cruz, Kristin Schirmer, Marco Baity Jesi
Regulation of chemicals requires knowledge of their toxicological effects on a large number of target species. Traditionally, this knowledge has been acquired through in vivo testing. The recent effort to find alternatives based on machine learning, however, has not focused on guaranteeing transparency, comparability and reproducibility, which makes it difficult to assess advantages and disadvantages
-
Cigarette smoking drives accelerated aging across human tissues bioRxiv. Bioinform. Pub Date : 2024-03-16 Jose Miguel Ramirez, Rogerio Ribeiro, Oleksandra Soldatkina, Athos Moraes, Raquel Garcia-Perez, Pedro G Ferreira, Marta Mele
Tobacco smoke is the main cause of preventable mortality worldwide. Smoking increases the risk of developing many diseases and has been proposed as an aging accelerator. Yet, the molecular mechanisms driving smoking-related health decline and aging acceleration in most tissues remain unexplored. Here, we characterize gene expression, alternative splicing, DNA methylation and histological alterations
-
MetagenomicKG: a knowledge graph for metagenomic applications bioRxiv. Bioinform. Pub Date : 2024-03-15 Chunyu Ma, Shaopeng Liu, David Koslicki
Motivation: The sheer volume and variety of genomic content within microbial communities makes metagenomics a field rich in biomedical knowledge. To traverse these complex communities and their vast unknowns, metagenomic studies often depend on distinct reference databases, such as the Genome Taxonomy Database (GTDB), the Kyoto Encyclopedia of Genes and Genomes (KEGG), and the Bacterial and Viral Bioinformatics
-
Dissecting AlphaFolds Capabilities with Limited Sequence Information bioRxiv. Bioinform. Pub Date : 2024-03-15 Jannik Adrian Gut, Thomas Lemmin
Protein structure prediction, a fundamental challenge in computational biology, aims to predict a protein's 3D structure from its amino acid sequence. This structure is pivotal for elucidating protein functions, interactions, and driving innovations in drug discovery and enzyme engineering. AlphaFold, a powerful deep learning model, has revolutionized this field by leveraging phylogenetic information
-
Protein Language Models Expose Viral Mimicry and Immune Escape bioRxiv. Bioinform. Pub Date : 2024-03-15 Dan Ofer, Michal Linial
Motivation: Viruses elude the immune system through molecular mimicry, adopting biophysical characteristics of their host. We adapt protein language models (PLMs) to differentiate between human and viral proteins. Understanding where the immune system and our models make mistakes could reveal viral immune escape mechanisms.Results: We applied pretrained deep-learning PLMs to predict viral from human
-
Transcription factor prediction using protein 3D structures bioRxiv. Bioinform. Pub Date : 2024-03-15 Fabian Neuhaus, Jeanine Liebold, Jan Baumbach, Khalique Newaz
Motivation: Transcription factors (TFs) are DNA-binding proteins that regulate expressions of genes in an organism. Hence, it is important to identify novel TFs. Traditionally, novel TFs have been identified by their sequence similarity to the DNA-binding domains (DBDs) of known TFs. However, this approach can miss to identify a novel TF that is not sequence similar to any of the known DBDs. Hence
-
Optimizing Design of Genomics Studies for Clonal Evolution Analysis bioRxiv. Bioinform. Pub Date : 2024-03-15 Arjun Srivatsa, Russell Schwartz
Genomic biotechnologies have seen rapid development over the past two decades, allowing for both the inference and modification of genetic and epigenetic information at the single cell level. While these tools present enormous potential for basic research, diagnostics, and treatment, they also raise difficult issues of how to design research studies to deploy these tools most effectively. In designing
-
Protein-based cell population discovery and annotation for CITE-seq data identifies cellular phenotypes associated with critical COVID-19 severity bioRxiv. Bioinform. Pub Date : 2024-03-15 Denise Allen, Matthew Weaver, Sam Prokopchuk, Fritz Lekschas, Mike Jiang, Greg Finak, Evan Greene, Andrew McDavid
Technologies such as Cellular Indexing of Transcriptomes and Epitopes sequencing (CITE-seq) and RNA Expression and Protein sequencing (REAP-seq) augment unimodal single-cell RNA sequencing (scRNA-seq) by simultaneously measuring expression of cell-surface proteins using antibody derived oligonucleotide tags (ADT). These protocols have been increasingly used to resolve cellular populations that are
-
Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data bioRxiv. Bioinform. Pub Date : 2024-03-15 Julia Haag, Alexander I. Jordan, Alexandros Stamatakis
Motivation: Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual's origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need
-
Construction of an Immunoinformatics-Based Multi-Epitope Vaccine Candidate targeting Kyasanur Forest Disease Virus bioRxiv. Bioinform. Pub Date : 2024-03-15 Sunitha Manjari, Lekshmi S. Rajan, Anita M. Shete, Vinod Jani, Savita Patil, Yash Joshi, Rima R. Sahay, Deepak Y. Patil, Sreelekshmy Mohandas, Triparna Majumdar, Uddhavesh Sonavane, Rajendra Joshi, Pragya Yadav
Kyasanur Forest Disease (KFD) is one of the neglected tick-borne viral zoonoses. KFD virus was initially considered endemic to the Western Ghats region of Karnataka. Still, over the years, there have been reports of its spread to newer areas within and outside Karnataka. The absence of an effective treatment for KFD expedites the need for further research and development of novel vaccines. The present
-
Revealing cancer driver genes through integrative transcriptomic and epigenomic analyses with Moonlight bioRxiv. Bioinform. Pub Date : 2024-03-15 Mona Nourbakhsh, Yuanning Zheng, Humaira Noor, Matteo Tiberti, Olivier Gevaert, Elena Papaleo
Cancer involves dynamic changes caused by (epi)genetic alterations such as mutations or abnormal DNA methylation patterns which occur in cancer driver genes. These driver genes are divided into oncogenes and tumor suppressors depending on their function and mechanism of action. Discovering driver genes in different cancer (sub)types is important not only for increasing current understanding of carcinogenesis
-
ProAffinity-GNN: A Novel Approach to Structure-based Protein-Protein Binding Affinity Prediction via a Curated Dataset and Graph Neural Networks bioRxiv. Bioinform. Pub Date : 2024-03-15 Zhiyuan Zhou, Yueming Yin, Hao Han, Yiping Jia, Jun Hong Koh, Adams Wai-kin Kong, Yuguang Mu
Protein-protein interactions (PPIs) are crucial for understanding biological processes and disease mechanisms, contributing significantly to advances in protein engineering and drug discovery. The accurate determination of binding affinities, essential for decoding PPIs, faces challenges due to the substantial time and financial costs involved in experimental and theoretical methods. This situation
-
HiTaC: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2 bioRxiv. Bioinform. Pub Date : 2024-03-15 Fábio M. Miranda, Vasco A. C. Azevedo, Rommel T. J. Ramos, Bernhard Y. Renard, Vitor C. Piro
Background: Fungi play a key role in several important ecological functions, ranging from organic matter decomposition to symbiotic associations with plants. Moreover, fungi naturally inhabit the human body and can be beneficial when administered as probiotics. In mycology, the internal transcribed spacer (ITS) region was adopted as the universal marker for classifying fungi. Hence, an accurate and
-
H3-OPT: Accurate prediction of CDR-H3 loop structures of antibodies with deep learning bioRxiv. Bioinform. Pub Date : 2024-03-14 Boxue Tian, Hedi Chen, Xiaoyu Fan, Shuqian Zhu, Yuchan Pei, Xiaochun Zhang, Xiaonan Zhang, Lihang Liu, Feng Qian
Accurate prediction of the structurally diverse complementarity determining region heavy chain 3 (CDR-H3) loop structure remains a primary and long-standing challenge for antibody modeling. Here, we present the H3-OPT toolkit for predicting the 3D structures of monoclonal antibodies and nanobodies. H3-OPT combines the strengths of AlphaFold2 with a pre-trained protein language model, and provides a
-
Rare Copy Number Variant analysis in case-control studies using SNP Array Data: a scalable and automated data analysis pipeline bioRxiv. Bioinform. Pub Date : 2024-03-14 Haydee Artaza, Ksenia Lavrichenko, Anette S.B. Wolff, Ellen C. Røyrvik, Marc Vaudel, Stefan Johansson
Background: Rare copy number variants (CNVs) significantly influence the human genome and may contribute to disease susceptibility. High-throughput SNP genotyping platforms provide data that can be used for CNV detection, but it requires the complex pipelining of bioinformatic tools. Here, we propose a flexible bioinformatic pipeline for rare CNV analysis from human SNP array data. Results: The pipeline
-
Identification of novel evolutionarily conserved genes and pathways in human and mouse musculoskeletal progenitors bioRxiv. Bioinform. Pub Date : 2024-03-14 Soundrapandian Saravanan, A. S Devika, Abida Islam Pranty, RV Shaji, Raghu Bhushan, James Adjaye, Smita Sudheer
The axial skeletal system and skeletal muscles of the vertebrates arise from somites, the blocks of tissues flanking both sides of the neural tube. The progenitors of Somites, called the Presomitic Mesoderm (PSM) reside at the posterior end of a developing embryo. Most of our understanding about these two early developmental stages comes from the studies on chick and mouse, and in the recent past,
-
BinDash 2.0: New MinHash Scheme Allows Ultra-fast and Accurate Genome Search and Comparisons bioRxiv. Bioinform. Pub Date : 2024-03-14 Jianshu Zhao, XiaoFei Zhao, Jean Pierre-Both, Konstantinos T. Konstantinidis
Motivation: Comparing large number of genomes in term of their genomic distance is becoming more and more challenging because there is an increasing number of microbial genomes deposited in public databases. Nowadays, we may need to estimate pairwise distances between millions or even billions of genomes. Few softwares can perform such comparisons efficiently. Results: Here we update the multi-threaded
-
Accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and specialized U-Net bioRxiv. Bioinform. Pub Date : 2024-03-14 Rajan Gyawali, Ashwin Dhakal, Liguo Wang, Jianlin Cheng
Picking protein particles in cryo-electron microscopy (cryo-EM) micrographs is a crucial step in the cryo-EM-based structure determination. However, existing methods trained on a limited amount of cryo-EM data still cannot accurately pick protein particles from noisy cryo-EM images. The general foundational artificial intelligence (AI)-based image segmentation model such as Meta's Segment Anything