-
Data processing solutions to render metabolomics more quantitative: case studies in food and clinical metabolomics using Metabox 2.0 Gigascience (IF 9.2) Pub Date : 2024-03-15 Kwanjeera Wanichthanarak, Ammarin In-on, Sili Fan, Oliver Fiehn, Arporn Wangwiwatsin, Sakda Khoomrung
In classic semiquantitative metabolomics, metabolite intensities are affected by biological factors and other unwanted variations. A systematic evaluation of the data processing methods is crucial to identify adequate processing procedures for a given experimental setup. Current comparative studies are mostly focused on peak area data but not on absolute concentrations. In this study, we evaluated
-
A reference genome of Commelinales provides insights into the commelinids evolution and global spread of water hyacinth (Pontederia crassipes) Gigascience (IF 9.2) Pub Date : 2024-03-14 Yujie Huang, Longbiao Guo, Lingjuan Xie, Nianmin Shang, Dongya Wu, Chuyu Ye, Eduardo Carlos Rudell, Kazunori Okada, Qian-Hao Zhu, Beng-Kah Song, Daguang Cai, Aldo Merotto Junior, Lianyang Bai, Longjiang Fan
Commelinales belongs to the commelinids clade, which also comprises Poales that includes the most important monocot species, such as rice, wheat, and maize. No reference genome of Commelinales is currently available. Water hyacinth (Pontederia crassipes or Eichhornia crassipes), a member of Commelinales, is one of the devastating aquatic weeds, although it is also grown as an ornamental and medical
-
Multi-omic dataset of patient-derived tumor organoids of neuroendocrine neoplasms Gigascience (IF 9.2) Pub Date : 2024-03-07 Nicolas Alcala, Catherine Voegele, Lise Mangiante, Alexandra Sexton-Oates, Hans Clevers, Lynnette Fernandez-Cuesta, Talya L Dayton, Matthieu Foll
Background Organoids are 3-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types. Results We have generated the first multi-omic dataset (whole-genome sequencing [WGS] and RNA-sequencing
-
Habitat suitability maps for Australian flora and fauna under CMIP6 climate scenarios Gigascience (IF 9.2) Pub Date : 2024-03-05 Carla L Archibald, David M Summers, Erin M Graham, Brett A Bryan
Background Spatial information about the location and suitability of areas for native plant and animal species under different climate futures is an important input to land use and conservation planning and management. Australia, renowned for its abundant species diversity and endemism, often relies on modeled data to assess species distributions due to the country’s vast size and the challenges associated
-
Leveraging citizen science for monitoring urban forageable plants Gigascience (IF 9.2) Pub Date : 2024-03-05 Filipi Miranda Soares, Luís Ferreira Pires, Maria Carolina Garcia, Yamine Bouzembrak, Lidio Coradin, Natalia Pirani Ghilardi-Lopes, Rubens Rangel Silva, Aline Martins de Carvalho, Benildes Coura Moreira dos Santos Maculan, Sheina Koffler, Uiara Bandineli Montedo, Debora Pignatari Drucker, Raquel Santiago, Anand Gavai, Maria Clara Peres de Carvalho, Ana Carolina da Silva Lima, Hillary Dandara Elias
Urbanization brings forth social challenges in emerging countries such as Brazil, encompassing food scarcity, health deterioration, air pollution, and biodiversity loss. Despite this, urban areas like the city of São Paulo still boast ample green spaces, offering opportunities for nature appreciation and conservation, enhancing city resilience and livability. Citizen science is a collaborative endeavor
-
Chromosome-level genome of the poultry shaft louse Menopon gallinae provides insight into the host-switching and adaptive evolution of parasitic lice Gigascience (IF 9.2) Pub Date : 2024-02-19 Ye Xu, Ling Ma, Shanlin Liu, Yanxin Liang, Qiaoqiao Liu, Zhixin He, Li Tian, Yuange Duan, Wanzhi Cai, Hu Li, Fan Song
Background Lice (Psocodea: Phthiraptera) are one important group of parasites that infects birds and mammals. It is believed that the ancestor of parasitic lice originated on the ancient avian host, and ancient mammals acquired these parasites via host-switching from birds. Here we present the first chromosome-level genome of Menopon gallinae in Amblycera (earliest diverging lineage of parasitic lice)
-
The probability of edge existence due to node degree: a baseline for network-based predictions Gigascience (IF 9.2) Pub Date : 2024-02-07 Michael Zietz, Daniel S Himmelstein, Kyle Kloster, Christopher Williams, Michael W Nagle, Casey S Greene
Important tasks in biomedical discovery such as predicting gene functions, gene–disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction
-
FAIR data retrieval for sensitive clinical research data in Galaxy Gigascience (IF 9.2) Pub Date : 2024-01-27 Jasper Ouwerkerk, Helena Rasche, John D Spalding, Saskia Hiltemann, Andrew P Stubbs
Background In clinical research, data have to be accessible and reproducible, but the generated data are becoming larger and analysis complex. Here we propose a platform for Findable, Accessible, Interoperable, and Reusable (FAIR) data access and creating reproducible findings. Standardized access to a major genomic repository, the European Genome-Phenome Archive (EGA), has been achieved with API services
-
Toward genome assemblies for all marine vertebrates: current landscape and challenges Gigascience (IF 9.2) Pub Date : 2024-01-27 Emma de Jong, Lara Parata, Philipp E Bayer, Shannon Corrigan, Richard J Edwards
Marine vertebrate biodiversity is fundamental to ocean ecosystem health but is threatened by climate change, overharvesting, and habitat degradation. High-quality reference genomes are valuable foundational scientific resources that can inform conservation efforts. Consequently, global consortia are striving to produce reference genomes for representatives of all life. Here, we summarize the current
-
MMV_Im2Im: an open-source microscopy machine vision toolbox for image-to-image transformation Gigascience (IF 9.2) Pub Date : 2024-01-27 Justin Sonneck, Yu Zhou, Jianxu Chen
Over the past decade, deep learning (DL) research in computer vision has been growing rapidly, with many advances in DL-based image analysis methods for biomedical problems. In this work, we introduce MMV_Im2Im, a new open-source Python package for image-to-image transformation in bioimaging applications. MMV_Im2Im is designed with a generic image-to-image transformation framework that can be used
-
ToxCodAn-Genome: an automated pipeline for toxin-gene annotation in genome assembly of venomous lineages Gigascience (IF 9.2) Pub Date : 2024-01-19 Pedro G Nachtigall, Alan M Durham, Darin R Rokyta, Inácio L M Junqueira-de-Azevedo
Background The rapid development of sequencing technologies resulted in a wide expansion of genomics studies using venomous lineages. This facilitated research focusing on understanding the evolution of adaptive traits and the search for novel compounds that can be applied in agriculture and medicine. However, the toxin annotation of genomes is a laborious and time-consuming task, and no consensus
-
Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning–driven data analysis Gigascience (IF 9.2) Pub Date : 2024-01-11 Akshay Akshay, Mitali Katoch, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C Burkhard, Rosalyn M Adam, Katia Monastyrskaya, Ali Hashemi Gheinani
Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming
-
A graph clustering algorithm for detection and genotyping of structural variants from long reads Gigascience (IF 9.2) Pub Date : 2024-01-11 Nicolás Gaitán, Jorge Duitama
Background Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been
-
The chromosome-scale genome of Magnolia sinica (Magnoliaceae) provides insights into the conservation of plant species with extremely small populations (PSESP) Gigascience (IF 9.2) Pub Date : 2024-01-11 Lei Cai, Detuan Liu, Fengmao Yang, Rengang Zhang, Quanzheng Yun, Zhiling Dao, Yongpeng Ma, Weibang Sun
Magnolia sinica (Magnoliaceae) is a highly threatened tree endemic to southeast Yunnan, China. In this study, we generated for the first time a high-quality chromosome-scale genome sequence from M. sinica, by combining Illumina and ONT data with Hi-C scaffolding methods. The final assembled genome size of M. sinica was 1.84 Gb, with a contig N50 of ca. 45 Mb and scaffold N50 of 92 Mb. Identified repeats
-
Computational reproducibility of Jupyter notebooks from biomedical publications Gigascience (IF 9.2) Pub Date : 2024-01-11 Sheeba Samuel, Daniel Mietchen
Background Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale
-
Vulture: cloud-enabled scalable mining of microbial reads in public scRNA-seq data Gigascience (IF 9.2) Pub Date : 2024-01-10 Junyi Chen, Danqing Yin, Harris Y H Wong, Xin Duan, Ken H O Yu, Joshua W K Ho
The rapidly growing collection of public single-cell sequencing data has become a valuable resource for molecular, cellular, and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs
-
A multi-omics data analysis workflow packaged as a FAIR Digital Object Gigascience (IF 9.2) Pub Date : 2024-01-10 Anna Niehues, Casper de Visser, Fiona A Hagenbeek, Purva Kulkarni, René Pool, Naama Karu, Alida S D Kindt, Gurnoor Singh, Robert R J M Vermeiren, Dorret I Boomsma, Jenny van Dongen, Peter A C ’t Hoen, Alain J van Gool
Background Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects
-
A high-quality chromosomal genome assembly of the sea cucumber Chiridota heheva and its hydrothermal adaptation Gigascience (IF 9.2) Pub Date : 2024-01-04 Yujin Pu, Yang Zhou, Jun Liu, Haibin Zhang
Background Chiridota heheva is a cosmopolitan holothurian well adapted to diverse deep-sea ecosystems, especially chemosynthetic environments. Besides high hydrostatic pressure and limited light, high concentrations of metal ions also represent harsh conditions in hydrothermal environments. Few holothurian species can live in such extreme conditions. Therefore, it is valuable to elucidate the adaptive
-
Evolutionary genomics of three agricultural pest moths reveals rapid evolution of host adaptation and immune-related genes Gigascience (IF 9.2) Pub Date : 2024-01-02 Yi-Ming Weng, Pathour R Shashank, R Keating Godfrey, David Plotkin, Brandon M Parker, Tyler Wist, Akito Y Kawahara
Background Understanding the genotype of pest species provides an important baseline for designing integrated pest management (IPM) strategies. Recently developed long-read sequence technologies make it possible to compare genomic features of nonmodel pest species to disclose the evolutionary path underlying the pest species profiles. Here we sequenced and assembled genomes for 3 agricultural pest
-
Chromosome-level genome assembly of the Pacific geoduck Panopea generosa reveals major inter- and intrachromosomal rearrangements and substantial expansion of the copine gene family Gigascience (IF 9.2) Pub Date : 2023-12-19 Jing Wang, Qing Xu, Min Chen, Yang Chen, Chunde Wang, Nansheng Chen
The Pacific geoduck Panopea generosa (class Bivalvia, order Adapedonta, family Hiatellidae, genus Panopea) is the largest known burrowing bivalve with considerable commercial value. Pacific geoduck and other geoduck clams play important roles in maintaining ecosystem health for their filter feeding habit and coupling pelagic and benthic processes. Here, we report a high-quality chromosome-level genome
-
DrugSim2DR: systematic prediction of drug functional similarities in the context of specific disease for drug repurposing Gigascience (IF 9.2) Pub Date : 2023-12-19 Jiashuo Wu, Ji Li, Yalan He, Junling Huang, Xilong Zhao, Bingyue Pan, Yahui Wang, Liang Cheng, Junwei Han
Background Traditional approaches to drug development are costly and involve high risks. The drug repurposing approach can be a valuable alternative to traditional approaches and has therefore received considerable attention in recent years. Findings Herein, we develop a previously undescribed computational approach, called DrugSim2DR, which uses a network diffusion algorithm to identify candidate
-
DriverMP enables improved identification of cancer driver genes Gigascience (IF 9.2) Pub Date : 2023-12-13 Yangyang Liu, Jiyun Han, Tongxin Kong, Nannan Xiao, Qinglin Mei, Juntao Liu
Background Cancer is widely regarded as a complex disease primarily driven by genetic mutations. A critical concern and significant obstacle lies in discerning driver genes amid an extensive array of passenger genes. Findings We present a new method termed DriverMP for effectively prioritizing altered genes on a cancer-type level by considering mutated gene pairs. It is designed to first apply nonsilent
-
Imputation method for single-cell RNA-seq data using neural topic model Gigascience (IF 9.2) Pub Date : 2023-11-24 Yueyang Qi, Shuangkai Han, Lin Tang, Lin Liu
Single-cell RNA sequencing (scRNA-seq) technology studies transcriptome and cell-to-cell differences from higher single-cell resolution and different perspectives. Despite the advantage of high capture efficiency, downstream functional analysis of scRNA-seq data is made difficult by the excess of zero values (i.e., the dropout phenomenon). To effectively address this problem, we introduced scNTImpute
-
Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations Gigascience (IF 9.2) Pub Date : 2023-11-24 Bianca-Maria Cosma, Ramin Shirali Hossein Zade, Erin Noel Jordan, Paul van Lent, Chengyao Peng, Stephanie Pillay, Thomas Abeel
Background Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) has overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS)
-
MetGENE: gene-centric metabolomics information retrieval tool Gigascience (IF 9.2) Pub Date : 2023-11-20 Sumana Srinivasan, Mano R Maurya, Srinivasan Ramachandran, Eoin Fahy, Shankar Subramaniam
Background Biomedical research often involves contextual integration of multimodal and multiomic data in search of mechanisms for improved diagnosis, treatment, and monitoring. Researchers need to access information from diverse sources, comprising data in various and sometimes incongruent formats. The downstream processing of the data to decipher mechanisms by reconstructing networks and developing
-
Variability analysis of LC-MS experimental factors and their impact on machine learning Gigascience (IF 9.2) Pub Date : 2023-11-20 Tobias Greisager Rehfeldt, Konrad Krawczyk, Simon Gregersen Echers, Paolo Marcatili, Pawel Palczynski, Richard Röttger, Veit Schwämmle
Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying
-
A community resource to mass explore the wheat grain proteome and its application to the late-maturity alpha-amylase (LMA) problem Gigascience (IF 9.2) Pub Date : 2023-11-06 Delphine Vincent, AnhDuyen Bui, Vilnis Ezernieks, Saleh Shahinfar, Timothy Luke, Doris Ram, Nicholas Rigas, Joe Panozzo, Simone Rochfort, Hans Daetwyler, Matthew Hayden
Background Late-maturity alpha-amylase (LMA) is a wheat genetic defect causing the synthesis of high isoelectric point alpha-amylase following a temperature shock during mid-grain development or prolonged cold throughout grain development, both leading to starch degradation. While the physiology is well understood, the biochemical mechanisms involved in grain LMA response remain unclear. We have applied
-
Finding haplotypic signatures in proteins Gigascience (IF 9.2) Pub Date : 2023-11-03 Jakub Vašíček, Dafni Skiadopoulou, Ksenia G Kuznetsova, Bo Wen, Stefan Johansson, Pål R Njølstad, Stefan Bruckner, Lukas Käll, Marc Vaudel
Background The nonrandom distribution of alleles of common genomic variants produces haplotypes, which are fundamental in medical and population genetic studies. Consequently, protein-coding genes with different co-occurring sets of alleles can encode different amino acid sequences: protein haplotypes. These protein haplotypes are present in biological samples and detectable by mass spectrometry, but
-
epialleleR: an R/Bioconductor package for sensitive allele-specific methylation analysis in NGS data Gigascience (IF 9.2) Pub Date : 2023-11-03 Oleksii Nikolaienko, Per Eystein Lønning, Stian Knappskog
Low-level mosaic epimutations within the BRCA1 gene promoter occur in 5–8% of healthy individuals and are associated with a significantly elevated risk of breast and ovarian cancer. Similar events may also affect other tumor suppressor genes, potentially being a significant contributor to cancer burden. While this opens a new area for translational research, detection of low-level mosaic epigenetic
-
metaGOflow: a workflow for the analysis of marine Genomic Observatories shotgun metagenomics data Gigascience (IF 9.2) Pub Date : 2023-10-18 Haris Zafeiropoulos, Martin Beracochea, Stelios Ninidakis, Katrina Exter, Antonis Potirakis, Gianluca De Moro, Lorna Richardson, Erwan Corre, João Machado, Evangelos Pafilis, Georgios Kotoulas, Ioulia Santi, Robert D Finn, Cymon J Cox, Christina Pavloudi
Background Genomic Observatories (GOs) are sites of long-term scientific study that undertake regular assessments of the genomic biodiversity. The European Marine Omics Biodiversity Observation Network (EMO BON) is a network of GOs that conduct regular biological community samplings to generate environmental and metagenomic data of microbial communities from designated marine stations around Europe
-
Single-cell transcriptome analysis illuminating the characteristics of species-specific innate immune responses against viral infections Gigascience (IF 9.2) Pub Date : 2023-10-17 Hirofumi Aso, Jumpei Ito, Haruka Ozaki, Yukie Kashima, Yutaka Suzuki, Yoshio Koyanagi, Kei Sato
Background Bats harbor various viruses without severe symptoms and act as their natural reservoirs. The tolerance of bats against viral infections is assumed to originate from the uniqueness of their immune system. However, how immune responses vary between primates and bats remains unclear. Here, we characterized differences in the immune responses by peripheral blood mononuclear cells to various
-
Katdetectr: an R/bioconductor package utilizing unsupervised changepoint analysis for robust kataegis detection Gigascience (IF 9.2) Pub Date : 2023-10-17 Daan M Hazelaar, Job van Riet, Youri Hoogstrate, Harmen J G van de Werken
Background Kataegis refers to the occurrence of regional genomic hypermutation in cancer and is a phenomenon that has been observed in a wide range of malignancies. A kataegis locus constitutes a genomic region with a high mutation rate (i.e., a higher frequency of closely interspersed somatic variants than the overall mutational background). It has been shown that kataegis is of biological significance
-
simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods Gigascience (IF 9.2) Pub Date : 2023-10-17 Chakravarthi Kanduri, Lonneke Scheffer, Milena Pavlović, Knut Dagestad Rand, Maria Chernigovskaya, Oz Pirvandy, Gur Yaari, Victor Greiff, Geir K Sandve
Background Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing
-
GERONIMO: A tool for systematic retrieval of structural RNAs in a broad evolutionary context Gigascience (IF 9.2) Pub Date : 2023-10-17 Agata M Kilar, Petr Fajkus, Jiří Fajkus
Background While web-based tools such as BLAST have made identifying conserved gene homologs appear easy, genes with variable sequences pose significant challenges. Functionally important noncoding RNAs (ncRNA) often show low sequence conservation due to genetic variations, including insertions and deletions. Rather than conserved sequences, these RNAs possess highly conserved structural features across
-
Chromosome-level reference genome of tetraploid Isoetes sinensis provides insights into evolution and adaption of lycophytes Gigascience (IF 9.2) Pub Date : 2023-09-30 Jinteng Cui, Yunke Zhu, Hai Du, Zhenhua Liu, Siqian Shen, Tongxin Wang, Wenwen Cui, Rong Zhang, Sanjie Jiang, Yanmin Wu, Xiaofeng Gu, Hao Yu, Zhe Liang
Background The Lycophyta species are the extant taxa most similar to early vascular plants that were once abundant on Earth. However, their distribution has greatly diminished. So far, the absence of chromosome-level assembled lycophyte genomes has hindered our understanding of evolution and environmental adaption of lycophytes. Findings We present the reference genome of the tetraploid aquatic quillwort
-
Allele-specific regulatory effects on the pig transcriptome Gigascience (IF 9.2) Pub Date : 2023-09-30 Yu Lin, Jing Li, Li Chen, Jingyi Bai, Jiaman Zhang, Yujie Wang, Pengliang Liu, Keren Long, Liangpeng Ge, Long Jin, Yiren Gu, Mingzhou Li
Background Allele-specific expression (ASE) refers to the preferential expression of one allele over the other and contributes to adaptive phenotypic plasticity. Here, we used a reciprocal cross-model between phenotypically divergent European Berkshire and Asian Tibetan pigs to characterize 2 ASE classes: imprinting (i.e., the unequal expression between parental alleles) and sequence dependent (i.e
-
Confound-leakage: confound removal in machine learning leads to leakage Gigascience (IF 9.2) Pub Date : 2023-09-30 Sami Hamdan, Bradley C Love, Georg G von Polier, Susanne Weis, Holger Schwender, Simon B Eickhoff, Kaustubh R Patil
Background Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding
-
Container Profiler: Profiling resource utilization of containerized big data pipelines Gigascience (IF 9.2) Pub Date : 2023-08-25 Varik Hoang, Ling-Hong Hung, David Perez, Huazeng Deng, Raymond Schooley, Niharika Arumilli, Ka Yee Yeung, Wes Lloyd
Background This article presents the Container Profiler, a software tool that measures and records the resource usage of any containerized task. Our tool profiles the CPU, memory, disk, and network utilization of containerized tasks collecting over 60 Linux operating system metrics at the virtual machine, container, and process levels. The Container Profiler supports performing time-series profiling
-
GADMA2: more efficient and flexible demographic inference from genetic data Gigascience (IF 9.2) Pub Date : 2023-08-23 Ekaterina Noskova, Nikita Abramov, Stanislav Iliutkin, Anton Sidorin, Pavel Dobrynin, Vladimir I Ulyantsev
Background Inference of complex demographic histories is a source of information about events that happened in the past of studied populations. Existing methods for demographic inference typically require input from the researcher in the form of a parameterized model. With an increased variety of methods and tools, each with its own interface, the model specification becomes tedious and error-prone
-
KGML-xDTD: a knowledge graph–based machine learning framework for drug treatment prediction and mechanism description Gigascience (IF 9.2) Pub Date : 2023-08-21 Chunyu Ma, Zhihan Zhou, Han Liu, David Koslicki
Background Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of action
-
Lessons learned to boost a bioinformatics knowledge base reusability, the Bgee experience Gigascience (IF 9.2) Pub Date : 2023-08-17 Tarcisio Mendes de Farias, Julien Wollbrett, Marc Robinson-Rechavi, Frederic Bastian
Background Enhancing interoperability of bioinformatics knowledge bases is a high-priority requirement to maximize data reusability and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and without
-
Chromosome-level genome and recombination map of the male buffalo Gigascience (IF 9.2) Pub Date : 2023-08-17 Xiaobo Wang, Zhipeng Li, Tong Feng, Xier Luo, Lintao Xue, Chonghui Mao, Kuiqing Cui, Hui Li, Jieping Huang, Kongwei Huang, Saif-ur Rehman, Deshun Shi, Dongdong Wu, Jue Ruan, Qingyou Liu
Background The swamp buffalo (Bubalus bubalis carabanesis) is an economically important livestock supplying milk, meat, leather, and draft power. Several female buffalo genomes have been available, but the lack of high-quality male genomes hinders studies on chromosome evolution, especially Y, as well as meiotic recombination. Results Here, a chromosome-level genome with a contig N50 of 72.2 Mb and
-
Genome resequencing reveals independent domestication and breeding improvement of naked oat Gigascience (IF 9.2) Pub Date : 2023-08-01 Jinsheng Nan, Yu Ling, Jianghong An, Ting Wang, Mingna Chai, Jun Fu, Gaochao Wang, Cai Yang, Yan Yang, Bing Han
As an important cereal crop, common oat, has attracted more and more attention due to its healthy nutritional components and bioactive compounds. Here, high-depth resequencing of 115 oat accessions and closely related hexaploid species worldwide was performed. Based on genetic diversity and linkage disequilibrium analysis, it was found that hulled oat (Avena sativa) experienced a more severe bottleneck
-
BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale Gigascience (IF 9.2) Pub Date : 2023-07-31 César Piñeiro, Juan C Pichel
Background High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing
-
Metaphor—A workflow for streamlined assembly and binning of metagenomes Gigascience (IF 9.2) Pub Date : 2023-07-31 Vinícius W Salazar, Babak Shaban, Maria del Mar Quiroga, Robert Turnbull, Edoardo Tescari, Vanessa Rossetto Marcelino, Heroen Verbruggen, Kim-Anh Lê Cao
Recent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyze genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational
-
Hetnet connectivity search provides rapid insights into how biomedical entities are related Gigascience (IF 9.2) Pub Date : 2023-07-28 Daniel S Himmelstein, Michael Zietz, Vincent Rubinetti, Kyle Kloster, Benjamin J Heil, Faisal Alquaddoomi, Dongbo Hu, David N Nicholson, Yun Hao, Blair D Sullivan, Michael W Nagle, Casey S Greene
Background Hetnets, short for “heterogeneous networks,” contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet, connects 11 types of nodes—including genes, diseases, drugs, pathways, and anatomical structures—with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such
-
Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection Gigascience (IF 9.2) Pub Date : 2023-07-28 T Phuong Quan, Ben Lacey, Tim E A Peto, A Sarah Walker
Background Large routinely collected data such as electronic health records (EHRs) are increasingly used in research, but the statistical methods and processes used to check such data for temporal data quality issues have not moved beyond manual, ad hoc production and visual inspection of graphs. With the prospect of EHR data being used for disease surveillance via automated pipelines and public-facing
-
SODAR: managing multiomics study data and metadata Gigascience (IF 9.2) Pub Date : 2023-07-27 Mikko Nieminen, Oliver Stolpe, Mathias Kuhring, January Weiner, Patrick Pett, Dieter Beule, Manuel Holtgrewe
Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar
-
Genomic analyses provide insights into the evolution and salinity adaptation of halophyte Tamarix chinensis Gigascience (IF 9.2) Pub Date : 2023-07-26 Jian Ning Liu, Hongcheng Fang, Qiang Liang, Yuhui Dong, Changxi Wang, Liping Yan, Xinmei Ma, Rui Zhou, Xinya Lang, Shasha Gai, Lichang Wang, Shengyi Xu, Ke Qiang Yang, Dejun Wu
Background The woody halophyte Tamarix chinensis is a pioneer tree species in the coastal wetland ecosystem of northern China, exhibiting high resistance to salt stress. However, the genetic information underlying salt tolerance in T. chinensis remains to be seen. Here we present a genomic investigation of T. chinensis to elucidate the underlying mechanism of its high resistance to salinity. Results
-
MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction Gigascience (IF 9.2) Pub Date : 2023-07-25 Wenhuan Zeng, Anupam Gautam, Daniel H Huson
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy
-
Identification of sex chromosomes and primary sex ratio in the small hive beetle, a worldwide parasite of honey bees Gigascience (IF 9.2) Pub Date : 2023-07-25 Qiang Huang, Sheina B Sim, Scott M Geib, Anna Childers, Junfeng Liu, Xiuxiu Wei, Wensu Han, Francisco Posada-Florez, Allen Z Xue, Zheng Li, Jay D Evans
Background The small hive beetle (SHB), Aethina tumida, has emerged as a worldwide threat to honey bees in the past two decades. These beetles harvest nest resources, feed on larval bees, and ultimately spoil nest resources with gelatinous slime together with the fungal symbiont Kodamaea ohmeri. Results Here, we present the first chromosome-level genome assembly for the SHB. With a 99.1% representation
-
A new haplotype-resolved turkey genome to enable turkey genetics and genomics research Gigascience (IF 9.2) Pub Date : 2023-07-21 Carolina P Barros, Martijn F L Derks, Jeff Mohr, Benjamin J Wood, Richard P M A Crooijmans, Hendrik-Jan Megens, Marco C A M Bink, Martien A M Groenen
Background The domesticated turkey (Meleagris gallopavo) is a species of significant agricultural importance and is the second largest contributor, behind broiler chickens, to world poultry meat production. The previous genome is of draft quality and partly based on the chicken (Gallus gallus) genome. A high-quality reference genome of M. gallopavo is essential for turkey genomics and genetics research
-
Genome assemblies of Vigna reflexo-pilosa (créole bean) and its progenitors, Vigna hirtella and Vigna trinervia, revealed homoeolog expression bias and expression-level dominance in the allotetraploid Gigascience (IF 9.2) Pub Date : 2023-07-20 Wirulda Pootakham, Chutima Sonthirod, Chaiwat Naktang, Chutintorn Yundaeng, Thippawan Yoocha, Wasitthee Kongkachana, Duangjai Sangsrakru, Prakit Somta, Sithichoke Tangphatsornruang
Vigna reflexo-pilosa (créole bean) is a wild legume belonging to the subgenus Ceratoropis and is widely distributed in Asia. Créole bean is the only tetraploid species in the genus Vigna, and it has been shown to derive from the hybridization of Vigna hirtella and Vigna trinervia. In this study, we combined the long-read PacBio technology with the chromatin contact mapping (Hi-C) technique to obtain
-
Data management strategy for a collaborative research center Gigascience (IF 9.2) Pub Date : 2023-07-04 Deepti Mittal, Rebecca Mease, Thomas Kuner, Herta Flor, Rohini Kuner, Jamila Andoh
The importance of effective research data management (RDM) strategies to support the generation of Findable, Accessible, Interoperable, and Reusable (FAIR) neuroscience data grows with each advance in data acquisition techniques and research methods. To maximize the impact of diverse research strategies, multidisciplinary, large-scale neuroscience research consortia face a number of unsolved challenges
-
Training Infrastructure as a Service Gigascience (IF 9.2) Pub Date : 2023-07-03 Helena Rasche, Cameron Hyde, John Davis, Simon Gladman, Nate Coraor, Anthony Bretaudeau, Gianmauro Cuccuru, Wendi Bacon, Beatriz Serrano-Solano, Jennifer Hillman-Jackson, Saskia Hiltemann, Miaomiao Zhou, Björn Grüning, Andrew Stubbs
Background Hands-on training, whether in bioinformatics or other domains, often requires significant technical resources and knowledge to set up and run. Instructors must have access to powerful compute infrastructure that can support resource-intensive jobs running efficiently. Often this is achieved using a private server where there is no contention for the queue. However, this places a significant
-
Efficient real-time selective genome sequencing on resource-constrained devices Gigascience (IF 9.2) Pub Date : 2023-07-03 Po Jui Shih, Hassaan Saadat, Sri Parameswaran, Hasindu Gamaarachchi
Background Third-generation nanopore sequencers offer selective sequencing or “Read Until” that allows genomic reads to be analyzed in real time and abandoned halfway if not belonging to a genomic region of “interest.” This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analyzing should be as low as possible for selective sequencing
-
Deep neural networks with knockoff features identify nonlinear causal relations and estimate effect sizes in complex biological systems Gigascience (IF 9.2) Pub Date : 2023-07-03 Zhenjiang Fan, Kate F Kernan, Aditya Sriram, Panayiotis V Benos, Scott W Canna, Joseph A Carcillo, Soyeon Kim, Hyun Jung Park
Background Learning the causal structure helps identify risk factors, disease mechanisms, and candidate therapeutics for complex diseases. However, although complex biological systems are characterized by nonlinear associations, existing bioinformatic methods of causal inference cannot identify the nonlinear relationships and estimate their effect size. Results To overcome these limitations, we developed
-
Identification of transcriptional regulatory variants in pig duodenum, liver, and muscle tissues Gigascience (IF 9.2) Pub Date : 2023-06-25 Daniel Crespo-Piazuelo, Hervé Acloque, Olga González-Rodríguez, Mayrone Mongellaz, Marie-José Mercat, Marco C A M Bink, Abe E Huisman, Yuliaxis Ramayo-Caldas, Juan Pablo Sánchez, Maria Ballester
Background In humans and livestock species, genome-wide association studies (GWAS) have been applied to study the association between variants distributed across the genome and a phenotype of interest. To discover genetic polymorphisms affecting the duodenum, liver, and muscle transcriptomes of 300 pigs from 3 different breeds (Duroc, Landrace, and Large White), we performed expression GWAS between
-
EraSOR: a software tool to eliminate inflation caused by sample overlap in polygenic score analyses Gigascience (IF 9.2) Pub Date : 2023-06-16 Shing Wan Choi, Timothy Shin Heng Mak, Clive J Hoggart, Paul F O'Reilly
Background Polygenic risk score (PRS) analyses are now routinely applied across biomedical research. However, as PRS studies grow in size, there is an increased risk of sample overlap between the genome-wide association study (GWAS) from which the PRS is derived and the “target sample,” in which PRSs are computed and hypotheses are tested. Despite the wide recognition of the sample overlap problem
-
Multiview child motor development dataset for AI-driven assessment of child development Gigascience (IF 9.2) Pub Date : 2023-05-27 Hye Hyeon Kim, Jin Yong Kim, Bong Kyung Jang, Joo Hyun Lee, Jong Hyun Kim, Dong Hoon Lee, Hee Min Yang, Young Jo Choi, Myung Jun Sung, Tae Jun Kang, Eunah Kim, Yang Seong Oh, Jaehyun Lim, Soon-Beom Hong, Kiok Ahn, Chan Lim Park, Soon Myeong Kwon, Yu Rang Park
Background Children's motor development is a crucial tool for assessing developmental levels, identifying developmental disorders early, and taking appropriate action. Although the Korean Developmental Screening Test for Infants and Children (K-DST) can accurately assess childhood development, its dependence on parental surveys rather than reliable, professional observation limits it. This study constructed