当前期刊: GigaScience Go to current issue    加入关注   
显示样式:        排序: 导出
  • Telescope: an interactive tool for managing large-scale analysis from mobile devices
    Gigascience (IF 4.688) Pub Date : 2020-01-23
    Brito J, Mosqueiro T, Rotman J, et al.

    BackgroundIn today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. ResultsTo address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. ConclusionsTelescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.

  • A highly contiguous genome assembly of the bat hawkmoth Hyles vespertilio (Lepidoptera: Sphingidae)
    Gigascience (IF 4.688) Pub Date : 2020-01-23
    Pippel M, Jebb D, Patzold F, et al.

    BackgroundAdapted to different ecological niches, moth species belonging to the Hyles genus exhibit a spectacular diversity of larval color patterns. These species diverged ∼7.5 million years ago, making this rather young genus an interesting system to study a wide range of questions including the process of speciation, ecological adaptation, and adaptive radiation. ResultsHere we present a high-quality genome assembly of the bat hawkmoth Hyles vespertilio, the first reference genome of a member of the Hyles genus. We generated 51× Pacific Biosciences long reads with an average read length of 8.9 kb. Pacific Biosciences reads longer than 4 kb were assembled into contigs, resulting in a 651.4-Mb assembly consisting of 530 contigs with an N50 value of 7.5 Mb. The circular mitochondrial contig has a length of 15,303 bp. The H. vespertilio genome is very repeat-rich and exhibits a higher repeat content (50.3%) than other Bombycoidea species such as Bombyx mori (45.7%) and Manduca sexta (27.5%). We developed a comprehensive gene annotation workflow to obtain consensus gene models from different evidence including gene projections, protein homology, transcriptome data, and ab initio predictions. The resulting gene annotation is highly complete with 94.5% of BUSCO genes being completely present, which is higher than the BUSCO completeness of the B. mori (92.2%) and M. sexta (90%) annotations. ConclusionsOur gene annotation strategy has general applicability to other genomes, and the H. vespertilio genome provides a valuable molecular resource to study a range of questions in this genus, including phylogeny, incomplete lineage sorting, speciation, and hybridization. A genome browser displaying the genome, alignments, and annotations is available at https://genome-public.pks.mpg.de/cgi-bin/hgTracks?db=HLhylVes1.

  • Compartment and hub definitions tune metabolic networks for metabolomic interpretations
    Gigascience (IF 4.688) Pub Date : 2020-01-23
    Waller T, Berg J, Lex A, et al.

    BackgroundMetabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. ResultsWe curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. ConclusionsExclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.

  • 3D revelation of phenotypic variation, evolutionary allometry, and ancestral states of corolla shape: a case study of clade Corytholoma (subtribe Ligeriinae, family Gesneriaceae)
    Gigascience (IF 4.688) Pub Date : 2020-01-22
    Hsu H, Chou W, Kuo Y.

    BackgroundQuantification of corolla shape variations helps biologists to investigate plant diversity and evolution. 3D images capture the genuine structure and provide comprehensive spatial information. ResultsThis study applied X-ray micro-computed tomography (µCT) to acquire 3D structures of the corollas of clade Corytholoma and extracted a set of 415 3D landmarks from each specimen. By applying the geometric morphometrics (GM) to the landmarks, the first 4 principal components (PCs) in the 3D shape and 3D form analyses, respectively, accounted for 87.86% and 96.34% of the total variance. The centroid sizes of the corollas only accounted for 5.46% of the corolla shape variation, suggesting that the evolutionary allometry was weak. The 4 morphological traits corresponding to the 4 shape PCs were defined as tube curvature, lobe area, tube dilation, and lobe recurvation. Tube curvature and tube dilation were strongly associated with the pollination type and contained phylogenetic signals in clade Corytholoma. The landmarks were further used to reconstruct corolla shapes at the ancestral states. ConclusionsWith the integration of µCT imaging into GM, the proposed approach boosted the precision in quantifying corolla traits and improved the understanding of the morphological traits corresponding to the pollination type, impact of size on shape variation, and evolution of corolla shape in clade Corytholoma.

  • A high-quality chromosomal genome assembly of Diospyros oleifera Cheng
    Gigascience (IF 4.688) Pub Date : 2020-01-16
    Suo Y, Sun P, Cheng H, et al.

    BackgroundDiospyros oleifera Cheng, of the family Ebenaceae, is an economically important tree. Phylogenetic analyses indicate that D. oleifera is closely related to Diospyros kaki Thunb. and could be used as a model plant for studies of D. kaki. Therefore, development of genomic resources of D. oleifera will facilitate auxiliary assembly of the hexaploid persimmon genome and elucidate the molecular mechanisms of important traits. FindingsThe D. oleifera genome was assembled with 443.6 Gb of raw reads using the Pacific Bioscience Sequel and Illumina HiSeq X Ten platforms. The final draft genome was ∼812.3 Mb and had a high level of continuity with N50 of 3.36 Mb. Fifteen scaffolds corresponding to the 15 chromosomes were assembled to a final size of 721.5 Mb using 332 scaffolds, accounting for 88.81% of the genome. Repeat sequences accounted for 54.8% of the genome. By de novo sequencing and analysis of homology with other plant species, 30,530 protein-coding genes with an average transcript size of 7,105.40 bp were annotated; of these, 28,580 protein-coding genes (93.61%) had conserved functional motifs or terms. In addition, 171 candidate genes involved in tannin synthesis and deastringency in persimmon were identified; of these chalcone synthase (CHS) genes were expanded in the D. oleifera genome compared with Diospyros lotus, Camellia sinensis, and Vitis vinifera. Moreover, 186 positively selected genes were identified, including chalcone isomerase (CHI) gene, a key enzyme in the flavonoid-anthocyanin pathway. Phylogenetic tree analysis indicated that the split of D. oleifera and D. lotus likely occurred 9.0 million years ago. In addition to the ancient γ event, a second whole-genome duplication event occurred in D. oleifera and D. lotus. ConclusionsWe generated a high-quality chromosome-level draft genome for D. oleifera, which will facilitate assembly of the hexaploid persimmon genome and further studies of major economic traits in the genus Diospyros.

  • Corrigendum to: Bipartite graphs in systems biology and medicine: a survey of methods and applications
    Gigascience (IF 4.688) Pub Date : 2020-01-20

    Georgios A Pavlopoulos, Panagiota I Kontou, Athanasia Pavlopoulou, Costas Bouyioukos, Evripides Markou, Pantelis G Bagos GigaScience, Volume 7, Issue 4, 1 April 2018, giy014, https://doi.org/10.1093/gigascience/giy014.

  • A draft genome sequence of the elusive giant squid, Architeuthis dux
    Gigascience (IF 4.688) Pub Date : 2020-01-16
    da Fonseca R, Couto A, Machado A, et al.

    ABSTRACT BackgroundThe giant squid (Architeuthis dux; Steenstrup, 1857) is an enigmatic giant mollusc with a circumglobal distribution in the deep ocean, except in the high Arctic and Antarctic waters. The elusiveness of the species makes it difficult to study. Thus, having a genome assembled for this deep-sea–dwelling species will allow several pending evolutionary questions to be unlocked. FindingsWe present a draft genome assembly that includes 200 Gb of Illumina reads, 4 Gb of Moleculo synthetic long reads, and 108 Gb of Chicago libraries, with a final size matching the estimated genome size of 2.7 Gb, and a scaffold N50 of 4.8 Mb. We also present an alternative assembly including 27 Gb raw reads generated using the Pacific Biosciences platform. In addition, we sequenced the proteome of the same individual and RNA from 3 different tissue types from 3 other species of squid (Onychoteuthis banksii, Dosidicus gigas, and Sthenoteuthis oualaniensis) to assist genome annotation. We annotated 33,406 protein-coding genes supported by evidence, and the genome completeness estimated by BUSCO reached 92%. Repetitive regions cover 49.17% of the genome. ConclusionsThis annotated draft genome of A. dux provides a critical resource to investigate the unique traits of this species, including its gigantism and key adaptations to deep-sea environments.

  • Multifaceted Hi-C benchmarking: what makes a difference in chromosome-scale genome scaffolding?
    Gigascience (IF 4.688) Pub Date : 2020-01-10
    Kadota M, Nishimura O, Miura H, et al.

    BackgroundHi-C is derived from chromosome conformation capture (3C) and targets chromatin contacts on a genomic scale. This method has also been used frequently in scaffolding nucleotide sequences obtained by de novo genome sequencing and assembly, in which the number of resultant sequences rarely converges to the chromosome number. Despite its prevalent use, the sample preparation methods for Hi-C have not been intensively discussed, especially from the standpoint of genome scaffolding. ResultsTo gain insight into the best practice of Hi-C scaffolding, we performed a multifaceted methodological comparison using vertebrate samples and optimized various factors during sample preparation, sequencing, and computation. As a result, we identified several key factors that helped improve Hi-C scaffolding, including the choice and preparation of tissues, library preparation conditions, the choice of restriction enzyme(s), and the choice of scaffolding program and its usage. ConclusionsThis study provides the first comparison of multiple sample preparation kits/protocols and computational programs for Hi-C scaffolding by an academic third party. We introduce a customized protocol designated “inexpensive and controllable Hi-C (iconHi-C) protocol,” which incorporates the optimal conditions identified in this study, and demonstrate this technique on chromosome-scale genome sequences of the Chinese softshell turtle Pelodiscus sinensis.

  • CAMITAX: Taxon labels for microbial genomes
    Gigascience (IF 4.688) Pub Date : 2020-01-07
    Bremges A, Fritz A, McHardy A.

    BackgroundThe number of microbial genome sequences is increasing exponentially, especially thanks to recent advances in recovering complete or near-complete genomes from metagenomes and single cells. Assigning reliable taxon labels to genomes is key and often a prerequisite for downstream analyses. FindingsWe introduce CAMITAX, a scalable and reproducible workflow for the taxonomic labelling of microbial genomes recovered from isolates, single cells, and metagenomes. CAMITAX combines genome distance–, 16S ribosomal RNA gene–, and gene homology–based taxonomic assignments with phylogenetic placement. It uses Nextflow to orchestrate reference databases and software containers and thus combines ease of installation and use with computational reproducibility. We evaluated the method on several hundred metagenome-assembled genomes with high-quality taxonomic annotations from the TARA Oceans project, and we show that the ensemble classification method in CAMITAX improved on all individual methods across tested ranks. ConclusionsWhile we initially developed CAMITAX to aid the Critical Assessment of Metagenome Interpretation (CAMI) initiative, it evolved into a comprehensive software package to reliably assign taxon labels to microbial genomes. CAMITAX is available under Apache License 2.0 at https://github.com/CAMI-challenge/CAMITAX.

  • A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers
    Gigascience (IF 4.688) Pub Date : 2020-01-03
    Hecker N, Hiller M.

    BackgroundMultiple alignments of mammalian genomes have been the basis of many comparative genomic studies aiming at annotating genes, detecting regions under evolutionary constraint, and studying genome evolution. A key factor that affects the power of comparative analyses is the number of species included in a genome alignment. ResultsTo utilize the increased number of sequenced genomes and to provide an accessible resource for genomic studies, we generated a mammalian genome alignment comprising 120 species. We used this alignment and the CESAR method to provide protein-coding gene annotations for 119 non-human mammals. Furthermore, we illustrate the utility of this alignment by 2 exemplary analyses. First, we quantified how variable ultraconserved elements (UCEs) are among placental mammals. Leveraging the high taxonomic coverage in our alignment, we estimate that UCEs contain on average 4.7%–15.6% variable alignment columns. Furthermore, we show that the center regions of UCEs are generally most constrained. Second, we identified enhancer sequences that are only conserved in placental mammals. We found that these enhancers are significantly associated with placenta-related genes, suggesting that some of these enhancers may be involved in the evolution of placental mammal-specific aspects of the placenta. ConclusionThe 120-mammal alignment and all other data are available for analysis and visualization in a genome browser at https://genome-public.pks.mpg.de/and for download at https://bds.mpi-cbg.de/hillerlab/120MammalAlignment/.

  • Chromosome-level genome assembly reveals the unique genome evolution of the swimming crab (Portunus trituberculatus)
    Gigascience (IF 4.688) Pub Date : 2020-01-06
    Tang B, Zhang D, Li H, et al.

    BackgroundThe swimming crab, Portunus trituberculatus, is an important commercial species in China and is widely distributed in the coastal waters of Asia-Pacific countries. Despite increasing interest in swimming crab research, a high-quality chromosome-level genome is still lacking. FindingsHere, we assembled the first chromosome-level reference genome of P. trituberculatus by combining the short reads, Nanopore long reads, and Hi-C data. The genome assembly size was 1.00 Gb with a contig N50 length of 4.12 Mb. In addition, BUSCO assessment indicated that 94.7% of core eukaryotic genes were present in the genome assembly. Approximately 54.52% of the genome was identified as repetitive sequences, with a total of 16,796 annotated protein-coding genes. In addition, we anchored contigs into chromosomes and identified 50 chromosomes with an N50 length of 21.80 Mb by Hi-C technology. ConclusionsWe anticipate that this chromosome-level assembly of the P. trituberculatus genome will not only promote study of basic development and evolution but also provide important resources for swimming crab reproduction.

  • Data detectives, self-love, and humility: a research parasite's perspective
    Gigascience (IF 4.688) Pub Date : 2020-01-03
    Duvallet C.

    Secondary analysis solidifies and expands upon scientific knowledge through the re-analysis of existing datasets. However, researchers performing secondary analyses must develop specific skills to be successful and can benefit from adopting some computational best practices. Recognizing this work is also key to building and supporting a community of researchers who contribute to the scientific ecosystem through secondary analyses. The Research Parasite Awards are one such avenue, celebrating outstanding contributions to the rigorous secondary analysis of data. As the recipient of a 2019 Junior Research Parasite Award, I was asked to provide some perspectives on life as a research parasite, which I share in this commentary.

  • Genome and population sequencing of a chromosome-level genome assembly of the Chinese tapertail anchovy (Coilia nasus) provides novel insights into migratory adaptation
    Gigascience (IF 4.688) Pub Date : 2020-01-02
    Xu G, Bian C, Nie Z, et al.

    BackgroundSeasonal migration is one of the most spectacular events in nature; however, the molecular mechanisms related to this phenomenon have not been investigated in detail. The Chinese tapertail, or Japanese grenadier anchovy, Coilia nasus, is a valuable migratory fish of high economic importance and special migratory dimorphism (with certain individuals as non-migratory residents). ResultsIn this study, an 870.0-Mb high-quality genome was assembled by the combination of Illumina and Pacific Biosciences sequencing. Approximately 812.1 Mb of scaffolds were linked to 24 chromosomes using a high-density genetic map from a family of 104 full siblings and their parents. In addition, population sequencing of 96 representative individuals from diverse areas along the putative migration path identified 150 candidate genes, which are mainly enriched in 3 Ca2+-related pathways. Based on integrative genomic and transcriptomic analyses, we determined that the 3 Ca2+-related pathways are critical for promotion of migratory adaption. A large number of molecular markers were also identified, which distinguished migratory individuals and non-migratory freshwater residents. ConclusionsWe assembled a chromosome-level genome for the Chinese tapertail anchovy. The genome provided a valuable genetic resource for understanding of migratory adaption and population genetics and will benefit the aquaculture and management of this economically important fish.

  • The draft nuclear genome assembly of Eucalyptus pauciflora: a pipeline for comparing de novo assemblies
    Gigascience (IF 4.688) Pub Date : 2020-01-02
    Wang W, Das A, Kainer D, et al.

    BackgroundEucalyptus pauciflora (the snow gum) is a long-lived tree with high economic and ecological importance. Currently, little genomic information for E. pauciflora is available. Here, we sequentially assemble the genome of Eucalyptus pauciflora with different methods, and combine multiple existing and novel approaches to help to select the best genome assembly. FindingsWe generated high coverage of long- (Nanopore, 174×) and short- (Illumina, 228×) read data from a single E. pauciflora individual and compared assemblies from 5 assemblers (Canu, SMARTdenovo, Flye, Marvel, and MaSuRCA) with different read lengths (1 and 35 kb minimum read length). A key component of our approach is to keep a randomly selected collection of ∼10% of both long and short reads separated from the assemblies to use as a validation set for assessing assemblies. Using this validation set along with a range of existing tools, we compared the assemblies in 8 ways: contig N50, BUSCO scores, LAI (long terminal repeat assembly index) scores, assembly ploidy, base-level error rate, CGAL (computing genome assembly likelihoods) scores, structural variation, and genome sequence similarity. Our result showed that MaSuRCA generated the best assembly, which is 594.87 Mb in size, with a contig N50 of 3.23 Mb, and an estimated error rate of ∼0.006 errors per base. ConclusionsWe report a draft genome of E. pauciflora, which will be a valuable resource for further genomic studies of eucalypts. The approaches for assessing and comparing genomes should help in assessing and choosing among many potential genome assemblies from a single dataset.

  • Correction to: A network-based conditional genetic association analysis of the human metabolome
    Gigascience (IF 4.688) Pub Date : 2019-12-30
    Tsepilov Y, Sharapov S, Zaytseva O, et al.

    In the original version of the article “A network-based conditional genetic association analysis of the human metabolome” by Y.A. Tsepilov et al. [1], there was a typographical error in the surname of the fourth author (Jan Krumsek) and a typo in Acknowledgments in the surname of Athina Spilopoulou. These typos have been corrected, and the authors apologize for the mistakes.

  • Prognostic model for multiple myeloma progression integrating gene expression and clinical features
    Gigascience (IF 4.688) Pub Date : 2019-12-30
    Sun C, Li H, Mills R, et al.

    BackgroundMultiple myeloma (MM) is a hematological cancer caused by abnormal accumulation of monoclonal plasma cells in bone marrow. With the increase in treatment options, risk-adapted therapy is becoming more and more important. Survival analysis is commonly applied to study progression or other events of interest and stratify the risk of patients. ResultsIn this study, we present the current state-of-the-art model for MM prognosis and the molecular biomarker set for stratification: the winning algorithm in the 2017 Multiple Myeloma DREAM Challenge, Sub-Challenge 3. Specifically, we built a non-parametric complete hazard ranking model to map the right-censored data into a linear space, where commonplace machine learning techniques, such as Gaussian process regression and random forests, can play their roles. Our model integrated both the gene expression profile and clinical features to predict the progression of MM. Compared with conventional models, such as Cox model and random survival forests, our model achieved higher accuracy in 3 within-cohort predictions. In addition, it showed robust predictive power in cross-cohort validations. Key molecular signatures related to MM progression were identified from our model, which may function as the core determinants of MM progression and provide important guidance for future research and clinical practice. Functional enrichment analysis and mammalian gene-gene interaction network revealed crucial biological processes and pathways involved in MM progression. The model is dockerized and publicly available at https://www.synapse.org/#!Synapse:syn11459638. Both data and reproducible code are included in the docker. ConclusionsWe present the current state-of-the-art prognostic model for MM integrating gene expression and clinical features validated in an independent test set.

  • Corrigendum to: Rice Galaxy: an open resource for plant science
    Gigascience (IF 4.688) Pub Date : 2019-12-30
    Juanillas V, Dereeper A, Beaume N, et al.

    In the original version of the article “Rice Galaxy: an open resource for plant science” by Venice Juanillas et al. [1], the main publication that generated the genome assemblies mentioned as important datasources in paper (IR 8: GenBank: MPPV00000000.1 and N 22: GenBank: LWDA00000000.1) was not included in the reference. In the online version of the paper, this occurs in the Discussion/Built-in/interoperable rice data section in the following sentence (highlighted in BOLD font):

  • Genomic evidence of neo-sex chromosomes in the eastern yellow robin
    Gigascience (IF 4.688) Pub Date : 2019-12-30
    Gan H, Falk S, Moraleś H, et al.

    In the final publication of “Genomic evidence of neo-sex chromosomes in the Eastern Yellow Robin,” by Gan et al., supplementary data files were incorrectly linked from another article. The correct supplementary data files for this article now appear online .

  • Dissection of soybean populations according to selection signatures based on whole-genome sequences
    Gigascience (IF 4.688) Pub Date : 2019-12-23
    Kim J, Jeong S, Kim K, et al.

    BackgroundDomestication and improvement processes, accompanied by selections and adaptations, have generated genome-wide divergence and stratification in soybean populations. Simultaneously, soybean populations, which comprise diverse subpopulations, have developed their own adaptive characteristics enhancing fitness, resistance, agronomic traits, and morphological features. The genetic traits underlying these characteristics play a fundamental role in improving other soybean populations. ResultsThis study focused on identifying the selection signatures and adaptive characteristics in soybean populations. A core set of 245 accessions (112 wild-type, 79 landrace, and 54 improvement soybeans) selected from 4,234 soybean accessions was re-sequenced. Their genomic architectures were examined according to the domestication and improvement, and accessions were then classified into 3 wild-type, 2 landrace, and 2 improvement subgroups based on various population analyses. Selection and gene set enrichment analyses revealed that the landrace subgroups have selection signals for soybean-cyst nematode HG type 0 and seed development with germination, and that the improvement subgroups have selection signals for plant development with viability and seed development with embryo development, respectively. The adaptive characteristic for soybean-cyst nematode was partially underpinned by multiple resistance accessions, and the characteristics related to seed development were supported by our phenotypic findings for seed weights. Furthermore, their adaptive characteristics were also confirmed as genome-based evidence, and unique genomic regions that exhibit distinct selection and selective sweep patterns were revealed for 13 candidate genes. ConclusionsAlthough our findings require further biological validation, they provide valuable information about soybean breeding strategies and present new options for breeders seeking donor lines to improve soybean populations.

  • Arteria: An automation system for a sequencing core facility
    Gigascience (IF 4.688) Pub Date : 2019-12-11
    Dahlberg J, Hermansson J, Sturlaugsson S, et al.

    BackgroundIn recent years, nucleotide sequencing has become increasingly instrumental in both research and clinical settings. This has led to an explosive growth in sequencing data produced worldwide. As the amount of data increases, so does the need for automated solutions for data processing and analysis. The concept of workflows has gained favour in the bioinformatics community, but there is little in the scientific literature describing end-to-end automation systems. Arteria is an automation system that aims at providing a solution to the data-related operational challenges that face sequencing core facilities. FindingsArteria is built on existing open source technologies, with a modular design allowing for a community-driven effort to create plug-and-play micro-services. In this article we describe the system, elaborate on the underlying conceptual framework, and present an example implementation. Arteria can be reduced to 3 conceptual levels: orchestration (using an event-based model of automation), process (the steps involved in processing sequencing data, modelled as workflows), and execution (using a series of RESTful micro-services). This creates a system that is both flexible and scalable. Arteria-based systems have been successfully deployed at 3 sequencing core facilities. The Arteria Project code, written largely in Python, is available as open source software, and more information can be found at https://arteria-project.github.io/ . ConclusionsWe describe the Arteria system and the underlying conceptual framework, demonstrating how this model can be used to automate data handling and analysis in the context of a sequencing core facility.

  • A Galaxy-based training resource for single-cell RNA-sequencing quality control and analyses
    Gigascience (IF 4.688) Pub Date : 2019-12-11
    Etherington G, Soranzo N, Mohammed S, et al.

    BackgroundIt is not a trivial step to move from single-cell RNA-sequencing (scRNA-seq) data production to data analysis. There is a lack of intuitive training materials and easy-to-use analysis tools, and researchers can find it difficult to master the basics of scRNA-seq quality control and the later analysis. ResultsWe have developed a range of practical scripts, together with their corresponding Galaxy wrappers, that make scRNA-seq training and quality control accessible to researchers previously daunted by the prospect of scRNA-seq analysis. We implement a “visualize-filter-visualize” paradigm through simple command line tools that use the Loom format to exchange data between the tools. The point-and-click nature of Galaxy makes it easy to assess, visualize, and filter scRNA-seq data from short-read sequencing data. ConclusionWe have developed a suite of scRNA-seq tools that can be used for both training and more in-depth analyses.

  • Systematic processing of ribosomal RNA gene amplicon sequencing data
    Gigascience (IF 4.688) Pub Date : 2019-12-09
    Tremblay J, Yergeau E.

    BackgroundWith the advent of high-throughput sequencing, microbiology is becoming increasingly data-intensive. Because of its low cost, robust databases, and established bioinformatic workflows, sequencing of 16S/18S/ITS ribosomal RNA (rRNA) gene amplicons, which provides a marker of choice for phylogenetic studies, has become ubiquitous. Many established end-to-end bioinformatic pipelines are available to perform short amplicon sequence data analysis. These pipelines suit a general audience, but few options exist for more specialized users who are experienced in code scripting, Linux-based systems, and high-performance computing (HPC) environments. For such an audience, existing pipelines can be limiting to fully leverage modern HPC capabilities and perform tweaking and optimization operations. Moreover, a wealth of stand-alone software packages that perform specific targeted bioinformatic tasks are increasingly accessible, and finding a way to easily integrate these applications in a pipeline is critical to the evolution of bioinformatic methodologies. ResultsHere we describe AmpliconTagger, a short rRNA marker gene amplicon pipeline coded in a Python framework that enables fine tuning and integration of virtually any potential rRNA gene amplicon bioinformatic procedure. It is designed to work within an HPC environment, supporting a complex network of job dependencies with a smart-restart mechanism in case of job failure or parameter modifications. As proof of concept, we present end results obtained with AmpliconTagger using 16S, 18S, ITS rRNA short gene amplicons and Pacific Biosciences long-read amplicon data types as input. ConclusionsUsing a selection of published algorithms for generating operational taxonomic units and amplicon sequence variants and for computing downstream taxonomic summaries and diversity metrics, we demonstrate the performance and versatility of our pipeline for systematic analyses of amplicon sequence data.

  • Accessible and reproducible mass spectrometry imaging data analysis in Galaxy
    Gigascience (IF 4.688) Pub Date : 2019-12-09
    Föll M, Moritz L, Wollmann T, et al.

    BackgroundMass spectrometry imaging is increasingly used in biological and translational research because it has the ability to determine the spatial distribution of hundreds of analytes in a sample. Being at the interface of proteomics/metabolomics and imaging, the acquired datasets are large and complex and often analyzed with proprietary software or in-house scripts, which hinders reproducibility. Open source software solutions that enable reproducible data analysis often require programming skills and are therefore not accessible to many mass spectrometry imaging (MSI) researchers. FindingsWe have integrated 18 dedicated mass spectrometry imaging tools into the Galaxy framework to allow accessible, reproducible, and transparent data analysis. Our tools are based on Cardinal, MALDIquant, and scikit-image and enable all major MSI analysis steps such as quality control, visualization, preprocessing, statistical analysis, and image co-registration. Furthermore, we created hands-on training material for use cases in proteomics and metabolomics. To demonstrate the utility of our tools, we re-analyzed a publicly available N-linked glycan imaging dataset. By providing the entire analysis history online, we highlight how the Galaxy framework fosters transparent and reproducible research. ConclusionThe Galaxy framework has emerged as a powerful analysis platform for the analysis of MSI data with ease of use and access, together with high levels of reproducibility and transparency.

  • Pseudo-chromosome–length genome assembly of a double haploid “Bartlett” pear (Pyrus communis L.)
    Gigascience (IF 4.688) Pub Date : 2019-12-09
    Linsmith G, Rombauts S, Montanari S, et al.

    BackgroundWe report an improved assembly and scaffolding of the European pear (Pyrus communis L.) genome (referred to as BartlettDHv2.0), obtained using a combination of Pacific Biosciences RSII long-read sequencing, Bionano optical mapping, chromatin interaction capture (Hi-C), and genetic mapping. The sample selected for sequencing is a double haploid derived from the same “Bartlett” reference pear that was previously sequenced. Sequencing of di-haploid plants makes assembly more tractable in highly heterozygous species such as P. communis. FindingsA total of 496.9 Mb corresponding to 97% of the estimated genome size were assembled into 494 scaffolds. Hi-C data and a high-density genetic map allowed us to anchor and orient 87% of the sequence on the 17 pear chromosomes. Approximately 50% (247 Mb) of the genome consists of repetitive sequences. Gene annotation confirmed the presence of 37,445 protein-coding genes, which is 13% fewer than previously predicted. ConclusionsWe showed that the use of a doubled-haploid plant is an effective solution to the problems presented by high levels of heterozygosity and duplication for the generation of high-quality genome assemblies. We present a high-quality chromosome-scale assembly of the European pear Pyrus communis and demostrate its high degree of synteny with the genomes of Malus x Domestica and Pyrus x bretschneideri.

  • Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples
    Gigascience (IF 4.688) Pub Date : 2019-12-06
    Zheng H, Brennan K, Hernaez M, et al.

    BackgroundLong non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. ResultsIn this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. ConclusionsConsidering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.

  • GraphClust2: Annotation and discovery of structured RNAs with scalable and accessible integrative clustering
    Gigascience (IF 4.688) Pub Date : 2019-12-06
    Miladi M, Sokhoyan E, Houwaart T, et al.

    BackgroundRNA plays essential roles in all known forms of life. Clustering RNA sequences with common sequence and structure is an essential step towards studying RNA function. With the advent of high-throughput sequencing techniques, experimental and genomic data are expanding to complement the predictive methods. However, the existing methods do not effectively utilize and cope with the immense amount of data becoming available. ResultsHundreds of thousands of non-coding RNAs have been detected; however, their annotation is lagging behind. Here we present GraphClust2, a comprehensive approach for scalable clustering of RNAs based on sequence and structural similarities. GraphClust2 bridges the gap between high-throughput sequencing and structural RNA analysis and provides an integrative solution by incorporating diverse experimental and genomic data in an accessible manner via the Galaxy framework. GraphClust2 can efficiently cluster and annotate large datasets of RNAs and supports structure-probing data. We demonstrate that the annotation performance of clustering functional RNAs can be considerably improved. Furthermore, an off-the-shelf procedure is introduced for identifying locally conserved structure candidates in long RNAs. We suggest the presence and the sparseness of phylogenetically conserved local structures for a collection of long non-coding RNAs. ConclusionsBy clustering data from 2 cross-linking immunoprecipitation experiments, we demonstrate the benefits of GraphClust2 for motif discovery under the presence of biological and methodological biases. Finally, we uncover prominent targets of double-stranded RNA binding protein Roquin-1, such as BCOR’s 3′ untranslated region that contains multiple binding stem-loops that are evolutionary conserved.

  • Wikipedia: Why is the common knowledge resource still neglected by academics?
    Gigascience (IF 4.688) Pub Date : 2019-12-03
    Jemielniak D.

    Wikipedia is by far the largest online encyclopedia, and the number of errors it contains is on par with the professional sources even in specialized topics such as biology or medicine. Yet, the academic world is still treating it with great skepticism because of the types of inaccuracies present there, the widespread plagiarism from Wikipedia, and historic biases, as well as jealousy regarding the loss of the knowledge dissemination monopoly. This article argues that it is high time not only to acknowledge Wikipedia's quality but also to start actively promoting its use and development in academia.

  • Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information
    Gigascience (IF 4.688) Pub Date : 2019-12-03
    Kim H, Jeon S, Kim C, et al.

    BackgroundLong DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. FindingsWe performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C–derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. ConclusionThe pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.

  • DAISY: A Data Information System for accountability under the General Data Protection Regulation
    Gigascience (IF 4.688) Pub Date : 2019-12-04
    Becker R, Alper P, Grouès V, et al.

    BackgroundThe new European legislation on data protection, namely, the General Data Protection Regulation (GDPR), has introduced comprehensive requirements for the documentation about the processing of personal data as well as informing the data subjects of its use. GDPR’s accountability principle requires institutions, projects, and data hubs to document their data processings and demonstrate compliance with the GDPR. In response to this requirement, we see the emergence of commercial data-mapping tools, and institutions creating GDPR data register with such tools. One shortcoming of this approach is the genericity of tools, and their process-based model not capturing the project-based, collaborative nature of data processing in biomedical research. FindingsWe have developed a software tool to allow research institutions to comply with the GDPR accountability requirement and map the sometimes very complex data flows in biomedical research. By analysing the transparency and record-keeping obligations of each GDPR principle, we observe that our tool effectively meets the accountability requirement. ConclusionsThe GDPR is bringing data protection to center stage in research data management, necessitating dedicated tools, personnel, and processes. Our tool, DAISY, is tailored specifically for biomedical research and can help institutions in tackling the documentation challenge brought about by the GDPR. DAISY is made available as a free and open source tool on Github. DAISY is actively being used at the Luxembourg Centre for Systems Biomedicine and the ELIXIR-Luxembourg data hub.

  • Genome-wide analysis of the H3K27me3 epigenome and transcriptome in Brassica rapa
    Gigascience (IF 4.688) Pub Date : 2019-12-04
    Payá-Milans M, Poza-Viejo L, Martín-Uriz P, et al.

    BackgroundGenome-wide maps of histone modifications have been obtained for several plant species. However, most studies focus on model systems and do not enforce FAIR data management principles. Here we study the H3K27me3 epigenome and associated transcriptome of Brassica rapa, an important vegetable cultivated worldwide. FindingsWe performed H3K27me3 chromatin immunoprecipitation followed by high-throughput sequencing and transcriptomic analysis by 3′-end RNA sequencing from B. rapa leaves and inflorescences. To analyze these data we developed a Reproducible Epigenomic Analysis pipeline using Galaxy and Jupyter, packaged into Docker images to facilitate transparency and reuse. We found that H3K27me3 covers roughly one-third of all B. rapa protein-coding genes and its presence correlates with low transcript levels. The comparative analysis between leaves and inflorescences suggested that the expression of various floral regulatory genes during development depends on H3K27me3. To demonstrate the importance of H3K27me3 for B. rapa development, we characterized a mutant line deficient in the H3K27 methyltransferase activity. We found that braA.clf mutant plants presented pleiotropic alterations, e.g., curly leaves due to increased expression and reduced H3K27me3 levels at AGAMOUS-like loci. ConclusionsWe characterized the epigenetic mark H3K27me3 at genome-wide levels and provide genetic evidence for its relevance in B. rapa development. Our work reveals the epigenomic landscape of H3K27me3 in B. rapa and provides novel genomics datasets and bioinformatics analytical resources. We anticipate that this work will lead the way to further epigenomic studies in the complex genome of Brassica crops.

  • Assembly of the 373k gene space of the polyploid sugarcane genome reveals reservoirs of functional diversity in the world's leading biomass crop
    Gigascience (IF 4.688) Pub Date : 2019-11-29
    Souza G, Van Sluys M, Lembke C, et al.

    ABSTRACT BackgroundSugarcane cultivars are polyploid interspecific hybrids of giant genomes, typically with 10–13 sets of chromosomes from 2 Saccharum species. The ploidy, hybridity, and size of the genome, estimated to have >10 Gb, pose a challenge for sequencing. ResultsHere we present a gene space assembly of SP80-3280, including 373,869 putative genes and their potential regulatory regions. The alignment of single-copy genes in diploid grasses to the putative genes indicates that we could resolve 2–6 (up to 15) putative homo(eo)logs that are 99.1% identical within their coding sequences. Dissimilarities increase in their regulatory regions, and gene promoter analysis shows differences in regulatory elements within gene families that are expressed in a species-specific manner. We exemplify these differences for sucrose synthase (SuSy) and phenylalanine ammonia-lyase (PAL), 2 gene families central to carbon partitioning. SP80-3280 has particular regulatory elements involved in sucrose synthesis not found in the ancestor Saccharum spontaneum. PAL regulatory elements are found in co-expressed genes related to fiber synthesis within gene networks defined during plant growth and maturation. Comparison with sorghum reveals predominantly bi-allelic variations in sugarcane, consistent with the formation of 2 “subgenomes” after their divergence ∼3.8–4.6 million years ago and reveals single-nucleotide variants that may underlie their differences. ConclusionsThis assembly represents a large step towards a whole-genome assembly of a commercial sugarcane cultivar. It includes a rich diversity of genes and homo(eo)logous resolution for a representative fraction of the gene space, relevant to improve biomass and food production.

  • Assessment of human diploid genome assembly with 10x Linked-Reads data
    Gigascience (IF 4.688) Pub Date : 2019-11-26
    Zhang L, Zhou X, Weng Z, et al.

    BackgroundProducing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries. ResultsWe prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole-genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332× and 823× and assembly quality worsened if it increased to >1,000× for a given C. Long DNA fragments could significantly extend phase blocks but decreased contig contiguity. The optimal length-weighted fragment length (W${\mu _{FL}}$) was ∼50–150 kb. When broadly optimal parameters were used for library preparation and sequencing, ∼80% of the genome was assembled in a diploid state. ConclusionsThe Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.

  • A multi-day and multi-band dataset for a steady-state visual-evoked potential–based brain-computer interface
    Gigascience (IF 4.688) Pub Date : 2019-11-25
    Choi G, Han C, Jung Y, et al.

    BackgroundA steady-state visual-evoked potential (SSVEP) is a brain response to visual stimuli modulated at certain frequencies; it has been widely used in electroencephalography (EEG)-based brain–computer interface research. However, there are few published SSVEP datasets for brain–computer interface. In this study, we obtained a new SSVEP dataset based on measurements from 30 participants, performed on 2 days; our dataset complements existing SSVEP datasets: (i) multi-band SSVEP datasets are provided, and all 3 possible frequency bands (low, middle, and high) were used for SSVEP stimulation; (ii) multi-day datasets are included; and (iii) the EEG datasets include simultaneously obtained physiological measurements, such as respiration, electrocardiography, electromyography, and head motion (accelerator). FindingsTo validate our dataset, we estimated the spectral powers and classification performance for the EEG (SSVEP) datasets and created an example plot to visualize the physiological time-series data. Strong SSVEP responses were observed at stimulation frequencies, and the mean classification performance of the middle frequency band was significantly higher than the low- and high-frequency bands. Other physiological data also showed reasonable results. ConclusionsOur multi-band, multi-day SSVEP datasets can be used to optimize stimulation frequencies because they enable simultaneous investigation of the characteristics of the SSVEPs evoked in each of the 3 frequency bands, and solve session-to-session (day-to-day) transfer problems by enabling investigation of the non-stationarity of SSVEPs measured on different days. Additionally, auxiliary physiological data can be used to explore the relationship between SSVEP characteristics and physiological conditions, providing useful information for optimizing experimental paradigms to achieve high performance.

  • High-throughput phenotyping with deep learning gives insight into the genetic architecture of flowering time in wheat
    Gigascience (IF 4.688) Pub Date : 2019-11-19
    Wang X, Xuan H, Evers B, et al.

    BackgroundMeasurement of plant traits with precision and speed on large populations has emerged as a critical bottleneck in connecting genotype to phenotype in genetics and breeding. This bottleneck limits advancements in understanding plant genomes and the development of improved, high-yielding crop varieties. ResultsHere we demonstrate the application of deep learning on proximal imaging from a mobile field vehicle to directly estimate plant morphology and developmental stages in wheat under field conditions. We developed and trained a convolutional neural network with image datasets labeled from expert visual scores and used this “breeder-trained” network to classify wheat morphology and developmental stages. For both morphological (awned) and phenological (flowering time) traits, we demonstrate high heritability and very high accuracy against the “ground-truth” values from visual scoring. Using the traits predicted by the network, we tested genotype-to-phenotype association using the deep learning phenotypes and uncovered novel epistatic interactions for flowering time. Enabled by the time-series high-throughput phenotyping, we describe a new phenotype as the rate of flowering and show heritable genetic control for this trait. ConclusionsWe demonstrated a field-based high-throughput phenotyping approach using deep learning that can directly measure morphological and developmental phenotypes in genetic populations from field-based imaging. The deep learning approach presented here gives a conceptual advancement in high-throughput plant phenotyping because it can potentially estimate any trait in any plant species for which the combination of breeder scores and high-resolution images can be obtained, capturing the expert knowledge from breeders, geneticists, pathologists, and physiologists to train the networks.

  • RepeatFiller newly identifies megabases of aligning repetitive sequences and improves annotations of conserved non-exonic elements
    Gigascience (IF 4.688) Pub Date : 2019-11-19
    Osipova E, Hecker N, Hiller M.

    BackgroundTransposons and other repetitive sequences make up a large part of complex genomes. Repetitive sequences can be co-opted into a variety of functions and thus provide a source for evolutionary novelty. However, comprehensively detecting ancestral repeats that align between species is difficult because considering all repeat-overlapping seeds in alignment methods that rely on the seed-and-extend heuristic results in prohibitively high runtimes. ResultsHere, we show that ignoring repeat-overlapping alignment seeds when aligning entire genomes misses numerous alignments between repetitive elements. We present a tool, RepeatFiller, that improves genome alignments by incorporating previously undetected local alignments between repetitive sequences. By applying RepeatFiller to genome alignments between human and 20 other representative mammals, we uncover between 22 and 84 Mb of previously undetected alignments that mostly overlap transposable elements. We further show that the increased alignment coverage improves the annotation of conserved non-exonic elements, both by discovering numerous novel transposon-derived elements that evolve under constraint and by removing thousands of elements that are not under constraint in placental mammals. ConclusionsRepeatFiller contributes to comprehensively aligning repetitive genomic regions, which facilitates studying transposon co-option and genome evolution. Source code: https://github.com/hillerlab/GenomeAlignmentTools

  • Trochodendron aralioides, the first chromosome-level draft genome in Trochodendrales and a valuable resource for basal eudicot research
    Gigascience (IF 4.688) Pub Date : 2019-11-18
    Strijk J, Hinsinger D, Zhang F, et al.

    BackgroundThe wheel tree (Trochodendron aralioides) is one of only 2 species in the basal eudicot order Trochodendrales. Together with Tetracentron sinense, the family is unique in having secondary xylem without vessel elements, long considered to be a primitive character also found in Amborella and Winteraceae. Recent studies however have shown that Trochodendraceae belong to basal eudicots and demonstrate that this represents an evolutionary reversal for the group. Trochodendron aralioides is widespread in cultivation and popular for use in gardens and parks. FindingsWe assembled the T. aralioides genome using a total of 679.56 Gb of clean reads that were generated using both Pacific Biosciences and Illumina short-reads in combination with 10XGenomics and Hi-C data. Nineteen scaffolds corresponding to 19 chromosomes were assembled to a final size of 1.614 Gb with a scaffold N50 of 73.37 Mb in addition to 1,534 contigs. Repeat sequences accounted for 64.226% of the genome, and 35,328 protein-coding genes with an average of 5.09 exons per gene were annotated using de novo, RNA-sequencing, and homology-based approaches. According to a phylogenetic analysis of protein-coding genes, T. aralioides diverged in a basal position relative to core eudicots, ∼121.8–125.8 million years ago. ConclusionsTrochodendron aralioides is the first chromosome-scale genome assembled in the order Trochodendrales. It represents the largest genome assembled to date in the basal eudicot grade, as well as the closest order relative to the core-eudicots, as the position of Buxales remains unresolved. This genome will support further studies of wood morphology and floral evolution, and will be an essential resource for understanding rapid changes that took place at the base of the Eudicot tree. Finally, it can further genome-assisted improvement for cultivation and conservation efforts of the wheel tree.

  • Correction to: Transcriptome of the caribbean stony coral Porites astreoides from three developmental stages
    Gigascience (IF 4.688) Pub Date : 2019-11-15
    Mansour T, Rosenthal J, Brown C, et al.

    This is a correction to: GigaScience, Volume 5, Issue 1, 1 December 2016, s13742-016-0138-1, https://doi.org/10.1186/s13742-016-0138-1

  • Deep learning for clustering of multivariate clinical patient trajectories with missing values
    Gigascience (IF 4.688) Pub Date : 2019-11-15
    de Jong J, Emon M, Wu P, et al.

    BackgroundPrecision medicine requires a stratification of patients by disease presentation that is sufficiently informative to allow for selecting treatments on a per-patient basis. For many diseases, such as neurological disorders, this stratification problem translates into a complex problem of clustering multivariate and relatively short time series because (i) these diseases are multifactorial and not well described by single clinical outcome variables and (ii) disease progression needs to be monitored over time. Additionally, clinical data often additionally are hindered by the presence of many missing values, further complicating any clustering attempts. FindingsThe problem of clustering multivariate short time series with many missing values is generally not well addressed in the literature. In this work, we propose a deep learning–based method to address this issue, variational deep embedding with recurrence (VaDER). VaDER relies on a Gaussian mixture variational autoencoder framework, which is further extended to (i) model multivariate time series and (ii) directly deal with missing values. We validated VaDER by accurately recovering clusters from simulated and benchmark data with known ground truth clustering, while varying the degree of missingness. We then used VaDER to successfully stratify patients with Alzheimer disease and patients with Parkinson disease into subgroups characterized by clinically divergent disease progression profiles. Additional analyses demonstrated that these clinical differences reflected known underlying aspects of Alzheimer disease and Parkinson disease. ConclusionsWe believe our results show that VaDER can be of great value for future efforts in patient stratification, and multivariate time-series clustering in general.

  • RootNav 2.0: Deep learning for automatic navigation of complex plant root architectures
    Gigascience (IF 4.688) Pub Date : 2019-11-08
    Yasrab R, Atkinson J, Wells D, et al.

    BackgroundIn recent years quantitative analysis of root growth has become increasingly important as a way to explore the influence of abiotic stress such as high temperature and drought on a plant's ability to take up water and nutrients. Segmentation and feature extraction of plant roots from images presents a significant computer vision challenge. Root images contain complicated structures, variations in size, background, occlusion, clutter and variation in lighting conditions. We present a new image analysis approach that provides fully automatic extraction of complex root system architectures from a range of plant species in varied imaging set-ups. Driven by modern deep-learning approaches, RootNav 2.0 replaces previously manual and semi-automatic feature extraction with an extremely deep multi-task convolutional neural network architecture. The network also locates seeds, first order and second order root tips to drive a search algorithm seeking optimal paths throughout the image, extracting accurate architectures without user interaction. ResultsWe develop and train a novel deep network architecture to explicitly combine local pixel information with global scene information in order to accurately segment small root features across high-resolution images. The proposed method was evaluated on images of wheat (Triticum aestivum L.) from a seedling assay. Compared with semi-automatic analysis via the original RootNav tool, the proposed method demonstrated comparable accuracy, with a 10-fold increase in speed. The network was able to adapt to different plant species via transfer learning, offering similar accuracy when transferred to an Arabidopsis thaliana plate assay. A final instance of transfer learning, to images of Brassica napus from a hydroponic assay, still demonstrated good accuracy despite many fewer training images. ConclusionsWe present RootNav 2.0, a new approach to root image analysis driven by a deep neural network. The tool can be adapted to new image domains with a reduced number of images, and offers substantial speed improvements over semi-automatic and manual approaches. The tool outputs root architectures in the widely accepted RSML standard, for which numerous analysis packages exist (http://rootsystemml.github.io/), as well as segmentation masks compatible with other automated measurement tools. The tool will provide researchers with the ability to analyse root systems at larget scales than ever before, at a time when large scale genomic studies have made this more important than ever.

  • Chromosomal-level reference genome of Chinese peacock butterfly (Papilio bianor) based on third-generation DNA sequencing and Hi-C analysis
    Gigascience (IF 4.688) Pub Date : 2019-11-04
    Lu S, Yang J, Dai X, et al.

    BackgroundPapilio bianor Cramer, 1777 (commonly known as the Chinese peacock butterfly) (Insecta, Lepidoptera, Papilionidae) is a widely distributed swallowtail butterfly with a wide number of geographic populations ranging from the southeast of Russia to China, Japan, India, Vietnam, Myanmar, and Thailand. Its wing color consists of both pigmentary colored scales (black, reddish) and structural colored scales (iridescent blue or green dust). A high-quality reference genome of P. bianor is an important foundation for investigating iridescent color evolution, phylogeography, and the evolution of swallowtail butterflies. FindingsWe obtained a chromosome-level de novo genome assembly of the highly heterozygous P. bianor using long Pacific Biosciences sequencing reads and high-throughput chromosome conformation capture technology. The final assembly is 421.52 Mb on 30 chromosomes (29 autosomes and 1 Z sex chromosome) with 13.12 Mb scaffold N50. In total, 15,375 protein-coding genes and 233.09 Mb of repetitive sequences were identified. Phylogenetic analyses indicated that P. bianor separated from a common ancestor of swallowtails ∼23.69–36.04 million years ago. Demographic history suggested that the population expansion of this species from the last interglacial period to the last glacial maximum possibly resulted from its decreased natural enemies and its adaptation to climate change during the glacial period. ConclusionsWe present a high-quality chromosome-level reference genome of P. bianor using long-read single-molecule sequencing and Hi-C–based chromatin interaction maps. Our results lay the foundation for exploring the genetic basis of special biological features of P. bianor and also provide a useful data source for comparative genomics and phylogenomics among butterflies and moths.

  • Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv
    Gigascience (IF 4.688) Pub Date : 2019-11-01
    Khan F, Soiland-Reyes S, Sinnott R, et al.

    BackgroundThe automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. ResultsBased on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. ConclusionsThe underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.

  • Transcriptome of the Caribbean stony coral Porites astreoides from three developmental stages.
    Gigascience (IF 4.688) Pub Date : 2016-08-04
    Tamer A Mansour,Joshua J C Rosenthal,C Titus Brown,Loretta M Roberson

    BACKGROUND Porites astreoides is a ubiquitous species of coral on modern Caribbean reefs that is resistant to increasing temperatures, overfishing, and other anthropogenic impacts that have threatened most other coral species. We assembled and annotated a transcriptome from this coral using Illumina sequences from three different developmental stages collected over several years: free-swimming larvae, newly settled larvae, and adults (>10 cm in diameter). This resource will aid understanding of coral calcification, larval settlement, and host-symbiont interactions. FINDINGS A de novo transcriptome for the P. astreoides holobiont (coral plus algal symbiont) was assembled using 594 Mbp of raw Illumina sequencing data generated from five age-specific cDNA libraries. The new transcriptome consists of 867 255 transcript elements with an average length of 685 bases. The isolated P. astreoides assembly consists of 129 718 transcript elements with an average length of 811 bases, and the isolated Symbiodinium sp. assembly had 186 177 transcript elements with an average length of 1105 bases. CONCLUSIONS This contribution to coral transcriptome data provides a valuable resource for researchers studying the ontogeny of gene expression patterns within both the coral and its dinoflagellate symbiont.

  • Genomic analyses reveal FAM84B and the NOTCH pathway are associated with the progression of esophageal squamous cell carcinoma.
    Gigascience (IF 4.688) Pub Date : 2016-01-14
    Caixia Cheng,Heyang Cui,Ling Zhang,Zhiwu Jia,Bin Song,Fang Wang,Yaoping Li,Jing Liu,Pengzhou Kong,Ruyi Shi,Yanghui Bi,Bin Yang,Juan Wang,Zhenxiang Zhao,Yanyan Zhang,Xiaoling Hu,Jie Yang,Chanting He,Zhiping Zhao,Jinfen Wang,Yanfeng Xi,Enwei Xu,Guodong Li,Shiping Guo,Yunqing Chen,Xiaofeng Yang,Xing Chen,Jianfang Liang,Jiansheng Guo,Xiaolong Cheng,Chuangui Wang,Qimin Zhan,Yongping Cui

    BACKGROUND Esophageal squamous cell carcinoma (ESCC) is the sixth most lethal cancer worldwide and the fourth most lethal cancer in China. Genomic characterization of tumors, particularly those of different stages, is likely to reveal additional oncogenic mechanisms. Although copy number alterations and somatic point mutations associated with the development of ESCC have been identified by array-based technologies and genome-wide studies, the genomic characterization of ESCCs from different stages of the disease has not been explored. Here, we have performed either whole-genome sequencing or whole-exome sequencing on 51 stage I and 53 stage III ESCC patients to characterize the genomic alterations that occur during the various clinical stages of ESCC, and further validated these changes in 36 atypical hyperplasia samples. RESULTS Recurrent somatic amplifications at 8q were found to be enriched in stage I tumors and the deletions of 4p-q and 5q were particularly identified in stage III tumors. In particular, the FAM84B gene was amplified and overexpressed in preclinical and ESCC tumors. Knockdown of FAM84B in ESCC cell lines significantly reduced in vitro cell growth, migration and invasion. Although the cancer-associated genes TP53, PIK3CA, CDKN2A and their pathways showed no significant difference between stage I and stage III tumors, we identified and validated a prevalence of mutations in NOTCH1 and in the NOTCH pathway that indicate that they are involved in the preclinical and early stages of ESCC. CONCLUSIONS Our results suggest that FAM84B and the NOTCH pathway are involved in the progression of ESCC and may be potential diagnostic targets for ESCC susceptibility.

  • A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay.
    Gigascience (IF 4.688) Pub Date : 2017-03-23
    Mark-Anthony Bray,Sigrun M Gustafsdottir,Mohammad H Rohban,Shantanu Singh,Vebjorn Ljosa,Katherine L Sokolnicki,Joshua A Bittker,Nicole E Bodycombe,Vlado Dancík,Thomas P Hasaka,Cindy S Hon,Melissa M Kemp,Kejie Li,Deepika Walpita,Mathias J Wawer,Todd R Golub,Stuart L Schreiber,Paul A Clemons,Alykhan F Shamji,Anne E Carpenter

    Background Large-scale image sets acquired by automated microscopy of perturbed samples enable a detailed comparison of cell states induced by each perturbation, such as a small molecule from a diverse library. Highly multiplexed measurements of cellular morphology can be extracted from each image and subsequently mined for a number of applications. Findings This microscopy dataset includes 919 265 five-channel fields of view, representing 30 616 tested compounds, available at "The Cell Image Library" (CIL) repository. It also includes data files containing morphological features derived from each cell in each image, both at the single-cell level and population-averaged (i.e., per-well) level; the image analysis workflows that generated the morphological features are also provided. Quality-control metrics are provided as metadata, indicating fields of view that are out-of-focus or containing highly fluorescent material or debris. Lastly, chemical annotations are supplied for the compound treatments applied. Conclusions Because computational algorithms and methods for handling single-cell morphological measurements are not yet routine, the dataset serves as a useful resource for the wider scientific community applying morphological (image-based) profiling. The dataset can be mined for many purposes, including small-molecule library enrichment and chemical mechanism-of-action studies, such as target identification. Integration with genetically perturbed datasets could enable identification of small-molecule mimetics of particular disease- or gene-related phenotypes that could be useful as probes or potential starting points for development of future therapeutics.

  • A collection of yeast cellular electron cryotomography data.
    Gigascience (IF 4.688) Pub Date : 2019-06-28
    Lu Gan,Cai Tong Ng,Chen Chen,Shujun Cai

    BACKGROUND Cells are powered by a large set of macromolecular complexes, which work together in a crowded environment. The in situ mechanisms of these complexes are unclear because their 3D distribution, organization, and interactions are largely unknown. Electron cryotomography (cryo-ET) can address these knowledge gaps because it produces cryotomograms-3D images that reveal biological structure at ∼4-nm resolution. Cryo-ET uses no fixation, dehydration, staining, or plastic embedment, so cellular features are visualized in a life-like, frozen-hydrated state. To study chromatin and mitotic machinery in situ, we subjected yeast cells to genetic and chemical perturbations, cryosectioned them, and then imaged the cells by cryo-ET. FINDINGS Here we share >1,000 cryo-ET raw datasets of cryosectioned budding yeast Saccharomyces cerevisiaecollected as part of previously published studies. These data will be valuable to cell biologists who are interested in the nanoscale organization of yeasts and of eukaryotic cells in general. All the unpublished tilt series and a subset of corresponding cryotomograms have been deposited in the EMPIAR resource for the community to use freely. To improve tilt series discoverability, we have uploaded metadata and preliminary notes to publicly accessible Google Sheets, EMPIAR, and GigaDB. CONCLUSIONS Cellular cryo-ET data can be mined to obtain new cell-biological, structural, and 3D statistical insights in situ. These data contain structures not visible in traditional electron-microscopy data. Template matching and subtomogram averaging of known macromolecular complexes can reveal their 3D distributions and low-resolution structures. Furthermore, these data can serve as testbeds for high-throughput image-analysis pipelines, as training sets for feature-recognition software, for feasibility analysis when planning new structural-cell-biology projects, and as practice data for students.

  • Erratum to: iMicrobe: Tools and data-driven discovery platform for the microbiome sciences.
    Gigascience (IF 4.688) Pub Date : 2019-08-03
    Ken Youens-Clark,Matt Bomhoff,Alise J Ponsero,Elisha M Wood-Charlson,Joshua Lynch,Illyoung Choi,John H Hartman,Bonnie L Hurwitz

  • Imaging tissues and cells beyond the diffraction limit with structured illumination microscopy and Bayesian image reconstruction.
    Gigascience (IF 4.688) Pub Date : 2018-10-24
    Jakub Pospíšil,Tomáš Lukeš,Justin Bendesky,Karel Fliegel,Kathrin Spendier,Guy M Hagen

    Background Structured illumination microscopy (SIM) is a family of methods in optical fluorescence microscopy that can achieve both optical sectioning and super-resolution effects. SIM is a valuable method for high-resolution imaging of fixed cells or tissues labeled with conventional fluorophores, as well as for imaging the dynamics of live cells expressing fluorescent protein constructs. In SIM, one acquires a set of images with shifting illumination patterns. This set of images is subsequently treated with image analysis algorithms to produce an image with reduced out-of-focus light (optical sectioning) and/or with improved resolution (super-resolution). Findings Five complete, freely available SIM datasets are presented including raw and analyzed data. We report methods for image acquisition and analysis using open-source software along with examples of the resulting images when processed with different methods. We processed the data using established optical sectioning SIM and super-resolution SIM methods and with newer Bayesian restoration approaches that we are developing. Conclusions Various methods for SIM data acquisition and processing are actively being developed, but complete raw data from SIM experiments are not typically published. Publically available, high-quality raw data with examples of processed results will aid researchers when developing new methods in SIM. Biologists will also find interest in the high-resolution images of animal tissues and cells we acquired. All of the data were processed with SIMToolbox, an open-source and freely available software solution for SIM.

  • New de novo assembly of the Atlantic bottlenose dolphin (Tursiops truncatus) improves genome completeness and provides haplotype phasing.
    Gigascience (IF 4.688) Pub Date : 2019-01-31
    Karine A Martinez-Viaud,Cindy Taylor Lawley,Milmer Martinez Vergara,Gil Ben-Zvi,Tammy Biniashvili,Kobi Baruch,Judy St Leger,Jennie Le,Aparna Natarajan,Marlem Rivera,Marbie Guillergan,Erich Jaeger,Brian Steffy,Aleksey Zimin

    High-quality genomes are essential to resolve challenges in breeding, comparative biology, medicine, and conservation planning. New library preparation techniques along with better assembly algorithms result in continued improvements in assemblies for non-model organisms, moving them toward reference-quality genomes. We report on the latest genome assembly of the Atlantic bottlenose dolphin, leveraging Illumina sequencing data coupled with a combination of several library preparation techniques. These include Linked-Reads (Chromium, 10x Genomics), mate pairs (MP), long insert paired ends, and standard paired end. Data were assembled with the commercial DeNovoMAGIC assembly software, resulting in two assemblies, a traditional "haploid" assembly (Tur_tru_Illumina_hap_v1) that is a mosaic of the two parental haplotypes and a phased assembly (Tur_tru_Illumina_phased_v1) where each scaffold has sequence from a single homologous chromosome. We show that Tur_tru_Illumina_hap_v1 is more complete and more accurate compared to the current best reference based on the amount and composition of sequence, the consistency of the MP alignments to the assembled scaffolds, and on the analysis of conserved single-copy mammalian orthologs. The phased de novo assembly Tur_tru_Illumina_phased_v1 is the first publicly available for this species and provides the community with novel and accurate ways to explore the heterozygous nature of the dolphin genome.

  • Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.
    Gigascience (IF 4.688) Pub Date : 2018-12-12
    Chris-Andre Leimeister,Jendrik Schellhorn,Svenja Dörrer,Michael Gerth,Christoph Bleidorn,Burkhard Morgenstern

    Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.

  • A micro X-ray computed tomography dataset of fossil echinoderms in an ancient obrution bed: a robust method for taphonomic and palaeoecologic analyses.
    Gigascience (IF 4.688) Pub Date : 2018-12-12
    Mhairi Reid,Emese M Bordy,Wendy L Taylor,Stephan G le Roux,Anton du Plessis

    BACKGROUND Taphonomic and palaeoecologic studies of obrution beds often employ conventional methods of investigation such as physical removal and extraction of fossils from their host rock (matrix) by mechanical preparation. This often-destructive method is not suitable for studying mold fossils, which are voids left in host rocks due to dissolution of the original organism in post-depositional processes. FINDINGS Microcomputed tomography (µCT) scan data of 24 fossiliferous rock samples revealed thousands of Paleozoic echinoderms. Digitally "stitching" together individually µCT scanned rock samples within three-dimensional (3D) space allows for quantifiable taphonomic data on a fossil echinoderm-rich obrution deposit from the Devonian (Emsian) of South Africa. Here, we provide a brief step-by-step guide on creating, segmenting, and ultimately combining sections of richly fossiliferous beds to create virtual models suited for the quantitative and qualitative taphonomic analyses of fossil invertebrate assemblages. CONCLUSIONS Visualizing the internal features of fossiliferous beds in 3D is an invaluable taphonomic tool for analyzing delicate fossils, accounting for all specimens irrespective of their preservation stages and with minimal damage. This technique is particularly useful for analyzing fossiliferous deposits with mold fossils that prove to be difficult to study with traditional methods, because the method relies on the large density contrast between the mold and host rock.

  • The genome of common long-arm octopus Octopus minor.
    Gigascience (IF 4.688) Pub Date : 2018-09-27
    Bo-Mi Kim,Seunghyun Kang,Do-Hwan Ahn,Seung-Hyun Jung,Hwanseok Rhee,Jong Su Yoo,Jong-Eun Lee,SeungJae Lee,Yong-Hee Han,Kyoung-Bin Ryu,Sung-Jin Cho,Hyun Park,Hye Suck An

    Background The common long-arm octopus (Octopus minor) is found in mudflats of subtidal zones and faces numerous environmental challenges. The ability to adapt its morphology and behavioral repertoire to diverse environmental conditions makes the species a promising model for understanding genomic adaptation and evolution in cephalopods. Findings The final genome assembly of O. minor is 5.09 Gb, with a contig N50 size of 197 kb and longest size of 3.027 Mb, from a total of 419 Gb raw reads generated using the Pacific Biosciences RS II platform. We identified 30,010 genes; 44.43% of the genome is composed of repeat elements. The genome-wide phylogenetic tree indicated the divergence time between O. minor and Octopus bimaculoides was estimated to be 43 million years ago based on single-copy orthologous genes. In total, 178 gene families are expanded in O. minor in the 14 bilaterian species. Conclusions We found that the O. minor genome was larger than that of closely related O. bimaculoides, and this difference could be explained by enlarged introns and recently diversified transposable elements. The high-quality O. minor genome assembly provides a valuable resource for understanding octopus genome evolution and the molecular basis of adaptations to mudflats.

  • Single-cell RNA-seq reveals dynamic transcriptome profiling in human early neural differentiation.
    Gigascience (IF 4.688) Pub Date : 2018-09-22
    Zhouchun Shang,Dongsheng Chen,Quanlei Wang,Shengpeng Wang,Qiuting Deng,Liang Wu,Chuanyu Liu,Xiangning Ding,Shiyou Wang,Jixing Zhong,Doudou Zhang,Xiaodong Cai,Shida Zhu,Huanming Yang,Longqi Liu,J Lynn Fink,Fang Chen,Xiaoqing Liu,Zhengliang Gao,Xun Xu

    Background Investigating cell fate decision and subpopulation specification in the context of the neural lineage is fundamental to understanding neurogenesis and neurodegenerative diseases. The differentiation process of neural-tube-like rosettes in vitro is representative of neural tube structures, which are composed of radially organized, columnar epithelial cells and give rise to functional neural cells. However, the underlying regulatory network of cell fate commitment during early neural differentiation remains elusive. Results In this study, we investigated the genome-wide transcriptome profile of single cells from six consecutive reprogramming and neural differentiation time points and identified cellular subpopulations present at each differentiation stage. Based on the inferred reconstructed trajectory and the characteristics of subpopulations contributing the most toward commitment to the central nervous system lineage at each stage during differentiation, we identified putative novel transcription factors in regulating neural differentiation. In addition, we dissected the dynamics of chromatin accessibility at the neural differentiation stages and revealed active cis-regulatory elements for transcription factors known to have a key role in neural differentiation as well as for those that we suggest are also involved. Further, communication network analysis demonstrated that cellular interactions most frequently occurred in the embryoid body stage and that each cell subpopulation possessed a distinctive spectrum of ligands and receptors associated with neural differentiation that could reflect the identity of each subpopulation. Conclusions Our study provides a comprehensive and integrative study of the transcriptomics and epigenetics of human early neural differentiation, which paves the way for a deeper understanding of the regulatory mechanisms driving the differentiation of the neural lineage.

  • Chromosome-level genome assembly of the spotted sea bass, Lateolabrax maculatus.
    Gigascience (IF 4.688) Pub Date : 2018-09-22
    Changwei Shao,Chang Li,Na Wang,Yating Qin,Wenteng Xu,Qun Liu,Qian Zhou,Yong Zhao,Xihong Li,Shanshan Liu,Xiaowu Chen,Shahid Mahboob,Xin Liu,Songlin Chen

    Background The spotted sea bass (Lateolabrax maculatus) is a valuable commercial fish that is widely cultured in China. While analyses using molecular markers and population genetics have been conducted, genomic resources are lacking. Findings Here, we report a chromosome-scale assembly of the spotted sea bass genome by high-depth genome sequencing, assembly, and annotation. The genome scale was 0.67 Gb with contig and scaffold N50 length of 31 Kb and 1,040 Kb, respectively. Hi-C scaffolding of the genome resulted in 24 pseudochromosomes containing 77.68% of the total assembled sequences. A total of 132.38 Mb repeat sequences were detected, accounting for 20.73% of the assembled genome. A total of 22, 015 protein-coding genes were predicted, of which 96.52% were homologous to proteins in various databases. In addition, we constructed a phylogenetic tree using 1,586 single-copy gene families and identified 125 unique gene families in the spotted sea bass genome. Conclusions We assembled a spotted sea bass genome that will be a valuable genomic resource to understanding the biology of the spotted sea bass and will also lead to the development of molecular breeding techniques to generate spotted sea bass with better economic traits.

  • Using and understanding cross-validation strategies. Perspectives on Saeb et al.
    Gigascience (IF 4.688) Pub Date : 2017-03-23
    Max A Little,Gael Varoquaux,Sohrab Saeb,Luca Lonini,Arun Jayaraman,David C Mohr,Konrad P Kording

    This three-part review takes a detailed look at the complexities of cross-validation, fostered by the peer review of Saeb et al.'s paper entitled "The need to approximate the use-case in clinical machine learning." It contains perspectives by reviewers and by the original authors that touch upon cross-validation: the suitability of different strategies and their interpretation.

  • The need to approximate the use-case in clinical machine learning.
    Gigascience (IF 4.688) Pub Date : 2017-03-23
    Sohrab Saeb,Luca Lonini,Arun Jayaraman,David C Mohr,Konrad P Kording

    The availability of smartphone and wearable sensor technology is leading to a rapid accumulation of human subject data, and machine learning is emerging as a technique to map those data into clinical predictions. As machine learning algorithms are increasingly used to support clinical decision making, it is vital to reliably quantify their prediction accuracy. Cross-validation (CV) is the standard approach where the accuracy of such algorithms is evaluated on part of the data the algorithm has not seen during training. However, for this procedure to be meaningful, the relationship between the training and the validation set should mimic the relationship between the training set and the dataset expected for the clinical use. Here we compared two popular CV methods: record-wise and subject-wise. While the subject-wise method mirrors the clinically relevant use-case scenario of diagnosis in newly recruited subjects, the record-wise strategy has no such interpretation. Using both a publicly available dataset and a simulation, we found that record-wise CV often massively overestimates the prediction accuracy of the algorithms. We also conducted a systematic review of the relevant literature, and found that this overly optimistic method was used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes. As we move towards an era of machine learning-based diagnosis and treatment, using proper methods to evaluate their accuracy is crucial, as inaccurate results can mislead both clinicians and data scientists.

  • Bioportainer Workbench: a versatile and user-friendly system that integrates implementation, management, and use of bioinformatics resources in Docker environments.
    Gigascience (IF 4.688) Pub Date : 2019-06-22
    Fabiano B Menegidio,David Aciole Barbosa,Rafael Dos S Gonçalves,Marcio M Nishime,Daniela L Jabes,Regina Costa de Oliveira,Luiz R Nunes

    BACKGROUND The Docker project is providing a promising strategy for the development of virtualization systems in bioinformatics. However, implementation, management, and launching of Docker containers is not entirely trivial for users not fully familiarized with command line interfaces. This has prompted the development of graphical user interfaces to facilitate the interaction of inexperienced users with Docker environments. RESULTS We describe the BioPortainer Workbench, an integrated Docker system that assists inexperienced users in interacting with a bioinformatics-dedicated Docker environment at 3 main levels: (i) infrastructure, (ii) platform, and (iii) application. CONCLUSIONS The BioPortainer Workbench represents a pioneering effort in developing a comprehensive and easy-to-use Docker platform focused on bioinformatics, which may greatly assist in the dissemination of Docker virtualization technology in this complex field of research.

  • Genome sequence of Malania oleifera, a tree with great value for nervonic acid production.
    Gigascience (IF 4.688) Pub Date : 2019-01-29
    Chao-Qun Xu,Hui Liu,Shan-Shan Zhou,Dong-Xu Zhang,Wei Zhao,Sihai Wang,Fu Chen,Yan-Qiang Sun,Shuai Nie,Kai-Hua Jia,Si-Qian Jiao,Ren-Gang Zhang,Quan-Zheng Yun,Wenbin Guan,Xuewen Wang,Qiong Gao,Jeffrey L Bennetzen,Fatemeh Maghuly,Ilga Porth,Yves Van de Peer,Xiao-Ru Wang,Yongpeng Ma,Jian-Feng Mao

    BACKGROUND Malania oleifera, a member of the Olacaceae family, is an IUCN red listed tree, endemic and restricted to the Karst region of southwest China. This tree's seed is valued for its high content of precious fatty acids (especially nervonic acid). However, studies on its genetic makeup and fatty acid biogenesis are severely hampered by a lack of molecular and genetic tools. FINDINGS We generated 51 Gb and 135 Gb of raw DNA sequences, using Pacific Biosciences (PacBio) single-molecule real-time and 10× Genomics sequencing, respectively. A final genome assembly, with a scaffold N50 size of 4.65 Mb and a total length of 1.51 Gb, was obtained by primary assembly based on PacBio long reads plus scaffolding with 10× Genomics reads. Identified repeats constituted ∼82% of the genome, and 24,064 protein-coding genes were predicted with high support. The genome has low heterozygosity and shows no evidence for recent whole genome duplication. Metabolic pathway genes relating to the accumulation of long-chain fatty acid were identified and studied in detail. CONCLUSIONS Here, we provide the first genome assembly and gene annotation for M. oleifera. The availability of these resources will be of great importance for conservation biology and for the functional genomics of nervonic acid biosynthesis.

  • NanoPipe-a web server for nanopore MinION sequencing data analysis.
    Gigascience (IF 4.688) Pub Date : 2019-01-29
    Victoria Shabardina,Tabea Kischka,Felix Manske,Norbert Grundmann,Martin C Frith,Yutaka Suzuki,Wojciech Makałowski

    BACKGROUND The fast-moving progress of the third-generation long-read sequencing technologies will soon bring the biological and medical sciences to a new era of research. Altogether, the technique and experimental procedures are becoming more straightforward and available to biologists from diverse fields, even without any profound experience in DNA sequencing. Thus, the introduction of the MinION device by Oxford Nanopore Technologies promises to "bring sequencing technology to the masses" and also allows quick and operative analysis in field studies. However, the convenience of this sequencing technology dramatically contrasts with the available analysis tools, which may significantly reduce enthusiasm of a "regular" user. To really bring the sequencing technology to every biologist, we need a set of user-friendly tools that can perform a powerful analysis in an automatic manner. FINDINGS NanoPipe was developed in consideration of the specifics of the MinION sequencing technologies, providing accordingly adjusted alignment parameters. The range of the target species/sequences for the alignment is not limited, and the descriptive usage page of NanoPipe helps a user to succeed with NanoPipe analysis. The results contain alignment statistics, consensus sequence, polymorphisms data, and visualization of the alignment. Several test cases are used to demonstrate the efficiency of the tool. CONCLUSIONS Freely available NanoPipe software allows effortless and reliable analysis of MinION sequencing data for experienced bioinformaticians, as well for wet-lab biologists with minimum bioinformatics knowledge. Moreover, for the latter group, we describe the basic algorithm necessary for MinION sequencing analysis from the first to last step.

  • Galaxy mothur Toolset (GmT): a user-friendly application for 16S rRNA gene sequencing analysis using mothur.
    Gigascience (IF 4.688) Pub Date : 2019-01-01
    Saskia D Hiltemann,Stefan A Boers,Peter J van der Spek,Ruud Jansen,John P Hays,Andrew P Stubbs

    BACKGROUND The determination of microbial communities using the mothur tool suite (https://www.mothur.org) is well established. However, mothur requires bioinformatics-based proficiency in order to perform calculations via the command-line. Galaxy is a project dedicated to providing a user-friendly web interface for such command-line tools (https://galaxyproject.org/). RESULTS We have integrated the full set of 125+ mothur tools into Galaxy as the Galaxy mothur Toolset (GmT) and provided a set of workflows to perform end-to-end 16S rRNA gene analyses and integrate with third-party visualization and reporting tools. We demonstrate the utility of GmT by analyzing the mothur MiSeq standard operating procedure (SOP) dataset (https://www.mothur.org/wiki/MiSeq_SOP). CONCLUSIONS GmT is available from the Galaxy Tool Shed, and a workflow definition file and full Galaxy training manual for the mothur SOP have been created. A Docker image with a fully configured GmT Galaxy is also available.

  • PhenoMeNal: processing and analysis of metabolomics data in the cloud.
    Gigascience (IF 4.688) Pub Date : 2018-12-12
    Kristian Peters,James Bradbury,Sven Bergmann,Marco Capuccini,Marta Cascante,Pedro de Atauri,Timothy M D Ebbels,Carles Foguet,Robert Glen,Alejandra Gonzalez-Beltran,Ulrich L Günther,Evangelos Handakas,Thomas Hankemeier,Kenneth Haug,Stephanie Herman,Petr Holub,Massimiliano Izzo,Daniel Jacob,David Johnson,Fabien Jourdan,Namrata Kale,Ibrahim Karaman,Bita Khalili,Payam Emami Khonsari,Kim Kultima,Samuel Lampa,Anders Larsson,Christian Ludwig,Pablo Moreno,Steffen Neumann,Jon Ander Novella,Claire O'Donovan,Jake T M Pearce,Alina Peluso,Marco Enrico Piras,Luca Pireddu,Michelle A C Reed,Philippe Rocca-Serra,Pierrick Roger,Antonio Rosato,Rico Rueedi,Christoph Ruttkies,Noureddin Sadawi,Reza M Salek,Susanna-Assunta Sansone,Vitaly Selivanov,Ola Spjuth,Daniel Schober,Etienne A Thévenot,Mattia Tomasoni,Merlijn van Rijswijk,Michael van Vliet,Mark R Viant,Ralf J M Weber,Gianluigi Zanetti,Christoph Steinbeck

    BACKGROUND Metabolomics is the comprehensive study of a multitude of small molecules to gain insight into an organism's metabolism. The research field is dynamic and expanding with applications across biomedical, biotechnological, and many other applied biological domains. Its computationally intensive nature has driven requirements for open data formats, data repositories, and data analysis tools. However, the rapid progress has resulted in a mosaic of independent, and sometimes incompatible, analysis methods that are difficult to connect into a useful and complete data analysis solution. FINDINGS PhenoMeNal (Phenome and Metabolome aNalysis) is an advanced and complete solution to set up Infrastructure-as-a-Service (IaaS) that brings workflow-oriented, interoperable metabolomics data analysis platforms into the cloud. PhenoMeNal seamlessly integrates a wide array of existing open-source tools that are tested and packaged as Docker containers through the project's continuous integration process and deployed based on a kubernetes orchestration framework. It also provides a number of standardized, automated, and published analysis workflows in the user interfaces Galaxy, Jupyter, Luigi, and Pachyderm. CONCLUSIONS PhenoMeNal constitutes a keystone solution in cloud e-infrastructures available for metabolomics. PhenoMeNal is a unique and complete solution for setting up cloud e-infrastructures through easy-to-use web interfaces that can be scaled to any custom public and private cloud environment. By harmonizing and automating software installation and configuration and through ready-to-use scientific workflow user interfaces, PhenoMeNal has succeeded in providing scientists with workflow-driven, reproducible, and shareable metabolomics data analysis platforms that are interfaced through standard data formats, representative datasets, versioned, and have been tested for reproducibility and interoperability. The elastic implementation of PhenoMeNal further allows easy adaptation of the infrastructure to other application areas and 'omics research domains.

Contents have been reproduced by permission of the publishers.
上海纽约大学William Glover