Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of "marker" genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.
The human genome contains many thousands of long noncoding RNAs (IncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive IncRNA annotation. Here, we present and analyze the most complete human IncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that IncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon /intron lengths. In contrast to protein-coding genes, however, IncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences-particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that IncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific IncRNAs expressed in the brain. Expression correlation analysis indicates that IncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of IncRNAs.
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (IncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
RNA-guided engineered nucleases (RGENs) derived from the prokaryotic adaptive immune system known as CRISPR clustered, regularly interspaced, short palindromic repeat)/Cas (CRISPR-associated) enable genome editing in human cell lines, animals, and plants, but are limited by off-target effects and unwanted integration of DNA segments derived from plasmids encoding Cas9 and guide RNA at both on-target and off-target sites in the genome. Here, we deliver purified recombinant Cas9 protein and guide RNA into cultured human cells including hard-to-transfect fibroblasts and pluripotent stem cells. RGEN ribonucleoproteins (RNPs) induce site-specific mutations at frequencies of up to 79%, while reducing off-target mutations associated with plasmid transfection at off-target sites that differ by one or two nucleotides from on-target sites. RGEN RNPs cleave chromosomal DNA almost immediately after delivery and are degraded rapidly in cells, reducing off-target effects. Furthermore, RNP delivery is less stressful to human embryonic stem cells, producing at least twofold more colonies than does plasmid transfection.
RNA-guided endonucleases (RGENs), derived from the prokaryotic adaptive immune system known as CRISPR/Cas, enable targeted genome engineering in cells and organisms. RGENs are ribonucleoproteins that consist of guide RNA and Cas9, a protein component originated from Streptococcus pyogenes. These enzymes cleave chromosomal DNA, whose sequence is complementary, to guide RNA in a targeted manner, producing site-specific DNA double-strand breaks (DSBs), the repair of which gives rise to targeted genome modifications. Despite broad interest in RGEN-mediated genome editing, these nucleases are limited by off-target mutations and unwanted chromosomal translocations associated with off-target DNA cleavages. Here, we show that off-target effects of RGENs can be reduced below the detection limits of deep sequencing by choosing unique target sequences in the genome and modifying both guide RNA and Cas9. We found that both the composition and structure of guide RNA can affect RGEN activities in cells to reduce off-target effects. RGENs efficiently discriminated on-target sites from off-target sites that differ by two bases. Furthermore, exome sequencing analysis showed that no off-target mutations were induced by two RGENs in four clonal populations of mutant cells. In addition, paired Cas9 nickases, composed of D10A Cas9 and guide RNA, which generate two single-strand breaks (SSBs) or nicks on different DNA strands, were highly specific in human cells, avoiding off-target mutations without sacrificing genome-editing efficiency. Interestingly, paired nickases induced chromosomal deletions in a targeted manner without causing unwanted translocations. Our results highlight the importance of choosing unique target sequences and optimizing guide RNA and Cas9 to avoid or reduce RGEN-induced off-target mutations.
Cancer is a disease driven by genetic variation and mutation. Exome sequencing can be utilized for discovering these variants and mutations across hundreds of tumors. Here we present an analysis tool, VarScan 2, for the detection of somatic mutations and copy number alterations (CNAs) in exome data from tumor normal pairs. Unlike most current approaches, our algorithm reads data from both samples simultaneously; a heuristic and statistical algorithm detects sequence variants and classifies them by somatic status (germline, somatic, or LOH); while a comparison of normalized read depth delineates relative copy number changes. We apply these methods to the analysis of exome sequence data from 151 high-grade ovarian tumors characterized as part of the Cancer Genome Atlas (TCGA). We validated some 7790 somatic coding mutations, achieving 93% sensitivity and 85% precision for single nucleotide variant (SNV) detection. Exome-based CNA analysis identified 29 large-scale alterations and 619 focal events per tumor on average. As in our previous analysis of these data, we observed frequent amplification of oncogenes (e.g., CCNE1, MYC) and deletion of tumor suppressors (NF1, PTEN, and CDKN2A). We searched for additional recurrent focal CNAs using the correlation matrix diagonal segmentation (CMDS) algorithm, which identified 424 significant events affecting 582 genes. Taken together, our results demonstrate the robust performance of VarScan 2 for somatic mutation and CNA detection and shed new light on the landscape of genetic alterations in ovarian cancer.
As the sequencing of healthy and disease genomes becomes more commonplace, detailed annotation provides interpretation for individual variation responsible for normal and disease phenotypes. Current approaches focus on direct changes in protein coding genes, particularly nonsynonymous mutations that directly affect the gene product. However, most individual variation occurs outside of genes and, indeed, most markers generated from genome-wide association studies (GWAS) identify variants outside of coding segments. Identification of potential regulatory changes that perturb these sites will lead to a better localization of truly functional variants and interpretation of their effects. We have developed a novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome. RegulomeDB includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. These data sources are combined into a powerful tool that scores variants to help separate functional variants from a large pool and provides a small set of putative sites with testable hypotheses as to their function. We demonstrate the applicability of this tool to the annotation of noncoding variants from 69 full sequenced genomes as well as that of a personal genome, where thousands of functionally associated variants were identified. Moreover, we demonstrate a GWAS where the database is able to quickly identify the known associated functional variant and provide a hypothesis as to its function. Overall, we expect this approach and resource to be valuable for the annotation of human genome sequences.
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS the 1000 Genome pilot alone includes nearly five terabases-make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing CHIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of chip-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http:/ / encodeproject.org/ENCODE/) and modENCODE (http:/ /www.modencode. org/) portals.
Bacterial diversity among environmental samples is commonly assessed with PCR-amplified 16S rRNA gene (16S) sequences. Perceived diversity, however, can be influenced by sample preparation, primer selection, and formation of chimeric 16S amplification products. Chimeras are hybrid products between multiple parent sequences that can be falsely interpreted as novel organisms, thus inflating apparent diversity. We developed a new chimera detection tool called Chimera Slayer (CS). CS detects chimeras with greater sensitivity than previous methods, performs well on short sequences such as those produced by the 454 Life Sciences (Roche) Genome Sequencer, and can scale to large data sets. By benchmarking CS performance against sequences derived from a controlled DNA mixture of known organisms and a simulated chimera set, we provide insights into the factors that affect chimera formation such as sequence abundance, the extent of similarity between 16S genes, and PCR conditions. Chimeras were found to reproducibly form among independent amplifications and contributed to false perceptions of sample diversity and the false identification of novel taxa, with less-abundant species exhibiting chimera rates exceeding 70%. Shotgun metagenomic sequences of our mock community appear to be devoid of 16S chimeras, supporting a role for shotgun metagenomics in validating novel organisms discovered in targeted sequence surveys.
While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.
Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.
Next-generation sequencing (NGS) technologies are revolutionizing genome research, and in particular, their application to transcriptomics (RNA-seq) is increasingly being used for gene expression profiling as a replacement for microarrays. However, the properties of RNA-seq data have not been yet fully established, and additional research is needed for understanding how these data respond to differential expression analysis. In this work, we set out to gain insights into the characteristics of RNA-seq data analysis by studying an important parameter of this technology: the sequencing depth. We have analyzed how sequencing depth affects the detection of transcripts and their identification as differentially expressed, looking at aspects such as transcript biotype, length, expression level, and fold-change. We have evaluated different algorithms available for the analysis of RNA-seq and proposed a novel approach-NOISeq-that differs from existing methods in that it is data-adaptive and nonparametric. Our results reveal that most existing methodologies suffer from a strong dependency on sequencing depth for their differential expression calls and that this results in a considerable number of false positives that increases as the number of reads grows. In contrast, our proposed method models the noise distribution from the actual data, can therefore better adapt to the size of the data set, and is more effective in controlling the rate of false discoveries. This work discusses the true potential of RNA-seq for studying regulation at low expression ranges, the noise within RNA-seq data, and the issue of replication.
A major challenge in the analysis of environmental sequences is data integration. The question is how to analyze different types of data in a unified approach, addressing both the taxonomic and functional aspects. To facilitate such analyses, we have substantially extended MEGAN, a widely used taxonomic analysis program. The new program, MEGAN4, provides an integrated approach to the taxonomic and functional analysis of metagenomic, metatranscriptomic, metaproteomic, and rRNA data. While taxonomic analysis is performed based on the NCBI taxonomy, functional analysis is performed using the SEED classification of subsystems and functional roles or the KEGG classification of pathways and enzymes. A number of examples illustrate how such analyses can be performed, and show that one can also import and compare classification results obtained using others' tools. MEGAN4 is freely available for academic purposes, and installers for all three major operating systems can be downloaded from www-ab.informatik.uni-tuebingen.de/software/megan.
The tumor microenvironment of colorectal carcinoma is a complex community of genomically altered cancer cells, nonneoplastic cells, and a diverse collection of microorganisms. Each of these components may contribute to carcinogenesis; however, the role of the microbiota is the least well understood. We have characterized the composition of the microbiota in colorectal carcinoma using whole genome sequences from nine tumor/normal pairs. Fusobacterium sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA sequence analysis of 95 carcinoma/normal DNA pairs, while the Bacteroidetes and Firmicutes phyla were depleted in tumors. Fusobacteria were also visualized within colorectal tumors using FISH. These findings reveal alterations in the colorectal cancer microbiota; however, the precise role of Fusobacteria in colorectal carcinoma pathogenesis requires further investigation.
RNA-seq is a powerful tool for the study of alternative splicing and other forms of alternative isoform expression. Understanding the regulation of these processes requires sensitive and specific detection of differential isoform abundance in comparisons between conditions, cell types, or tissues. We present DEXSeq, a statistical method to test for differential exon usage in RNA-seq data. DEXSeq uses generalized linear models and offers reliable control of false discoveries by taking biological variation into account. DEXSeq detects with high sensitivity genes, and in many cases exons, that are subject to differential exon usage. We demonstrate the versatility of DEXSeq by applying it to several data sets. The method facilitates the study of regulation and function of alternative exon usage on a genome-wide scale. An implementation of DEXSeq is available as an R/Bioconductor package.
We created a visualization tool called Circos to facilitate the identification and analysis of similarities and differences arising from comparisons of genomes. Our tool is effective in displaying variation in genome structure and, generally, any other kind of positional relationships between genomic intervals. Such data are routinely produced by sequence alignments, hybridization arrays, genome mapping, and genotyping studies. Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements. Circos is capable of displaying data as scatter, line, and histogram plots, heat maps, tiles, connectors, and text. Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines.
Atopic dermatitis (AD) has long been associated with Staphylococcus aureus skin colonization or infection and is typically managed with regimens that include antimicrobial therapies. However, the role of microbial communities in the pathogenesis of AD is incompletely characterized. To assess the relationship between skin microbiota and disease progression, 16S ribosomal RNA bacterial gene sequencing was performed on DNA obtained directly from serial skin sampling of children with AD. The composition of bacterial communities was analyzed during AD disease states to identify characteristics associated with AD flares and improvement post-treatment. We found that microbial community structures at sites of disease predilection were dramatically different in AD patients compared with controls. Microbial diversity during AD flares was dependent on the presence or absence of recent AD treatments, with even intermittent treatment linked to greater bacterial diversity than no recent treatment. Treatment-associated changes in skin bacterial diversity suggest that AD treatments diversify skin bacteria preceding improvements in disease activity. In AD, the proportion of Staphylococcus sequences, particularly S. aureus, was greater during disease flares than at baseline or post-treatment, and correlated with worsened disease severity. Representation of the skin commensal S. epidermidis also significantly increased during flares. Increases in Streptococcus, Propionibacterium, and Corynebacterium species were observed following therapy. These findings reveal linkages between microbial communities and inflammatory diseases such as AD, and demonstrate that as compared with culture-based studies, higher resolution examination of microbiota associated with human disease provides novel insights into global shifts of bacteria relevant to disease progression and treatment.
An estimated 15% or more of the cancer burden worldwide is attributable to known infectious agents. We screened colorectal carcinoma and matched normal tissue specimens using RNA-seq followed by host sequence subtraction and found marked over-representation of Fusobacterium nucleatum sequences in tumors relative to control specimens. F. nucleatum is an invasive anaerobe that has been linked previously to periodontitis and appendicitis, but not to cancer. Fusobacteria are rare constituents of the fecal microbiota, but have been cultured previously from biopsies of inflamed gut mucosa. We obtained a Fusobacterium isolate from a frozen tumor specimen; this showed highest sequence similarity to a known gut mucosa isolate and was confirmed to be invasive. We verified overabundance of Fusobacterium sequences in tumor versus matched normal control tissue by quantitative PCR analysis from a total of 99 subjects (p = 2.5 x 10(-6)), and we observed a positive association with lymph node metastasis.