SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads

Published on Jun 15, 2014in Bioinformatics4.531
· DOI :10.1093/bioinformatics/btu077
Yinlong Xie6
Estimated H-index: 6
(SCUT: South China University of Technology),
Gengxiong Wu5
Estimated H-index: 5
+ 13 AuthorsJun Wang141
Estimated H-index: 141
Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining a large number of gene sequences from an organism with no reference genome. Owing to the rapid increase in throughputs and decrease in costs of next- generation sequencing, RNA-Seq in particular has become the method of choice. However, the very short reads (e.g. 2 � 90 bp paired ends) from next generation sequencing makes de novo assembly to recover complete or full-length transcript sequences an algorithmic challenge. Results: Here, we present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. We evaluated its performance on transcriptome datasets from rice and mouse. Using as our benchmarks the known transcripts from these wellannotated genomes (sequenced a decade ago), we assessed how SOAPdenovo-Trans and two other popular transcriptome assemblers handled such practical issues as alternative splicing and variable expression levels. Our conclusion is that SOAPdenovo-Trans provides higher contiguity, lower redundancy and faster execution. Availability and implementation: Source code and user manual are available at Contact: or Supplementary information: Supplementary data are available at Bioinformatics online.
  • References (17)
  • Citations (437)
📖 Papers frequently viewed together
788 Citations
6,965 Citations
20.9k Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
#1BingXin Lu (ECNU: East China Normal University)H-Index: 1
#2Zhenbing Zeng (ECNU: East China Normal University)H-Index: 9
Last. Tieliu Shi (ECNU: East China Normal University)H-Index: 24
view all 3 authors...
Transcriptome reconstruction is an important application of RNA-Seq, providing critical information for further analysis of transcriptome. Although RNA-Seq offers the potential to identify the whole picture of transcriptome, it still presents special challenges. To handle these difficulties and reconstruct transcriptome as completely as possible, current computational approaches mainly employ two strategies: de novo assembly and genome-guided assembly. In order to find the similarities and diffe...
37 CitationsSource
#1Ruibang Luo (HKU: University of Hong Kong)H-Index: 25
#2Binghang Liu (HKU: University of Hong Kong)H-Index: 20
Last. Jun WangH-Index: 141
view all 30 authors...
Background There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.
2,249 CitationsSource
#1Marcel H. Schulz (CMU: Carnegie Mellon University)H-Index: 18
#2Daniel R. Zerbino (UCSC: University of California, Santa Cruz)H-Index: 21
Last. Ewan Birney (EMBL-EBI: European Bioinformatics Institute)H-Index: 103
view all 4 authors...
Motivation: High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de novo, taking into account possible alternative isoforms and the dynamic range of expression values. Results: We present a software package named Oases designed to heuristically assemble RNA-seq reads in...
1,019 CitationsSource
#1Manfred Grabherr (MIT: Massachusetts Institute of Technology)H-Index: 27
#2Brian J. Haas (MIT: Massachusetts Institute of Technology)H-Index: 65
Last. Aviv Regev (MIT: Massachusetts Institute of Technology)H-Index: 110
view all 21 authors...
Reconstructing full-length transcripts from high-throughput RNA sequencing data is difficult without a reference genome sequence. Grabherr et al. describe Trinity, an algorithm for assembling full-length transcripts from short reads without first mapping the reads to a genome sequence.
7,893 CitationsSource
#1Jeffrey Martin (LBNL: Lawrence Berkeley National Laboratory)H-Index: 10
#2Vincent M. Bruno (Yale University)H-Index: 4
Last. Zhong Wang (LBNL: Lawrence Berkeley National Laboratory)H-Index: 26
view all 9 authors...
Comprehensive annotation and quantification of transcriptomes are outstanding problems in functional genomics. While high throughput mRNA sequencing (RNA-Seq) has emerged as a powerful tool for addressing these problems, its success is dependent upon the availability and quality of reference genome sequences, thus limiting the organisms to which it can be applied. Here, we describe Rnnotator, an automated software pipeline that generates transcript models by de novo assembly of RNA-Seq data with...
168 CitationsSource
#1Gordon RobertsonH-Index: 21
Last. Inanc BirolH-Index: 54
view all 29 authors...
We describe Trans-ABySS, a de novo short-read transcriptome assembly and analysis pipeline that addresses variation in local read densities by assembling read substrings with varying stringencies and then merging the resulting contigs before analysis. Analyzing 7.4 gigabases of 50-base-pair paired-end Illumina reads from an adult mouse liver poly(A) RNA library, we identified known, new and alternative structures in expressed transcripts, and achieved high sensitivity and specificity relative to...
644 CitationsSource
Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogene...
290 CitationsSource
#1Cole Trapnell (UMD: University of Maryland, College Park)H-Index: 42
#2Brian A. Williams (California Institute of Technology)H-Index: 19
Last. Lior Pachter (University of California, Berkeley)H-Index: 55
view all 9 authors...
High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series....
7,704 CitationsSource
#1Mitchell Guttman (MIT: Massachusetts Institute of Technology)H-Index: 37
#2Manuel Garber (Broad Institute)H-Index: 35
Last. Aviv Regev (MIT: Massachusetts Institute of Technology)H-Index: 110
view all 13 authors...
High-throughput sequencing of total cellular RNA by RNA-Seq promises rapid reconstruction of spliced transcripts in a cell population. Guttman et al. accomplish this using only paired-end RNA-seq data and an unannotated genome sequence, and apply the method to better define many new, conserved long intergenic noncoding RNAs (lincRNAs).
982 CitationsSource
#1Guojie ZhangH-Index: 61
#2Guangwu GuoH-Index: 22
Last. Jun WangH-Index: 141
view all 19 authors...
Understanding the dynamics of eukaryotic transcriptome is essential for studying the complexity of transcriptional regulation and its impact on phenotype. However, comprehensive studies of transcriptomes at single base resolution are rare, even for modern organisms, and lacking for rice. Here, we present the first transcriptome atlas for eight organs of cultivated rice. Using high-throughput paired-end RNA-seq, we unambiguously detected transcripts expressing at an extremely low level, as well a...
333 CitationsSource
Cited By437
#1Atsuo YoshidoH-Index: 13
#2Jindra ŠíchováH-Index: 9
Last. František MarecH-Index: 29
view all 9 authors...
Sex-chromosome systems tend to be highly conserved and knowledge about their evolution typically comes from macroevolutionary inference. Rapidly evolving complex sex-chromosome systems represent a rare opportunity to study the mechanisms of sex-chromosome evolution at unprecedented resolution. Three cryptic species of wood-white butterflies—Leptidea juvernica, L. sinapis and L. reali—have each a unique set of multiple sex-chromosomes with 3–4 W and 3–4 Z chromosomes. Using a transcriptome-based ...
#1Adam Voshall (NU: University of Nebraska–Lincoln)H-Index: 5
#2Sairam BeheraH-Index: 1
Last. Etsuko N. MoriyamaH-Index: 36
view all 9 authors...
Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models an...
#1Marek Cmero (Peter MacCallum Cancer Centre)
#2Breon Schmidt (Peter MacCallum Cancer Centre)
Last. Nadia Davidson (Peter MacCallum Cancer Centre)H-Index: 1
view all 6 authors...
Structural DNA variants can modify gene function by altering transcript sequences, and have been shown to be drivers in both cancer and rare diseases. Although there are now many methods to detect structural variants from Whole Genome Sequencing (WGS), RNA-sequencing (RNA-seq) remains under-utilised as a technology for the detection of gene altering structural variants. Calling fusion genes from RNA-seq data is well established, but other transcriptional variants such as fusions with novel seque...
#1Shunfu Mao (UW: University of Washington)H-Index: 1
#2Lior Pachter (California Institute of Technology)H-Index: 55
Last. Sreeram Kannan (UW: University of Washington)H-Index: 11
view all 4 authors...
High throughput sequencing of RNA (RNA-Seq) has become a staple in modern molecular biology, with applications not only in quantifying gene expression but also in isoform-level analysis of the RNA transcripts. To enable such an isoform-level analysis, a transcriptome assembly algorithm is utilized to stitch together the observed short reads into the corresponding transcripts. This task is complicated due to the complexity of alternative splicing - a mechanism by which the same gene may generate ...
#1Gregory W. Stull (CAS: Chinese Academy of Sciences)H-Index: 9
#2Pamela S. Soltis (UF: University of Florida)H-Index: 102
Last. Stephen A. Smith (UM: University of Michigan)H-Index: 55
view all 5 authors...
PREMISE: Discordance between nuclear and organellar phylogenies (cytonuclear discordance) is a well-documented phenomenon at shallow evolutionary levels but has been poorly investigated at deep levels of plant phylogeny. Determining the extent of cytonuclear discordance across major plant lineages is essential not only for elucidating evolutionary processes, but also for evaluating the currently used framework of plant phylogeny, which is largely based on the plastid genome. METHODS: We present ...
#1Seungho Kang (MSU: Mississippi State University)H-Index: 4
#2Alexander K. Tice (MSU: Mississippi State University)H-Index: 6
Last. Matthew W. Brown (MSU: Mississippi State University)H-Index: 20
view all 6 authors...
Integrins are transmembrane receptor proteins that activate signal transduction pathways upon extracellular matrix binding. The Integrin Mediated Adhesion Complex (IMAC), mediates various cell physiological process. The IMAC was thought to be an animal specific machinery until over the last decade these complexes were discovered in Obazoa, the group containing animals, fungi, and several microbial eukaryote lineages. Amoebozoa is the eukaryotic supergroup sister to Obazoa. Even though Amoebozoa ...
#1Kevin Magne (Université Paris-Saclay)H-Index: 2
#2Shengbin Liu (Université Paris-Saclay)
Last. Pascal Ratet (Université Paris-Saclay)H-Index: 40
view all 8 authors...
In cultivated grasses, tillering, spike architecture and seed shattering represent major agronomical traits. In barley, maize and rice, the NOOT-BOP-COCH-LIKE (NBCL) genes play important roles in development, especially in ligule development, tillering and flower identity. However, compared with dicots, the role of grass NBCL genes is underinvestigated. To better understand the role of grass NBCLs and to overcome any effects of domestication that might conceal their original functions, we studie...
#1Mohammad Sadat-Hosseini (UT: University of Tehran)H-Index: 4
#2Mohammad Reza Bakhtiarizadeh (UT: University of Tehran)H-Index: 9
Last. Kourosh Vahdati (UT: University of Tehran)H-Index: 13
view all 5 authors...
Transcriptome resources can facilitate to increase yield and quality of walnuts. Finding the best transcriptome assembly has not been the subject of walnuts research as yet. This research generated 240,179,782 reads from 11 walnut leaves according to cDNA libraries. The reads provided a complete de novo transcriptome assembly. Fifteen different transcriptome assemblies were constructed from five different well-known assemblers used in scientific literature with different k-mer lengths (Bridger, ...
#1Brogan J. HarrisH-Index: 1
#2C. Jill HarrisonH-Index: 14
Last. Thomas WilliamsH-Index: 85
view all 4 authors...
Summary The origin of land plants was accompanied by new adaptations to life on land, including the evolution of stomata—pores on the surface of plants that regulate gas exchange. The genes that underpin the development and function of stomata have been extensively studied in model angiosperms, such as Arabidopsis. However, little is known about stomata in bryophytes, and their evolutionary origins and ancestral function remain poorly understood. Here, we resolve the position of bryophytes in th...
4 CitationsSource
#1John D. Hogan (BU: Boston University)H-Index: 3
#2Jessica L. Keenan (BU: Boston University)H-Index: 2
Last. Cynthia A. Bradham (BU: Boston University)H-Index: 29
view all 17 authors...
Abstract Embryonic development is arguably the most complex process an organism undergoes during its lifetime, and understanding this complexity is best approached with a systems-level perspective. The sea urchin has become a highly valuable model organism for understanding developmental specification, morphogenesis, and evolution. As a non-chordate deuterostome, the sea urchin occupies an important evolutionary niche between protostomes and vertebrates. Lytechinus variegatus (Lv) is an Atlantic...