Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches — reference-based, de novo and combined strategies — along with some perspectives on transcriptome assembly in the near future.
Here we use whole-genome de novo assembly of second-generation sequencing reads to map structural variation (SV) in an Asian genome and an African genome. Our approach identifies small- and intermediate-size homozygous variants (1–50 kb) including insertions, deletions, inversions and their precise breakpoints, and in contrast to other methods, can resolve complex rearrangements. In total, we identified 277,243 SVs ranging in length from 1–23 kb. Validation using computational and experimental methods suggests that we achieve overall <6% false-positive rate and <10% false-negative rate in genomic regions that can be assembled, which outperforms other methods. Analysis of the SVs in the genomes of 106 individuals sequenced as part of the 1000 Genomes Project suggests that SVs account for a greater fraction of the diversity between individuals than do single-nucleotide polymorphisms (SNPs). These findings demonstrate that whole-genome de novo assembly is a feasible approach to deriving more comprehensive maps of genetic variation.
BGI的文章。BGI,也就是华大,有钱的主,拿钱砸出来好多高IF的文章,此为其一。
通过whole genome de novo assembly来确定Structural variation(SV)。
验证(validation)是一个必要的程序,不像千人基因组 (paper)用实验方法(wet),他们同时用了实验(wet)和电脑计算(dry)两种方法。关于computational方法,是用BWA或者SOAPaligner来把原始的reads align到reference genome上。为什么不能用这个align的结果来做SV detection呢?其他的软件基本都是这样做的。为了“创新”,另辟蹊径,用de novo assembly,但同时还需要align to reference,是不是有点走弯路?。另外一个疑问:如果align的结果和de novo assembly的序列不一样呢?
也和另外两个软件做了比较,BreakDancer (paper)和Pindel (paper),“据说”是比这两个软件的精确性要高。(“similar sensitivity but improved precision”)但是,为什么不和其他软件比较呢?
Yong Lin, Jian Li, Hui Shen, Lei Zhang, Christopher J Papasian and Hong-Wen Deng
Motivation: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions.
Results: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence cover-age, and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative perform-ance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different condi-tions, are provided.
拿流行的几个de novo assembly tools for next-generation sequencing做了系统的比较,是一篇很不错的文章。曾经有多少人徘徊于各种assembly软件中,不知道选择哪一种,这篇文章提供了很好的帮助。
还有一件事要记录一下。有人请我审一篇关于identification of SNPs without the reference genome的文章。惭愧!我自己还没发过一篇一作文章。老板在后面撑腰,说,审吧,你审完了我给把把关。(老板原话:“you can review. if you like, i can look at your review before you send it in.”)
线程是一个单一的执行流程,它是所有程序执行过程中最小的控制单位,即能被 CPU 所调度的最小任务单元。线程与进程之间既有联系,又完全不同。简单地说,一个线程必然属于某一个进程,而一个进程包含至少一个或者多个线程。早期的计算机系统一次只能运行一个程序,因此,当有多个程序需要执行的时候,唯一的办法就是让它们排成队,按顺序串行执行。进程的出现打破了这种格局,CPU 资源按时间片被分割开来,分配给不同的进程使用。这样一来,从微观上看进程的执行虽然仍是串行的,但是从宏观上看,不同的程序已经是在并行执行了。如果我们把同样的思想运用到进程上,很自然地就会把进程再细分成更小的执行单位,即线程。由于一个进程又往往需要同时执行多个类似的任务,因此这些被细分的线程之间可以共享相同的代码段,数据段和文件句柄等资源。有了进程,我们可以在一台单 CPU 计算机系统上同时运行 Firefox 和 Microsoft Office Word 等多个程序;有了线程,我们可以使 Firefox 在不同的标签里同时加载多个不同的页面,在 Office Word 里编辑文档的同时进行语法错误检查。因此,线程给我们带来了更高的 CPU 利用率、更快速的程序响应、更经济地资源使用方式和对多 CPU 的体系结构更良好的适应性。关于多线程的详细讲解,可参看:perl 线程模型讲解(http://it.chinawin.net/softwaredev/article-124a1.html)
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (~4×) 1000 Genomes Project datasets.