[Paper] Efficient De Novo Assembly of Large Genomes Using Compressed Data Structures

Mar 1st, 2012 12:00 am | Comments

Efficient de novo assembly of large genomes using compressed data structures

Jared T. Simpson and Richard Durbin

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

Abstract

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

fulltext

[Paper] GAGE: A Critical Evaluation of Genome Assemblies and Assembly Algorithms

Mar 1st, 2012 12:00 am | Comments

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Abstract

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

fulltext

[Paper] BGI单细胞基因组测序

Mar 1st, 2012 12:00 am | Comments

转自BGI华大基因（微博）

【Cell文章“双生子”诞生】华大基因在肿瘤研究上获得突破性进展，于国际著名学术期刊《细胞》上同时发表两篇研究论文。研究团队研发了一种解析单细胞基因组的新方法，并将该方法应用于原发性血小板增多症ET和肾透明细胞癌ccRCC的肿瘤内部遗传特征研究。

1. 研究人员发现此次研究的肾癌病例并非由常见的两个突变基因VHL和PBRM1导致，提示在癌症分析和诊断过程中进行个性化治疗的重要性。此外还发现了一些与此肾癌发生相关的重要功能基因。 Single-Cell Exome Sequencing Reveals Single-Nucleotide Mutation Characteristics of a Kidney Tumor

2为了探究肿瘤的演化及遗传特性，研究人员对取自一例典型的JAK2阴性ET病人的90个单细胞进行了全外显子测序，揭示了此ET在肿瘤发生中遵循单克隆演化模型，并鉴定了一些与ET的发生、发展相关的突变基因。Single-Cell Exome Sequencing and Monoclonal Evolution of a JAK2-Negative Myeloproliferative Neoplasm

News from BGI website: Cell highlights BGI studies on single-cell sequencing, leading to a new era of cancer research

新一期Science上有关BGI的通讯

Feb 2nd, 2012 12:00 am | Comments

看看BGI在政府支持下，如何起家、发展并在全球扩张的。

http://www.sciencemag.org/content/335/6068/516.full

转方向啦

Jan 29th, 2012 12:00 am | Comments

PhD研究的基本方向初步定下来了。会继续关注测序技术，不过会更关注基因组、转录组、表观遗传组、蛋白质组、代谢组等组学数据的获取，处理和分析，特别是对如何整合多个组学、组学间调控关系的分析。

Life Technologies和Illumina竞相发布新测序机器

Jan 11th, 2012 12:00 am | Comments

1月10日，Life Technologies（Ion Torrent）发布Ion Proton™ Sequencer。就在同一天，Illumina也发布Hiseq 2500。测序技术的竞争已经进入白热化，都在瞄准1000美元测个体基因组的目标。Ion Torrent说年末就能达到，而Illumina反击说他们在夏天就能实现。让我们拭目以待吧。

（目前的形势是Illumina一直领先，从reads长度、测序通量、价钱等方面都要优于Ion Torrent。但Ion Torrent采用的是不同的测序技术（Semiconductor Sequencing），所有测序进程都在一个芯片上完成。如果要升级机器，只需要更换新的芯片即可。Ion Torrent很有可能后来居上。）

（最近比较忙，先发布出来，以后再慢慢分析，有兴趣的直接看英文报道。）

http://www.genomeweb.com/sequencing/life-tech-plans-new-ion-torrent-instrument-deliver-whole-genomes-hours-1k

http://www.lifetechnologies.com/us/en/home/about-us/news-gallery/press-releases/2012/life-techologies-itroduces-the-bechtop-io-proto.html

http://www.genomeweb.com/sequencing/illumina-introduces-one-day-genome-sequencer-new-business-groups

http://pathogenomics.bham.ac.uk/blog/2012/01/ion-torrent-proton-the-chip-is-not-the-machine/

http://blogs.nature.com/news/2012/01/the-1000-genome-are-we-there-yet.html

http://core-genomics.blogspot.com/2012/01/agbt-previews.html

http://omicsomics.blogspot.com/2012/01/sequencing-technology-fireworks.html

http://pathogenomics.bham.ac.uk/blog/2012/01/ion-torrent-proton-the-chip-is-not-the-machine/

用Reaver破解WPA加密的WIFI密码

Jan 9th, 2012 12:00 am | Comments

How to Crack a Wi-Fi Network’s WPA Password with Reaver

不用字典，而是利用一个WPS的漏洞进行暴力破解，时长2-10小时不等。

很早就有关于此漏洞的理论研究（参考），Reaver将其变成现实。

ENJOY!!!

[Paper] De Novo Assembly and Genotyping of Variants Using Colored De Bruijn Graphs

Jan 9th, 2012 12:00 am | Comments

De novo assembly and genotyping of variants using colored de Bruijn graphs

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novoassembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

华大基因利用GPU加速基因组数据分析

Jan 9th, 2012 12:00 am | Comments

基因组数据爆炸式地增长，带来计算上极大的困难，让超级计算机也捉襟见肘。要解决起来，一方面是软件，包括算法的优化和数据压缩技术等；另一方面是硬件，GPU计算、云计算等等。

前一段时间看到华大网站上SOAP系列的软件都了GPU的版本，现在wired出了篇通讯，报道华大用GPU加速基因组数据分析。面向大众的，没有什么专业术语，有兴趣的看看。

Chinese Crunch Human Genome With Videogame Chips

主要讲的是：世界最大基因测序中心华大基因（BGI）和NVIDIA合作将基因分析工具移植到NVIDIA的GPU架构，将基因组分析时间降低一个数量级。以前分析个人的基因组数据需要4天，现在缩短到6小时。

另外：genome research新出一篇论文，说是序列质量比不同拼接软件对拼接结果影响要大。顿时让我联想到种种关于下一代测序技术的论文和报道。下一代测序技术的开发和应用，看似如火如荼，其实非常不成熟，问题很多，还很难用此测序技术做出突破性的新发现。不过，话说回来，不成熟，有问题，很多人都在关注，正好说明了可以做一些东西，且能灌几篇高IF的水文（虽然在科学上不算什么新突破）。

展望2012超级计算技术

Jan 4th, 2012 12:00 am | Comments

By Eng Lim Goh, Translated by Bioops

只做了简要翻译，有兴趣请查看原文。

十个关于2012年高性能计算（high performance computing）的预测：

1 个人医疗成真。个人基因组测序话费变得和磁共振成像一样便宜。药物设计和开方会根据个人基因组的不同而因地制宜（译者注：不敢苟同，前一句可能成真，后一句还有很长的路要走。）

2 预报恶劣天气更加精准。气象学上对于恶劣天气方位和时间的预测上有了长足的发展。

3 共享式内存系统会有预想不到的发现。共享式系统仍会是下年的主流。随着计算能力的扩大，必然会有新的预想不到的发现（“expect to find the unexpected”）

4 分布式系统的快速增长。分布式系统将成为各个需要处理大批量非结构性数据的企业和政府机构的首选。

5. 对协处理器的兴趣增加。有那么一群人会对协处理器感兴趣，并关注如何平衡协处理器所带来的性能增强与编程困难。（译者注：目前很火的GPU就是一个例子。）

6 开放式的云计算服务。目前使用云计算的主要是政府机构，但下年，类似Amazon、Facebook和微软等提供公用云计算的公司会越来越多。

7 数据中心的绿色化。数据中心低能耗，冷却水的重新利用，等等。

8 计算流体力学（computational fluid dynamics (CFD)）。通过计算机的模拟，可以对小到自行车，大到航天飞机进行流体力学的研究。同时CFD也被用于医学上对血液流动的模拟。下一年CFD将在系统生物学方面大展身手。

9 全球变暖的预测更加精确。更准确得预测下一个百年全球气候变化，并指出全球不同地区的关键“热点”（hot spots）

10 关于宇宙的部分理论将被证明。霍金关于黑洞和宇宙形成的关键理论会被实验和观测证实。（译者注：有点瞎扯淡）

在未来，我们更关注的是使用高性能计算技术能给其他学科带来新的发现和发展，而不只是用计算机做分析和写报告。

← Older Blog Archives Newer →