Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

Next-generation Transcriptome Assembly

| Comments

Nature Review Genetics doi:10.1038/nrg3068

一篇入门文章,值得一看,包括引用的一些review

Abstract

Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches — reference-based, de novo and combined strategies — along with some perspectives on transcriptome assembly in the near future.

perl脚本提取BLAST结果中的信息【以tblastn为例】

| Comments

用blast中 -m 8参数可以得到类似的信息,但满足不了所有的需求。用bioperl可以得到任意想要的信息,但我用的cluster上没有bioperl,只好自己写了一个。只是用了一些正则表达式,不适用于其他blast(比如blastn、blastp),但只需要稍微改一下就可以了,暂时将就着用吧。欢迎留言讨论。

parse_blast.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/usr/bin/perl
# Parse alignments from tBLASTn output file
use strict;
use warnings;
print "query\tquery_length\thit\tscore\texpect\tidentity\tquery_strand\tsubject_strand\tcoverage_query\tcoverage_subject\tquery_start\tquery_end\tsubject_start\tsubject_end\n";
my $input=$ARGV[0];
my ($eachline,$query_name,$query_length,$hit_name,$hsp_num);
my ($query_start_temp,$query_end_temp,$subject_start_temp,$subject_end_temp);
my ($query_start,$query_end,$subject_start,$subject_end,$query_seq,$subject_seq,$query_seq_temp,$subject_seq_temp);
my ($score,$expect,$coverage_query,$coverage_subject,$identity_query,$identity_subject,$identity,$query_strand,$subject_strand);
$hsp_num=0;
open(IN,"<$input") or die ("no file $input !");
while (defined($eachline=<IN>)){
  $eachline=~s/[\r\n]//g;
  if($eachline=~/^Query= /){
    ($query_name)=($eachline=~/^Query= (\S+)/);
    $eachline=<IN>;
    ($query_length)=($eachline=~/\((\S+) letters\)/);
    $query_length=~s/,//g;
  }
  if($eachline=~/^>/){
    ($hit_name)=($eachline=~/^>(\S+)/);
  }
  if($eachline=~/^ Score/){
    ($score,$expect)=($eachline =~ /Score = (.*) bits.*Expect.* = (\S+),/);
    $hsp_num++;
    if($hsp_num>1){
      $coverage_query=$query_end-$query_start+1;
      $coverage_subject=$subject_end-$subject_start+1;
      print "$coverage_query\t$coverage_subject\t$query_start\t$query_end\t$subject_start\t$subject_end\t$query_seq\t$subject_seq\n";
    }
    $query_start=1000000000;$query_end=0;
    $subject_start=1000000000;$subject_end=0;
    $query_seq='';$subject_seq='';
    print "$query_name\t$query_length\t$hit_name\t";
    print "$score\t$expect\t";
  }
  if($eachline=~/^ Identities/){
    ($identity_query,$identity_subject,$identity)=($eachline =~ /Identities = (\d+)\/(\d+) \((\d+)%\)/);
    print "$identity\t";
  }
  if($eachline=~/^ Frame = /){
    if($eachline =~ / Frame = -/){$subject_strand='Minus';}elsif($eachline =~ / Frame = +/){$subject_strand='Plus';}
    $query_strand='Plus';
    print "$query_strand\t$subject_strand\t";
  }
  if($eachline=~/^Query: (\d+)/){
    if($query_strand eq 'Plus'){($query_start_temp,$query_seq_temp,$query_end_temp)=($eachline =~ /(\d+)(\D+)(\d+)/);}
    if($query_strand eq 'Minus'){($query_end_temp,$query_seq_temp,$query_start_temp)=($eachline =~ /(\d+)(\D+)(\d+)/);}
    $query_seq_temp=~s/[ -]//g;$query_seq.=$query_seq_temp;
    if($query_start>$query_start_temp){$query_start=$query_start_temp;}
    if($query_end<$query_end_temp){$query_end=$query_end_temp;}
  }
  if($eachline=~/^Sbjct: (\d+)/){
    if($subject_strand eq 'Plus'){($subject_start_temp,$subject_seq_temp,$subject_end_temp)=($eachline =~ /(\d+)(\D+)(\d+)/);}
    if($subject_strand eq 'Minus'){($subject_end_temp,$subject_seq_temp,$subject_start_temp)=($eachline =~ /(\d+)(\D+)(\d+)/);}
    $subject_seq_temp=~s/[ -]//g;$subject_seq.=$subject_seq_temp;
    if($subject_start>$subject_start_temp){$subject_start=$subject_start_temp;}
    if($subject_end<$subject_end_temp){$subject_end=$subject_end_temp;}
  }
  if($eachline=~/^Matrix: /){
    $coverage_query=$query_end-$query_start+1;
    $coverage_subject=$subject_end-$subject_start+1;
    print "$coverage_query\t$coverage_subject\t$query_start\t$query_end\t$subject_start\t$subject_end\t$query_seq\t$subject_seq\n";
  }
}
close IN;

做生物信息常用到的linux命令

| Comments

1,统计一个序列文件中的序列个数(grep用好了可以非常快捷方便地处理一些数据)

grep -c '>' seqfile
2, 查看大文件头几行或最后几行
head seqfile
tail seqfile
3,文件行数
wc -l seqfile
4,矩阵格式的文件,提取其中的某几列(例如blast -m 8
 cut -f 1,2,11 seq.cblast > seq.abc
5,awk和sed

先学的perl,后知道awk和sed,认识到很多事情用awk和sed解决比写perl脚本方便多了

比如fastq转换fasta文件:

awk ‘NR % 4 == 1 || NR % 4 == 2′ myfile.fastq | sed -e ‘s/@/>/’ > myfile.fasta
6,screen管理远程任务,可以在远程会话断开后继续在后台运行,详见此文

7,vi/vim就不必细说了,编程必备。(emacs党自动替换成emacs)

8,暂时就这么多了。好长时间没做过东西了。等用到或者想到了再加。

聪明是天生的么?

| Comments

Nature旗下Molecular Psychiatry的一篇文章

用基因组关联研究(GWAS)的方法,分析了3511个人的549692个SNP与智力的关联程度。结论:40%的crystallized-type intelligence差异和51%的fluid-type intelligence差异可以认为是受遗传因素影响。(crystallized-type intelligence:所具备的知识,和记忆力有关;fluid-type intelligence:解决问题、逻辑思维的能力。 具体解释见wikipedia

有兴趣的可读原文或相关评论文章(生物谷genomewebboston.com

 

Structural Variation Detection by Whole Genome De Novo Assembly

| Comments

Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

Here we use whole-genome de novo assembly of second-generation sequencing reads to map structural variation (SV) in an Asian genome and an African genome. Our approach identifies small- and intermediate-size homozygous variants (1–50 kb) including insertions, deletions, inversions and their precise breakpoints, and in contrast to other methods, can resolve complex rearrangements. In total, we identified 277,243 SVs ranging in length from 1–23 kb. Validation using computational and experimental methods suggests that we achieve overall <6% false-positive rate and <10% false-negative rate in genomic regions that can be assembled, which outperforms other methods. Analysis of the SVs in the genomes of 106 individuals sequenced as part of the 1000 Genomes Project suggests that SVs account for a greater fraction of the diversity between individuals than do single-nucleotide polymorphisms (SNPs). These findings demonstrate that whole-genome de novo assembly is a feasible approach to deriving more comprehensive maps of genetic variation.

BGI的文章。BGI,也就是华大,有钱的主,拿钱砸出来好多高IF的文章,此为其一。

通过whole genome de novo assembly来确定Structural variation(SV)。

用拼接好的scaffolds和human reference genome来比对。BGI利用BLATLASTZ,加上一些自己开发的脚本(scripts)建立了一个SV detection的流程(pipeline),叫做SOAPsv。这篇文章等于是这个pipeline的首次实践。

验证(validation)是一个必要的程序,不像千人基因组 (paper)用实验方法(wet),他们同时用了实验(wet)和电脑计算(dry)两种方法。关于computational方法,是用BWA或者SOAPaligner来把原始的reads align到reference genome上。为什么不能用这个align的结果来做SV detection呢?其他的软件基本都是这样做的。为了“创新”,另辟蹊径,用de novo assembly,但同时还需要align to reference,是不是有点走弯路?。另外一个疑问:如果align的结果和de novo assembly的序列不一样呢?

也和另外两个软件做了比较,BreakDancer (paper)和Pindel (paper),“据说”是比这两个软件的精确性要高。(“similar sensitivity but improved precision”)但是,为什么不和其他软件比较呢?

最后要提另一篇重量级的文章:Mapping copy number variation by population-scale genome sequencing,使用了几乎所有的SV detection的软件,并实验验证,还探讨了一些SV发生机制。是一篇集大成的文章,需要仔细研究,有空写篇博客介绍一下。

Comparison of De Novo Assembly Tools for Next-generation Sequencing

| Comments

Comparative Studies of de novo Assembly Tools for Next-generation Sequencing Technologies

Yong Lin, Jian Li, Hui Shen, Lei Zhang, Christopher J Papasian and Hong-Wen Deng

Motivation: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions.

Results: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence cover-age, and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative perform-ance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different condi-tions, are provided.

拿流行的几个de novo assembly tools for next-generation sequencing做了系统的比较,是一篇很不错的文章。曾经有多少人徘徊于各种assembly软件中,不知道选择哪一种,这篇文章提供了很好的帮助。

作者比较了SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo,(竟然没有ALLPATHS-LG!)。

先用大到人类(其实最多只用了100Mb序列片段)小到E. coli的不同GC含量的基因组,模拟出single-end和paired-end、以及不同测序深度、不同测序片段长度的短序列片段,然后使用各个软件拼接,缺省设置,在各有2.40GB双核+12GB的8个nodes上运行,然后比较拼接结果:N50、基因组覆盖率(sequence coverage)、拼接错误(assembly error)和base  calling error rate (BCER)等,还有软件运行时的硬件消耗。

最终的结论:(结论比较复杂,不同条件和要求下,各个软件表现有些差异。总的来看ABySS和SOAPdenovo比较“万金油”。

再看非常重要的硬件消耗(还是SOAPdenovo和abyss给力啊,看来我眼光不错,我就是用的这两个,还有ALLPATHS-LG!

本人看法:SOAPdenovo能拼接scaffolds,能产生更长的N50,而ABySS只能拼接到contigs但ABySS基于MPI并行,对硬件需求低。ALLPATHS-LG和SOAPdenovo相似。

Hello 点滴记录

| Comments

今天发现有人转载我的文章,(虽然我的大部分也是转载别人的),而且有人把我的网址收藏成“有用网站”。哈哈!

为答谢网友,分享一个关于RNA-Seq的高水平blog,http://rna-seqblog.com/,还有一个关于人类基因组测序的blog,http://www.genomesunzipped.org/

今后要向两位博主看齐,多整点专业的、有深度的文章。(时间啊,就像那个XX,挤一下总会有的!)

今天又被自己鄙视了一下,眼高手低啊。被人问到专业而具体的问题,一问三不知。罪过!还敢号称是做“技术”的!打算拿已发表的Next-generation sequencing data,把这些主要技术流程亲自下手过一遍,de novo assembly、reference assembly、RNA-Seq map、transcriptome assembly、SNP discovery、structural variation/CNV detection等等。

还有一件事要记录一下。有人请我审一篇关于identification of SNPs without the reference genome的文章。惭愧!我自己还没发过一篇一作文章。老板在后面撑腰,说,审吧,你审完了我给把把关。(老板原话:“you can review.  if you like, i can look at your review before you send it in.”)

Perl多线程

| Comments

先转一下多线程的概念:

线程是一个单一的执行流程,它是所有程序执行过程中最小的控制单位,即能被 CPU 所调度的最小任务单元。线程与进程之间既有联系,又完全不同。简单地说,一个线程必然属于某一个进程,而一个进程包含至少一个或者多个线程。早期的计算机系统一次只能运行一个程序,因此,当有多个程序需要执行的时候,唯一的办法就是让它们排成队,按顺序串行执行。进程的出现打破了这种格局,CPU 资源按时间片被分割开来,分配给不同的进程使用。这样一来,从微观上看进程的执行虽然仍是串行的,但是从宏观上看,不同的程序已经是在并行执行了。如果我们把同样的思想运用到进程上,很自然地就会把进程再细分成更小的执行单位,即线程。由于一个进程又往往需要同时执行多个类似的任务,因此这些被细分的线程之间可以共享相同的代码段,数据段和文件句柄等资源。有了进程,我们可以在一台单 CPU 计算机系统上同时运行 Firefox 和 Microsoft Office Word 等多个程序;有了线程,我们可以使 Firefox 在不同的标签里同时加载多个不同的页面,在 Office Word 里编辑文档的同时进行语法错误检查。因此,线程给我们带来了更高的 CPU 利用率、更快速的程序响应、更经济地资源使用方式和对多 CPU 的体系结构更良好的适应性。关于多线程的详细讲解,可参看:perl 线程模型讲解(http://it.chinawin.net/softwaredev/article-124a1.html

————————————————————–

perl中的多线程模块

5.8以后的版本的多线程模块可参看perldoc(http://perldoc.perl.org/threads.html

—————————————————————

perl的多线程实例:

涉及语言:Perl
所用模块:threads
模块中的方法: threads->create(),
创建一个新线程;threads->join(),
收割已经创建的线程;threads->list(threads::all),
返回所有已经创建的线程;threads->is_joinable(),
返回目标线程是否已经完成,等待join;
其他的在perldoc上了。

——————————————————–

multithread.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
use threads;           #声明模块
use warnings;use strict;
print localtime(time),"\n";  #输出系统时间;
my $j=0;
my $thread;
while()
{
last if($j>=10)#这里控制一下任务数量,共10个;
#控制创建的线程数,这里是5,scalar函数返回列表threads->list()元素的个数;
while(scalar(threads->list())<5)  {    $j++; #创建一个线程,这个线程其实就是调用(引用)函数“ss”; #函数‘ss’包含两个参数($j和$j);       threads->new(\&ss,$j,$j);
}
foreach $thread(threads->list(threads::all))
{    if($thread->is_joinable())      #判断线程是否运行完成;
{    $thread->join();
#输出中间结果;
print scalar(threads->list()),"\t$j\t",localtime(time),"\n";
}
}
}
#join掉剩下的线程(因为在while中当j=10时,还有4个线程正在运行,但是此时程序将退出while循,所以在这里需要额外程序join掉剩下的4个线程)
foreach $thread(threads->list(threads::all))
{    $thread->join();
     print scalar(threads->list()),"\t$j\t",localtime(time),"\n";
}
#输出程序结束的时间,和程序开始运行时间比较,看程序运行性能;
print localtime(time),"\n";
#下面就是每个线程引用的函数;
sub ss()
{   my ($t,$s)=@_;
sleep($t);       #sleep函数,睡觉;以秒为单位;
print "$s\t";
}

—————————————————

结果:

第一列表示程序已经完成的任务数,第二列表示正在运行的线程数-1(join掉一个了),第三列表示在收割掉一个线程后新添加的任务,最后一列表示完成一个线程时的系统时间。

————————————————————

多线程运行性能

如果单独运行这10个任务,所需要的时间为:1+2+3+4++10=55s;
采用多线程运行(5个)的话,需要的时间为:54-39=16s;

————————————————————-

运行过程

简要描述一下程序运行过程,以便更深入理解多线程的概念。
程序共要运行10个任务,第一个任务的作用是暂停程序1s(sleep(1));第二个任务是暂停程序2s(sleep(2));以此类推,第十个任务是暂停程序10s;
时间(s) 任务
0 1,2,3,4,5(程序初始,5个线程同时运行,需要时间最长的是线程5(5s))
1 2,3,4,5,6(经过1s后,第一个任务已经完成,被join掉,同时添加新任务6)
2 3,4,5,6,7(同上)
3 4,5,6,7,8
4 5,6,7,8,9
5 6,7,8,9,10
7-end join所有剩下的线程(所有任务都已经添加,程序中while循环退出)

方法$thread->is_joinable()的作用

前面已经说了,这个方法是用来判断线程是否已经运行完成,处于等待join的状态。当需要处理多个任务,但这些任务完成需要的时间又不一样时,这个方法就显得特别重要。

还是以上面的程序为例。程序初始运行时创建5个线程。第一个线程所需时间最短,为1s。第五个线程所需时间最长5s。如果不适用$thread->is_joinable()而直接join这五个线程的话,如下:

foreach $thread(threads->list(threads::all))

{   $thread->join();

}

结果是:主程序处于等待状态。在1s后,第一个线程被join,主程序依然处于等待,2s后第二个线程被join,主程序等待。知道5s后第五个线程被join,主程序通畅,重新创建下一组线程(5个)。显然这个过程不能最大话利用CPU的资源。当第一个线程被join后,虽然程序中只有4个线程在运行,但是由于主程序处于等待状态,新的线程不会被创建。

最佳的方法就是判断线程是否可以被join。如上面的程序所写的。这样可以保证程序运行过程中始终是5个线程,最大化的利用CPU资源。

——————————————————-

实例

说了这么多,多线程在生物信息中到底可以怎么来运用,下面给一个简单的实例。从KEGG数据库(http://www.genome.jp/kegg/)上搜索同源序列。

所需文件:seqname.txt(用于存放需要搜索的序列KEGG名称);

源码:

keggmultithread.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
use strict;use warnings;use threads;
use SOAP::Lite;
use Cwd;
my $path=getcwd;
my $wsdl = 'http://soap.genome.jp/KEGG.wsdl';
my $serv = SOAP::Lite->service($wsdl);
open(F,"K00006.txt");
my @names=;
chomp @names;
close(F);
my $i=0;
my $thread;
print localtime(time);
while($i<scalar(@names))  {    while(scalar(threads->list())<10)  {    threads->new(\&orgfile,$names[$i]);
$i++;
}
foreach $thread(threads->list(threads::all))
{    if($thread->is_joinable())
{    $thread->join();
}
}
}
foreach $thread(threads->list(threads::all))
{    $thread->join();
}
print localtime(time);
sub orgfile
{    my($seq)=@_;
my $offset = 1;
my $limit = 100;
my $top5 = $serv->get_best_neighbors_by_gene($seq, $offset,$limit);
$seq=~s/://;
open(F,">$seq.txt");
foreach my $hit (@{$top5})
{    print F "$hit->{genes_id1}\t$hit->{genes_id2}\t$hit->{sw_score}\n";
}
close(F);
print "$seq\n";
}

————————————————

[Paper]A Framework for Variation Discovery and Genotyping Using Next-generation DNA Sequencing Data

| Comments

http://www.nature.com/ng/journal/v43/n5/full/ng.806.html

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (~4×) 1000 Genomes Project datasets.