Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

Get Intron Sizes From Gff3 Files Using Perl

| Comments

Normally, there is no ‘intron’ feature in gff3 files, but the information can be obtained by calculating the interval sizes between CDS regions. Here I wrote a simple perl script for getting intron sizes from gff3 files. The script can print out each intron sizes, one gene per line. You can customize the script based on your gff3 files or demands.

PS: I rank the exon positions before calculating the intron size to avoid the +/- strand issues
get_intron_size.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/bin/perl
use strict;
use warnings;

# usage: perl get_intron_size.pl gff3_file >output

my $input=$ARGV[0];
my ($eachline,@exons);
my $first=0;
open (IN, "<$input") or die ("no such file!");
while(defined($eachline=<IN>)){
  if($eachline=~/\tmRNA\t/){
    $first++;
    if($first != 1){
      print_intron(@exons);
      @exons=();
    }
  }elsif($eachline=~/\tCDS\t/){
    my @eachline=split(/\t/,$eachline);
    push (@exons, $eachline[3],$eachline[4]);
  }
}
print_intron(@exons);

sub print_intron{
  my (@exons)=@_;
  if(scalar(@exons)>2){
    my @ordered_exons=sort {$a<=>$b} @exons;
    for (my $i=1;$i<=scalar(@ordered_exons)-3;$i=$i+2){
      my $each_intron_size=$ordered_exons[$i+1]-$ordered_exons[$i]-1;
      print "$each_intron_size\t";
    }
  }else{print "0";}
  print "\n";
}
An example of gff3 file from soybean genome annotation:
##gff-version 3
Gm01    phytozome8_0    gene    51481   61502   .       -       .       ID=Glyma01g00270;Name=Glyma01g00270
Gm01    phytozome8_0    mRNA    51481   61502   .       -       .       ID=PAC:16242891;Name=Glyma01g00270.1;pacid=16242891;longest=1;Parent=Glyma01g00270
Gm01    phytozome8_0    CDS     61437   61502   .       -       0       ID=PAC:16242891.CDS.1;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     61167   61305   .       -       0       ID=PAC:16242891.CDS.2;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     60722   60780   .       -       2       ID=PAC:16242891.CDS.3;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     60339   60533   .       -       0       ID=PAC:16242891.CDS.4;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     59699   59814   .       -       0       ID=PAC:16242891.CDS.5;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     59420   59547   .       -       1       ID=PAC:16242891.CDS.6;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     59176   59284   .       -       2       ID=PAC:16242891.CDS.7;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     56878   56879   .       -       1       ID=PAC:16242891.CDS.8;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     55013   55069   .       -       2       ID=PAC:16242891.CDS.9;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     54390   54500   .       -       2       ID=PAC:16242891.CDS.10;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     54218   54320   .       -       2       ID=PAC:16242891.CDS.11;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     53902   53966   .       -       1       ID=PAC:16242891.CDS.12;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     52608   52639   .       -       2       ID=PAC:16242891.CDS.13;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     52517   52571   .       -       0       ID=PAC:16242891.CDS.14;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     52362   52465   .       -       2       ID=PAC:16242891.CDS.15;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     51828   51925   .       -       0       ID=PAC:16242891.CDS.16;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     51481   51625   .       -       1       ID=PAC:16242891.CDS.17;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    gene    90693   95580   .       -       .       ID=Glyma01g00300;Name=Glyma01g00300
Gm01    phytozome8_0    mRNA    90693   94401   .       -       .       ID=PAC:16242894;Name=Glyma01g00300.1;pacid=16242894;longest=1;Parent=Glyma01g00300
Gm01    phytozome8_0    CDS     92970   94401   .       -       0       ID=PAC:16242894.CDS.1;Parent=PAC:16242894;pacid=16242894
Gm01    phytozome8_0    CDS     92083   92084   .       -       2       ID=PAC:16242894.CDS.2;Parent=PAC:16242894;pacid=16242894
Gm01    phytozome8_0    CDS     90693   90860   .       -       0       ID=PAC:16242894.CDS.3;Parent=PAC:16242894;pacid=16242894
Gm01    phytozome8_0    mRNA    94388   95580   .       -       .       ID=PAC:16242895;Name=Glyma01g00300.2;pacid=16242895;longest=0;Parent=Glyma01g00300
Gm01    phytozome8_0    CDS     95467   95469   .       -       0       ID=PAC:16242895.CDS.1;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    five_prime_UTR  95470   95580   .       -       .       ID=PAC:16242895.five_prime_UTR.1;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    CDS     95277   95356   .       -       0       ID=PAC:16242895.CDS.2;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    CDS     94388   94475   .       -       1       ID=PAC:16242895.CDS.3;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    gene    116300  127990  .       +       .       ID=Glyma01g00320;Name=Glyma01g00320
Gm01    phytozome8_0    mRNA    116300  127990  .       +       .       ID=PAC:16242897;Name=Glyma01g00320.1;pacid=16242897;longest=1;Parent=Glyma01g00320
Gm01    phytozome8_0    five_prime_UTR  116300  116467  .       +       .       ID=PAC:16242897.five_prime_UTR.1;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     116468  117077  .       +       0       ID=PAC:16242897.CDS.1;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     117156  118627  .       +       2       ID=PAC:16242897.CDS.2;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     125732  125982  .       +       0       ID=PAC:16242897.CDS.3;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     127537  127567  .       +       1       ID=PAC:16242897.CDS.4;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    three_prime_UTR 127568  127990  .       +       .       ID=PAC:16242897.three_prime_UTR.1;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    mRNA    116300  127990  .       +       .       ID=PAC:16242898;Name=Glyma01g00320.2;pacid=16242898;longest=0;Parent=Glyma01g00320
Gm01    phytozome8_0    five_prime_UTR  116300  116467  .       +       .       ID=PAC:16242898.five_prime_UTR.1;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    CDS     116468  117077  .       +       0       ID=PAC:16242898.CDS.1;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    CDS     117156  118627  .       +       2       ID=PAC:16242898.CDS.2;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    CDS     127537  127707  .       +       0       ID=PAC:16242898.CDS.3;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    three_prime_UTR 127708  127990  .       +       .       ID=PAC:16242898.three_prime_UTR.1;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    mRNA    127557  127707  .       +       .       ID=PAC:16242899;Name=Glyma01g00320.4;pacid=16242899;longest=0;Parent=Glyma01g00320
Gm01    phytozome8_0    CDS     127557  127707  .       +       1       ID=PAC:16242899.CDS.1;Parent=PAC:16242899;pacid=16242899
Gm01    phytozome8_0    gene    170877  193446  .       +       .       ID=Glyma01g00380;Name=Glyma01g00380

Painful Experience on Building Local UCSC Genome Browser

| Comments

I want to install a UCSC genome browser on my ubuntu workstation, to integrate and display the genomes , mostly plant genomes, with related annotations and NGS data, e.g. RNA-seq, methylation, SNPs, synteny, genome alignment, and etc.. There are also other genome browsers, but UCSC is my favorite and can meet my needs.

I spent a couple of workdays building from sources and adding custom genomes. The browser is based on mysql and apache which I don’t know much about, so It’s not surprising I encountered tons of problems. Thank google and other kind people who posted their experiences. I feel sorry that I cannot do the same thing as them, but at least I would like to share some resources and tips.

1) Read the README files in the sources, and follow the instructions. (You’d better understand them.)

2) The wiki site is very helpfull.

3) Other people’s experiences. (1 and 2)

4) If top bar is not showing, add the following code in apach2 conf file. (/etc/apache2/apache2.conf on my machine) 5) Mysql loading local file issue in hgLOAD* scripts. I tried different solutions, but only one worked for me. I logged into mysql and run the loading command if the error appears.

6) Stop mysql before rsync mysql files from UCSC server.

7) Most mysql problems were caused by permission settings.

关于砖家叫兽和科研创新

| Comments

(注:这是几天前发在qq空间上的,以评论的方式,想一句写一句。在这里只做转发,保持原貌,只是少了时间戳。)

作为一个可能的下一代砖家叫兽,不得不考虑什么创新什么是科研这个早就被嚼烂的问题,旧瓶装新酒才能忽悠来名和利。

。。。。。。

看评论

Next-generation Sequencing Quality Control Tools

| Comments

FASTX-Toolkit: The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/ FASTQ files preprocessing, including:

  • FASTQ-to-FASTA converter
    Convert FASTQ files to FASTA files.
  • FASTQ Information
    Chart Quality Statistics and Nucleotide Distribution
  • FASTQ/A Collapser
    Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
  • FASTQ/A Trimmer
    Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise).
  • FASTQ/A Renamer
    Renames the sequence identifiers in FASTQ/A file.
  • FASTQ/A Clipper
    Removing sequencing adapters / linkers
  • FASTQ/A Reverse-Complement
    Producing the Reverse-complement of each sequence in a FASTQ/FASTA file.
  • FASTQ/A Barcode splitter
    Splitting a FASTQ/FASTA files containning multiple samples
  • FASTA Formatter
    changes the width of sequences line in a FASTA file
  • FASTA Nucleotide Changer
    Convets FASTA sequences from/to RNA/DNA
  • FASTQ Quality Filter
    Filters sequences based on quality
  • FASTQ Quality Trimmer
    Trims (cuts) sequences based on quality
  • FASTQ Masker
    Masks nucleotides with ‘N’ (or other character) based on quality
Galaxy NGS QC and manipulation tools (citation): Galaxy provides a tool suite that functions on all of the commonly known FASTQ format variants and provides a pipeline for manipulating next generation sequencing data taken from a sequencing machine all the way through the quality filtering steps.

NGSQC (citation):The NGSQC pipeline provides a set of novel quality control measures for quickly detecting a wide variety of quality issues in deep sequencing data derived from two dimensional surfaces, regardless of the assay technology used. It also enables researchers to determine whether sequencing data related to their most interesting biological discoveries are caused by sequencing quality issues. NGSQC can help to ensure that biological conclusions, in particular those based on relatively rare sequences, are not caused by low quality sequencing.

NGS QC Toolkit (citation): A toolkit for the quality control (QC) of next generation sequencing (NGS) data. The toolkit comprises of user-friendly stand alone tools for quality control of the sequence data generated using Illumina and Roche 454 platforms with detailed results in the form of tables and graphs, and filtering of high-quality sequence data. It also includes few other tools, which are helpful in NGS data quality control and analysis.

PRINSEQ (citation): PRINSEQ can be used to filter, reformat, or trim your genomic and metagenomic sequence data. It generates summary statistics of your sequences in graphical and tabular format. It is easily configurable and provides a user-friendly interface.

SolexaQA (citation): SolexaQA is a Perl-based software package that calculates quality statistics and creates visual representations of data quality from FASTQ files generated by Illumina second-generation sequencing technology (“Solexa”).

Transfering to Github

| Comments

《像黑客一样写博客》的启发和《像黑客一样写博客——Jekyll入门》的技术指导,打算把bioops迁移到github上。不怎么懂github和jekyll。搞了半天,基本上follow jekyllbootstrap上的文档。模版用make-reid;代码高亮用google-code-prettify;参照Markdown语法说明开始写博客。暂不打算整站迁移,先练练手。

补充:github+jekyll太nerdy了。现在的处理办法是两个都保留,但github只发原创,欢迎访问RSS)。

大部分时间在折腾这个代码高亮,走了很多弯路,现将最后简易可行的方法记录如下:

  1. /_includes/themes/mark-reid/default.html中的</head>前加入
  • <body> tag中加入onload="prettyPrint()"
  • 在代码前后加<pre class="prettyprint">...</pre>标识高亮
  • 貌似对javascript代码无效,总是出错。所以用gist代替(如上所示)。
  • 另外,按照jekyll wiki上迁移wordpress文档的时候在安装mysqlplus会报错

    ERROR:  Error installing mysqlplus:
        ERROR: Failed to build gem native extension.
    
            /usr/bin/ruby1.8 extconf.rb
    checking for mysql_query() in -lmysqlclient... no
    checking for main() in -lm... yes
    checking for mysql_query() in -lmysqlclient... no
    checking for main() in -lz... yes
    checking for mysql_query() in -lmysqlclient... no
    checking for main() in -lsocket... no
    checking for mysql_query() in -lmysqlclient... no
    checking for main() in -lnsl... yes
    checking for mysql_query() in -lmysqlclient... no
    checking for main() in -lmygcc... no
    checking for mysql_query() in -lmysqlclient... no
    *** extconf.rb failed ***
    Could not create Makefile due to some reason, probably lack of
    necessary libraries and/or headers.  Check the mkmf.log file for more
    details.  You may need configuration options.
    
    Provided configuration options:
        --with-opt-dir
        --without-opt-dir
        --with-opt-include
        --without-opt-include=${opt-dir}/include
        --with-opt-lib
        --without-opt-lib=${opt-dir}/lib
        --with-make-prog
        --without-make-prog
        --srcdir=.
        --curdir
        --ruby=/usr/bin/ruby1.8
        --with-mysql-config
        --without-mysql-config
        --with-mysql-dir
        --without-mysql-dir
        --with-mysql-include
        --without-mysql-include=${mysql-dir}/include
        --with-mysql-lib
        --without-mysql-lib=${mysql-dir}/lib
        --with-mysqlclientlib
        --without-mysqlclientlib
        --with-mlib
        --without-mlib
        --with-mysqlclientlib
        --without-mysqlclientlib
        --with-zlib
        --without-zlib
        --with-mysqlclientlib
        --without-mysqlclientlib
        --with-socketlib
        --without-socketlib
        --with-mysqlclientlib
        --without-mysqlclientlib
        --with-nsllib
        --without-nsllib
        --with-mysqlclientlib
        --without-mysqlclientlib
        --with-mygcclib
        --without-mygcclib
        --with-mysqlclientlib
        --without-mysqlclientlib
    
    Gem files will remain installed in /var/lib/gems/1.8/gems/mysqlplus-0.1.2 for inspection.
    Results logged to /var/lib/gems/1.8/gems/mysqlplus-0.1.2/ext/gem_make.out

    解决办法参照https://github.com/lorensr/mysqlplus/tree/及自己琢磨:

    sudo apt-get install libmysql++-dev
    sudo gem install hpricot
    转换成功后生成的是html文件,并且换行符显示的是^M(dos下的回车符),为了让转换后的文章正确分段,可先将html后缀变成md,然后批量去掉^M
    rename -v s/\.html/\.md/ *
    dos2unix *.md

    [Paper] a Whole-Cell Computational Model Predicts Phenotype From Genotype

    | Comments

    http://www.cell.com/abstract/S0092-8674%2812%2900776-3

    Highlights

    • An entire organism is modeled in terms of its molecular components
    • Complex phenotypes can be modeled by integrating cell processes into a single model
    • Unobserved cellular behaviors are predicted by model of M. genitalium
    • New biological processes and parameters are predicted by model of M. genitalium

    Summary

    Understanding how complex phenotypes arise from individual molecules and their interactions is a primary challenge in biology that computational approaches are poised to tackle. We report a whole-cell computational model of the life cycle of the human pathogen Mycoplasma genitalium that includes all of its molecular components and their interactions. An integrative approach to modeling that combines diverse mathematics enabled the simultaneous inclusion of fundamentally different cellular processes and experimental measurements. Our whole-cell model accounts for all annotated gene functions and was validated against a broad range of data. The model provides insights into many previously unobserved cellular behaviors, including in vivo rates of protein-DNA association and an inverse relationship between the durations of DNA replication initiation and replication. In addition, experimental analysis directed by model predictions identified previously undetected kinetic parameters and biological functions. We conclude that comprehensive whole-cell models can be used to facilitate biological discovery.

    Tools for Cross-species Biological Network Comparison

    | Comments

    GraphAlignment: GraphAlignment is an extension package for the R programming environment which provides functions for finding an alignment between two networks based on link and node similarity scores. This package implements the Bayesian scoring methods and the alignment algorithm introduced in the paper J. Berg and M. Lässig, “Cross-species analysis of biological networks by Bayesian alignment”, PNAS 103 (29), 10967-10972 (2006). Documentation and installation guide are available here.

    PathBLAST: PathBLAST is a network alignment and search tool for comparing protein interaction networks across species to identify protein pathways and complexes that have been conserved by evolution. PathBLAST is now available at http://www.pathblast.org/ as a web-based query. (publications)

    IsoRankN: IsoRankN (IsoRank-Nibble) is a global multiple-network alignment tool based on spectral clustering on the induced graph of pairwise alignment scores. IsoRankN outperforms existing algorithms for global network alignment in coverage and consistency on multiple alignments of the five available eukaryotic networks. Being based on spectral methods, IsoRankN is both error-tolerant and computationally efficient. More information is available here.

    NetworkBLAST & NetworkBLAST-M: NetworkBLAST analyzes protein interaction networks across species to infer protein complexes that are conserved in evolution.
    For comparing multiple networks use Networkblast-M; for an executable of NetworkBLAST please refer to here.

    Revolution R Enterprise

    | Comments

    http://www.revolutionanalytics.com/products/enterprise-performance.php

    DOWNLOAD

    NEED MORE SPEED?

    Benchmarks show Revolution R Enterprise is significantly faster than open source R for many real-world computations. Linked with high-performance multi-threaded math libraries, Revolution R Enterprise automatically reduces computation times when using multi-core and multi-processor hardware.

    Quick Links:

    Take immediate advantage of:

    • Significant speed enhancements over open source R.
    • 64-Bit Scalability: Analyze larger data sets on 64-bit Windows and Linux systems.
    • Parallel Processing Power: Reduce computation time for simulations, optimizations, segmented data analysis and more.

    Revolution R 5.0 Performance Benchmarks

    The benchmarks on this page demonstrate the performance of Revolution R 6.0 compared to the base version of R-2.14.2, available from the R Project. The test system was an Intel® Xeon® Processor X3440 (2.53 GHz) with 8GB System RAM running 32-bit Windows Server 2008 R2 SP1.

    One of the differences between Revolution R products and base R is the ability to leverage multithreading and processor capabilities on all x86 platforms to increase performance. Thus, the more cores available to Revolution R, the higher your performance for many operations.

    Revolution Analytics has created these tests to simulate common real-world computations.

    Base R 2.14.2 Revolution R (1-core) Revolution R (4-core) Speedup (4 core)
    Matrix Multiply 175.4 sec 28.9 sec 9.3 sec 17.9x
    Cholesky Factorization 25.5 sec 4.6 sec 1.3 sec 18.7x
    Singular Value Decomposition 58.7 sec 16.8 sec 7.2 sec 7.1x
    Principal Components Analysis 228.4 sec 50.8 sec 19.4 sec 10.8x
    Linear Discriminant Analysis 189.1 sec 73.7 sec 52.3 sec 2.6x
    Speedup = Slower time / Faster Time - 1

    More information:

    微博存档05062012-06212012

    | Comments

    Thu Jun 21 03:25:30 +0800 2012
    Nobel fight over African HIV centre  http://t.cn/zWw5m7q

    Thu Jun 21 03:25:05 +0800 2012
    Genomics: The breast cancer landscape  http://t.cn/zWw5nuU

    Thu Jun 21 03:24:06 +0800 2012
    The landscape of cancer genes and mutational processes in breast cancer  http://t.cn/zWw5n65

    Thu Jun 21 03:22:23 +0800 2012
    The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups  http://t.cn/zWw5ECT

    Thu Jun 21 03:22:12 +0800 2012
    Sequence analysis of mutations and translocations across breast cancer subtypes  http://t.cn/zWw5EKy

    Wed Jun 20 08:17:12 +0800 2012
    An Update on Using Pacific Bio Sequences for Genome Assembly  Homologus  http://t.cn/zWZHLvc

    Wed Jun 20 04:51:59 +0800 2012
    法国故意散步,想打西班牙?,又是法国对西班牙……一说起这个我就肝颤,2000的点球,2006的被逆转——除了02年冤死在棒子手上,这两场是人生看球最痛经历,没有之一……被最痛恨球队淘汰最喜欢球队的事情,不想再发生第三次了,板鸭会给力么?说实话,心里没底。

    Wed Jun 20 04:48:27 +0800 2012
    皇马十一人全部进入淘汰赛。不过法国遭遇西班牙,全部十一人不可能同时出现在四强了。

    Wed Jun 20 04:40:46 +0800 2012
    皇马十一人全部进入淘汰赛。不过本泽马走不远

    Wed Jun 20 01:40:48 +0800 2012
    转发微博,抱歉,此微博不适宜对外公开。如需帮助,请联系客服。http://t.cn/z0D6ZaQ

    Tue Jun 19 23:02:12 +0800 2012
    @ewanbirney For genomics lovers: Biology of Genomes 2013 is going from Tue May 7th to Sat May 11th CSHL. Save the date + your best genomics abstracts :)

    Tue Jun 19 20:50:09 +0800 2012
    Cancer genomics: Indicators for drug response from sequencing  http://t.cn/zWZoWaU

    Tue Jun 19 08:05:15 +0800 2012
    complete list http://t.cn/zWzQ2EJ,国际超级电脑组织公布最新全球超级电脑500强名单,IBM最新研制的超级计算机“红杉”为美国夺得全球最快超级计算机宝座,其持续运算速度达每秒16324万亿次,峰值运算速度高达每秒20132万亿次,令其他计算机望尘莫及。中国超级电脑排名第五。http://t.cn/zWzDcuU

    Tue Jun 19 06:55:41 +0800 2012
    Using Galaxy to Perform Large‐Scale Interactive Data Analyses - Current Protocols  http://t.cn/zWZv4bD

    Tue Jun 19 06:55:38 +0800 2012
    Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy - Current Protocols  http://t.cn/zWZv4LY

    Fri Jun 15 02:50:02 +0800 2012
    托雷斯!

    Thu Jun 14 23:02:21 +0800 2012
    实例:Contrail - A de Bruijn Genome Assembler that uses Hadoop http://t.cn/zWhaImo,NOSQL Databases for Bioinformatics  http://t.cn/zWhaZtR

    Thu Jun 14 22:55:39 +0800 2012
    NOSQL Databases for Bioinformatics  http://t.cn/zWhaZtR

    Thu Jun 14 22:33:35 +0800 2012
    算法上的东西还是拼不过google,据媒体报道,3D地图推出后,不少开发者想要尝鲜进行试用,不料却发现苹果地图犯下了一些明显的错误。比如说,该地图应用因位置标记错误“引导”车主跳桥。而且,在苹果的3D地图中,亚洲和澳洲居然是被北冰洋隔开!辽阔的非洲大陆居然成了“北太平洋”,格陵兰岛成了印度洋!  http://t.cn/zWPrSGl

    Thu Jun 14 20:08:50 +0800 2012
    //@基因组:[good]@土摩托: 吴孔明团队花了20多年的时间,研究了150万平方公里的农田,这才得出了上述结论。这篇文章刚刚发表在最新一期《Nature》上…。http://t.cn/zWhwlTv 补充一句:看到有些人质疑这个结论,可你难道没想过,你花一秒钟想出来的质疑,人家当然早就想到了。你太自信了。,《卫报》报道了中国农科院植保所所长吴孔明的最新研究,证明转基因Bt抗虫棉不但能杀死害虫,减少农药的使用,还能保护环境,让其他不能被Bt杀死的害虫的天敌们得以繁衍。http://t.cn/zWhwvf2 不少人指责我为孟山都说话,我的回答是:孟山都是一家环保企业,我为什么不能为它说话?

    Wed Jun 13 04:39:52 +0800 2012
    谈“历史”没用,拳头才是硬货,【英国阿根廷争议岛屿将公投决定主权归属】福克兰群岛(阿根廷称马尔维纳斯群岛)政府12日宣布,该岛居民将在明年举行全民公投,以决定其主权归属,从而结束英、阿两国在其主权归属问题上的争议。英国首相卡梅伦称,英国政府将支持公投的结果,支持和捍卫岛民的选择。http://t.cn/zWPqoT7

    Tue Jun 12 04:06:55 +0800 2012
    舍瓦牛逼!

    Tue Jun 12 03:57:56 +0800 2012
    瑞典胜之不武

    Tue Jun 12 00:48:14 +0800 2012
    英格兰难道就一直这样让法国放开攻么?

    Mon Jun 11 23:23:27 +0800 2012
    Google 梦回大唐 奋力一战 | 谷奥——探寻谷歌的奥秘  http://t.cn/zWv9ePp  原文:http://t.cn/zWvipNo

    Mon Jun 11 21:20:46 +0800 2012
    转发微博,《C++程序员的阅读清单》多读一些优秀的书籍,对于开发者稳固编程基础、提高编程技能有很大帮助。但是,大多时候,初学者不知道应该读什么书入门,有一定基础的开发者不知道如何进阶。本文总结了一个C++优秀书籍清单,希望能够为你带来一定的帮助……http://t.cn/zWvwYZl

    Sat Jun 09 01:28:10 +0800 2012
    下半场风云突变

    Fri Jun 08 23:41:59 +0800 2012
    ESPN终于转播足球了。可以在电视上看了!,2012年欧洲杯完全赛程,准备好过这个夏天……喜欢的朋友收藏吧~

    Fri Jun 08 23:03:03 +0800 2012
    牛逼的flame病毒 http://t.cn/zOey5jv http://t.cn/zOFsvIv http://t.cn/zOFIcg1 http://t.cn/zOFjuvl

    Fri Jun 08 03:17:03 +0800 2012
    Deciphering the plant splicing code  http://t.cn/zOFaRMT

    Fri Jun 08 01:56:26 +0800 2012
    Lafayette is #2! America’s Brainiest Cities http://t.cn/zOFXgUP

    Thu Jun 07 20:37:40 +0800 2012
    若发现有某种严重疾病的可能性,生还是不生? Sequencing the Unborn http://t.cn/zOFceD6

    Wed Jun 06 21:13:19 +0800 2012
    Nothing worth doing is easy

    Wed Jun 06 10:59:09 +0800 2012
    妙,【美不反对中国发布美城市空气质量数据 称未干涉中国内政】针对中国环保部指责外国使馆发布中国空气质量数据不合规定,美国国务院回应称,美国发布的数据只是为在中国的美国公民提供信息,没有干涉中国内政,因此不会停止发布。如果中国也发布美国城市空气质量数据,美国不反对。http://t.cn/zOkJ3lU

    Wed Jun 06 01:08:49 +0800 2012
    奇文共赏,上月20日,沈阳皇姑区一名男子以切菜刀劫持路过女子,荷枪实弹的警察与男子对峙,因恐伤害人质不敢轻动。歹徒占据上风押着人质转移,发现一炸鸡柳的小摊碍事,他一脚踢飞。这下可惹祸了,愤怒的小贩手持大勺冲上去,当场将歹徒制服。[哈哈][哈哈]http://t.cn/zO3GYSh

    Tue Jun 05 08:36:25 +0800 2012
    今天沪指跌的有水平!

    Mon Jun 04 23:59:03 +0800 2012
    真后悔看了那个视频。,【加拿大杀害中国留学生嫌犯在德国被捕】加拿大“残肢邮寄案”嫌犯、29岁色情男星马尼奥塔今晚在德国柏林一家网吧被捕。马尼奥塔涉嫌将在蒙特利尔留学的33岁武汉学生林俊杀害并猥亵,整个过程被录像并上传网络,部分尸块还被邮寄至加拿大两大政党总部。http://t.cn/zODTgoi

    Mon Jun 04 04:34:46 +0800 2012
    西班牙要变第二个法国么?,我开始担心今天的剧本要成"吊丝的逆袭"了。。。

    Sat Jun 02 00:39:41 +0800 2012
    RIP,【视频:客车司机被飞来铁块击中忍痛停车救乘客】5月29日,浙江杭州长运快客公司有一辆客车从无锡返回杭州时候,在途中遭到对向车道突然飞来铁块的袭击,当事司机吴斌被迎面击中,但他并没有让客车失控而是忍痛将车停好确保车上24名乘客安全,而他自己因肝脏多处破裂献出生命。http://t.cn/zOeqEcO

    Fri Jun 01 21:58:21 +0800 2012
    Next Steps in the Sequence: The Implications of Whole Genome Sequencing for Health in the UK (PHG Foundation, Cambridge, 2011)http://t.cn/zOeURjF @BGI华大基因 @王俊华大基因

    Fri Jun 01 21:55:40 +0800 2012
    给全基因组测序浇点冷水,清醒一下。[Science: Policy Forum] Medicine: Whole-Genome Sequencing: The New Standard of Care?  http://t.cn/zOeUjL7

    Fri Jun 01 20:30:35 +0800 2012
    转发微博。,抱歉,此微博不适宜对外公开。如需帮助,请联系客服。http://t.cn/z0D6ZaQ

    Fri Jun 01 12:15:07 +0800 2012
    转发微博。,【BGI求贤令】加入我们,与华大共享未来!华大基因诚聘“药物基因组高级项目负责人”,请投递简历至:recruiting@genomics.org.cn 更多职位信息,敬请关注:http://t.cn/zOCnFgC [lb嘿嘿]

    Thu May 31 21:32:50 +0800 2012
    RIP,【红学家周汝昌逝世,享年95岁】5月31日凌晨,著名红学家周汝昌先生于家中去世,终年95岁。周汝昌先生留有遗愿,要求子女不开追悼会,不设灵堂,让他安安静静地走。周汝昌一生有60多部学术著作问世,代表作《红楼梦新证》是近代红学研究的奠基之作。http://t.cn/zOg0YcA

    Thu May 31 02:15:33 +0800 2012
    托雷斯后仰头球!

    Thu May 31 02:02:22 +0800 2012
    文章配图都是怎么画出来的?,The tomato genome sequence provides insights into fleshy fruit evolution  http://t.cn/zOg5b1Y

    Thu May 31 01:53:56 +0800 2012
    The tomato genome sequence provides insights into fleshy fruit evolution  http://t.cn/zOg5b1Y

    Wed May 30 22:38:51 +0800 2012
    http://t.cn/zOgvmmt

    Wed May 30 09:49:39 +0800 2012
    转发微博,【中国青年报:可怜之刃刺向可怜之胸】贵州德江县执法人员拆除违建被农妇刺死,记者调查发现该农妇今年27岁,随丈夫在外打工10年,在丈夫去年去世后回老家,花掉近10万积蓄修建两间平房,但因属违建被拆除。村民称她没有丈夫依靠,还要抚养刚9岁的女儿,房子没了,被逼入了绝境。http://t.cn/zOdm5S5

    Tue May 29 04:31:57 +0800 2012
    转发微博,【阿扎尔宣布今夏加盟切尔西】http://t.cn/zOdVNC9 北京时间5月29日凌晨,法甲最佳球员比利时国脚阿扎尔个人推特正式宣布今夏将加盟新科欧冠冠军切尔西,这场旷日持久且引众豪门加入的转会大戏,随着阿扎尔的亲自宣布落下帷幕,法甲小天王将于今夏正式转战英超。

    Tue May 29 04:29:16 +0800 2012
    LOL | How Not To Be A Bioinformatician  http://t.cn/zOdVnlW

    Tue May 29 00:55:40 +0800 2012
    转发微博,等爸有钱了给你办美国户口———转自北京大学法学博士,中国政法大学法学民商经济法学院教师翟继光。亲爱的女儿:爸爸是个无能的人。很多事情都办不好,请你原谅。爸爸把你带到这个(详见长微博)

    Sat May 26 01:28:57 +0800 2012
    BBC News - Station grabs SpaceX Dragon ship  http://t.cn/zOBSe7P

    Sat May 26 00:15:26 +0800 2012
    Revolution Analytics - Commercial Software & Support for the R Statistics Language  http://t.cn/hbqgZ1

    Fri May 25 07:27:52 +0800 2012
    //@BGI华大基因: [围观]看测序仪器市场风云变幻[微风][浮云],#ICG-Europe# NGS & 3GS comparison by Sanger. Ion Torrent performs better, Oxford Nanopore still promising [围观] @BGI华大基因

    Fri May 25 02:14:03 +0800 2012
    a data-driven approach [Perspective] Cancer: Systems Biology, Metabolomics, and Cancer Metabolism  http://t.cn/zOB4tJj

    Fri May 25 02:11:11 +0800 2012
    测序新趋势。[News Focus] The Biology of Genomes: Single-Cell Sequencing Tackles Basic and Biomedical Questions  http://t.cn/zOB4q7R

    Fri May 25 02:06:09 +0800 2012
    牙齿变坏,是因为吃的太好了。[News Focus] Evolutionary Biology: An Evolutionary Theory of Dentistry  http://t.cn/zOB4Up1

    Fri May 25 00:43:02 +0800 2012
    The Emergence of lncRNA  http://t.cn/zOBUh4S

    Thu May 24 21:35:41 +0800 2012
    LOL,4月30日,有位朋友@ 我,大意说:“要是国内 IT 公司能像Google等奖励漏洞提交人,国内产品会更加牛叉”。今日很高兴在微博上看到有公司这样表姿态了,于是有白帽子跟进。据@Joey_Yin 微博显示,他提交了一个 XSS 漏洞,结果对方的奖励规则变了,XSS 漏洞不在奖励范围之内了。[衰] 请见大图。

    Thu May 24 10:51:59 +0800 2012
    能完成这些就已经很"出格"了。几个人能做到。换一种角度看,这是一种幸福,#非诚勿扰语录#我相信每个人身边都会有这样的人:在学校念书的时候成绩很好、家庭环境很好、上了一好的中学又考了一好的大学,念了一个好的专业,毕业之后通过自己的努力或者家里关系,去了一个好单位,从来不做那些出格的事,如果不出意外的话,59岁就会退下来。你觉得这叫靠谱的人吗?——@孟非

    Thu May 24 10:03:39 +0800 2012
    Amino acid coevolution induces an evolutionary Stokes shift http://t.cn/zO1QvRs

    Thu May 24 09:51:27 +0800 2012
    google reader几天不看就堆积成山了。

    Thu May 24 08:32:58 +0800 2012
    一个困扰众多大牛的问题,谁能给个准确的模型?Humans riddled with rare genetic variants http://t.cn/zO1Yce8

    Thu May 24 08:17:54 +0800 2012
    2012 Next-Gen Sequencing Conferences http://t.cn/zO1jsMz

    Thu May 24 08:05:45 +0800 2012
    黑色素瘤基因组http://t.cn/zO1jnHc

    Tue May 22 08:36:13 +0800 2012
    也许是因为作者是EBI的吧,竟然能发到nature上。http://t.cn/zO3iRDg

    Sun May 20 02:05:01 +0800 2012
    NYU的那个fellowship是临时给的还是有个固定名字来源的?连飞机票都报。

    Sat May 19 00:07:46 +0800 2012
    转发微博,家宴,证明一下上得厅堂下得厨房,杀得了木马斗得了流氓。  @薛蛮子 ,等你那量@韩寒 身高的照片。我家墙上刻着韩少的印记,我证明他还不够精确,171.5。

    Fri May 18 21:31:02 +0800 2012
    Cell: The Life History of 21 Breast Cancershttp://t.cn/zOmSHZ3 Mutational Processes Molding the Genomes of 21 Breast Cancers http://t.cn/zOmSHZu

    Fri May 18 09:00:12 +0800 2012
    转发微博。,《X,你想怎样》——《赛德克.巴莱》,《复仇者联盟》⋯⋯几个电影的影评。博客链接http://t.cn/zOm6rkB

    Fri May 18 04:16:22 +0800 2012
    占前排 ——看到韩寒的博文《影评一篇》有感而发的评论。http://t.cn/zOm6rkB

    Wed May 16 08:16:33 +0800 2012
    转发微博,哈佛医学院遗传系的厕所标志很有特色[嘻嘻]。

    Wed May 16 06:39:32 +0800 2012
    好文!   陈平原:中国大学唯哈佛剑桥马首是瞻 误尽苍生_资讯频道_凤凰网  http://t.cn/zOnXLaH

    Wed May 16 06:25:06 +0800 2012
    转发微博,【坐在路边鼓掌的人】接到了女儿班主任的电话。首先得知,女儿的成绩,仍是中等。不过他说,有一件奇怪的事,他从教30年,第一次遇见。语文试卷上有一道附加题:你最欣赏班上的哪位同学,请说出理由。 除女儿之外,全班同学,竟然都写上了女儿的名字… 值得天下父母认真读一读的好文! by@周伯通招聘

    Tue May 15 22:31:38 +0800 2012
    转发微博,针对方舟子同学对韩寒同学不断挖墙角的行为。本人作出了一个重大决定。我决定与方舟子这位秃顶老男人结婚。用我的后半身来守住他。并在每晚面对方舟子的时候,把他想象成欧美高富帅年轻男。

    Tue May 15 21:20:39 +0800 2012
    我手贱,点了恒大的比赛。

    Tue May 15 01:08:41 +0800 2012
    Durbin写的,次品概率不大,我认为最好的一本生物信息书。

    Mon May 14 23:31:02 +0800 2012
    有点像基因水平上的趋同进化。Parallel domestication of the Shattering1 genes in cereals : Nature Genetics : Nature Publishing Group  http://t.cn/zOEBl01

    Mon May 14 20:42:03 +0800 2012
    BGI和UGA同时发布谷子基因组。对比一下!http://t.cn/zOEoM9Y http://t.cn/zOEn01Y @BGI华大基因

    Mon May 14 20:28:04 +0800 2012
    谷子基因组,通讯作者是我的committee member。Reference genome sequence of the model plant Setaria : Nature Biotechnology : Nature Publishing Group  http://t.cn/zOEnH84

    Mon May 14 03:07:56 +0800 2012
    解说真不错! 一半狂欢一半落寞 曼城曼联争冠诠释足球魅力|英超联赛聚焦|英超_新浪视频  http://t.cn/zOEilxo

    Sun May 13 05:37:19 +0800 2012
    转发微博,今天是母亲节,我终于完成了这幅创作了两年多的油画「母亲」。记得1979年,我考上大学时,曾向自己的母亲发誓:"从今天开始,您就不要怕了,我一定要让您过上衣食无忧的生活,绝不能让您下半辈子再在街头流浪"。应该说,我做到了。但今天我们敢对那些被冤的母亲说:我们一定要为您伸张正义吗?!

    Wed May 09 18:46:38 +0800 2012
    转发微博。,2012年欧洲杯完全赛程,准备好过这个夏天……喜欢的朋友收藏吧~

    Wed May 09 16:15:25 +0800 2012
    Nature: 染色体镶嵌性、年龄和癌症之间的关系 http://t.cn/zO8CACs

    Wed May 09 16:15:25 +0800 2012
    String-graph based next-generation sequencing assemblers http://t.cn/zO8CANA

    Wed May 09 03:49:52 +0800 2012
    卡罗尔也太贱了!

    Wed May 09 00:03:37 +0800 2012
    String-graph based next-generation sequencing assembler | Bioops  http://t.cn/zO8cNFN

    Tue May 08 22:48:41 +0800 2012
    Nature: 染色体镶嵌性、年龄和癌症之间的关系 | Bioops  http://t.cn/zO85uRu  评论:人体细胞复制不可靠。年老的人更容易得癌症。50岁是个坎儿。生孩子要趁早。

    Mon May 07 12:52:02 +0800 2012
    难得的是来自媒体的微博。我转成图片,以防河蟹。

    Mon May 07 12:33:53 +0800 2012
    恭喜!,是地,你们都猜对了!

    String-graph Based Next-generation Sequencing Assemblers

    | Comments

    Please read this article or the following linked papers for the reason using string-graph algorithm.

    Readjoiner (Gonnella and Kurtz 2012) Readjoiner is a sequence assembler based on the assembly string graph framework (Myers 2005). It is faster and uses less memory than previous string graph-based assemblers.

    SGA (Simpson and Durbin 2012) SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.

    Edena (Hernandez et al 2008) The program is under development. Edena is based on the classical overlap layout assembly framework. In addition, it includes two features to improve the assembly of very short sequences: exact matching and detection of spurious reads.

    LEAP (Dinh and Rajasekaran 2011) LEAP incorporates a memory-efficient data structure for the overlap graph. This data structure requires only linear time to construct and and linear memory to store.