Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

Notes on Using BLAST

| Comments

BLAST (not BLAST+) provides an option for tabular output that is easily parsed. Use the -m 8 option for tabular output, or the -m 9 option to include headers.

blastall -i input.fa -d /path/to/db.fa -p blastn -m 8
Assumes that db.fa is in a directory that also has a correctly formatted database. This can be achieved by:
formatdb -i db.fa -p F
The fields for tabular BLAST output are:
1 Query The query sequence id
2 Subject The matching subject sequence id
3 % id
4 alignment length
5 mistmatches
6 gap openings
7 q.start
8 q.end
9 s.start
10 s.end
11 e-value
12 bit score
Parse the information:

Python

for line in open(“myfile.blast”):
(queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore) = line.split(“t”)

Perl

while (<>) {
($queryId, $subjectId, $percIdentity, $alnLength, $mismatchCount, $gapOpenCount, $queryStart, $queryEnd, $subjectStart, $subjectEnd, $eVal, $bitScore) = split(/t/)
}

MCL - a Cluster Algorithm for Graphs

| Comments

The MCL can be used to cluster/group similar genes into gene families from BLAST (not BLAST+)  -m 8 output file.

TribeMCL is not available now.

OrthoMCL is a more powerful software based on MCL for grouping orthologous protein sequences The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for networks (also known as graphs) based on simulation of (stochastic) flow in graphs. The algorithm was invented/discovered by Stijn van Dongen at the Centre for Mathematics and Computer Science (also known as CWI) in the Netherlands. The PhD thesis Graph clustering by flow simulation is centered around this algorithm, the main topics being the mathematical theory behind it, its position in cluster analysis and graph clustering, issues concerning scalability, implementation, and benchmarking, and performance criteria for graph clustering in general. The work for this thesis was carried out under supervision of Jan van Eijck and Michiel Hazewinkel.

Protocol: Clustering similarity graphs encoded in BLAST results

First we create an ABC-formatted file using the external columnar BLAST format, which is assumed to be in a file called seq.cblast.
cut -f 1,2,11 seq.cblast > seq.abc
The columnar format in the file seq.cblast has, for a given BLAST hit, the sequence labels in the first two columns and the asssociated E-value in column 11. It is parsed by the standard UNIX cut utility. The format must have been created with the BLAST -m8 option so that no comment lines are present. Alternatively these can be filtered out using grep.
The newly created seq.abc file is loaded by mcxload, which writes both a network file seq.mci and a dictionary file seq.tab.
mcxload -abc seq.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o seq.mci -write-tab seq.tab
The –stream-mirror option ensures that the resulting network will be undirected, as recommended when using mcl. Omitting this option would result in a directed network as BLAST E-values generally differ between two sequences. The default course of action for mcxload is to use the best value found between a pair of labels. The next option, –abc-neg-log10 tranforms the numerical values in the input (the BLAST E-values) by taking the logarithm in base 10 and subsequently negating the sign. Finally, the transformed values are capped so that any E-value below 1e-200 is set to a maximum allowed edge weight of 200.
To obtain clusterings from seq.mci and seq.tab one has two choices. The first is to generate an abstract clustering representation and from that obtain the label output, as follows. Below the -o option is not used, so mcl will create meaningful and unique output names by itself. The default way of doing this is to preprend the prefix out. and to append a suffix encoding the inflation value used, with inflation encoded using two digits of precision and the decimal separator removed.
mcl seq.mci -I 1.4
mcl seq.mci -I 2
mcl seq.mci -I 4
mcl seq.mci -I 6

mcxdump -icl out.seq.mci.I14 -tabr seq.tab -o dump.seq.mci.I14
mcxdump -icl out.seq.mci.I20 -tabr seq.tab -o dump.seq.mci.I20
mcxdump -icl out.seq.mci.I40 -tabr seq.tab -o dump.seq.mci.I40
mcxdump -icl out.seq.mci.I60 -tabr seq.tab -o dump.seq.mci.I60
Now the file out.seq.tab.I14 and its associates can be used for example to compute the distances between the encoded clusterings with clm dist, to compute a set of strictly reconciled nested clusterings with clm order, or to compute an efficiency criterion with clm info.
Alternatively, label output can be obtained directly from mcl as follows.
mcl seq.mci -I 1.4  -use-tab seq.tab
mcl seq.mci -I 2  -use-tab seq.tab
mcl seq.mci -I 4  -use-tab seq.tab
mcl seq.mci -I 6  -use-tab seq.tab

MizBee: A Multiscale Synteny Browser

| Comments

multi-platform (based on java), user-friendly,  no requirements for installation and run, and more remarkably, easier than Circos.

MizBee is a multiscale synteny browser for exploring conservation relationships in comparative genomics data. Using side-by-side linked views, MizBee enables efficient data browsing across a range of scales, from the genome to the gene. The design of MizBee is grounded in perceptual principles, and includes several techniques such as edge bundling and layering to enhance visual cues about conservation relationships related to proximity, size, similarity, and orientation.

i-ADHoRe: Detect Synteny Regions

| Comments

i-ADHoRe is a highly sensitive software tool to detect degenerated homology relations within and between different genomes.

The i-ADHoRe algorithm is based on the initial ADHoRe algorithm. After detecting initial pairs of colinear segments using the basic ADHoRe algorithm, these pairs are aligned to each other to form a profile that combines their gene order and content information. This profile is then used to detect additional homologous segments that show conserved gene order and content when compared to the profile rather than individual segments. If such an additional segment is discovered, it is included in the profile as well and the search is repeated until no additional segments can be found. All results are outputted in tab delimited text files.

Download

Simillion, C., Janssens, K., Sterck, L., Van de Peer, Y. (2008) i-ADHoRe 2.0: An improved tool to detect degenerated genomic homology using genomic profiles. Bioinformatics 24, 127-8. doi

Ubuntu下终端路径只显示当前目录

| Comments

.bashrc文件记录了用户终端配置

$: sudo vim ~/.bashrc

在文件中找到:

if [ “$color_prompt ” = yes ]; then

PS1 =’${debian_chroot:+($debian_chroot)}[33[01;32m]u@h[33[00m]:[33[01;34m]W [33[00m]$ ’

else

PS1 =’${debian_chroot:+($debian_chroot)}u@h:W $ ’

将红色的w由小写改成大写,可以表示只显示当前目录名称

Useful, Essential and Most Often Used Vim Commands

| Comments

The list of Vim commands >

Working with files
Vim command Action
:e filename Open a new file. You can use the Tab key for automatic file name completion, just like at the shell command prompt.
:w filename Save changes to a file. If you don’t specify a file name, Vim saves as the file name you were editing. For saving the file under a different name, specify the file name.
:q Quit Vim. If you have unsaved changes, Vim refuses to exit.
:q! Exit Vim without saving changes.
:wq Write the file and exit.
: x Almost the same as :wq, write the file and exit if you’ve made changes to the file. If you haven’t made any changes to the file, Vim exits without writing the file.
These Vim commands and keys work both in command mode and visual mode.
Vim command Action
j or Up Arrow Move the cursor up one line.
k or Down Arrow Down one line.
h or Left Arrow Left one character.
l or Right Arrow Right one character.
e To the end of a word.
E To the end of a whitespace-delimited word.
b To the beginning of a word.
B To the beginning of a whitespace-delimited word.
0 To the beginning of a line.
^ To the first non-whitespace character of a line.
$ To the end of a line.
H To the first line of the screen.
M To the middle line of the screen.
L To the the last line of the screen.
:n Jump to line number n. For example, to jump to line 42, you’d type :42
Inserting and overwriting text
Vim command Action
i Insert before cursor.
I Insert to the start of the current line.
a Append after cursor.
A Append to the end of the current line.
o Open a new line below and insert.
O Open a new line above and insert.
C Change the rest of the current line.
r Overwrite one character. After overwriting the single character, go back to command mode.
R Enter insert mode but replace characters rather than inserting.
The ESC key Exit insert/overwrite mode and go back to command mode.
Deleting text
Vim command Action
x Delete characters under the cursor.
X Delete characters before the cursor.
dd or :d Delete the current line.
Entering visual mode
Vim command Action
v Start highlighting characters. Use the normal movement keys and commands to select text for highlighting.
V Start highlighting lines.
The ESC key Exit visual mode and return to command mode.
Editing blocks of text
Note: the Vim commands marked with (V) work in visual mode, when you’ve selected some text. The other commands work in the command mode, when you haven’t selected any text.
Vim command Action
~ Change the case of characters. This works both in visual and command mode. In visual mode, change the case of highlighted characters. In command mode, change the case of the character uder cursor.
> (V) Shift right (indent).
< (V) Shift left (de-indent).
c (V) Change the highlighted text.
y (V) Yank the highlighted text. In Windows terms, “copy the selected text to clipboard.”
d (V) Delete the highlighted text. In Windows terms, “cut the selected text to clipboard.”
yy or :y or Y Yank the current line. You don’t need to highlight it first.
dd or :d Delete the current line. Again, you don’t need to highlight it first.
p Put the text you yanked or deleted. In Windows terms, “paste the contents of the clipboard”. Put characters after the cursor. Put lines below the current line.
P Put characters before the cursor. Put lines above the current line.
Undo and redo
Vim command Action
u Undo the last action.
U Undo all the latest changes that were made to the current line.
Ctrl + r Redo.
Vim command Action
/pattern Search the file for pattern.
n Scan for next search match in the same direction.
N Scan for next search match but opposite direction.
Replace
Vim command Action
:rs/foo/bar/a Substitute foo with barr determines the range and a determines the arguments.
The range (r) can be
nothing Work on current line only.
number Work on the line whose number you give.
% The whole file.
Arguments (a) can be
g Replace all occurrences in the line. Without this, Vim replaces only the first occurrences in each line.
i Ignore case for the search pattern.
I Don’t ignore case.
c Confirm each substitution. You can type y to substitute this match, n to skip this match, a to substitute this and all the remaining matches (“Yes to all”), and q to quit substitution.
Examples
:452s/foo/bar/ Replace the first occurrence of the word foo with bar on line number 452.
:s/foo/bar/g Replace every occurrence of the word foo with bar on current line.
:%s/foo/bar/g Replace every occurrence of the word foo with bar in the whole file.
:%s/foo/bar/gi The same as above, but ignore the case of the pattern you want to substitute. This replaces fooFOOFoo, and so on.
:%s/foo/bar/gc Confirm every substitution.
:%s/foo/bar/c For each line on the file, replace the first occurrence of foo with bar and confirm every substitution.

Parse Gff3 File

| Comments

Here is an example script of parsing gff3 file to get the cds sequences.

The annotation of a reference genome usually is stored as a gff3 format file. Most of the time, the genome sequencing group would provide CDS sequences along with the reference genome. However, sometimes they have transcripts instead of CDS, or you want to get 3’-UTR region, or 500 bp upstream sequences from the coding regions. It’s a good idea to parse the gff3 files to get information you need.

If you have any questions about the script or parsing gff3 file, leave a comment. Thank you!

parse_gff3_cds.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/usr/bin/perl
# parse_gff3.pl
# get the cds sequences from gff3 file.
# usage: perl parse_gff3_cds.pl gff3_input_file genome_sequence_directory >output.txt

use strict;
use warnings;

use Bio::Tools::GFF;
use Bio::DB::Fasta;

my $gff3_file=$ARGV[0];  #gff3 input file
my $genome=$ARGV[1]; #genome sequence directory
my $gffio = Bio::Tools::GFF -> new(-file =>$gff3_file , -gff_version => 3);
my $db    = Bio::DB::Fasta  -> new($genome);

my $i=0;
my $first=0;
my $strand=0;
my $seq='';
while(my $feature = $gffio->next_feature()) {
# Sometimes, the gff3 file format is a little with the standard format,
# So that the keys of the hashes are different.
# Use the following lines of masked code to see the keys of the hashes. Some Change may be needed.
#  $i++;
#  if ($i==2){
#    my @a=keys %{$feature};
#    print "@a\n";
#    my @b=keys %{$feature->{_location}};
#    print "@b\n";
#    print "$feature->{_primary_tag}\n";
#  }
  if($feature->{_primary_tag}=~/mrna/i){
    $first++;
    if($first==1){
      $seq='';
    }else{
      print_sequence($seq,80,$strand);
      $seq='';
    }
    print ">$feature->{_gsf_tag_hash}->{Name}->[0]|$feature->{_gsf_tag_hash}->{ID}->[0]\n";
  }elsif($feature->{_primary_tag}=~/cds/i){
    my $seq_temp=$db->seq($feature->{_gsf_seq_id}, $feature->{_location}->{_start}=>$feature->{_location}->{_end});
    if($feature->{_location}->{_strand}=~/-/){  # to know the strand
      $seq=$seq_temp.$seq;
      $strand=0;
    }else{
      $seq.=$seq_temp;
      $strand=1;
    }
  }
}
print_sequence($seq,80,$strand);

$gffio->close();

sub print_sequence {
  my($sequence, $length,$strand) = @_;
  if($strand==0){  # if the sequence is on the minus strand, get its reverse-complement counterpart
    $sequence=~tr/ACGTacgt/TGCAtgca/;
    $sequence=reverse $sequence;
  }
  for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {
    print substr($sequence, $pos, $length),"\n";
  }
}

Shotgun Reads

| Comments

read is a short sequence of DNA that has been taken directly from an organism’s genome. The procedure of Whole Genome Shotgun Assembly involves collecting large libraries of reads and piecing them together to re-create a complete genome. Reads applied to Whole Genome Shotgun Assembly are called shotgun reads.

Reads are created by laboratory sequencing methods such as Sanger sequencing; the method used affects the read’s length in bases, as well as its quality scores. For more information, see Sequencing.

Scaffold or Supercontig

| Comments

A scaffold is a portion of the genome sequence reconstructed from end-sequenced whole-genome shotgun clones. Scaffolds are composed of contigs and gaps.

A contig is a contiguous length of genomic sequence in which the order of bases is known to a high confidence level. Gaps occur where reads from the two sequenced ends of at least one fragment overlap with other reads in two different contigs (as long as the arrangement is otherwise consistent with the contigs being adjacent). Since the lengths of the fragments are roughly known, the number of bases between contigs can be estimated.

The goal of whole-genome shotgun assembly is to represent each genomic sequence in one scaffold; however, this is not always possible. One chromosome may be represented by many scaffolds (e.g., Chlamydomonas reinhardtii) or just a single scaffold (e.g., Human chromosome 19), depending on how completely the genome can be reconstructed, or assembled, from the available reads.  The relative locations of scaffolds in the genome are unknown.

Scaffolds are normally numbered approximately from largest to smallest. Some scaffolds may ultimately be filtered out of the assembly, resulting in skipped scaffold numbers.

In some cases, scaffolds can overlap. For example, in polymorphic genomes, regions with a high density of allelic differences between haplotypes may be split into separate sets of scaffolds, each representing one allele. Thus, a sequence that exists in only one location in the genome may appear on more than one scaffold.

N50

| Comments

What does N50 mean? What is N50 or N90?

N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly.

N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome.

Alternative definition: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L’ , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L’ is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L’ consists of six 2’s, six 3’s, four 4’s, and sixteen 8’s; the N50 of L is the median of L’ , which is 6.