Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

MCL - a Cluster Algorithm for Graphs

| Comments

The MCL can be used to cluster/group similar genes into gene families from BLAST (not BLAST+)  -m 8 output file.

TribeMCL is not available now.

OrthoMCL is a more powerful software based on MCL for grouping orthologous protein sequences The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for networks (also known as graphs) based on simulation of (stochastic) flow in graphs. The algorithm was invented/discovered by Stijn van Dongen at the Centre for Mathematics and Computer Science (also known as CWI) in the Netherlands. The PhD thesis Graph clustering by flow simulation is centered around this algorithm, the main topics being the mathematical theory behind it, its position in cluster analysis and graph clustering, issues concerning scalability, implementation, and benchmarking, and performance criteria for graph clustering in general. The work for this thesis was carried out under supervision of Jan van Eijck and Michiel Hazewinkel.

Protocol: Clustering similarity graphs encoded in BLAST results

First we create an ABC-formatted file using the external columnar BLAST format, which is assumed to be in a file called seq.cblast.
cut -f 1,2,11 seq.cblast > seq.abc
The columnar format in the file seq.cblast has, for a given BLAST hit, the sequence labels in the first two columns and the asssociated E-value in column 11. It is parsed by the standard UNIX cut utility. The format must have been created with the BLAST -m8 option so that no comment lines are present. Alternatively these can be filtered out using grep.
The newly created seq.abc file is loaded by mcxload, which writes both a network file seq.mci and a dictionary file seq.tab.
mcxload -abc seq.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o seq.mci -write-tab seq.tab
The –stream-mirror option ensures that the resulting network will be undirected, as recommended when using mcl. Omitting this option would result in a directed network as BLAST E-values generally differ between two sequences. The default course of action for mcxload is to use the best value found between a pair of labels. The next option, –abc-neg-log10 tranforms the numerical values in the input (the BLAST E-values) by taking the logarithm in base 10 and subsequently negating the sign. Finally, the transformed values are capped so that any E-value below 1e-200 is set to a maximum allowed edge weight of 200.
To obtain clusterings from seq.mci and seq.tab one has two choices. The first is to generate an abstract clustering representation and from that obtain the label output, as follows. Below the -o option is not used, so mcl will create meaningful and unique output names by itself. The default way of doing this is to preprend the prefix out. and to append a suffix encoding the inflation value used, with inflation encoded using two digits of precision and the decimal separator removed.
mcl seq.mci -I 1.4
mcl seq.mci -I 2
mcl seq.mci -I 4
mcl seq.mci -I 6

mcxdump -icl out.seq.mci.I14 -tabr seq.tab -o dump.seq.mci.I14
mcxdump -icl out.seq.mci.I20 -tabr seq.tab -o dump.seq.mci.I20
mcxdump -icl out.seq.mci.I40 -tabr seq.tab -o dump.seq.mci.I40
mcxdump -icl out.seq.mci.I60 -tabr seq.tab -o dump.seq.mci.I60
Now the file out.seq.tab.I14 and its associates can be used for example to compute the distances between the encoded clusterings with clm dist, to compute a set of strictly reconciled nested clusterings with clm order, or to compute an efficiency criterion with clm info.
Alternatively, label output can be obtained directly from mcl as follows.
mcl seq.mci -I 1.4  -use-tab seq.tab
mcl seq.mci -I 2  -use-tab seq.tab
mcl seq.mci -I 4  -use-tab seq.tab
mcl seq.mci -I 6  -use-tab seq.tab

Comments