Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

Get Intron Sizes From Gff3 Files Using Perl

| Comments

Normally, there is no ‘intron’ feature in gff3 files, but the information can be obtained by calculating the interval sizes between CDS regions. Here I wrote a simple perl script for getting intron sizes from gff3 files. The script can print out each intron sizes, one gene per line. You can customize the script based on your gff3 files or demands.

PS: I rank the exon positions before calculating the intron size to avoid the +/- strand issues
get_intron_size.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/bin/perl
use strict;
use warnings;

# usage: perl get_intron_size.pl gff3_file >output

my $input=$ARGV[0];
my ($eachline,@exons);
my $first=0;
open (IN, "<$input") or die ("no such file!");
while(defined($eachline=<IN>)){
  if($eachline=~/\tmRNA\t/){
    $first++;
    if($first != 1){
      print_intron(@exons);
      @exons=();
    }
  }elsif($eachline=~/\tCDS\t/){
    my @eachline=split(/\t/,$eachline);
    push (@exons, $eachline[3],$eachline[4]);
  }
}
print_intron(@exons);

sub print_intron{
  my (@exons)=@_;
  if(scalar(@exons)>2){
    my @ordered_exons=sort {$a<=>$b} @exons;
    for (my $i=1;$i<=scalar(@ordered_exons)-3;$i=$i+2){
      my $each_intron_size=$ordered_exons[$i+1]-$ordered_exons[$i]-1;
      print "$each_intron_size\t";
    }
  }else{print "0";}
  print "\n";
}
An example of gff3 file from soybean genome annotation:
##gff-version 3
Gm01    phytozome8_0    gene    51481   61502   .       -       .       ID=Glyma01g00270;Name=Glyma01g00270
Gm01    phytozome8_0    mRNA    51481   61502   .       -       .       ID=PAC:16242891;Name=Glyma01g00270.1;pacid=16242891;longest=1;Parent=Glyma01g00270
Gm01    phytozome8_0    CDS     61437   61502   .       -       0       ID=PAC:16242891.CDS.1;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     61167   61305   .       -       0       ID=PAC:16242891.CDS.2;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     60722   60780   .       -       2       ID=PAC:16242891.CDS.3;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     60339   60533   .       -       0       ID=PAC:16242891.CDS.4;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     59699   59814   .       -       0       ID=PAC:16242891.CDS.5;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     59420   59547   .       -       1       ID=PAC:16242891.CDS.6;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     59176   59284   .       -       2       ID=PAC:16242891.CDS.7;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     56878   56879   .       -       1       ID=PAC:16242891.CDS.8;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     55013   55069   .       -       2       ID=PAC:16242891.CDS.9;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     54390   54500   .       -       2       ID=PAC:16242891.CDS.10;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     54218   54320   .       -       2       ID=PAC:16242891.CDS.11;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     53902   53966   .       -       1       ID=PAC:16242891.CDS.12;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     52608   52639   .       -       2       ID=PAC:16242891.CDS.13;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     52517   52571   .       -       0       ID=PAC:16242891.CDS.14;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     52362   52465   .       -       2       ID=PAC:16242891.CDS.15;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     51828   51925   .       -       0       ID=PAC:16242891.CDS.16;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    CDS     51481   51625   .       -       1       ID=PAC:16242891.CDS.17;Parent=PAC:16242891;pacid=16242891
Gm01    phytozome8_0    gene    90693   95580   .       -       .       ID=Glyma01g00300;Name=Glyma01g00300
Gm01    phytozome8_0    mRNA    90693   94401   .       -       .       ID=PAC:16242894;Name=Glyma01g00300.1;pacid=16242894;longest=1;Parent=Glyma01g00300
Gm01    phytozome8_0    CDS     92970   94401   .       -       0       ID=PAC:16242894.CDS.1;Parent=PAC:16242894;pacid=16242894
Gm01    phytozome8_0    CDS     92083   92084   .       -       2       ID=PAC:16242894.CDS.2;Parent=PAC:16242894;pacid=16242894
Gm01    phytozome8_0    CDS     90693   90860   .       -       0       ID=PAC:16242894.CDS.3;Parent=PAC:16242894;pacid=16242894
Gm01    phytozome8_0    mRNA    94388   95580   .       -       .       ID=PAC:16242895;Name=Glyma01g00300.2;pacid=16242895;longest=0;Parent=Glyma01g00300
Gm01    phytozome8_0    CDS     95467   95469   .       -       0       ID=PAC:16242895.CDS.1;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    five_prime_UTR  95470   95580   .       -       .       ID=PAC:16242895.five_prime_UTR.1;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    CDS     95277   95356   .       -       0       ID=PAC:16242895.CDS.2;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    CDS     94388   94475   .       -       1       ID=PAC:16242895.CDS.3;Parent=PAC:16242895;pacid=16242895
Gm01    phytozome8_0    gene    116300  127990  .       +       .       ID=Glyma01g00320;Name=Glyma01g00320
Gm01    phytozome8_0    mRNA    116300  127990  .       +       .       ID=PAC:16242897;Name=Glyma01g00320.1;pacid=16242897;longest=1;Parent=Glyma01g00320
Gm01    phytozome8_0    five_prime_UTR  116300  116467  .       +       .       ID=PAC:16242897.five_prime_UTR.1;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     116468  117077  .       +       0       ID=PAC:16242897.CDS.1;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     117156  118627  .       +       2       ID=PAC:16242897.CDS.2;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     125732  125982  .       +       0       ID=PAC:16242897.CDS.3;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    CDS     127537  127567  .       +       1       ID=PAC:16242897.CDS.4;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    three_prime_UTR 127568  127990  .       +       .       ID=PAC:16242897.three_prime_UTR.1;Parent=PAC:16242897;pacid=16242897
Gm01    phytozome8_0    mRNA    116300  127990  .       +       .       ID=PAC:16242898;Name=Glyma01g00320.2;pacid=16242898;longest=0;Parent=Glyma01g00320
Gm01    phytozome8_0    five_prime_UTR  116300  116467  .       +       .       ID=PAC:16242898.five_prime_UTR.1;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    CDS     116468  117077  .       +       0       ID=PAC:16242898.CDS.1;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    CDS     117156  118627  .       +       2       ID=PAC:16242898.CDS.2;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    CDS     127537  127707  .       +       0       ID=PAC:16242898.CDS.3;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    three_prime_UTR 127708  127990  .       +       .       ID=PAC:16242898.three_prime_UTR.1;Parent=PAC:16242898;pacid=16242898
Gm01    phytozome8_0    mRNA    127557  127707  .       +       .       ID=PAC:16242899;Name=Glyma01g00320.4;pacid=16242899;longest=0;Parent=Glyma01g00320
Gm01    phytozome8_0    CDS     127557  127707  .       +       1       ID=PAC:16242899.CDS.1;Parent=PAC:16242899;pacid=16242899
Gm01    phytozome8_0    gene    170877  193446  .       +       .       ID=Glyma01g00380;Name=Glyma01g00380

Comments