Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

详解Bioperl的序列对象(Bioperl HOWTO翻译7)

| Comments

序列对象

英文原文

前面涉及到了很多序列对象,展示了序列对象的一些创建和使用方法。这里来详细描述序列对象的功能。

下表列出了序列对象的‘方法’(面向对象编程中的概念,见前文;表的内容就不翻译了)。‘return’表示使用这个方法时,对象所返回的值(或内容)。其中有些方法,如seq(),既可用于输出,也可以向其输入。例如,从已有的序列对象中获取序列。

1
$sequence_as_string = $seq_obj->seq;

也可以自己设定序列:

1
$seq_obj->seq("MMTYDFFFFVVNNNNPPPPAAAW");
Table 1: Sequence Object Methods
Name Returns Example Note
accession_number identifier $acc = $so->accession_number get or set an identifier
alphabet alphabet $so->alphabet(‘dna’) get or set the alphabet (‘dna’,’rna’,’protein’)
authority authority, if available $so->authority(“FlyBase”) get or set the organization
desc description $so->desc(“Example 1”) get or set a description
display_id identifier $so->display_id(“NP_123456”) get or set an identifier
division division, if available (e.g. PRI) $div = $so->division get division (e.g. “PRI”)
get_dates array of dates, if available @dates = $so->get_dates get dates
get_secondary_accessions array of secondary accessions, if available @accs = $so->get_secondary_accessions get other identifiers
is_circular Boolean if $so->is_circular { # } get or set
keywords keywords, if available @array = $so->keywords get or set keywords
length length, a number $len = $so->length get the length
molecule molecule type, if available $type = $so->molecule get molecule (e.g. “RNA”, “DNA”)
namespace namespace, if available $so->namespace(“Private”) get or set the name space
new Sequence object $so = Bio::Seq->new(-seq => “MPQRAS”) create a new one, see Bio::Seq for more
pid pid, if available $pid = $so->pid get pid
primary_id identifier $so->primary_id(12345) get or set an identifier
revcom Sequence object $so2 = $so1->revcom Reverse complement
seq sequence string $seq = $so->seq get or set the sequence
seq_version version, if available $so->seq_version(“1”) get or set a version
species Species object $species_obj = $so->species See Bio::Species for more
subseq sequence string $string = $seq_obj->subseq(10,40) Arguments are start and end
translate protein Sequence object $prot_obj = $dna_obj->translate See the Bioperl Tutorial for more
trunc Sequence object $so2 = $so1->trunc(10,40) Arguments are start and end
需要注意的是,上表列出的有些方法,如molecule和division,仅在序列对象有相应值的时候才有效,有些序列格式并不包括这些信息。所以,使用某种方法之前,一定要了解清楚输入的序列文件,及其包含的内容。

还有一些方法是关于序列注释信息的,但这些内容可能有点离题,如果要了解的话,详见Feature-Annotation HOWTO。下表列出了一些有关的方法。

Table 2: Feature and Annotation Methods
Name Returns Note
get_SeqFeatures array of SeqFeature objects
get_all_SeqFeatures array of SeqFeature objects array includes sub-features
remove_SeqFeatures array of SeqFeatures removed
feature_count number of SeqFeature objects
add_SeqFeature annotation array of Annotation objects get or set

举例

接着来看一下如何使用上面提到的各种方法。看看这些方法如何从不同的来源获取序列对象以及输出内容。先来看看如何从Genbank获取并创建序列对象,代码如下:

1
2
3
use Bio::DB::GenBank;
$db_obj = Bio::DB::GenBank->new;
$seq_obj = $db_obj->get_Seq_by_acc("J01673");

或者从本地已有的Genbank文件中获取,代码如下

1
2
3
use Bio::SeqIO;
$seqio_obj = Bio::SeqIO->new(-file => "J01673.gb", -format => "genbank" );
$seq_obj = $seqio_obj->next_seq;

Genbank文件格式如下所示:

LOCUS       ECORHO                  1880 bp    DNA     linear   BCT 26-APR-1993
DEFINITION  E.coli rho gene coding for transcription termination factor.
ACCESSION   J01673 J01674
VERSION     J01673.1  GI:147605
KEYWORDS    attenuator; leader peptide; rho gene; transcription terminator.
SOURCE      Escherichia coli
ORGANISM  Escherichia coli
                  Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
                  Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 1880)
AUTHORS   Brown,S., Albrechtsen,B., Pedersen,S. and Klemm,P.
TITLE     Localization and regulation of the structural gene for
             transcription-termination factor rho of Escherichia coli
JOURNAL   J. Mol. Biol. 162 (2), 283-298 (1982)
MEDLINE   83138788
PUBMED   6219230
REFERENCE   2  (bases 1 to 1880) AUTHORS   Pinkham,J.L. and Platt,T.
TITLE     The nucleotide sequence of the rho gene of E. coli K-12
JOURNAL   Nucleic Acids Res. 11 (11), 3531-3545 (1983)
MEDLINE   83220759
PUBMED   6304634
COMMENT      Original source text: Escherichia coli (strain K-12) DNA.
                      A clean copy of the sequence for [2] was kindly provided by
                      J.L.Pinkham and T.Platt.
FEATURES       Location/Qualifiers
     source      1..1880
                     /organism="Escherichia coli"
                     /mol_type="genomic DNA"
                     /strain="K-12"
                     /db_xref="taxon:562"
     mRNA       212..>1880
                     /product="rho mRNA"
     CDS          282..383
                     /note="rho operon leader peptide"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="AAA24531.1"
                     /db_xref="GI:147606"
                     /translation="MRSEQISGSSLNPSCRFSSAYSPVTRQRKDMSR"
     gene         468..1727
                     /gene="rho"
     CDS          468..1727
                     /gene="rho"
                     /note="transcription termination factor"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="AAA24532.1"
                     /db_xref="GI:147607"
                     /translation="MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAK
                     SGEDIFGDGVLEILQDGFGFLRSADSSYLAGPDDIYVSPSQIRRFNLRTGDTISGKIR
                     PPKEGERYFALLKVNEVNFDKPENARNKILFENLTPLHANSRLRMERGNGSTEDLTAR
                     VLDLASPIGRGQRGLIVAPPKAGKTMLLQNIAQSIAYNHPDCVLMVLLIDERPEEVTE
                     MQRLVKGEVVASTFDEPASRHVQVAEMVIEKAKRLVEHKKDVIILLDSITRLARAYNT
                     VVPASGKVLTGGVDANALHRPKRFFGAARNVEEGGSLTIIATALIDTGSKMDEVIYEE
                     FKGTGNMELHLSRKIAEKRVFPAIDYNRSGTRKEELLTTQEELQKMWILRKIIHPMGE
                     IDAMEFLINKLAMTKTNDDFFEMMKRS"
ORIGIN      15 bp upstream from HhaI site.
        1 aaccctagca ctgcgccgaa atatggcatc cgtggtatcc cgactctgct gctgttcaaa
      61 aacggtgaag tggcggcaac caaagtgggt gcactgtcta aaggtcagtt gaaagagttc
                                  ...deleted...  
  1801 tgggcatgtt aggaaaattc ctggaatttg ctggcatgtt atgcaatttg catatcaaat
  1861 ggttaatttt tgcacaggac
//
不论用那种方式,都能得到一样的序列对象。下表列出了这个序列对象的可用方法及其返回值。
Table 3: Values from the Sequence object (Genbank)
Method Returns
display_id ECORHO
desc E.coli rho gene coding for transcription termination factor.
display_name ECORHO
accession J01673
primary_id 147605
seq_version 1
keywords attenuator; leader peptide; rho gene; transcription terminator
is_circular
namespace
authority
length 1880
seq AACCCT…ACAGGAC
division BCT
molecule DNA
get_dates 26-APR-1993
get_secondary_accessions J01674
这里需要说明一下。首先,很多序列信息没有被返回。这些“丢失”的信息都是和序列注释信息有关,可详见Feature and Annotation HOWTO。并且,有些方法返回的是空值,比如namespace和authority。原因是对应的序列信息还没有一个普遍接受的格式或确定的名字,也许等确定的时候,作者会重写代码。(译者注:可能作者是先构造了一个结构,没有对应的内容。反正现在这些方法是没用的,暂不用管。)最后,你可能会问各个序列信息如何和相应的方法对应起来的。一般来说,由于没有一个通用标准,代码作者根据自己的常识,将相应的序列信息命一个合理的名字,然后对应到某个方法上。(最后一句可能翻译的不准确)

再来看fasta格式文件作为输入(仍是同一序列)。fasta格式如下所示,相对Genbank,显得非常简单:

>gi|147605|gb|J01673.1|ECORHO E.coli rho gene coding for transcription termination factor
AACCCTAGCACTGCGCCGAAATATGGCATCCGTGGTATCCCGACTCTGCTGCTGTTCAAAAACGGTGAAG
TGGCGGCAACCAAAGTGGGTGCACTGTCTAAAGGTCAGTTGAAAGAGTTCCTCGACGCTAACCTGGCGTA
                        ...deleted...
ACGTGTTTACGTGGCGTTTTGCTTTTATATCTGTAATCTTAATGCCGCGCTGGGCATGTTAGGAAAATTC
CTGGAATTTGCTGGCATGTTATGCAATTTGCATATCAAATGGTTAATTTTTGCACAGGAC
可返回的内容:
Table 4: Values from the Sequence object (Fasta)
Method Returns
display_id 147605|gb|J01673.1|ECORHO
desc E.coli rho gene coding for transcription termination factor
display_name 147605|gb|J01673.1|ECORHO
accession unknown
primary_id 147605|gb|J01673.1|ECORHO
is_circular
namespace
authority
length 1880
seq AACCCT…ACAGGAC
和上面使用Genbank文件得到的信息相比,会缺少一些序列信息,如seq_version。另外,如display_id,显示的是不同值。原因在于Genbank服务器将Genbank格式转换fasta格式时遵循的规则和SwissProt服务器将SwissProt格式转换fasta格式的规则不一样。除非有一个统一的标准,否则代码作者一般是根据自己的理解将各个序列信息对应到某一方法上。虽然Bioperl可以遵循某一个特定的规则,如Genbank所使用的。但Bioperl的作者们通过投票决定不遵循任何一个只来源于某一个组织的转换规则。

接着看一下SwissProt格式文件的输入。

ID A2S3_RAT STANDARD; PRT; 913 AA.

AC   Q8R2H7; Q8R2H6; Q8R4G3;
DT   28-FEB-2003 (Rel. 41, Created)
DE   Amyotrophic lateral sclerosis 2 chromosomal region candidate gene
DE   protein 3 homolog (GABA-A receptor interacting factor-1) (GRIF-1) (O-
DE   GlcNAc transferase-interacting protein of 98 kDa).
GN   ALS2CR3 OR GRIF1 OR OIP98.
OS   Rattus norvegicus (Rat).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
OX   NCBI_TaxID=10116;
RN   [1]
RP   SEQUENCE FROM N.A. (ISOFORMS 1 AND 2), SUBCELLULAR LOCATION, AND
RP   INTERACTION WITH GABA-A RECEPTOR.
RC   TISSUE=Brain;
RX   MEDLINE=22162448; PubMed=12034717;
RA   Beck M., Brickley K., Wilkinson H.L., Sharma S., Smith M.,
RA   Chazot P.L., Pollard S., Stephenson F.A.;
RT   "Identification, molecular cloning, and characterization of a novel
RT   GABAA receptor-associated protein, GRIF-1.";
RL   J. Biol. Chem. 277:30079-30090(2002).
RN   [2]
RP   REVISIONS TO 579 AND 595-596, AND VARIANTS VAL-609 AND PRO-820.
RA   Stephenson F.A.;
RL   Submitted (FEB-2003) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   SEQUENCE FROM N.A. (ISOFORM 3), INTERACTION WITH O-GLCNAC TRANSFERASE,
RP   AND O-GLYCOSYLATION.
RC   STRAIN=Sprague-Dawley; TISSUE=Brain;
RX   MEDLINE=22464403; PubMed=12435728;
RA   Iyer S.P.N., Akimoto Y., Hart G.W.;
RT   "Identification and cloning of a novel family of coiled-coil domain
RT   proteins that interact with O-GlcNAc transferase.";
RL   J. Biol. Chem. 278:5399-5409(2003).
CC   -!- SUBUNIT: Interacts with GABA-A receptor and O-GlcNac transferase.
CC   -!- SUBCELLULAR LOCATION: Cytoplasmic.
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=3;
CC       Name=1; Synonyms=GRIF-1a;
CC         IsoId=Q8R2H7-1; Sequence=Displayed;
CC       Name=2; Synonyms=GRIF-1b;
CC         IsoId=Q8R2H7-2; Sequence=VSP_003786, VSP_003787;
CC       Name=3;
CC         IsoId=Q8R2H7-3; Sequence=VSP_003788;
CC   -!- PTM: O-glycosylated.
CC   -!- SIMILARITY: TO HUMAN OIP106.
DR   EMBL; AJ288898; CAC81785.2; -.
DR   EMBL; AJ288898; CAC81786.2; -.
DR   EMBL; AF474163; AAL84588.1; -.
DR   GO; GO:0005737; C:cytoplasm; IEP.
DR   GO; GO:0005634; C:nucleus; IDA.
DR   GO; GO:0005886; C:plasma membrane; IEP.
DR   GO; GO:0006357; P:regulation of transcription from Pol II pro...; IDA.
DR   InterPro; IPR006933; HAP1_N.
DR   Pfam; PF04849; HAP1_N; 1.
KW   Coiled coil; Alternative splicing; Polymorphism.
FT   DOMAIN      134    355       COILED COIL (POTENTIAL).
FT   VARSPLIC    653    672       VATSNPGKCLSFTNSTFTFT -> ALVSHHCPVEAVRAVHP
FT                                TRL (in isoform 2).
FT                                /FTId=VSP_003786.
FT   VARSPLIC    673    913       Missing (in isoform 2).
FT                                /FTId=VSP_003787.
FT   VARSPLIC    620    687       VQQPLQLEQKPAPPPPVTGIFLPPMTSAGGPVSVATSNPGK
FT                                CLSFTNSTFTFTTCRILHPSDITQVTP -> GSAASSTGAE
FT                                ACTTPASNGYLPAAHDLSRGTSL (in isoform 3).
FT                                /FTId=VSP_003788.
FT   VARIANT     609    609       E -> V.
FT   VARIANT     820    820       S -> P.
SQ   SEQUENCE   913 AA;  101638 MW;  D0E135DBEC30C28C CRC64;
     MSLSQNAIFK SQTGEENLMS SNHRDSESIT DVCSNEDLPE VELVNLLEEQ LPQYKLRVDS
     LFLYENQDWS QSSHQQQDAS ETLSPVLAEE TFRYMILGTD RVEQMTKTYN DIDMVTHLLA
                             ...deleted...
     GIARVVKTPV PRENGKSREA EMGLQKPDSA VYLNSGGSLL GGLRRNQSLP VMMGSFGAPV
     CTTSPKMGIL KED
//
相应的返回值如下表所示:
Table 5: Values from the Sequence object (Swissprot)
Method Returns
display_id A2S3_RAT
desc Amyotrophic lateral … protein of 98 kDa).
display_name A2S3_RAT
accession Q8R2H7
is_circular
namespace
authority
seq_version
keywords Coiled coil; Alternative splicing; Polymorphism
length 913
seq MSLSQ…ILKED
division RAT
get_dates 28-FEB-2003 (Rel. 41, Created)
get_secondary_accessions Q8R2H6 Q8R4G3
和Genbank一样,详见Feature and Annotation HOWTO,查看序列注释信息。

Comments