Transcript Document
Functional Annotation
Episode 2: Preliminary Results
The Group
27th Feb 2012
Lavanya Rishishwar
Artika Nath
Lu Wang
Haozheng Tian
Shengyun Peng
Ashwath Kumar
Hamidreza Hassanzadeh
1
Recap
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
2
Flowchart
27th Feb 2012
3
Flowchart
27th Feb 2012
4
PRELIMINARY RESULTS
27th Feb 2012
5
Subject Organisms
Species
Disease State
State
Isolated
Hemolysis
Hpd
fuculosekinase
M19107
H. haemolyticus
Asymptomatic
Minnesota
Y
-
-
M19501
H. haemolyticus
Asymptomatic
Minnesota
N
+
-
M21127
H. haemolyticus
Pathogenic
Georgia
Y
-
-
M21621
H. haemolyticus
Pathogenic
Texas
Y
-
-
M21639
H. haemolyticus
Pathogenic
Illinois
N
-
-
M21709
H. influenzae
Pathogenic
NY
N
-
+
fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolates
Hpd: encoding a lipoprotein protein D,
27th Feb 2012
6
BLAST: Output and Parsing
• Once the results received from gene
prediction tools, we should blast them against
different databases
• The selected threshold: 0.005
• This is automatically done by the ad-hoc
scripts utilizing the BioPerl lib, for all 6 strains
• The results are then processed and the
certain metrics elicited for further analysis
27th Feb 2012
7
27th Feb 2012
8
27th Feb 2012
9
BLAST v/s UniProt: Coverage
27th Feb 2012
Organism
# of unique organisms in the
hits
M19107
2338
M19501
2332
M21127
2360
M21621
2364
M21639
2433
M21709
2154
10
BLAST v/s UniProt: M19107
Pasteurella
Ralstonia
Lactobacillus
Rickettsia Brucella
Mus Coxiella Legionella
Homo
Arabidopsis
Klebsiella
Actinobacillus
Xylella
Erwinia Rhizobium
Acinetobacter
Bordetella
Francisella
Clostridium
Mycobacterium
Buchnera
Neisseria
Xanthomonas
Others
Streptococcus
Shigella
Haemophilus
Bacillus
Vibrio
Staphylococcus
Escherichia
Burkholderia
Salmonella
Yersinia
Shewanella
Pseudomonas
27th Feb 2012
11
BLAST v/s UniProt: M21709
Listeria
Homo
Coxiella
Legionella Xylella Arabidopsis
Rickettsia
Erwinia Klebsiella
Brucella
Rhizobium
Acinetobacter
Bordetella
Actinobacillus
Francisella
Clostridium
Mycobacterium
Buchnera
Neisseria
Xanthomonas
Streptococcus
Others
Shigella
Vibrio
Burkholderia
Staphylococcus
Haemophilus
Bacillus
Escherichia
Yersinia
Salmonella
Shewanella
Pseudomonas
27th Feb 2012
12
CONSERVED DOMAIN DATABASE
(CDD)
Introduction
• CDD is a protein annotation resource that consists
of a collection of well-annotated multiple
sequence alignment models for ancient domains
and full-length proteins.
• These are available as position-specific score
matrices (PSSMs) for fast identification of
conserved domains in protein sequences via RPSBLAST.
• The PSSMs are meant to be used for compiling
RPS-BLAST search databases only.
RPS-BLAST
• Reversed Position Specific Blast
• It searches a query sequence against a
database of profiles (opposite of PSI-BLAST).
• Use pre-computed lookup table for the
profiles to allow the search to proceed faster
(architecture dependent).
• The CD-Search databases for RPS-BLAST:
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
Strategy
FORMATRPSDB
• Formatrpsdb is a utility that converts a
collection of input sequences into a database
suitable for use with RPS-Blast.
• Formatrpsdb is designed to perform the work
of formatdb, makemat and copymat
simultaneously, without generating the large
number of intermediate files these utilities
would need to create an RPS Blast database.
Build Database
Title for
database file
Input file
containing
list of ASN.1
Scoremat
filenames
Create
index files
for
database
For scoremats that
contain only
Threshold
residue
for Base name of
frequencies, the
extending output
scaling factor to
hits for RPS
database
apply when
database
creating PSSMs
RUN RPS-BLAST
Results for CDD: COGs
Organism: M19107
>10
27th Feb 2012
22
Results for CDD: COGs
Organism: M21709
>10
27th Feb 2012
23
LipoP
27th Feb 2012
24
LiopP
• LipoP classifies genes into 4 classes:
– SpI: Signal peptide I
– SpII: Lipoprotein signal peptide
– TMH: N-terminal transmembrane helix (Not very reliable, It is used to avoid
TMH being falsely predicted as signal peptides)
– CYT: Cytoplasmic. (All the rest)
• The classification system in LipoP uses HMM with four branches, one
each for SpI, SpII, TMH, CYT.
• Protein sets for training and testing was extracted from SWISS-PROT.
• They consisted of lipoproteins, SPaseI-cleaved proteins, cytoplasmic
proteins from the two Gram-negative phyllums Proteobacteria and
Spirochetes.
• Transmembrane proteins were taken from phyllums Proteobacteria
and Gracilicutes.
Output Example
# M19107_final_1488 SpI score=11.1193 margin=11.320213 cleavage=31-32
# Cut-off=-3
M19107_final_1488
LipoP1.0:Best
SpI
M19107_final_1488
LipoP1.0:Margin
SpI
1
M19107_final_1488
LipoP1.0:Class
CYT
M19107_final_1488
LipoP1.0:Class
SpII
M19107_final_1488
LipoP1.0:Signal
CleavI
31
M19107_final_1488
LipoP1.0:Signal
CleavI
30
M19107_final_1488
LipoP1.0:Signal
CleavII
19
1.
2.
3.
4.
5.
6.
7.
1
1
1
1
32
31
20
1
11.320213
1
1
11.119
-2.18348
-1.80091
11.1193
-0.200913
-1.80091
# PISHA|SDLNQ
# SPISH|ASDLN
# TALFS|CGLLI Pos+2=G
Sequence ID
Type of prediction. Best means the highest scoring class, Margin gives the difference between the best score
and the second best score, Class gives the score of other classes and Signal lines contain predicted cleavage
sites.
Feature type.
Location in the sequence. For lines with a class prediction it is always 1. For cleavage sites it is the last amino
acid of the signal peptide relative to the predicted cleavage site.
Location same as above except that for cleavage sites it is the first amino acids after the cleavage site.
Score. For the "Margin" type it is the difference between the best and the second best class score.
For the cleavage sites the ±5 context is shown after the #, and for lipoprotein cleavage sites the amino acid in
postition +2 is shown (which may determine whether the lipoprotein is attached to the inner or outer
membrane) - An aspartic acid (D) in position +2 after the cleavage site of a lipoprotein means that it is
attached to the inner membrane, and most other lipoproteins are attached to the outer membrane (“Testing the
'+2 rule' for lipoprotein sorting in the Escherichia coli cell envelope with a new genetic selection”, Seydel et al
(1999) Molecular Microbiology 34: 810-821)
Results
Hh
Hi
Strain
SpI
SpII
Inner
Membrane
Lipoproteins
M19107
164
54
2
241
1470
1929
M19501
176
60
3
228
1293
1757
M21127
174
67
3
244
1564
2049
M21621
178
64
2
244
1413
1899
M21639
194
82
4
267
2072
2615
M21709
144
53
2
225
1383
1805
TMH
CYT
Total
SignalP
Biological background
• Many different types of secretory signals are
found. SignalP focused on prediction of
classical signal peptides, which are the far
most common type of signal peptide cleaved
by signal peptidase I (SPase).
• In bacteria signal peptide is targeted directly
to the cell membrane.
SignalP
• SignalP 3.0 was the best method among
PrediSi, SPEPlip, Signal-CF, Signal-3L and
Signal-BLAST. (Choo, K., Tan, T. & Ranganathan, S. BMC
Bioinformatics 10, S2 (2009).)
• SignalP4.0 is even better, and hence was
included in our method. (SignalP 4.0: discriminating signal
peptides from transmembrane regions Thomas Nordahl Petersen, et al.
Nature Methods, 8:785-786, 2011)
SignalP
• SignalP 4.0 is a purely neural network–based
method.
• Two types of networks in SignalP 4.0:
– SignalP-TM networks
– SignalP-noTM networks
• The decision to select network: If SignalP-TM
predicts four or more positions as being
transmembrane positions, SignalP-TM is used for
the final prediction, otherwise SignalPnoTM is
used.
Results from SignalP
Organism No.
Signal Pep
Total Genes
Percentage
M19107
144
1929
7.47%
M19501
150
1757
8.54%
M21127
152
2049
7.42%
M21621
151
1899
7.95%
M21639
178
2615
6.81%
M21709
122
1805
6.76%
Comparison between LipoP and SignalP
• The results obtained from LipoP and SignalP were compared
with the help of a script.
• Both SpI and SpII were taken from LipoP and all the positive
outputs were taken from SignalP.
• They were also analyzed for similar cleavage sites.
Comparison table
Organism No.
Genes Predicted to have
No. of Cleavage Sites detected
Signaling Peptides
Total # of
Negatives
Genes
LipoP
SignalP
Common
Consistent Sites Conflicting Sites
Unique
Unique
M19107
75
1
143
1710
1929
112
31
M19501
86
0
150
1521
1757
115
35
M21127
89
0
152
1808
2049
114
38
M21621
91
0
151
1657
1899
117
34
M21639
100
2
176
2337
2615
126
50
M21709
75
0
122
1608
1805
93
29
75
143
86
1
152
M21127
100
M19501
M19107
89
150
91
122
M21709
M21621
Signal P
M21639
75
151
176
LipoP
2
Comparison between LipoP and SignalP
• Bottom-line: As was clearly visible by the
Venn Diagram, the SignalP didn’t provided
much of new information as compared to
LipoP.
27th Feb 2012
36
Prediction of transmembrane helices in proteins
TMHMM
TMHMM
Organism No.
Transmembrane
Helices
Total Genes
Percentage
M19107
392
1929
20.32%
M19501
385
1757
21.91%
M21127
417
2049
20.35%
M21621
413
1899
21.75%
M21639
464
2615
17.74%
M21709
361
1805
20.00%
Member signature databases
Similar coverage in size; Different content
Member Database
PFAM
PROSITE
Focus/Features
divergent domains
functional sites
PRINTS
hierarchical definitions from superfamily to subfamily levels
TIGRFAMs
building HMMs for functionally equivalent proteins
PIRSF
produce HMMs over the full length of a protein and have protein
length restrictions together family members
HAMAP profiles
manually created by expert curators they identify proteins that are
part of well-conserved bacterial, archaeal and plastid-encoded
proteins families or subfamilies
PANTHER
build HMMS based on the divergence of function within families
SUPERFAMILY
Structure using the SCOP as a basis for building HMMs
GENE3D
Use Structure using the CATH superfamilies as a basis for
building HMMs
Querying
with InterProScan
About
•
•
•
•
A wrapper of sequence analysis applications
Database and output files scanning
Bulk data processing
Efficient(parallel) internal architecture
Query Sequence
InterProScan
Querying
with InterProScan
• Input
– Nucleotide* or protein sequences
– Recognized sequence format: raw, FASTA or
EMBL
– Reformat and translate(if necessary)
*Nucleotide sequences will translated and scanned in all 6 frames
without any further assumption
Querying
with InterProScan
• Running InterProScan
screenshot at<60s
Querying
with InterProScan
Querying
with InterProScan
• Output
– InterProScan makes results available in four
formats: raw, ebixml, xml, txt, html
• Parse InterProScan Output(BioPerl)
– Bio::SeqIO::interpro
• Interpretation of Output Data(example)
Querying
with InterProScan
Key:
Intepretation
10683_1_ORF1
the id of the input sequence.
024307F93E501F2C
the crc64 (checksum) of the protein sequence (supposed to be unique).
404
the length of the sequence (in AA).
HMMPfam
the anaysis method launched.
PF03453
the database members entry for this match.
MoeA_N
the database member description for the entry.
1
the start of the domain match.
163
the end of the domain match.
1.49999999999999999E-56
the evalue of the match (reported by member database method).
T
the status of the match (T: true, ?: unknown).
26-Feb-12
the date of the run.
IPR005110
the corresponding InterPro entry (if iprlookup requested by the user).
MoeA, N-terminal and linker domain
the description of the InterPro entry.
Biological Process: molybdopterin cofactor
the GO (gene ontology) description for the InterPro entry.
biosynthetic process (GO:0032324)
Preliminary Results
M19107
1,391
325
Total Searched Protein
1,769
Match
1,716
Unmatch
378
Total Hits: 12,393
53
Next Up
• Major Challenge:
Funneling all the annotation information into
a consolidated GenBank/GFF3 entry.
• Level 2!
27th Feb 2012
48
Level 2
Operons, Virulence Factors and Metabolic Pathways
27th Feb 2012
49
VIRULENCE
Likelihood of a pathogen causing disease
27th Feb 2012
50
H.haemolyticus
• As the name of the species implies,
is generally hemolytic on blood
agar plates
• Beta-hemolytic phenotype
routinely used in the clinical setting
to distinguish H.h from NTHi
• Nonhemolytic H. haemolyticus
strains are being isolated >
misidentified as NTHI
Gene(s) encoding hemolysin
Unknown
(Xin WangMeningitis Laboratory, CDC)
Photograph from From MicrobeLibrary.org
Subject Organisms
Species
Disease State
State
Isolated
Hemolysis
Hpd
fuculosekinase
M19107
H. haemolyticus
Asymptomatic
Minnesota
Y
-
-
M19501
H. haemolyticus
Asymptomatic
Minnesota
N
+
-
M21127
H. haemolyticus
Pathogenic
Georgia
Y
-
-
M21621
H. haemolyticus
Pathogenic
Texas
Y
-
-
M21639
H. haemolyticus
Pathogenic
Illinois
N
-
-
M21709
H. influenzae
Pathogenic
NY
N
-
+
fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolates
Hpd: encoding a lipoprotein protein D,
27th Feb 2012
52
Virulence factors
• Refer to the traits encoded by `virluence genes` that pathogenic
microbes are equipped to cause infection.
HOW???
–
–
–
–
Attach selectively to host tissues
Colonize parts of the host body
Gain access to nutrients by invading or destroying host tissues
Avoid host defenses
• Virulence factors include:
–
–
–
–
Bacterial toxins
Cell surface proteins that mediate bacterial attachment
Cell surface carbohydrates and proteins that protect a bacterium
Hydrolytic enzymes that may contribute to the pathogenicity of the bacterium
27th Feb 2012
53
VFDB: Virulence factor Database
• Set up in 2004
• Up-to date information regarding validated VF’s from 24 genera of
medically important bacterial pathogens.
• Detailed tabular comparison of virluence composition in terms of
V. genes and their composition
• Multiple alignment and statistical analysis of homologous VFs
• Graphical comparison of V. genes
• VF’s
–
–
–
–
Adhesion & invasion
Bacterial secretion systems& effectors
Toxins
Iron-acquisition system
• Pathogenicity island
27th Feb 2012
54
Operon and Pathway Analysis
• As was pointed out by Alejandro Caro, usually
a missing gene in an otherwise complete
pathway reflects a hole in the annotation
process.
• This path serves to fill such holes in the
annotation process.
27th Feb 2012
55
DOOR(Database of
prOkaryotic OpeRons)
• DOOR (Database of prOkaryotic OpeRons) is an operon
database developed by Computational Systems Biology Lab
(CSBL) at UGA. The operons in the database are based on
prediction.
• DOOR is the biggest operon database available until
now(2009).
• This algorithm is consistently best at all aspects including
sensitivity and specificity for both true positives and true
negatives, and the overall accuracy reach ~90%.
• Currently DOOR has operons for 971 prokaryotic genomes.
• Although most of operons in DOOR are not verified by
experiments, they are also trying to provide some limited
literature information, which is extracted from ODB.
FOUR STRAINS IN DOOR
Strategy
THE PATHWAY TOOLS
A Glance at the End of Annotation
Enable
• Browsing of Annotated Genes
• Analysis of pathways
Database
"Do not use a DBMS when the initial investment in
hardware, software, and training is too high.”
- Shamkant Navathe,
Georgia Institute of Technology
The Pathway Tools
"Pathway Tools is a production-quality software
environment for creating a type of model- organism
database called a Pathway/Genome Database (PGDB)"
The Pathway Tools
• Prediction
– Metabolic pathways
– Metabolic pathway hole filler
– Operons
• Curating
• PGDB web service
– Publish PGDB
– Query
– Visualization
• Metabolic Network Analysis
WHY “The Pathway Tools” ?
• Pros
– BioCyc Tier 1 and Tier 2 databases are highly
curated
– Enables editing(curation) and querying of PGDB
• Cons
– BioCyc have less number of
genomes than other databases
– Some tools are only available
in the local version(eg. PathoLogic)
The Pathway Tools
• Prediction
– Metabolic pathways
– Metabolic pathway hole filler
– Operons
• Curating
• PGDB web service
– Publish PGDB
– Query
– Visualization
• Metabolic Network Analysis
PathoLogic
The Pathway Tools Local Version(GUI)
PathoLogic
Inputs and outputs of the computational inference modules
within PathoLogic