Powerpoint File

Download Report

Transcript Powerpoint File

Genome analysis and annotation
Part II
Modeling a gene
S.mansoniView
PASA assemblies
Evidence
S. japonicum EST alignments
Genewise alignments(predictions)
nr Protein Alignments
Caenorhabditis sp. Protein Alignments
Brugia malayi Protein Alignments
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
Attributes of individual annotated genes
Sequence Database Hits
Top: Protein matches
Bottom: EST matches
Not shown graphically: gene name, nucleotide and protein sequence, MW, pI, organellar
targeting sequence, membrane spanning regions, other domains.
Gene
Predictions
Annotated Gene
Top: editing panel
Bottom: final curation
Splice site predictions:
red: acceptor sites
blue: donor sites
Screenshot of a component within Neomorphic’s annotation station: www.neomorphic.com
Assigning function to predicted gene products
Assigning function to predicted gene products
The primary tool for assigning function is homology to well characterized
proteins
E.coli
H. influenzae
E.coli
H. influenzae
H. influenzae
H. influenzae
M. genitalium
M. genitalium
…however transitive annotation can lead to errors that propagate.
The modular nature of proteins can provide the
basis for functional annotation
•
Proteins may share features that give clues to their structure and/or
function
•
A domain is a region of a protein that can adopt a particular threedimensional structure. Together a group of proteins that share a
domain is called a family. There are several databases of protein
families such as Pfam (http://www.sanger.ac.uk/Software/Pfam/)
•
Motifs are short, conserved regions of proteins, typically consisting
of a pattern of amino acids that characterizes a prrotein family
(http://www.expasy.org/prosite/)
EF-hand: D-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)- [DE]-[LIVMFYW]
3)
HMM domains can also be defined and used to group proteins into
families
Protein domain frequencies can yield insights into the
biology of an organism
Top 20 PFAM domains in A. fumigatus
Counts in A. nidulans and A. oryzae
Afu
Ana
Aoa
PF00400
WD domain, G-beta repeat
532
598
541
PF00023
Ankyrin repeat
368
633
430
PF00083
major facilitator superfamily protein
166
281
219
PF00172
Fungal Zn(2)-Cys(6) binuclear cluster domain
146
179
211
PF00515
TPR Domain
139
142
152
PF00096
Zinc finger, C2H2 type
124
113
142
PF04082
Fungal specific transcription factor domain
110
163
159
PF00153
Mitochondrial carrier protein
106
114
100
PF00069
Protein kinase domain
105
105
101
PF00005
ABC transporter
93
129
86
PF00076
RNA recognition motif. (a.k.a. RRM, RBD, or RNP domain)
93
93
99
PF00106
oxidoreductase, short chain dehydrogenase/reductase family
92
135
129
PF00271
Helicase conserved C-terminal domain
73
80
69
PF00067
Cytochrome P450
63
134
102
PF00107
oxidoreductase, zinc-binding dehydrogenase family
61
107
80
PF00501
AMP-binding enzyme
61
77
83
PF00560
Leucine Rich Repeat
47
50
54
PF00550
Phosphopantetheine attachment site
46
54
60
PF00036
EF hand
10
52
50
Domain based Paralogous Families can be genrated
Domain Content of Entire Proteome can be computed
All the proteins from a genome
HMM search against Pfam profiles
Alignment search against homology-based domain alignments
The search results are stored in the database in
the form of domain-based alignments
Organize the proteins into domain-based
paralogous families
•Related families share one or more domains with other families
•Many putative novel domains are extensions of existing domains
Hidden Markov Models (HMMs)
Statistical representations of sequence patterns.
Seed:
Model:
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
A query sequence is
scored by how likely
is it that the HMM
would produce it.
Procedure for Preparing a HMM Seed

Inspect and edit a pairwise aligned group of gene products:
- Eliminate fragments
- Correct the alignment
- Remove sequence outside domain
- Eliminate redundancy
- BLAST, annotate and possibly expand the seed.
Homology-Based Alignment:
HMM Seed:
Trusted Hits:
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
What is Gene Ontology (GO)?
The Gene Ontology is a set of dynamic controlled vocabularies used to
describe gene products in terms of their associated biological processes,
cellular components and molecular functions in a species-independent
manner (www.geneontology.org)
The Three Ontologies
Molecular function, biological process and cellular
component are considered attributes of gene products.
Biological Process (a)
A biological objective
has more than one distinct step
 Molecular Function (b)
what the gene product does
Think ‘activity’
 Cellular Component (c)
location in the cell (or smaller unit)
or part of a complex

Assigning GO IDs
Each GO ID is qualified with an evidence code.
Evidence codes are:
IMP – inferred from mutant phenotype
IGI—inferred from genetic interaction
IPI—inferred from physical interaction
IDA—inferred from direct assay
IEP—inferred from expression pattern
ISS—inferred from structural similarity
IEA—inferred from electronic annotation
IC—inferred by curator
TAS—traceable author statement
NAS—non-traceable author statement
ND—no biological data available
NR—no longer used
•
•
•
•
Experimental evidence
Sequence similarity
Calculated by algorithm
Author statement
The “with/to” field
ISS, IPI, IGI require the
accession of the similarity
hit, the interacting entity
Gene ontologies can help interpret large scale datasets
K-means clustering using TIGR Multi-Experiment Viewer (TMEV)
Cluster 4
Cluster 10