GO annotations for ordered lists of genes

Transcript GO annotations for ordered lists of genes

GOSt
a Gene Ontology mining tool
Jüri Reimand
Overview
•
•
•
•
•
•
Introduction, bioinformatics
Gene Ontology (GO)
GOSt, a Gene Ontology mining tool
Statistics and thresholds
Ordered gene lists
Extending GO
Introduction
• Bioinformatics
– Analysis of experimental data
• Genes encode proteins
– Proteins : building blocks of living organisms
– Gene expression : protein production from
genetic code
• Microarray experiments measure gene
expression
–
–
–
–
Thousands of genes simultaneously
Expression levels over time
Different biological conditions
Comparison of healthy and diseased cells
cluster similar
profiles
measures
over time
Introduction
• Biological experiments
give large amounts of data
• Groups of similar genes:
– top “most active” genes
– similar expression profiles over time
“steroid metabolism”
“biosynthesis”
“iron ion binding”
• Many genes have some available annotations
– Previous knowledge from databases
• How to describe the group as a whole?
– What are the common features?
– Which features are significantly overrepresented?
Gene Ontology (GO)
• GO - Directed Acyclic Graph (DAG)
– Vertices: terms
– Edges: relations between general and specific terms
• Hierarchically structured vocabulary
– 3 DAGs: processes, components, functions
• Annotations to vocabulary terms
– Association between
a gene g and a property t (GO term t)
– Based on biological discoveries
– Genes of many genomes are annotated to GO
• Annotation sets : for a fixed organism
– All genes associated with GO term t
GO example
• Graph fragment
with some terms
related to organ
development
• Vocabulary is
general to living
organisms
• Gene
annotations
organismspecific
• True Path Rule
hierarchical
annotations
ENSG00000163217
ENSG00000161202
GO example
• Graph fragment
with some terms
related to organ
development
• Vocabulary is
general to living
organisms
• Gene
annotations
organismspecific
• True Path Rule
hierarchical
annotations
ENSG00000163217
ENSG00000161202
GOSt – Gene Ontology Statistics
•
•
•
•
•
•
GO annotations to groups of genes
Statistical significance of results
Thresholds for distinguishing significant results
Analysing ordered lists of genes
Visualisation methods, WWW interface
Command line toolset for large-scale analysis
GOSt example
45 mouse
genes
338 GO
Evidence
codes
Genes
P-value
GO
terms
Annotations to gene groups
Gq
Query
GO Term
e.g. Gt
heart
development
• Result: term t matches query Q
Statistical significance
• Is intersection QT significant?
• Fisher's one-tailed test
– Cumulative hypergeometric probability
– Get observed or more genes in intersection QT
– P ( pick k white balls out of K white and N-K black balls )
• Multiple testing
– Every query results in a number of p-values
– Matching GO terms are not independent
– Increased rate of false positive matches
• Which p-values are significant?
Experimental thresholds
• Simulation experiment
– Fix some gene query size k
– Repeat 1000 times:
• Generate synthetic query Q with k elements :
random subset of organism's genes
• Observe best p-value p for query Q
• Store p-value, p --> P
– Choose p', 50th smallest p-value from P
– Threshold p' – top 5% of p-values for random queries
of size k
• Calculate for query lengths k = [1,1000]
• Compare with standard multiple testing
corrections
– Bonferroni (1936), Benjamini-Hochberg (1995)
Analytical thresholds
• Analytical approach to simulated thresholds
– Fix gene query size k
– Observe all sizes and frequencies of GO annotation
sets T
– Presume events with different T independent
– Observe possible p-values p with query of k elements
– Always correct p by constant c=0.97 (set
dependencies!)
– Find such threshold p', that gives p ~= 0.95
• Repeat for query lengths k = [1,1000]
Significance thresholds
Significance thresholds
Significance thresholds
Significance thresholds
Ordered lists of genes
• Gene groups may be ordered
– Interesting gene and few most
similar genes
– Top “most active” genes
– Increasing distance from cluster
centre
• Top of the list, but how many?
– Compare list with GO term
– Which portion gives best p-value?
– Peak significance of ordered query
GOSt algorithms
• Unordered query
– Intersections with all annotation sets T
• Exhaustive algorithm for ordered queries:
– intersections with all Qi and annotation sets T
• Approximate algorithm for ordered queries:
– for every annotation set T, view only list portions
that give local p-value extremes
• local best p : list ends with matching gene
• local worst p : list ends just before matching
gene
Example: Ordered list analysis
Peak significance
at ordered list of
28 genes
p-value
query length
List of genes, and matches for “Biosynthesis of steroids”
Evidence
codes
Genes
P-value
Ordered
list query
GO
categories
Algorithm speed comparison
24 sec
2.8 sec
GOSt features
• Command line interface (C/C++ and Perl)
• Graphical user interface in web
http://bioinf.ebc.ee/GOST
– SWOG (Graphics language, Jaanus Hansen 2005)
• Data for multiple organisms
– yeast, chicken, cow, mouse, rat, human...
• Wrappers for parallel applications (GRID, MPI)
• Pipelines for gene expression data analysis
Extending GO ( i )
• Pathway – a network of interacting genes
and proteins
– metabolism pathways, disease pathways, ..
• Include pathway data to GO vocabulary
– KEGG Pathway database
– pathways as vocabulary terms
– related genes as annotations to terms
• KEGG terms independent of GO vocabulary
GO
GO:0003674
molecular_function
GO:0005575
cellular_component
GO:0008150
biological_process
KEGG:00000 KEGG pathways
KEGG:05010 - Alzheimer's disease
Extending GO ( ii )
• Gene expression started by transcription
factors (TF)
• TFs bind to certain patterns in DNA
– Transcription Factor Binding Sites (TFBS)
– Often found in regions close to gene (1k bp)
• Include TFBS data from TRANSFAC
– Patterns (putative TFBS) as vocabulary terms
– annotations to genes near patterns
Transcription factor
ATATAATAAAGATGAGGCGAATATAAATATACCGGCCCTTAGCGCGAAGCAATTCATCATATAAGCGAGAGAGGCCAATATGCAATCTTCGACAGCAT
TF binding site
gene
TRANSFAC motifs
• Motifs added in a hierarchy
– according to PWM score
– 5 levels:
• near_threshold
• ...
• near_MAX_score
depth in
hierarchy
• Work in progress
TF:M00431_4
TF:M00431_3
TF:M00431_2
TF:M00431_1
TF:M00431_0
TF:M00328_4
TF:M00328_3
TF:M00328_2
TF:M00000
TTTSGCGS:4
TTTSGCGS:3
TTTSGCGS:2
TTTSGCGS:1
TTTSGCGS:0
NCNNTNNTGCRTGANNNN:4
NCNNTNNTGCRTGANNNN:3
NCNNTNNTGCRTGANNNN:2
TRANSFAC motifs
– Hedi Peterson
GO
GO:0003674
molecular_function
GO:0005575
cellular_component
GO:0008150
biological_process
KEGG:00000 KEGG pathways
Summary
• We investigated means for finding GO annotations to
groups of genes, and statistical methods for determining
significance of results.
• We combined GO vocabulary with various types of
biological data, such as KEGG pathways and TRANSFAC
regulatory elements.
• We proposed analytical thresholds for distinguishing
significant results from structured and partly dependent
GO annotations, and verified thresholds with simulation
experiments.
• We proposed a novel concept of analyzing GO
annotations for ordered lists of genes, and implemented
fast algorithms for the purpose.
• The practical result of our work is GOSt, a GO mining tool.
Command line interface is suitable for large-scale automatic
analysis, while graphical web interface enables highly
visualized and interactive analysis.
Sneak preview
• GO analysis of
hierarchical clustering
tree
– Cluster genes according
to expression similarity
and ..
– .. “Wrap up” nodes that
show no significant
annotations in GO
• Work in progress
– Meelis Kull
– Darja Krushevskaja
Acknowledgments
Jaak Vilo
BIIT group
Hedi Peterson
Meelis Kull
Jaanus Hansen
Priit Adler
Ilja Livenson
Raivo Kolde
Konstantin Tretjakov
Pavlos Pavlidis
Asko Tiidumaa
Darja Krushevskaja
FunGenES Consortium

GO annotations for ordered lists of genes

Transcript GO annotations for ordered lists of genes

Directory