Data Mining - functional statistical genetics/bioinformatics
Download
Report
Transcript Data Mining - functional statistical genetics/bioinformatics
S
S
G
ection
ON
tatistical
enetics
Laura Kelly Vaughan, Ph.D.
Assistant Professor
Department of Biostatistics
Section on Statistical Genetics
[email protected]
Data Mining:
Functional Statistical
Genetics &
Bioinformatics
NCBI (National Center for Biotechnology Information)
Bioinformatics is the field of science in which biology,
computer science, and information technology merge
into a single discipline. The ultimate goal of the field is
to enable the discovery of new biological insights and
to create a global perspective from which unifying
principles in biology can be discerned.
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html
Integrative Data Analysis
Genetic studies tend to focus on one data
source
Genetic variation
RNA levels
Blood biochemistry
This fails to utilize the information contained
in the connections among these variables…
Central Dogma of Molecular Biology
Replication
TXN
DNA
TSN
RNA
Protein
Phenotype
Proteomics
Genetics
Structural
Genomics
PTM
Functional Genomics
(Transcriptomics)
Phenomics
Metabolomics
Different sources of annotation data
Gene Ontology
Pathways/Networks
Protein/protein
interactions
Literature
Functional annotations
Expression
Cross species
Cellular localization
Methylation
ChIP
Sequence similarity
Promoter & Regulatory
Network
Protein domains
Gene Ontology
www.geneontology.org
The GO project has developed three structured
controlled vocabularies (ontologies) that describe
gene products in a species-independent manner.
biological processes- series of events accomplished by
one or more ordered assemblies of molecular functions
cellular components- parts of the cell
molecular functions- activities, such as catalytic or
binding activities, that occur at the molecular level
Example of a GO annotation
http://www.yeastgenome.org/help/images/cytokinesisDAGrels.jpg
What is a Pathway?
Physical and functional interactions between
genes and gene products
Metabolic pathways
Kinase based signaling cascades
Transcriptional signaling pathways
TNF Signaling
TNFa
TNF
a/b
TNFR1
TNFR2
SODD
TRADD
I-TRAF
TRAF2
RAIDD
TRAF2
MADD
Ceramides
P
MEKKs
TAK1
NIK
P
Caspase8
Caspase1
P
JNKK1
IKKs
P
P
p38
Caspases
3,6,7
P
JNK1
P
ERKs
IkBs
P
NF-kB
Apoptosis
BID
Caspase9
tBID
ATFs
c-Jun
CytoC
APAF1
c-Fos
NF-kB
Elk1
CytoC
Gene Expression and Cell Survival
C 2007-2009
SABiosciences.com
IkBs
Degradation
What is a Network?
Graphical representation if relationship between
genes, gene products, or other objects
Formed with information such as
Genes in interacting pathways
Gene products that share protein-protein interactions
Gene products protein-nucleotide relationships
Regulatory relationships
Metabolic interactions
Metabolic Disease Network
©2008 by National Academy of Sciences
Lee D. et.al. PNAS 2008;105:9880-9885
Analysis tools
Numerous methods have been developed to
aid in the interpretation of biological
experiments
2 basic categories
Pre-analysis methods where the raw data is
grouped together & the groups are tested
Dimension reduction
Post-analysis methods where significant or
interesting results are grouped together to
identify trends
Before you start…
There are many methods available for
integrative data analysis
Before you chose one, you must properly
define the questions you are trying to
answer…
What is your hypothesis?
DBA ~10 mins
Methods
Unsupervised, or data based methods
Utilizes all the data to identify trends
Hypothesis generating
Supervised, or prior information based
Requires the user to provide a ‘training set’ of
genes
Hypothesis testing
Gene Set Analysis
Test statistic intended to measure the
deviation of gene-set expression
measurements from the null hypothesis of no
association with the phenotype is calculated
The statistical significance (P-value) for each
gene set is calculated based on permutation
of samples
Types of enrichment methods
Class 1- Singular enrichment (SEA)
P-value calculated on each term from pre-selected list &
enrichment terms are listed
Class 2- Gene set enrichment (GSEA)
All genes (without pre-selection) are included
No need to select list
Experimental values integrated into P-value calculations
Pairwise comparisons (e.g., disease vs. control)
Most appropriate for expression data
Class 3- Modular enrichment (MEA)
Predetermined list, with term-term or gene-gene
relationships included in enrichment P-value calculation
Closest to nature of biological data structure
DAVID
Provides a comprehensive set of functional
annotation tools for investigators to
understand biological meaning behind large
list of genes
Extensive annotation database
Includes both pathways and GO
SEA and MEA algorithms
Visualization tools
http://david.abcc.ncifcrf.gov/
DAVID and LVH gene expression
GO clustering of significant
genes between different
mouse treatment groups
Stansfield et al 2009 Cardiopulmonary Support and Physiology
Babelomics Suite
Suite of web tools for the functional profiling
of genome scale experiments
Multiple annotation sources
Pathways, GO, regulation, text mining,
interactions
Allows for functional enrichment
Several gene set methods
Mostly SEA methods
http://babelomics.bioinfo.cipf.es/
Babelomics and thyroid carcinoma
Identified 1031 gene with
differential expression
Enriched pathways
included
MAPkinase
TGF-B
Focal adhesion
Cell motility
Activation of actin
polymerization
Cell cycle
Identified 30 genes that
predict prognosis with 95%
accuracy
Montero-Conde et al 2008 Oncogene
GSEA
Computational method that determines
whether an a priori defined set of genes
shows statistically significant, concordant
differences between two biological states
(e.g. phenotypes).
http://www.broad.mit.edu/gsea/
GSEA: Steps in the Methodology
Define a Gene Set from prior knowledge
Order the genes by correlation with phenotype
Estimate the gene set’s Enrichment Score
Assess Statistical Significance using permutation tests
Adjust for Multiple Hypothesis
Subramanian et. al, PNAS, 2005
Biological pathways involved in
chemotherapy response in breast cancer
GSEA for ER+
breast cancer
tumors
chemotherapy
responders and
non-responders
Of >850 gene
sets, 4 were
significant
Tordai et al 2008 Breast Cancer Research
Significance Analysis of Function and
Expression (SAFE)
Generalization and extension of GSEA method
2 stage permutation based approach to asses significant
changes in gene expression across experimental
conditions
First computes gene-specific local statistics to test for
association between gene expression and the phenotype.
Gene-specific statistics then used to estimate global
statistics that detects shifts in the local statistics within a
gene category.
The significance of the global statistics is assessed by
repeatedly permuting the response values.
SAFE implements a rank-based global statistics that
enables a better use of marginally significant genes than
those based on a p-value cutoff.
http://www.bioconductor.org/packages/bioc/1.6/src/contrib/html/safe.html
Dietary resveratrol
and aging in mice
SAFE analysis based
on GO annotations
Overlap of classes
with significant effect
caloric restrictive
response with low
dose resveratrol
Barger et al 2008 PLoS One
Supervised Analysis
Endeavour
Web based prioritization of candidate genes
Infers models for the training set in each data
source
Application of each model to the candidate
geens to rank against profiles of training set
Merges rankings from each data source to
give global ranking of genes
http://homes.esat.kuleuven.be/~bioiuser/endeavour/endeavour.php
ENDEAVOUR: the algorithm behind
the wizard
Tranchevent, L.-C. et al. Nucl. Acids Res. 2008 36:W377-W384;
doi:10.1093/nar/gkn325
Copyright restrictions may apply.
Genetic disorder prioritization
using Endeavour
Network Analysis
Dynamic representation of cellular process
through the incorporation of annotation &
experimental data
Structures are not fixed and change with
context
Many methods available…
Suderman & Hallett 2007 Bioinformatics
Ingenuity IPA
Pathway Analysis of WTCCC Type 2
Hypertension GWAs
No single SNP was
significant at the
genome wide level
High degree of
relationship between
pathways suggests
multiple related
mechanisms
Large number of low
penetrance risk
alleles
Pathway analysis with
MetaCore
Torkamani et al. 2008 Genomics
The next step
Translational Science
Integration of 49 genome wide experiments
for the prediction of previously unknown
obesity related genes
Greatly outperforms individual experiments
English, S. B. et al. Bioinformatics 2007 23:2910-2917;
doi:10.1093/bioinformatics/btm483
References
Song & Black 2008. BMC Bioinformatics. 9:502
Huang et al 2009. NAR 37(1):1-13
Chen et al 2008 Nature 452(27)429-435
Dinu et al 207 Journal of Biomedical Info 40:75-760
Al-Shahrour et al NAR 36:W341-346
Barry et al 2005 Bioinformatics 21(9)1943-1949
Huang et al Nature Protocols 4(1)44-57
Tranchevent et al 2008 NAR 36:W377-384
Mehta et al 2006 Physiol Genomics 28:24-32
Suderman & Hallett Bioinformatics 23(20)2651-2659
Dinu et al 2008 Briefings in Bioinformaics
Curtis et al 2005 Trends in Biotech 23(8)
Price and Shmulevich 2007 Current Op in Biotech 18:365-370
Zhang et al 2008 BMC Systems Bio 2:5
Werner 2008 Current Op in Biotech 19:50-54
Lui et al 2007 BMC Bioinformatics 8:431
Goeman & Buhimann 2007 Bioinformatics 23(8)980-987
Rivals et al 2007 Bioinformatics 23(4)401-407
Nam & Kim 2008 Briefings in Bioinformatics 9(3) 89-97