Data Mining - functional statistical genetics/bioinformatics

Download Report

Transcript Data Mining - functional statistical genetics/bioinformatics

S
S
G
ection
ON
tatistical
enetics
Laura Kelly Vaughan, Ph.D.
Assistant Professor
Department of Biostatistics
Section on Statistical Genetics
[email protected]
Data Mining:
Functional Statistical
Genetics &
Bioinformatics
NCBI (National Center for Biotechnology Information)
Bioinformatics is the field of science in which biology,
computer science, and information technology merge
into a single discipline. The ultimate goal of the field is
to enable the discovery of new biological insights and
to create a global perspective from which unifying
principles in biology can be discerned.
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html
Integrative Data Analysis
 Genetic studies tend to focus on one data
source



Genetic variation
RNA levels
Blood biochemistry
 This fails to utilize the information contained
in the connections among these variables…
Central Dogma of Molecular Biology
Replication
TXN
DNA
TSN
RNA
Protein
Phenotype
Proteomics
Genetics
Structural
Genomics
PTM
Functional Genomics
(Transcriptomics)
Phenomics
Metabolomics
Different sources of annotation data
 Gene Ontology
 Pathways/Networks
 Protein/protein
interactions
 Literature
 Functional annotations
 Expression






Cross species
Cellular localization
Methylation
ChIP
Sequence similarity
Promoter & Regulatory
Network
 Protein domains
Gene Ontology
 www.geneontology.org
 The GO project has developed three structured
controlled vocabularies (ontologies) that describe
gene products in a species-independent manner.



biological processes- series of events accomplished by
one or more ordered assemblies of molecular functions
cellular components- parts of the cell
molecular functions- activities, such as catalytic or
binding activities, that occur at the molecular level
Example of a GO annotation
http://www.yeastgenome.org/help/images/cytokinesisDAGrels.jpg
What is a Pathway?
 Physical and functional interactions between
genes and gene products



Metabolic pathways
Kinase based signaling cascades
Transcriptional signaling pathways
TNF Signaling
TNFa
TNF
a/b
TNFR1
TNFR2
SODD
TRADD
I-TRAF
TRAF2
RAIDD
TRAF2
MADD
Ceramides
P
MEKKs
TAK1
NIK
P
Caspase8
Caspase1
P
JNKK1
IKKs
P
P
p38
Caspases
3,6,7
P
JNK1
P
ERKs
IkBs
P
NF-kB
Apoptosis
BID
Caspase9
tBID
ATFs
c-Jun
CytoC
APAF1
c-Fos
NF-kB
Elk1
CytoC
Gene Expression and Cell Survival
C 2007-2009
SABiosciences.com
IkBs
Degradation
What is a Network?
 Graphical representation if relationship between
genes, gene products, or other objects
 Formed with information such as





Genes in interacting pathways
Gene products that share protein-protein interactions
Gene products protein-nucleotide relationships
Regulatory relationships
Metabolic interactions
Metabolic Disease Network
©2008 by National Academy of Sciences
Lee D. et.al. PNAS 2008;105:9880-9885
Analysis tools
 Numerous methods have been developed to
aid in the interpretation of biological
experiments
 2 basic categories

Pre-analysis methods where the raw data is
grouped together & the groups are tested


Dimension reduction
Post-analysis methods where significant or
interesting results are grouped together to
identify trends
Before you start…
 There are many methods available for
integrative data analysis
 Before you chose one, you must properly
define the questions you are trying to
answer…

What is your hypothesis?
DBA ~10 mins
Methods
 Unsupervised, or data based methods


Utilizes all the data to identify trends
Hypothesis generating
 Supervised, or prior information based


Requires the user to provide a ‘training set’ of
genes
Hypothesis testing
Gene Set Analysis
 Test statistic intended to measure the
deviation of gene-set expression
measurements from the null hypothesis of no
association with the phenotype is calculated
 The statistical significance (P-value) for each
gene set is calculated based on permutation
of samples
Types of enrichment methods
 Class 1- Singular enrichment (SEA)
 P-value calculated on each term from pre-selected list &
enrichment terms are listed
 Class 2- Gene set enrichment (GSEA)
 All genes (without pre-selection) are included
 No need to select list
 Experimental values integrated into P-value calculations
 Pairwise comparisons (e.g., disease vs. control)
 Most appropriate for expression data
 Class 3- Modular enrichment (MEA)
 Predetermined list, with term-term or gene-gene
relationships included in enrichment P-value calculation
 Closest to nature of biological data structure
DAVID
 Provides a comprehensive set of functional
annotation tools for investigators to
understand biological meaning behind large
list of genes
 Extensive annotation database

Includes both pathways and GO
 SEA and MEA algorithms
 Visualization tools
 http://david.abcc.ncifcrf.gov/
DAVID and LVH gene expression
 GO clustering of significant
genes between different
mouse treatment groups
Stansfield et al 2009 Cardiopulmonary Support and Physiology
Babelomics Suite
 Suite of web tools for the functional profiling
of genome scale experiments

Multiple annotation sources



Pathways, GO, regulation, text mining,
interactions
Allows for functional enrichment
Several gene set methods

Mostly SEA methods
 http://babelomics.bioinfo.cipf.es/
Babelomics and thyroid carcinoma
 Identified 1031 gene with
differential expression

Enriched pathways
included
 MAPkinase
 TGF-B
 Focal adhesion
 Cell motility
 Activation of actin
polymerization
 Cell cycle
 Identified 30 genes that
predict prognosis with 95%
accuracy
Montero-Conde et al 2008 Oncogene
GSEA
 Computational method that determines
whether an a priori defined set of genes
shows statistically significant, concordant
differences between two biological states
(e.g. phenotypes).
 http://www.broad.mit.edu/gsea/
GSEA: Steps in the Methodology
 Define a Gene Set from prior knowledge
 Order the genes by correlation with phenotype
 Estimate the gene set’s Enrichment Score
 Assess Statistical Significance using permutation tests
 Adjust for Multiple Hypothesis
Subramanian et. al, PNAS, 2005
Biological pathways involved in
chemotherapy response in breast cancer
 GSEA for ER+
breast cancer
tumors
chemotherapy
responders and
non-responders
 Of >850 gene
sets, 4 were
significant
Tordai et al 2008 Breast Cancer Research
Significance Analysis of Function and
Expression (SAFE)
 Generalization and extension of GSEA method
 2 stage permutation based approach to asses significant
changes in gene expression across experimental
conditions


First computes gene-specific local statistics to test for
association between gene expression and the phenotype.
Gene-specific statistics then used to estimate global
statistics that detects shifts in the local statistics within a
gene category.
 The significance of the global statistics is assessed by
repeatedly permuting the response values.
 SAFE implements a rank-based global statistics that
enables a better use of marginally significant genes than
those based on a p-value cutoff.
 http://www.bioconductor.org/packages/bioc/1.6/src/contrib/html/safe.html
Dietary resveratrol
and aging in mice
 SAFE analysis based
on GO annotations
 Overlap of classes
with significant effect
caloric restrictive
response with low
dose resveratrol
Barger et al 2008 PLoS One
Supervised Analysis
Endeavour
 Web based prioritization of candidate genes
 Infers models for the training set in each data
source
 Application of each model to the candidate
geens to rank against profiles of training set
 Merges rankings from each data source to
give global ranking of genes
http://homes.esat.kuleuven.be/~bioiuser/endeavour/endeavour.php
ENDEAVOUR: the algorithm behind
the wizard
Tranchevent, L.-C. et al. Nucl. Acids Res. 2008 36:W377-W384;
doi:10.1093/nar/gkn325
Copyright restrictions may apply.
Genetic disorder prioritization
using Endeavour
Network Analysis
 Dynamic representation of cellular process
through the incorporation of annotation &
experimental data

Structures are not fixed and change with
context
 Many methods available…
 Suderman & Hallett 2007 Bioinformatics
Ingenuity IPA
Pathway Analysis of WTCCC Type 2
Hypertension GWAs
 No single SNP was
significant at the
genome wide level
 High degree of
relationship between
pathways suggests
multiple related
mechanisms
 Large number of low
penetrance risk
alleles
 Pathway analysis with
MetaCore
Torkamani et al. 2008 Genomics
The next step
Translational Science
 Integration of 49 genome wide experiments
for the prediction of previously unknown
obesity related genes

Greatly outperforms individual experiments
English, S. B. et al. Bioinformatics 2007 23:2910-2917;
doi:10.1093/bioinformatics/btm483
References



















Song & Black 2008. BMC Bioinformatics. 9:502
Huang et al 2009. NAR 37(1):1-13
Chen et al 2008 Nature 452(27)429-435
Dinu et al 207 Journal of Biomedical Info 40:75-760
Al-Shahrour et al NAR 36:W341-346
Barry et al 2005 Bioinformatics 21(9)1943-1949
Huang et al Nature Protocols 4(1)44-57
Tranchevent et al 2008 NAR 36:W377-384
Mehta et al 2006 Physiol Genomics 28:24-32
Suderman & Hallett Bioinformatics 23(20)2651-2659
Dinu et al 2008 Briefings in Bioinformaics
Curtis et al 2005 Trends in Biotech 23(8)
Price and Shmulevich 2007 Current Op in Biotech 18:365-370
Zhang et al 2008 BMC Systems Bio 2:5
Werner 2008 Current Op in Biotech 19:50-54
Lui et al 2007 BMC Bioinformatics 8:431
Goeman & Buhimann 2007 Bioinformatics 23(8)980-987
Rivals et al 2007 Bioinformatics 23(4)401-407
Nam & Kim 2008 Briefings in Bioinformatics 9(3) 89-97