Problem Solving Environments for High Throughput Informatics

Download Report

Transcript Problem Solving Environments for High Throughput Informatics

Data Management and Mining
in
BioArray Informatics
Prof. Yike Guo
Dept. of Computing, Imperial College, London
Goal:
 Understand the basic bioarray technology
including microarray technology for gene
expression, protein chips, NMR spectroscopy and
other high throughout devices
 Learn the basic analytical technology and its
applications to the bioarray information
 Learn the analysis processes of processing and
analysing bioarray data (e.g. gene expression
analysis)
Lecture Overview
 Lecture One : BioArray Informatics Introduction
 Lecture Two : BioArray Technology
 Lecture Three : Analysis Technology (1)—Data
Normalisation and Transformation
 Lecture Four : Analysis Technology (2)--Clustering and
Classification
 Lecture Five : Analysis Technology (3)– Multivariate
Statistics
 Lecture Six : Analysis Applications (1)—Gene Expression
Analysis
 Lecture Seven: Analysis Application (2)—Integrative
Analysis of BioArray Data
BioArray Informatics: Integrative Analysis of
BioArray Data within the Biological Context
secondary structure
tertiary structure
polymorphism
patient records
epidemiology
expression patterns
physiology
sequences
alignments
ATGCAAGTCCCT
AAGATTGCATAA
GCTCGCTCAGTT
receptors
signals
pathways
linkage maps
cytogenetic maps
physical maps
Functional -Omics Analysis
“REAL WORLD”
“INPUTS”
NOXIOUS AGENT/STRESSOR
“OUTPUTS”
“BIOLOGICAL END-POINTS”
PATHOLOGY
ALTERED PHYSIOLOGY
AND METABOLISM
“-OMICS WORLD”
Time
Gene Profile
Time
Time
Protein Profile
Time
Time
Metabolic Profile
A Dynamics in BioArray Informatics
Interactions
Environment
Metabolites
DNA
RNA
Protein
Growth rate
Expression
A mathematical model
forwards-propagated
correlations
metabolites
protein
mRNA
time
event
BioArray Provides the Means for Revealing
the Interaction
Gene
2
1
3
9
Receptor
4,5,6
Protein
7
Relations
1- gene homologs
2- gene encodes a protein
3- protein can regulate the expression of a gene
4- protein phosphorylates another protein
5- protein binds to another protein
6- protein lyses another protein
7- Proteins can sometimes be receptors
8- Receptors bind a ligand
9- Receptors (if bound) activate other proteins
Ligand
8
BioArray: Quantitative Measurement of
Biological Concepts
experiment
ORF
control
Microarrays1
• R, G values
~1000 bp hybridization
• quality indicators
ORF
Affymetrix
2
25-bp hybridization
• R/G ratios
PM
MM
• Averaged PM-MM
• “presence”
• feature statistics
• 25-mers
Quantitative Analysis
Reproducibility
confidence
intervals
to find significant
deviations
BioArray Informatics: BioArray is the
data, everything else is Informatics









Data Engineering
Data Warehousing
Data Integration
Data Analysis
Knowledge Discovery
Discovery Integration
Discovery Validation
Knowledge Integration
Knowledge Warehousing
Data Warehousing
Data Sources
External
Data Sources
Operational Data Sources
Sample & Clinical
Data
BioArray
Data
KEGG
Unigene
Genbank
Data Warehousing:
Experimental/Sample
Database
Expression
Database
Function
Annotations
Structure
Annotations
Example - ArrayExpress
ArrayExpress
Sample
ExpressionValue
Hybridization
Array
Experiment
External links
Ontology
e.g., organism
taxonomy
Reference
e.g., publication, web
resource
Database
e.g., gene in
SWISS-prot
Data Schema in Warehousing :
A Gene Expression Example
Gene
Expression
Warehouse
OMIM
Disease
ExPASy
SwissProt
PDB
ExPASy
Enzyme
Protein
Enzyme
LocusLink
Affy Fragment
Known Gene
MGD
Sequence
Metabolite
SNP
SPAD
Sequence
Cluster
NCBI
dbSNP
Genbank
NMR
Pathway
UniGene
KEGG
A Workflow of Gene Expression Database
Data Reduction Queries
GXDW
Comparisons
between 2 samples
Set Fold
Change
(e.g., > 2X)
User defined
dataset
Warehousing
Output
Comparisons
between multiple
samples
Profile Report
Data in
analysis
Set higher avg difference
value (e.g., >200)
Visualisation
A->P/ P->A stringency
(e.g., 80%)
Advanced
Gene Expression
Analysis
Queries, Queries…..

Query to the data
 Which genes are linked ?
 Which genes are expressed similarly to my gene XYZ?
 Which genes are co-expressed in differing conditions ?
 classification (of tumors, diseased tissues etc.): which patterns
are characteristic for a certain class of samples, which genes
are involved?
 functional classification of genes: Are changes clustered in
particular classes?
 metabolic pathway information: Is a certain pathway/route in a
pathway affected?
 disease information & clinical follow up: correlation to
expression patterns.
 phenotype information for mutants: Are there correlations
between particular phenotypes and expression patterns?
Gene Expression Data Analysis Work Flow
Data in
analysis
Interactive Analysis Procedures
Cluster by genes
Study outliers
Correlate clinical
measurements
Literature analysis
Time course analysis
Defined subsets of
genes
Classic drug targets
[Examples, not
exhaustive]
Known disease association
Cross species indices
Knowledge Deliverables
(Un)fortunately, Scientists never think
linearly
 Why those genes are co-expressed?
 What do their protein products do?
 What is the common regulatory motifs of a co-expressed
gene set?
 Can we patent them?
 Do we know which metabolic pathway they are in? If
there is no, can I synthesis one?
 Are there HTS results for any proteins in the pathway?
 Are there any compounds in the HTS library that hit
selectively and consistently against those proteins?
 Which ones have good activity, availability and toxicity?
Advanced Analysis
 Discovery Annotation and Validation
 E.X. Annotating a set of co-expressed genes with
some conserved regulatory motifs
 E.X. Scoring a co-expression pattern with pathways
 E.X. Literature analysis to annotate biological
semantics
 Integrative Analysis
 E.X. Multi-modality Analysis
 E.X. Cross Annotation of Discovered Patterns
 Modelling and Simulation
 E.X. Pathway Synthesis
 E.X. Virtual Cell Modelling
Pathway Scoring
P1: Gen 1
8
P1: Gen 2
P1: Gen 3
7
P1: Gen 4
P1
P1: Gen 5
6
P1: Gen 6
P1: Gen 7
P1: Gen 8
5
Gen 9
4
Gen 10
Gen 11
3
Gen 12
Gen 13
2
Gen 14
Gen 15
1
Gen 16
Gen 17
0
Gen 18
1
2
3
4
5
6
Gen 19
Gen 20
log
P( g  P)
P( g  P)
Analysis of Gene Expression Data with
Pathway Scores Our Approach
GPE-Score(Pathway)