Outreach_microarray_bioinformatics_GMC_2005

Download Report

Transcript Outreach_microarray_bioinformatics_GMC_2005

The Bioinformatics of
Microarrays
Microarray Outreach Team Fall
2005
Outline
• Biology, Statistics, Data mining common term
definitions
• Transcriptome caveats and limitations
• Experimental Design
• Scan to intensity measures
• Low level analysis
• Data mining – how to interpret > 6000 measures
–
–
–
–
Databases
Software
Techniques
Comparing to prior HT studies, across platforms?
Issues
Bioinformatics, Computational
Biology, Data Mining
• Bioinformatics is an interdisciplinary field about the information
processing problems in computational biology and a unified treatment
of the data mining methods for solving these problems.
• Computational Biology is about modeling real data and simulating
unknown data of biological entities, e.g.
– Genomes (viruses, bacteria, fungi, plants, insects,…)
– Proteins and Proteomes
– Biological Sequences
– Molecular Function and Structure
• Data Mining is searching for knowledge in data
– Knowledge mining from databases
– Knowledge extraction
– Data/pattern analysis
– Data dredging
– Knowledge Discovery in Databases (KDD)
Basic Terms in Biology
Example:
• The human body contains ~100 trillion cells
• Inside each cell is a nucleus
• Inside the nucleus are two complete sets of the human
genome (except in egg, sperm cells and blood cells)
• Each set of genomes includes 30,000-80,000 genes on the
same 23 chromosomes
• Gene – A functional hereditary unit that occupies a fixed
location on a chromosome, has a specific influence on
phenotype, and is capable of mutation.
• Chromosome – A DNA containing linear body of the cell
nuclei responsible for determination and transmission of
hereditary characteristics
Basic Terms in Data Mining
• Data Mining:A step in the knowledge discovery process consisting of
particular algorithms (methods) that under some acceptable
objective, produces a particular enumeration of patterns (models)
over the data.
• Knowledge Discovery Process: The process of using data mining
methods (algorithms) to extract (identify) what is deemed knowledge
according to the specifications of measures and thresholds, using a
database along with any necessary preprocessing or
transformations.
• A pattern is a conservative statement about a probability distribution.
– Webster: A pattern is (a) a natural or chance configuration, (b) a
reliable sample of traits, acts, tendencies, or other observable
characteristics of a person, group, or institution
Problems in Bioinformatics
Domain
– Data production at the levels of
molecules, cells, organs, organisms,
populations
– Integration of structure and function data,
gene expression data, pathway data,
phenotypic and clinical data, …
– Prediction of Molecular Function and
Structure
– Computational biology: synthesis
(simulations) and analysis (machine
learning)
Subcellular Localization, Provides a simple goal for
genome-scale functional prediction
Determine how many of the ~6000 yeast
proteins go into each compartment
Subcellular Localization,
a standardized aspect of function
Cytoplasm
Nucleus
Membrane
ER
Extracellular
[secreted]
Golgi
Mitochondria
"Traditionally" subcellular localization is
"predicted" by sequence patterns
NLS
Nucleus
Cytoplasm
Membrane
TM-helix
ER
HDEL
Extracellular
[secreted]
Sig. Seq.
Golgi
Import Sig.
Mitochondria
Subcellular localization is associated with the level of
gene expression
[Expression Level
in Copies/Cell]
Cytoplasm
Nucleus
Membrane
ER
Extracellular
[secreted]
Golgi
Mitochondria
Combine Expression Information & Sequence
Patterns to Predict Localization
[Expression Level
in Copies/Cell]
NLS
Nucleus
Cytoplasm
Membrane
TM-helix
ER
HDEL
Extracellular
[secreted]
Sig. Seq.
Golgi
Import Sig.
Mitochondria
Major Objective: Discover a comprehensive
theory of life’s organization at the
molecular level
– The major actors of molecular biology: the
nucleic acids, DeoxyriboNucleic acid
(DNA) and RiboNucleic Acids (RNA)
– The central dogma of molecular
biology???
Proteins are very complicated molecules with 20
different amino acids.
Dynamic Nature of Yeast Genome
eORF= essential
kORF= known
hORF= homology
identified
shORF= short
tORF= transposon
identified
qORF= questionable
dORF= disabled
First published sequence claimed 6274 genes– a # that has
been revised many times, why?
The Affy detection oligonucleotide sequences are frozen at the time
of synthesis, how does this impact downstream data analysis?
Microarray Data Process
1. Experimental Design
2. Image Analysis – raw data
3. Normalization – “clean” data
4. Data Filtering – informative data
5. Model building
6. Data Mining (clustering, pattern recognition, et al)
7. Validation
Experimental Design
A good microarray design has 4 elements
1.
2.
3.
4.
A clearly defined biological question or hypothesis
Treatment, perturbation and observation of
biological materials should minimize systematic
bias
Simple and statistically sound arrangement that
minimizes cost and gains maximal information
Compliance with MIAME
• The goal of statistics is to find signals in a sea of noise
• The goal of exp. design is to reduce the noise so signals
can be found with as small a sample size as possible
Observational Study vs. Designed
Experiment
• Observational study– Investigator is a passive observer who
measures variables of interest, but does not
attempt to influence the responses
• Designed Experiment– Investigator intervenes in natural course of
events
What type is our DMSO exp?
Experimental Replicates
• Why?
– In any exp. system there is a certain amount of noise—
so even 2 identical processes yield slightly different
results
– Sources?
– In order to understand how much variation there is it is
necessary to repeat an exp a # of independent times
– Replicates allow us to use statistical tests to ascertain if
the differences we see are real
Technical vs. Biological Replicates
As we progress from the starting material to the scanned
image we are moving from a system dominated by biological
effects through one dominated by chemistry and physics noise
Within Affy platform the dominant variation is usually of a
biological nature thus best strategy is to produce replicates as
high up the experimental tree as possible
From probe level signals to gene
abundance estimates
From probe level signals to gene abundance estimates
The job of the expression summary algorithm is
to take a set of Perfect Match (PM) and MisMatch (MM) probes, and use these to generate
a single value representing the estimated
amount of transcript in solution, as measured
by that probeset.
To do this, .DAT files containing array images are first
processed to produce a .CEL file, which contains
measured intensities for each probe on the array.
It is the .CEL files that are analysed by the expression
calling algorithm.
PM and MM Probes
• The purpose of each MM probe is to provide a direct
measure of background and stray-signal(perhaps due to
cross-hybridisation) for its perfect-match partner. In most
situations the signal from each probepair is simply the
difference PM - MM.
• For some probepairs, however, the MM signal is greater
than the PM value; we have an apparently impossible
measure of background.
Signal Intensity
• Following these calculations, the MAS5
algorithm now has a measure of the signal
for each probe in a probeset.
• Other algortihms, ex RMA, GCRMA,
dCHIP and others have been developed
by academic teams to improve the
precision and accuracy of this calculation
• In our Exp we will use RMA and GCRMA
Low level data analysis / pre-processing
• Varying biological or
cellular composition
among sample types.
GMC
scientists
Scott
Scott
• Differences in sample
preparation, labeling or
hybridization
• Non specific crosshybridization of target to
probes.
Lead to systemic
differences between
individual arrays
Anjie
Anjie
• Raw Data Quality Control
GeneSpring, Rlanguage,
• Normalization
and
Bioconductorb
• Scaling
filtering.
GMC scientists + entire UVM
outreach team
Data processing is completed now
what?
Overview of Microarray Problem
Biology Application Domain
Validation
Data Analysis
Microarray
Experiment
Experiment
Design and
Hypothesis
Image
Analysis
Data
Mining
Data Warehouse
Artificial
Intelligence (AI)
Statistics
Knowledge discovery
in databases (KDD)
Back to Biology
• Do the changes you see in gene
expression make sense BIOLOGICALLY?
• How do we know?
• If they don’t make sense, can you
hypothesize as to why those genes might
be changing?
• Leads to many, many more experiments
The Gene Ontologies
A Common Language for Annotation of
Genes from
Yeast, Flies and Mice
…and Plants and Worms
…and Humans
…and anything else!
Gene Ontology Objectives
• GO represents concepts used to classify
specific parts of our biological knowledge:
– Biological Process
– Molecular Function
– Cellular Component
• GO develops a common language applicable
to any organism
• GO terms can be used to annotate gene
products from any species, allowing
comparison of information across species
Sriniga Srinivasan, Chief Ontologist, Yahoo!
The ontology. Dividing human knowledge
into a clean set of categories is a lot like
trying to figure out where to find that
suspenseful black comedy at your corner
video store. Questions inevitably come up,
like are Movies part of Art or
Entertainment? (Yahoo! lists them under the
latter.) -Wired Magazine, May 1996
The 3 Gene Ontologies
• Molecular Function = elemental activity/task
– the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
• Biological Process = biological goal or
objective
– broad biological goals, such as mitosis or purine metabolism, that are
accomplished by ordered assemblies of molecular functions
• Cellular Component = location or complex
– subcellular structures, locations, and macromolecular complexes; examples
include nucleus, telomere, and RNA polymerase II holoenzyme
Example:
Gene Product = hammer
Function (what)
Process (why)
Drive nail (into wood)
Carpentry
Drive stake (into soil)
Gardening
Smash roach
Pest Control
Clown’s juggling object
Entertainment
Biological Examples
Biological Process
Molecular Function
Cellular Component
Terms, Definitions, IDs
term: MAPKKK cascade (mating sensu Saccharomyces)
goid: GO:0007244
definition: OBSOLETE. MAPKKK cascade involved in
definition: MAPKKK cascade involved in transduction of
transduction of mating pheromone signal, as described in
mating pheromone signal, as described in Saccharomyces
Saccharomyces.
definition_reference: PMID:9561267
comment: This term was made obsolete because it is a gene
product specific term. To update annotations, use the biological
process term 'signal transduction during conjugation with cellular
fusion ; GO:0000750'.
SGD
SGD public microarray data sets available for public query
Homework
1.
2.
3.
4.
5.
Go to http://www.yeastgenome.org/ and find 3 candidate genes of known
f(x) and one of undefined f(x) that you might predict to be altered by
DMSO treatment
What GO biological processes and molecular mechanisms are
associated with your candidate genes?
Where, subcellularly does the protein reside in the cell?
What other proteins are known or inferred to interact with yours? How
was this interaction determined? Is this a genetic or physical interaction?
Find the expression of at least one of your known genes in another public
ally deposited microarray data set?
1.
2.
6.
Name of data set and how you found it?
What is the largest Fold change observed for this gene in the public study?
Now that you are microarray technology experts can you give me 3
reasons why the observed transcript level difference may not be
confirmed through a second technology like RTQPCR?
Suggested Reading