No Slide Title - people.vcu.edu

Download Report

Transcript No Slide Title - people.vcu.edu

Bioinformatics
Brad Windle
[email protected]
Ph# 628-1956
Web Site: http://www.people.vcu.edu/~bwindle/Courses
Click on Link to MEDC 310 course
Or
http://www.phc.vcu.edu/310/
Profiling
The term "bioinformatics" is about 15 years old. It covers a variety of
data analyses that include:
DNA and protein sequence analysis
Biological analysis of drugs, can overlap with chemoinformatics
Genetics
Taxonomy
Clinical data statistics
Genomic and proteomic research
Bioinformatics is sometimes equated to the term "data mining", which
is commonly used in e-business and internet data handling.
Chemoinformatics
Chemoinformatics has a special challenge in that a structure of a
compound or drug needs to be quantified. Specific structures are
characterized by molecular descriptors useful in Quantitative
Structure Activity Relationship (QSAR) modeling. QSAR tells
you what about the structure of a drug that makes it do what it
does.
Much of this information has implications on what a drug will
do in a cell. However, the complexity of a cell makes the reality
of what a drug does in the cell deviate significantly from what is
anticipated based on chemistry and enzymatic assays. This
stresses the need for characterizing drugs based on more
biological data.
Analogies for looking for patterns
Looking at patterns in images
A mixture of many patterns
We need to identify individual patterns
There are methods for extracting the patterns from the data
There is also noise tht obscures the patterns
One method for identifying object patterns of interest
amidst the noise
Another method for identifying different object patterns
of interest amidst the noise
This is what was actually buried in the noise
Questions?
Philosophy of Science
Reductionist Approach (Reductionism)
VS
Systems Approach (Systemism)
Reductionist
Systems Approach
Traditional Scientific Methods
Updated Scientific Methods
Obervat ions are made wit h or wit hout
making changes to the system
Technology allows a large amount
of observations to be made
Dat a are analyzed and a hypot hesis
developed
Bioinformat ics allows analysis
of a large amount of data
Experiment s are designed and conduct ed
t o t est t he hypot hesis, usually involves
changing something in the system
Obervat ions are made t o det ermine if
the hypothesis is true or false
Technology allows a large amount
of observations to be made
Data are analyzed and conclusions made
Bioinformat ics allows analysis
of a large amount of data
The hypot hesis is eit her proved t rue and
advancing t o t he next st age occurs, or
t he hypot hesis is proved false and new
obervat ions are made or dat a is reanalyzed to develop a better hypothesis
How Does a Cell, or Person Respond to
Therapy or a Drug?
Treat 10 people suffering from Disease A with
Drug X.
•
•
•
•
2 people suffer adverse reactions
3 exhibit good recovery from disease
2 exhibit modest recovery from disease
3 exhibit no sign of recovery from disease
What Factors Cause in Differences
Between People?
Genes and their sequence
Health-wise
• Disease
• Health-related Traits
• Response to Drugs
What Are the Differences in Genes?
Single nucleotide polymorphisms (SNPs)
SerSerIleAsnGlyGlnLeuArgPro
AGTTCTATAAATGGCCAGCTTAGACCT
TCAAGATATTTACCGGTCGAATCTGGA
SerSerIleHisGlyGlnIleArgPro
AGTTCTATACATGGCCAGATTAGACCA
TCAAGATATGTACCGGTCTAATCTGGT
How does a difference in a gene affect drug response?
Transport of the drug
Metabolism of the drug
Interaction with the drug target
5 Million SNPs
Let’s say there are 10 SNPs that contribute to response to Drug
X
Combinatorial approach to identifying SNPs that correlate with
drug response
All combinations = 1060
Narrow SNPs down to those within genes to 100,000
Combinations = 1043
Traveling Salesman Problem
SNPs thus far described were inherited,
affecting the quality of proteins
What about differences between people
that are somatic?
What about quantitative differences in
proteins?
Differences in Protein Expression and
Gene Expression
20,0000 genes - Genomics
100,000 proteins - Proteomics
Traditional Scientific Methods
Updated Scientific Methods
Obervat ions are made wit h or wit hout
making changes to the system
Technology allows a large amount
of observations to be made
Dat a are analyzed and a hypot hesis
developed
Bioinformat ics allows analysis
of a large amount of data
Experiment s are designed and conduct ed
t o t est t he hypot hesis, usually involves
changing something in the system
Obervat ions are made t o det ermine if
the hypothesis is true or false
Technology allows a large amount
of observations to be made
Data are analyzed and conclusions made
Bioinformat ics allows analysis
of a large amount of data
The hypot hesis is eit her proved t rue and
advancing t o t he next st age occurs, or
t he hypot hesis is proved false and new
obervat ions are made or dat a is reanalyzed to develop a better hypothesis
In genomics and proteomics research, the data is extensive and the patterns
complex.
The emphasis shifts from asking specific questions or testing hypotheses to trying
to filter out the most significant observation the data offers.
Bioinformatics and Data Mining in general use two forms of learning:
Unsupervised learning and Supervised learning
Supervised learning is the process of learning by example:
Use example patterns with known characteristics to learn and predict
characteristics for the unknown
This is essentially the modeling process
Unsupervised learning is the learning by observation and exploratory
data analysis is a general form
Let the data reveal prominent patterns and associations, you don’t look
for
specific patterns
Exploratory data analysis is used when there is no hypothesis to test, or
when there is no specific pattern expected.
This type of analysis shows the most significant pattern or trends within
the data; it does not imply biologically or statistical significant.
Cluster analysis is a popular form of exploratory data analysis.
Cluster analysis sorts whatever is being analyzed into
clusters with the greatest similarities in trend or
pattern. It is a form of non-descriptive statistics and
exploratory data analysis.
A dendrogram or tree diagram is used to present the
results.
Below is an example of a dendrogram for bacterial
species of Escherichia.
New technology= lots of data
Microarray Technology
DNA Microarray
Cell 1’s
mRNA
Cell 2’s
mRNA
Pseudo-colored MicroarraySpots
The total intensity for each spot is summed and the values
plotted on a scatterplot.
A scatterplot of 2000 points is shown. Each point
respresents a gene.
Cluster analysis methods
The most straightforward methods involve calculating
the Euclidean (Euclid) distance between two points, for
all combinations of points.
Pythagorean Theorem
If we perform cluster analysis on the 2000 points, we
can see that we have one giant cluster with a handful
of outliers.
Adding Dimensions to Cluster Analysis
The distance calculation would be:
Thus, while we can't visualize more than three dimensions, the
computer can perform cluster analysis on as many dimensions
imaginable or as processing time allows.
Pearson Correlation Coefficient
Two-fold Cluster Analysis
Gene expression analysis in drug development can involve a large number of genes
and a large number of drugs. It is not only important to identify what genes cluster
together, but also what drugs cluster . This is done by two-fold cluster analysis.
The genes are arranged and clustered as well as the drugs. The drugs that illicit similar
gene expression patterns will cluster. Both clusters can be viewed in a single 2-D
dendrogram.
Questions?
Cluster Tree
of cell lines
Classifying Cancer
Using supervised learning, models have been developed
Classifying different subsets of cancers that the pathologist
can’t
Predicting response to therapy and patient prognosis
Any kind of data can be explored
Cell response profile
Monks et al. Anti-Cancer Drug Design 12:553 (1997)
Drug clusters correspond to
drug targets or mechanisms
of action
not necessarily drug
structure.
Scherf et al, nature genetics 24:236 (2000)
Exploratory Tools allows us to focus on
what most relevant based on the data
And developed relevant hypotheses
For example
Geldanamycin is cytotoxic through
inhibition of microtubules
The End
Any Questions?