Analysis of Protein Geometry, Particularly Related to Packing at the
Download
Report
Transcript Analysis of Protein Geometry, Particularly Related to Packing at the
Mark Gerstein, Yale University
gersteinlab.org/courses/452
(last edit in fall '06, handout version, including in-class changes)
1 (c) M Gerstein, 2006, Yale, gersteinlab.org
BIOINFORMATICS
Summary
2 (c) M Gerstein, 2006, Yale, gersteinlab.org
Used in class M11
[2006,12.06]
[From S Harris's Science Cartoons,
http://www.sciencecartoonsplus.com]
3 (c) M Gerstein, 2006, Yale, gersteinlab.org
You'll Forget…
4 (c) M Gerstein, 2006, Yale, gersteinlab.org
… So I'll
distill
What is Bioinformatics?
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many
applications.
5 (c) M Gerstein, 2006, Yale, gersteinlab.org
• (Molecular) Bio - informatics
Data Types
Structures
Functional
Genomics
6 (c) M Gerstein, 2006, Yale, gersteinlab.org
Sequences
"Core" Bioinformatics
Computing with sequences and structures
protein structure prediction
biological databases and mining them
• New Stuff: Networks and Expression Analysis
`Will teach these in CS 545 (Data Mining) next semester
• Fairly Speculative: simulating cells
7 (c) M Gerstein, 2006, Yale, gersteinlab.org
• Core Stuff
• Memorize the previous summary
• Good familarity with main points in
lectures (quizzes)
• Rest of overheads and readings for
reference on projects and …
8 (c) M Gerstein, 2006, Yale, gersteinlab.org
Hierarchical Structure of
Course Information
Cross-cutting Themes
Dynamic programming
• Different measures of similarity
(RMS vs. Structural similarity; PAM & Blossum vs %ID)
• Generalized similarity matrix in threading
Statistical scoring schemes (with P-values)
For sequences, structures, sequence to structure, and even
expression data
Time complexity of the comparisons
• Predictions
LOD scores (# with features / expectation )
Progressive more complex features
Amount of features information IN vs. prediction OUT
Testing against benchmarks with cross-validation
(sec. struc. prediction, seq. comparison scoring, datamining)
Other methods, need for heuristics
9 (c) M Gerstein, 2006, Yale, gersteinlab.org
• Algorithms for Comparison
Cross-cutting Themes
Character strings, fold (just CAs), volumes and surfaces from all
atom representation, energy and minimization, dynamics (time
and velocity)
• Simulation
Vector configuration boiled down a scalar E through potential
Compute intensive exploration of configurations (MC, MD)
Averages over correctly weighted configurations
Importance of simplification
• The Survey Mode
Collecting information in DB tables
Importance of integration and interoperation
Organizing it around "part" classifications
Surveying it for useful statistics (taking into account biases)
Doing datamining to find more tenuous relationships
10 (c) M Gerstein, 2006, Yale, gersteinlab.org
• Increasing the chemically reality and complexity
of genes
11 (c) M Gerstein, 2006, Yale, gersteinlab.org
Anti-Themes
12 (c) M Gerstein, 2006, Yale, gersteinlab.org
Depth v
Breadth
1980
Historical Perspective
• Single Structures
1990
1995
2000
2005
Modeling & Geometry
Forces & Simulation
Docking
• Sequences, SequenceStructure Relationships
Alignment
Structure Prediction
Fold recognition
• Genomics
Dealing with many sequences
Gene finding & Genome Annotation
Databases
• Integrative Analysis
Expression & Proteomics Data
Datamining
Simulation again….
13 (c) M Gerstein, 2006, Yale, gersteinlab.org
1985
14 (c) M Gerstein, 2006, Yale, gersteinlab.org
(from CooperToons, http://members.aol.com/ChipCooper/cartoon26.html)