Analysis of Protein Geometry, Particularly Related to Packing at the

Download Report

Transcript Analysis of Protein Geometry, Particularly Related to Packing at the

Mark Gerstein, Yale University
gersteinlab.org/courses/452
(last edit in fall '06, handout version, including in-class changes)
1 (c) M Gerstein, 2006, Yale, gersteinlab.org
BIOINFORMATICS
Summary
2 (c) M Gerstein, 2006, Yale, gersteinlab.org
Used in class M11
[2006,12.06]
[From S Harris's Science Cartoons,
http://www.sciencecartoonsplus.com]
3 (c) M Gerstein, 2006, Yale, gersteinlab.org
You'll Forget…
4 (c) M Gerstein, 2006, Yale, gersteinlab.org
… So I'll
distill
What is Bioinformatics?
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many
applications.
5 (c) M Gerstein, 2006, Yale, gersteinlab.org
• (Molecular) Bio - informatics
Data Types
Structures
Functional
Genomics
6 (c) M Gerstein, 2006, Yale, gersteinlab.org
Sequences
"Core" Bioinformatics
 Computing with sequences and structures
 protein structure prediction
 biological databases and mining them
• New Stuff: Networks and Expression Analysis
 `Will teach these in CS 545 (Data Mining) next semester
• Fairly Speculative: simulating cells
7 (c) M Gerstein, 2006, Yale, gersteinlab.org
• Core Stuff
• Memorize the previous summary
• Good familarity with main points in
lectures (quizzes)
• Rest of overheads and readings for
reference on projects and …
8 (c) M Gerstein, 2006, Yale, gersteinlab.org
Hierarchical Structure of
Course Information
Cross-cutting Themes
 Dynamic programming
• Different measures of similarity
(RMS vs. Structural similarity; PAM & Blossum vs %ID)
• Generalized similarity matrix in threading
 Statistical scoring schemes (with P-values)
 For sequences, structures, sequence to structure, and even
expression data
 Time complexity of the comparisons
• Predictions




LOD scores (# with features / expectation )
Progressive more complex features
Amount of features information IN vs. prediction OUT
Testing against benchmarks with cross-validation
(sec. struc. prediction, seq. comparison scoring, datamining)
 Other methods, need for heuristics
9 (c) M Gerstein, 2006, Yale, gersteinlab.org
• Algorithms for Comparison
Cross-cutting Themes
 Character strings, fold (just CAs), volumes and surfaces from all
atom representation, energy and minimization, dynamics (time
and velocity)
• Simulation




Vector configuration boiled down a scalar E through potential
Compute intensive exploration of configurations (MC, MD)
Averages over correctly weighted configurations
Importance of simplification
• The Survey Mode
 Collecting information in DB tables
 Importance of integration and interoperation



Organizing it around "part" classifications
Surveying it for useful statistics (taking into account biases)
Doing datamining to find more tenuous relationships
10 (c) M Gerstein, 2006, Yale, gersteinlab.org
• Increasing the chemically reality and complexity
of genes
11 (c) M Gerstein, 2006, Yale, gersteinlab.org
Anti-Themes
12 (c) M Gerstein, 2006, Yale, gersteinlab.org
Depth v
Breadth
1980
Historical Perspective
• Single Structures
1990
1995
2000
2005
 Modeling & Geometry
 Forces & Simulation
 Docking
• Sequences, SequenceStructure Relationships
 Alignment
 Structure Prediction
 Fold recognition
• Genomics
 Dealing with many sequences
 Gene finding & Genome Annotation
 Databases
• Integrative Analysis
 Expression & Proteomics Data
 Datamining
 Simulation again….
13 (c) M Gerstein, 2006, Yale, gersteinlab.org
1985
14 (c) M Gerstein, 2006, Yale, gersteinlab.org
(from CooperToons, http://members.aol.com/ChipCooper/cartoon26.html)