intro - EECS Department

Download Report

Transcript intro - EECS Department

EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Administrative
Register for 3 hours of credit
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
Me
Luke Huan, assistant prof. in Electrical Engineering &
Computer Science
Homepage: http://people.eecs.ku.edu/~jhuan/
Office: 2304 Eaton Hall
Email: [email protected]
Office hour:
10:00 – 11:00am Monday and Wednesday
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
My Lecture Style
I may tend to talk fast, especially when excited
Class materials are highly interdisciplinary
Use your questions to slow me down
Ask for clarification, repetition of a strange phrase, jargons
“If in doubt, speak it out”
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
You
Introduction:
Who you are
What department you are in
Why you are taking the course
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Outline for Today
What is mining biological data?
What is this course about?
Course home page
Course references
Paper presentation
Final project
Grading
Forward class reviewing
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
What is Mining Biological Data
Goal: understanding the structure of biological data
Patterns
Descriptive models
Predictive models
Challenges:
What is the nature of the data?
What are the computational tasks?
How to break a task into a group of computational components?
How to evaluate the computational results?
Applications
Experimental design and hypothesis generation
Synthesis novel proteins
Drug design
…
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
What is this Course About?
Learning…
Problems in mining biological data
Available techniques, their pros and cons
How to combine techniques together
Enough perception to avoid pitfalls
Practicing…
To present recent papers on a selected topic
To work on a project that may involve
A domain expert,
A driving biological problem, and
The development of new data mining techniques
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Class Information
Class Homepage: http://people.eecs.ku.edu/~jhuan/fall06.html
Meeting time: 9:00 – 9:45 Monday, Wednesday, Friday
Meeting place: Eaton Hall 2001
Prerequisite: none
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
Textbook & References
Textbook: none
References
Data Mining --- Concepts and techniques, by Han and Kamber,
Morgan Kaufmann, 2001. (ISBN:1-55860-489-8)
The Elements of Statistical Learning --- Data Mining, Inference,
and Prediction, by Hastie, Tibshirani, and Friedman, Springer,
2001. (ISBN:0-387-95284-5)
Bioinformatics: Genes, Proteins, and Computers, edited by
Christine Orengo, David Jones, Janet Thornton, Bios Scientific
Publishers, 2003. (ISBN: 1-85996-0545)
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Paper Presentation
One per student
Research paper(s)
List of recommendations will be posted at the class webpage a week from now
Your own pick (upon approval)
Three parts
Review the goal of the paper(s)
Discuss the research challenges
Present the techniques and comment on their pros and cons
Questions and comments from audience
Extra credit for active participants of class discussions
Order of presentation: first come first pick
Please send in your choice of paper by September 1st.
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Final Project
Project (due Nov. 27th)
One project
I will post some suggestions at class website.
I am soliciting projects from researchers on campus
You are welcome to propose your own
Discuss with me before you start
Checkpoints
Proposal: title and goal (due Sep. 8th)
Background and related work (due Sep. 29th)
Outline of approach (due Oct. 20th)
Implementation & Evaluation (due Nov. 10th)
Class demo (due Nov. 27th)
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Grading
Grading scheme
Paper presentation and discussion
45%
Project
45%
Attendance and Participation
10%
No homework
No exam
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
Forward Class Reviewing
This is for overview, not content
Don’t worry if you do not understand some of the words, that’s
why you want to take this class.
Gives an idea of what is coming
Order of presentation might be shuffled to accommodate
everyone’s schedule
Topics may be adjusted with progresses of the class
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
Week 1: Pattern Mining
Frequent patterns: finding regularities in data
Frequent patterns (set of items) are one that occur frequently in a
data set
Can we automatically profile customers?
What products are often purchased together?
Customer Shopping basket
ID
Items bought
8/21/2006
Introduction
100
f, a, c, d, g, I, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n
One hypothesis: {a, c}  {m}
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Week 2: Advanced Pattern Mining
Reducing number of patterns
Maximal patterns and closed patterns
Constraint-based mining
Patterns with concept hierarchy
Patterns in quantitative data
Correlation vs. association
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Week 3: Mining Microarray
CH1I
CH1B
CH1D
CH2I
CH2B
CTFC3
4392
284
4108
280
228
VPS8
401
281
120
275
298
EFB1
318
280
37
277
215
SSA1
401
292
109
580
238
FUN14
2857
285
2576
271
226
SP07
228
290
48
285
224
MDM10
538
272
266
277
236
CYS3
322
288
41
278
219
DEP1
312
272
40
273
232
NTG1
329
296
33
274
228
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Data from:
Spellman, P. T.,
Sherlock, G.,
Zhang, M.Q., Iyer,
V.R., Anders, K.,
Eisen, M.B.,
Brown, P.O.,
Botstein, D. and
Futcher, B. (1998),
“Comprehensive
Identification of
Cell Cycleregulated Genes of
the Yeast
Saccharomyces
cerevisiae by
Microarray
Hybridization”,
Molecular Biology
of the Cell, 9,
3273-3297.
slide17
Week 4: Patterns in Sequences,
Trees, and Graphs
p1
a
p5
c
p2
y
y b
y
y
b
p3
 = 2/3
f=2/3
a
f = 3/3
a
y
P1
f=2/3 b
y
P4
8/21/2006
Introduction
b
x
G3
f=2/3
a
y
c
f=3/3
a
y
b
b
x
y
b
b
P2
b
s4
c
b
s3
G2
y
y
y
b
q3
b
s1
b
y
s2
a
x
y
d
p4
G1
y
y
q2
a
x
q1
b
P3
b
P5
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
x
b f=2/3
P6
slide18
Week 5: Pattern Discovery in Biomolecules
Protein
A sequence from 20 amino acids
Lys
Lys
Gly
Gly
Leu
Val
Ala
His
Adopts a stable 3D structure that can be measured experimentally
Oxygen
Nitrogen
Carbon
Sulfur
Cartoon
Space
filling
Surface
Ribbon
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
Week 6: Descriptive Models
Group objects into clusters
Ones in the same cluster are similar
Ones in different clusters are dissimilar
Unsupervised learning: no predefined classes
Outliers
Cluster 1
Cluster 2
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
Week 7: Subspace Clustering
Movie 1
Movie
Movie 22 Movie
Movie33 Movie
Movie44 Movie
Movie55 Movie
Movie6 6 Movie
Movie7 7
Viewer 11
Viewer
11
Viewer 22
Viewer
44
Viewer 33
Viewer
22
33
44
66
Viewer 44
Viewer
33
44
55
77
Viewer 55
Viewer
8/21/2006
Introduction
22
44
33
66
55
55
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
55
77
33
11
33
44
slide21
rating
Week 7: Subspace Clustering
8
7
6
5
4
3
2
1
0
viewer 1
viewer 3
viewer 4
movie 1
8/21/2006
Introduction
movie 2
movie 4
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
movie 6
slide22
Week 8: Mining Microarray (II)
Apply subspace clustering to microarray analysis
Find groups of genes that are co-regulated
May integrate data from protein sequences and functional
description of genes
Applying subgraph mining to microarray analysis
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide23
Week 9: Predictive Models
Two-class version:
Using “training data” from Class +1 and Class -1
Develop a “rule” for assigning new data to a Class
Slides from J.S. Marron in Statistics at UNC
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
Week 10: Classification Algorithms and
Applications
Decision tree
Fishers linear discrimination method
Kernel methods
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide25
Week 11: Text Mining, Gene Ontology,
Data Management
Ontology seeks to describe or posit the basic categories
and relationships of being or existence to define entities
and types of entities within its framework. Ontology can
be said to study conceptions of reality (Wikipedia).
GO is a database of terms for genes
Terms are connected as a directed acyclic graph
Levels represent specifity of the terms (not normalized)
GO contains three different sub-ontologies:
Molecular function
Biological process
Cellular component
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
Week 12: Systems Biology & Proteomics
Part of the biological system in a cell at the molecular level
FAS-L
IGF1
IL-3
IGF1R
FAS
A proteome is the set of all proteins in anmitogen
organism
IL-3R
FADD/MORT
IRS1
FLICE
P53
P21
Cyclin D1
RAS
pRb
P16
Cdk4
ICE
PI 3-K
P27
P107
Bin-1
E2F
CPP32
AKT/PKB
apoptosis
Bcl-XL
BAD
Mad
Max
C-Myc
C-Myc
Max
Max
Mad
Cyclin E
Cdc25A
?
cell proliferation
Cyclin E
Cdk2
p
Cdk2
P27
p
Cyclin E
Cyclin E
Cdk2
p
Cdk2
Source: http://www.ircs.upenn.edu/modeling2001/,
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide27
Week 13: Analyzing Biological Networks
Biological networks pose serious challenges and opportunities for the data
mining research in computer science
Large volume of data
Heterogeneous data types
35,000
Protein-protein interaction in yeast
# of
structures
Growth of Known Structures in
Protein Data Bank (PDB)
Year
Gary D. Bader & Christopher W.V. Hogue, Nature Biotechnology 20, 991 - 997 (2002)
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
Week 14: bio-Data Integration
Data are collected from many different sources
Each piece of data describes part of a complicated (and
not directly observable) biological process
Combine data together to achieve better understanding
and better prediction
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
Week 15, 16: Project Presentation
Check what you have learned from the class
Celebrate the hard work!
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
Further References
Data mining
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, IEEETKDD
Bioinformatics
Conferences: ISMB, RECOMB, PSB, CSB, BIBE, etc.
Journals: Bioinformatics, J. of Computational Biology, etc.
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
Further References
AI & Machine Learning
Conferences: Machine learning (ICML), AAAI, IJCAI, etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Database systems
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT,
ICDT,
Journals: ACM-TODS, IEEE-TKDE etc.
Visualization
Conference proceedings: IEEE Visualization, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
8/21/2006
Introduction
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32