credia - Computer Science - Worcester Polytechnic Institute

Download Report

Transcript credia - Computer Science - Worcester Polytechnic Institute

WPI Center for Research in Exploratory Data and Information Analysis
Research Bytes 2004
Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Need for Data Mining
• Data are being gathered and stored
extremely fast
– Currently, the amount of new data stored in digital computer
systems every day is roughly equivalent to 3000 pages of text for
every person on Earth (estimate based on a projection to 2003 of a
study led by Lyman & Varian at UC-Berkeley in 2000).
• Computational tools and techniques are
needed to help humans in summarizing,
understanding, and taking advantage of
accumulated data
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
What is Data Mining?
or more generally, Knowledge Discovery in Databases (KDD)
“Non-trivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data”
[Fayyad et al. 1996]
• Raw Data
Data Mining
• Patterns
» Analytical and Statistical Patterns (rules, decision trees, …)
» Visual Patterns
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases"
AAAI Magazine, pp. 37-54. Fall 1996.
WPI Center for Research in Exploratory Data and Information Analysis
Data Analysis (KDD)Process
clean
data
data “pre”processing
CREDIA
data analysis
data mining
• analytical
 statistical
• visual
models
90
80
70
60
50
40
30
20
10
0
East
W est
North
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
• noisy/missing data
• dim. reduction
data
sources
model/pattern
evaluation
data
• quantitative
• qualitative
data
management
• databases
• data warehouses
new data
model/patterns
deployment
• prediction
• decision support
“good” model
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
KDD is Interdisciplinary
techniques come from multiple fields
• Machine Learning (AI)
– Contributes (semi-)automatic
induction of empirical laws from
observations & experimentation
• Statistics
– Contributes language, framework,
and techniques
• Pattern Recognition
– Contributes pattern extraction and
pattern matching techniques
• Databases
– Contributes efficient data
storage, data cleansing, and
data access techniques
• Data Visualization
– Contributes visual data displays
and data exploration
• High Performance Comp.
– Contributes techniques to
efficiently handling complexity
• Application Domain
– Contributes domain knowledge
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
What do you want to learn from your data?
KDD approaches
A
B
C
D
blue
blue
orange
regression
IF A & B THEN
IF A & D THEN
classification
clustering
Data
change/deviation
detection
summarization
90
80
70
60
50
40
30
20
10
0
East
W est
North
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
dependency/assoc. analysis
0.5
A
0.3
C
B
0.75
D
IF a & b & c THEN d & k
IF k & a THEN e
A, B -> C 80%
C, D -> A 22%
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Some Current Analytical Data
Mining Research Projects at WPI
• Mining Complex Data: Set and Sequence Mining
–
–
–
–
Systems performance Data
Sleep Data
Financial Data
Web Data
• Data Mining for Genetic Analysis
– Correlating genetic information with diseases
– Predicting gene expression patterns
• Data Mining for Electronic Commerce
– Collaborative and Content-Based Filtering
• Using Association Rules and using Neural Networks
WPI Center for Research in Exploratory Data and Information Analysis
Analyzing
Sleep
Data
Purpose:
CREDIA
 Associations between sleep patterns and health/pathology
Obtain patterns of different sleep stages (4 sleep+REM +Wake)
DATA SET
Clinical (sequential)
Electro-encephalogram (EEG),
Electro-oculogram (EOG),
(Source: http://www. blsc.com)
Electro-myogram (EMG),
Diagnostic (tabular)
Questionnaire responses
Patient’s demographic info.
Patient’s medical history
Probe measuring flow of Oxygen
in blood etc.
Potential Rules:
(A) Association Rules
(Sleep latency <3 min) & (hereditary disorder) => Narcolepsy
confidence=92%, support= 13%
(B) Classification Rules
(snoring= HEAVY) & (AHI* > 30/hour): severe OSA***
=> (Race = Caucasian) confidence=70%, support= 8%
WPI, UMassMedical, BC
*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Input Data
• Each instance: [Tabular | set | sequential] * attributes
attr1
attr2
attr3 attr4
attr5 [class]
illnesses
{depression,
P1 fatigue}
heart rate
age
oxygen
27
gender Epworth
M
5
{stroke,
P2 dementia,
fatigue}
97,72,67,80,…
73
90,92,96,89,86,…
F
23
P3 {arthritis}
102,99,87,96,…
49
97,100,82,80,70,
…
M
14
…
…
…
…
…
…
…
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Analyzing Financial Data
• Sequential data – daily stock values
• “Normal” (tabular/relational) data
– sector (computers, agricultural, educational, …), type of
government, product releases, companies awards, …
• Desired rules:
– If DELL’s stock value increases & 1999<year<2002 =>
IBM’s stock value decreases
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Events – Financial Data
Basic events: 16 or so financial templates
[Little&Rhodes78]
difficult pattern matching – alignments and time warping
Panic Reversal
Rounding Top Reversal
Head & Shoulders Reversal
Descending Triangle Reversal
WPI Center for Research in Exploratory Data and Information Analysis
Closer Look: WPI Weka
CREDIA
Tool for mining complex temporal/spatial associations
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining for Genetic Analysis
w/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI),
and Alvarez (CS, BC)
• SNP analysis
– discovering correlations between
sequence variations and diseases
• Gene expression
– discovering patterns that cause a gene
to be expressed in a particular cell
WPI Center for Research in Exploratory Data and Information Analysis
Correlating Genetics with
Diseases
• Utilize Data Mining
Techniques with Actual
Genetic Data Sampled
from Research
• Spinal Muscular Atrophy:
inherited disease that
results in progressive
muscle degeneration and
weakness.
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Genomic Data Resources
Patient
Gender
SMA Type
(Severity)
SNP
Location
C212
AG1-CA
Father / Mother
Father / Mother
Female
Severe
Y272C
31 / 28 29
102 / 108 112
Male
Mild
Y272C
28 29 / 25
108 112 / 114
Wirth, B. et al. Journal of Human Molecular Genetics
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
Our System: CAGE
To predict gene expression based on DNA
sequences.
Muscle Cell
Gene 3
Gene 1
Gene 2
Neural Cell
Gene 1
Gene 2
CAGE
On
Gene 3
Seam Cells
Gene 1
Gene 2
Off
Gene 3
WPI Center for Research in Exploratory Data and Information Analysis
Grad. & Undergrad. Students
•
•
•
•
•
•
•
•
•
•
•
•
•
Ali Benamara
Dharmesh Thakkar.
Senthil K Palanisamy.
Zachary Stoecker-Sylvia.
Keith A. Pray.
Jonathan Freyberger.
Maged El-Sayed.
Parameshvyas
Laxminarayan.
Aleksandar Icev.
Wendy Kogel.
Michael Sao Pedro.
Christopher Shoemaker.
Weiyang Lin.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
CREDIA
Jonathan Rudolph
Eduardo Paredes
Iavor N. Trifonov.
Takeshi Kawato
Cindy Leung and Sam Holmes.
John Baird, Jay Farmer, Rebecca Gougian, Ken
Monterio, Paul Young.
Zachary Stoecker-Sylvia.
Kristin Blitsch, Ben Lucas, Sarah Towey
Wendy Kogel, Brooke LeClair, Christopher St. Yves.
Brian Murphy, David Phu, Ian Pushee, Frederick
Tan.
Daniel Doyle, Jared Judecki, James Lund, Bryan
Padovano.
Christopher Cole.
Michael Ciman and John Gulbrandsen.
Tara Halwes
Christopher Martino.
Matthew Berube.
Anna Novikov.
Amy Kao and Dana Rock.