PPT - the Department of Computer Science

Download Report

Transcript PPT - the Department of Computer Science

Part I: Introductory Materials
Introduction to Data Mining
Dr. Nagiza F. Samatova
Department of Computer Science
North Carolina State University
and
Computer Science and Mathematics Division
Oak Ridge National Laboratory
What is common among all of them?
2
Who are the data producers? What data?
Application  Data
• Application Category: Finance
• Producer: Wall Street
• Data: stocks, stock prices, stock purchases,…
• Application Category: Academia
• Producer: NCSU
• Data: students admission data (name, DOB, GRE
scores, transcripts, GPA, university/school
attended, recommendation letters, personal
statement, etc.
3
Application Categories
•
•
•
•
•
•
•
Finance (e.g., banks)
Entertainment (e.g., games)
Science (e.g., weather forecasting)
Medicine (e.g., disease diagnostics)
Cybersecurity (e.g., terrorists, identity theft)
Commerce (e.g., e-Commerce)
…
4
What questions to ask about the data?
DataQuestions
• Academia:NCSU:Admission data
1.
Is there any correlation between the students’ GRE scores and
their successful completion of a PhD program?
2. What are the groups of students that share common academic
performance?
3. Are there any admitted students who would stand out as an
anomaly? What type of anomaly is that?
4. If the student majors in Physics, what other major is he/she
likely double-major?
5
Questions by Types?
•
•
•
•
•
•
•
•
Correlation, similarity, comparison,…
Association, causality, co-occurrence,…
Grouping, clustering,…
Categorization, classification,…
Frequency or rarity of occurrence,…
Anomalous or normal objects, events, behaviors,
Forecasting: future classes, future activity,…
…
6
What information we need to answer?
QuestionsData Objects and Object Features
• Academia:NCSU:Admission data
–
–
Objects: Students
Object’s Features=Variables=Attributes=Dimensions & Types
• Name:String (e.g., Name=Neil Shah)
• GPA:Numeric (e.g., GPA=5.0)
• Recommendation:Text (e.g., … the top 2% in my career…)
• Etc.
7
How to compare two objects?
Data Object  Object Pairs
• Academia:NCSU:Admission data
–
–
Objects: Students
Based on a single feature:
• Similar GPA
• The same first letter in the last name
– Based on a set of features:
• Similar academic records (GPA, GRE, etc.)
• Similar demographic records
– Can you compute a numerical value for your similarity measure
used for comparison? Why or Why not?
8
How to represent data mathematically?
Data Object & its Features Data Model
• What mathematical objects have you studied?
–
–
–
–
–
–
–
–
–
–
–
Scalar
Points
Vectors
Vector spaces
Matrices
Sets
Graphs, networks (maybe)
Tensors (maybe)
Time series (maybe)
Topological manifolds (maybe)
…
9
Data object as vector with components…
Vector components:
• Features, or
City=(Latitude, Longitude)--2-dimensional object
• Attributes, or
Raleigh=(35.46, 78.39) Boston=(42.21, 71.5)
• Dimensions
Proximity(Raleigh, Boston)=?
• Geodesic distance
• Euclidean distance
• Length of the interstate route
10
A set of data objects as vector spaces
3-dimensional vector space
Altitude
Moscow
Raleigh
Latitude
Longitude
Mining such data ~ studying vector spaces
11
Multi-dimensional vectors…
Vector components:
• Features, or
• Attributes, or
• Dimensions
Student=(Name, GPA, Weight, Height, Income in K, …) - mutli-dimensional
S1=(John Smith, 5.0, 180, 6.0, 200)
S2=(Jane Doe, 3.0, 140, 5.4, 70)
Proximity(S1, S2)=?
• How to compare when vector components are
of heterogeneous type, or different scales?
• How to show the results of the comparison?
12
as matrices…
Example: A collection of text documents on the Web
Original Documents
Parsed Documents
D1: Child Safety at Home
D2: Infant & Toddler First Aid
Your Baby's Health and
D3: Safety: From Infant to
Toddler
D1: Child Safety Home
D2: Infant Toddler
Bab Health Safety Infant
D3:
Toddler
t-d term-document matrix
Terms=Features=Dimensions
T1:
T2:
T3:
T4:
T5:
T6:
T7:
D1:
0
1
0
1
0
1
0
D2:
0
0
0
0
1
0
1
D3:
1
0
1
0
1
1
1
T1:
T2:
T3:
T4:
T5:
T6:
T7:
Mining such data ~ studying matrices
Bab
Child
Health
Home
Infant
Safety
Toddler
13
or as trees
t-d term-document matrix
T1:
T2:
T3:
T4:
T5:
T6:
T7:
D1:
0
1
0
1
0
1
0
D2:
0
0
0
0
1
0
1
document
D3:
1
0
1
0
1
1
1
D2
president government party
election political elected
national districts held
district independence vice
minister parties
Is D2 similar to D3?
D3
What if there are 10,000 terms?
population area
climate city miles
province land
topography total
season 1999
square rate
Mining such data ~ studying trees
terms
economy million
products 1996
growth copra
economic 1997
food scale
exports rice 14
fish
0r as networks, or graphs w/ nodes & links
president government party
election political elected
national districts held
district independence vice
minister parties
Nodes=Documents
Links=Document similarity (e.g., if document
references another document )
population area
climate city miles
province land
topography total
season 1999
square rate
economy million
products 1996
growth copra
economic 1997
food scale
exports rice fish
Mining such data ~ studying graphs, or graph mining
15
What apps naturally deal w/ graphs?
Social Networks
Semantic Web
Drug Design,
Computer networks
Chemical compounds
Credit: Images are from Google images via search of keywords
World Wide Web
Sensor networks
16
What questions to ask about graph data?
Graph Data  Graph Mining Questions
• Academia:NCSU:Admission data
1. Nodes=students; links=similar academics/demographics
2. How many distinct academically performing groups of students
admitted to NCSU?
3. Which academic group is the largest?
4. Given a new student applicant, can we predict which academic
group the student will likely belong to?
5. Are groups of student with similar demographics usually share
similar academic performance?
6. Over the last decade, has the diversity in demographics of
accepted student groups increased or decreased?
7. …
17
Recap: Data Mining and Graph Mining
Application
Data
Questions
Data Objects + Features
Mathematical Data Representation (Data Model)
Vectors
Graphs
Matrices
Not one hat fits all
Time series
Sets
Tensors More than one models
are needed
Manifolds
Models are related
18
How much data?
1PB/year
Ecology
20-40TB/simulation
Biology
30TB/day
Climate Astrophysics Cosmology
850TB
Web
My laptop:
60 GB (GigaBytes) – 109 Bytes
1 TB (TeraByte) – 1012 Bytes
1 PB (PetaByte) – 1015 Bytes
19
It is not just the Size
– but the Complexity
Petabytes Data
20
Data Describes Complex Patterns/Phenomena
How to untangle the riddles of the complexity?
Complex regulation Single gene
~30k genes
Analytical tools that find the “dots”
from data significantly reduce data.
50 trans elements control single
gene expression
Challenge:
How to “connect the dots” to answer
important science/business questions?
21
Connecting the Dots
Finding the Dots
Sheer Volume of Data
Climate
Now: 20-40 Terabytes/year
5 years: 5-10 Petabytes/year
Fusion
Now: 100 Megabytes/15 min
5 years: 1000 Megabytes/2 min
Connecting the Dots
Advanced Math+Algorithms
Huge dimensional space
Combinatorial challenge
Complicated by noisy data
Requires high-performance
computers
Understanding the Dots
Providing Predictive
Understanding
 Produce bioenergy
 Stabilize CO2
 Clean toxic waste
22
Why Would Data Mining Matter?
Enables solving many large-scale data problems
Finding the Dots
Connecting the Dots
Understanding the Dots
• How
to effectively
produce bioenergy?
• How to stabilize carbon
dioxide?
• How to convert toxic
into non-toxic waste?
...
23
Science Questions
How to Move and Access the Data?
Technology trends are a rate limiting factor
Most of these data will NEVER be touched!
Data doubles every 9 months; CPU ―18 months.
Naturally distributed
but effectively immovable
Retrieval Rate Mbytes/s
Latency and Speed – Storage Performance
Streaming/Dynamic
but not re-computable
CPU, Disk, Network Trend
Doubling:
105
Memory
Disk
Tape
CPU: every 1.2 years
Disk: every 1.4 years
WAN: 0.7 years
log10(Object Size Bytes)
Src: Richard Mount, SLAC
J. W. Toigo, Avoiding a Data Crunch, Scientific American, May 2000
MIPS/$M
GB/$M
kB/s
24
How to Make Sense of Data?
Know Your Limits & Be Smart
More data
Not humanly possible to browse a petabyte of data.
Analysis must reduce data to quantities of interest.
Ultrascale Computations:
Must be smart about which
probe combinations to see!
Petabytes
Terabytes
Physical Experiments:
Must be smart about probe
placement!
Gigabytes
Megabytes
More analysis
To see 1 percent of a petabyte at 10 megabytes per second takes:
35 8-hour days!
25
What Analysis Algorithms to Use?
Even a simple big O analysis can elucidate simplicity.
Analysis algorithms fail for a few gigabytes.
Algorithmic Complexity:
Calculate means
O(n)
Calculate FFT
O(n log(n))
Calculate SVD
O(r • c)
Clustering algorithms
O(n2)
If n=10GB, then what
is O(n) or O(n2) on a
teraflop computers?
1GB = 109 bytes
1Tflop = 1012 op/sec
Data
size n
Algorithm Complexity
n
nlog(n)
n2
100B
10-10sec. 10-10 sec. 10-8 sec.
10KB
10-8 sec.
10-8 sec.
10-4sec.
1MB
10-6 sec.
10-5 sec.
1 sec.
100MB
10-4 sec.
10-3 sec.
3 hrs
10GB
10-2 sec.
0.1 sec.
3 yrs.
For illustration chart assumes 10-12 sec.
26
(1Tflop/sec) calculation time per data point