03_csit_invited1 - NDSU Computer Science

Transcript 03_csit_invited1 - NDSU Computer Science

Data Mining and Data Warehousing,
many-to-many Relationships,
applications
William Perrizo
DataSURG
(Database Systems Users and Research Group)
North Dakota State University
Fargo, North Dakota USA
[email protected]
DM on a DW vs. TP on a DB
Workload on Repository Question
 C.J. Date, circa 1980
 Transactions on a DBMS vs.
 file processing programs on file systems.
 “Use a DBMS instead of file systems! Unifies data resources,
centralizes control, promotes standards and consistency,
eliminates redundancy, increases data value (wider data usage)”
 Circa 1990
 “Buy a separate DW for DM” (separate from your DBMS for TP)”
 2 separate, quite redundant, non-sharing, inconsistent.. systems!
 What happened?
 Great marketing success! (sold more hardware and software)
 Great Concurrency Control R&D failure!
We failed to integrate transactions and queries (OLTP and OLAP, i.e.,
updates and reads) in one system with acceptable performance!
 The marketing was so successful, nobody noticed the failure!
OUTLINE
I still hold out hope that DW and DB will eventually be unified again. I believe
eventually the industry will demand it. Already, there is work to update DWs!
.
For now let’s just focus on DATA and DM
 I Consider DM to be on the unstructured side of querying.
There you run up against two curses immediately.
Curse of non-scalability (solutions don’t scale with data volume.)
Curse of dimensionality (solutions don’t scale with data dimension
 I will talk about techniques we have used to address the curses.
 Process vertically structured data horizontally (instead of the ubiquitous
vertical processing of horizontal data (the record orientation).
 Parallelize the DM engine.
 Parallelize the software DM engine on clusters of computers.
 Parallelize the greyware DM engine on clusters of people
(i.e., browser-enable all software for visual data mining)
Data mining finds information in data.
Why do we need Data Mining?
 Data volume expands by Parkinson’s Law
Data volume expands to fill
available data storage
 Disk-storage expands by Moore’s law
Capacity  2
t / 9 months
 Available storage doubles every 9 months!
We’re awash with data!

Network data: hi-speed, DWDM, All-opt (mgmt, flow classif’n,QoS,security)
(10 terabytes by 2003 ~ 1013 B).

US EROS Data Center (EDC) archives Earth Observing System (EOS)
Remotely Sensed Imagery (RSI), satellite and aerial photo data
(10 petabytes by 2005 ~ 1016 B).

National Virtual Observatory (aggregated astronomical data)
(10 exabytes by 2010 ~ 1019 B).

Sensor data from sensor networks (Micro & Nano -sensor networks)
(10 zettabytes by 2015 ~ 1022 B).

WWW will continue to grow (and other text collections)
(10 yottabytes by 2020 ~ 1025 B).

Micro-arrays, gene-chips, genome sequence data
(10 gazillabytes by 2030 ~ 1028 B?).
I had to make up this Name? Are projected data
sizes overrunning our ability to names for those
sizes!

Stock Market prediction data (prices + all the above?
especially astronomy data?)
(10 supragazillabytes by 2040 ~ 1031 B?).
Useful info must be teased out of these large volumes of data thru data mining.
More’s Law: More’s Less
The more volume, the less information. (AKA: Shannon’s Canon)
A simple illustration: Which phone book has more info?
(both involve the same 4 data granules)
BOOK-1
Name
Number
Smith
234-9816
Jones
231-7237
Name
Smith
Smith
Jones
Jones
BOOK-2
Number
234-9816
231-7237
234-9816
231-7237
Data mining reduces volume and raises the information level.
Precision Agriculture Data Mining
Dataset consists of an aerial photograph (TIFF image taken during the growing season)
and a synchronized yield map (crop yield taken at harvest). Altogether there are 4 feature
attributes (B,G,R,Y) and ~100,000 pixels.
TIFF image
Yield Map
A producer wants to know the relationship between the color intensities and yield?
One hypothsize, the Association Rule, hi_green and low_red  hi_yield, is
intuitive and could be made and verified without data mining (simple querying).
Data mining has found a stronger rule, hi_NIR and low_red  very_hi_yield.
So many producers use VIR instead of RBG cameras to get the better information.
Another Precision Agriculture Data Mining
Example: Grasshopper Infestation Prediction
(again involving RSI data)
• Grasshopper caused significant economic loss each year.
• Early infestation prediction is key to damage control.
Pixel classification on remotely sensed imagery
holds significant promise to achieve early detection.
Pixel classification (signaturing) has many applications
pest detection, forest fire detection, wet-lands monitoring …
(for signaturing we developed the SMILEY software/greyware system)
http:midas.cs.ndsu.nodak.edu/~smiley
Sensor Network Data Mining
 Micro and Nano scale sensor blocks
are being developed for sensing






Biological agents
Chemical agents
Motion detection
coatings deterioration
RF-tagging of inventory
Structural materials fatigue
 There will be trillions++ of individual
sensors creating mountains of data.
 The data must be mined for it’s information.
CEASR slide

Data Mining?
Querying is asking specific questions and expecting specific answers.
Data Mining goes into the MOUNTAIN of DATA,
and hopefully returns information gems.
But also, some fool’s gold?
Relevance and interestingness analysis, serves to
assay those gems. (help pick out valuable gems).
Data Mining Process
 Data mining: the core
of the knowledge
discovery process.
Pattern Evaluation
and Assay
Data Mining
Task-relevant Data
OLAP
Classification
Clustering
ARM
Data Warehouse: cleaned,
integrated, read-only, periodic,
historical raw database
Selection
Feature extraction,
tuple selection
Data Cleaning/Integration:
missing data, outliers,
noise, errors
Mountain of Raw Data
Data Mining versus Querying
There is a whole spectrum of techniques to get information from data:
Fractals, …
Standard querying
SQL
SELECT
FROM
WHERE
Complex
queries
(nested,
EXISTS..)
Searching and Aggregating
FUZZY query,
Search engines,
BLAST searches
OLAP
(rollup,
drilldown,
slice/dice..
Machine Learning
Supervised
Learning –
classification
regression
Data Mining
Data Prospecting
Association Rule Mining
Unsupervised
Learning clustering
On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02).
On the Data Mining end, the surface has barely been scratched.
But even those scratches had a great impact – One of the early scatchers became
the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy
Walmart vs. KMart
Our Approach
 Vertical, compressed data structures, Peano-trees (Ptrees)1
processed horizontally (DBMSs process horizontal data vertically)
 Ptrees are data-mining-ready, compressed data
structures, which attempt to address the curses of
scalability and curse of dimensionality.
 And a compressed, OLAP-ready data warehouse structure, the
Peano Data Cube (PDcube)1
 PDcube facilitates OLAP operations and query processing,
using the Ptree data structure.
1
Technology is patent pending
by North Dakota State University
DBMS relation (table) is a horizontal
structure (set of horizontal records)
processed vertically (vertical scans)
R( A1 A2 A3 A4)
Horizontal
structure
Processed
vertically
(scans)
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
-->
7. 2nd half of 1st of 2nd not 0
R[A1] R[A2] R[A3] R[A4]
001
000
001
111
100
101
100
100
0
0
0
0
1
0
1
(1-Dim) Ptrees are built
by recording the truth of
the predicate “pure 1”
recursively on halves,
until there is purity, P11:
1. Whole file is not pure1 0
2. 1st half is not pure1  0
3. 2nd half is not pure1  0
4. 1st half of 2nd half not  0
5. 2nd half of 2nd half is  1
6. 1st half of 1st of 2nd is  1
Ptrees are vertical structures
(compressed vertical bit files)
processed horizontally (using a multioperand logical ANDs).
1
0
0
0
01
1
10
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
101
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
1
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P11 P12 P13
P21 P22 P23 P31 P32 P33 P41 P42 P43
0
0
0
0
0
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0 0 0 0 1 0 1 0 0 0 0
0 0 1 0
10 10
01 0001
01 01
00 10 01
0100
01
01 01 10
01
01
10
01 10 01
10
Vertical structure processed horizontally (ANDs)
Ptrees
Run-length-compressed, lossless, vertical, structures representing the data
in a way that facilitates fast horizontal AND-processing
 Jury is still out on which parallelization approach is best, vertical (by relation)
or horizontal (by tree node) or some combination.
 Horizontal is pretty, but network multicast overload eats us alive (maybe active
networking? Clusters of Playstations?...)
 The most useful form of a Ptree is the predicate-Ptree e.g., when we use
the “pure 1” predicate, as in the previous example, they are called
Pure1-trees or P1trees (1-bit at a node iff corresponding half is pure1.)
 There are many other useful predicates, e.g., NonPure0-trees or NP0trees
(1 iff half is not pure0) etc., but we will focus on P1trees.
 The Ptree on previous slide were 1-dimensional (recursively halving
bit files), but they can be 2-D (recursively quartering), 3-D,…
 Ptrees for 2-D spatial data are usually 2-dimensional (recursively
quartering, in Peano order)
A 2-Dimensional P1tree
Node is 1 iff that quadrant is purely 1-bits, e.g.,
A bit-file
(from, e.g., a 2-D image)
1111110011111000111111001111111011110000111100001111000001110000
Run-length compress the corresponding raster ordered matrix using Peano order.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
00
00
00
00
00
00
00
10
00
00
00
00
0
1
0
0
0 0 1 0
1 1 0 1
1 1 1 0 0 0 1 0 1 1 0 1
0
2-D Count Ptrees (original form)
Counts are the ultimate goal, but we use predicate Ptrees (e.g., P1trees) because they are
more compressed, facilitate faster ANDing and produce the needed counts quite quickly
001
111
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
0
16
1
2
8
15
2
3 0 4 1
4 4 3 4
3
1 1 1 0 0 0 1 0 1 1 0 1
Peano coordinates (QID)
Raster coordinate order sorts by dimension 1st and bit-position 2nd
Peano coordinate order sorts by bit-position 1st and dimension 2nd.
( 7, 1 )
3
( 111, 001 )
10.10.11
16
2.2.3
CEASR and a 3-D diagram
 With animation arrows?
 If so, go back first
 Then left,
 Then down.
 Point out: Add additional bits to increase
intensity resolution (granularity)?
 Add DANA slides.
Logical Operations on P-trees
(are used to get counts of any pattern)
Ptree 1
Ptree 2
AND result
OR result
The Ptree AND operation is faster than the bit-by-bit AND operation
since, there are shortcuts.
e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc.
The more operands there are in the AND, the greater the benefit due
to this shortcut (more pure0 nodes).
Ptree dimension
 The dimension of the Ptree structure is a
user chosen parameter
 It can be chosen to fit the data dimension
 Most datasets  1-D Ptrees (recursive halving)
 2-D Images  2-D Ptrees (recursive quartering)
 3-D Solids
 3-D Ptrees (recursive eighth-ing)
 Or dimension can be chosen based on other
considerations
 optimize compression
 increase processing speed (next slide)
Raster Sorting:
Peano Sorting:
Attributes 1st
Bit position 1st
Bit position 2nd
Attributes 2nd
Generalized Raster and Peano Sorting: generalizes to any table with
numeric attributes (not just images).
Generalize Peano Sorting
KNN speed improvement (UCI MLR data sets)
Time in Seconds
120
100
Unsorted
80
60
Generalized Raster
40
20
0
Generalized Peano
National Virtual Observatory data
 What Ptree dimension and what ordering
should be used for astronomical data?
 Peano Triangle Mesh Tree (PTM-tree)
 Peano Celestial Coordinate tree (PCCtree)
 Uses (RA, dec) coordinates of the celestial sphere
 RA=Recession Angle
 dec=declination
Peano Triangular Mesh Tree (PTM-tree)
 Similar to the Hierarchical Triangular Mesh (HTM)
scheme (Sloan Digital Sky Survey project)
 Sphere is divided into triangles
 Triangle sides are always great circle segments.
 PTM differs from HTM in the way in which they are
ordered?
PTM-Ordering the Triangular Mesh
1,2
1
1,1
1,0
1,3,3
1,3,1
1,3
1,3,0
1,3,2
The Half Sphere up to 3 Levels
Traverse southern
hemisphere in the
revere direction
(just the identical
pattern pushed
down instead of
pulled up, arriving
at the Southern
neighbor of the
start point.
dec
RA
PTM-tree up to 3 Levels
LRLR
LRLR
LRLR
LRLR
PTM-tree up to 4 Levels
LRLR RLRL LRLR
RLRL LRLR RLRL
LRLR RLRL
LRLR RLRL LRLR
RLRL LRLR RLRL
LRLR RLRL
Peano Celestial Coordinate Trees (PCCtrees)
Unlike PTM-tree which partitions the sphere
into the 8 faces of an octahedron, in the
PCCtree scheme transforms the sphere into
a cylinder, then a rectangle, then use
regular Peano Coords.
Celestial coordinates
 RA is from 0 to 360o
 dec is -90o to 90o.
P
R
A
d
e
c
90o
North Plane
0o
South Plane
-90o
0o
360o
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
PRAdec-scheme:
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Sphere  Cylinder  Plane
Graph data
(many-to-many self relations)
“Everything should be
made as simple as
possible, but not simpler”
Albert Einstein
Representating graphs
Examples:
 Genomics

Protein-protein interactions
(ACM KDD-Cup ’02)
 Focuses is on node
structure
 WWW

Focuses on link structure
 Publications citations


ACM KDD_Cup ’03
Focus is on both
Scientific American 05/03
Genomics
Organism
Species
Vert
Genome Size
(million bp)
human
Homo sapiens
1
3000
fly
Drosophila
melanogaster
0
185
yeast
Saccharomyces
cerevisiae
0
12.1
Mus
musculus
1
mouse
SubCell-Location
Myta
Ribo
Nucl
Ribo
Function
apop
meio
mito
apop
StopCodonDensity
.1
.1
.1
.9
PolyA-Tail
1
1
0
0
Organism
Dimension
Table
g0
g2
1
1
0
0
e1
1
1
1
1
e0
0
1
e2
1
1
1
e3
0
1
e2
U
N
V
S
T
R
C
T
Y
S
T
Z
E
D
A
D
S
H
M
N
1
e3
Experiment
1 Dimension
Table
3
2
a
c
h
2
2
b
s
h
2
4
a
c
a
1
2
4
a
s
a
1
0
(MIAME)
chromosome,length
o0
1
P
I
g3
17, 78 12, 60 Mi, 40 1, 48
1 0
17, 40
1
10,
75 1 0
o1
1 e14,
00
1
0
65 0 0
0
o2 16, 760 0
1 9, 45 0 Pl, 43 0
0
o3
3000
g1
e1
L
A
B
Gene-Org
Dim Table
Gene Dimension Table
0
1
0
1
1
1
0
0
0
0
0
1
0
1
0
1
0
1
Gene-Experiment-Organism Cube
(1 iff that gene from that organism
expresses at a threshold level in that
experiment.)
Many-to-many-to-many relationship
Protien-Protien Interactions
(PPI) (2-hop interactions)
Gene Dimension Table
g3
1
0
0
0
g2
0
0
1
0
1
1
1
0
0
g1
g13
1
0
0g2
1
0
0
0
1
1
1
1
0
g0
Myt
a
Rib
o
Nucl
Rib
o
Function
StopCodonDensity
apo
p
.1
mei
o
.1
mit
o
.1
apo
p
.9
PolyA-Tail
1
1
0
0
M
y
t
a
R
i
b
o
N
u
c
l
a
p
o
p
M
e
i
o
M
i
t
o
S
C
D
1
S
C
D
2
S
C
D
3
S
C
D
4
1
0
0
1
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
1
0
g2
0
1
0
1
0
0
1
g3
Gene Dimension Table (Binary)
g0
g1
SubCellLocation
g0
1
g1
0
G
E
N
E
1
P
ol
yA
1
0
1
1
g2
0
0
1
0
g3
0
0
1
0
g4
1
0
g1
KDD-Cup ’02 results
NDSU Team
Greyware PPI graph mining tool
 Visualize feature information using a glyph for each gene (PPI
graph node)
 PPI Edge iff the 2 genes code for interacting proteins
le
n
g
t
h
e
ss
e
nt
ia
l
Di
sce
nt
er
M
y
t
a
R
i
b
o
N
u
c
l
a
p
o
p
M
e
i
o
M
i
t
o
S
C
D
G
E
N
E
1
In
foqt
y
4
1
0
0
1
0
1
o
1
0
1
0
0
1
0
.1
6
0
5
1
g2
0
0
1
0
0
1
.1
4
0
0
5
g3
0
1
0
1
0
0
.9
9
0
8
2
g4
4
Glyp
h for
g1
g1
Gene Dimension Table (non-binary)
stopcodondensity
This visual data mining tool was effective in KDD-CUP ’02)
Thanks so much!
 Don’t forget to submit your best work to CAINE, Nov 1113, 2003, Las Vegas NV by July 1. Submit to Program Chair,
[email protected] or Conference Chair,
[email protected]
http:/www.cs.ndsu.nodak.edu/~krile/caine03 or http:www.isca-hq.org
For those interested in DM in genomics and bioinformatics,
Virtual Conference in Genomics and Bioinformatics
VGAB-III, Sep 16-18, 2003 http:www.ndsu.edu/~virtual-genomics
Submit papers to Program Chair, [email protected]
or to the Conference Chair,
[email protected]
 VGAB-III will be available over Access Grid and Real Player
to anywhere, for free (no registration fee)

03_csit_invited1 - NDSU Computer Science

Transcript 03_csit_invited1 - NDSU Computer Science

Directory