1 - NDSU Computer Science

Download Report

Transcript 1 - NDSU Computer Science

Data Mining and Data Warehousing,
many-to-many Relationships,
applications
DataSURG
(Database Systems Users and Research Group)
North Dakota State University
Fargo, North Dakota USA
[email protected]
DM on a DW vs. TP on a DB
Workload on Repository Question
 C.J. Date, circa 1980
 Transactions on a DBMS vs.
 file processing programs on file systems.
 “Use a DBMS instead of file systems! Unifies data resources,
centralizes control, promotes standards and consistency,
eliminates redundancy, increases data value (wider data usage)”
 Circa 1990
 “Buy a separate DW for DM” (separate from your DBMS for TP)”
 2 separate, quite redundant, non-sharing, inconsistent.. systems!
 What happened?
 Great marketing success! (sold more hardware and software)
 Great Concurrency Control R&D failure!
We failed to integrate transactions and queries (OLTP and OLAP, i.e.,
updates and reads) in one system with acceptable performance!
 The marketing was so successful, nobody noticed the failure!
OUTLINE
I still hold out hope that DW and DB will eventually be unified again. I believe
eventually the industry will demand it. Already, there is work to update DWs!
.
For now let’s just focus on DATA and DM
 I Consider DM to be on the unstructured side of querying.
There you run up against two curses immediately.
Curse of non-scalability (solutions don’t scale with data volume.)
Curse of dimensionality (solutions don’t scale with data dimension
 I will talk about techniques we have used to address the curses.
 Process vertically structured data horizontally (instead of the ubiquitous
vertical processing of horizontal data (the record orientation).
 Parallelize the DM engine.
 Parallelize the software DM engine on clusters of computers.
 Parallelize the greyware DM engine on clusters of people
(i.e., browser-enable all software for visual data mining)
Data mining finds information in data.
Why do we need Data Mining?
 Data volume expands by Parkinson’s Law
Data volume expands to fill
available data storage
 Disk-storage expands by Moore’s law
Capacity  2
t / 9 months
 Available storage doubles every 9 months!
We’re awash with data!

Network data: hi-speed, DWDM, All-opt (mgmt, flow classif’n,QoS,security)
(10 terabytes by 2003 ~ 1013 B).

US EROS Data Center (EDC) archives Earth Observing System (EOS)
Remotely Sensed Imagery (RSI), satellite and aerial photo data
(10 petabytes by 2005 ~ 1016 B).

National Virtual Observatory (aggregated astronomical data)
(10 exabytes by 2010 ~ 1019 B).

Sensor data from sensor networks (Micro & Nano -sensor networks)
(10 zettabytes by 2015 ~ 1022 B).

WWW will continue to grow (and other text collections)
(10 yottabytes by 2020 ~ 1025 B).

Micro-arrays, gene-chips, genome sequence data
(10 gazillabytes by 2030 ~ 1028 B?).
I had to make up this Name? Projected data sizes
are overrunning our ability to names for them!

Stock Market prediction data (prices + all the above?
especially astronomy data?)
(10 supragazillabytes by 2040 ~ 1031 B?).
Useful info must be teased out of these large volumes of data thru data mining.
More’s Law: More’s Less
The more volume, the less information. (AKA:
Shannon’s Canon)
A simple illustration: Which phone book has
more info?
(both involve the same 4 data granules)
BOOK-1
Name
Number
Smith
234-9816
Jones
231-7237
Name
Smith
Smith
Jones
Jones
BOOK-2
Number
234-9816
231-7237
234-9816
231-7237
Data mining reduces volume and raises the information level.
Precision Agriculture Data Mining
Dataset consists of an aerial photograph (TIFF image taken during the growing season)
and a synchronized yield map (crop yield taken at harvest). Altogether there are 4 feature
attributes (B,G,R,Y) and ~100,000 pixels.
TIFF image
Yield Map
A producer wants to know the relationship between the color intensities and yield?
One hypothsize, the Association Rule, hi_green and low_red  hi_yield, is
intuitive and could be made and verified without data mining (simple querying).
Data mining has found a stronger rule, hi_NIR and low_red  very_hi_yield.
So many producers use VIR instead of RBG cameras to get the better information.
Another Precision Agriculture Data Mining
Example: Grasshopper Infestation Prediction
(again involving RSI data)
• Grasshopper caused significant economic loss each year.
• Early infestation prediction is key to damage control.
Pixel classification on remotely sensed imagery
holds significant promise to achieve early detection.
Pixel classification (signaturing) has many applications
pest detection, forest fire detection, wet-lands monitoring …
(for signaturing we developed the SMILEY software/greyware system)
http:midas.cs.ndsu.nodak.edu/~smiley
Sensor Network Data Mining
 Micro and Nano scale sensor blocks
are being developed for sensing






Biological agents
Chemical agents
Motion detection
coatings deterioration
RF-tagging of inventory
Structural materials fatigue
 There will be trillions++ of individual
sensors creating mountains of data.
 The data must be mined for it’s information.
Sensor Network Application:
CubE for Active Situation Replication (CEASR)
Situation space
(with nano-sensors )
Realtime replica
of sensed pattern
(on a
wristwatch?
Proposed Technical Approach:
Ability to sense chemical, vibrational, biological, thermal in realtime (over hills, etc.)
The problems to be solved include:
1.
Communication between sensor field(s) and CEASR.
2.
Nano-sensors must be position registered.
3.
Data fusion architecture must be developed.
4.
Fluidic Self Assembly (FSA) of Cube for Active Situation
Replication (CEASR).
FSA is an Alien Technology patented process capable of
producing clear flexible substrates with embedded nanoLED display units.
Operational Capability:
Using Alien Technology’s Fluidic Self-assembly (FSA)
technology, clear sheets with embedded nano-LED
elements will be laminated together to produce a clear
visualization cube with a nano-LED at each pixel
corresponding to nano-sensors at each pixel of the actual
space.
The nano-sensors will turn on their corresponding
CEASR display Nano-LED when a threshold is sensed
(chemical, vibrational, biological, thermal…). These
sensed patterns will be transmitted, fused and displayed
in real-time.
Each energized nano-sensor must transmit a ping and its
location (QID). These QIDs are then translated to 3dimensional coordinates at the display device. The
correcponding voxel on the display lights up.
A more sophisicated CEASR device would sense and
transmit the intensity levels. The display device would
light up with the corresponding intensity.
Anthropological Application:
Digital Archive Network for Anthropology (DANA)
(data mine arthropological artifacts (shape, color, discovery location,…)
Astronomy Application: The celestial sphere
dec
RA
Data Mining?
Querying is asking specific questions and expecting specific answers.
Data Mining goes into the MOUNTAIN of DATA,
and hopefully returns information gems.
But also, some fool’s gold?
Relevance and interestingness analysis, serves to
assay those gems. (help pick out valuable gems).
Data Mining Process
 Data mining: the core
of the knowledge
discovery process.
Pattern Evaluation
and Assay
Data Mining
Task-relevant Data
OLAP
Classification
Clustering
ARM
Data Warehouse: cleaned,
integrated, read-only, periodic,
historical raw database
Selection
Feature extraction,
tuple selection
Data Cleaning/Integration:
missing data, outliers,
noise, errors
Mountain of Raw Data
Data Mining versus Querying
There is a whole spectrum of techniques to get information from data:
Fractals, …
Standard querying
SQL
SELECT
FROM
WHERE
Complex
queries
(nested,
EXISTS..)
Searching and Aggregating
FUZZY query,
Search engines,
BLAST searches
OLAP
(rollup,
drilldown,
slice/dice..
Machine Learning
Supervised
Learning –
classification
regression
Data Mining
Data Prospecting
Association Rule Mining
Unsupervised
Learning clustering
On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02).
On the Data Mining end, the surface has barely been scratched.
But even those scratches had a great impact – One of the early scatchers became
the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy
Walmart vs. KMart
Our Approach
 Vertical, compressed data structures, variously called either
Predicate-trees or Peano-trees (Ptrees in either case)1
processed horizontally (DBMSs process horizontal data vertically)
 Ptrees are data-mining-ready, compressed data
structures, which attempt to address the curses of
scalability and curse of dimensionality.
 And a compressed, OLAP-ready data warehouse structure, the
Pcube1
 Pcubes facilitate OLAP operations and query processing, using
the Ptree data structure.
1
Technology is patent pending
by North Dakota State University
A table, R(A1..An), is a horizontal
structure (set of horizontal records)
processed vertically (vertical scans)
R( A1 A2 A3 A4)
Horizontal
structure
Processed
vertically
(scans)
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
-->
7. 2nd half of 1st of 2nd not 0
R[A1] R[A2] R[A3] R[A4]
001
000
001
111
100
101
100
100
0
0
0
0
1
0
1
1-Dimensional Ptrees
are built by recording
the truth of the
predicate “pure 1”
recursively on halves,
until there is purity, P11:
1. Whole file is not pure1 0
2. 1st half is not pure1  0
3. 2nd half is not pure1  0
4. 1st half of 2nd half not  0
5. 2nd half of 2nd half is  1
6. 1st half of 1st of 2nd is  1
Ptrees vertical partition; compress
each vertical bit slice into a basic Ptree;
horizontally process these Ptrees using
one multi-operand logical AND operation.
1
0
0
0
01
1
10
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P11 P12 P13
P21 P22 P23 P31 P32 P33 P41 P42 P43
0
0
0
0
0
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0 0 0 0 1 0 1 0 0 0 0
0 0 1 0
10 10
10 01
01 0001
01 01
0100
01
01 01 10
01
01
01 10 01
10
Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2
01 21-level
Horizontal Processing of Vertical Structures
History
In the 1980’s it was proposed for DBMSs
and record-based workloads
 Decomposition Storage Model
(DSM, Copeland et al)
Attribute Transposed File (ATF)
 Band Sequential (BSQ) in RSI)
 Bit Transposed File (BTF, Wang et al)
These initiatives didn’t last. Why?
Horizontal Processing of Vertical Structures
for Record-based Workloads
 For record-based workloads (where the result is a set of records),
changing the horizontal record structure and then having to
reconstruct it, may introduce too much post processing?
R R R
R R R
R R R
R R R
R( A1 A2 A3 A4)
11
0
0
0
0
1
0
1
1
12
1
1
1
1
0
1
1
1
13
0
1
0
0
1
0
1
1
21
1
1
1
1
0
0
0
0
22
1
1
1
1
1
1
0
0
23
1
1
0
1
0
0
0
0
31
1
1
1
1
0
0
0
0
32
1
1
0
0
0
0
0
0
33
0
0
1
1
1
1
1
1
41
0
0
0
1
1
1
1
1
42
0
0
0
1
0
0
0
0
43
1
0
1
1
0
1
0
0
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
 For data mining workloads, the result is often a bit (Yes/No, True/False) or another
unstructured result, there is no reconstructive post processing?
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
1
Run Lists: Generalized Ptrees using standard run length compression of vertical bit files
(alternatively, using Lempl Zipf?, Golomb?, other?)
Run Lists: record type and start-offset of
pure runs.
E.g., RL11:
R( A1 A2 A3 A4)
010
011
010
010
101
010
111
111
1.
1st run is Pure0
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
2nd run is Pure1
 1:100
3.
3rd run is Pure0
 0:101
4.
4th run is Pure1
 1:110
1
RL11
0:000
1:100
0:101
1:110
R[A1] R[A2] R[A3] R[A4]
010
011
010
010
101
010
111
111
R11
0
0
0
0
1
0
1
 0:000
truth:start
2.
-->
111
111
110
111
010
010
000
000
Eg, to count, 111 000 001 100s, use “pure111000001100”:
RL11^RL12^RL13^RL’21^RL’22^RL’23^RL’31^RL’32^RL33^RL41^RL’42^RL’43
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
RL11 RL12 RL13
(to complement, flip purity bits)
110
110
101
101
001
001
001
001
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
RL21 RL22 RL23
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
RL31 RL32 RL33 RL41 RL42 RL43
0:000 1:000 0:000 1:000 1:000 1:000 1:000 1:000 0:000 0:000 0:000
1:100 0:100 1:001 0:100 0:110 0:010 0:100 0:010 1:010 1:010 1:010
0:101
1:011
0:101 1:101 0:010
0:100
1:100
1:110
0:101
1:110
1:000
0:001
1:010
0:100
1:101
0:110
RunList-trees? (RLtrees)

To facilitate subsetting (isolating a subset) and processing, a
Ptree stucture can be constructed on top of the RunList using
the “pure1” predicate:
1. Whole file is not pure1 0
2. 1st half is not pure1  0
3. 2nd half is not pure1  0
4. 1st half of 2nd half not  0
5. 2nd half of 2nd half is  1
R11
0
0
0
0
1
0
1
1
6. 1st half of 1st of 2nd is  1
0
0
0
01
1
10
7. 2nd half of 1st of 2nd not 0
RL11
0:000 1:100 0:101 1:110
RunList-trees continued

Alternatively, a separate NotPure0 index trees could be build where the predicate is
NotPure0 (also note, the tree could be terminated at a given level).



First, AND the NP0 index trees.
Only the 1-branches or the resulty need to be ANDed through list scans.
The more operands, the fewer 1-branches.
1. Whole file is true
1
2. 1st half is false
0
nd
3. 2 half is true
1
4. 1st half of 2nd half true  1
5. 2nd half of 2nd half true  1
R11
0
0
0
0
1
0
1
1
6. 1st half of 1st of 2nd true 1
7. 2nd of 1st of 2nd false
RL11
0
1
1
11
1
10
0
0:000 1:100 0:101 1:110
Ptrees
Vertical, compressed, lossless structures that facilitates fast
horizontal AND-processing
 Jury is still out on the best parallelization approach, vertical (by relation) or
horizontal (by tree node) or some combination.
 Horizontal parallelization is pretty, but network multicast overload is huge
 Use active networking? Clusters of Playstations?...
 The most useful form of a Ptree is the Pure1-tree or P1tree
 1-bit at a node iff corresponding half is pure1.
 There are many other useful predicates, e.g., NonPure0-trees (previously shown).
 We will focus on P1trees.
 All Ptrees shown so far were
 1-dimensional (recursively halving bit files), but they can be
 2-D (recursively quartering) (e.g., used for 2-D images)
 3-D (recursively eight-ing), …
A 2-Dimensional P1tree
Node is 1 iff that quadrant is purely 1-bits, e.g.,
A bit-file
(from, e.g., a 2-D image)
1111110011111000111111001111111011110000111100001111000001110000
Run-length compress the corresponding raster ordered matrix using Peano order.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
00
00
00
00
00
00
00
10
00
00
00
00
0
1
0
0
0 0 1 0
1 1 0 1
1 1 1 0 0 0 1 0 1 1 0 1
0
Alternatively, a Count tree?
Counts are the ultimate goal, but P1trees are more compressed and produce the needed counts
quite quickly.
001
11
11
11
11
11
11
11
01
111
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
55
00
00
00
10
11
11
11
11
0
16
1
level-3 (pure=43)
2
8
3
15
2
3 0 4 1
4 4 3 4
3
1 1 1 0 0 0 1 0 1 1 0 1
16
level-2
level-1
level-0
2.2.3
 QID (Quadrant ID): e.g., 2.2.3


Pure-1/Pure-0 quadrants
Root Count
( 7, 1 )
 Tree levels: 3, 2, 1, 0 (purity counts of 43 42 41 40 resp.)
 Fan-out = 2dim = 4
( 111, 001 )
10.10.11
Logical Operations on Ptrees
(are used to get counts of any pattern)
Ptree 1
Ptree 2
AND result
OR result
AND operation is faster than the bit-by-bit AND since, there are shortcuts
(any pure0 operand node means result node is pure0.)
e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc.
The more operands there are in the AND, the greater the benefit due to this
shortcut (more pure0 nodes).
Sensor Network Application:
CubE for Active Situation Replication (CEASR)
Situation space
(with nano-sensors )
Realtime replica
of sensed pattern
(on a
wristwatch?
Proposed Technical Approach:
Ability to sense chemical, vibrational, biological, thermal in realtime (over hills, etc.)
The problems to be solved include:
1.
Communication between sensor field(s) and CEASR.
2.
Nano-sensors must be position registered.
3.
Data fusion architecture must be developed.
4.
Fluidic Self Assembly (FSA) of Cube for Active Situation
Replication (CEASR).
FSA is an Alien Technology patented process capable of
producing clear flexible substrates with embedded nanoLED display units.
Operational Capability:
Using Alien Technology’s Fluidic Self-assembly (FSA)
technology, clear sheets with embedded nano-LED
elements will be laminated together to produce a clear
visualization cube with a nano-LED at each pixel
corresponding to nano-sensors at each pixel of the actual
space.
The nano-sensors will turn on their corresponding
CEASR display Nano-LED when a threshold is sensed
(chemical, vibrational, biological, thermal…). These
sensed patterns will be transmitted, fused and displayed
in real-time.
Each energized nano-sensor must transmit a ping and its
location (QID). These QIDs are then translated to 3dimensional coordinates at the display device. The
correcponding voxel on the display lights up.
A more sophisicated CEASR device would sense and
transmit the intensity levels. The display device would
light up with the corresponding intensity.
3-D Application:
Digital Archive Network for Anthropology (DANA)
(data mine arthropological artifacts (shape, color,…)
3-Dimensional Ptrees
(e.g., for the CEASR sensor
network or
3-Dimensional
Ptrees
the Digital Archive Network for Anthropology)
X
Y
Z
Intensity
0
0
0
15 (1111)
1
0
0
15 (1111)
0
1
0
15 (1111)
1
1
0
15 (1111)
0
0
1
15 (1111)
1
0
1
15 (1111)
0
1
1
15 (1111)
1
1
1
15 (1111)
2
0
0
15 (1111)
3
0
0
4 (0100)
2
1
0
1 (0001)
3
1
0
12 (1100)
2
0
1
12 (1100)
3
0
1
2 (0010)
2
1
1
12 (1100)
3
1
1
12 (1100)
0
2
0
15 (1111)
1
2
0
15 (1111)
0
3
0
2 (0010)
1
3
0
0 (0000)
0
2
1
15 (1111)
1
2
1
15 (1111)
0
3
1
2 (0010)
1
3
1
0 (0000)
2
2
0
12 (1100)
Ptree dimension
 The dimension of the Ptree structure is a
user chosen parameter
 It can be chosen to fit the data dimension
 Most datasets  1-D Ptrees (recursive halving)
 2-D Images  2-D Ptrees (recursive quartering)
 3-D Solids
 3-D Ptrees (recursive eighth-ing)
 Or dimension can be chosen based on other
considerations
 optimize compression
 increase processing speed (next slide)
Raster Sorting:
Peano Sorting:
Attributes 1st
Bit position 1st
Bit position 2nd
Attributes 2nd
Generalized Raster and Peano Sorting: generalizes to any table with
numeric attributes (not just images).
Generalize Peano Sorting
KNN speed improvement (UCI MLR data sets)
Time in Seconds
120
100
Unsorted
80
60
Generalized Raster
40
20
0
Generalized Peano
National Virtual Observatory data
 What Ptree dimension and what ordering should
be used for astronomical data?
 Where all bodies are assumed to be on the surface of
a sphere, the celestial sphere (shares equatorial
plane with earth and has no specified radius)
 Peano Triangle Mesh Tree (PTM-tree)
 Peano Celestial Coordinate tree (PCCtree)
 Uses (RA, dec) coordinates of the celestial sphere


RA=Recession Angle (longitudinal angle)
dec=declination
(latitude angle)
Peano Triangular Mesh Tree (PTM-tree)
 Similar to the Hierarchical Triangular Mesh (HTM)
used in the Sloan Digital Sky Survey project.
In both:
 Sphere is divided into triangles
 Triangle sides are always great circle segments.
 PTM differs from HTM in the way in which they are
ordered?
The difference between HTM and
PTM-trees is in the ordering.
1,3,3
1,1,2
1,3,1
1.1.3
1,1,1
1
1,2
1
1,3,0
1,1,0
1,
21,1
1,3
1,0
1,1
Ordering of HTM
Why use a different ordering?
1,
0
1,
3
Ordering of PTM-tree
1,3,2
PTM Triangulation of the Celestial Sphere
Traverse southern
hemisphere in the
revere direction
(just the identical
pattern pushed
down instead of
pulled up, arriving
at the Southern
neighbor of the
start point.
dec
RA
This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics.
PTM triangulation – Next Level
LRLR
LRLR
LRLR
LRLR
PTM-triangulation - Next Level
LRLR RLRL LRLR
RLRL LRLR RLRL
LRLR RLRL
LRLR RLRL LRLR
RLRL LRLR RLRL
LRLR RLRL
Peano Celestial Coordinate Trees (PCCtrees)
Unlike PTM-trees which initially partition
the sphere into the 8 faces of an
octahedron, in the PCCtree scheme:
the sphere is tranformed into a cylinder,
then into a rectangle,
then standard Peano ordering is used on the
Celestial Coordinates.
Celestial Coordinates
 RA is from 0 to 360o
 dec is -90o to 90o.
P
R
A
d
e
c
90o
North Plane
0o
South Plane
-90o
0o
360o
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
PRAdec-scheme:
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Z Z
Sphere  Cylinder  Plane
Graph data
(many-to-many self relations)
“Everything should be
made as simple as
possible, but not simpler”
Albert Einstein
Representating graphs
Examples:
 Genomics

Protein-protein interactions
(ACM KDD-Cup ’02)
 Focuses is on node
structure
 WWW

Focuses on link structure
 Publications citations


ACM KDD_Cup ’03
Focus is on both
Scientific American 05/03
Genomics
Organism
Species
Vert
Genome Size
(million bp)
human
Homo sapiens
1
3000
fly
Drosophila
melanogaster
0
185
yeast
Saccharomyces
cerevisiae
0
12.1
Mus
musculus
1
mouse
SubCell-Location
Myta
Ribo
Nucl
Ribo
Function
apop
meio
mito
apop
StopCodonDensity
.1
.1
.1
.9
PolyA-Tail
1
1
0
0
Organism
Dimension
Table
g0
g2
1
1
0
0
e1
1
1
1
1
e0
0
1
e2
1
1
1
e3
0
1
e2
U
N
V
S
T
R
C
T
Y
S
T
Z
E
D
A
D
S
H
M
N
1
e3
Experiment
1 Dimension
Table
3
2
a
c
h
2
2
b
s
h
2
4
a
c
a
1
2
4
a
s
a
1
0
(MIAME)
chromosome,length
o0
1
P
I
g3
17, 78 12, 60 Mi, 40 1, 48
1 0
17, 40
1
10,
75 1 0
o1
1 e14,
00
1
0
65 0 0
0
o2 16, 760 0
1 9, 45 0 Pl, 43 0
0
o3
3000
g1
e1
L
A
B
Gene-Org
Dim Table
Gene Dimension Table
0
1
0
1
1
1
0
0
0
0
0
1
0
1
0
1
0
1
Gene-Experiment-Organism Cube
(1 iff that gene from that organism
expresses at a threshold level in that
experiment.)
Many-to-many-to-many relationship
Protien-Protien Interactions
(PPI) (2-hop interactions)
Gene Dimension Table
g3
1
0
0
0
g2
0
0
1
0
1
1
1
0
0
g1
g13
1
0
0g2
1
0
0
0
1
1
1
1
0
g0
Myt
a
Rib
o
Nucl
Rib
o
Function
StopCodonDensity
apo
p
.1
mei
o
.1
mit
o
.1
apo
p
.9
PolyA-Tail
1
1
0
0
M
y
t
a
R
i
b
o
N
u
c
l
a
p
o
p
M
e
i
o
M
i
t
o
S
C
D
1
S
C
D
2
S
C
D
3
S
C
D
4
1
0
0
1
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
1
0
g2
0
1
0
1
0
0
1
g3
Gene Dimension Table (Binary)
g0
g1
SubCellLocation
g0
1
g1
0
G
E
N
E
1
P
ol
yA
1
0
1
1
g2
0
0
1
0
g3
0
0
1
0
g4
1
0
g1
KDD-Cup ’02 results
NDSU Team
Greyware PPI graph mining tool
 Visualize feature information using a glyph for each gene (PPI
graph node)
 PPI Edge iff the 2 genes code for interacting proteins
le
n
g
t
h
e
ss
e
nt
ia
l
Di
sce
nt
er
M
y
t
a
R
i
b
o
N
u
c
l
a
p
o
p
M
e
i
o
M
i
t
o
S
C
D
G
E
N
E
1
In
foqt
y
4
1
0
0
1
0
1
o
1
0
1
0
0
1
0
.1
6
0
5
1
g2
0
0
1
0
0
1
.1
4
0
0
5
g3
0
1
0
1
0
0
.9
9
0
8
2
g4
4
Glyp
h for
g1
g1
Gene Dimension Table (non-binary)
stopcodondensity
This visual data mining tool was effective in KDD-CUP ’02)
Thanks so much!
 Don’t forget to submit your best work to CAINE, Nov 1113, 2003, Las Vegas NV by July 1. Submit to Program Chair,
[email protected] or Conference Chair,
[email protected]
http:/www.cs.ndsu.nodak.edu/~krile/caine03 or http:www.isca-hq.org
For those interested in DM in genomics and bioinformatics,
Virtual Conference in Genomics and Bioinformatics
VGAB-III, Sep 16-18, 2003 http:www.ndsu.edu/~virtual-genomics
Submit papers to Program Chair, [email protected]
or to the Conference Chair,
[email protected]
 VGAB-III will be available over Access Grid and Real Player
to anywhere, for free (no registration fee)