03vertical_audio0 - NDSU Computer Science

Download Report

Transcript 03vertical_audio0 - NDSU Computer Science

3. Vertical Data
First, a brief description of
Data Warehouses (DWs) versus
Database Management Systems (DBMSs)

C.J. Date recommended, circa 1980,


Do transaction processing on a DataBase Management System (DBMS),
rather than doing file processing on file systems.
“Using a DBMS, instead of file systems,






unifies data resources,
centralizes control,
standardizes usages,
minimizes redundancy and inconsistency,
maximizes data value and usage.
Inmon, et all, circa 1990


“Buy a separate Data Warehouse (DW) for long-running queries and data
mining” (separate from DBMS for transaction processing)”.
“Double your hardware! Double your software! Double your fun!
Section 3
#0
Data Warehouses (DWs)
vs.
DataBase Management Systems (DBMSs)
 What happened?
 Inmon's idea was a great marketing success!,
 but foretold a great Concurrency Control Research &
Development (CC R&D) failure!
CC R&D people had failed to integrate transaction and query
processing, Also Known As (AKA) OnLine Transaction Processing
(OLTP) and OnLine Analytic Processing (OLAP), that is, update and
read workloads) in one system with acceptable performance!
 Marketing of Data Warehouses was so successful, nobody
noticed the failure! (or seem to mind paying double)
 Most enterprises now have a separate DW from their DBMS
Section 3
# 0.1
Some still hope DWs and DBs will one day be unified again.
The industry may demand it eventually; e.g., Already, there is research work on
real time updating of Data Warehouses (DW)s
For now let’s just focus on DATA.
You run up against two curses immediately in data processing.
Curse of cardinality: solutions don’t scale well with respect to record volume.
"files are too deep!"
Curse of dimensionality: solutions don’t scale with respect to attribute dimension.
"files are too wide!"
 Curse of cardinality is a problem in the horizontal and vertical world!
 In the horizontal world it was disguised as “curse of the slow join”.
In the horizontal world we decompose relations to get good design
(e.g., 3rd normal form), but then we pay for that by requiring many
slow joins to get the answers we need.
Section 3
# 0.2
Techniques to address these curses.
Horizontal Processing of Vertical Data or HPVD, instead of the ubiquitous Vertical
Processing of Horizontal (record orientated) Data or VPHD.
Parallelizing the processing engine.
 Parallelize the software engine on clusters of computers.

Parallelize the greyware engine on clusters of people
(i.e., enable visualization and use the web...).
Again, we need better techniques for data analysis, querying and mining
because of:
Parkinson’s Law: Data volume expands to fill available data storage.
Moore’s law:
Available storage doubles every 9 months!
Section 3
#2
A few HPVD successes: 1. Precision Agriculture
Yield prediction: Using Remotely Sensed Imagery (RSI) consists of an aerial photograph (RGB TIFF image
taken ~July) and a synchronized crop yield map taken at harvest; thus, 4 feature attributes (B,G,R,Y) and
~100,000 pixels.
Producer are able to analyze the color intensity patterns from
aerial and satellite photos taken in mid season to predict yield
(find associations between electromagnetic reflection and yeild).
E.g., ”hi_green & low_red  hi_yield”. That is very intuitive.
TIFF image
Yield Map
A stronger association, “hi_NIR & low_redhi_yield”,
found through HPVD data mining), allows producers to take and query mid-season aerial photographs for
low_NIR & high_red grid cells, and where low yeild is anticipated, apply (top dress) additional nitrogen.
Can producers use Landsat images of China of predict wheat prices before planting?
2. Infestation Detection (e.g., Grasshopper Infestation Prediction - again involving RSI)
Grasshopper caused significant economic loss each year.
Early infestation prediction is key to damage control.
Pixel classification on remotely sensed imagery holds much promise
to achieve early detection.
Pixel classification (signaturing) has many, many applications: pest
detection, Flood monitoring, fire detection, wetlands monitoring …
Section 3
#3
3. Sensor Network Data HPVD
 Micro and Nano scale sensor blocks
are being developed for sensing






Biological agents
Chemical agents
Motion detection
coatings deterioration
RF-tagging of inventory (RFID tags for Supply Chain Mgmt)
Structural materials fatigue
 There will be trillions++ of individual sensors creating
mountains of data which can be data mined using HPVD
(maybe it shouldn't be called a success yet?).
Section 3
#4
4. A Sensor Network Application:
CubE for Active Situation Replication (CEASR)
Nano-sensors dropped
into the Situation space
Situation space
.:.:.:.:..::….:. : …:…:: ..:
. . :: :.:…: :..:..::. .:: ..:.::..
.:.:.:.:..::….:. : …:…:: ..:
. . :: :.:…: :..:..::. .:: ..:.::..
.:.:.:.:..::….:. : …:…:: ..:
. . :: :.:…: :..:..::. .:: ..:.::..
Soldier sees replica of sensed
situation prior to entering space
Wherever a threshold level is sensed (of chemical, biological,
thermal, etc.), a ping is registered in a compressed Vertical data
structure for that location (The compressed vertical data structure
is a Ptree. A detailed definition Ptrees is coming up later).
A clear plexiglass cube, with embedded nano-LEDs at each
voxel (volume pixel) displays the situation to theuser.
The single compressed vertical data structure (Ptree) containing
all the information is transmitted to the cube, where the pattern
is reconstructed (uncompress, display).
Each energized nano-sensor transmits a ping (location is triangulated from the
ping). These locations are then translated to 3-dimensional coordinates at the
display. The corresponding voxel on the display lights up. This is the
expendable, one-time, cheap sensor version. A more sophisticated CEASR
device could sense and transmit the intensity levels, lighting up the display voxel
with the same intensity.
==================================
Section 3 # 5
\
CARRIER
/
3. Anthropology Application
Digital Archive Network for Anthropology (DANA)
(analyze, query and mine arthropological artifacts (shape, color, discovery location,…)
Section 3
#6
What has spawned these successes?
(i.e., What is Data Mining?)
Querying is asking specific questions for specific answers
Data Mining is finding the patterns that exist in data
(going into MOUNTAINS of raw data for the information gems
hidden in that mountain of data.)
visualization
Pattern Evaluation
and Assay
Data Mining
Task-relevant Data
Data Warehouse: cleaned,
integrated, read-only, periodic,
historical database
Classification
Clustering
Rule Mining
Loop
backs
Selection
Feature extraction, tuple
selection
Raw data must be cleaned
of: missing items, outliers,
noise, errors
Smart files
Section 3
#7