Data Mining - Lyle School of Engineering

Download Report

Transcript Data Mining - Lyle School of Engineering

DATA WAREHOUSING
&
INFORMATION RETRIEVAL
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
POBox 750122
Dallas, Texas 75275-0122
214-768-3087
[email protected]
The contents of this presentation draw extensively from slides for:
Data Mining, Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
1
DW&IR Outline
Introduction
 Data Warehousing
 Research
 Summary

4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
2
DW&IR Outline
 Introduction
– Data Warehousing Overview
– Information Retrieval
 Data Warehousing
 Research
 Summary
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
3
Data Warehousing
 “Subject-oriented, integrated, time-variant, nonvolatile”




William Inmon
http://www.inmondatasystems.com/
Operational Data: Data used in day to day needs of
company.
Informational Data: Supports other functions such as
planning and forecasting.
Data mining tools often access data warehouses rather
than operational data.
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
4
Data Warehouse Variations
Data Mart – Subset of complete data
warehouse
 Virtual Warehouse – Warehouse
implemented as a view of operational
data

4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
5
Operational vs. Informational
Application
Use
Temporal
Modification
Orientation
Data
Size
Level
Access
Response
Data Schema
Operational Data
Data Warehouse
OLTP
Precise Queries
Snapshot
Dynamic
Application
Operational Values
Gigabits
Detailed
Often
Few Seconds
Relational
OLAP
Ad Hoc
Historical
Static
Business
Integrated
Terabits
Summarized
Less Often
Minutes
Star/Snowflake
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
6
Information Retrieval






Information Retrieval (IR): retrieving desired
information from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about “data mining”

IR being applied to other unformatted data
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
7
DB vs IR
Records (tuples) vs. documents
 Well defined results vs. fuzzy results
 DB grew out of files and traditional
business systesm
 IR grew out of library science and need
to categorize/group/access
books/articles

4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
8
DB vs IR (cont’d)
Data retrieval
which docs contain a set of keywords?
Well defined semantics
a single erroneous object implies failure!
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
9
Information Retrieval (cont’d)
Similarity: measure of how close a
query is to a document.
 Documents which are “close enough”
are retrieved.
 Metrics:
– Precision = |Relevant and Retrieved|
|Retrieved|
– Recall = |Relevant and Retrieved|
|Relevant|

4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
10
IR Query Result Measures
and Classification
IR
Classification
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
11
DW&IR Outline

Introduction
 Data
Warehousing
– Dimensional Modeling
– OLAP
– Decision Support Systems
 Research
 Summary
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
12
Data Transformation for Data
Warehouse







ETL – Extract, Transform, Load
Unwanted data must be removed
Convert heterogeneous sources into one
common schema
As the operational data is probably a
snapshot of the data, multiple snapshots may
need to be merged to create the historical
view
Summarize data
New derived data
Handle missing and erroneous data
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
13
Data Warehouse Creation
Fig 1 [1]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
14
Dimensional Modeling





View data in a hierarchical manner more as
business executives might
Useful in decision support systems and mining
Dimension: collection of logically related
attributes; axis for modeling data.
Facts: data stored
Ex: Dimensions – products, locations, date
Facts – quantity, unit price
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
15
Multidimensional Model Example
Fig 2 [1]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
16
Cube view of Data
Fig 4 [1]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
17
Aggregation Hierarchies
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
18
Multidimensional Schemas

Star Schema shows facts and dimensions
– Center of the star has facts shown in fact tables
– Outside of the facts, each diemnsion is shown
separately in dimension tables
– Access to fact table from dimension table via join
SELECT Quantity, Price
FROM Facts, Location
Where (Facts.LocationID = Location.LocationID) and
(Location.City = ‘Dallas’)
– View as relations, problem volume of data and
indexing
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
19
Star Schema
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
20
Flattened Star
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
21
Normalized Star
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
22
Snowflake Schema
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
23
OLAP








Online Analytic Processing (OLAP): provides more
complex queries than OLTP.
OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
Dimensional data; cube view
Support ad hoc querying
Require analysis of data
Can be thought of as an extension of some of the basic
aggregation functions available in SQL
OLAP tools may be used in DSS systems
Mutlidimentional view is fundamental
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
24
OLAP Implementations

MOLAP (Multidimensional OLAP)
– Multidimential Database (MDD)
– Specialized DBMS and software system capable of
supporting the multidimensional data directly
– Data stored as an n-dimensional array (cube)
– Indexes used to speed up processing

ROLAP (Relational OLAP)
– Data stored in a relational database
– ROLAP server (middleware) creates the
multidimensional view for the user
– Less Complex; Less efficient

HOLAP (Hybrid OLAP)
– Not updated frequently – MDD
– Updated frequently - RDB
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
25
OLAP Operations
Roll Up
Drill Down
Single Cell
Multiple Cells
Slice
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
Dice
26
OLAP Operations






Simple query – single cell in the cube
Slice – Look at a subcube to get more
specific information
Dice – Rotate cube to look at another
dimension
Roll Up – Dimension Reduction; Aggregation
Drill Down
Visualization: These operations allow the
OLAP users to actually “see” results of an
operation.
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
27
Relationship Between Topcs
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
28
Decision Support Systems
Tools and computer systems that assist
management in decision making
 What if types of questions
 High level decisions
 Data warehouse – data which supports
DSS

4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
29
Data Warehouse Links

OLAP
– http://www.olapreport.com/

General Data Warehousing
–
–
–
–

DW Products
–
–
–
–
–

http://www.inmoncif.com/home/
http://www.datawarehouseconsulting.com/
http://www.datawarehousing.com/
http://www.dw-institute.com/
http://www-306.ibm.com/software/data/informix/redbrick/
http://www.oracle.com/solutions/business_intelligence/dw_home.html
http://www.sas.com/technologies/dw/index.html
http://msdn2.microsoft.com/en-us/library/aa545535.aspx
http://www.sybase.com/detail?id=1027323
Interesting Articles
– “Teaching Effective Methodologies to Design a Data Warehouse,” by Behrooz SeyedAbbassi
http://isedj.org/isecon/2001/35c/ISECON.2001.Seyed-Abbassi.pdf
– An Oracle DBA’s Guide to the OLAP Option,” by by Mark Rittman
http://www.dbazine.com/datawarehouse/dw-articles/rittman1
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
30
DW&IR Outline
Introduction
 Data Warehousing

 Research
– Bibliomining
– Hierarchical Multimedia IR
– Ontology-based OLAP & IR
 Summary
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
31
Bibliomining [2,3]



Data Warehousing + Data Mining + Libraries
Abstract, cleanse, summarize library data
– Documents
– Users (including demographics)
– Circulation Records (including Web server records)
Privacy of utmost importance
http://www.bibliomining.com/nicholson/biblioprocess.htm [2]
http://bibliomining.com/nicholson/nicholsonbibliointro.html [3]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
32
Hierarchical Multimedia IR [4]



DW Approach to Multimedia IR
– Allows easier integration of multiple data types
– Facilitates indexing
– Facilitates searching
– Allows data to be stored at many different
granularities and dimensions
– Data aggregation
“data warehouses are not just large databases;
they are large, complex environments that
integrate many technologies” [p729]
Multimedia starflake schema
– Denormalized star dimension table
– Normalized snowflake tables
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
33
Starflake
Fig 2 [4]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
34
Hierarchy of Data Cubes
Fig 4 [4]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
35
Ontology-Based OLAP & IR [5]
Combine structured and document data
obtained from Web
 Global Ontology

– Includes OLAP dimensions
– Contains resource metadata
– RDF based

IR based on
– Both queries and resources represented as
RDF metadata
– http://www.w3.org/RDF/
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
36
Ontology OLAP&IR Architecture
Fig 1 [5]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
37
OLAP Dimensions in RDF
Fig 2 [5]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
38
RDF Query
Fig 6 [5]
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
39
DW&IR Outline
Introduction
 Data Warehousing
 Research

 Summary
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
40
Summary

Information Retrieval is being extended to many
different data types
– Multimedia
– Data warehouse



Data Warehousing is being extended beyond the
basic business domain
Little research in combining DW and IR
Integrating Unstructured Text into the Structured
Environment: The Value Proposition“, by Bill Inmon
– http://www.inmondatasystems.com/whitepapers/int
egratingunstructured.pdf
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
41
Bibliography
[1] Anne-Muriel Arigon, Anne Tchounikine, and Maryvonne Miquel, “Handling
Multiple Points of View in a Multimedia Data Warehouse,” ACM Transactions on
Multimedia Computing, Communications and Applications, Vol. 2, No. 3, August
2006, Pages 199–218.
[2] S. Nicholson, “The Bibliomining Process: Data Warehousing and Data Mining
for Library Decision-Making,” Information Technology and Libraries, 22(4),
2003.
[3] S. Nicholson, “The Basis for Biliomining: Frameworks for Bringing Together
Usage-Based Data Mining and Bibliometrics through Data Warehousing in
Digital Library Services,” Information Processing & Management, 42(3), May
2006, pp 785-804.
[4] Jane You, Tharam Dillon, James Liu, Edwige Pissaloux, “On Hierarchical
Multimedia Information Retrieval,” You, J.; Proceedings of the 2001
International Conference on Image Processing, 7-10 Oct 2001, pp 729 – 732.
[5] Torsten Priebe and Gunther Pernul, “Ontology-based Integration of OLAP and
Information Retrieval,” Proceedings of the 14th International Workshop on
Database and expert Systems Applications, 2003.
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
42
4/17/07, Tecnológico de Monterrey, SMU
CSE 8337
43