session 8-9 Data Resource Management

Download Report

Transcript session 8-9 Data Resource Management

PART Ⅱ Introduction to the
Foundation of Information
Technologies
Session 8-9
Data Resource Management
HUANG Lihua, Fudan University
Content
• Database Concepts & Technology
– Experiment: ACCESS
• Database Trends
• Data Trends of Application
– Data Warehouse
– OLAP
– DATA Mining
• Creating Database Environment
HUANG Lihua, Fudan University
DATA
• Streams of raw facts representing events
such as business transactions, simple
observations of the state of the world
HUANG Lihua, Fudan University
FILE ORGANIZATION
• A computer system organizes data in a hierarchy
that begins with bits, and proceeds to bytes, fields,
records, files, and database.
File
Record
Field
Byte
Record
Field
Byte
Byte
Bit
Field
Bit
HUANG Lihua, Fudan University
Filed
Byte
Bit
Bit
FILE ORGANIZATION
• BIT: Binary Digit (0,1;Y,N;On, Off)
• BYTE: Combination of BITS which
represent a CHARACTER
• FIELD: A logical grouping of characters into a
word, a group of words, or a complete number.
• RECORD: Collection of FIELDS which reflect a
TRANSACTION
• FILE: A Collection of Similar RECORDS
• DATABASE: An Organization’s Electronic
Library of FILES
HUANG Lihua, Fudan University
FILE ORGANIZATION
• Filed: Student’s name;
• Record:
A
record
For Example,
Number
Name
Course
Date
Grade
9525012
Zhang
Yan
MIS
1998.9
A
Name
Course
Date
Grade
Zhang
Yan
9525018 Jeff Yu
MIS
1998.9
A
MIS
1998.9
A
9525027
…
He Hui
MIS
1998.9
B
…
…
…
…
•File:
Number
9525012
HUANG Lihua, Fudan University
FILE ORGANIZATION : Another way of
thinking about database components——
• ENTITY: Person, Place, Thing, Event about
Which Data Must be Kept ( a record
describes an entity )
• ATTRIBUTE: Description of a Particular
ENTITY (corresponds to fields)
• KEY FIELD: Field Used to Retrieve,
Update, Sort RECORD
*
HUANG Lihua, Fudan University
FILE ORGANIZATION
Record
Key Field
Attribute
Key Field
FDU NO.
HKU NO.
NAME
SEX
98HM001
93835
Xie Mingqiang
M
58702331
98HM002
93840
Yu Bing
F
65110968
98hm003
93841
Wang Pei
F
58711001
-23306
63568504
98HM004
93842
Ge Ruijin
M
56938860
56873143
98HM005
93843
Wang Xintao
M
58611828
65352394
98HM006
93844
Fu Qiang
F
58666060
-6007
58836304
File
TEL(O)
HUANG Lihua, Fudan University
TEL(H)
KEY FIELD
• Field in Each Record
• Uniquely Identifies THIS Record
• For RETRIEVAL
UPDATING
SORTING
*
HUANG Lihua, Fudan University
Accessing Records from Computer
Files:Sequential vs. Direct or random
file organization
• SEQUENTIAL: Data records must be
retrieved in the same physical sequence in
which they are stored. (Magnetic tape )
• DIRECT: Data can be accessed without
regard to physical sequence. (Disk)
*
Direct file
organization
Sequential file
organization
HUANG Lihua, Fudan University
Traditional File Processing & File
Organization
registration
Class programs
Class file
accounting
Accounts programs
Class file
athletics
Sports programs
Class file
HUANG Lihua, Fudan University
Traditional File Processing & File
Organization
HUANG Lihua, Fudan University
Problems Arising from the File Organization
• Data Redundancy: The same piece of information could be
duplicated in several files.
• Data Inconsistency:
• Data Isolation: Data files are likely to be organized
differently, stored in different formats, and often physically
inaccessible to other applications.
• data integrity problem: It is difficult to place data integrity
constraints across multiple data files.
• Application and Data Independence: In the file
environment, the applications and their associated data
files are dependent on each other.
• Poor security: is difficult to enforce in the file environment.
• Lack of data sharing & availability
HUANG Lihua, Fudan University
Flat
File
Problems Arising from the File Organization
•
•
•
•
•
Data Redundancy;
Data Inconsistency;
Data Isolation, data integrity problem;
Application and Data Independence;
Security, data sharing problem.
These problems led to the
development of DATABASE
HUANG Lihua, Fudan University
DATABASE
• A Database is an organized logical grouping
of related files.
• In a Database, data are stored & managed in
a convenient form, and integrated and
related so that one set of software programs
provides access to all the data.
HUANG Lihua, Fudan University
DATABASE
• Collection of centralized data
• Controls redundant data
• Data stored so as to appear to users in one
location
• Services multiple application
HUANG Lihua, Fudan University
DATABASE MANAGEMENT SYSTEM
(数据库管理系统DBMS)
•
Software to create & maintain DATA
enables business applications to extract
data independent of specific computer
programs.
HUANG Lihua, Fudan University
registrar
Class programs
Class file
accounting
Accounts programs
Class file
athletics
Sports programs
Class file
Computer based files of this type cause problems such as redundancy, inconsistency, and data isolation.
registrar
accounting
athletics
Class programs
Accounts programs
DBMS
Sports programs
HUANG Lihua, Fudan University
Database
Class file
Accounts file
Sports file
DBMS provides access
to all data in the
database
Database Environment
HUANG Lihua, Fudan University
COMPONENTS OF DBMS:
• DATA DEFINITION LANGUAGE:
– Defines Data Elements in Database
• DATA MANIPULATION LANGUAGE:
– Manipulates Data for Applications
– For Example: For extracting data from database, e.g.
SQL
• DATA DICTIONARY:
– Formal Definitions of all Variables in Database;
Controls Variety of Database Contents
*
HUANG Lihua, Fudan University
Sample data dictionary report
HUANG Lihua, Fudan University
Fundamental Database Structures
HUANG Lihua, Fudan University
HIERARCHICAL DATABASE
ROOT
FIRST
CHILD
2nd
CHILD
Employer
Compensation
Ratings
Salary
Job
Assignments
Pension
HUANG Lihua, Fudan University
Benefits
Insurance
Health
Type of RELATIONS
ONE-TO-ONE:
STUDENT
CLASS
ONE-TO-MANY:
STUDENT
A
MANY-TO-MANY:
ID
STUDENT
B
COURSE
1
STUDENT
C
COURSE
2
STUDENT STUDENT
HUANG Lihua,A
Fudan University
B
STUDENT
C
NETWORK DATA MODEL
• Variation of Hierarchical Model
• Useful for many-to-many relationships
COURSE
1
STUDENT
A
COURSE
2
STUDENT
B
STUDENT
C
HUANG Lihua, Fudan University
Disadvantages of Hierarchical and
Network DBMS
• Outdated
• Less flexible compared to RDBMS
• Lack support for ad-hoc and English
language-like queries
HUANG Lihua, Fudan University
RELATIONAL DATA MODEL
•
•
•
•
DATA IN TABLE FORMAT
RELATION: TABLE
Tuple(元组): ROW (record 记录) IN TABLE
Field: COLUMN (attribute 属性) IN TABLE
HOURS * RATE
TOTAL
ABLE
$ 40.50 $ 10.35 $ 419.18
BAXTER $ 38.00 $ 8.75 $ 332.50
CHEN
$ 42.70 $ 9.25 $ 394.98
DENVER $ 35.90 $ 9.50 $ 341.05
HUANG Lihua, Fudan University
Example DB: Fortune 500
Companies
• company
Comp. name sales
assets
netincome
empls
indcode
yr
allied
9115000
13271000
-279000
143800
37
85
boeing
9035000
7593000
292000
95700
37
82
...
• industry codes
indcode
indname
42
pharmaceuticals
44
computers
HUANG Lihua, Fudan University
...
The Relational Data Model
HUANG Lihua, Fudan University
Current DBMS: Relational Database
• DBMS Vendor
–
–
–
–
–
–
MS: Access, SQL Server
Oracle
Sybase
DB2
Informix
MySQL
HUANG Lihua, Fudan University
The Relational Database Model
 The relational model is based on a simple concept
of tables in order to capitalize on characteristics of
rows and columns of data, which is consistent
with real-world business situations.
 One of the greatest advantages of the relational
model is its conceptual simplicity and the ability
to link records in a way that is not predefined.
HUANG Lihua, Fudan University
The Relational Abstraction
• Information is in tables
– Also called (base) relations
• Columns define attributes(属性、字段、数据项)
– Also called fields or domains
• Rows define records
– Also called tuples(元组)
• Cells contain values
– All cells in column have information of same type
• e.g., integer, floating point, text, date
HUANG Lihua, Fudan University
Operations on Tables
• Add new rows (or sometimes columns)
– Modify existing rows
•
•
•
•
•
•
Choose a subset of columns
Choose a subset of rows
Combine rows (e.g., sum values in a column)
Combine columns
Combine two tables (join)
No operations to combine individual cells
– Unlike spreadsheet
HUANG Lihua, Fudan University
Three Basic Operations in a
Relational Database
• Select:
– Creates subset of rows that meet specific criteria
• Join:
– Combines relational tables to provide users with
information
• Project:
– Enables users to create new tables containing only
relevant information
HUANG Lihua, Fudan University
The three basic operations of a
relational DBMS
HUANG Lihua, Fudan University
Operating on Databases: SQL
• Every abstraction needs an interface through
which users invoke abstract operations
– graphical interface
– language
• Structured Query Language
– Select …(content)…. From… (table)..Where
(condition)
• We'll focus only on queries
– Query = question
– Extract some data from one or more tables to answer a
particular question
HUANG Lihua, Fudan University
Physical vs. Logical Data View
• Minimizes these problems by providing two
“views” of the database data:
– The physical view deals with the actual, physical
arrangement and location of data in the direct access
storage devices (DASD).
– The logical view, or user’s view, represents data in a
format that is meaningful to a user and to the software
programs that process that data.
• Entity-relationship diagram (ER图): Methodology for
documenting databases illustrating relationships between database
entities
• Normalization (范式): Process of creating small stable data
structures from complex groups of data
HUANG Lihua, Fudan University
Entity-relationship diagram
HUANG Lihua, Fudan University
Experiment: Microsoft Access
• Features:
– Create/Modify databases
– Specify/Run queries
– Design/Print reports
– Design graphical user interfaces around databases
• Forms for entering, viewing data
• Assignment: P136 APP. Exer 3
P.169 App. Exer 1
HUANG Lihua, Fudan University
Content
• Database Concepts & Technology
– Experiment: ACCESS
• Database Trends
• Data Trends of Application
– Data Warehouse
– OLAP
– DATA Mining
• Creating Database Environment
HUANG Lihua, Fudan University
2. Database Trends(1)
• The evolution of Database System
• Data
– Simple data => Multimedia data, Knowledge
• Model
– Relational model => OO model
Object relational model
HUANG Lihua, Fudan University
Databases Trends (2)
• Application
– OLTP => OLAP
• Data organization
– Database => Data warehouse, Data Marts
• Query language
– SQL => Deductive
HUANG Lihua, Fudan University
Emerging Database Models
The most common database models are:
 Multimedia database
 Deductive databases
 Object-oriented databases
 Multimedia and hypermedia
databases
Multidimensional Database
HUANG Lihua, Fudan University
Object-Oriented Database Model
• Object-oriented (OO) databases store both data
and procedures acting on the data, as objects.
• Encapsulation Capability
– The OO database can be particularly helpful in
multimedia environments, such as in manufacturing
sites using CAD/CAM.
– OO databases can be particularly useful in supporting
temporal and spatial(时空) dimensions.
• Terminology in the OO model includes:
– objects, attributes, classes, methods, and messages.
HUANG Lihua, Fudan University
Hypermedia Database Model
 The hypermedia database model stores
chunks of information in the form of
nodes connected by links established by
the user.
 The nodes can contain text, graphics,
sound, full-motion video, or executable
computer programs.
 Users can branch to related information in
any kind of relationship.
HUANG Lihua, Fudan University
A hypermedia database
HUANG Lihua, Fudan University
Multidimensional Database
• A variation of the relational model
• Use multidimensional structures to organize data and
express the relationship between data.
• A dimension of the data : a side of a cube.
• ①多维数组
– (北京,1999年,彩电,10000)
– (地理位置,年份,产品类型,销售额)
• ②维的层次
– 例如:年、季度、月份、日期
–
国家、地区、省、城市
• ③维内元素的类
– 例如: 按产品的价格分成高、中、低档。
HUANG
Lihua, Fudan University
–
按原材料的成本价格分类
Multidimensional data model
HUANG Lihua, Fudan University
Specialized Databases
• There are many specialized databases, depending
on the type or format of data stored.
– A geographical information database contains
locational data for overlaying on maps or images.
– A knowledge database stores decision rules used to
evaluate situations and help users make decisions like
an expert.
– A multimedia database stores data on many media—
sounds, video, images, graphic animation, and text.
HUANG Lihua, Fudan University
Content
• Database Concepts & Technology
– Experiment: ACCESS
• Database Trends
• Data Trends of Application
– Data Warehouse
– OLAP
– DATA Mining
• Creating Database Environment
HUANG Lihua, Fudan University
3. Data Trends of Application
• Data Warehouse
• OLAP
• DATA Mining
HUANG Lihua, Fudan University
From Database to Data
Warehousing
• 随着信息技术在企业的广泛应用,企业积累了
大量数据
• 企业所面对的问题不是简单地处理数据,而是
如何使用数据
----从操作处理(Operational Processing) 到分
析处理(Analytical Processing)
HUANG Lihua, Fudan University
Operational vs. Decision Support
Systems
• Operational Processing in Operational Systems
– Support day to day transactions
– Contain current, “up to date” data
– Examples: customer orders, inventory levels, bank
account balances
• Analytical Processing in Decision Support
Systems
– Support strategic decision making
– Contain historical, “summarized” data
– Examples: performance summary, customer
profitability, market segmentation
HUANG Lihua, Fudan University
Why data warehouses?
• Decision Support Data
– Are found in many different databases
• within the company
• outside the company
– In practical terms, locating and integrating all this
information in real time is very difficult
• Solution:
– Create separate repositories of data for decision
support
– => data warehouses
HUANG Lihua, Fudan University
Data Warehousing
• Stores current and historical data
– Consolidates data for management analysis and
decision making
– Supports reporting and query tools
• “数据仓库之父” W. H. Inmon给的定义:
数据仓库就是一个用以更好地支持企业或
组织的决策分析处理的,面向主题的,集
成的,不可更新的,随时间不断变化的数
据集合。
HUANG Lihua, Fudan University
Characteristics of Data Warehouses
1)
2)
Organization. Data are organized by detailed subjects.
Consistency. Data in different operational databases may be
encoded differently. In the warehouse they will be coded in a
consistent manner.
3)
Time variant. The data are kept for 5 to 10 years so they can be
used for trends, forecasting, and comparisons over time.
4)
Non-volatile. Once entered into the warehouse, data are not
updated.
5)
6)
Relational. The data warehouse uses a relational structure.
Client/server. The data warehouse uses the client/server to provide
the end user an easy access to its data.
HUANG Lihua, Fudan University
Data Warehouse Suitability
Data warehousing is most appropriate for
organizations in which some of the following
apply.
Large amounts of data need to be accessed by end-users.
The operational data are stored in different systems.
An information-based approach to management is in use.
There is a large, diverse customer base.
The same data are represented differently in different
systems.
 Data are stored in highly technical formats that are difficult to
decipher.
 Extensive end-user computing is performed.





HUANG Lihua, Fudan University
Comparison with DB Systems
• DB Systems
• DWS Systems
– 数据库:操作型数据
(Operational Data),增、
删、改操作频繁
– 数据库核心:功能强大,
面向OLTP (Online
Transaction Processing) 应用
– 数据库工具:以查询工
具为主
– 数据仓库:分析型数
据(Analytical Data),
极少有更新操作
– 数据仓库管理系统:
因极少有更新操作,
故功能简单
– 数据仓库工具:以分
析工具为主
HUANG Lihua, Fudan University
OLAP
在线分析处理
DSS
Data
Warehouse
决策支持系统
Data Mining
数据挖掘
OLTP
Database
在线业务处理
HUANG Lihua, Fudan University
数据仓库的层次结构
分产线的
月销售数据
高度
综合数据
分产线的
周销售数据
综合数据
汇总的销售数据
原始数据
销售数据
档案数据
HUANG Lihua, Fudan University
数据仓库的实现
Operational
Databases
Data Warehouse
主题1
主题数据
Internal
data
source
External
data
source
取
出
数
据
数
据
转
换
主题2
S
S
主题3
S
S
主题数据
主题数据
SS
Infomart
Infomart
Datamart
Datamart
Infomart(信息超市): 这是一个应用功能(或从应用功能产生的输出)可多次运行
数据仓库,它是对应业务问题的答案。
这是数据仓库的一个数据子集,它对应最终用户的信息需
HUANG Lihua,
Fudan University
Datamart(数据超市):
求。它比之数据仓库中的数据要更加归纳、汇总一些。
Data Marts
Data Marts are an alternative used by many other firms is
creation of a lower cost, scaled-down version of a data
warehouse. They refer to small warehouses that focus on
specific aspects of a company, such as for a strategic
business unit (SBU) or a department.
Two major types of Data Marts:
1) Replicated (dependent) Data Marts. In such cases one
can replicate functional subsets of the data warehouse in
smaller databases.
2) Stand-Alone Data Marts. A company can have one or
more independent data marts without having a data
warehouse.
HUANG Lihua, Fudan University
数据仓库的展现
Operational
Databases
主题1
主题数据
取
出
数
据
数
据
转
换
主题2
主题数据
主题3
主题数据
SS
S
S
S
S
Metadata: 关于数据的数据
HUANG Lihua, Fudan University
Infomart
查询
报表
Datamart
在线
分析
数据仓库的数据、信息流
运行
OLTP
数据
Operational
数据库 定义
Databases
数据仓库
主题1
主
题
表
数据确认
信息
新数据集
信息
数据转换
汇
总
表
数据整合
信息
数据汇总
数据取用
处理工具
在线分析
数据挖掘
再分析
新展示
数
据
市
场
HUANG Lihua, Fudan University
提取、转换、加载(ETL)
(Extract, Transform, Load)
• Extract
– Consolidate data from
several sources
• Transform
– Filter out unwanted data,
correct incorrect data,
convert to new data
elements, aggregate into
new data subsets
• Load
HUANG Lihua, Fudan University
– Load into data warehourse
• 数据清理过程
• 解决冲突
• 使用效率的考
虑
Example of DWS Application
1、贸易中心需求
贸易中心市内卷烟销售采用三级批的方式:
卷烟收购
价
格
销
售
分
析
贸易中心
收购价
1.
2.
3.
4.
5.
25个有限公司
调拨价
销售分析
资源投放分析
价格分析
有限公司效益预测
网络建设信息分析
HUANG Lihua, Fudan University
零售商
批发价
消费者
集市价
网络零售价
2、技术方案:
SAS+DWA+IntrNet的解决方案
2.1 数据仓库逻辑配置方案
“市内销售分析”主题(Subject_insale)中包含两个层次
(1) 详细数据(DETAIL TABLE):
(2) 数据集合(DATA GROUP):
1) 根据需求分析需建立五个数据集合:
2) 销售量分析集合(01_QuantyData),
3) 货源投放分析集合(02_DistributeData),
4) 价格分析集合(03_PriceData),
5) 效益分析集合(04_BenefitData),
6) 网络建设分析集合(05_SalenetData)
HUANG Lihua, Fudan University
数据仓库逻辑结构图
DW_tobacco_SH
Tobacco_DW
Global_subject
plan
prod
sale
store
Subject_insale
Subject_outsale
Detail Logical Table
(Detail Table 1)
…
(Detail Table n)
01_MarketData
02_DistributeData
03_PriceData
04_SubsaleData
ODD_information_pub
ODD_insale
HUANG Lihua, Fudan University
ODD_outsale
Detail Logical Table
(Detail Table 1)
….
(Detail Table n)
01_QuantyData
02_DistributeDat
a
03_PriceData
04_BenefitData
05_SalenetData
SAS/IntrNet配置方案:
SAS/IntrNet应用发布示意图
WEBPGM
(需开发部分)
DWA info
Browser
SAS appsrver
webserver
通用标准B/W平台
broker
SAS/IntrNet
HUANG Lihua, Fudan University
SAS/DWA
3、功能分析
销量分析 货源投放 价格分析 效益分析 网建客户信息分析 系统维护
销量
对比
比较
同比
销量
有限公司1
有限公司2
日期1 1999年04月01日
日期2 2000年02月28日
统计周期 月
销售性质 所有
牌号 301101 对照表
(产地/代码)
有限公司
所有
时间
销售区域
确定
重选
HUANG Lihua, Fudan University
比较复选框

有限公司
销售性质
关注品牌
牌号对照表

品牌产地
总计
沪产烟
代码输入:
外地烟
外烟
品牌号
总计
熊猫
中华
红双喜
。
。
。
HUANG Lihua, Fudan University
具体牌号
总计
翻盖中华
中华软壳
中华礼包
。
。
。
货源投放
控制牌号投放量分析(系数相关性)
1
销量分析
货源投放
投放量分析
起始日期
价格分析
效益分析
网建客户相关信息分析
(分析类别)
销量
1999 年 09 月
截止
截止日期
系统维护
集市价格
: 2000 年 02 月
牌号选择
库存
确定
重选
系数
相关性分析
99/09
99/10
99/11
99/12
00/01
截止
HUANG Lihua, Fudan University
00/02
2
投放进度分析
销量分析
货源投放
价格分析
效益分析
网建客户相关信息分析
系统维护
投放进度分析(分析类别)
起始日期
2000 年 01 月 01
销量
日
截止日期
截止
统计周期
牌号
(代码)
有限公司
2000 年 02 月 01 日
集市价格
周
对照表
投放
所有
有限公司库存
确定
重选
01/1
01/2
01/3
01/4
02/1
02/2
截止
HUANG Lihua, Fudan University
02/3
02/4
3.3 市内销售价格分析
1 价格分析
销量分析 货源投放
价格分析
起始日期
01 日
截止日期
截止
统计周期
价格分析 效益分析 网建客户相关信息分析 系统维护
(分析类别)
集市均价
1999 年 09 月
2000 年 02 月
月
牌号
集市价
(代码)
对照表
网络批发价
确定
重选
99/09
99/10
99/11
99/12
00/01
00/02
截止
HUANG Lihua, Fudan University
2 价量分析
销量分析 货源投放 价格分析 效益分析 网建客户相关信息分析 系统维护
价量分析
(分析类别)
集市价格
起始日期
1999 年 09 月
截止
截止日期
有限公司
牌号
2000 年 02 月
所有
条件选定的价格
对照表
(代码)
销售性质
统计周期
确定
周
重选
市内销量
99/09 99/10 99/11 99/12 00/01 00/02
截止
HUANG Lihua, Fudan University
复旦大学数据管理与数据分析应用
数据展现
主题数据分析层
ROLAP
主题数据层
主题数据库
共享数据层
业务数据层
共享数据库
人事数据库 学工数据库
人事子系统 学工子系统
●●●
教务数据库 科研数据库
●●●
教务子系统 科研子系统
HUANG Lihua, Fudan University
OLAP, DM, KDD
• ON-LINE ANALYTICAL PROCESSING (OLAP
在线分析处理): Tools for multi- dimensional
data analysis
• DATAMINING(DM, 数据挖掘): Tools for
finding hidden patterns, relationships, for
predicting trends
• Knowledge Discovery in Databases (KDD知识发
现):Tools for extracting useful knowledge
from volumes of data.
HUANG Lihua, Fudan University
Multidimensionality
• Modern data and information may have several dimensions.
– e.g. Management may be interested in examining sales figures in a
certain city by product, by time period, by salesperson, and by
store.
• It is important to provide the user with a technology that
allows him or her to add, replace, or change dimensions
quickly and easily in a table and/or graphical presentation.
• The technology of slicing(切片), dicing(切块),
Drilling (钻取)and similar manipulations is called
Multidimensionality.
HUANG Lihua, Fudan University
Multidimensionality and OLAP
Sale
Product
P1
P2
P3
P4
P5
Americas
Europe
Area
Asi
a
time
HUANG Lihua, Fudan University
Multidimensionality= Flexible Analysis
Time series analysis
Area Comparsion
Product comparison
Special Analysis
HUANG Lihua, Fudan University
Data Mining
• Exciting new set of tools for using data
warehouses
• Combination of AI and statistical
analysis to discover information that is
“hidden” in the data
• OLAP versus Data Mining
HUANG Lihua, Fudan University
Data Mining Analysis
Method
• associations
– e.g. linking purchase of pizza with beer
• sequences
– e.g. tying events together: marriage and purchase
of furniture
• classifications
– e.g. recognizing patterns such as the attributes of
customers that are most likely to quit
• Forecasting
– e.g. predicting buying habits of customers based
on past patterns
HUANG Lihua, Fudan University
Applications of Data Mining
Data Mining is currently being used in the following areas;
 Retailing & Sales
 Banking
 Manufacturing & Production
 Brokerage & Securities
trading
 Computer hardware &
software







Insurance
Policework
Government & Defense
Airlines
Health care
Broadcasting
Marketing
HUANG Lihua, Fudan University
Knowledge Discovery in Databases
(KDD)
• KDD is the process of extracting useful knowledge from
volumes of data.
• It is the subject of extensive research.
• KDD’s objective is to identify valid, novel, potentially
useful, and ultimately understandable patterns in data.
• KDD is useful because it is supported by three
technologies that are now sufficiently mature:
– Massive data collection
– Powerful multiprocessor computers
– Data mining algorithms
HUANG Lihua, Fudan University
Evolution of KDD
Stages in the Evolution of Knowledge Discovery
Evolutionary Stage
Business Question
Enabling Technologies
Characteristics
Data Collection
(1960s)
What was my total
revenue in the last five
years?
What were unit sales in
New England last March?
Computer, tapes, disks.
Retrospective, static
data delivery
Relational databases
(RDBMS), structured query
language (SQL)
Online analytic processing
(OLAP), multidimensional
databases, data warehouses
Advanced algorithms,
multiprocessor computers,
massive databases
Retrospective,
dynamic data delivery
at record level
Retrospective,
dynamic data delivery
at multiple levels
Prospective,
proactive information
delivery
Data Access (1980s)
Data Warehousing &
Decision Support
(early 1990s)
Intelligent Data
Mining (late 1990s)
Drill down to Boston?
What’s likely to happen to
Boston unit sales next
month? Why?
Source: Courtesy of Accrue Software.
HUANG Lihua, Fudan University
Content
• Database Concepts & Technology
– Experiment: ACCESS
• Database Trends
• Data Trends of Application
– Data Warehouse
– OLAP
– DATA Mining
• Creating Database Environment
HUANG Lihua, Fudan University
Creating a Database Environment
HUANG Lihua, Fudan University
Key organizational elements in the
database environment
Database
Technology and
management
Data
Administration
Database
Management
Systems
Data Planning
And Modeling
Methodology
Users
HUANG Lihua, Fudan University
Data Administration
• Data Administration
–
–
–
–
Develop information policy
Define information requirements
Plan for data
Oversee logical database design and database
dictionary development
– Monitor use of information
• Database administrator & database analyst
HUANG Lihua, Fudan University
Key organizational elements in the
database environment
Database
Technology and
management
Data
Administration
Database
Management
Systems
Data Planning
And Modeling
Methodology
Users
HUANG Lihua, Fudan University
Management Requirements for
Database Systems
• Data Planning and Modeling Methodology
– Enterprise-wide planning for data
– Identify key entities, attributes, and relationships that constitute the
organization’s data
• Data Planning process
– Data planning
• Develop a model of business processes
– Requirements Specification
• Define information needs of end users in a business process
– Conceptual design
• Expresses all information requirements in the form of high-level model (ERM)
– Logical design
• Translates the conceptual model into the data model of a DBMS
– Physical Design
• Determines the data storage structure and access method
HUANG Lihua, Fudan University
Procurement Process
When purchasing department wrote a purchase order, it sent a
Purchasing copy to accounts payables.
Vendor
The vendor sent an invoice to accounts payables when they
delivered goods.
Receiving
When material control received the goods, it sent a copy of
the receiving document.
Account
Payables
Thus, the account payables received three documents from various
senders:Purchasing order, invoice and receiving document. It was
up to account payables to match the purchasing against the receiving
document and the invoice. If they matched, the department issued
payment. Otherwise, an accounts payable clerk would investigate the
discrepancy, hold up payment, generate documents and all in all gum
up the works. HUANG Lihua, Fudan University
Entity-relationship diagram
HUANG Lihua, Fudan University
Main data table in Procurement
Process
HUANG Lihua, Fudan University
Key organizational elements in the
database environment
Database
Technology and
management
Data
Administration
Database
Management
Systems
Data Planning
And Modeling
Methodology
Users
HUANG Lihua, Fudan University
Management Requirements for
Database Systems
• Database Technology, Management, and
Users
– Databases require DBMS software and staff
– Database design group defines and organizes
structure and content of database
– Database administration: establish physical
database, logical relations, access rules
HUANG Lihua, Fudan University
Wrap-up
• Database Management Systems
– Concept, importance
– Be able to use simplest ACCESS
• Major types of database
– Structure model:
• Hierchical, Network, Rational, Multidimensional, OO
– Data type:
• Operational database ~analytical database ~Deductive Database
• Simple database ~ Multimedia Database ~ Hypermedia Database
– Store type
• Centralized Database ~ Distributed Database
• Database development and access
• Data warehouse
• OLAP, Data mining
HUANG Lihua, Fudan University
Assignment for Session 8-9
•
Individual Review for session 8
–
–
•
Reading Materials: Textbook: chapter 5
P. 168: Review Quiz: write down in your book.
Individual Assignment :
–
–
–
P.136: Application exercises 3
P. 169: Application exercises 1
submit to Vcampus in two weeks
HUANG Lihua, Fudan University
Preparation for session 10
• Individual Preparation for session 10
– Reading Materials: Textbook: chapter 6
– P. 206: Review Quiz: write down in your book.
HUANG Lihua, Fudan University