Data Mining Engineering

Download Report

Transcript Data Mining Engineering

Data Mining - Introduction
Peter Brezany
Institut für Scientific Computing
Universität Wien
Tel. 4277 38825
Sprechstunde: Di, 13.00-14.00
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
Outline
Business Intelligence and its components
Knowledge discovery in databases
Data mining techniques
- description
- classification
- prediction
- clustering
- neural networks
Commercial data mining systems (Demo of the SAS
Enterprise Miner) ?
Data warehousing
Data webhousing
Advanced topics: parallel and distributed data analysis
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
2
Literature
Mark and Mary Whitehorn: Business Intelligence:
The IBM Solution. Springer-Verlag, 2000.
R. Kimball: The Data Warehouse Toolkit. John Willey, 1996.
J. Han, M. Kamber: Data Mining. Concepts and Techniques
Morgam Kaufmann Publishers, 2000.
M. Ester, J. Sander: Knowledge Discovery in Databases.
Springer-Verlag, 2000.
I.H. Witten, E. Frank: Data Mining. (Practical Machine
Learning Tools and Techniques with Java Implementations).
Morgam Kaufmann Publishers, 2000.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
3
Business Intelligence
Definition:
Business Intelligence is an umbrella term, broadly covering the
processes involved in extracting valuable business information
and knowledge from the mass of data that exists within a typical
enterprise.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
4
Business Intelligence Tools
• Data warehouses
• OLAP (On-Line Analytical Processing) tools
• Data mining tools
• Text mining tools
the focus
of our
lectures
• Web mining tools
• Data joiners (integrators)
• Business Intelligence portals, etc.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
5
Business Intelligence Tools (cont.)
• Data warehouse - a repository of multiple heterogeneous data
sources, organized under a unified schema at a single site in order
to facilitate management decision making.
• OLAP – analysis techniques with functionlities such as summarization, consolidation, and aggregation, as well as the ability to view
information from different angles.
• Data mining – extracting or “mining“ knowledge from large data sets.
• Text mining – “mining“ large textual (document) databases.
• Web mining – discovering knowledge from hypertext data.
• Data joiner - working with data from disparate, heterogeneous data
sources
• Business Intelligence portal – a Web site designed to be the first
point of entry for visitors to information about a company. With help
of the portal´s personalising functions, the user can choose information sources that he needs for performing a specific task.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
6
DATA MINING
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
7
Introduction
• This lecture topic is about the theme which has
come to be known as data mining and knowledge
discovery in large databases, data warehouses, and
other massive information repositories.
• Data mining emerged during the late 1980s; has
made great strides during the late 1990s, and is
expected to continue to flourish into the next
future.
• We introduce interesting data mining techniques and
systems, and discuss applications and research
directions.
• Data mining can be viewed as a result of the natural
evolution of information technology - including
database technology, artificial intelligence, machine
learning, neural networks, statistics, patternrecognition, knowledge-based systems, highperformance computing, and data visualization.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
8
What Motivated Data Mining? Why
Is It Important?
• There is the wide availability of huge amounts of
data and the imminent need for turning such data
into useful information and knowledge.
• Applications ranging from business management,
production control, and market analysis, to
engineering design and science exploration.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
9
Motivation
Business
Medicine
Scientific
experiments
Data and data
exploration
cloud
Simulations
P.Brezany
Earth observations
Institut für Softwarewissenschaften – Universität Wien
10
CERNs challenge
• Starting point
– New detector LHC
» Large Hadron Collider, 14 TeV
» Goals: Search for Higgs Boson and
Graviton (and others)
– Start 2006
• Challenges
– Data are accessed worldwide
» CERN and Regional Centers (Europe, Asia, America)
» 2000 users
– Huge data volumes
– Data semantics
– Performance and throughput
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
11
The LHC Detectors
CMS
ATLAS
LHCb
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
12
Multi-Tier Model
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
13
The Evolution of Database Technology
Data Collection and Database Creation (1960s and earlier)
- Primitive file processing
Database Management Systems (1970s-early 1980s)
- Hierarchical, network and relational DB systems
- Query languages (SQL, etc), query optimization
- Transaction management, concurrency control, recovery
- Data modeling tools
Advanced Database Systems
(mid-1980s-present)
object-oriented, object-relational,
spatial, multimedia, ...
Web-based Database Systems
(1990s-present)
- XML-based DB systems,
- Web mining
Data Warehousing and Data Mining (late 1980s-present)
- Data warehouse and OLAP technology
- Data mining and knowledge discovery
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
14
Database Querying and Data Mining
Query languages like SQL are standardized and powerful, but for not skilled
users are they too difficult.
OLAP Tools allow flexible multidimensional queries. Their methods are querycentric.
Data Warehouse
Query languages like SQL
P.Brezany
OLAP Tools
Institut für Softwarewissenschaften – Universität Wien
Data Mining Tools
15
We Are Data Rich, But Information Poor
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
16
So, What Is Data Mining?
Data mining – searching for knowledge (interesting patterns)
in your data.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
17
Data Mining As a Step in the
Process of Knowledge Discovery
• Many people treat data mining as a synonym for the
term Knowledge Discovery in Databases, or KDD.
• Alternative view: data mining as n step in KDD:
– 1, Data cleaning (to remove noise and inconsistent data)
– 2. Data integration (where multiple data sources may be combined)
– 3. Data selection (where data relevant to the analysis task are
retrieved from the database)
– 4. Data transformation (where data are transformed or consolidated
into forms appropriate for mining by performing summary or
aggregation operations, for instance)
– 5. Data mining (an essential process where intelligent methods are
applied in order to extract patterns)
– 6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness measures)
– 7. Knowledge presentation to the user
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
18
Data Mining in Knowledge Discovery
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
19
Architecture of a Data Mining System
Graphical user interface
Pattern evaluation
Knowledge
base
Data mining engine
Database or
data warehouse server
Data cleaning, data integration
Database
P.Brezany
Filtering
Data
warehouse
Institut für Softwarewissenschaften – Universität Wien
20
Architecture of a Data Mining System (2)
Database, data warehouse, or other information repository:
One or a set of databases, data warehouses, spreadsheets, etc.
Database or data warehouse server: responsible for fetching
the relevant data, based on the user’s data mining request.
Knowledge base: domain knowledge that is used to guide the
search, or evaluate the interestingness of resulting patterns.
Such knowledge can include concept hierarchies, used to organize attribute values into different levels of abstraction.
Data mining engine: essential to the data mining system; ideally
consists of a set of functional modules for tasks such as characterization, association, classification, cluster analysis, and evolution and deviation analysis.
21
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
Architecture of a Data Mining System (3)
Pattern evaluation module: This component typically employs
interestingness measures and interacts with the data mining so
as to focus the search towards interesting patterns. It may use
interestingness thresholds to filter out discovered patterns.
Graphical user interface: This module communicates between
users and the data mining system allowing the user
• to specify a data mining query or task
• provide information to help focus the search
• perform exploratory data mining based on the intermediate
data mining results
• browse database and data warehouse schemas or data structures
• evaluate mined patterns
• visualize the patterns in different forms.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
22
Stages of a Data Exploration Project
Time to
complete
(percent of total)
Based on:
Data Preparation for Data Mining,
by Dorian Pyle, Morgan Kaufmann
P.Brezany
1.
Exploring the problem
10
2.
Exploring the solution
9
3.
Implementation specification 1
4.
Knowledge discovery
Importance
to success
(percent of total)
15
20
14
80
51
a. Data preparation
60
b. Data surveying
15
3
c. Data modeling
5
2
80
Institut für Softwarewissenschaften – Universität Wien
15
20
23
Relational Database
• A database system, also called a database management system
(DBMS), consists of a collection of interrelated data, known as a
database, and a set of software programs to manage and access
the data.
• A relational database is a collection of tables, each of which is
assigned a unique name. Each table consists of a set of attributes
(columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple represents an object identified by a
unique key.
• Relational data can be accessed by database queries written in a
relational query language, such as SQL.
• Using data mining, one can search for trends or data patterns in
relational databases.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
24
Relational Databases – Example
The AllElectronics company is described by the following
table: customer, item, employee, and branch. Fragments of
these tables are shown on the next slide; the attribute that
represents the key or composite key component is underlined.
•The relation customer consists of a set of attributes, including a unique customer identity number (cust_ID), and so on.
•Tables can also be used to represent the relationships between or among multiple relational tables. E.g., these include
purchases (customer purchases items, creating a sales transaction that is handled by an employee), items_sold (lists the
items sold in the given transaction), and works_at (employee
works at a branch of AllElectronics).
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
25
Fragments of Relations from AllElectronics DB
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
26
Data Warehouses
A data warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and which
usually resides at a single site.
Data warehouses are constructed via a process of data cleaning,
data transformation, data integration, data loading and periodic
data refreshing.
Figure on the next slide shows the basic architecture of a data
warehouse for AllElectronics.
In order to facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer,
item, supplier, and activity. The data are stored from a historical perspective and are typically summarized.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
27
Architecture of a Data Warehouse
Client
Data source in Ch.
Data source in NY
Clean
Transform
Integrate
Load
Data
warehouse
Query and
analysis tools
Client
Data source in T.
Load = periodical data refreshing
Data source in Vancouver
P.Brezany
Remarks: Ch - Chicago, NY - New York, T - Toronto
Institut für Softwarewissenschaften – Universität Wien
28
Modeling a Data Warehouse
A data warehouse is usually modeled by a multidimensional
database structure, where each dimension corresponds to an
attribute in the schema, each cell stores the value of some
aggregate measure, such as count or sales_amount.
The actual physical structure of a data warehouse may be a
relational data store or a multidimensional data cube. It
provides a multidimensional view of data and allows the
precomputation and fast accessing of summarized data.
Example: A data cube for summarized sales data of
AllElectronics is presented in the next slide.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
29
A Multidimensional Data Cube
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
30
Modeling a Data Warehouse (2)
Data warehouse vs. Data mart: A data warehouse collects
information about subjects and span an entire organization,
and thus its scope is enterprise wide. A data mart is
a department-wide.
Data warehouse systems are well suited for On-Line Analytical
processing, or OLAP.
OLAP operations allow the presentation of data at different
levels of abstractions.
Examples of OLAP operations include drill-down and roll-up,
which allow the user to view the data at different degrees of
summarization as illustrated in the previous slide.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
31
Transactional Databases
A transactional database consists of a file where each record
represents a transaction.
A transaction includes a unique transaction identity number
(trans_id), and a list of the items making up the transaction
(such as items purchased in a store).
The transactional database may have additional tables associated
with it, which contain other information regarding the sale, such
as the date of the transaction, the custommer ID number, the ID
number of the sales person, etc.
Example: Transactions can be stored in a table, with one
record per transaction. A fragment of a transactional database
for AllElectronics is shown in the next slide.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
32
Transactional Databases (2)
Trans_id
list of item_Ids
T100
...
I1, I3, I8, I16
...
The transactional database is usually either stored in a flat file
in a format similar to that of the above table, or unfolded into
a standard relation in a format similar to that of the
items_sold table in slide no. 18.
A regular data retrieval system is not able to answer queries
like “Which items sold well together?”
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
33
Advanced Database Systems and
Database Applications
Relational DB systems have been widely used in business applications.
The new database applications include handling
• spatial data (e.g. maps)
• engineering design data (e.g., the design of buildings or
integrated circuits)
• hypertext and multimedia data (text, image, video, audio data)
• time-related data (e.g. stock exchange data)
• World Wide Web (a huge, widely distributed information repository made available by the Internet)
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
34
Data Mining Tasks
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
35
Data Mining Functionalities - What
Kinds of Patterns Can be Minded?
• Data mining functionalities are used to specify the kind
of patterns that can be found in data mining tasks.
• Data mining tasks can be classified into 2 categories:
– Descriptive - they characterize the general properties of the data in
the database.
– Prescriptive - they perform inference on the current data in order to
make predictions.
• In some cases, users may have no idea which kinds of
patterns may be interesting => searching for several
different kinds of patterns in parallel.
• Data mining systems should be able to discover
patterns at various granularities (abstraction levels).
• Specifying hints to guide or focus the search.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
36
Association Analysis
• Association analysis is the discovery of association rules
showing attribute-value conditions that occur frequently
in a given set of data.
• The association rule X => Y is interpreted as “database
tuples that satisfy the conditions in X are also likely to
satisfy the conditions in Y.”
• Example A data mining system may find in AllElectronics:
age(X, “20..29”) and income(X, “20K..29K”) => buys(X,”CD
player”) [support = 2%, confidence = 60%]
• X is a variable representing a customer. The rule indicates
that of the customers under study, 2% are 20 to 29 years
of age with an income of 20K to 29K and have purchased a
CD player. There is a 60% probability that a customer in
this age and income group will purchase a CD player.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
37
Association Analysis (Cont.)
• We would like to determine which items are frequently
purchased together within the same transactions. E.g.,
contains(T, “computer”) => contains(T, “software”)
[support = 1%, confidence = 50%]
• Explanation: if a transaction, T, contains “computer”,
there is a 50% chance that it contains “software” as
well, and 1% of all of the transactions contain both.
• This rule involves a single attribute or predicate (i.e.
contains) => single-dimensional association rule. It can
be written simpy as “computer => software {1%,50%]”
Remark: On the last slide, we have: multi-dimensional assoc. rule.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
38
Classification and Prediction
• Classification is the process of finding a set of models
(or functions) that describe and distinguish data
classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose
class label is unknown. The derived model is based on
the analysis of a training data (i.e., data objects whose
class label is known),
• “How is the derived model presented?”
– Classification (IF-THEN) rules
– Mathematical formulae
– Decision tree - it is a flow-chart-like tree structure, where each
node denotes a test on an attribute value, each branch represents an
outcome of the test, and the tree leaves represent classes or class
distributions.
– Neural networks - a collection of neuron-like processing units with
weighted connections between the units.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
39
Classification and Prediction (Cont.)
• Prediction - in many applications, users may wish to
predict some missing or unavailable data values rather
then class labels. The predicted values are usually
numerical data.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
40
Cluster Analysis
• Clustering analyzes data objects without consulting a
known class label.
• Clustering can be used to generate such labels.
• The objects are clustered or grouped based on the
principle of maximizing the intraclass similarity and
minimizing the interclass similarity.
• Each cluster can be viewed as a class of objects, from
which rules can be derived.
• Example Cluster analysis can be performed on AllElectronics customer data in order to identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing. (Figure
on the next slide shows a 2-D plot of customers with respect to
customer locations in a city).
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
41
Cluster Analysis - Example
A 2-D plot of customer data with respect to customer locations
in a city, showing 3 data clusters. Each cluster „center“ is marked with a „+“.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
42
Outlier Analysis
• A database may contain data objects that do not
comply with the general behaviour or model of the
data. These data objects are outliers,
• Most data mining methods discard outliers as noise or
exceptions.
• In some applications such as fraud detection, the rare
events can be more interesting than the more regularly
occurring ones,
• Example Outlier analysis may uncover fraudulent usage
of credit cards by detecting purchases of extremely
large amounts for a given account number in comparison
to regular charges incurred by the same account.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
43
Evolution Analysis
• It describes and models regularities or trends for
objects whose behavior changes over time.
• It includes time-series data analysis.
• Example Suppose that we have the major stock market
(time-series) data of the last several years available
from the New York Stock Exchange and we would like
to invest in shares of high-tech industrial companies. A
data mining study of stock exchange data may identify
stock evolution regularities for overall stocks and for
the stocks of particular companies. Such regularities
may help predict future trends in stock market prices.
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
44
Fragen
Business Intelligence
What do you understand under the term business
intelligence?
Characterize main business intelligence tools.
Data mining – introduction
What is data mining?
What motivates data mining?
Architecture of a typical data mining system
Basic data mining functionalities
Interestingness of patterns
Data warehousing
What is a data warehouse?
Architecture of a data warehouse
P.Brezany
Institut für Softwarewissenschaften – Universität Wien
45