Data Mining

Transcript Data Mining

Data Mining
By Archana Ketkar
What Is Data Mining?
Data mining is the principle of sorting through large amounts of
data and picking out relevant information.
In other words…
 Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge
amount of data
 Other names

Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Some Definitions
 Data : Data are any facts, numbers, or text that can
be processed by a computer.



operational or transactional data such as, sales, cost,
inventory, payroll, and accounting
nonoperational data, such as industry sales, forecast
data, and macro economic data
meta data - data about the data itself, such as logical
database design or data dictionary definitions
 Information: The patterns, associations, or
relationships among all this data can provide
information.
Definitions Continued..
 Knowledge: Information can be converted into
knowledge about historical patterns and future
trends. For example, summary information on retail
supermarket sales can be analyzed in terms of
promotional efforts to provide knowledge of
consumer buying behavior. Thus, a manufacturer or
retailer could determine which items are most
susceptible to promotional efforts.
 Data Warehouses: Data warehousing is defined as a
process of centralized data management and
retrieval.
Data Warehouse example
Data Rich, Information Poor
Data Mining process
Knowledge discovery from data
KDD process includes

data cleaning (to remove noise and inconsistent data)

data integration (where multiple data sources may be
combined)

data selection (where data relevant to the analysis task are
retrieved from the database)
 data transformation (where data are transformed or consolidated
into forms appropriate for mining by performing summary or
aggregation operations)
KDD continued….
 data mining (an essential process where intelligent
methods are applied in order to extract data patterns.
 pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
 knowledge presentation (where visualization and
knowledge representation techniques are used to
present the mined knowledge to the user)
Data mining is a core of knowledge discovery process
Knowledge Discovery (KDD) Process

Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
Functionalities/Techniques:
 Concept/Class Description: Characterization
and Discrimination
 Mining Frequent Patterns, Associations and
correlations
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis
Concept/Class Description:
Characterization and Discrimination
 Data Characterization: A data mining system
should be able to produce a description
summarizing the characteristics of customers.
 Example: The characteristics of customers
who spend more than $1000 a year at (some
store called ) AllElectronics. The result can be
a general profile such as age, employment
status or credit ratings.
Characterization and Discrimination
continued…
 Data Discrimination: It is a comparison of the
general features of targeting class data
objects with the general features of objects
from one or a set of contrasting classes. User
can specify target and contrasting classes.
 Example: The user may like to compare the
general features of software products whose
sales increased by 10% in the last year with
those whose sales decreased by about 30%
in the same duration.
Mining Frequent Patterns,
Associations and correlations
Frequent Patterns : as the name suggests patterns that occur
frequently in data.
Association Analysis: from marketing perspective, determining
which items are frequently purchased together within the same
transaction.
Example: An example is mined from the (some store) AllElectronic
transactional database.
buys (X, “Computers”)  buys (X, “software”) [Support = 1%,
confidence = 50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a
50% chance that he/she will buy software as well.
 Support = 1%, means that 1% of all the transactions under
analysis showed that computer and software were purchased
together.
Mining Frequent Patterns,
Associations and correlations
 Another example:
 Age (X, 20…29) ^ income (X, 20K-29K) 
buys(X, “CD Player”) [Support = 2%,
confidence = 60% ]
 Customers between 20 to 29 years of age
with an income $20000-$29000. There is
60% chance they will purchase CD Player
and 2% of all the transactions under analysis
showed that this age group customers with
that range of income bought CD Player.
Classification and Prediction
 Classification is the process of finding a
model that describes and distinguishes data
classes or concepts for the purpose of being
able to use the model to predict the class of
objects whose class label is unknown.
 Classification model can be represented in
various forms such as
 IF-THEN Rules
 A decision tree
 Neural network
Classification Model
Cluster Analysis
 Clustering analyses data objects without
consulting a known class label.
 Example: Cluster analysis can be performed
on AllElectronics customer data in order to
identify homogeneous subpopulations of
customers. These clusters may represent
individual target groups for marketing. The
figure on next slide shows a 2-D plot of
customers with respect to customer locations
in a city.
Cluster Analysis
Outlier Analysis
 Outlier Analysis : A database may contain data
objects that do not comply with the general behavior
or model of the data. These data objects are outliers.
 Example: Use in finding Fraudulent usage of credit
cards. Outlier Analysis may uncover Fraudulent
usage of credit cards by detecting purchases of
extremely large amounts for a given account number
in comparison to regular charges incurred by the
same account. Outlier values may also be detected
with respect to the location and type of purchase or
the purchase frequency.
Evolution Analysis
 Evolution Analysis: Data evolution analysis describes
and models regularities or trends for objects whose
behavior changes over time.
 Example: Time-series data. If the stock market data
(time-series) of the last several years available from
the New York Stock exchange and one would like to
invest in shares of high tech industrial companies. A
data mining study of stock exchange data may
identify stock evolution regularities for overall stocks
and for the stocks of particular companies. Such
regularities may help predict future trends in stock
market prices, contributing to one’s decision making
regarding stock investments.
References :
 http://www.anderson.ucla.edu/faculty/jason.fr
and/teacher/technologies/palace/datamining.
htm
 Data Mining Concepts and Techniques,Jiwei
Han and Micheline Kamber,2006.
 http://www.eco.utexas.edu/~norman/BUS.FO
R/course.mat/Alex/#1
 http://en.wikipedia.org/wiki/Data_mining
 http://www-faculty.cs.uiuc.edu/~hanj/bk2/
Thank you!

Data Mining

Transcript Data Mining

Directory