Introduction to Data Analysis and Mining by Laura Jordana

Download Report

Transcript Introduction to Data Analysis and Mining by Laura Jordana

Introduction to Data
Analysis and Mining
By Laura Jordana
Decision-Support Systems



Database applications can be classified as
either transaction-processing or decisionsupport systems.
Transaction-processing systems are
extensively used today: bank transactions,
online sale transactions, etc.
These systems generate a large amount of
information.
Decision-Support Systems


Decision-support systems attempt to
extract useful information from the
generated information in order to make
business decisions.
For example, it can analyze customer
behavior to help managers decide what
products to stock in a store or what
market to advertise their products to.
OLAP




Many decision-support queries can be written in
SQL. However, others cannot, or cannot be
expressed easily.
Extensions are available to make data analysis
easier.
OLAP (Online Analytical Processing) consists of
tools for data analysis.
Examples: statistical data such as finding
percentiles, cumulative distributions
Data Warehousing





A data warehouse is an archive of information
gathered from multiple data sources.
A company may have different databases for
different purposes.
These databases might only contain current
data.
The purpose of the data warehouse is to store
ALL the data for a long time.
Decision-support queries are easier to write, and
online transaction-processing systems are not
affected by this additional workload.
Components of a Data
Warehouse
Issues



When and how to gather data – sourcedriven (from the data source to the warehouse)
or destination-driven (warehouse sends requests
for new data)
Schema to be used – Different data sources
are likely to have different schemas
Data transformation and cleansing –
correcting minor errors such as a street name
being spelled incorrectly
Issues (cont.)


Propogating updates – how to update
the data warehouse when an update
occurs at the data source
Summarizing data – may not
necessarily need or have room to store all
raw data
Data Mining



The process of analyzing large databases
to find useful patterns.
Data mining attempts to discover rules
and patterns from data.
Also called “knowledge discovery”.
Knowledge Discovery



A rule can be the result of knowledge
discovery.
For example: “Young women with annual
incomes are most likely to buy small
sports cars.”
These rules are not universally true, and
have degrees of “support” and
“confidence”.
Applications of Knowledge
Discovery



Predictions: For example, a credit-card company may
want to predict a person’s credit risk based on known
factors.
Associations: Suggesting books to a customer who has
purchased books at an online bookstore, or suggesting
accessories to go with an item.
Real-World Example: The National Basketball
Association uses a data-mining application in conjunction
with video recordings of basketball games to analyze
plays and discover interesting patterns in game data.
(Source: http://citeseer.ist.psu.edu/cachedpage/421882/1)
Classification




Items belong to one of several classes.
The problem is to predict what class a new item
belongs to (i.e. predicting a person’s credit risk).
Attributes of the item are used to predict its
class (i.e. age, education, annual income,
current debts).
The decision-tree is one way to perform
classification.
Decision-Tree



A decision tree has leaf nodes that
represent classes.
Each internal node is associated with a
predicate or function which is used to
determine which child to traverse to.
Basically, a decision-tree is a flow chart of
if-then scenarios.
Decision-Tree
Association



Association is a topic of interest particularly in
the retail industry.
Companies are interested in the associations
among different items that people purchase.
For example: Someone who buys bread will
probably buy milk. Someone who bought a book
on PHP is likely to purchase a book on MySQL.
Association Rules






bread => milk
PHP => MySQL
As mentioned before, rules have degrees of “support”
and “confidence”.
Support measures what percentage of the population
satisfies both sides of the rule (i.e. what percentage of
all purchases include both milk and bread).
Confidence is a measure of how often the population
satisfies the right hand side of the rule when the left
hand side is true (i.e. what percentage of the purchases
that include bread also include milk).
Note: Confidence of bread=>milk can be different from
milk=>bread although they have the same support.
Other Types Of Mining


Text mining – uses data mining techniques
on text documents
Data visualization – helps users observe
patterns visually
References


http://www.purdue.edu/UNS/html4ever/2
004/041018.Caruthers.discover.html
A. Silberschatz, H.F. Korth, S. Sudershan:
Database System Concepts, 5th ed.,
McGraw-Hill, 2006