Data_Mining_spring_2006
Download
Report
Transcript Data_Mining_spring_2006
DATA MINING
Prof. Sin-Min Lee
Surya Bhagvat
CS 157B – Spring 2006
Making sense out of data
With the hard drives prices becoming inexpensive the amount of
data stored in the databases by the corporations has increased
dramatically.
Just having the raw data in the database is of no use unless
someone makes sense of the data. For example one could store a
decade of customer data but for the data to become useful one
needs to find the patterns in the data to identify the customer
behavior.
Would SQL solve the above problem?
Traditional SQL and Analytics
Traditional SQL is useful in performing very large queries and one
could argue saying that SQL is all but necessary in order to get
the information.
This argument holds good for small sets of data but when a query
is performed against a huge database which stores about terabytes
of data then the performance of SQL would go down.
Also identifying patterns in the data is not always feasible with the
traditional SQL querying. This is where the field of Analytics come
into play
Analytics
Analytics is basically identifying patterns of data in order to make
better decisions.
For example if you are maintaining a commercial ecommerce web
site, then one thing which you want to know would be the visitors
behavior patterns like from which search engine they came from,
how they go on about searching for items in your web site and so
on.
Basically what we are trying to do here is identify the patterns of
customer behavior which would be useful later on to target that
particular customer with promotional offers.
Analytics (Continued….)
Google recently came up with Google Analytics for free.
The URL for this is site is
http://www.google.com/analytics/feature_fast.html
Right now one needs to do sign up for their invitation and once
they accept it all one needs to do is to include google analytics
tracking code in your web site and then you can start monitoring
the customer behavior.
Transactional Systems
In transactional systems the information about day-to-day
transactions is stored.
For example retail stores like Safeway records each transaction
that happens during the day at the time the purchase is made.
Identifying patterns on transactional systems is relatively hard
because the data stored in these systems usually run up to
terabytes and if a SQL query is performed across such a huge
database then it may bring the whole system down.
So what’s the alternative?
Decision support Systems
For decision making activities like to determine patterns or to run
complex SQL’s a separate database or system is usually
maintained and those systems are known as Decision Support
systems.
The high level data is pulled out from the transactional systems
and then stored into these databases for performing analytics or
data mining techniques.
The downside to this is the data may not be real time. But a
service could be written which runs in the background which
updates the decision support systems at real time.
Decision support systems
(contd…)
Decision support systems can be classified into three kinds
Statistical analysis, OLAP (On-line Analytical Processing) and
Data warehouses.
If detailed statistical analysis of data needs to be performed
then SQL is very limited and one needs to go for commercial
packages like SAS. Further information could be found at
http://www.sas.com/technologies/analytics/statistics/index.html
?sgc=u
Decision support systems
(contd….)
OLAP provides very fast access to data.
The data from RDBMS is gathered and placed it into
multidimensional cubes which are then made available to the
users.
Cognos powerplay is the best selling OLAP product. The link to
this product is
http://www.cognos.com/products/business_intelligence/analysis/
Data warehousing
The third kind of a decision support system is data warehouse.
Data mining is usually performed on these data warehouses.
The data in an enterprise is usually stored in various
transactional systems or databases. For example some data
might be stored in Oracle database, the other data might be
stored in DB2 or Teradata or in some systems it may just be
stored in text files or excel files.
When one wants to combine all this data to look for patterns it
becomes very difficult, so all this disparate data from various
different sources are pulled together to form a data warehouse.
Data warehousing (Contd…)
The steps involved in building a data warehouse includes:
1) Getting the raw data from different sources and storing it as
is in a temporary staging area. Typically ETL tools are used
for this process.
2) The data from the temporary staging area is then cleansed
and various business rules are applied to load the data into
the actual data warehouse tables.
Predictive analytics and Data
Mining
Data Mining is about finding the patterns in data and is essentially
used for predicting customer behavior.
For example Data Mining could be used to predict based on
customer complaints whether that customer is going to go to
another competitor.
Applications of Data Mining are varied and is used in almost all
applications from CRM to Earthquake predictions.
Predictive analytics and Data
Mining
Predictive analytics is based on predictor, a single value. Predictive
analytics is extensively used in CRM applications.
A predictor for a customer could be 'Recent purchase' made.
For example if you are calling customers for promotions then based
on this predictor one would call the most recent customer first
followed by the customers who purchased items like a month ago.
Procedures in Data Mining
The key procedures used in Data mining include :
1) Association rules
2) Classification
3) Clustering
Association rules
Association rules have an associated population which consists of a
set of instances.
For example if one buys an iPod from Amazon.com then the
association with this product would be the accessories that come
with iPod and displayed by Amazon include Apple iPod Nano
Armband Grey, Apple iPod Nano Dock and Apple iPod Nano
Lanyard Headphones.
Association rule measures are Support and Confidence
Association rules
Support: Is a measure of what fraction of the population satisfies
both the antecedent and the consequent of the rule. For example
the support for iPod=>DVD player is 0.001 percent, that means
the support is very low.
Confidence: Is a measure of how often the consequent is true
when the antecedent is true. For example the rule iPod=>Apple
iPod Nano Armband Grey would be say 80 percent
Support and Confidence
examples
Classification
The most popular way to classify the items is using Decision
tree classifiers. In the example degree is masters and the
person's income is 40K starting from the root, we follow the
edge labeled 25K to 75K to reach a leaf. The class at the leaf is
"good" so we predict that the credit risk of that person is good
Clustering
Grouping similar data into clusters is what clustering is all about.
The degree of association would be strong in the case of same
cluster and weak between different clusters
Clustering is based on the distance measures like Euclidian,
probabilistic etc. K-means is one of the most famous clustering
algorithm
Resources
A.Silberschatz, H.F. Korth, S. Sudarshan
Database System Concepts, 5th Ed., McGraw-Hill, 2006
http://www.google.com/analytics/feature_fast.html
http://www.sas.com/technologies/analytics/statistics/index.html?sg
c=u
http://www.cognos.com/products/business_intelligence/analysis/