Transcript Data Mining

Data Mining
157A, Fall Semester 2006
Brent Turner
Presentation Contents:
1.
2.
3.
4.
5.
6.
7.
8.
What Is Data Mining
Data Mining Ideas
The DM Process
Advantages and Problems in DM
Example 1 – web searches
Example 2 – buying habits
Example 3 – basketball stats
References
1
What is DM
2
The DM process
1.Data gathering
2.Data cleansing: eliminate errors and/or
bogus data
3.Feature extraction: obtaining only the
interesting attributes of the data
4.Pattern extraction and discovery.
5.Visualization of the data.
6.Evaluation of results
3
Data Mining Ideas
Search dataspace for a new “golden” relationship.
• Brute force:
40 items: 2^40 = 1099511627776 (a trillion)
possible pair combinations to look at with only
40 data items
• Smarter Search:
Infer or guess relationships based on other
known data
(Association rules; Causality; Frequent item sets)
4
Advantages of Data Mining
 Provides new knowledge from existing data
 Public databases
 Government sources
 Company Databases
 Old data can be used to develop new knowledge
 New knowledge can be used to improve services or
products
 Improvements lead to:
 Bigger profits
 More efficient service
Some problems to consider in DM
 Privacy – datum dealing with personal
information (e.g. medical history) may need to
be kept private from employers, insurance
companies, etc.
 Legality – can DM be used to screen out highrisk persons or help prosecute a crime
 Ethics – should we create software that can be
used in unethical ways? What should be done
with the new knowledge?
Advantages of Data Mining
5
Example 1 – Web Search
a. Page rank, for discovering the most
“important” pages on the Web, as used in
Google.
b. Hubs and authorities, a more detailed
evaluation of the importance of Web
pages using a variant of the eigenvector
calculation used for Page rank.
6
Example 2 – Buying habits
5%
70%
+
5%
=
Historic data might
identify that customers
who purchase the
Gladiator DVD and the
Patriot DVD also
purchase the Braveheart
DVD.
The historic data might
indicate that the first two
DVDs are purchased by
only 5% of all customers.
But 70% of these then
also purchase
Braveheart.
Example 2 – Buying habits
Support = 5% customers bought Gladiator & Patriot
Confidence = 70% hose who will also buy Braveheart
Conclusion:
Use realtime web advertising to get more sales.
7 Example 3 – basketball stats
In one application,
IBM's Advance Scout
was developed to
identify different
strategies employed
by basketball players
in the NBA.
Pippen
Discoveries include the
observation that
Scottie Pippen's
favorite move on the
left block is a righthanded hook to the
middle.
Harper
And when guard Ron
Harper penetrates the
lane, he shoots the
ball 83% of the time.
Jordan
Also, it was noticed that
17% of Michael
Jordan's offence
comes on isolation
plays, during which
he tends to take two
or three dribbles
before pulling up for a
jumper
8
1)
2)
3)
4)
5)
References
“Data Mining” Oo, Aung, 2005; at
www.cs.sjsu.edu/faculty/lee/cs157 accessed 11-292006.
“Data Mining Lecture Notes” Ullman , Jeffery D., at
infolab.stanford.edu/~ullman/mining accessed 11-292006.
“DATA MINING Desktop Survival Guide” Williams,
Graham, at www.togaware.com/datamining/survivor
accessed 11-29-2006.
Pinker, Steven, at pinker.wjh.harvard.edu accessed 1127-2006.
Photographs at www.nba.com, accessed 11-29-2006.