Transcript Slides

INFORMATION
TECHNOLOGY IN
BUSINESS AND
SOCIETY
SESSION 17 – ADVANCED SQL + DATA MINING
SEAN J. TAYLOR
ADMINISTRATIVIA
• Assignment 3:
New drop for any updates related to A3
• Assignment 4:
Due Sunday 4/1 (this is an extension)
• Class participation grading.
MIDTERM REVIEW PROCESS
• Consult the solutions (posted to BB).
• Photocopy the page(s) of your exam that
you wish to dispute.
• Write why you think you deserve points.
• Submit to my mailbox on the 8th floor by
Thursday 3/29 (or after class).
LEARNING OBJECTIVES
1. Be able to write more advanced queries.
2. Learn about the data-driven organization and
the data revolution in management.
3. Know the basic problems data mining attempts
to solve.
REVIEW: SQL
SELECT ISBN, BookName, Price, Publisher
FROM Book
WHERE
BookName like '*Information Systems*'
AND PubDate > #1/1/2002#
AND Price < 100
ORDER BY Price
REVIEW: GROUP BY … HAVING
Use “Having” clause to filter aggregation result
SELECT Publisher, COUNT(*) FROM Book
GROUP BY Publisher
Having Count(*) > 2
Use “where” clause to filter records to be aggregated
SELECT Publisher, COUNT(*) as total FROM Book
Where Price < 100
GROUP BY Publisher
Having Count(*) > 10
Order by Count(*)
MULTIPLE GROUP BY FIELDS
SELECT Publisher, Author, AVG(Price) as
AvgPrice
FROM Book
GROUP BY Publisher, Author;
GROUPING WITH A JOIN
SELECT Publisher, Count(*) as NumOrders
FROM Book, Orders
WHERE Book.ISBN = Orders.ISBN
GROUP BY Publisher;
GROUPING WITH A JOIN 2
SELECT Publisher, Orders.CustomerID,
Sum(price) as TotalPaid
FROM Book, Orders, Customer
WHERE Book.ISBN = Orders.ISBN AND
Orders.CustomerID =
Customer.CustomerID
GROUP BY Publisher, Orders.CustomerID;
MULTIPLE JOINS
WITH WHERE AND GROUP BY
SELECT FavoriteMovie, count(*)
FROM Profiles, FavoriteBooks, FavoriteMovies
WHERE
FavoriteMovies.ProfileId = Profiles.ProfileId
and FavoriteBooks.ProfileID = Profiles.ProfileID
and FavoriteBook = "The Great Gatsby"
GROUP BY FavoriteMovie ORDER BY count(*) desc;
PROPORTIONS
USING SUB-SELECTS
SELECT FavoriteMovie, count(*) / (select count(*)
from Profiles)
FROM Profiles, FavoriteMovies
WHERE
FavoriteMovies.ProfileId = Profiles.ProfileId
GROUP BY FavoriteMovie
ORDER BY count(*) desc;
PROPORTIONS
USING SUB-SELECTS II
SELECT FavoriteMovie, Profiles.Sex, count(*) /
avg(Q.total)
from Profiles, FavoriteMovies, (select Sex, count(*)
as total from Profiles group by Sex) as Q
where
FavoriteMovies.ProfileId = Profiles.ProfileId
and Q.Sex = Profiles.Sex
group by Profiles.Sex, FavoriteMovie
order by FavoriteMovie, Profiles.Sex;
THE DATADRIVEN FIRM
GARY LOVEMAN
• Zero executive experience
• Zero background in Casinos
• But, an MIT PhD who knows how to make numbers talk
Results
• Transformed Harrah’s from second tier to number one
gaming company in the world
• Completed a $30.7 Billion LBO
• Introduced a culture of pervasive field experimentation
“There are two ways to get fired from Harrah’s…”
THE DATA-DRIVEN FIRM
Why do we see these changes now?
• Collect: easier to collect, store information
about consumers, technologies, markets
• Respond: Fast internal communication means
that firms are agile enough to respond to
external information
• Process: Firms can process large volumes of
data to make intelligent decisions
DATA-DRIVEN FIRMS ARE
WINNING
Data-driven decision makers:
• 4% higher productivity
• 6% greater profitability
• 50% higher market value from IT
(Brynjolfsson and Kim, 2011)
WHAT WAL-MART KNOWS
http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.html
DATA-DRIVEN
CHALLENGES
1. Measurement
What should be measured and how?
2. Incentives
How can we design incentives around these measures
without creating adverse consequences?
3. Infrastructure
Do we have the right infrastructure (servers, software,
etc) in place to measure and analyze the data we have?
4. Skills
Do we have the skills we need to accomplish these
tasks?
A NEW KIND OF R&D
Measure
Replicate
Share
Experiment
Learn
WHAT IS DATA MINING?
1. Automated search for patterns in data
2. Automated (or computer assisted) statistical
modeling
3. A process for using IT to extract useful,
actionable knowledge from large bodies of data
“BIG DATA”
http://online.wsj.com/video/2012-the-year-of-bigdata/D4237159-C9A9-4A09-9701F03EF7FB8040.html
BIG NAMES WITH BIG DATA
CEOS
“We have come out on top in the casino wars by mining our customer
data deeply, running marketing experiments and using the results to
develop and implement finely tuned marketing and services
strategies that keep our customers coming back.”
Gary Loveman, Harrahs CEO
”For every leader in the company, not just for me, there are decisions
that can be made by analysis. These are the best kinds of decisions.
They’re fact-based decisions.”
Jeff Bezos, Amazon CEO
“It’s all about collecting information on 200 million people you’d
never meet, and on the basis of that information, making a series of
very critical long-term decisions about lending them money and
hoping they would pay you back.”
Rich Fairbank, founder and CEO of Capital One
WHY NOW?
Firms are collecting massive amounts of data on
operations, customers, and the competitive
landscape.
But there is far too much data for manual analysis.
• Amazon: > 50M active customers
• Phone companies: 100M+ accounts, thousands
of txns each
• Google: 11B “objects”
• RFID tags
TYPES OF DATA MINING
Machine
Learning
Supervised
Classification
Regression
Visualization
Unsupervised
Clustering
Outlier
detection
OUR ROADMAP
1. Visualization
2. Basic Data Mining Process
3. Classification Example
4. Clustering Example
NEXT CLASS:
DATA MINING II
• Work on A4