PGA Tournament Data Warehouse

Download Report

Transcript PGA Tournament Data Warehouse

PGA Tour Data Warehouse
Team Members
Frank Paladino
Aravind Yeluripiti
Project Goals
• The Professional Golfers Association (PGA) is the
organizer of the men's professional golf tour
primarily in the United States and North
America.
• It organizes most of the events on the flagship
annual series of tournaments also known as the
PGA Tour.
• The goal of this project is to build a data
warehouse using data collected from the PGA
Tour website.
Task Delegation and work flow
Source Databases
• The source databases contain data related to PGA Players,
Tournaments, and Results.
• For each subject area, research was done to determine the
appropriate data model for database.
• After completing the data model, data was loaded into the
source databases.
• In order to obtain relevant data for our project, we
extracted data from the PGA Tour website into comma
delimited files and imported them into our source
databases.
• The “LOAD DATA INFILE ‘[filename]’ FIELDS TERMINATED BY
‘,’ command was used to load the data from files into the
source databases.
Source Databases
• Player Database
– The player database contains data related to the
players profile and demographic information.
• Tournament Database
– The Tournament database contains detailed data
related to tournaments, venues, and results of
tournaments.
• Statistics Database
– The statistics database contains data that measures a
players tour performance.
Raw data - sample
•
Dustin johnson
–
–
–
–
–
–
–
•
Height: 6 ft, 4 in
Weight: 190 lbs
Birthday: 06/22/1984
College:Coastal Carolina University
Turned Pro:2007
Birthplace:Columbia, South Carolina
Residence:Myrtle Beach, South Carolina
Steven stricker
–
–
–
–
–
–
–
Height:6 ft, 0 in
Weight:190 lbs
Birthday:02/23/1967
College:University of Illinois
Turned Pro:1990
Birthplace:Edgerton, Wisconsin
Residence:Madison, Wisconsin
Players Database
Players Database Snapshots
Players Table
Locations Table
Players Database Snapshots
PlayerPersonalInfo Table
Players Database snapshots
Statistics Database
Statistics Database Snapshots
BasicStats Table
AdvancedStats Table
Statistics Database Snapshots
TourStats Table
Tournament Database Schema
Tournament Database Snapshots
CourseDetails Table
PlayerTourney Table
Tournament Database Snapshots
Venues Table
Tourneys Table
Tournament Database Snapshots
Scorecard Table
PGA Tour Warehouse Database
The PGA Tour Warehouse integrates data from
the three source databases to a single schema
that can be used to query player statistical,
tournament results, and player performance.
Our approach separated source data into:
– Qualitative data – dimensions – data used to
qualify or filter the data
– Quantitative data - (facts/measures) – data used
to measure performance
Implementation of Data Warehouse
The SQL implementation of data warehouse dimension and fact tables used a similar approach to that described below. For
each target table to be loaded, a stored procedure was used to extract source data into a database cursor, compare the key
fields to the target table, and if the row does not already exists, inserts the data into the target table.
Example shown below was used to load the LocationDimension table
BEGIN
DECLARE done INT DEFAULT 0;
DECLARE id varchar(10);
DECLARE city1 varchar(45);
DECLARE state1 varchar(2);
DECLARE country1 varchar (45);
DECLARE cur1 CURSOR FOR SELECT DISTINCT idLocation, city, state, country FROM tournament.venue ;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
OPEN cur1; -- open database cursor
cur1_loop: LOOP
FETCH cur1 INTO id, city1, state1, country1; -- MySQL Bug 28227 (variable names cannot be same as column names)
IF (done=1) THEN
LEAVE cur1_loop;
END IF;
IF NOT EXISTS (SELECT * FROM pgatourwarehouse.locationdimension WHERE idLocation = id) THEN
INSERT INTO pgatourwarehouse.locationdimension VALUES (id, city1, state1, country1, CURDATE());
END IF;
END LOOP;
SELECT 1 FROM tournament.venue LIMIT 1; -- MySQL Bug 60840 (need a statement using a table to avoid throwing warning 1329)
CLOSE cur1; -- close the database cursor
END
Implementation of Data Warehouse
Last, a call wrapper.sql script was used to call
stored procedures in specific order so that
referential integrity constraints were honored.
1.
2.
3.
4.
5.
6.
7.
8.
CALL loadLocationDimension();
CALL loadLocationDimension2();
CALL loadVenueDimension();
CALL loadPlayerDimension();
CALL loadTournamentDimension();
CALL loadPlayerTourneyRoundFacts();
CALL loadPlayerFacts();
CALL loadPlayerTourneyFacts();
PGA tour
Warehouse
Fact
constellation
Data Warehouse Snapshots
LocationDimension Table
Data Warehouse Snapshots
PlayerDimension Table
Data Warehouse Snapshots
TournamentDimension Table
VenueDimension Table
Data Warehouse Snapshots
PlayerFacts Table
Data Warehouse Snapshots
PlayerTourneyFacts Table
Data Warehouse Snapshots
PlayerTourneyRoundFacts Table
Is PGA tour, a warehouse?
• “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making process.”—
W. H. Inmon
• Subject-oriented
– Players, tournaments
• Integrated
– raw data -> databases -> warehouse
• Time variant
– implicit: seasons (2012,2011)
– explicit: each table time stamped
• Nonvolatile
– Initial loading and access of data
Is PGA tour, a warehouse?...contd
• Management decision making (querying)
– How many players are from so-so
city/state/country?
– How many tournaments are being held in certain
state in a specific month?
– Etc..