Introduction to Databases for Managing Research Data

Download Report

Transcript Introduction to Databases for Managing Research Data

Introduction to Databases for
Managing Research Data
CustomerID *
Name
Address
Phone
SSN
CustomerID
AccountNumber
AccountNumber*
TransactNumber
Amount
Account Table
Transaction Table
Customers Table
* Denotes Primary Key
Sherry Lake
Data Management Consultant
University of Virginia Library
[email protected]
Bill Corey
Data Management Consultant
University of Virginia Library
[email protected]
© 2013 by the Rector and Visitors of the University of Virginia.
This work is made available under the terms of the Creative Commons Attribution-ShareAlike 4.0
International license http://creativecommons.org/licenses/by-sa/4.0/
Goals for the Workshop
• Understand differences between Spreadsheets
and Databases
• Learn the features of databases
• Learn the 3 steps of database design
• Gain peer and expert feedback
2
Some Vocabulary
• Database – set of (related) tables
• Table – set of rows and columns
• Column – field, attribute
• Row – Tuple, observation, case
• RDBMS – Relational Database Management
System
3
Spreadsheets vs. Database
Spreadsheets are best if:
– Data can be stored in a single
datasheet without lots of
redundancy
– You are doing calculations or
making charts
– Don’t need to link several
spreadsheets together to get
the results you want
Databases are best if:
– Data are readily stored in
multiple related tables
– You need multiple user
access
– You want to be able to do
complex manipulations with
the data
– You want to develop data
entry tools
4
Big Issue with Spreadsheets
• Data integrity
– Internal record consistency is not maintained
– Updating more than one record
– Removing information
– Creating incomplete cells
ID
DeptName
DeptAddress
ContactName
ContactTitle
ContactPhone
1 Finance
110 5th Street
Michael Jones
Manager
555-1111
2 Finance
110 5th Street
Ted Smith
Senior Analyst
555-1112
3 Benefits
118 5th Street
Brian Williams
Manager
555-3333
5
Relational Database Is…
• Collection of data organized into tables
• Each table contains records
• Each record identifies the same set of fields
• Tables may have relationships with another
table
• Tools help you manage the table
relationships
6
Database Features
• Set of tables
• Explicit control over data (column) types in
tables
Date
Site
Height
Count
<dates only>
<text only>
< real numbers only>
< integers only>
7
Database Features
• Relationships are defined between tables
Date
Site Species
A
B
A
C
Height
Diameter
Site
Latitude
Longitude
A
B
C
D
8
Identify Data Needs
• Data Modeling
– Analyze data needs
– Visualize the data objects
• Questions to Ask
– Why do you want to collect the data?
– What do you want to do with it?
– Who else will be using the data, what do they
want from it?
9
3 Steps to Database Design
1. Split Data Into Tables
– Normalization
– Each field must contain only one value
– Each field must have a unique name
2. Determine Data Type for Each Column
3. Identify Relationships Between the Tables
– No two records can be identical
10
Normalization
• Process of efficiently organizing data in a
database
• Goals of the normalization process:
– eliminating redundant data (for example,
storing the same data in more than one table)
– ensuring data dependencies make sense (only
storing related data in a table)
11
Normalization Process
1st Normal Form
– Eliminate duplicate columns from the same
table
– Create separate tables for each group of
related data
– Identify each row with a unique column or
set of columns (the primary key)
12
First Normal: Eliminate Duplicate
Columns and Assign Keys
Books
TITLE
AUTHOR 1
Ecology 101
Ecology for Dummies
Smith, A.B.
AUTHOR 2
PUBLISHER
Gordon, D.A.
Univ. Press
4873895759
4324
Wiley & Sons
0493802020
8998
Doe, J.
ISBN
QTY.
Ecology and Politics
Kim, J.B.
McGraw-Hill
7482929292
900
Ecology and Modern Cinema
Kim, C.B.
Univ. Press
2234849302
1
Books
Authors
Author
TITLE
PUBLISHER
ISBN
QTY.
Ecology 101
Univ. Press
4873895759
4324
Ecology for Dummies
Wiley & Sons
0493802020
8998
Ecology and Politics
McGraw-Hill
7482929292
900
Ecology and Modern Cinema
Univ. Press
2234849302
1
Smith, A.B.
Gordon, D.A.
Doe, J.
Kim, J.B.
Kim, C.B.
13
First Normal: Eliminate Duplicate
Columns and Assign Keys
Books
TITLE
AUTHOR 1
Ecology 101
Ecology for Dummies
Smith, A.B.
AUTHOR 2
PUBLISHER
Gordon, D.A.
Univ. Press
4873895759
4324
Wiley & Sons
0493802020
8998
Doe, J.
ISBN
QTY.
Ecology and Politics
Kim, J.B.
McGraw-Hill
7482929292
900
Ecology and Modern Cinema
Kim, C.B.
Univ. Press
2234849302
1
Primary
Key
Books
ISBN
Authors
TITLE
PUBLISHER
QTY.
Id
Author
Ecology 101
Univ. Press
4873895759
4324
0
Smith, A.B.
Ecology for Dummies
Wiley & Sons
0493802020
8998
1
Gordon, D.A.
Ecology and Politics
McGraw-Hill
7482929292
900
2
Doe, J.
Ecology and Modern Cinema
Univ. Press
2234849302
1
3
Kim, J.B.
4
Kim, C.B.
14
First Normal: Eliminate Duplicate
Columns and Assign Keys
Books
TITLE
AUTHOR 1
Ecology 101
Ecology for Dummies
Smith, A.B.
AUTHOR 2
PUBLISHER
Gordon, D.A.
Univ. Press
4873895759
4324
Wiley & Sons
0493802020
8998
Doe, J.
ISBN
QTY.
Ecology and Politics
Kim, J.B.
McGraw-Hill
7482929292
900
Ecology and Modern Cinema
Kim, C.B.
Univ. Press
2234849302
1
Books
Authors
TITLE
PUBLISHER
ISBN
QTY.
Ecology 101
Univ. Press
4873895759
4324
Ecology for Dummies
Wiley & Sons
0493802020
Ecology and Politics
McGraw-Hill
Ecology and Modern Cinema
Univ. Press
Id
Foreign Key
ISBN
Author
0
4873895759
Smith, A.B.
8998
1
4873895759
Gordon, D.A.
7482929292
900
2
0493802020
Doe, J.
2234849302
1
3
7482929292
Kim, J.B.
4
2234849302
Kim, C.B.
15
First Normal Exercise
Pause for exercise
Personnel
Last
First
M.I.
Institution
Sector
Position 1
Position 2
Smith
Ann
A
SDSU
Academic
Professor
Community Liaison
Smith
Ann
Z
Acme Inc.
Private
Administrator
Field Technician
Kim
John
B
SDSU
Academic
P.I.
Data Manager
16
First Normal
Eliminate duplicate columns
Last
First
M.I.
Institution
Sector
Position 1
Position 2
Smith
Ann
A
SDSU
Academic
Professor
Community Liaison
Smith
Ann
Z
Acme Inc.
Private
Administrator
Field Technician
Kim
John
B
SDSU
Academic
P.I.
Data Manager
positions
personnel
Pers
_id
Last
First
M.I.
Institution
Sector
0
Smith
Ann
A
SDSU
Academi
c
Foreign Key
id
Pers_id
Position
0
0
Professor
1
0
Community Liaison
2
1
Administrator
1
Smith
Ann
Z
Acme Inc.
Private
3
1
Field Technician
2
Kim
John
B
SDSU
Academi
c
4
2
P.I.
5
2
Data Manager
17
Normalization Process
2nd Normal Form
– Meet all the requirements of the first
normal form
– Remove subsets of data that apply to
multiple rows of a table and place them in
separate tables
– Create relationships between these new
tables and their predecessors through the
use of foreign keys
18
Second Normal: Eliminate Duplicate
Rows and Assign Keys
Books
Authors
Id
TITLE
PUBLISHER
ISBN
QTY.
Ecology 101
Univ. Press
4873895759
4324
Ecology for Dummies
Wiley & Sons
0493802020
Ecology and Politics
McGraw-Hill
Ecology and Modern Cinema
Univ. Press
Author
0
4873895759
Smith, A.B.
1
4873895759
Gordon, D.A.
8998
2
0493802020
Doe, J.
7482929292
900
3
7482929292
Kim, J.B.
2234849302
1
4
2234849302
Kim, C.B.
Books
TITLE
ISBN
Publishers
ISBN
QTY.
PUBLISHER
Ecology 101
4873895759
4324
Univ. Press
Ecology for Dummies
0493802020
8998
Wiley & Sons
Ecology and Politics
7482929292
900
McGraw-Hill
Ecology and Modern Cinema
2234849302
1
19
Second Normal: Eliminate Duplicate
Rows and Assign Keys
Books
Authors
Id
TITLE
PUBLISHER
ISBN
QTY.
Ecology 101
Univ. Press
4873895759
4324
Ecology for Dummies
Wiley & Sons
0493802020
Ecology and Politics
McGraw-Hill
Ecology and Modern Cinema
Univ. Press
ISBN
Author
0
4873895759
Smith, A.B.
1
4873895759
Gordon, D.A.
8998
2
0493802020
Doe, J.
7482929292
900
3
7482929292
Kim, J.B.
2234849302
1
4
2234849302
Kim, C.B.
Books
Publishers
TITLE
ISBN
Ecology 101
4873895759
Ecology for Dummies
QTY.
Publisher_id
PUBLISHER
4324
0
Univ. Press
0493802020
8998
1
Wiley & Sons
Ecology and Politics
7482929292
900
2
McGraw-Hill
Ecology and Modern Cinema
2234849302
1
20
Second Normal: Eliminate Duplicate
Rows and Assign Keys
Books
Authors
Id
TITLE
PUBLISHER
ISBN
QTY.
Ecology 101
Univ. Press
4873895759
4324
Ecology for Dummies
Wiley & Sons
0493802020
Ecology and Politics
McGraw-Hill
Ecology and Modern Cinema
Univ. Press
ISBN
Author
0
4873895759
Smith, A.B.
1
4873895759
Gordon, D.A.
8998
2
0493802020
Doe, J.
7482929292
900
3
7482929292
Kim, J.B.
2234849302
1
4
2234849302
Kim, C.B.
Books
Publishers
TITLE
PUBLISHER_id
ISBN
Ecology 101
0
4873895759
Ecology for Dummies
1
Ecology and Politics
Ecology and Modern Cinema
QTY.
Publisher_id
PUBLISHER
4324
0
Univ. Press
0493802020
8998
1
Wiley & Sons
2
7482929292
900
2
McGraw-Hill
0
2234849302
1
21
Final Tables with Primary and
Foreign Keys
Books
Authors
TITLE
PUBLISHER_id
ISBN
QTY.
Ecology 101
0
4873895759
4324
Ecology for Dummies
1
0493802020
8998
Ecology and Politics
2
7482929292
900
Ecology and Modern Cinema
0
2234849302
1
Id
ISBN
Author
0
4873895759
Smith, A.B.
1
4873895759
Gordon, D.A.
2
0493802020
Doe, J.
3
7482929292
Kim, J.B.
4
2234849302
Kim, C.B.
Publishers
Publisher_id
PUBLISHER
0
Univ. Press
1
Wiley & Sons
2
McGraw-Hill
22
Second Normal Exercise
Pause for exercise
positions
personnel
Pers_id
Last
First
M.I.
Institution
Sector
id
pers_id
Position
0
Smith
Ann
A
SDSU
Academic
0
0
Professor
1
Smith
Ann
Z
Acme Inc.
Private
1
0
Community Liaison
2
Kim
John
B
SDSU
Academic
2
1
Administrator
3
1
Field Technician
4
2
P.I.
5
2
Data Manager
23
Second Normal
Eliminate duplicate rows
positions
personnel
Pers_
id
Last
First
M.I.
Institution
Sector
0
Smith
Ann
A
SDSU
Academic
1
Smith
Ann
Z
Acme Inc.
Private
2
Kim
John
B
SDSU
Academic
id
pers_id
Position
0
0
Professor
1
0
Community Liaison
2
1
Administrator
3
1
Field Technician
4
2
P.I.
5
2
Data Manager
personnel
Pers_
id
Last
First
M.I.
Institution_id
0
Smith
Ann
A
0
Institution_id
Institution
Sector
1
Smith
Ann
Z
1
0
SDSU
Academic
2
Kim
John
B
0
1
Acme Inc.
Private
institutions
24
Final Tables with Primary and
Foreign Keys
positions
personnel
First
id
pers_id
Position
0
0
Professor
0
1
0
Community Liaison
Z
1
2
1
Administrator
B
0
3
1
Field Technician
4
2
P.I.
5
2
Data Manager
Pers_
id
Last
M.I.
0
Smith
Ann
A
1
Smith
Ann
2
Kim
John
Institution_id
institutions
Institution_id
Institution
Sector
0
SDSU
Academic
1
Acme Inc.
Private
25
Normalization Process
3rd Normal Form
– Meet all the requirements of the second
normal form
– Remove columns that are not dependent
upon the primary key
Table: Schools
id schoolName
schoolPhone
dean
deanEmail
1 Architecture
555-1111
Michelle Myers
[email protected]
2 Arts & Sciences
555-2222
Ted Smith
[email protected]
3 Education
555-3333
Bryan Williams
[email protected]
26
Determine Data Types
Data Type
Definition
Text
0-255 characters
Memo
0-64000 characters
Number
Integer, long integer, single, double
Date/Time
Dates, times, or both at once
AutoNumber
Automatically incremented as
records are added
OLE object
Image, sound files
Hyperlink
Link to an internet resource
27
Determine Data Types
Books
Authors
TITLE
Text (255)
Id
Number(integer)
PUBLISHER_id
Number
(integer)
ISBN
Text(10)
ISBN
Text(10)
Author
text(255)
QTY.
Number(integer)
Publishers
PUBLISHER_id
integer
PUBLISHER
text(255)
28
Determine Data Types Exercise
Pause for exercise
29
Determine Data Types
Personnel
Institutions
Pers_id
Number (integer)
Last
Text(255)
Institution_id
Number(integer)
First
Text(255)
Institution
text(255)
M.I.
Text(1)
Sector
Text(25)
Institution_id
Number (integer)
Positions
id
Number(integer)
pers_id
Number(integer)
position
Text(200)
30
Identify the Relationships
• How the tables are “related” to each other
– One-to-one
– One-to-many
– Many-to-many
• Foreign Keys define the relationships
31
Identify the Relationships
Books
TITLE
PUBLISHER_id
ISBN
QTY.
Ecology 101
0
4873895759
4324
Ecology for Dummies
1
0493802020
8998
Ecology and Politics
2
7482929292
900
Ecology and Modern Cinema
0
2234849302
1
1 to many
Authors
Id
ISBN
1 to many
Publishers
Author
0
4873895759
Smith, A.B.
1
4873895759
Gordon, D.A.
2
0493802020
Doe, J.
3
7482929292
Kim, J.B.
4
2234849302
Kim, C.B.
PUBLISHER_id
PUBLISHER
0
Harcourt Brace
1
Wiley & Sons
2
McGraw-Hill
32
Identify the Relationships
personnel
positions
pers_id
Last
First
M.I.
Institution_id
id
pers_id
Position
0
Smith
Ann
A
0
0
0
Professor
1
Smith
Ann
Z
1
1
0
Community Liaison
2
Kim
John
B
0
2
1
Administrator
3
1
Field Technician
4
2
P.I.
5
2
Data Manager
Pause for Exercise
contact_address
institutions
Institution_id
Institution
Sector
0
SDSU
Academic
1
Acme Inc.
Private
pers_id
street
city
state
0
523 Main St.
Amherst
MA
1
1010 Sea St.
San Diego
CA
2
99 Ridge Way
Portland
ME
33
Identify the Relationships
personnel
positions
pers_id
Last
First
M.I.
Institution_id
id
pers_id
Position
0
Smith
Ann
A
0
0
0
Professor
1
Smith
Ann
Z
1
1
0
Community Liaison
2
Kim
John
B
0
2
1
Administrator
3
1
Field Technician
4
2
P.I.
5
2
Data Manager
1 to many
1 to many
1 to 1
contact_address
institutions
Institution_id
Institution
Sector
0
SDSU
Academic
1
Acme Inc.
Private
pers_id
street
city
state
0
523 Main St.
Amherst
MA
1
1010 Sea St.
San Diego
CA
2
99 Ridge Way
Portland
ME
34
Identify the Relationships
Many-to-many: need to create a 3rd table to relate the two
35
Relational Database Functions
• Organize data – reduce or eliminate redundancy
• Improve data quality – reject “bad” data
– Wrong type of data
– Only good “codes” allowed
•
•
•
•
Retrieve data – query/search/select
Sort data
Update data
Output – link to other software with statistical and
graphical functionality
36
Database Management Systems
• FileMaker Pro
– Client/Server architecture
• Microsoft Access (SQL Server)
– Good for single users
– Sophisticated user interface
• MySQL
– Client/Server architecture
– Limited User Interface (PHPMyAdmin)
• Oracle
37
Excel
vs.
– Optimized for data
analysis & calculations
– Limited sorting
– Good for performing
complex calculations,
exploring possible
outcomes, and producing
high quality charts
– Record limit is 1,048,576
Access
– Used to collect,
manipulate & sort data
– Sort by different
selections
– Maintain data integrity
– Good for managing data
– Easy to create data
entry forms
38
Database Systems
Access
• Access workstation based, single user application
• Platform dependent
• Cannot be accessed concurrently
• No security other than workstation
• Part of MS Office suite, not free
MySQL
• MySQL is cross platform, multi user access
• Accessible to more users thru the web, client program or
other admin tools to access database (via authentication)
• Can be integrated with Web Server (web programming
languages)
• Data available remotely
• Free, open-source
39
More Information
• Database Normalization Basics
– http://databases.about.com/od/specificproducts/a/norma
lization.htm
• Step-by-step Guides to Using Databases
– http://www.geekgirls.com/menu_databases.htm
• Interactive Online SQL Training
– http://www.sqlcourse.com/
• Comparison between Access & Excel
– http://office.microsoft.com/enus/access/HA102101951033.aspx
40
Mailing List Subscription
• Please check the box on our sign-in sheet to
receive occasional emails to keep up with our
services, training, and news.
• Please encourage others to subscribe:
http://eepurl.com/CJwYT
41