Census Bureau

Download Report

Transcript Census Bureau

Census Bureau
DRIS
Date: 01/23/2007
Oracle conversion example:
Initial approaches for UCM
source data
Index









Data Modeling
Current Datafile
Current Dataload
Data Overlook
Two Approaches
First Approach
Data Distribution
Advantages
Disadvantages
3









Second Approach
Basic Modeling
Advantages
Advance Work
Care needed
Our Recommendation
Tasks
Next steps
? Questions
4
Data modeling


Conversion of data from Legacy (Fortran)
to RDBMS (Oracle)
Hardware/software



Sun E6900, OS Solaris 5.10/12 cpu/96 G RAM
Database - Oracle 10g
Oracle designer / Erwin
5
Current datafile
Geo
Base Data
Legacy process
Big datafile
Census
Data modeling
Data updates
Reports
Oracle db
Pl/SQL,
Shell,
C,
ETL tool
Data Feeds
6
Current Dataload







UCM data
Fortran format
One big file w/ 180 M records
Record length is 1543 bytes
Most of the fields are varchar2
Many fields are blank/no data
Performance in Oracle inadequate without
schema redesign to leverage RDMS capabilities
7
Data Overview (approx)






State of NY
State of CA
State of TX
District of Columbia
Delaware
Connecticut






20 M
34 M
25 M
31 G
52 G
38 G
500 K
1 M
1 M
750 M
1.5 G
1.5 G
8
Two approaches

First Approach
Break datafile on the basis of data



E.g. RO level (12)
State level (54-56), including DC, Puerto Rico etc.
Second Approach
Break datafile into multiple tables with change in
field definitions using relational model
9
First approach
Break datafile on the basis of data
Current datafile
Table_CA
Table_NY
Table_XX
Table_YY
Table_54
10
Data distribution

Uneven data distribution

Big data tables will be 30+ G

Small data tables will be close to < 1 G
11
Advantages of this kind of
segmenting/partitioning:


State level queries will be faster than
current
If the data is separated by RO, the data
will be more distributed w/ less tables
(close to 12 instead 54-56)
12
Disadvantages







Too many tables
Many fields are empty and varchar2(100)
No normalization
Existing queries need to be changed a lot
No normalization technique is used.
For small tables, query will run fast but for big
tables, there will be a lot of overhead
Operational tables will be same in number
Too complicated to run queries, may confuse
users while joining main and operational tables
13
Second approach
Break datafile into few relational tables
with change in field definitions
Current datafile
MAFID
MAFID
Table2
Table1
MAFID
MAFID
MAFID
Table3
MAFID
Table4
14
Basic Modeling





Database design/logical and physical
Relations will be defined based on a primary key
 In this case, it will be MAFID, which is unique
varchar2(100) field could be converted to smaller fields based on
actual field lengths
All fields will be mapped with at least one of the fields in the new
tables
Data will be inserted in multiple efficient tables based on updated
data model using relational database design principles
15
Advantages

Faster






Queries
Updates
Deletes
Additions
Less maintenance
Same approach can be used for
transactional/operational data
16
Advance work







Identify each and every field of UNM data
Check/Define field lengths of each field
Map every field to new schema
Can some fields be merged together?
Identify and remove duplicate data elements in
model
Define tables and relationships and create new
schema
Break and load data into these tables
17
Care needed




Current datafile will be broken into
multiple datafiles for data processing
Load one by one datafile into tables
Test and demonstrate completeness of
new model
Craft comparison to prove source and new
schema properly include all Census data
18
Our Recommendation

** Second Approach **

Why ?






Data distribution will be uniform
Less unwanted data is moved to separate tables
This will reduce overhead on the queries of any updates
Existing queries can be used with little modifications
Ongoing data maintenance will be more efficient in RDBMS
Additional data like RPS can be easily uploaded using same
queries
19
Tasks






Design database using data modeling
tool/ Oracle designer / Erwin etc.
Create test data from original datafile
Load test data into database tables
Create test scripts to check data
consistency
Check indexes for required queries
Test old data vs. new data
20
Continued…






Break data into small files
Load full data into tables
Unit test on data for consistency
Run queries on the database
If needed, fine tune database
Use same approach for transactional data
like RPS data
21
Next steps…



Continued collaboration with Census team
to improve domain understanding for new
team members
Access to Oracle database tools on team’s
workstations
Access to operational Oracle instance to
begin development of approach
22
? Questions
23