Transcript Slide1

Census Bureau
DRIS
Date: 01/16/2007
Index









Data Modeling
Current Datafile
Current Dataload
Data Overlook
Two Approaches
First Approach
Data Distribution
Advantages
Disadvantages
2







Second Approach
Basic Modeling
Advantages
Advance Work
Care needed
Our Recommendation
Tasks
3
Data modeling


Conversion of data from Legacy (Fortran)
to RDBMS (Oracle)
Hardware/software



Sun V890/E12K, OS Solaris 5.7,5.8,5.9,5.10
Database - Oracle 10g
Oracle designer / Erwin
4
Current datafile
Geo
Base Data
Legacy process
Big datafile
Census
Data modeling
Data updates
Reports
Oracle db
Pl/SQL,
Shell,
C,
ETL tool
Data Feeds
5
Current Dataload







UCNM data
Fortran format
One big file w/ 180 M records
Record length is 1543 bytes
Most of the fields are varchar2
Many fields are blank/no data
Performance too poor in Oracle
6
Data overlook (approx)






State of NY
State of CA
State of TX
District of Columbia
Delaware
Connecticut






20 M
34 M
25 M
31 G
52 G
38 G
500 K
1 M
1 M
750 M
1.5 G
1.5 G
7
Two approaches

First Approach
Break datafile on the basis of data



E.g. RO level (12)
State level (54-56), including DC, Puerto Rico etc.
Second Approach
Break datafile into multiple tables with change in
field definitions using relational model
8
First approach
Break datafile on the basis of data
Current datafile
Table_CA
Table_NY
Table_XX
Table_YY
Table_54
9
Data distribution

Uneven data distribution

Big data tables will be 30+ G

Small data tables will be close to < 1 G
10
Advantages


State level queries will be faster than
current
If the data is separated by RO, the data
will be more distributed w/ less tables
(close to 12 instead 54-56)
11
Disadvantages







Too many tables
Many fields are empty and varchar2(100)
No normalization
Existing queries need to be changed a lot
No normalization technique is used.
For small tables, query will run fast but for big
tables, there will be a lot of overhead
Operational tables will be same in number
Too complicated to run queries, may confuse
users while joining main and operational tables
12
Second approach
Break datafile into few relational tables
with change in field definitions
Current datafile
MAFID
MAFID
Table2
Table1
MAFID
MAFID
MAFID
Table3
MAFID
Table4
13
Basic Modeling





Database design/logical and physical
Relations will be defined based on a primary key
 In this case, it will be MAFID, which is unique
varchar2(100) fields will be converted to smaller fields, say
varchar2(60) or smaller/based on actual field lengths
All fields will be mapped with at least one of the fields in the new
tables
Data will be inserted in small multiple tables
14
Advantages

Faster






Queries
Updates
Deletes
Additions
Less maintenance
Same approach can be used for
transactional/operational data
15
Advance work







Identify each and every field of UNM data
Check/Define field lengths of each field
Map every field to new table field
Can some fields be merged together?
If yes, identify those
Define tables and relationships
Break and load data into these tables
16
Care needed




Current datafile will be broken into
multiple datafiles for data processing
Load one by one datafile into tables
Making sure that all datafiles are loaded
into multiple tables
No data is missing from the base table
17
Our Recommendation

** Second Approach **

Why ?






Data distribution will be uniform
Less unwanted data is moved to separate tables
This will reduce overhead on the queries of any
updates
Existing queries can be used by little modifications
Less maintenance
Additional data like from RPS can be easily uploaded
using same queries
18
Tasks






Design database using data modeling
tool/ Oracle designer / Erwin etc.
Create test data from original datafile
Load test data into database tables
Create test scripts to check data
consistency
Check indexes for required queries
Test old data vs. new data
19
Continued…






Break data into small files
Load full data into tables
Unit test on data for consistency
Run queries on the database
If needed, fine tune database
Use same approach for transactional data
like RPS data
20
THE END
21