Data Profiling & Database Archiving
Download
Report
Transcript Data Profiling & Database Archiving
SvalTech
Two Neglected Data
Management Technologies
Data Profiling & Database Archiving
DAMA Phoenix Chapter
9 September, 2010
Jack E. Olson
[email protected]
www.svaltech.com
“Data Quality: The Accuracy Dimension”, Morgan Kaufmann, 2003
“Database Archiving: How to Keep Lots of Data for a Long Time", Morgan Kaufmann, 2008
Copyright SvalTech, Inc., 2010
SvalTech
Why Two Topics
•
Jack knows a lot about each of them
–
–
–
–
Wrote a book on data profiling
Wrote a book on database archiving
Built data profiling product: AXIO at Evoke
Built database archiving product: TITAN at NESI
•
Both are post-operational data management technologies
•
Neither is required for operational purposes
•
Both have huge potential for return on investment
– Cost savings
– Operational performance improvements
– Risk reduction
•
Both are liked by compliance/governance folks
•
Both are under-utilized in IT shops
Copyright SvalTech, Inc., 2010
2
SvalTech
What I’m Going to Talk About
For each Technology,
•
•
•
•
•
•
Define Technology
Identify Where It is Used
Show How it Works
Basic Solution Architectures
Show Business Case Basics
Why it is Neglected
Copyright SvalTech, Inc., 2010
3
SvalTech
Data Profiling
Copyright SvalTech, Inc., 2010
4
SvalTech
Books
• Data Quality: The Accuracy Dimension, Jack E.
Olson, Morgan Kaufmann, 2003
• Executing Data Quality Projects” Ten Steps to Data
Quality and Trusted Information (tm), Danette
McGilray, Morgan Kaufmann, 2008
• Three Dimensional Analysis – Data Profiling
Techniques, Ed Lindsey, Data Profiling LLC, 2008
• Data Quality Assessment, Maydanchik Arkady,
Technics Publications, LLC, 2007
Copyright SvalTech, Inc., 2010
5
SvalTech
Definition
Data Profiling is the application of data analysis techniques
to existing data for the purpose of determining the actual
content, structure, and quality of the data.
metadata
accurate &
inaccurate
Data
accurate &
inaccurate
Copyright SvalTech, Inc., 2010
accurate
metadata
Data
Profiling
facts about
inaccurate
data
data quality
issues
6
SvalTech
Where is Data Profiling Used?
•
Enterprise Data Quality Improvement Program
–
–
–
•
Support Consolidation of Databases after mergers and acquisitions
–
–
•
Dramatically reduce cost and time to complete projects
Improve quality of data in resulting system
Support application renovation projects
–
–
•
Traditional Six Sigma Like Program
Recursive application of data quality assessment
Based on historical success of companies who have used it
Dramatically reduce time and cost to complete
Improve quality of data in resulting system
Support data integration functions for data warehousing/ business
intelligence data stores
–
–
Develop processes to cleanse data in transit
Improve quality of data in information intelligence stores
Copyright SvalTech, Inc., 2010
7
SvalTech
Data Quality Program
Data Quality
Correction
Data Quality
Prevention
Data Profiling
Name & Address
Cleansing
De-duplication
Data Quality
Monitoring
Incident
Investigation
Re-verification
Business Process
Improvements
Data Quality Issue
Formulation &
Tracking
Application Software
Improvements
Data Quality
Discovery
Copyright SvalTech, Inc., 2010
8
SvalTech
Data Quality Discovery
Data Profiling
profile
DATA
review
profiler
Incident
Investigation
Copyright SvalTech, Inc., 2010
Suspicion
Of DQ problem
DQ issues
process
review
DQ issues
DATA
review
9
SvalTech
Data Profiling Functions
•
Column Examination
–
–
–
–
–
–
•
Row Examination
–
–
–
•
Find all primary key candidates (single or multi-column)
Find intra-row column dependencies (find denormalization instances)
Find multi-column value relationships
• Value ordering rules
• NULL value dependencies
Multi-table Examination
–
–
–
–
•
List all values in column with frequency of occurrence
Show high and low values
Determine true data type
Determine degree of uniqueness
Determine encoding patterns used, frequency of each pattern
Compute values: AVG, SUM, MEDIAN, STD DEVIATION
Find matching columns across tables
• Match by column name, data type
• Match by values
Find primary/foreign key pairs (single and multi-column)
Determine 1-1, 1-M, 1-0, M-1, M-M, 0-1 rules
Find primary values not found in secondary tables
Test User provided data rules
Copyright SvalTech, Inc., 2010
10
SvalTech
Potential for Discovery of Value
Problems
missing & invalid
values
values recorded
in a column of a
table
errors that can be
fixed without
re-verification
valid but wrong
values
errors that can be
found through
analytical techniques
correct values
Copyright SvalTech, Inc., 2010
11
SvalTech
What Types of Problems can be identified
for values in a column?
Invalid Values
- missing values when should not be missing
- values out of range or not in domain of expected values
- value in one column not possible when combined with
values in one or more other columns
- obviously wrong when looked at
•Name = Donald Duck
•Address = 1600 Pennsylvania Avenue
Valid Values
- distribution of values unexpected
too many of one or more values
too few of one or more values
- value is incompatible with other values in other columns
- multiple values mean same thing
Copyright SvalTech, Inc., 2010
12
SvalTech
What Types of Problems can be Fixed?
• Synonyms (multiple representations for same value)
• Multiple ways to express same value
•Date formats
•Number formats
•Use of case on character values
• Cases where the value of a column can be determined from the
values in one or more other columns
explicitly: through a rule
correlation against known combinations
Copyright SvalTech, Inc., 2010
13
SvalTech
Role of Metadata and Data Rules
• must define what constitutes correct
• traditional metadata is only part of definition
• traditional metadata is often inaccurate and/or incomplete
• data rules exist whether known or not
• data rules exist whether enforced or not
• profiling can validate metadata and known data rules
• can be used to correct or enhance metadata and data rules
• can be used to discover additional data rules
• can test adherence to data rules
Copyright SvalTech, Inc., 2010
14
SvalTech
Data Profiling Solution
Architectures
metadata
data rules
gather metadata
& data rules
review
update metadata
and data rules
profile
profiler
data
extract
samples
Identify and
quantify problems
determine impact
on business
create DQ
problem tickets
Copyright SvalTech, Inc., 2010
15
SvalTech
Solution Architectures
Home Grown
Inefficient
Directed SQL; not generic
Easy Algorithms
Few Surprises uncovered
Vendor Tools
More sophisticated functions
third normal form discovery
foreign key discovery
multiple pattern recognition
More management of results
Copyright SvalTech, Inc., 2010
16
SvalTech
State of Vendor Support
• Fragmented
–
–
Different products call themselves data profiling but do different things
Some non-profiling products call themselves profiling
•
Buried in multiple data management functions
•
Marketed as compliance product
•
Not a lot of advances in technology over last few years
–
–
–
–
–
–
More complex array of profiling functions
Targeted profiling functions
Menu of functions to pick from
SaaS
Profiling of metadata
Monitoring operational systems for profiling expectations
Copyright SvalTech, Inc., 2010
17
SvalTech
Business Case Basics
•
Dollar savings
–
–
•
Improved Operations
–
–
•
Fewer Operational Mistakes
• Reduced rework
• Reduced customer returns
Improved Analytic Results
• Better decision making from Analytics
Data Quality problems can cost
a company 15-25% of bottom line profit.
Fewer operational glitches due to wrong data values
Reduced time to complete projects
Risk Reduction
–
–
Reduced exposure to catastrophic events
Reduced exposure to legal scrutiny of data
Copyright SvalTech, Inc., 2010
It’s all indirect:
you cannot determine
what value you will get
until after you have
profiled data, studied
results and used it to
some advantage.
18
SvalTech
How to Get Started
•
Need to establish as part of quality improvement/ compliance/
governance program
–
–
•
Start with a pilot data quality assessment of a critical data resource
–
–
–
•
6 Sigma like
Multi-department interest
Small
Targeted: high value consequences of poor data
“see what we find”
Possibly start with a renovation/ consolidation/ integration project
–
Process to establish source level metadata and understand data content
•
Need to Quantify Results
•
Use Success to Expand Use of Technology
Copyright SvalTech, Inc., 2010
19
SvalTech
Why is it Neglected?
•
Problems are not on Top Ten List of IT Concerns
•
Cannot predict value: cannot build business case in advance
–
–
Don’t know value of a data quality improvements until after you discover the problems
Value is hard to quantify: does not appear until quality problems acted upon
•
Perception that you don’t need it
•
Exposes corporate/ IT weaknesses
•
No Clear Owner of Data Quality
•
Lack of education on value, cost to implement
– Management education
– Technical Staff education
Copyright SvalTech, Inc., 2010
20
SvalTech
Database Archiving
Copyright SvalTech, Inc., 2010
21
SvalTech
Books
• Database Archiving: How to Keep Lots of Data for a
Very Long Time, Jack E. Olson, Morgan Kaufmann,
2009
Copyright SvalTech, Inc., 2010
22
SvalTech
Definition
The process of removing selected data items from
operational databases that are not expected to be referenced
again and storing them in an archive database where
they can be retrieved if needed.
Physical Documents
application forms
mortgage papers
prescriptions
File Archiving
structured files
source code
reports
Copyright SvalTech, Inc., 2010
Document Archiving Multi-media files
word
pictures
pdf
sound
excel
telemetry
XML
Email Archiving
outlook
lotus notes
Database Archiving
DB2
IMS
ORACLE
SAP
PEOPLESOFT
23
SvalTech
Business Records: the Archive Unit
You don’t archive databases; you archive data from databases.
A Business Record is the data captured and maintained for a
single business event or to describe a single real world object.
Databases are collections of Business Records.
Database Archiving is Records Retention.
customer
employee
stock trade
purchase order
deposit
loan payment
Copyright SvalTech, Inc., 2010
24
SvalTech
Data Retention
The requirement to keep data for a business object for a
specified period of time. The object cannot be destroyed until
after the time for all such requirements applicable to it has past.
Business Requirements
Regulatory Requirements
The Data Retention requirement is the longest of all requirement lines.
Copyright SvalTech, Inc., 2010
25
SvalTech
Data Retention
•
Retention requirements vary by business object type
•
Retention requirements from regulations are exceeding business requirements
•
Retention requirements will vary by country
•
Retention requirements imply the obligation to maintain the authenticity of the data
throughout the retention period
•
Retention requirements imply the requirement to faithfully render the data on demand in a
common business form understandable to the requestor
•
The most important business objects tend to have the longest retention periods
•
The data with the longest retention periods tends to be accumulate the largest number of
instances
•
Retention requirements often exceed 10 years.
more years for some applications
Copyright SvalTech, Inc., 2010
Requirements exist for 25, 50, 70 and
26
SvalTech
Where Database Archiving
Is Used
• Overloaded Operational Databases
• Retired Applications
• Application Renovation Projects
Copyright SvalTech, Inc., 2010
27
SvalTech
Overloaded Operational Databases
• Transaction data
• Lots of data
– Hundreds of millions of rows
– High daily transaction rate
•
•
•
•
24/7 operational availability requirement
Long retention period (15 years or more)
Short useful active life (less than 2 years)
Low access requirements during the inactive period
– Very low access frequency
– Response time not critical
– Access requirements are simple, easily satisfied with ad hoc tools
Copyright SvalTech, Inc., 2010
28
SvalTech
Retired Applications
• Merger of companies results in an operational
application being duplicated
• Data Structures are not compatible
– One keeps data elements not in other
– One encodes data elements differently
– One designed for different OS/DBMS than other
• Decision is made to use one system and
abandon the other one
• Meets all characteristics of an operational
application
Copyright SvalTech, Inc., 2010
29
SvalTech
Application Renovation Projects
• Application is undergoing major change
–
–
–
–
–
Replaced with packaged application
Legacy modernization
Legacy termination
Rewritten to be web-centric
Need to satisfy new requirements
• Old data structures are out of date
– Legacy DBMS
– Legacy file system
• Data meets all other requirements for archiving
operational application
Copyright SvalTech, Inc., 2010
30
SvalTech
Architecture of Database Archiving
Operational System
Application program
Archive Extractor
Archive Administrator
Archive Designer
Archive Data Manager
Archive Access Manager
OP DB
Archive extractor
Archive Server
archive
catalog
archive
storage
Copyright SvalTech, Inc., 2010
31
SvalTech
Archive Staff
•
Database Archive Specialist
–
–
–
–
•
Database Archive Administrator
–
–
•
Received education on database archive design and implementation
Knows tools available
Experienced
Full time job
Received education on database archiving administration
Full time job
Supporting Roles
–
–
–
–
–
–
–
–
–
Storage Administrators
Database Administrators
Data Stewards
Security Administrators
Compliance staff
IT management
Business Unit Management
Legal
Records Management
Copyright SvalTech, Inc., 2010
32
SvalTech
Archive Designer Component
•
Metadata
–
–
–
–
•
Data
–
–
–
–
–
•
Capture current metadata
Validate it
Enhance it
Design archive storage format
Define business records to be archived
Define source of data
Define data structures within operational system
Define reference data needed to include with it
Define archive format of data
Policies
–
–
–
–
Define extract policy (when a record becomes inactive)
Define operational disposal policy (when to remove from operational database)
Define storage policy (how to protect data in archive)
Define discard policy (when to remove from archive)
Copyright SvalTech, Inc., 2010
33
SvalTech
Archive Extractor Components
•
Extractor process
–
–
–
–
–
–
–
–
–
•
Verify consistency with design metadata
Extract data as defined in designer
Mark or delete from operational database as defined in designer
Pass data to archive data manager
Keep audit records on everything done
Do not impact operational performance
Support interruptions with transaction level recovery
Support restart
Finish scans within acceptable time periods
Scheduling
– Establish periodic executions
– Find non-disruptive periods
– Be consistent
Copyright SvalTech, Inc., 2010
34
Archive Extractors
SvalTech
Physical vs. Application Extractors
Archive Extractor
Operational System
Application program
OP DB
Archive extractor
Physical Extractor
Application Extractor
Gets/deletes data directly from the database
tables, rows, columns
Gets/deletes data from an application API
virtual tables, rows, columns
application program
Copyright SvalTech, Inc., 2010
35
SvalTech
Archive Data Manager Component
•
Put data away
–
–
–
–
–
•
Execute Storage policies
–
–
–
–
–
•
Return to accessing programs
Fetch data on request
–
–
•
Encryption/ signatures
Backup copies created and stored
Geographic dispersion of backups
Register archive files with archive catalog
Enter audit trail information
Fetch metadata on request
–
•
Receive data from extractors
Format into archive segment files
Determine metadata version affinity
Format and store metadata files if new
Build or update segment indexes both internal and external
Scan archive segments
Search through indexes
Execute Archive Discard Process
–
–
–
Periodic scheduling
Delete qualifying business records
Update archive catalog
Copyright SvalTech, Inc., 2010
36
SvalTech
Archive Access Component
•
Query Capability
–
–
–
–
•
Determine applicability based on archive segment versions of metadata
SQL based is best, if possible
Employ external indexes to determine which archive segments to look into
Employ internal indexes to avoid reading all of an archive segment
Support standard access tools
–
–
–
Report generation (such as Crystal Reports)
Generic query tools
JDBC interface
•
Support metadata version browsing
•
Support generation of load files based on query results
•
Support generation of load files based on original data source based on query results
Copyright SvalTech, Inc., 2010
37
SvalTech
Archive Administration Component
•
Manage Archive Catalog
–
–
–
•
Manage Archive Storage Systems
–
–
•
Application archive designs
Audit trails
Results logs
Ensure periodic readability checks
Maintain access audit trails
Manage Archive Access
–
–
–
Authorizations for users
Authorizations for specific events
• Unloads
Ensure audit records are created for all access
•
Manage e-Discovery requests
•
Ensure Extract and Discard processes are run when they are supposed to
•
Manage Metadata Change Process
Copyright SvalTech, Inc., 2010
38
SvalTech
Solution Comparisons
Home-Grown vs. Vendor
Home-Grown Solutions:
Use Parallel DB
Use Database Partitions
Put in UNLOAD files
Save Image Copies of DB
Copyright SvalTech, Inc., 2010
Vendor Solutions:
More Complete Solutions
Support Long Term Administration
Put data in XML files
Put data in reformatted files
Exploit strengths of storage subsystems
Direct access support
39
SvalTech
Home-Grown Solutions
• Solve Operational Problems, BUT:
–
–
–
–
Create downstream problems
Fail to achieve cost savings
Render archive data inaccessible
• Either completely or,
• Expensive in time and cost to query
Lose data authenticity
• Common Omissions
–
–
–
–
–
Copyright SvalTech, Inc., 2010
No handling or improvement of metadata
No change process for structure changes
No long term storage management
Fail to achieve application/system independence
No administration platform
40
SvalTech
Vendor Solutions
• Not a Lot of Vendors
–
–
–
Only 7 I know of
3 large companies
• Through acquisition
Gartner pre-recession characterization
• Is a new technology
• $100M in 2008
• 40% per year growth rate
• Early adopter stage
• Solutions not complete
–
–
Copyright SvalTech, Inc., 2010
Need growth in function and maturity
Common weak spots
• Design modeling
• Extractor technology
• Not pervasive across data sources
• Storage structure
• Storage management
41
SvalTech
Business Case Basics
Drivers
Longer Data Retention requirements
Expanded Business
overloaded
operational
databases
Operational problems
Mergers and Acquisitions
Cost of Keeping Old Systems
Difficulty in Making Application Changes
Data Governance
e-Records Retention
e-Discovery Readiness concerns
Copyright SvalTech, Inc., 2010
42
SvalTech
Reason for Archiving
All data in
operational db
Inactive data in
archive db
most expensive system
most expensive storage
most expensive software
least expensive system
least expensive storage
least expensive software
In a typical op db
60-80% of data
is inactive
Size Today
This percentage
is growing
Operational
Copyright SvalTech, Inc., 2010
operational
archive
43
SvalTech
Cost Saving Elements
Look for and compute difference in storage costs
front-line vs archive storage
byte counts differences between operational and archive
Look for and compute difference in system costs
operational vs archive systems
are operational system upgrades avoided
are software upgrades avoided
can systems be eliminated for application
can software be eliminated for application
Look for savings on people costs
can people be eliminated or redirected for retired applications
Potential savings on changes/ application renovations
simplification of design
elimination of data conversions
Copyright SvalTech, Inc., 2010
44
SvalTech
Operational Efficiency Impacts
Will operational performance be enhanced with less data
Will utility time periods be reduced (backup, reorganization)
fewer occurrences needed
less data to process each time
Will recovery times be reduced and what is that worth
interruption recoveries
disaster recoveries
Will implementation of data structure changes be improved
avoided
reduced amount of data to unload/modify/reload
Copyright SvalTech, Inc., 2010
45
SvalTech
Risk Factors
Will the saved data have better authenticity
not changed in archive
shielded from updates or damage
traceable back to original form
Will e-Discovery benefit from archiving
can locate and process data outside of operational environment
can easily create legal-hold archive units
Will exposure of data reduced
fewer authorized users against the archive
complete audit trails of all access
Copyright SvalTech, Inc., 2010
46
SvalTech
Business Case Summary
• Database Archiving solutions generally provide for lower cost software,
can use lower cost storage more efficiently, and run on smaller machines.
• Each business case is different
Many factors can be used in building business case
Seen an application justified on storage costs alone
Seen an application justified on disaster recovery time alone
Seen an application justified on better data security alone
• Each organization will have many potential applications
• Having a database archiving practice can create synergies across many
applications thus adding more value
Copyright SvalTech, Inc., 2010
47
SvalTech
Why is it Neglected?
•
Belief that work arounds are good enough
•
Technology is too new
•
Technology is incomplete or flawed
•
Inability to produce complete business case to justify
•
Lack of experience in using technology
•
Lack of education on value, cost to implement
– Management education
– Technical Staff education
Copyright SvalTech, Inc., 2010
48
SvalTech
Final Thoughts
Are these two technologies related?
Enterprises who consider data a critical and valuable asset and who have a goal
of collecting, protecting, preserving and deploying data to the best advantage
of the enterprise WILL have robust data management and data governance practices.
These will include professional management of:
- data and metadata architectural control
- active and ongoing data quality improvement program
- active and ongoing data archiving program (records management)
- data security control
- access authorization and auditing
- data encryption
- data protection program
- backups
- disaster recovery
- protection from unauthorized access/ changes
- data distribution control
Copyright SvalTech, Inc., 2010
49