Reference Data Management
Download
Report
Transcript Reference Data Management
Mehmet Orun, MBA, CDMP
Reference Data Management is critical to success of
Master Data Management, Data Warehousing, or
SOA integration.
Traditional RDM efforts by data architects are limited
to Data Domain values in data models and mapping
tables for integration (ETL, ESB…) developers
This presentation provides an enterprise-wide view of
what reference data is, why it matters, and how it can
be managed, to increase the practical knowledge of
the data professional.
Mehmet Orun is the Sr. Manager for
Data Architecture & Analysis as well as
solution owner for Master Data
Management at Genentech, where he
is responsible for data services rollout
and associated processes, technology
roadmaps and methodologies.
Mehmet’s interest and expertise
includes data and meta data driven
solutions to bridge the gap across
applications, processes, and structured
and unstructured data sources. He has
master level CDMP certification in
Data Management and Data &
Information Quality and is a Senior
Member of the ACM.
Mehmet serves on the board of San
Francisco DAMA Chapter
Mehmet Orun is also the primary
contributor to DAMA-DMBOK’s
“Reference & Master Data Mngt”
chapter. Some of the frameworks in
this presentation are based on this
content.
What is Reference Data
A Definition
Some Examples
The complexity
Reference Data in the Enterprise
RDM Function
Conceptual Framework
Reference Data Types
Use/Design Patterns & Implications
RDM Functions (from DMBOK)
Reference data is any kind of data that is used solely to categorize
other data found in a database, or solely for relating data in a
database to information beyond the boundaries of the enterprise.
(Malcolm Chisholm, TDAN, 10/1/2001)
Wikipedia suggests:
Reference data are data describing a physical or virtual object and its
properties, usually described with nouns. A special type of reference
data is master reference data - these are reference data shared over a
number of systems.
The above mixes Reference Data Management, MDM and Data Standards
(noted as many refer to Wikipedia as the authoritative source)
State Codes
[Postal]
State Codes [-USA]:
CA, MO, AL, …
Hair Color:
Brown, Blonde, …
Diseases
Cancer, Ulcer, …
What does a code mean?
CA: Canada or California?
What is an authoritative source?
USPS for State codes?
ISO 3166 for country codes and if so ISO 3166-2
for state codes?
What is an acceptable list of values?
Hair color: Black, Brown, Blond, Red
What about Chestnut?
What about Pink, or Purple?
What about Pink and Purple?
What is the right level of detail “to categorize
other data”?
Cancer
Breast Cancer
174.9, Female breast, unspecified (Ref: ICD9)
What do we mean by a term anyway?
Cancer?
Cancer cell photo from alternative-cancer.net
Crab family taxonomy from NCBI
What are there reference
data elements here?
Trip Type
Airport Code & Names
Traveler Type
Class of Service
Fare Type
Flags?
Other?
If you don’t know what something means, how do you
get meaning out of it?
If data is in different levels of detail from different
sources, how do you maintain richness without
sacrificing accuracy?
How do you relate one set of categories to another,
without standards?
The rest of this presentation will describe different types
of Reference Data and their use of Reference Data in
the enterprise
Enterprise Context and Reference Data Types
Reference Data Management is control over defined domain
values (also known as vocabularies), including control over
standardized terms, code values, and other unique identifiers,
business definitions for each value, business relationships within
and across domain value lists, and the consistent, shared use of
accurate, timely, and relevant reference data values to classify
and categorize data.
(DAMA DMBOK, Chapter 8)
If you do not [somehow] manage your reference data, broader
data integration and master data reconciliation will be
challenging at best
ANSI/NISO Z39.19-2005 is the authoritative standard and
guideline for reference data/vocabulary management
Information
Integration
• Code mapping
rules
• Single source for
ESB, ETL, EII
Data Governance
& Quality
• Manage terms,
names, definitions
• Code lists for
quality rules
Master Data
Management
• Use capability for
Information
Integration, Data
Governance, and
Quality
Business
Intelligence
Knowledge
Management
• Hierarchies for
drill down or data
mining analytics
Reference Data Management Solution
(Vocabulary, Vocabulary View, Term-Relationship Management,
Publication and Translation Services)
• Import or link
attributes and
definitions as part
of vocabulary
meta data
Data Models
• Code
lists/vocabularies
and term
relationships
Code List and
Vocabularies
• Taxonomy or
Ontology for
Search and
Navigation
• Content Tagging
What are there reference data
elements here?
San Francisco the city vs. a
pointer to a metropolitan area
SFO or San Francisco as airport
indicators
Hotel names with many
synonyms
…
Simple
to
Complex
Increasing complexity
List
Flat List
Taxonomy
Hierarchical
•Parent / Child
•Relationship between
parent and a child can be
relatively underspecified or
ill defined.
Thesaurus
Ontology
Equivalence - Synonym
Homographic-Spelled the same
Classes (general things) many
different meaning – “Tank”
Hierarchical-Broader than/Narrower
Than
Associated-Associated
A thesaurus is typically used to
associate the rough meaning of a
term to the rough meaning of another
term.
domains of interest
Instances (particular things)
Relationships among those things
Properties (values) of those things
Functions of and processes
involving those things
Constraints on and rules involving
those things.
Simple Value Lists
Defined Value Lists
Taxonomic Reference Data
Ontological Reference Data
Relationship Types
Computer
Manufacturers
H
International
Business
Machines
E
?
Hardware
E
IBM
H
?
Software
Group
A
Big Blue
?
A
Software
E
Equivalence
H
Hierarchical
A
Associative
Equivalent Term Relationship:
A relationship between or among terms in a controlled vocabulary that
leads to one or more terms that are to be used instead of the term from
which the cross-reference is made. (Z39.19) The relationship indicator
for this type is ET.
Hierarchical Relationship:
A relationship between or among terms in a controlled vocabulary that
depicts broader (generic) to narrower (specific) or whole-part
relationships; begins with the words broader term (BT), or narrower term
(NT). (Z39.19) The relationship indicator for this type is HT.
Related Term Relationship:
A term that is associatively but not hierarchically linked to another term
in a controlled vocabulary. In thesauri, the relationship indicator for this
type of term is RT. (Z39.19)
Information
Integration
• Code mapping
rules
• Single source for
ESB, ETL, EII
Data Governance
& Quality
• Manage terms,
names, definitions
• Code lists for
quality rules
Master Data
Management
• Use capability for
Information
Integration, Data
Governance, and
Quality
Business
Intelligence
Knowledge
Management
• Hierarchies for
drill down or data
mining analytics
Reference Data Management Solution
(Vocabulary, Vocabulary View, Term-Relationship Management,
Publication and Translation Services)
• Import or link
attributes and
definitions as part
of vocabulary
meta data
Data Models
• Code
lists/vocabularies
and term
relationships
Code List and
Vocabularies
• Taxonomy or
Ontology for
Search and
Navigation
• Content Tagging
Implementation Patterns & Data Management Functions
If selecting a single value stores it in a database
row/cell, what happens when you select multiple
values?
How do you report on any one value?
a.
b.
c.
d.
Select … where <field> = <value>
Select … where <field> IN {value}
Select … where <field> like ‘%value%’
Other?
Who is an AV buf in the audience?
Who knows their doctor well?
Reference Data Management is control over defined domain
values (also known as vocabularies)
Vocabulary Management
function of defining, sourcing, importing, and maintaining, … any
given vocabulary
Key questions:
▪ What information concepts will this vocabulary support? (which data
attributes)
▪ Who is the audience?
▪ Why is the vocabulary needed? (app, BI , content management?)
▪ Who are the decision makers?
▪ What are the current vocabularies?
▪ Are there existing standards?
Term: One or more words designating a concept
Term Management
Specifying how Terms are initially defined and classified and
how this information is maintained once it starts
Most significant effort is to identify terms that will
constitute the standard
Preferred Term: One of two or more synonyms selected as a
term for inclusion in a controlled vocabulary
Term Relationship Management
How the terms relate
Hierarchical Relationship depicts broader (generic) to
narrower (specific) or whole-part relationships
In data integration, if terms are inconsistent, map to broader
relationship for accuracy (data detail loss will occur)
In BI, consider allowing mixed granularity
Related Term Relationship depitcs associatively without
hierarchical relationships
Term Change
Due to deprecation or its metadata changing
If definition is improved, no impact
If value/data type… changes, assess impact
Term Relationship Change
If relationships are changing, integration or aggregation rules may be
impacted
Vocabulary Version Change
Important especially when external standards are used
Minor (addition/deprecation) vs. Major revisions
Reference Data/Vocabularies/Business Semantics are
pervasive in the enterprise.
Broad implementations would likely take time, cost
money, and not clearly understood
Like many data management initiatives, pick an area
with business value that can drive focus, develop a
sustainable solution and perhaps obtain tools to help.
Process and accountability is more important than
technology
Track value, share the success, and expand your scope