Reference Data Management

Download Report

Transcript Reference Data Management

Mehmet Orun, MBA, CDMP

Reference Data Management is critical to success of
Master Data Management, Data Warehousing, or
SOA integration.

Traditional RDM efforts by data architects are limited
to Data Domain values in data models and mapping
tables for integration (ETL, ESB…) developers

This presentation provides an enterprise-wide view of
what reference data is, why it matters, and how it can
be managed, to increase the practical knowledge of
the data professional.

Mehmet Orun is the Sr. Manager for
Data Architecture & Analysis as well as
solution owner for Master Data
Management at Genentech, where he
is responsible for data services rollout
and associated processes, technology
roadmaps and methodologies.

Mehmet’s interest and expertise
includes data and meta data driven
solutions to bridge the gap across
applications, processes, and structured
and unstructured data sources. He has
master level CDMP certification in
Data Management and Data &
Information Quality and is a Senior
Member of the ACM.

Mehmet serves on the board of San
Francisco DAMA Chapter

Mehmet Orun is also the primary
contributor to DAMA-DMBOK’s
“Reference & Master Data Mngt”
chapter. Some of the frameworks in
this presentation are based on this
content.

What is Reference Data
 A Definition
 Some Examples
 The complexity

Reference Data in the Enterprise





RDM Function
Conceptual Framework
Reference Data Types
Use/Design Patterns & Implications
RDM Functions (from DMBOK)

Reference data is any kind of data that is used solely to categorize
other data found in a database, or solely for relating data in a
database to information beyond the boundaries of the enterprise.
(Malcolm Chisholm, TDAN, 10/1/2001)

Wikipedia suggests:
Reference data are data describing a physical or virtual object and its
properties, usually described with nouns. A special type of reference
data is master reference data - these are reference data shared over a
number of systems.
The above mixes Reference Data Management, MDM and Data Standards
(noted as many refer to Wikipedia as the authoritative source)

State Codes
[Postal]
State Codes [-USA]:
 CA, MO, AL, …

Hair Color:
 Brown, Blonde, …

Diseases
 Cancer, Ulcer, …

What does a code mean?
 CA: Canada or California?

What is an authoritative source?
 USPS for State codes?
 ISO 3166 for country codes and if so ISO 3166-2
for state codes?

What is an acceptable list of values?
 Hair color: Black, Brown, Blond, Red
 What about Chestnut?
 What about Pink, or Purple?
 What about Pink and Purple?

What is the right level of detail “to categorize
other data”?
 Cancer
 Breast Cancer
 174.9, Female breast, unspecified (Ref: ICD9)

What do we mean by a term anyway?

Cancer?
Cancer cell photo from alternative-cancer.net
Crab family taxonomy from NCBI
What are there reference
data elements here?
Trip Type
Airport Code & Names
Traveler Type
Class of Service
Fare Type
Flags?
Other?

If you don’t know what something means, how do you
get meaning out of it?

If data is in different levels of detail from different
sources, how do you maintain richness without
sacrificing accuracy?

How do you relate one set of categories to another,
without standards?
The rest of this presentation will describe different types
of Reference Data and their use of Reference Data in
the enterprise
Enterprise Context and Reference Data Types

Reference Data Management is control over defined domain
values (also known as vocabularies), including control over
standardized terms, code values, and other unique identifiers,
business definitions for each value, business relationships within
and across domain value lists, and the consistent, shared use of
accurate, timely, and relevant reference data values to classify
and categorize data.
(DAMA DMBOK, Chapter 8)

If you do not [somehow] manage your reference data, broader
data integration and master data reconciliation will be
challenging at best

ANSI/NISO Z39.19-2005 is the authoritative standard and
guideline for reference data/vocabulary management
Information
Integration
• Code mapping
rules
• Single source for
ESB, ETL, EII
Data Governance
& Quality
• Manage terms,
names, definitions
• Code lists for
quality rules
Master Data
Management
• Use capability for
Information
Integration, Data
Governance, and
Quality
Business
Intelligence
Knowledge
Management
• Hierarchies for
drill down or data
mining analytics
Reference Data Management Solution
(Vocabulary, Vocabulary View, Term-Relationship Management,
Publication and Translation Services)
• Import or link
attributes and
definitions as part
of vocabulary
meta data
Data Models
• Code
lists/vocabularies
and term
relationships
Code List and
Vocabularies
• Taxonomy or
Ontology for
Search and
Navigation
• Content Tagging

What are there reference data
elements here?

San Francisco the city vs. a
pointer to a metropolitan area

SFO or San Francisco as airport
indicators

Hotel names with many
synonyms

…
Simple
to
Complex
Increasing complexity
List
Flat List
Taxonomy
Hierarchical
•Parent / Child
•Relationship between
parent and a child can be
relatively underspecified or
ill defined.
Thesaurus
Ontology
Equivalence - Synonym
Homographic-Spelled the same
Classes (general things) many
different meaning – “Tank”
Hierarchical-Broader than/Narrower
Than
Associated-Associated
A thesaurus is typically used to
associate the rough meaning of a
term to the rough meaning of another
term.
domains of interest
Instances (particular things)
Relationships among those things
Properties (values) of those things
Functions of and processes
involving those things
Constraints on and rules involving
those things.

Simple Value Lists

Defined Value Lists

Taxonomic Reference Data

Ontological Reference Data
Relationship Types
Computer
Manufacturers
H
International
Business
Machines
E
?
Hardware
E
IBM
H
?
Software
Group
A
Big Blue
?
A
Software
E
Equivalence
H
Hierarchical
A
Associative

Equivalent Term Relationship:
A relationship between or among terms in a controlled vocabulary that
leads to one or more terms that are to be used instead of the term from
which the cross-reference is made. (Z39.19) The relationship indicator
for this type is ET.

Hierarchical Relationship:
A relationship between or among terms in a controlled vocabulary that
depicts broader (generic) to narrower (specific) or whole-part
relationships; begins with the words broader term (BT), or narrower term
(NT). (Z39.19) The relationship indicator for this type is HT.

Related Term Relationship:
A term that is associatively but not hierarchically linked to another term
in a controlled vocabulary. In thesauri, the relationship indicator for this
type of term is RT. (Z39.19)
Information
Integration
• Code mapping
rules
• Single source for
ESB, ETL, EII
Data Governance
& Quality
• Manage terms,
names, definitions
• Code lists for
quality rules
Master Data
Management
• Use capability for
Information
Integration, Data
Governance, and
Quality
Business
Intelligence
Knowledge
Management
• Hierarchies for
drill down or data
mining analytics
Reference Data Management Solution
(Vocabulary, Vocabulary View, Term-Relationship Management,
Publication and Translation Services)
• Import or link
attributes and
definitions as part
of vocabulary
meta data
Data Models
• Code
lists/vocabularies
and term
relationships
Code List and
Vocabularies
• Taxonomy or
Ontology for
Search and
Navigation
• Content Tagging
Implementation Patterns & Data Management Functions

If selecting a single value stores it in a database
row/cell, what happens when you select multiple
values?

How do you report on any one value?
a.
b.
c.
d.
Select … where <field> = <value>
Select … where <field> IN {value}
Select … where <field> like ‘%value%’
Other?
Who is an AV buf in the audience?
Who knows their doctor well?
Reference Data Management is control over defined domain
values (also known as vocabularies)
Vocabulary Management
 function of defining, sourcing, importing, and maintaining, … any
given vocabulary
 Key questions:
▪ What information concepts will this vocabulary support? (which data
attributes)
▪ Who is the audience?
▪ Why is the vocabulary needed? (app, BI , content management?)
▪ Who are the decision makers?
▪ What are the current vocabularies?
▪ Are there existing standards?
Term: One or more words designating a concept
Term Management
 Specifying how Terms are initially defined and classified and
how this information is maintained once it starts
 Most significant effort is to identify terms that will
constitute the standard

Preferred Term: One of two or more synonyms selected as a
term for inclusion in a controlled vocabulary
Term Relationship Management
 How the terms relate

Hierarchical Relationship depicts broader (generic) to
narrower (specific) or whole-part relationships
 In data integration, if terms are inconsistent, map to broader
relationship for accuracy (data detail loss will occur)
 In BI, consider allowing mixed granularity

Related Term Relationship depitcs associatively without
hierarchical relationships

Term Change
 Due to deprecation or its metadata changing
 If definition is improved, no impact
 If value/data type… changes, assess impact

Term Relationship Change
 If relationships are changing, integration or aggregation rules may be
impacted

Vocabulary Version Change
 Important especially when external standards are used
 Minor (addition/deprecation) vs. Major revisions

Reference Data/Vocabularies/Business Semantics are
pervasive in the enterprise.

Broad implementations would likely take time, cost
money, and not clearly understood

Like many data management initiatives, pick an area
with business value that can drive focus, develop a
sustainable solution and perhaps obtain tools to help.
Process and accountability is more important than
technology

Track value, share the success, and expand your scope