Transcript CDR - NCOR
National Cancer Institute
U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES
National Institutes of Health
CDR, an NCI-developed
Vocabulary and Database for
Biobanking
Helen Moore and Ping Guan
Biorepositories and Biospecimen Research
Branch
CTS Ontology Workshop 2015
Sept. 23, 2015
Improving Biospecimen Processes is Essential to Enable Better
Research, Clinical Trials, and Molecular Medicine
Patient Care
Clinical Trials
Storage
Biospecimen
Collection
Analysis
Research
Processing in
Pathology Lab
(Blood, Tissues,
Urine, etc.)
Clinical Data Collection
Adapted from Peggy Devine
Biorepositories and Biospecimen
Research Branch (BBRB)
• Mission
– BBRB provides leadership, tools, resources, and policies
in biobanking for the global biomedical research
community, to enable translational research and
precision medicine for patients.
http://biospecimens.cancer.gov/
Current Initiatives
• NCI Best Practices for Biospecimen Resources
• Biospecimen Preanalytical Variables Program (BPV)
• Genotype-Tissue Expression Program (GTEx)
• ELSI research in biobanking
• Biospecimen Research Database (BRD) – online
literature and SOPs database
• Biospecimen Evidence-Based Practices (BEBPs)
• Biobank economics research and online tools
• Patient brochures
• CDR collaboration
Comprehensive Data Resource (CDR)
• Supports two ongoing biospecimen programs:
– The Genotype-Tissue Expression Program (GTEx) – a NIH Common
Fund study of genomic variation and tissue-specific expression,
analyzing up to 30 tissues per donor in 900 deceased donors.
– The Biospecimen Preanalytical Variables program (BPV) – a study of
preanalytical variation in tissue processing and storage (FFPE and
frozen tissues) and the effects of such variation on downstream
molecular analysis.
Program Needs Driving CDR
Development
• NIH Common Fund Project: GTEx
–
–
–
–
Normal/Non-diseased
Postmortem
30+ tissue types per donor
960 donors
Science : A representation of how variation in the human genome affects gene expression among
individuals and tissues. Colors and shapes show variations between people and within individuals.
The Genotype-Tissue Expression (GTEx) Consortium examined postmortem tissue to document how
genetic variants confer differences in gene expression across the human body.
See pages 618, 640, 648, 660, and 666.
Program Needs Driving CDR
Development
• Biospecimen science project: BPV
–
–
–
–
Cancer patients
Surgical tissues
Studies of predefined preanalytical factors
Multi-site experimental design and SOPs
-Post-operative ischemia
-Room temperature
-Type of preservative
-Rate of freezing/fixing
Recorded/annotated
-Anesthesia
-Intra-operative ischemia
-Many other variables
H&E
IHC
FISH
RNA isolation
Storage
Tissue processing
-Multiple formulaic variables
- Multiple time settings for each
Biospecimen Collection, Processing,
Storage Data
• Blood and tissue collection and processing data
– Blood tube type, time stamps for processing
– For resected tissues: surgical clamp times, time placed in fixative or
frozen, time placed in tissue processor, etc.
– Storage conditions
• Pathology QC
Data to be Collected and Managed
• Donor parameters
– SOP
– Consent verification
– Maintaining privacy
• Biospecimen collection, processing, storage parameters
– SOPs
– Electronic CRFs with time stamps etc.
• Physical transfers of biospecimens
– MTAs
– Chain of custody information
• Clinical data about the donor
• Pathology data
– Diagnosis (donor level data from path report)
– Diagnosis and additional observations (individual biospecimen
level data)
• IDs of donors, cases, and derived biospecimens
What Types of Information Do We Want
to Capture?
Time 0
Specimen is viable
and biologically reactive
Patient
Medical/
Surgical
Procedures
Pre-acquisition
Acquisition
Handling/
Processing
Molecular composition subject to
further alteration/degradation
Storage
Post-acquisition
Distribution
Scientific
Analysis
Knowledge
Base
Program requirements for CDR
Functions
• Development of CDEs to thoroughly annotate the
biospecimen life cycle to support the goals of the project
• Development of workflow-based annotation with live data
entry at BSS when possible
• Provision of IDs for the project
• Record data at the BSS and transmit to project homepage
(annotation, gross pathology images)
• Monitor shipping between different program sites
GTEx Data flow in CDR
CDR Overview
• CDR is a distributed bioinformatics tissue collection and
information management platform
– Built upon open source technologies and frameworks
– Supports the needs of complex projects that require collecting a
large number of high quality, well-annotated human
biospecimens.
• CDR is a custom software solution
– Incorporates specific biospecimen procurement SOPs
– Provides data security for HIPAA-compliant limited dataset
– Provides real-time distributed data services between multiple
centers nationwide.
.
• CDR is designed to follow the NCI Best Practices for
biospecimen collection and annotation.
CDR Technology
• The CDR is built on the Grails Framework
– Leverages Groovy scripting for rapid development and easy learning
curve
– Leverages Spring for security
– Leverages Hibernate for Object - Relational mapping, keeping the
application database agnostic
– Enables rapid and flexible web service API development, enabling the
CDR to interconnect with multiple institutions and systems
Data Flow in CDR
Biospecimen Source
Sites
Specimens
Comprehensive Biospecimen Resource
Kits
Participating Hospitals
Data
Comprehensive Data Resource (CDR)
Specimens
Key
Represents Kits and
Biospecimens
Represents Data
Data
Management
and Quality
Management
Pathology
Resource
Center
Laboratory, Data Analysis & Coordinating Center
CDR – Can This be Useful to the
Biobanking Community?
• Unmet needs for management software in biobanking
community
• Facilitate use of Best Practices and annotation of
biospecimen collection and processing steps
• CDR is being adopted for other NCI programs
including the CPTAC program (Clinical Proteomic
Tumor Analysis).
• The CDR code was posted last year.
• Collaborative Announcement:
https://ttc.nci.nih.gov/opportunities/opportunity.php?opp_id=748093
754466223
CDR – Collaborative Project
• Voluntary collaboration:
– No funding to individual collaborator(s)
– Collaborator(s) have their own IT capacity to further develop
and customize the software
• Informational sessions
– Two Webinars in July: Program introduction and live demo
– 100+ participants each time
• Collaborators who are interested in the collaboration
will provide:
–
–
–
–
Intended area of research to support after adopting CDR
IT experience and expertise in the proposed adoption
Relevance to biobanking operation
Contributions in standardizing and streamlining biobanking
practices.
Standardized Terminology & Definition
• Different data types & sources in CDR
– Data entered into CDR comes from a variety of
sources and organizations:
• Different Biospecimen Source Sites for data entry
• Various data types:
– Operational data
– Clinical Data
• caHUB Enterprise Vocabulary Service
– caHUB EVS
• Provide standardized terminology and definitions
that serve as a consistent basis for data integration
and data sharing across the program.
EVS Architecture
Project
requirements
Data call
from users
External Sources
Indexed full-text
search
EVS Tools
• Protégé
– Protégé is a free, open source ontology editor and
knowledge-base framework developed at Stanford.
– The caHUBt (caHUB Thesaurus) is documented and
managed in NCI Protégé.
• SOLR
– Solr is an open source enterprise search platform from the
Apache Lucene project.
– SOLR provides distributed search and index replication.
– SOLR supports full-text search, hit highlighting, faceted
search, dynamic clustering, database integration, and rich
document (e.g., Word, PDF) handling. Providing distributed
search and index replication.
– caHUBt is fully indexed through SOLR
EVS Components
• GTEx, BMS and BPV form elements defined in
Protégé as Common Data Elements (CDEs)
– Form Elements and valid values definitions stored in Protégé
– Exported to SOLR for display on CDR
• Valid Values defined in SOLR for display in CDR:
– Cause of Death, Source ICD10-CM
Synonyms from UMLS
– Medications, source FDA NDC List
– Primary Cancer Type, source PDQ Disease List
– Medical Procedures, source AMA CPT List
Synonyms from UMLS
Common Data Elements (CDEs)
• The CDEs is the repository of locally defined data
elements used on the forms in the CDR.
• These definitions were provided by the study in
which the data element is leveraged, or were
developed by the Vocabulary team leveraging
standard sources such as the NCIt and the UMLS
• Example: BMI, gender.
Causes of Death (COD)
• “Cause of Death” concepts
–
–
–
–
–
Immediate
First Underlying
Last Underlying
Death Certificate
CDR form element “Death Circumstances” for GTEx study
• Valid Value Set Sources:
– the caHUBt in NCI Protégé,
– the ICD10-CM,
– the NLM Unified Medical Language System (UMLS)
• The ICD10-CM provides the foundation of the valid value list that
is bound to each of the four “cause of death” concepts.
Medications (RX)
• “Current Medications” concept
– CDR Form element for the GTEx study.
• Valid Value Set Source:
– FDA NDC Directory, which is published by FDA on a
weekly basis.
Primary Cancer Type (PCT)
• “Primary Cancer Type” concept:
– CDR Form element for GTEx study.
• Valid Value Set Source:
– the PDQ Disease List
Medical Procedures (CPT)
• “Underlying Conditions” concept:
– CDR Form element for the GTEx
• Valid Value Set Sources:
– UMLS
– The AMA’s Current Procedural Terminology (CPT)
Sharing the caHUB Vocabulary with the
Research Community
• Publish the data set on public-accessible sites
– NIH CDE Portal
– NCBO BioPortal
– OBO Foundry