From Unstructured Textual Data to Structured Networks

Download Report

Transcript From Unstructured Textual Data to Structured Networks

Novel Mechanistic Insights in
Cardiovascular Health and Disease via a
Text Mining Approach
NIH BD2K AHM
November 30th, 2016
Peipei Ping
Heart BD2K
Text Mining:
From Unstructured Textual Data to Structured Networks
• Pubmed boasts a treasure trove of >2.2 million cardiovascular-related articles
from 1809-2016, and it is estimated that there is a new publication every ~2.7
minutes (Lau et al., Circulation 2016).
• However, these unstructured mounting textual data are interconnected.
• The key from Big textual Data to Knowledge is Structuring!

Transforming unstructured textual data into structured and interconnected
relationships.

Mining phrases from massive textual data.

Entity recognition and typing.

Pattern extraction for biological relationship discovery.
Aims of the Study
• To conduct natural language processing and pattern learning
on published natural textual data bases in 6 main groups of
CVD.
• To explore the application of phrase-mining and network
embedding on textual CVD data to extract relevant
information and identify novel patterns or classifications.
• To facilitate predictive analytics, gain novel mechanistic
insights and facilitate clinical decision making.
Text Mine Biomedical Corpora
Corpora
Pattern & Relationship Discovery
Scientific
Articles
Proteins
Patient
Phenotyping
Genes
Disease
Exploration
Metabolites
Therapy
Development
Novel Mechanistic Insights
& Clinical Decision Making
Text-mine
Medical Case
Reports
EHR
New Knowledge
Methods
Technologies:
• Segphrase+, a phrase-mining algorithm
• Large-scale Information Network Embedding (LINE)
Input:
• List of top-250 proteins relevant to CVD
• 551,358 publications (1995-2016) in Pubmed based on
the MeSH terms and synonyms within each of the
following CVD:
 Cerebrovascular Accidents (CVA), Cardiomyopathies
(CM), Ischemic Heart Diseases (IHD), Arrhythmias,
Valve Disease (VD) and Congenital Heart Disease
(CHD)
Medical Subject Headings (MeSH)
• National Library of Medicine (NLM)’s controlled vocabulary thesaurus
• Used to index articles from 5,400 of the world's leading biomedical journals
for the MEDLINE®/PubMED®.
• Maintained and updated by MeSH Section staff
• Hierarchical structure: Broad and specific terms
Cardiovascular Diseases
Heart Diseases
Cardiac Arrhythmias
Sick Sinus Syndrome
Cardiac Sinus Arrest
https://www.nlm.nih.gov/mesh
Finding Scientific Manuscripts Using MeSH Terms
Text mine: CaseOLAP workflow
Corpus: PubMed Research Articles
1995-2016
Six main CVD groups with their
MeSH terms
Names of 250 proteins highly relevant
in CVD and synonyms
Extract and screen CVD-related articles by MeSH terms
Input extracted articles and list of proteins (with synonyms included as unified string)
Calculate text-mining score of each protein-disease pair using CaseOLAP
Rank Calculation
Integrity: The phrase is meaningful, understandable, and high-quality.
int(p,c) calculated by SegPhrase+ in preprocessing
Distinctiveness: The phrase has a relatively larger count in the extracted articles of one
disease than in the extracted articles of the other five diseases.
Popularity: The phrase has larger total count in the extracted articles of that disease than
other phrases.
𝑓𝑖𝑛𝑎𝑙 𝑟𝑎𝑛𝑘 =
3
𝑖𝑛𝑡𝑒𝑔𝑟𝑖𝑡𝑦 ∗ 𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ∗ 𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡𝑖𝑣𝑒𝑛𝑒𝑠𝑠
Rank list of proteins in each CVD
Rank Calculation
𝑓𝑖𝑛𝑎𝑙 𝑟𝑎𝑛𝑘 =
3
𝑖𝑛𝑡𝑒𝑔𝑟𝑖𝑡𝑦 ∗ 𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ∗ 𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡𝑖𝑣𝑒𝑛𝑒𝑠𝑠
• Integrity — Name is a meaningful, understandable and high-quality
phrase
• Popularity — A large total count in the extracted articles of that disease
• Distinctiveness — Has a relatively larger count in the extracted articles
of that disease than in the extracted articles of other five diseases.
Example:
Seed pair: <Breast Cancer, brca1>
Query: <Cardiomyopathy, ?>
Protein
Score
Interferon-γ
3.336
Interleukin-4
2.809
Interleukin-17a
2.729
TNF
2.549
Titin
2.349
Total List of Proteins in CVD According to Their Scores
250 molecules with their
scores in CVD
https://dx.doi.org/10.6084/m9.figshare.
4055886.v1
Count
Text-Mining Reveals Novel Biomedical Insights & New
Patterns Among Key Proteins and 6 CVDs
Top 25 Scoring
Proteins in 6 CVDs
Score
Biological Functions and Pathways of the Top 50
CaseOLAP Scoring proteins over 6 CVD Groups
Rank list of 250 proteins in each cardiovascular
disease by text-mining score
Input GeneID/Uniprot ID of top 50 scoring
proteins into Analysis Pipeline of Reactome
Assess results for 21 main biological processes
Obtain P-value and FDR for overrepresentation
test for 21 largest biological processes
Biological Functions and Pathways of the Top 50
CaseOLAP Scoring proteins over 6 CVD Groups
Biological Process
AR
CHD
Circadian Clock
0.422 (1)
0.391 (1)
Developmental Biology
0.043 (14)
0.081 (12)
Hemostasis
0.000 (20)
Neuronal System
CM
CVA
IHD
VD
0.108 (2)
0.379 (1)
0.402 (1)
0.180 (10)
0.253 (11)
0.063 (12)
0.054 (13)
0.000 (13)
0.007 (10)
0.000 (15)
0.000 (21)
0.001 (13)
0.302 (4)
0.462 (3)
0.208 (4)
0.944 (1)
0.700 (2)
0.487 (3)
Signal Transduction
0.503 (21)
0.037 (26)
0.084 (23)
0.021 (30)
0.012 (27)
0.142 (24)
Immune System
0.592 (18)
0.992 (9)
0.007 (25)
0.047 (26)
0.038 (23)
0.476 (18)
Disease
0.400 (10)
0.056 (13)
0.128 (11)
0.819 (7)
0.243 (10)
0.213 (11)
na (0)
DNA Repair
na (0)
0.884 (1)
na (0)
na (0)
na (0)
0.894 (1)
Chromatin organization
na (0)
0.826 (1)
na (0)
na (0)
na (0)
na (0)
Metabolism
0.743 (15)
0.182 (19)
0.874 (11)
0.475 (18)
0.002 (26)
0.897 (12)
DNA Replication
0.582 (1)
0.185 (2)
0.521 (1)
0.223 (2)
0.531 (1)
0.559 (1)
Transmembrane transport of small molecules
0.304 (7)
0.373 (6)
0.177 (7)
0.924 (3)
0.101 (8)
0.408 (6)
Gene Expression
0.118 (19)
0.314 (15)
0.878 (9)
0.053 (21)
0.991 (6)
0.119 (18)
Cell Cycle
0.869 (3)
0.273 (6)
0.921 (2)
0.879 (3)
0.799 (3)
0.946 (2)
na (0)
0.925 (1)
na (0)
na (0)
0.917 (1)
0.932 (1)
0.169 (3)
0.000 (8)
0.049 (10)
0.726 (5)
Organelle biogenesis and maintenance
Muscle contraction
0.000 (10)
0.004 (6)
0.000 (7)
na (0)
Vesicle-mediated transport
0.306 (8)
0.694 (5)
0.797 (4)
0.900 (4)
na (0)
0.638 (1)
na (0)
na (0)
na (0)
Extracellular matrix organization
0.031 (6)
0.020 (6)
0.000 (9)
0.003 (8)
0.004 (7)
0.000 (13)
Cellular responses to stress
0.660 (3)
0.011 (8)
0.024 (7)
0.132 (6)
0.073 (6)
0.961 (1)
Programmed Cell Death
0.755 (1)
0.720 (1)
0.331 (2)
0.420 (2)
0.343 (2)
0.734 (1)
Cell-Cell communication
0.652 (1)
P-value
Summary
• We have devised a natural language processing approach to annotate vast amounts
of textual data from published manuscripts on CVD for statistical pattern learning,
extraction of relevant information and application of predictive analytics.
• A combination of phrase-mining algorithms and a large-scale network embedding
technique is effective to recognize patterns and extract relevant information,
providing novel biomedical insights regarding relationships among 25 proteins and 6
major CVDs.
• This novel data acquisition strategy may also be suitable for the vast amount of
accumulated patient information in Clinical Case Reports and Electronic Health
Records.
Acknowledgements
University of Illinois at Urbana-Champaign:
•
•
•
•
•
•
Professor Jia Wei Han
Doris Xin
Meng Qu
Xuan Wang
Fangbo Tao
Po-Wei Chan
University of California, Los Angeles:
•
•
•
•
•
David Liem
Vincent Kyi
Leah Briscoe
Travis Cao
Brian Bleakley