Automated Hypothesis Generation Based on Mining Scientific
Download
Report
Transcript Automated Hypothesis Generation Based on Mining Scientific
Automated Hypothesis Generation
Based on Mining Scientific Literature
Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter
Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda
Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower,
Ying Chen, and Olivier Lichtarge. 2014. Automated hypothesis generation based on mining
scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (KDD '14). ACM, New York, NY, USA, 1877-1886.
Kathleen Padova, October 21, 2014
Authorship & Publication
• Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman,
Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam
Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N.
Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J.
Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence
Donehower, Ying Chen, and Olivier Lichtarge
• Joint project between Baylor College of Medicine, The
University of Texas MD Anderson Cancer Center, and IBM
Research
• Presented at the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining,
August 2014
• ~700 downloads from ACM Digital Library
Challenge
• The amount of information is growing larger
than humans’ capability to process. Is there a
systematic way we can perform some of this
analysis, leaving more time for investigation?
Past approaches
• Highly structured content where the
connections are inferred from the structure
(MeSH)
• Established empirical laws (chemistry)
• Attempts at unstructured content – “hit-ormiss”
New approach
• KnIT - Knowledge Integration Toolkit
o Exploration
o Interpretation
o Analysis
Case Study: p53 Kinases
• Protein p53, when chemically modified
(phosphorylation) by another protein, is
essential in a cell’s own defense against
broken, cancerous state
• Kinases - proteins that phosphorylate other
proteins
• Increase in research to find drugs that
influence kinases as potential cancer
treatments
Case Study: Challenge
• 500 + kinases X Tens of thousands of possible
proteins = Over 10,000,000 possible kinaseprotein combinations
• Months to experiment, years to elucidate a
single kinase-protein relationship
• 33 of 500+ kinases found to modify p53… so
far
Case study: Challenge
• Mine the literature to identify other kinases
likely to modify p53
• Create a focused pool of highly likely targets
for future experimentation
Phase 1: Exploration
•
•
•
•
•
Collect relevant information
Design text queries
Extract relevant documents (259 kinases)
Identify known p53 kinases (23)
Model each entity (kinase)
Phase 2: Interpretation
• Graph the similarity relationships between
entities (kinases)
• Visualize hidden connections bases on entity
models
• Discover “deviant” entities based on proximity
to other entities sharing similar properties
Phase 2: Interpretation
Phase 2: Interpretation
Phase 3: Analysis
• Graphically diffuse annotated similarity
relationships throughout graph
• Adds a “likeliness” factor to entities
• Domain expert verifies the candidates for
further study
Phase 3: Analysis
Computational Validation
• Leave-one-out cross-validation
o Mark one known p53 kinase at a time as unknown
o Can algorithm correctly predict the unknown known?
• Retrospective study
o Run literature pre-2003 to try to predict the known
p53 kinases that would be discovered later
• Large scale study
o Apply algorithm to different data set to predict
kinases that target other (not p53) proteins
Experimental Validation
Future work
• Ramp up capability for larger-scale analysis
• Apply to wider area of proteins creating a
more comprehensive map of proteins and
functions involved in cancer research
• Apply the general literature mining approach
to other scientific domains
Discussion
• To what extent is success dependent on
partnering outside of the known domain?
• How reliable can this technique be evaluating
literature over a large timespan where the
vocabulary may evolve?