BioQA: Question answering with respect to Biomedical texts and
Download
Report
Transcript BioQA: Question answering with respect to Biomedical texts and
BioQA - A question
answering system for the
biomedical domain
Luis Tari
Question Answering (QA)
What is QA?
“QA is an interactive human computer process that
encompasses understanding a user information need,
typically expressed in a natural language query; retrieving
relevant documents, data, or knowledge from selected
sources; extracting, qualifying and prioritizing available
answers from these sources; and presenting and
explaining responses in an effective manner.”
Cited from “New Directions in Question Answering”
Why QA?
One of the ultimate goals in AI (human-level AI, Turing’s
test, …)
A move beyond keyword query, finding what we really want
to know
QA
How is QA different from a search engine?
QA
Check out www.brainboost.com
Search Engine
Queries in Natural
Queries based on
Language (Questions)
keywords
Present answers to users Users find the answers
from retrieved results
Some natural language
Mostly keywords and
process is used to
ranking to retrieve results
determine answers
Text Retrieval Conference
(TREC)
An annual activity of information retrieval (IR)
research sponsored by the National Institute for
Standards and Technology (NIST).
TREC is organized into “tracks” of common
interest.
Research groups work on a common source of
data and a common set of queries or tasks.
The goal is to allow comparisons across
systems and approaches in a research-oriented,
collegial manner.
TREC Genomics Track
TREC Genomics Track focuses on the
retrieval of information from biomedical
literature.
Ad-hoc retrieval on a set of 4.5 millions of
articles, in which 25% of them have no
abstracts.
50 topics (queries) organized in 5 templates
TREC Genomics Templates
1.
2.
3.
4.
5.
Find articles describing standard methods or
protocols for doing some sort of experiment or
procedure.
Find articles describing the role of a gene involved
in a given disease.
Find articles describing the role of a gene in a
specific biological process.
Find articles describing interactions (e.g., promote,
suppress, inhibit, etc.) between two or more genes
in the function of an organ or in a disease.
Find articles describing one or more mutations of a
given gene and its biological impact.
BioQA
A QA system for the biomedical domain
A great deal of genomics information
resources are available
Entrez Gene, PubMed, UniProt, Gene Ontology,
UMLS, many many more…
BioQA utilizes some of the genomics
resources, whereas a generic QA does not
Keyword search is not enough
Consider the following examples
Example 1
Suppose as a biologist, I want to know the role of the
gene interferon beta in the disease multiple sclerosis.
Query to PubMed:
“interferon beta” AND “multiple sclerosis”
Oops… interferon beta IS also
the name of a treatment. I’m not
a medical doctor so I don’t
really care….
Example 2
Query: “interferon beta” AND “multiple
sclerosis”
Hmm… this is more like what
I am looking for….
Objectives of BioQA
Phase 1
Phase 2
Retrieve relevant articles with respect to the
specific needs of user’s questions
Extract and present answers to the users
Phase 3
Answer questions that require simple reasoning
BioQA Prototype
Offline Subsystem
Load BioMedAbstract
Collect Genes, Diseases,
Bio-processes, ...
UMLS
Location of
entities
Mesh
Entity Recognition
BioQA DB
Anaphora Resolution
Index
Location of
entities
Facts
Indexing
Extract facts
(interactions)
BioQA Prototype
Online Subsystem
Accept User’s Query
Process User’s Query
(Tag, Lucene Syntax,
Stem...)
Database
Search
Database (using
Lucene indexes)
Index
Filter result
Rank/Categorize
result
Allow user to
modify/choose
query patterns
Present results to
User
Main Components of BioQA
Phase 1
Question Processing and Query Formation
Entity Recognition
Indexing
Pronoun Resolution
Extraction
Ranking
Question Processing and
Query Formation
Process questions so that keywords are extracted to
form queries for retrieval
Incorporate synonyms for the keywords
Consider the question:
“What is the role of PRNP in mad cow disease?”
First idea
Get all the nouns from the question
But we do not want a query that includes “role”
Second idea
Identify all the entities from the question and treat them as
keywords
But what if we are unable to identify some of the entities?
Question Processing and
Query Formation
Third idea – making use of dependency grammar
(Link Grammar)
keyword(N2) :- noun(N1), noun(N2), Mp(N1,X), J(X,N2).
In the following example, N1= “role” and N2= “X” in
the question
+-----------------Xp-----------------+
|
+--------MVp--------+
|
|
+---Ost---+
|
|
+---Ws--+Ss*w+
+--Ds-+-Mp-+J+ +J+ |
|
|
|
|
|
| | | | |
LEFT-WALL what is.v the role.n of X in Y ?
Entity Recognition
To recognize gene symbols, disease names
Lots of resources on
Why is Entity Recognition still an issue?
gene symbols: Entrez Gene, HUGO, …
disease names: MeSH, UMLS, …
“CDC28” can be written as “Cdc28”, “Cdc28p”, “cdc-28”
“hairy” is a gene name
“GSS” is a synonym of “PRNP”, but “GSS” itself is also a
gene which is unrelated to “PRNP”!
Two tasks
Recognize gene names given a biomedical article
Generate gene symbol synonyms and variants given a
gene symbol in a query
Entity Recognition
Various approaches:
Machine learning techniques to recognize names
on the basis of their characteristic features
Dictionary-based methods with generation of
variants
Dictionary-based + Part-of-Speech methods
Rule-based methods
Some of the best Entity Taggers:
ABNER
GAPSCORE
Anaphora Resolution
Pronominal Anaphors
Resolving third-person pronouns and reflexive pronouns
Example: “BRCA1 interacts with Smad2. It also interacts
with Smad3.”
Which histones?
Sortal Anaphors
“In this report, we show that virus infection of cells results
in a dramatic hyperacetylation of histones H3 and H4 that
is localized to the IFN-beta promoter. … Thus, coactivatormediated localized hyperacetylation of histones may play a
crucial role in inducible gene expression. [PMID:
10024886]
Anaphora Resolution
“Ethanol was found to inhibit the function of this
chimeric receptor in a manner similar to that of
nACh alpha 7 receptors. Because the inhibition
transfers with the amino-terminal domain of the
receptor, the observations suggest that the
amino-terminal domain of the receptor is involved
in the inhibition.” [PMID: 8863848]
Extraction
To extract knowledge from text
Knowledge such as protein-protein interactions, genedisease relations, …
Can be used in presenting answers
Extracting protein-protein interactions
“Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly
phosphorylated Swe1 and this modification served as a
priming step to promote subsequent Cdc5-dependent
Swe1 hyperphosphorylation and degradation.” [PMID:
15920482]
Should extract the following interactions from the above
text:
Cdc28 binds Clb2
Swe1 is phosphorylated by Clb2-Cdc28 complex
Cdc5 is involved in Swe1 phosphorylation.
Extraction
Extraction of other relations
“… Furthermore, PACT colocalized with viral
replication complex in the infected cells. Thus the
observed effect of PACT is novel and PACT is
involved in the regulation of viral replication …”
[PMID: 11401490]
Should extract the following relations from the
above text:
PACT colocalized with viral replication complex in the
infected cells
PACT is involved in the regulation of viral replication
Extraction
Two main directions towards extraction:
Cooccurrence
Identify entities that co-occur within abstracts
Frequency-based scoring scheme to rank the extracted
relationships
NLP
Combine the analysis of syntax and semantics
Using extraction rules that are implemented manually or
learned automatically from annotated corpus
However,
Cooccurrences sometimes do not actually mean correct
relations
Cannot infer directional relationships from cooccurrences
Hard Lessons learned from
TREC
Synonyms from gene dictionary is NOT
enough
Generating gene symbol variants is essential
One query is not enough to do the job
Generating query variants, which are slight
variations of the original query.
For instance, the query “inhibitory synoptic
transmission” can have the variants “synoptic
transmission” and “inhibitory transmission”.
more….
Abstracts related to a gene family can be
relevant as well
Suppose we want to know about the gene COPII,
we may want to know COP, COPI as well
Abstracts can merely mention an entity as an
example
e.g. [PMID 10232877]: GSTM1 is mentioned to be
related to breast cancer as an example, but article
is about GSTM1 and alcoholism.
Future Components
Structural Feedback
Answer Presentation
Semantics of Words
Simple Reasoning using Domain Knowledge
Structural Feedback
Problem:
Can we use the underlying “structures” among the
relevant articles to improve the retrieval process?
[IBM]
Goal: To learn the “structures” of abstracts
that are identified as relevant.
Idea: Learn the structure of articles (such as
common words, MeSH terms)
identified to be relevant by domain experts
identified to be relevant by users
Answer Presentation
To present answers to users in a precise and
concise manner
Current Status: relevant “answers” are presented to
the users in the form of abstracts
Problem: Not concise enough for users
Ideas:
Retrieve small passage of text, based on proximity of
keywords [LCC02] and simple cosine similarity between
sentences [Singapore05].
Extraction using NLP
Use text summarization techniques to present answers
[PSB06].
Semantics of Words
WordNet – a resource that provides synonyms of
words in different senses; relations between words
Question:
Abstract (PMID:12161276):
“… IDE plays in the degradation and clearance of
human amyloid beta from migroglial cells and neurons …”
Semantic relation between “role” and “play” [from
WordNet]:
“What is the role of IDE in Alzheimer’s Disease?”
role: function, purpose, role, use
play: is_a(play_use)
So we can say “role”, “play”, “use” are related.
Answer: The role of IDE is in the degradation and
clearance of human amyloid beta from migroglial
cells and neurons.
Simple Reasoning using Domain
Knowledge (Example 1)
Question:
“Does IDE play a role in Alzheimer’s Disease (AD)?”
Retrieved Abstract (PMID:12161276):
“… The insulin degrading enzyme (IDE) is an attractive
candidate gene since previous studies have identified a possible
role that IDE plays in the degradation and clearance of human
amyloid beta from migroglial cells and neurons …”
Domain knowledge:
AD is a nervous system disease.
Neurons are related to the nervous system.
Answer: Yes, IDE plays a role in AD because AD is a nervous
system disease and IDE plays in the degradation and
clearance of human amyloid beta from migroglial cells and
neurons.
Simple Reasoning using Domain
Knowledge (Example 2)
Question: Does MMS2 involve in cancer?
Domain Knowledge about MMS2
MMS2 is known to be involved in biological
processes such as cell proliferation and the
ubiquitin cycle, based on the Gene Ontology.
Cell Proliferation – cell growth
Ubiquitin cycle – regulating proteins' half-lives
Simple Reasoning using Domain
Knowledge (Example 2 cont.)
Domain Knowledge about cancer
Abnormal growth of tissues
Sometimes in cancer, we find that the ubiquitin cycle is
deregulated, leading to certain proteins having extra long
or extra short half-lives.
Answer: Yes. Since MMS2 is involved in regulating
cell proliferation and ubiquitin cycle, MMS2 is
possibly involved in cancer.
Challenges:
How to represent such knowledge
Where to get such domain knowledge
Potential Projects
Learning
Answer Presentation
Passage retrieval, extraction
Extraction
Structural Feedback
Rules for describing keywords in questions
gene-disease, gene-biological process relations
Sortal Resolution
Semantics of Words
References
Literature mining for the biologist: from information retrieval to biological
discovery. Lars Juhl Jensen, Jasmin Saric and Peer Bork. Nature Reviews
Genetics 7, 119-129 (February 2006).
Anaphora Resolution
Anaphora Resolution in Biomedical Literature. Jose Castano, Jason
Zhang, James Pustejovsky.
Extraction of Gene-Disease Relations
Association of genes to genetically inherited diseases using data mining.
Perez-Iratxeta C, Bork P, Andrade MA. Nature Genetics 31, 316-319
(2002).
G2D: A Tool for Mining Genes Associated to Disease. Perez-Iratxeta C,
Wjst M, Bork P, Andrade MA. BMC Genetics 6, 45 (2005).
Extraction of Gene-Disease Relations from Medline Using Domain
Dictionaries and Machine Learning. Hong-Woo Chun, Yoshimasa
Tsuruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata, Teruyoshi Hishiki,
and Jun'ichi Tsujii. PSB 2006.
Structural Feedback
[IBM] Rie Kubota Ando, Mark Dredze, Tong Zhang. TREC 2005
Genomics Track Experiments at IBM Watson.
References
Answer Presentation
[LCC02] Dan I. Moldovan, Mihai Surdeanu: On the Role of Information
Retrieval and Information Extraction in Question Answering Systems.
SCIE 2002: 129-147.
[Singapore05] Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan and TatSeng Chua, Question Answering Passage Retrieval Using Dependency
Relations, In Proceedings of the 28th Annual International ACM SIGIR
Conference on Research and Development of Information Retrieval
(SIGIR 2005), Salvador, Brazil, August 15 -19, 2005.
[PSB06] Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter.
Finding GeneRIFs via Gene Ontology Annotations. To appear in PSB
2006.
WordNet Resources
[WordNetSim] Pedersen, Patwardhan, and Michelizzi.
WordNet::Similarity - Measuring the Relatedness of Concepts. Appears
in the Proceedings of the Nineteenth National Conference on Artificial
Intelligence (AAAI-04), July 25-29, 2004, San Jose, CA (Intelligent
Systems Demonstration).
[SenseRelate] Michelizzi. Semantic Relatedness Applied to All Words
Sense Disambiguation. Master of Science Thesis, Department of
Computer Science, University of Minnesota, Duluth, July, 2005.