QuASI: Question Answering using Statistics, Semantics, and

Download Report

Transcript QuASI: Question Answering using Statistics, Semantics, and

QuASI:
Question Answering using
Statistics, Semantics, and Inference
Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan
Univ. of California-Berkeley / ICSI / Stanford University
TREC Task 1:
Overview


Search 525,938 MedLine records
 Titles, abstracts, MeSH category terms,
citation information
Topics:
 Taken from the GeneRIF portion of the
LocusLink database
 We are supplied with a gene names
 Definition of a GeneRIF:
For gene X, find all MEDLINE references that focus
on the basic biology of the gene or its protein
products from the designated organism. Basic
biology includes isolation, structure, genetics and
function of genes/proteins in normal and disease
states.
TREC Task 1:
Sample Query

3 2120 Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL
ncogene)
3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6
3 2120 Homo sapiens ALIAS_SYMBOL TEL
3 2120 Homo sapiens PREFERRED_PRODUCT ets variant gene 6
3 2120 Homo sapiens PRODUCT ets variant gene 6

3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene









The first column is the official topic number (1-50).
The second column contains the LocusLink ID for the
gene.
The third column contains the name of organism.
The fourth column contains the gene name type.
The fifth column contains the gene name.
TREC Task 1:
Approach

Two main components:

Retrieve relevant docs
• May miss many because of variation in how
gene names are expressed

Rank order them
TREC Task 1:
Approach

Retrieval

Normalization of query terms
• Special characters are replaced with spaces in both queries and documents.

Term expansion
• A set of pattern based rules is applied to the original list of query terms, to
expand the original set, and increase recall.
• Some rules with lower confidence get a lower weight in the ranking step.


Stop word removal
Organism identification
• Gene names are often shared across different organisms
• Developed a method to automatically determine which MeSH terms
correspond to LocusLink Organism terms
• Retrieved Medline docs indicated by LocusLink links corresponding to a given
organism
• Organism terms were the most frequent MeSH categories among the selected docs
• Used these terms to identify the organism term in Medline
• An example of playing two databases off each other.

Mesh concepts
• When an exact match is found between one of the query terms and a MeSH
term assigned to a document, the document is retrieved.
Gene Name
Expansion
Organism Filtering
TREC Task 1:
Approach

Relevance ranking


IBM’s DB2 Net Search Extender was used as the text search engine.
Scoring:
• Each query is a union of 5 different sub-queries •
•
•
•
•
titles,
abstracts,
titles using low confidence expansion rules,
abstracts using low confidence expansion rules, and
MeSH concepts.
• Each sub-query returns a set of documents with a relevance score
from the text search engine (or a fixed value for MeSH matches)
• The aggregated score is the weighted SUM of the individual scores
with optional weights applied to each sub-query score.
• SUM performs better than MAX, since it gives higher confidence to
documents found in multiple sub-queries.
• Scores are normalized to be in the (0,1) range, by dividing the
score by the highest aggregated score achieved for the query.
TREC Task 1:
Approach

GeneRIF classification
• A Naïve Bayes model is used to assign to each document
the probability it is a GeneRIF.
• MeSH terms are used as features.

Combination of text retrieval score and GeneRIF
classification score.
• We tried both an additive and a multiplicative approach.
Both behave similarly with a slightly better performance
achieved with the additive one.
TREC Task 1: Results

Performance is measured using the standard
trec_eval program.

On training data:
 Best published result:
0.4125
 With GeneRIF classifier:
0.5101
 Without GeneRIF classifier: 0.5028

On testing data: (turned in 8/4/03)


With GeneRIF classifier – 0.3933
Without GeneRIF classifier – 0.3768
TREC Task 2

Problem Definition:
 Given GeneRIFS formatted as:
• 1 355 12107169 J Biol Chem 2002 Sep
13;277(37):34343-8. the death effector domain of FADD is
involved in interaction with Fas.
• 2 355 12177303 Nucleic Acids Res 2002 Aug
15;30(16):3609-14. In the case of Fas-mediated
apoptosis, when we transiently introduced these hybridribozyme libraries into Fas-expressing HeLa cells, we were
able to isolate surviving clones that were resistant to or
exhibited a delay in Fas-mediated apoptosis w

… reproduce the GeneRIF from the MEDLINE
record.