Charanya Venkatesh Kumar
Download
Report
Transcript Charanya Venkatesh Kumar
SEARCHING QUESTION AND
ANSWER ARCHIVES
Dr. Jiwoon Jeon
Presented by
CHARANYA VENKATESH KUMAR
Discussion
Current Information Retrieval systems?
OVERVIEW
Introduction
Q&A Retrieval
Test Collections
Translation Based Q&A retrieval
framework
Learning word-to-word translations
INTRODUCTION
Q&A Retrieval problem
Challenges
Semantically similar questions
Problem : Word mismatch problem
Solution : Machine translation-based
information retrieval model
Quality of the Answers
Problem : Many answers to a given question
Solution : Answer Quality Prediction Technique
What is New?
New Type of Information System
New Translation-based Retrieval Model
New Document Quality Estimation Method
Integration of Advances in Multiple research
Areas
New Paraphrase Generation Method
Utilizing Web as a Resource for Retrieval
OVERVIEW
Introduction
Q&A Retrieval
Test Collections
Translation Based Q&A retrieval
framework
Learning word-to-word translations
Q & A RETRIEVAL
Question & Answer Archives
Websites with FAQ
Community based question answering
services
Task Definition
Q & A Retrieval (Contd..)
Q & A Retrieval (Contd..)
Advantages
Handle natural language questions
Return answers instead of relevant
documents
Disadvantages
Can answer only previously answered
questions
Q & A RETRIEVAL SYSTEM
ARCHITECTURE
CHALLENGES
Finding relevant Question & Answer
Pairs
Importance of question parts
Word mismatch problem
Estimating Answer Quality
Importance
OVERVIEW
Introduction
Q&A Retrieval
Test Collections
Translation Based Q&A retrieval
framework
Learning word-to-word translations
TEST COLLECTIONS
Components :
Set of documents
Set of information needs (queries)
Set of relevance judgment
Pooling Method
WONDIR COLLECTION
Earliest community based QA service in
the US.
1 million question and answer pairs
used from this service
Average question length = 27 words
Average answer length = 28 words
Examples
Queries
Closed-class questions that ask fact
based short answers.
Relevance Judgment
E.g.: Where is Charlotte located?
220 relevant Q&A pairs for 50 queries
using pooling method.
Relevance Judgment Criteria
WebFAQ COLLECTION
by Jijkoun and Rijke
Collection of FAQs using web crawlersmade public for research purposes.
Found web pages that contain the word
“FAQ”.
Used heuristic methods to automatically
extract question and answer pairs from
the web pages.
NAVER COLLECTION
Leading portal site in South Korea
Community-based answering service
Collection A :
Category information – To test category
specific translations
Collection B :
Non-Textual Information – To build answer
quality prediction technique
Naver Collection (Contd..)
Question – Title & Body
Naver Test Collection A
Naver Test Collection B
Relevance :
Question semantically related to query and
Question contains all query terms
Q&A pair was clicked multiple times for the
query.
Comparison of test Collections
OVERVIEW
Introduction
Q&A Retrieval
Test Collections
Translation Based Q&A retrieval
framework
Learning word-to-word translations
Translation Based Q&A
Retrieval framework
Use of Machine Translation technique
for information retrieval
Word mismatch problem
Translation based approach
IBM Statistical Machine
translation Models
Do not require any linguistic knowledge
of the source or target language.
Exploits only co-occurrence statistics of
terms in training data.
IBM Models
Model 1
Model 2
Treats every possible word alignment
equally
Assumes only positions of terms are
related to the word alignment
Model 3
The first term and the second term
generated from the same term are
independent
IBM Models (Contd..)
Model 4
First order alignment model
Every word is dependent only on the
previous aligned word.
Model 5
Reformulation of Model 4
Advantages of Model 1
Efficient implementation is possible
using a form of query expansion.
Performance gain of using low level
translation models is high.
Can be easily integrated into the query
likelihood
IBM Model 1 Equation
The probability that a query Q of length m is
the translation of a document D (of length n) is
given as
IBM Model 1 Equation
Translation based Language
Models
Language model is a mechanism for
generating text.
Unigram language model
Assumes each word is generated
independently
Concerns only probabilities of sampling a
single word.
Language modeling approach
to IR
In maximum likelihood estimator,
unseen words in a document have zero
probability.
Smoothing :
Transfers some probability mass from the
seen words to the unseen words.
Dirichlet smoothing – good performance
and cheap computational cost.
Language modeling approach
to IR (Contd..)
The ranking function for the query
likelihood language model with Dirichlet
smoothing can be written as
IBM Model 1 vs. Query
Likelihood
Comparable components in the two models
Self Translation Model
Every word has some probability to
translate to itself.
Cannot be 1
If too low – deteriorate retrieval
performance
TransLM
Final ranking Function looks like
Efficiency Issues and
Implementation of TransLM
Flipped Translation Tables
Term-at-a-time Algorithm
OVERVIEW
Introduction
Q&A Retrieval
Test Collections
Translation Based Q&A retrieval
framework
Learning word-to-word translations
Properties of Word
Relationships
Not Symmetric
Not fixed
Change depending on retrieval or
translation tasks.
must be given as probability values.
Training Sample Generation
Key Idea
If two answers are very similar, then the
corresponding questions are semantically
similar.
Similarity Measures
Cosine Similarity
Query Likelihood scores between two
answers (LM SCORE)
LM-HRANK
Word Relationship Types
P(Q|A)
P(A|Q)
Source – Answer ; Target – Question
Source – Question ; Target – Answer
P(Q|Q)
P(Q<->Q)
EM Algorithm
Find word relationships that maximize the
likelihood of sampling the target text from
the source text in training samples.
EM Algorithm (Contd..)
The translation probability from a source
word t to a target word w is given as
EM Algorithm (Contd..)
The translation probability from a source
word t to a target word w is given as
Examples
Examples (Contd..)
SUMMARY
Introduction
Q&A Retrieval
Test Collections
Translation Based Q&A retrieval
framework
Learning word-to-word translations
Coming Up Next…
Estimating Answer Quality
Experiments