slides - Stanford NLP Group

Download Report

Transcript slides - Stanford NLP Group

Query Segmentation and
Structured Annotation via NLP
Rifat Reza Joye
Panagiotis Papadimitriou
Problem
• Caloricious.com:
– Semantic search engine for food items
• Free-text queries over structured data
– Query: gluten free high protein bars
– Data: Each food item is database record with attributes name,
brand, category, nutrients, allergens, ..
• Query segmentation and structured annotation
gluten
ALLERGEN
free
high
protein
NUTRIENT
bars
CATEGORY
1st Approach
MEMM with Synthetic Training Data
• Seems as instance of NER
• Problem: No labeled queries to train MEMM
• Solution: Generate synthetic labeled queries
– Query study in 100 queries
• 96% queries contain 1–3 segments.
• One of the segments in 98% queries refers to Name or
Category or Brand
– Algorithm
• Pick a food item at random
• Pick 1-3 attributes and generate a query
2nd Approach
Segmentation & MaxEnt Classification
Query Segmentation
• Train language model on
structured data text
• Use model to find segment
probabilities
• Find the ML segmentation
through DP
gluten
free
high
protein
bars
Segment Annotation
• Annotate each segment
with an attribute using
MaxEnt classifier
• Training: For each attribute
training examples come
from the corresponding
entries of database
products
gluten
free
high
protein
bars
Results
Accuracy of Segment Classification
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1st Approach
2nd Approach (2-grams)
2nd Approach (3-grams)
Conclusions – Future Work
• Combination of Language Model, Dynamic
Programming and MaxEnt classification
provides very good accuracy without labeled
data
• It would be interesting to compare with NER
on a big labeled set
• We also plan to compare with the state-of-the
art algorithm in the context of a research
submission.
More Results…
• Evangelos
• March 12, 2011 @
9.14am
• 19.5 inches
• 6lbs 11oz