Transcript slides

Julius Information Extractor
June 14, 2006
Kyle Woodward
Lee-Ming Zen
The Problem
• There is a lot of text and information out there,
but not a whole lot of tagging. How can we
extract information a user is interested in
without knowing anything beforehand?
Approach
• Based upon AT&T system
• Build up “spelling” and “context” rules
• Iteratively learn new rules by labeling and examining
labels by jumping from one set of rules to the other
• Additional features
• We used a fixed length prefix and suffix to augment
the context
• Substituted POS instead of a full grammar parse for
context
• Window bounds selection to determine tag size
• Web
• Use information from web search snippets
Rules
• Rules are a set of features for a particular
labeling with weights for each feature
• e.g. allcap, contains, full-string, etc.
What’s Cool
• Generality
• No restrictions on the type of data it runs against
• No preassumed notions about the domain
• GUI tools
• Labeler
• Statistics viewer
• Works
• Works well on small data sets
What’s Not
• Fails at larger corpora
• Generality tradeoff means not being able to
exploit certain information
• Web context does not necessarily help due to
noise