Hidden Markov Models - Drexel University
Download
Report
Transcript Hidden Markov Models - Drexel University
Hidden Markov Models
Applied to Information Extraction
Part I: Concept
HMM Tutorial
Part II: Sample Application
AutoBib: web information extraction
Larry Reeve
INFO629: Artificial Intelligence
Dr. Weber, Fall 2004
Part I: Concept
HMM Motivation
Real-world has structures and processes which
have (or produce) observable outputs
Usually sequential (process unfolds over time)
Cannot see the event producing the output
Example: speech signals
Problem: how to construct a model of the
structure or process given only observations
HMM Background
Basic theory developed and published in 1960s and 70s
No widespread understanding and application until late 80s
Why?
Theory published in mathematic journals which were not widely read by
practicing engineers
Insufficient tutorial material for readers to understand and apply concepts
HMM Uses
Uses
Speech recognition
Text processing
Parsing raw records into structured records
Bioinformatics
Recognizing spoken words and phrases
Protein sequence prediction
Financial
Stock market forecasts (price pattern prediction)
Comparison shopping services
HMM Overview
Machine learning method
Makes use of state machines
Based on probabilistic models
Useful in problems having sequential steps
Can only observe output from states, not the
states themselves
Example: speech recognition
Observe: acoustic signals
Hidden States: phonemes
(distinctive sounds of a language)
State machine:
Observable Markov Model Example
State transition matrix
Weather
Once each day weather is observed
What is the probability the weather for
the next 7 days will be:
State 1: rain
State 2: cloudy
State 3: sunny
sun, sun, rain, rain, sun, cloudy, sun
Each state corresponds to a physical
observable event
Rainy
Cloudy
Sunny
Rainy
0.4
0.3
0.3
Cloudy
0.2
0.6
0.2
Sunny
0.1
0.1
0.8
Observable Markov Model
Hidden Markov Model Example
Coin toss:
Heads, tails sequence with 2 coins
You are in a room, with a wall
Person behind wall flips coin, tells result
Coin selection and toss is hidden
Cannot observe events, only output (heads, tails) from
events
Problem is then to build a model to explain
observed sequence of heads and tails
HMM Components
A set of states (x’s)
A set of possible output symbols (y’s)
A state transition matrix (a’s)
Output emission matrix (b’s)
probability of making transition from
one state to the next
probability of a emitting/observing a
symbol at a particular state
Initial probability vector
probability of starting at a particular
state
Not shown, sometimes assumed to be 1
HMM Components
Common HMM Types
Ergodic (fully connected):
Every state of model can be reached in
a single step from every other state of
the model
Bakis (left-right):
As time increases, states proceed from
left to right
HMM Core Problems
Three problems must be solved for HMMs to be
useful in real-world applications
1) Evaluation
2) Decoding
3) Learning
HMM Evaluation Problem
Purpose: score how well a given model matches
a given observation sequence
Example (Speech recognition):
Assume HMMs (models) have been built for words
‘home’ and ‘work’.
Given a speech signal, evaluation can determine the
probability each model represents the utterance
HMM Decoding Problem
Given a model and a set of observations, what
are the hidden states most likely to have
generated the observations?
Useful to learn about internal model structure,
determine state statistics, and so forth
HMM Learning Problem
Goal is to learn HMM parameters (training)
State transition probabilities
Observation probabilities at each state
Training is crucial:
it allows optimal adaptation of model parameters to observed
training data using real-world phenomena
No known method for obtaining optimal parameters
from data – only approximations
Can be a bottleneck in HMM usage
HMM Concept Summary
Build models representing the hidden states of a
process or structure using only observations
Use the models to evaluate probability that a model
represents a particular observation sequence
Use the evaluation information in an application to:
recognize speech, parse addresses, and many other
applications
Part II: Application
AutoBib System
Provide a uniform view of several computer science
bibliographic web data sources
An automated web information extraction system that
requires little human input
Web pages designed differently from site-to-site
IE requires training samples
HMMs used to parse unstructured bibliographic
records into a structured format: NLP
Web Information Extraction
Converting Raw Records
Approach
1) Provide seed database of structured records
2) Extract raw records from relevant Web pages
3) Match structured records to raw records
To build training samples
4) Train HMM-based parser
5) Parse unmatched raw recs into structured recs
6) Merge new structured records into database
AutoBib Architecture
Step 1 - Seeding
Provide seed database of structured records
Take small collection of BibTeX format records and
insert into database
Cleaning step normalizes record fields
Examples:
“Proc.” “Proceedings”
“Jan” “January”
Manual step, executed once only
Step 2 – Extract Raw Records
Extract raw records from relevant Web pages
User specifies
Web pages to extract from
How to follow ‘next page’ links for multiple pages
Raw records are extracted
Uses record-boundary discovery techniques
Subtree of Interest = largest subtree of HTML tags
Record separators = frequent HTML tags
Tokenized Records
(Replace all HTML tags with ^)
Step 3 - Matching
Match raw records R to structured records S
Apply 4 tests (heuristic-based)
1)
2)
3)
4)
Match at least author in R to an author in S
S.year must appear in R
If S.pages exists, R must contain it
S.title is ‘approximately contained’ in R
Levenshtein edit distance – approximate string match
Step 4 – Parser Training
Train HMM-based parser
For each pair of R and S that match, annotate
tokens in raw record with field names
Annotated raw records are fed into HMM parser in
order to learn:
State transition probabilities
Symbol probabilities at each state
Parser Training, continued
Key consideration is HMM structure for navigating
record fields (fields, delimiters)
Special states
Normal states
start, end
author, title, year, etc.
Best structure found:
Have multiple delimiter and tag states,
one for each normal state
Example: author-delimiter, author-tag
Sample HMM
(Method 3)
Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg
Step 5 - Conversion
Parse unmatched raw recs into structured recs
using HMM parser
Matched raw records can be directly converted
without parsing because they were annotated in
matching step
Step 6 - Merging
Merge new structured records into database
Initial seed database has now grown
New records will be used for improved
matching on the next run
Evaluation
Success rate:
# of tokens labeled by HMM
------------------------------------# of tokens labeled by person
DBLP: 98.9%
Computer Science Bibliography
CSWD: 93.4%
CompuScience WWW-Database
HMM Advantages / Disadvantages
Advantages
Effective
Can handle variations in record structure
Optional fields
Varying field ordering
Disadvantages
Requires training using annotated data
Not completely automatic
May require manual markup
Size of training data may be an issue
Other methods
Wrappers
Specification of areas of interest on Web page
Hand-crafted
Wrapper induction
Requires manual training
Not always accommodating to changing structure
Syntax-based; no semantic labeling
Application to Other Domains
E-Commerce
Comparison shopping sites
Extract product/pricing information from many sites
Convert information into structured format and store
Provide interface to look up product information and
then display pricing information gathered from many sites
Saves users time
Rather than navigating to and searching many sites, users
can consult a single site
References
Concept:
Rabiner, L. R. (1989). A Tutorial on Hidden Markov
Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, 77(2), 257-285.
Application:
Geng, J. and Yang, J. (2004). Automatic Extraction
of Bibliographic Information on the Web. Proceedings
of the 8th International Database Engineering and
Applications Symposium (IDEAS’04), 193-204.