Hidden Markov Models - Drexel University

Download Report

Transcript Hidden Markov Models - Drexel University

Hidden Markov Models
Applied to Information Extraction

Part I: Concept


HMM Tutorial
Part II: Sample Application

AutoBib: web information extraction
Larry Reeve
INFO629: Artificial Intelligence
Dr. Weber, Fall 2004
Part I: Concept
HMM Motivation

Real-world has structures and processes which
have (or produce) observable outputs
Usually sequential (process unfolds over time)
 Cannot see the event producing the output



Example: speech signals
Problem: how to construct a model of the
structure or process given only observations
HMM Background

Basic theory developed and published in 1960s and 70s

No widespread understanding and application until late 80s

Why?

Theory published in mathematic journals which were not widely read by
practicing engineers

Insufficient tutorial material for readers to understand and apply concepts
HMM Uses

Uses

Speech recognition


Text processing


Parsing raw records into structured records
Bioinformatics


Recognizing spoken words and phrases
Protein sequence prediction
Financial


Stock market forecasts (price pattern prediction)
Comparison shopping services
HMM Overview

Machine learning method

Makes use of state machines

Based on probabilistic models

Useful in problems having sequential steps

Can only observe output from states, not the
states themselves

Example: speech recognition


Observe: acoustic signals
Hidden States: phonemes
(distinctive sounds of a language)
State machine:
Observable Markov Model Example
State transition matrix

Weather

Once each day weather is observed




What is the probability the weather for
the next 7 days will be:


State 1: rain
State 2: cloudy
State 3: sunny
sun, sun, rain, rain, sun, cloudy, sun
Each state corresponds to a physical
observable event
Rainy
Cloudy
Sunny
Rainy
0.4
0.3
0.3
Cloudy
0.2
0.6
0.2
Sunny
0.1
0.1
0.8
Observable Markov Model
Hidden Markov Model Example

Coin toss:
Heads, tails sequence with 2 coins
 You are in a room, with a wall
 Person behind wall flips coin, tells result

Coin selection and toss is hidden
 Cannot observe events, only output (heads, tails) from
events


Problem is then to build a model to explain
observed sequence of heads and tails
HMM Components

A set of states (x’s)

A set of possible output symbols (y’s)

A state transition matrix (a’s)


Output emission matrix (b’s)


probability of making transition from
one state to the next
probability of a emitting/observing a
symbol at a particular state
Initial probability vector


probability of starting at a particular
state
Not shown, sometimes assumed to be 1
HMM Components
Common HMM Types

Ergodic (fully connected):


Every state of model can be reached in
a single step from every other state of
the model
Bakis (left-right):

As time increases, states proceed from
left to right
HMM Core Problems

Three problems must be solved for HMMs to be
useful in real-world applications
1) Evaluation
2) Decoding
3) Learning
HMM Evaluation Problem

Purpose: score how well a given model matches
a given observation sequence

Example (Speech recognition):

Assume HMMs (models) have been built for words
‘home’ and ‘work’.

Given a speech signal, evaluation can determine the
probability each model represents the utterance
HMM Decoding Problem

Given a model and a set of observations, what
are the hidden states most likely to have
generated the observations?

Useful to learn about internal model structure,
determine state statistics, and so forth
HMM Learning Problem

Goal is to learn HMM parameters (training)



State transition probabilities
Observation probabilities at each state
Training is crucial:

it allows optimal adaptation of model parameters to observed
training data using real-world phenomena

No known method for obtaining optimal parameters
from data – only approximations

Can be a bottleneck in HMM usage
HMM Concept Summary

Build models representing the hidden states of a
process or structure using only observations

Use the models to evaluate probability that a model
represents a particular observation sequence

Use the evaluation information in an application to:
recognize speech, parse addresses, and many other
applications
Part II: Application
AutoBib System

Provide a uniform view of several computer science
bibliographic web data sources

An automated web information extraction system that
requires little human input



Web pages designed differently from site-to-site
IE requires training samples
HMMs used to parse unstructured bibliographic
records into a structured format: NLP
Web Information Extraction
Converting Raw Records
Approach
1) Provide seed database of structured records
2) Extract raw records from relevant Web pages
3) Match structured records to raw records

To build training samples
4) Train HMM-based parser
5) Parse unmatched raw recs into structured recs
6) Merge new structured records into database
AutoBib Architecture
Step 1 - Seeding

Provide seed database of structured records

Take small collection of BibTeX format records and
insert into database

Cleaning step normalizes record fields

Examples:



“Proc.”  “Proceedings”
“Jan”  “January”
Manual step, executed once only
Step 2 – Extract Raw Records

Extract raw records from relevant Web pages

User specifies
Web pages to extract from
 How to follow ‘next page’ links for multiple pages


Raw records are extracted

Uses record-boundary discovery techniques


Subtree of Interest = largest subtree of HTML tags
Record separators = frequent HTML tags
Tokenized Records
(Replace all HTML tags with ^)
Step 3 - Matching

Match raw records R to structured records S

Apply 4 tests (heuristic-based)
1)
2)
3)
4)
Match at least author in R to an author in S
S.year must appear in R
If S.pages exists, R must contain it
S.title is ‘approximately contained’ in R
Levenshtein edit distance – approximate string match
Step 4 – Parser Training

Train HMM-based parser

For each pair of R and S that match, annotate
tokens in raw record with field names

Annotated raw records are fed into HMM parser in
order to learn:
State transition probabilities
 Symbol probabilities at each state

Parser Training, continued

Key consideration is HMM structure for navigating
record fields (fields, delimiters)

Special states


Normal states


start, end
author, title, year, etc.
Best structure found:


Have multiple delimiter and tag states,
one for each normal state

Example: author-delimiter, author-tag
Sample HMM
(Method 3)
Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg
Step 5 - Conversion

Parse unmatched raw recs into structured recs
using HMM parser

Matched raw records can be directly converted
without parsing because they were annotated in
matching step
Step 6 - Merging

Merge new structured records into database

Initial seed database has now grown

New records will be used for improved
matching on the next run
Evaluation

Success rate:
# of tokens labeled by HMM
------------------------------------# of tokens labeled by person

DBLP: 98.9%


Computer Science Bibliography
CSWD: 93.4%

CompuScience WWW-Database
HMM Advantages / Disadvantages

Advantages


Effective
Can handle variations in record structure



Optional fields
Varying field ordering
Disadvantages

Requires training using annotated data



Not completely automatic
May require manual markup
Size of training data may be an issue
Other methods

Wrappers



Specification of areas of interest on Web page
Hand-crafted
Wrapper induction



Requires manual training
Not always accommodating to changing structure
Syntax-based; no semantic labeling
Application to Other Domains

E-Commerce

Comparison shopping sites
Extract product/pricing information from many sites
 Convert information into structured format and store
 Provide interface to look up product information and
then display pricing information gathered from many sites


Saves users time

Rather than navigating to and searching many sites, users
can consult a single site
References

Concept:


Rabiner, L. R. (1989). A Tutorial on Hidden Markov
Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, 77(2), 257-285.
Application:

Geng, J. and Yang, J. (2004). Automatic Extraction
of Bibliographic Information on the Web. Proceedings
of the 8th International Database Engineering and
Applications Symposium (IDEAS’04), 193-204.