Transcript Simple Text
Information Retrieval
CSE 8337
Spring 2005
Simple Text Processing
Material for these slides obtained from:
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book
Text Processing TOC
Simple Text Storage
String Matching
String-to-String Correction
(Approximate matching)
CSE 8337 Spring 2005
2
Text storage
EBCDIC/ASCII
Array of character
Linked list of character
Trees- B Tree, Trie
Stuart E. Madnick, “String Processing
Techniques,” Communications of the ACM, Vol
10, No 7, July 1967, pp 420-424.
CSE 8337 Spring 2005
3
Pattern Matching(Recognition)
Pattern Matching: finds occurrences
of a predefined pattern in the data.
Applications include speech recognition,
information retrieval, time series
analysis.
CSE 8337 Spring 2005
4
Similarity Measures
Determine similarity between two objects.
Similarity characteristics:
Alternatively, distance measures measure
how unlike or dissimilar objects are.
CSE 8337 Spring 2005
5
String Matching Problem
Input:
Pattern – length m
Text string – length n
Find one (next, all) occurrences of
string in pattern
Ex:
String: 00110011011110010100100111
Pattern: 011010
CSE 8337 Spring 2005
6
String Matching Algorithms
Brute Force
Kknuth-Morris Pratt
Boyer Moore
P209 in text
CSE 8337 Spring 2005
7
Brute Force String Matching
Brute Force
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html
Space O(m+n)
Time O(mn)
00110011011110010100100111
011010
011010
011010
CSE 8337 Spring 2005
8
FSR
CSE 8337 Spring 2005
9
Creating FSR
Create FSM:
Construct the “correct” spine.
Add a default “failure bus” to state 0.
Add a default “initial bus” to state 1.
For each state, decide its attachments to
failure bus, initial bus, or other failure links.
CSE 8337 Spring 2005
10
Knuth-Morris-Pratt
Apply FSM to string by processing characters
one at a time.
Accepting state is reached when pattern is
found.
Space O(m+n)
Time O(m+n)
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html
CSE 8337 Spring 2005
11
Boyer-Moore
Scan pattern from right to left
Skip many positions on illegal character
string.
O(mn)
Expected time better than KMP
Expected behavior better
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html
CSE 8337 Spring 2005
12
String-to-String Correction
Measure of similarity between strings
Can be used to determine how to convert
from one string to another
Cost to convert one to the other
Transformations
Match: Current characters in both strings are the
same
Delete: Delete current character in input string
Insert: Insert current character in target string
into string
CSE 8337 Spring 2005
13
Distance Between Strings
CSE 8337 Spring 2005
14
Approximate String Matching
Find patterns “close to” the string
Fuzzy matching
Applications:
Spelling checkers
IR
Define similarity (distance) between
string and pattern
CSE 8337 Spring 2005
15