Transcript Simple Text

Information Retrieval
CSE 8337
Spring 2005
Simple Text Processing
Material for these slides obtained from:
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book

Text Processing TOC



Simple Text Storage
String Matching
String-to-String Correction
(Approximate matching)
CSE 8337 Spring 2005
2
Text storage





EBCDIC/ASCII
Array of character
Linked list of character
Trees- B Tree, Trie
Stuart E. Madnick, “String Processing
Techniques,” Communications of the ACM, Vol
10, No 7, July 1967, pp 420-424.
CSE 8337 Spring 2005
3
Pattern Matching(Recognition)


Pattern Matching: finds occurrences
of a predefined pattern in the data.
Applications include speech recognition,
information retrieval, time series
analysis.
CSE 8337 Spring 2005
4
Similarity Measures



Determine similarity between two objects.
Similarity characteristics:
Alternatively, distance measures measure
how unlike or dissimilar objects are.
CSE 8337 Spring 2005
5
String Matching Problem

Input:




Pattern – length m
Text string – length n
Find one (next, all) occurrences of
string in pattern
Ex:


String: 00110011011110010100100111
Pattern: 011010
CSE 8337 Spring 2005
6
String Matching Algorithms




Brute Force
Kknuth-Morris Pratt
Boyer Moore
P209 in text
CSE 8337 Spring 2005
7
Brute Force String Matching

Brute Force
 Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html


Space O(m+n)
Time O(mn)
00110011011110010100100111
011010
011010
011010
CSE 8337 Spring 2005
8
FSR
CSE 8337 Spring 2005
9
Creating FSR

Create FSM:




Construct the “correct” spine.
Add a default “failure bus” to state 0.
Add a default “initial bus” to state 1.
For each state, decide its attachments to
failure bus, initial bus, or other failure links.
CSE 8337 Spring 2005
10
Knuth-Morris-Pratt





Apply FSM to string by processing characters
one at a time.
Accepting state is reached when pattern is
found.
Space O(m+n)
Time O(m+n)
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html
CSE 8337 Spring 2005
11
Boyer-Moore

Scan pattern from right to left
Skip many positions on illegal character
string.
O(mn)
Expected time better than KMP
Expected behavior better

Handbook of Algorithms and Data Structures




http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html
CSE 8337 Spring 2005
12
String-to-String Correction




Measure of similarity between strings
Can be used to determine how to convert
from one string to another
Cost to convert one to the other
Transformations



Match: Current characters in both strings are the
same
Delete: Delete current character in input string
Insert: Insert current character in target string
into string
CSE 8337 Spring 2005
13
Distance Between Strings
CSE 8337 Spring 2005
14
Approximate String Matching



Find patterns “close to” the string
Fuzzy matching
Applications:



Spelling checkers
IR
Define similarity (distance) between
string and pattern
CSE 8337 Spring 2005
15