Transcript document
ISP 433/633 Week 4
Text operation, indexing and
search
Document Process Steps
Example Collection
Documents
D1: It is a dog eat dog world!
D2: While the world sleeps.
D3: Let sleeping dogs lie.
D4: I will eat my hat.
D5: My dog wears a hat.
Step 1: Parse Text Into Words
• break at spaces and punctuation
D1:
D2:
IT
IS
A
DOG
EAT
DOG
WORLD
WHILE
THE
WORLD
SLEEPS
D3:
LET
SLEEPING
DOGS
LIE
D4:
I
WILL
EAT
MY
HAT
D5:
MY
DOG
WEARS
A
HAT
Step 2: Stop Words
Elimination
• Remove non-distinguishing words
•
Pronouns, …
prepositions, … articles, ...
to Be, to Have, to Do
• I,MY,IT,YOUR,…OF,BY,ON,…A,THE,THIS,…,IS,HAS,WIL
L,…
D1:
D2:
DOG
EAT
DOG
WORLD
WORLD
SLEEPS
D3:
LET
SLEEPING
DOGS
LIE
D4:
EAT
HAT
D5:
DOG
WEARS
HAT
Stop Words List
• 250-300 most common words in English
account for 50% or more of a given text.
– Example: “the” and “of” represent 10% of
tokens. “and”, “to”, “a”, and “in” - another
10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words
(types), 2256 word occurrences
(tokens).
– Top 65 types cover 1132 tokens (> 50%).
– Token/type ratio: 2256/859 = 2.63
Step 3: Stemming
• Goal: “normalize” similar words
D1:
D2:
DOG
EAT
DOG
WORLD
WORLD
SLEEP
D3:
LET
SLEEP
DOG
LIE
D4:
EAT
HAT
D5:
DOG
WEAR
HAT
Stemming and
Morphological Analysis
Morphology (“form” of words)
– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class
– dog, dogs
– Derivational Morphology
• Derive one word from another
• Often change grammatical class
– build, building; health, healthy
Simple “S” stemming
• IF a word ends in “ies”, but not “eies” or
“aies”
– THEN “ies” “y”
• IF a word ends in “es”, but not “aes”,
“ees”, or “oes”
– THEN “es” “e”
• IF a word ends in “s”, but not “us” or “ss”
– THEN “s” NULL
Harman, JASIS 1991
Porter’s Algorithm
• An effective, simple and popular English
stemmer
• Official URL
http://www.tartarus.org/~martin/PorterStemm
er/
• A demo
http://snowball.tartarus.org/demo.php
Porter’s Algorithm
• 1. The measure, m, of a stem is a function of sequences of
vowels followed by a consonant. If V is a sequence of vowels
and C is a sequence of consonants, then m is:
C(VC)mV
where the initial C and final V are optional and m is the number
of VC repeats.
m=0 free, why
m=1 frees, whose
m=2 prologue, compute
2. *<X> - stem ends with letter X
3. *v* - stem ends in a vowel
4. *d - stem ends in double consonant
5. *o - stem ends with consonant-vowel-consonant sequence
where the final consonant is now w, x, or y
Porter, Program 1980
Porter’s Algorithm
• Suffix conditions take the form current_suffix = = pattern
Actions are in the form old_suffix -> new_suffix
Rules are divided into steps to define the order of applying the
rules. The following are some examples of the rules:
STEP CONDITION SUFFIX REPLACEMENT EXAMPLE
1a
NULL
sses
ss
stresses->stress
1b
*v*
ing
NULL
making->mak
1b1 NULL
at
ate
inflat(ed)->inflate
1c
*v*
y
I
happy->happi
2
m>0
aliti
al
formaliti->formal
3
m>0
icate
ic
duplicate->duplic
4
m>1
able
NULL
adjustable->adjust
5a
m>1
e
NULL
inflate->inflat
Problems of Porter’s Algorithm
•
•
Unreadable results
Does not handle some irregular verbs and
adjectives
–
–
•
Take/took
Bad/worse
Possible errors:
Too Aggressive
Too Timid
Organization/organ
Relatedness/related
Executive/execute
Create/creation
Step 4: Indexing
• Inverted Files
Vocabulary Occurrences
DOG
EAT
HAT
LET
LIE
SLEEP
WEAR
WORLD
D1
D1
D4
D3
D3
D2
D5
D1
D3 D5
D4
D5
D3
D2
Inverted Files
• Occurrences can point to
– Documents
– Positions in a document
– Weight
• Most commonly used indexing method
• Based on words
– Queries such as phrases are expensive to solve
– Some data does not have words
• Genetic data
Suffix Trees
1234567890123456789012345678901234567890123456789012345678901234567
This is a text. A text has many words. Words are made from letters.
Patricia tree
60
l
m
t
a
e
d
n
x
50
28
t
w
o
r
d
‘‘
.
s
19
11
‘‘
.
40
33
Text Compression
• Represent text in fewer bits
• Symbols to be compressed are words
• Method of choice
– Huffman coding
Huffman Coding
• Developed by David Huffman (1952)
• Average of 5 bits per character
• Based on frequency distributions of
symbols
• Idea: assign shorter code to more
frequent symbols
• Algorithm: iteratively build a tree of
symbols starting with the two least
frequent symbols
An Example
Symbol Frequency
0
A
7
B
4
C
10
D
5
E
2
0
F
11
c
G
15
0
1
H
3
b
d
I
7
J
8
0
1
1
1
0
0
f
g
1
0
a
1
0
1
e
h
1
0
1
i
j
Example Coding
Symbol Code
A
0110
B
0010
C
000
D
0011
E
01110
F
010
G
10
H
01111
I
110
J
111
0
0
0
c
1
1
1
0
0
0
1
b
d
f
g
1
0
a
1
0
1
e
h
1
0
1
i
j
Exercise
• Consider the bit string:
011011011110001001100011101001110
001101011010111
• Use the Huffman code from the
example to decode it.
• Try inserting, deleting, and switching
some bits at random locations and try
decoding
Huffman Code
• Prefix property
– it means that no word in the code is a
prefix of any other word in the code
• Random access
– Decompress starting from any where
• Not the fastest
Sequential string searching
• Boyer-Moore algorithm
• Example: search for “cats” in “the
catalog of all cats”
• Some preprocessing is needed.
• Demos:
http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html