Text Compression

Download Report

Transcript Text Compression

Text operations
Prepared By : Loay Alayadhi
425121605
Supervised by: Dr. Mourad Ykhlef
Document Preprocessing
•
•
•
•
•
Lexical analysis
Elimination of stopwords
Stemming of the remaining words
Selection of index terms
Construction of term categorization
structures. (thesaurus)
2
Logical View of a Document
automatic
or manual
indexing
document
text+
structure
structure
recognition
structure
accents,
spacing,
etc.
text
stopwords
noun
groups
stemming
index
terms
full text
3
1)Lexical Analysis of the Text
• Lexical Analysis
Convert an input stream of characters into
stream words .
– Major objectives is the identification of the
words in the text !! How ??
• Digits. ignoring numbers is a common way
• Hyphens. state-of-the art
• punctuation marks. remove them. Exception: 510B.C
• Case
4
2) Elimination of Stopwords
• words appear too often are not useful
for IR.
• Stopwords: words appear more than
80% of the documents in the collection
are stopwords and are filtered out as
potential index words
• Problem
– Search for “to be or not to be”?
5
3) Stemming
• A Stem: the portion of a word which is
left after the removal of its affixes (i.e.,
prefixes or suffixes).
• Example
– connect, connected, connecting, connection,
connections
• Removing strategies
–
–
–
–
affix removal: intuitive, simple
table lookup
successor variety
n-gram
6
4) Index Terms Selection
• Motivation
– A sentence is usually composed of nouns,
pronouns, articles, verbs, adjectives, adverbs,
and connectives.
– Most of the semantics is carried by the noun
words.
• Identification of noun groups
– A noun group is a set of nouns whose syntactic
distance in the text does not exceed a
predefined threshold
7
5) Thesaurus Construction
• Thesaurus: a precompiled list of important
words in a given domain of knowledge and
for each word in this list, there is a set of
related words.
• A controlled vocabulary for the indexing and
searching. Why?
–
–
–
–
Normalization,
indexing concept ,
reduction of noise,
identification, ..ect
8
The Purpose of a Thesaurus
• To provide a standard vocabulary for
indexing and searching
• To assist users with locating terms for
proper query formulation
• To provide classified hierarchies that
allow the broadening and narrowing of
the current query request
9
Thesaurus (cont.)
• Not like common dictionary
– Words with their explanations
• May contain words in a language
• Or only contains words in a specific domain.
• With a lot of other information especially
the relationship between words
– Classification of words in the language
– Words relationship like synonyms, antonyms
10
Roget thesaurus
Example
cowardly adjective (‫)جبان‬
Ignobly lacking in courage: cowardly turncoats
Syns: chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang), yellowbellied (slang)
• http://www.thesaurus.com
• http://www.dictionary.com/
11
Thesaurus Term Relationships
• BT: broader
• NT: narrower
• RT: non-hierarchical, but related
12
Use of Thesaurus
• Indexing
Select the most appropriate thesaurus entries for
representing the document.
• Searching
Design the most appropriate search strategy.
– If the search does not retrieve enough
documents, the thesaurus can be used to expand
the query.
– If the search retrieves too many items, the
thesaurus can suggest more specific search
vocabulary
13
Document clustering
• Document clustering : the operation of
grouping together similar documents in
classes
• Global vs. local
• Global: whole collection
– At compile time, one-time operation
• Local
– Cluster the results of a specific query
– At runtime, with each query
14
Text Compression
• Why text compression is important?
– Less storage space
– Less time for data transmission
– Less time to search (if the compression
method allows direct search without
decompression)
15
Terminology
• Symbol
– The smallest unit for compression (e.g.,
character, word, or a fixed number of
characters)
• Alphabet
– A set of all possible symbols
• Compression ratio
– The size of the compressed file as a
fraction of the uncompressed file
16
Types of compression models
• Static models
– Assume some data properties in advance
(e.g., relative frequencies of symbols) for all
input text
– Allow direct (or random) access
– Poor compression ratios when the input text
deviates from the assumption
17
Types of compression models
• Semi-static models
– Learn data properties in a first pass
– Compress the input data in a second pass
– Allow direct (or random) access
– Good compression ratio
– Must store the learned data properties for
decoding
– Must have whole data at hand
18
Types of compression models
• Adaptive models
– Start with no information
– Progressively learn the data properties as the
compression process goes on
– Need only one pass for compression
– Do not allow random access
• Decompression cannot start in the middle
19
General approaches to text
compression
• Dictionary methods
– (Basic) dictionary method
– Ziv-Lempel’s adaptive method
• Statistical methods
– Arithmetic coding
– Huffman coding
20
Dictionary methods
• Replace a sequence of symbols with a
pointer to a dictionary entry
aaababbbaaabaaaaaaabaabb
input
Compress
babbabaa
aaa bb
output
dictionary
May be suitable
for one text but
may be unsuitable
for another
21
Adaptive Ziv-Lempel coding
• Instead of dictionary entries, pointers point to
the previous occurrences of symbols
aaababbbaaabaaaaaaabaabb
Compress
a|a|b|b|b|a|a|a|b|b
22
Adaptive Ziv-Lempel coding
• Instead of dictionary entries, pointers
point to the previous occurrences of
symbols
aaababbbaaabaaaaaaabaabb
a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
1 2
3 4
5
6
7
8
9
10
0a|1a|0b|1b|3b|2a|3a|6a|2b|9b
1
2
3
4
5
6 7
8
9 10
23
Adaptive Ziv-Lempel coding
• Good compression ratio (4 bits/character)
– Suitable for general data compression and
widely used (e.g., zip, compress)
• Do not allow decoding to start in the
middle of a compressed file  direct
access is impossible without
decompression from the beginning
24
Arithmetic coding
• The input text (data) is converted to a
real number between 0 and 1, such as
0.328701
 Good compression ratio (2 bits/character)
 Slow
 Cannot
start decoding in the middle of a
file
25
Symbols and alphabet
for textual data
• Words are more appropriate symbols for
natural language text
• Example: “for each rose, a rose is a rose”
– Alphabet
• {a, each, for, is, rose, , ‘,’}
– Always assume a single space after a word
unless there is another separator
• {a, each, for, is, rose, ‘,’}
26
Huffman coding
• Assign shorter codes (bits) to more
frequent symbols and longer codes (bits)
to less frequent symbols
• Example: “for each rose, a rose is a rose”
27
Example
sym freq
symb
freq
b
each
1
each
1
1
,
1
,
for
1
for
1
is
1
is
1
a
2
a
2
rose
3
rose
3
5
a
4
2
rose
each
2
,
for
is
28
Example
symb freq
each
1
,
1
for
1
is
1
a
2
rose
3
0
0
a
1
1
rose
0
0
each
1
,
1
0
for
1
is
29
Example
symb freq code
each
1
100
,
1
101
for
1
110
is
1
111
a
2
00
rose
3
01
0
0
a
1
1
rose
0
0
each
1
,
1
0
for
1
is
30
Canonical tree
- Height of the left subtree of any node is never
smaller than that of the right subtree
- All leaves are in increasing order of
probabilities (frequencies) from left to right
0
0
0
each
1
,
1
0
1
0
for
1
a
1
rose
is
31
Advantages of canonical tree
• Smaller data for decoding
– Non-canonical tree needs:
• Mapping table between symbols and codes
– Canonical tree needs:
• (Sorted) list of symbols
• A pair of number of symbols and numerical value
of the first code for each level
– E.g., {(0, NA), (2, 2), (4, 0)}
• More efficient encoding/decoding
32
Byte-oriented Huffman coding
• Use whole bytes instead of binary coding
Non-optimal tree
254 empty nodes
256 symbols
256 symbols
Optimal tree
254 symbols
256 symbols
2 symbols
254 empty nodes
33
Comparison of methods
34
Compression of inverted files
• Inverted file: composed of
– A vector containing all distinct words in the
text collection.
– for each a list of documents in which that
word occurs.
– Types of code:
• Unary
• Elias-~
• Elisa~o
• Golomb
35
Conclusions
• Text transformation: meaning instead of
strings
– Lexical analysis
– Stopwords
– Stemming
• Text compression
– Searchable
– Random access
– Model-coding
inverted files
36
Thanks….
Any Questions.
37