java LZW compress MyTextFile.txt MySmallFile.lzw

Download Report

Transcript java LZW compress MyTextFile.txt MySmallFile.lzw

Algorithm Programming 1
89-210
Some Topics in Compression
Bar-Ilan University
2006-2007 ‫תשס"ז‬
by Moshe Fresko
Huffman Coding



Variable-length encoding
Works on probabilities of symbols (characters,
words, etc.)
Build a tree






Get two least frequent symbols/nodes
Join them into a parent node
Parent node’s frequency is sum of child nodes’
Continue until the tree contains all nodes and symbols
The path of a leaf indicates its code
Frequent symbols are near the root giving them short
codes
LZ77





Introduced in 1977 by Abraham Lempel and Jacob
Ziv
Dictionary based
Works in a window size n
Decoding is easy and fast (but not Encoding)
Produces a list of tuples (Pos,Len,C)



Pos : Position backwards from the current position
Len : Number of symbols to be taken
C : Next character
LZ77

Based on strings that repeat themselves
An outcry in Spain is an outcry in vain
An outcry in Spa(6,3)is a(22,12)v(21,3)
aaaaaaaaaa
a(1,9)
LZ77 - Example
Window size : 5
 ABBABCABBBBC
NextSeq
Code

A
B
BA
BC
ABB
BBC
(0,0,A)
(0,0,B)
(1,1,A)
(3,1,C)
(3,2,B)
(2,2,C)
LZ77 - Some Variations



LZSS - A flag bit for distinguishing pointers
from the other items.
LZR - No limit on the pointer size.
LZH - Compress the pointers in Huffman
coding.
LZ78


Instead of a window to previously seen text, a
dictionary of phrases will be build
Both encoding and decoding are simple


From the current position in the text, find the
longest phrase that is found in the dictionary
Output the pair (Index,NextChar)



Index : The dictionary phrase of that index
NextChar : The next character after that phrase
Add to the dictionary the new phrase by
appending the next character
LZ78 - Example

ABBABCABBBBC
Input
A
B
BA
BC
AB
BB
BC

Output Add to dictionary
(0,A)
1 = “A”
(0,B)
2 = “B”
(2,A)
3 = “BA”
(2,C)
4 = “BC”
(1,B)
5 = “AB”
(2,B)
6 = “BB”
(4,EOLN)
Dictionary size
LZW


Produces only a list of dictionary entry indexes
Encoding
1.
Starts with initial dictionary

2.
3.
4.
5.
For example, possible ascii characters (0..255)
From the input, find the longest string that exists in the
dictionary
Output this string’s index in the dictionary
Append the next character in the input to that string and
add it into the dictionary
Continue from that character on from (2)
LZW - Example

ABBABCABBBBC
Initial dictionary 0=“A”, 1=“B”, 2=“C”
Input
NextChar
Output
A
B
0
B
B
1
B
A
1
AB
C
3
C
A
2
AB
B
3
BB
B
4
B
C
1
C
2


Dictionary size : ?
Add to dictionary
3 = “AB”
4 = “BB”
5 = “BA”
6 = “ABC”
7 = “CA”
8 = “ABB”
9 = “BBB”
10 = “BC”
-
LZW – Encoding Example

T=ababcbababaaaaaaa

Initial Dictionary Entries :1=a 2=b
Input
a
b
ab
c
ba
bab
a
aa
aaa
a
Output NextSymbol
1
b
2
a
4
c
3
b
5
b
8
a
1
a
10
a
11
a
1
-
3=c
Add To Dictionary
4 = ab
5 = ba
6 = abc
7 = cb
8 = bab
9 = baba
10 = aa
11 = aaa
12= aaaa
-
LZW – Encoding Algorithm
w = Empty
while ( read next symbol k ) {
if wk exists in the dictionary
w = wk
else
add wk to the dictionary;
output the code for w;
w = k;
}
LZW – Decoding Algorithm
read a code k
output dictionary entry for k
w=k
while ( read a code k ) {
entry = dictionary entry for k
output entry
add w + entry[0] to dictionary
w = entry
}
LZW – Decoding

There is a special case problem with the previous
algorithm







It can be confronted on every decoding process of a big
file
It is the case where the index number read is not in the
dictionary yet
Example : ABABABA
Initially : A=1,B=2
Output=1 2 3 5
In decoding above algorithm will not find the dictionary
entry ABA=5
An additional small check will solve the problem

Be careful to do it in the Exercise 3
LZW – Dictionary Length

Dictionary length


Typically : 14 bits = 16384 entries (first 256 of them are
single bytes)
What if we are out of dictionary length
1.
2.
3.
4.
5.
Don’t add to the dictionary any more
Delete the whole dictionary (This will be used in the exercise)
LRU : Throw those that are not used recently
Monitor performance, and flush dictionary when the performance
is poor.
Double the dictionary size
Exercise 3 – Compression Algorithms
1.
Define an interface “Compression” in “Compression.java” as:
interface Compression {
void compress(InputStream is, OutputStream os) throws IOException ;
void decompress(InputStream is, OutputStream os) throws IOException ;
void stop() ;
}
// Note that these definitions are as general as possible to let flexibility in the
input and output media chosen.
2.
Define the LZW compression algorithm in file LZW.java
3.
This class must implement the above Compress interface.
Exercise 3 – Compression Algorithms
1.
2.
Write an application program “main()” in “LZW.java” that will run in the
following form
java LZW compress/decompress InputFile OutputFile
It will run the given Algorithm for compressing/decompressing the given
input file into the given output file.
Example:
java LZW compress MyTextFile.txt MySmallFile.lzw
will compress MyTextFile.txt into MySmallFile.lzw
java LZW decompress MySmallFile.lzw MyNewTextFile.txt
will uncompress the given file into a new file
Exercise 3 – Compression Algorithms

You can use the following two classes to keep a dynamic list and an associative
array



ArrayList (with interface List)
HashMap (with interface Map)
Example:
// For keeping a dynamic list
List myList = new ArrayList() ;
// For keeping an associative array
Map myMap = new HashMap() ;

List




Map




boolean add(Object)
int size()
Object get(int)
Object put(Object key,Object value)
boolean containsKey(Object key)
Object get(Object key)
For more information, look at the Java API specification page.
Exercise 3 – Compression Algorithms

Algorithm Parameters:




LZW: The maximal initial Dictionary size is 512 entries which requires 9
bits for writing. First 256 entries are the ascii characters.
After reaching the end of the dictionary it will double the dictionary size.
Which means from that point on the representation of an entry will be 10 bits
and will be a dictionary of max size 1024.
This will continue until 14 bits. After reaching 214 entries, it will clean the
dictionary by starting from the beginning with 9-bit represenations.
Please take the two numbers 9 and 14 (start Bit Count, last Bit Count) as a
parameter in the constructor. The default constructor will start them with 9
and 16.



LZW() { this(9,14); }
LZW(int startNumOfBits, int endNumOfBits) { /*Initialization*/ }
You have to use BitInputStream/BitOutputStream from the 1st Exercise to
write the bits into an output or to read from input.
Exercise # 2
Compression Algorithms

Important:

You submit via submitex.









Course number: 89-210
Exercise: ex2
Two files will be delivered. Compression.java and LZW.java
It must compile/work under Java 1.4/1.5. You can try it on the Unix
environment (on Sunshine).
Do not write debugging information to the console or any other place.
Write enough comments. Write wherever you think there is a need for
the understanding of the code.
Write your code according to the OOP principles.
Everybody must do it alone.
Deadline:

14 Dec 2006