The Ingenious Ideas That Drive Today`s Computers

Download Report

Transcript The Ingenious Ideas That Drive Today`s Computers

Algorithms : The Ingenious Ideas
That Drive Today's Computers
Hongfei Yan
2015/4/1
Outline
• Huffman coding
• Greatest common divisor
https://docs.python.org/3/library/collections.html#module-collections
8.3. collections — Container datatypes
• This module implements specialized container datatypes providing
alternatives to Python’s general purpose built-in containers, dict, list,
set, and tuple.
• Counter: dict subclass for counting hashable objects
Priority queue
• In computer science, a priority queue is an abstract data type which
is like a regular queue or stack data structure, but where additionally
each element has a "priority" associated with it. In a priority queue,
an element with high priority is served before an element with low
priority. If two elements have the same priority, they are served
according to their order in the queue.
• While priority queues are often implemented with heaps, they are
conceptually distinct from heaps. A priority queue is an abstract
concept like "a list" or "a map"; just as a list can be implemented with
a linked list or an array, a priority queue can be implemented with a
heap or a variety of other methods such as an unordered array.
POJ 4080: Huffman编码树
描述:构造一个具有n个外部节点的扩充二叉树,每个外部节点Ki有一个Wi对应,作为该外部节点的权。使得这个扩
充二叉树的叶节点带权外部路径长度总和最小:
Min( W1 * L1 + W2 * L2 + W3 * L3 + … + Wn * Ln)
Wi:每个节点的权值。
Li:根节点到第i个外部叶子节点的距离。
编程计算最小外部路径长度总和。
输入:第一行输入一个整数n,外部节点的个数。第二行输入n个整数,代表各个外部节点的权值。2<=N<=100
输出:输出最小外部路径长度总和。
样例输入
4
1135
样例输出
17
Greedy Algorithms
Chapter 10 Algorithm Design Techniques,
10.1 Greedy Algorithms
Data Structures and Algorithm Analysis in C++ (4th Edition),by
Mark A. Weiss, Prentice Hall, June 11, 2013.
• Greedy algorithms work in phases.
• In each phase, a decision is made that appears to be good, without regard for future consequences.
Generally, this means that some local optimum is chosen.
• This “take what you can get now” strategy is the source of the name for this class of algorithms.
• When the algorithm terminates, we hope that the local optimum is equal to the
global optimum.
• If this is the case, then the algorithm is correct;
• otherwise, the algorithm has produced a suboptimal solution.
• If the absolute best answer is not required,
• then simple greedy algorithms are sometimes used to generate approximate answers,
• rather than using the more complicated algorithms generally required to generate an exact answer.
Huffman Codes
• An application of greedy algorithms, known as file compression.
• The normal ASCII character set consists of roughly 100 “printable”
characters.
• In order to distinguish these characters,
= 7 bits are required.
• Seven bits allow the representation of 128 characters, so the ASCII character
set adds some other “nonprintable” characters.
• An eighth bit is added as a parity check. The important point, however, is that
if the size of the character set is C, then
bits are needed in a standard
encoding.
Using a standard coding scheme
• Suppose we have a file that contains only the characters a, e, i, s, t, plus blank spaces and
newlines.
• Suppose further, that the file has ten a’s, fifteen e’s, twelve i’s, three s’s, four t’s, thirteen blanks,
and one newline.
In real life, files can be quite large
• Many of the very large files are output of some program and there is
usually a big disparity between the most frequent and least frequent
characters.
• We might be interested in reducing the file size in the case where we
are transmitting it over a slow phone line.
• Also, since on virtually every machine, disk space is precious, one
might wonder if it would be possible to provide a better code and
reduce the total number of bits required.
The general strategy
• is to allow the code length to vary from character to character and to
ensure that the frequently occurring characters have short codes.
• Notice that if all the characters occur with the same frequency, then
there are not likely to be any savings.
A trie
• The representation of each character can be found by starting at the root and recording the path,
• using a 0 to indicate the left branch and a 1 to indicate the right branch.
• For instance, s is reached by going left, then right, and finally right. This is encoded as 011. This data structure is
sometimes referred to as a trie.
• If character ci is at depth di and occurs fi times, then the cost of the code is equal to
A slightly better tree
• A better code than the one given in Figure 10.9 can be obtained by noticing that
the newline is an only child.
• By placing the newline symbol one level higher at its parent,
• we obtain the new tree in Figure 10.10. This new tree has cost of 173, but is still far from
optimal.
A Full Tree
• a full tree: All nodes either are leaves or have two children.
• An optimal code will always have this property, since otherwise, nodes with only one child could move
up a level.
• If the characters are placed only at the leaves, any sequence of bits can always be decoded
unambiguously.
•
•
•
•
•
For instance, suppose 0100111100010110001000111 is the encoded string.
0 is not a character code, 01 is not a character code,
but 010 represents i, so the first character is i.
Then 011 follows, giving an s. Then 11 follows, which is a newline.
The remainder of the code is a, space, t, i, e, and newline.
• Thus, it does not matter if the character codes are different lengths, as long as no character code
is a prefix of another character code. Such an encoding is known as a prefix code.
• Conversely, if a character is contained in a nonleaf node, it is no longer possible to guarantee that the decoding will
be unambiguous.
Huffman’s Algorithm: find the full binary
tree of minimum total cost
• assume that the number of characters is C. Huffman’s algorithm can be described
as follows:
• We maintain a forest of trees. The weight of a tree is equal to the sum of the frequencies of
its leaves.
• C−1 times, select the two trees, T1 and T2,of smallest weight, breaking ties arbitrarily, and
form a new tree with subtrees T1 and T2.
• At the beginning of the algorithm, there are C single-node trees—one for each character.
• At the end of the algorithm there is one tree, and this is the optimal Huffman coding tree.
• The reason that this is a greedy algorithm is that at each stage we perform a
merge without regard to global considerations. We merely select the two
smallest trees.
POJ 4080: Huffman编码树
描述:构造一个具有n个外部节点的扩充二叉树,每个外部节点Ki有一个Wi对应,作为该外部节点的权。使得这个扩充二叉树的叶节
点带权外部路径长度总和最小:
Min( W1 * L1 + W2 * L2 + W3 * L3 + … + Wn * Ln)
Wi:每个节点的权值。
Li:根节点到第i个外部叶子节点的距离。
编程计算最小外部路径长度总和。
输入:第一行输入一个整数n,外部节点的个数。第二行输入n个整数,代表各个外部节点的权值。2<=N<=100
输出:输出最小外部路径长度总和。
样例输入
4
1135
样例输出
17
8.5. heapq — Heap queue algorithm
• To create a heap, use a list initialized to [], or you can transform a
populated list into a heap via function heapify().
• heapq.heapify(x) Transform list x into a heap, in-place, in linear time.
• heapq.heappop(heap) Pop and return the smallest item from
the heap
https://docs.python.org/3/library/heapq.html
A Huffman encoding can be computed by first
creating a tree of nodes:
• Create a leaf node for each symbol and add it to the priority queue.
• While there is more than one node in the queue:
• Remove the node of highest priority (lowest probability) twice to get two nodes.
• Create a new internal node with these two nodes as children and with probability
equal to the sum of the two nodes' probabilities.
• Add the new node to the queue.
• The remaining node is the root node and the tree is complete.
• Traverse the constructed binary tree from root to leaves assigning and
accumulating a ‘0’ for one branch and a ‘1’ for the other at each node. The
accumulated zeros and ones at each leaf constitute a Huffman encoding for
those symbols and weights.
http://rosettacode.org/wiki/Huffman_coding#Python
Sorting Mini-HOW TO
• https://wiki.python.org/moin/HowTo/Sorting/
• Python lists have a built-in sort() method that modifies the list inplace and a sorted() built-in function that builds a new sorted list
from an iterable.
• Sorting Basics
• Key Functions
• The same technique works for objects with named attributes.
• Operator Module Functions
Sorting basics
Key Functions
The same technique works for objects with
named attributes.
Operator Module Functions
Greatest common divisor
• In mathematics, the greatest common divisor (gcd) of two or
more integers, when at least one of them is not zero, is the largest
positive integer that divides the numbers without a remainder.
• The Euclidean algorithm calculates the greatest common divisor (GCD)
of two natural numbers a and b.
• Formally the algorithm can be described as:
get the GCD of any amount of numbers
• Since GCD is associative, GCD(a,b,c,d) is the same as
GCD(GCD(GCD(a,b),c),d). In this case, Python's reduce function would
be a good candidate for reducing the cases for which len(numbers) >
2 to a simple 2-number comparison.
• functools.reduce(function, iterable[, initializer]),
https://docs.python.org/3/library/functools.html
• Apply function of two arguments cumulatively to the items of
sequence, from left to right, so as to reduce the sequence to a single
value.
• For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
((((1+2)+3)+4)+5).