Sets, Maps and Hash tables

Download Report

Transcript Sets, Maps and Hash tables

Sets, Maps and Hash Tables
Sets
• We have learned that different data structures have different advantages – and
drawbacks
• Choosing the proper data structure
depends on typical usage patterns
• Array- and list-oriented data structures are
appropriate when the order of elements
matter – but that is not always the case
RHS – SOC
2
Sets
• A Set is a data structure which can hold an
unordered collection of elements
• Not having to worry about ordering can
improve performance of other operations
• On a Set, we want to be able to
– Insert an element
– Delete an element
– Check if a given element is in the Set
RHS – SOC
3
Sets
public interface Set<T>
{
void add(T element);
void remove();
boolean contains(T element);
Iterator<T> iterator();
}
RHS – SOC
4
Sets
• It turns out that insertion, deletion and
check for containment can be done in
O(log(n)), or even faster!
• Depends on the underlying implementation of the interface
• In Java, implementation is either
– HashSet (based on Hash Tables)
– TreeSet (based on Trees)
RHS – SOC
5
Sets
• A Set iterator is ”simpler” than e.g. a List
iterator
– Elements will occur in ”random” order
– No add method – we just call add on the Set
itself
– No previous method – does not make sense
• The Set iterator does however have a
delete method (why?)
RHS – SOC
6
Sets – Quality tip
• When using a Set, we must choose a specific implementation (HashSet or TreeSet)
• However, the definition should look like:
Set<Car> cars = new HashSet<Car>();
RHS – SOC
7
Sets – Quality tip
Set<Car> cars = new HashSet<Car>();
• Why…? We should in general only refer to
the interface, not the implementation
• Easy to switch implementation!
RHS – SOC
8
Maps
• A Map is a data structure which stores
associations between
– A collection of keys
– A collection of values
• All keys map to a value
• Keys are unique (values are not)
RHS – SOC
9
Maps
K1
V1
K2
V2
K3
V3
K4
RHS – SOC
10
Map
public interface Map<K,V>
{
void put(K key,V value);
V get(K key);
void remove(K key);
Set<K> keySet();
}
RHS – SOC
11
Map
• The keySet method
returns a Set containing
all keys in the Map
• You must then iterate
through this Set, in order
to get all values stored in
the Map
RHS – SOC
12
Map
Map<String,Car> carMap = new HashMap<String,Car>();
...
Set<String> regNumbers = carMap.keySet();
for (String regNo : regNumbers)
{
Car aCar = carMap.get(regNo);
... // Do something with the Car object
}
RHS – SOC
13
Exercises
• Review: R16.1, R16.4, R16.6
• Programming: P16.4, P16.12
RHS – SOC
14
Hash Tables
• A Set and a Map are both abstract data
types – we need a concrete implementation in order to use them
• In the Java library, two implementations
are available:
– Sets: HashSet, TreeSet
– Maps: HashMap, TreeMap
RHS – SOC
15
Hash Tables
• The implementations HashSet and
HashMap are based on a Hash Table
• A Hash Table is based on the below ideas:
– Create an array of length N, which can store
objects of some type T
– Find a mapping from T to the interval [0; N-1]
(a Hash Function f)
– Store an object t of type T in the position f(t)
RHS – SOC
16
Hash Tables
f(Car1) = 3
Car3
Car1
0
f(Car2) = 0
Car2
1
f(Car3) = 2
2
3
RHS – SOC
4
17
Hash Tables
• A Hash Table is thus ”almost” an array
• Instead of having an index directly
available, we must calculate it
• If calculation can be done in constant time,
then all basic operations (insert, delete,
lookup) can be done in constant time!
• Better than tree-based implementations,
which have O(log(N))
RHS – SOC
18
Hash Tables
• However, there are
some issues:
– How do we define a
good mapping from the
objects to [0; N-1]?
– What happens if we try
to store two objects at
the same position?
RHS – SOC
19
Hash Functions
• Before finding a good mapping – i.e. a
good hash function – we must consider
the size of the array
• For good performance, the array should at
least be as large as the maximal number
of objects stored
• Rule of thumb is about 30 % larger
• Size should be a prime number (???)
RHS – SOC
20
Hash Functions
• What if the expected number of objects is
unknown in advance?
• We can expand a hash table dynamically
• If the hash table in running out of space,
double the capacity
• Start out with a reasonably large array
(space is cheap…)
RHS – SOC
21
Hash Functions
• Having handled the choice of N, how do
we define a proper hash function?
• Properties of a hash function:
– Must map all objects of type T to the interval
[0; N-1]
– Should map objects as uniformly as possible
to the interval [0; N-1]
RHS – SOC
22
Hash Functions
• We can enforce the mapping to [0;N-1] by
using the modulo operator:
f(t) = g(t) % N
• g(t) can then produce any integer value
• How do we achieve a uniform distribution?
• Theory for this is complicated, but there
are some general rules to follow
RHS – SOC
23
Hash Functions
• A good hash function
should be ”almost random”, but deterministic
– ”Almost random” –
values are well distributed in the interval
– Deterministic – always
produce the same output
for the same input
RHS – SOC
24
Hash Functions
• In Java, all objects have a hashCode
method
– Defined in Object class
– Can be overrided
– Returns an integer (the Hash Code)
– We must use modulo on the value ourselves
RHS – SOC
25
Hash Functions
• Hash function for integers:
– The number itself…
• Hash function for strings:
final int HASH_MULTIPLIER = 31;
int h = 0;
for (int i = 0; i < s.length; i++)
h = (HASH_MULTIPLIER * h) + s.charAt(i);
RHS – SOC
26
Hash Functions
• Hash code for an object
can be calculated by
combining hash codes
for instance fields
• Combine values in a
way similar to the
algorithm used to find
string hash codes
RHS – SOC
27
Hash Functions
public int hashCode()
{
final int MULTIPLIER = 31;
int h1 = regNo.hashCode();
int h2 = mileage;
int h3 = model.hashCode();
int h = h1*MULTIPLIER + h2;
h = h*MULTIPLIER + h3;
return h;
}
RHS – SOC
28
Hash Functions
• But wait…what about numeric overflow?
• We multiply a ”random” integer value with
a number…?
• Does not really matter…
• As long as the algorithm is deterministic,
overflow is not a problem
• Just helps ”scrambling” the value 
RHS – SOC
29
Hash Functions
• Common pitfalls:
– Remember to define a hashCode function
– If you forget, the hashCode implementation in
Object is used
– Based solely on memory location of object
– Two objects with the same value of instance
fields will produce different hash codes…
RHS – SOC
30
Hash Functions
• Common pitfalls:
– The hashCode function must be
”compatible” with your equals function
– If a.equals(b) it must hold that
a.hashCode() == b.hashCode()
– If not, duplicates are allowed!
– The reverse condition is not required; two
different objects may have the same hash
code
RHS – SOC
31
Hash Functions
• In general, you must
remember to:
– Either define the
hashCode and the
equals method
– Or not define any of
them!
RHS – SOC
32
Handling collisions
• Even with a good hash function, we will
still experience collisions
• Collision: two different objects t1 and t2
have the same hash code
• We will then try to store both objects in the
same position in the array
• Now what…?
RHS – SOC
33
Handling collisions
• What we store in each position in the array
is not the objects themselves, but a linked
list of objects
• Objects with the same hash code h are
stored in the linked list in position h
• With a good hash function, the average
length of non-empty lists is less than 2
RHS – SOC
34
Handling collisions
Car6
Car4
Car2
0
Car3
1
2
Car1
3
RHS – SOC
Car5
4
35
Handling collisions
• Basic operations (insert, delete, lookup)
follow this structure:
– Calculate hash code for the object
– Find the corresponding position in the array
• Insert: Insert element at the end of list
• Delete/Lookup: Iterate through list until element is
found, or end of list is reached
RHS – SOC
36
Handling collisions
• Basic operations are thus not done in truly
constant time
• However, if a proper hash function is used,
running time is constant in practice
• Use hash-based implementations unless
special circumstances apply
– Hard to define hash/equals function
– More functionality required
RHS – SOC
37
Exercises
• Review: R16.8, R16.10
• Programming: P16.6
RHS – SOC
38