Softwaring in Clojure

Download Report

Transcript Softwaring in Clojure

Persistent data
structures
Yoav Rubin
About me
• Software engineer in IBM Research, Haifa
– Worked on
• From large scale products to small scale research
projects
– Domains
• Software tools
• Development environments
• Simplified programming
– Technologies
{:name “Yoav Rubin,
:email [email protected],
:blog http://yoavrubin.blogspot.com,
:twitter @yoavrubin}
• Frontend engineering
• Java, Clojure
• Lecture the course “Functional
programming on the JVM” in Haifa
University
Roadmap
• Why
• What
• How
Why
Few assumptions
• Modern software uses different kinds of
data
• Modern software requires various ways to
work with data
• We’re in a multi-core world
• Mutability and concurrency don’t get along
Working with different kinds of data
+
Requiring different ways to work with data
Data structures
+
Immutability
Persistent data structures
Modern
software
Concurrency and
mutability don’t get
along
What
What is a data structure
• A way to organize data
– Provides contracts for read / find
– Provides contracts for update
• Adding data elements
• Removing data elements
• A data structure may contain other data
structures
What’s in a contract
• Information given by the requester
– For reads:
• Nothing
• Some identifier of the data
– For writes
• The data element itself
• The data element alongside additional info
• What is returned
– For reads:
• The data elements / not-found-identifier
– For writes:
• Nothing
• Some identifier of the data
• The data structure itself
• The cost of the operation
E.g.,
• Hashtable
– add(data-element, key) , O(1)
– read(key), O(1)
– find(data-element), O(n)
• Balanced search tree
– add(data-element), O(logn)
– find(data-element), O(logn)
– Remove(data-element), O(logn)
E.g.,
• LIFO
– push(data-element), O(1)
– pop(), O(1)
– contains(data-element), O(n)
• FIFO
– enqueue(data-element), O(1)
– dequeue(), O(1)
– contains(data-element), O(n)
What is the data in a data structure
• Values
– Or other data structures
• At the leaves – only values
• A value is something that cannot change
– 7, \a , nil, “abc”
Is a data structure a value
• In a mutable world - No
• In an immutable world – Yes
– It cannot change !!!
Persistency
• A persistent data structure is a data structure
• That acts as a value
– Cannot be changed
• While providing contracts for reading and writing
• Without affecting other users of the data
structure
Writing to an immutable structure
• You don’t write to an immutable structure
• Copy-on-write?
– Breaks performance guarantees
• Maintaining persistency
– Create a structure, that is perceived as
updated by the update requester
– Perceived as immutable by everyone else
Writing “together with”
• There are two “participants” in the write
– The “write-to” structure
– The added value
• Extract as much as possible from the “write to”
structure
– Shared part
– Non-shared part
• Complete the new structure with the added
value and duplications of the non-shared part
How
Persistent data structures
• Persistent list
• Persistent vector
• Persistent map
Persistent list - conj
list1
a
b
c
d
(def list2 (conj list1 x))
list1
a
x
list2
b
c
d
Note that list2 uses all of list1
Persistent list – next (/ pop)
list1
a
b
c
d
(def list3 (next list2))
list1
a
b
c
x
Note that list3 is list1
list2
list3
d
Persistent list
• Complete structural sharing upon
“modification”
• Write (conj) – just adding the new value
• Delete (pop) – returning a pointer to the
next element
Persistent lists
• O(1) for insertion (at the front)
• O(n) for going over the list
• O(1) for popping
Persistent vectors and maps
Behind the scenes
• A trie
– A tree in which each node has an “alphabet
map” to route the navigation to the children
• Over the alphabet of 0-31
• Vector – a balanced dense trie
• Maps – sparse trie
Why trie
• Trie allows holding both data and metadata of an
element
• The data is at the leaves
• The metadata is the derived from the structure of
the path to the leave
– Deriving information from the structure is a very
powerful mechanism
• Neural networks work that way
– In persistent vectors the metadata is the index
– In persistent maps the metadata is the hashcode
Persistent vectors
• Values are at the leaves
level
• Each level can hold up to 32
elements
• If passed that number, a new level is
created, and the previous level is pointed
by entry 0
Adding elements
1
1
1
2
3
2
2
4
3
5
1
6
7
8
9
10
11
12
Finding an element
• Looking at the bit representation of the index
• Each 5 bits correspond to a position in a specific
level
• We know the trie’s height
• The height tells us which bit quintet to start the
find process with
Finding an element - example
• Assume that the trie height is 3
At the top level
At the
wenext
At
would
the
level
leaves
look
we’d
here
level
lookwe’d
here look here
At the top level we would look here
At the next level we’d look here
At the leaves level we’d look here
Persistent vector
• Very efficient
– Almost O(1)
• Find, add
– O(near-constant-time)
• O(log32n)
– A very strong narrowing factor
– 1M elements => need to handle 4 levels
– For all practical uses – think of it as O(1)
• Subvec – O(1)
– No dependency in the size of the sub vector !!
Persistent map
• A special trie - Hash Array Map Trie
– Based on the work of Phil Bagwell
• Over the alphabet of 0-31
• Not dense
• Similar to the way Persistent vector map works
– Instead of indices, we use a 32 bit hash value of the key
What happens when modifying*
• We want to do (assoc m x v1)
• The big arrows represent the
not participating sub tree (each
arrow can be several real
arrows to several nodes)
n
x
v
(assoc m x v1)
m
m’
r’
r
n’
n
…
x
v
x’
v1
Path copying
• Performing an action on a structure does
not creates an entirely new structure
• A new structure is created and share a
large portion of the old structure
What should be created
• Anything that is related to the new node
• Remember – information about that node
is found at the leaf (the content) and on
the path (the metadata)
• Recreating the path to the changed node
and the changed node
Path copying - price
• The amount of nodes need to be created
is O(tree-height)
• using a very wide tree
– 32 children for each parent
– Tree-height is log32n
• n is the number of nodes in the tree
Performance concerns
• Very sparse structure
• Each node should only point to existing
children
– Not be structured as a full node
• Still, need to locate child in an efficient
way
– Constant time
Processing within a node
• Each node holds the following fields
– A list of up to 32 “links” that point from the parent
node to a child
• Less cache misses
– Bitmap of 32 bits (an integer)
• 1 in the ith bit means that an entry whose value in the
segment is i is pointed by the links list
• Its index in the list is the number of ones to the right of ith bit
• A very sparse tree – most of the links lists would
be very small
Example
• A node is pointed by entry 0 in level 0 and
has 4 children
– In the positions 2, 5, 14 , 22
• How to assoc keys with the following hash
codes
– 00 00000 00000 00000 00100 00000
– 00 00000 00000 00000 01110 00000
The node (in level 1):
00 00000 00100 00000 10000 00001 00100
2
5
14 22
First hash code:
00 00000 00100 00000 10000 00001 00100
2
5
14 22
00 00000 00000 00000 00100 00000
First entry in level 0,
brought us to this
node
First hash code:
00 00000 00100 00000 10000 00001 00100
2
5
14 22
00 00000 00000 00000 00100 00000
Looking for entry 4 in this level, checking
the bit map and seeing that it has 0 there,
therefore returning false
Second hash code:
00 00000 00100 00000 10000 00001 00100
2
5
14 22
00 00000 00000 00000 01110 00000
First entry in level 0,
brought us to this
node
Second hash code:
00 00000 00100 00000 10000 00001 00100
2
5
14 22
00 00000 00000 00000 01110 00000
Looking for entry 14 in this level, checking the 14th bit in the
map. There’s 1 there.
Counting the number of 1s to the right of that bit – getting 2
Continue on the link in cell 2 (remember – zero based indexing)
How many actions taken
• Looking at the ith bit
– A simple masking
• Counting the number of 1s
– Using the processor instruction CTPOP
• Population counting in a bit
• Found on many processors
• Counts the number of ones
– Masking and counting
• A constant amount of bit operations
That’s not all
• We traveled down the link
• If there’s a node with map there
– Continue the same with the next segment
• If there’s a node with a value there
– Return that value
Disclaimer
• There are other ways to implement the
internal processing within a node
• This specific way is a simplification of the
process that was presented in the original
“Ideal Hash Trees” paper
– Which was once used in Clojure
– Was replaced with several more optimizations
Why trie
• A linear structure – would need to copy the
entire structure
– Too much performance hit
• O(n) time for each action
• Too much memory consumption
– Path copying results in needing to create a
few nodes for each change
• the wider the trie, the less node
– 32 children for parent is a very wide trie
Zippers
Zippers
• Generic tree handling API
– Walking
– editing
• Purely functional data structure
– Persistent
– Excellent performance
• Found in clojure.zip
• First described by Gérard Huet in 1997
Zippers - rational
• In functional data structures, we replaced the
O(1) for updates with O(logh)
– Gained immutability
• If we assume that the updates occur near a
cursor on the structure, we can regain the O(1)
for updates
– Think of a text editor (a tree of lines, each has
characters in it)
• All the changes occur at the cursor
What is a zipper
+
Zippers - creation
• Can be based on any tree structure
• Need to provide the following three
functions
– branch? – can the given node have children
– children – get the children of the given node
– make-node – creation of a new node
Zippers - creation
• What does the zipper have upon creation
– The three functions
– An empty bookkeeping data
– The root of the tree (as the initial cursor)
• Creation with practically no performance
cost
What is possible
• Going around
– left, right, up, down
• Going to the root
• Getting the current node
• DFS preorder travelling (in Clojure)
– With prev or next
• Each movement returns a new zipper!!!
Left context
Focus
n1
n1
n2
n3
n5
x
y
n4
c
a
b
We want to get here
Path
Right context
Left context
Focus
Path
n1
n1
n2
n2
n3
n5
x
y
n4
c
a
b
We want to get here
x
y
n3
R
Right context
Left context
Focus
Path
Right context
n1
R
n1
n2
n2
n3
n5
x
y
x
y
n4
n4
c
a
b
We want to get here
L
n3
n5
c
Left context
Focus
Path
Right context
n1
R
n1
n2
n2
n3
n5
x
y
x
y
n4
L
n5
c
a
n3
c
b
n4
We want to get here
a
b
R
What is possible
• Editing
– In O(creation-of-new-node)
– Not O(logn + creation-of-new-node)
– Treating the zipper as the root of the tree
• Note – this is a functional data structure
– Nothing is changed
– A new element is created
– Think of path copying
• But without the path
More info
•
Phil Bagwell’s paper about tries hash mapped array tries
–
•
State, value and Identity – Rich Hickey
–
•
–
http://www.youtube.com/watch?v=6c4DJX2Xr3k
Zippers: Making Functional “Updates” Efficient
–
•
http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf
Clojure Zippers
–
•
http://www.youtube.com/watch?v=K2NYwP90bNs
Gérard Huet’s paper about zippers
–
•
http://www.youtube.com/watch?v=pNhBQJN44YQ
Designing data structures (Phil Bagwell)
–
•
http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvectorimplementation/
http://blog.higher-order.net/2009/09/08/understanding-clojures-persistenthashmap-deftwice/
Functional structures in Scala (Daniel Spiewak)
–
•
http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
Persistent maps and vectors inner working in Clojure
–
•
http://lampwww.epfl.ch/papers/idealhashtrees.pdf
http://scienceblogs.com/goodmath/2010/01/13/zippers-making-functional-upda/
Purely functional data structures by Chris Okasaki
–
http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf