Transcript Ranking

Ranking
Ida Mele
Introduction
• The set of software components for the management of
large sets of data is made of:
•
•
•
•
•
•
MG4J
Fastutil
the DSI Utilities
Sux4J
WebGraph
the LAW software
• These software components have been developed by the
DSI of the University of Milan
Ida Mele
Ranking
1
Fastutil
• Fastutil 6 is a free software, developed in Java
• Technical requirement:
• Java >= 6
• Useful links:
• http://fastutil.di.unimi.it/
• http://fastutil.di.unimi.it/docs/
Ida Mele
Ranking
2
Fastutil
• Fastutil extends Java Collections, and it
provides:
• Type-specific maps, sets, and lists
• Priority queues with a small memory footprint and
fast access and insertion
• 64-bit arrays, sets, and lists
• Fast I/O classes for text and binary files
Ida Mele
Ranking
3
Fastutil: Advantages
• Advantages in using Fastutil:
• Classes of Fastutil are implemented in order to
work on huge collections of data in an efficient
way
• Fastutil provides a new set of classes to deal with
collections whose size exceeds 231
Ida Mele
Ranking
4
Fastutil: Advantages
• Advantages in using Fastutil:
• There are additional features (e.g., bidirectional
iterators) that are not available in the standard
classes
• Classes can be plugged into existing code, because
they implement their standard counterpart (e.g.,
Map is used for maps)
Ida Mele
Ranking
5
Fastutil: Big Arrays
• BigArrays: This class provides static methods and
objects for working with big arrays
• Big arrays are arrays-of-arrays. For example, a big
array of integers has type int[][]
• Methods handle these arrays-of-arrays as if they are
monodimensional arrays with 64-bit indices
• The length of a big array is bounded by
Long.MAX_VALUE rather than Integer.MAX_VALUE
Ida Mele
Ranking
6
Fastutil: Big Arrays
• Given a big array a, a[0], a[1], … a[n] are called segments. Each
one has length SEGMENT_SIZE (the last segment can have a
smaller size)
• Each index i is associated with a segment and a displacement
into the segment
• Methods segment/displacement compute the
segment/displacement associated with a given index
• Method index receives the segment and the displacement and
returns the corresponding index
• Methods get/set allow to return/set the value of a given
element in the big array
Ida Mele
Ranking
7
Fastutil Big Arrays – example
• We want to scan the big array a
• First solution:
for( int s = 0; s < a.length; s++ ) {
final int[] t = a[ s ];
for( int d = 0; d < t.length; d++ ) {
//do something with t[ d ]
}
}
Ida Mele
Ranking
8
Fastutil Big Arrays – example
• Second solution:
for( int s = a.length; s-- != 0; ) {
final int[] t = a[ s ];
for( int d = t.length; d-- != 0; ) {
//do something with t[ d ]
}
}
Ida Mele
Ranking
9
Fastutil Big Arrays – example
• Third solution:
for( int s = a.length; s-- != 0; ) {
final long[] t = a[ s ];
for( int d = t.length; d-- != 0; )
t[d] = index( s, d );
}
• We can use the index method, which returns the
index associated with a segment and displacement
Ida Mele
Ranking
10
Fastutil: Big Data Structures
• Fastutil provides classes also for other data
structures:
• BigList: a list with indices. The instances of this
class implement the same semantics of traditional
List
• HashBigSet: the instances of this class use a hash
table to represent a big set. The number of
elements in the set is limited only by the amount
of core memory
Ida Mele
Ranking
11
Dsiutils
•
•
•
•
The DSI utilities are a mishmash of classes
Free software
Developed in Java
Useful links:
• http://dsiutils.di.unimi.it/
• http://dsiutils.di.unimi.it/docs/
Ida Mele
Ranking
12
Dsiutils: MultipleString
• In large-scale text indexing we want to use a
mutable string that, once frozen, can be used in
the same optimized way of an immutable string
• In Java we have String and StringBuffer, which
can be used for immutable and mutable strings
respectively
• The solution is MultipleString
• MultipleString does not need synchronization
Ida Mele
Ranking
13
Dsiutils: Packages
• Some important packages:
• it.unimi.dsi.bits contains main classes for
manipulating bits. For example, the class BitVectors
provides static methods and objects that do useful
things with bit vectors
• it.unimi.dsi.compression provides word-based
compression/decompression classes
• it.unimi.dsi.util offers implementations of
BloomFilters, PrefixMaps, StringMaps, BinaryTries and
others
Ida Mele
Ranking
14
WebGraph
• WebGraph is a framework for graph
compression
• It exploits modern compression techniques to
manage very large graphs
• Useful links:
• http://webgraph.di.unimi.it/
• http://webgraph.di.unimi.it/docs/
Ida Mele
Ranking
15
WebGraph
• WebGraph provides:
• ζ-codes, which are suitable for storing web graphs
• Algorithm for compressing the graph that exploit
gap compression as well as ζ-codes. The
parameters provide different tradeoffs between
access speed and compression ratio
• Algorithms to access to compressed graphs
without decompression. The lazy techniques delay
the decompression until it is necessary
Ida Mele
Ranking
16
WebGraph: Classes
• Some important classes:
• ImmutableGraph is an abstract class representing
an immutable graph
• BVGraph allows to store and access web graphs in
a compressed form
• ASCIIGraph is used to store the graph in a humanreadable ASCII format
Ida Mele
Ranking
17
WebGraph: Classes
• Some important classes:
• ArcLabelledImmutableGraph is an abstract
implementation of a graph with labeled arcs
• Transform returns the transformed version of an
immutable graph. We can use the transpose
method of this class if we want to create the
transpose graph
Ida Mele
Ranking
18
LAW
• Java software developed by the Laboratory for
Web Algorithms
• It is free and contains several implementations
of the Pagerank algorithm
• Useful links:
• http://law.di.unimi.it/software.php
• http://law.di.unimi.it/software/docs/index.html
Ida Mele
Ranking
19
LAW: PageRank
• PageRank of the package it.unimi.dsi.law.rank is an
abstract class that defines methods and attributes
for the PageRank algorithm
• Provided features:
•
•
•
•
•
we can set the preference vectors
we can set the damping factor
we can program stopping criteria
step-by-step execution
reusability
Ida Mele
Ranking
20
Exercise
• Download the archive with libraries: lib.zip
• Download the files:
• set-classpath.sh
• example
• Text2ASCII.class and PrintRanks.class
available at:
http://www.dis.uniroma1.it/~mele/WebIR.html
• Set the classpath using the command:
source set-classpath.sh
Ida Mele
Ranking
21
Build the Graph: Step1
• Step 1 - Create the file in the format ASCIIGraph
using the command:
java Text2ASCII example
• Output:
example.graph-txt: the first line contains the number of
nodes (e.g., n). The following n lines contain the list of
out-neighbours of the nodes. In particular, the line i-th
contains the successors of the node i, sorted in an
increasing order and separated by a space
Ida Mele
Ranking
22
Build the Graph: Step1
input:
example
0
0
0
1
1
1
2
2
2
2
…
1
8
9
4
7
9
1
3
4
5
Ida Mele
output:
example.graph-txt
Text2ASCII
10
1 8
4 7
1 3
1 4
1
1
1
5
0
Ranking
9
9
4 5 6 7 8 9
5 6 9
2
2 3 4 5
9
1 3 4 6
23
Build the Graph: Step1
more example.graph-txt
Node id
.
.
.
Ida Mele
0
1
2
3
4
5
6
7
8
9
10
1 8
4 7
1 3
1 4
1
1
1
5
0
Num of nodes
9
9
4 5 6 7 8 9
5 6 9
Lists of successors
2
2 3 4 5
9
1 3 4 6
Ranking
24
Build the Graph: Step2
• We can use the main method of the BVGraph class to
load and compress an ImmutableGraph
• The compressed graph is described by:
• basename.graph: the graph file. It contains the successor
lists, one for each node. Each list is a sequence of natural
number that are coded as sequence of bits in a efficient
way
• basename.offsets: the offset file. It stores the offset for
each node of the graph
• basename.properties: the file with properties and
statistics
Ida Mele
Ranking
25
Build the Graph: Step2
• Step 2 - Conversion from the ASCIIGraph to
the BVGraph:
java it.unimi.dsi.big.webgraph.BVGraph -g
ASCIIGraph example exampleBV
• Output:
• exampleBV.graph
• exampleBV.offsets
• exampleBV.properties
Ida Mele
Ranking
26
Build the Graph: Step2
more exampleBV.properties
…
compratio=1,89
bitsforblocks=22
residualarcs=15
version=0
…
nodes=10
compressionflags=
intervalisedarcs=10
bitspernode=16,8
arcs=34
…
Ida Mele
Ranking
27
Compute PageRank
• To compute the PageRank we can use the following
implementations:
•
•
•
•
PowerMethod
PageRankPowerSeries
GaussSeidel
Jacobi
• The output is made of 2 files:
• basename.ranks: binary file with the results of
computation
• basename.properties: text files with general info
Ida Mele
Ranking
28
Compute PageRank: Step 1
• Step 1 - We use the main method of the class
PageRankPowerMethod by issuing the following
command:
java it.unimi.dsi.law.rank.PageRankPowerMethod
exampleBV examplePR
• Output:
• examplePR.ranks
• examplePR.properties
Ida Mele
Ranking
29
Compute PageRank: Step 1
more examplePR.properties
rank.alpha = 0.85
rank.stronglyPreferential = false
method.numberOfIterations = 12
method.norm.type = INFTY
method.norm.value = 8.396275630317973E-7
graph.nodes = 10
graph.fileName = example
Ida Mele
Ranking
30
Compute PageRank: Step 2
• Step 2 – Print the scores
• The file .ranks is a binary file with the scores of
the nodes, so we can print PageRank scores by
using the class PrintRanks:
java PrintRanks examplePR.ranks > ranks
• Output:
ranks: file with n lines, one for each node. The i-th
line contains the score of node number i
Ida Mele
Ranking
31
Compute PageRank: Step 2
more ranks
Node id
.
.
.
Ida Mele
0
1
2
3
4
5
6
7
8
9
0.0515659940361598
0.20197850631669495
0.07982657817906964
0.07587785830476211
0.14600457683651308
0.08608501191896127
0.07294688611466064
0.0931194920828582
0.05050241152172527
0.14209268468859523
Ranking
PageRank values
32
Compute PageRank: sorting
sort –r ranks
0.20197850631669495
0.14600457683651308
0.14209268468859523
0.0931194920828582
0.08608501191896127
0.07982657817906964
0.07587785830476211
0.07294688611466064
0.0515659940361598
0.05050241152172527
Ida Mele
Ranking
PageRank values
sorted in
decreasing order
33
Homework
1. Repeat the exercise with the graphs:
• WikiIT
• WikiPT
available at:
http://www.dis.uniroma1.it/~mele/WebIR.html
2. Create a new graph by using synthetic or real data,
and repeat the exercise
Ida Mele
Ranking
34