SamudraManthan Popular terms

Download Report

Transcript SamudraManthan Popular terms

SamudraManthan
Popular terms
Dinesh Bhirud
Prasad Kulkarni
Varada Kolhatkar
1
Architecture
WORKER PROCESSORS
create
datastructures
M
A
N
A
G
E
R
Ngram
pruning
Intra-process
reduction
create
datastructures
Ngram
pruning
Intra-process
reduction
create
datastructures
Ngram
pruning
Intra-process
reduction
R
E
D
U
C
T
I
O
N
Finding
Top
Ngrams
2
Data Distribution
MANAGER
P0
Handshake protocol
HANDSHAKE MODULE
WORKER
P1
WORKER
P2
WORKER
PN
1. Signal Ready
(W->M)
2. Data msg
(M->W)
3. Next ready signal
(W->M)
.
.
.
4. Terminate msg
(M->W)
3
Data Distribution (contd…)




Manager (processor 0) reads an article and
passes it on to the other
processors(workers) in a round-robin fashion
Before sending new article to the same
worker, manager waits till the worker is ready
to receive more data.
Worker processes the article and creates
data structures before receiving new article.
Sends & receives are synchronous.
4
Suffix Array, LCP Vector And
Equivalence Classes





Suffix array is a sorted array of suffixes
LCP vector keeps track of repeating terms in
suffixes
We use suffix arrays and LCP vector to
partition articles into classes
Each class represents a group of Ngrams
Classes represent all Ngrams in the article
and no Ngram is represented more than once
5
Example
A ROSE IS A ROSE
S
LCP
Trivial
Classes
Non-trivial
Classes
0
A ROSE
0
<0,0>
<0,1>
1
A ROSE IS A ROSE
2
<1,1>
2
IS A ROSE
0
<2,2>
3
ROSE
0
<3,3>
4
ROSE IS A ROSE
1
<4,4>
5
<3,4>
0
6
Advantages of Suffix Arrays

Time Complexity



There can be at the most 2N-1 classes, where N
is number of words in an article
Ngrams of all/any sizes can be identified with
their tfs in linear time
These data structures enable us to represent
all and any sized Ngrams without actually
storing them
7
Intra-Processor Reduction
Problem




Suffix array data structure gives us article
level unique Ngrams with term frequencies
A processor processes multiple articles
Need to identify unique Ngrams across
articles
Need to have an unique identifier for each
word
8
Dictionary – Our Savior


Dictionary is a sorted list of all unique words
in the Gigaword corpus
Dictionary ids form a unified basis for
intra/inter process reduction
9
Intra-Processor Reduction



Used a hash table to store unique Ngrams with tf and df
Hashing function
 Simple mod hashing function
 H(t)= ∑ t(i) mod HASH_SIZE, where t(i) is the dictionary id of
word i in Ngram t
Hash data structure
struct ngramstore {
int *word_id;
int cnt;
int doc_freq;
struct ngramstore *chain;
};
10
• Inter-Process Reduction
Binomial Tree
Steps
1 -> 0
3 -> 2
5 -> 4
7 -> 6
2 -> 0
6 -> 4
4 -> 0
i varies from 0 to log(n) - 1
• Send -> Recv diff = (2 ^ i)
• For any iteration,
recv if(id % (2^i) == 0) else sender
• max_recv = (reductions-1) *
(int)pow((double)2, i+1);
Processors enter next iteration by calling
MPI_Barrier()
0
2
4
1
6
3
5
7
11
Inter-Process Reduction using
Hashing



Reusing our hash technique and code from
intra-process reduction
All processes use binomial tree collection
pattern to reduce unique Ngrams
After log n steps process 0 has the final hash
with all unique Ngrams
12
Scaling up to GigaWord?
Goal
 Reduce per processor memory requirement
Cut off term frequency
 Ngrams with low tf are not going to score high
 Observation : 66 % of total trigrams have term
frequency 1 in 1.2GB data
 Unnecessary to carry such Ngrams
 Solution: Eliminate Ngrams with very low term
frequency
13
Pruning – stoplist motivation


Similarly Ngrams with high df are not going to
score high.
Memory hotspot



This elimination can be done only after intraprocess collection
Defeats the goal of per processor memory
reduction
Need for an adaptive elimination
14
Pruning - Stoplist






Ngrams such as "IN THE FIRST" scored high using
TF*IDF measure
Eliminate such Ngrams to extract really interesting
terms
Stoplist is a list of commonly occurring words such as
“the”, “a”, “to”, “from”, “is”, “first”
Stoplist is based on our dictionary
Still evolving and currently contains 160 words
Eliminate Ngrams containing all words from the
stoplist
15
Interesting 3-grams on GigaWord
16
Performance Analysis Speedup
17
Space Complexity


Memory requirement increases for higher order
Ngrams
Why?




Suppose there are n unique Ngrams in each article and
m such articles
For higher order Ngrams, the number of unique ngrams
increase
We store each unique Ngram in our hash data structure
In worst case all Ngrams across articles are unique.
We have to store mn unique Ngrams per processor
18
Current Limitations


Static Dictionary
M through N Interesting Ngrams



Our hash data structure is designed to handle a
single sized Ngram at a time
We provide M through N functionality by
repetitively building all data structures
Not a scalable approach
19
Thanks 
20