Transcript Slide 1
Special Topics in Computer Science
Advanced Topics in Information Retrieval
Lecture 7 (book chapter 9):
Parallel and Distributed IR
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: Conclusions
How to accelerate search? Same results as sequential
Ideas:
Quick-and-dirty rejection of bad objects, 100% recall
Fast data structure for search (based on clustering)
Careful check of all found candidates
Solution: mapping into fewer-D feature space
Condition: lower-bounding of the distance
Assumption: skewed spectrum distribution
Few coefficients concentrate energy, rest are less important
2
Previous Chapter: Research topics
Object detection (pattern and image recognition)
Automatic feature selection
Spatial indexing data structures (more than 1D)
New types of data.
What features to select? How to determine them?
Mixed-type data (e.g., webpages, or images with
sound and description)
What clustering/IR methods are better suited for
what features? (What features for what methods?)
Similar methods in data mining, ...
3
The problem
Very large document collections
Google: 4,000,000,000 pages
Slow response?
Solution: parallel computing
Google: 10,000 computers
4
Parallel architectures
Data stream
Instruction stream
Single
Multiple
Single
SISD
classical
SIMD
simple
Multiple
MISD
(rare)
MIMD
many SISD
5
MIMD architecture
The most common
Can be
tightly coupled
loosely coupled
Distributed
Many computers interacting via network
PC Clusters
Similar to MIMD computers, but greater cost of
communication
very loosely coupled
More coarse-grained programs
6
Performance improvement
Time: speedup S
Ideally, N times (number of processors)
In practice impossible
The problem does not decompose into N equal parts
Communication and control overhead
< 1 / f, where f is the largest separable fraction of the
problem
Cost
Per processor: S / N
7
Two approaches to parallelism
Build new algorithms
E.g., neural nets
Naturally parallel
Problem: to define the retrieval task
Adapt the existing techniques to parallelism
Allows relying on well-studied approaches
We will consider this option
8
Ways to use parallelism
Multitasking
N search engines
Good for processing many queries
Problems:
A single query is not speeded up
Bottleneck: disk access (index)
Possible solution: replicating (part of) data. RAIDs
Parallel algorithms
IR = data. Main question: how to partition the data
Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
9
Possible partitionings
Horizontal: document partitioning. Union of results
Vertical: term partitioning. Basically, intersect results
10
Inverted files: Logical partitioning
Logical vs. physical document partitioning
Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
11
Inverted files: Logical partitioning
Construction and updating
Also parallel
Construction
Assign docs to processors
Order docs such that each processor has an interval
Process in parallel
Merge. Each piece is ordered already
12
Inverted files:
Physical document partitioning
Several separate collections, one per processor
Separate indices
Then the lists are merged (they are already ordered)
Priority queue is used
The result is not sorted; Insertion is quick
The maximal element can be found quickly
First k elements can be found rather quickly
Details in the book
Consistent scores are needed
Global statistics is needed. Can be computed at index time
13
Logical or physical partitioning?
Logical requires less communication
Faster
Physical is more flexible. Simpler implementation
Simpler conversion of existing systems
14
Inverted files: Term partitioning
Each processor processes a part of the inverted file
The results are intersected (for AND)
(or as appropriate for Boolean operations, OR and NOT)
When term distribution in user queries is skewed,
then document partitioning is better
When uniform, term partitioning is better.
Twice for long queries, 5 – 10 times for short (Web-like)
15
Suffix arrays
Array construction can be parallelized
merges are parallel
Document partitioning is applied straightforwardly
Each processor maintains its own suffix array
Term partitioning can be applied
Each processor owns a branch of the tree (lexicographic
interval)
Bottleneck: all processors need access to the entire text
16
17
Signature files
Document partitioning: straightforward
Create query signature, distribute to each processor
Merge results (using Boolean operations if needed)
Term partitioning: shorter signatures
Merging and eliminating false drops is slow
This method is not recommended
18
SIMD computers
Single Instruction, Multiple data
Uncommon
Good for simple operations
Bit operations in signature files
Details in the book
Ranking is supported in hardware in some computers
If signature file does not fit into memory, can be
processed in batches
I/O overhead
Use multiple queries with the same batch
This improves throughput, but not response time
19
… SIMD computers
Inverted files are difficult to adapt to SIMD
The inverted file is restructured
Details in the book
20
Distributed IR
MIMD with
Slow communication
Not all nodes are used for a given query
Encryption issues
Document partitioning is usually used
Term partitioning imposes greater communication
overhead
Document clustering can be useful (to distribute docs
by processors)
Index clusters and then search only the best ones
Another approach: use training queries, then similarity of
the user query to these
21
Research topics
How to evaluate the speedup
New algorithms
Adaptation of existing algorithms
Merging the results is a bottleneck
Meta search engines
Creating large collections with judgements
Is recall important?
22
Conclusions
Parallel computing can improve
response time for each query and/or
throughput: number of queries processed with same speed
Document partitioning is simple
good for distributed computing
Term partitioning is good for some data structures
Distributed computing is MIMD computing with slow
communication
SIMD machines are good for Signature files
Both are out of favor now
23
Thank you!
Till May 17? 18?, 6 pm
24