Transcript Slide 1

Special Topics in Computer Science
Advanced Topics in Information Retrieval
Lecture 7 (book chapter 9):
Parallel and Distributed IR
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: Conclusions
 How to accelerate search? Same results as sequential
 Ideas:
 Quick-and-dirty rejection of bad objects, 100% recall
 Fast data structure for search (based on clustering)
 Careful check of all found candidates
 Solution: mapping into fewer-D feature space
 Condition: lower-bounding of the distance
 Assumption: skewed spectrum distribution
 Few coefficients concentrate energy, rest are less important
2
Previous Chapter: Research topics




Object detection (pattern and image recognition)
Automatic feature selection
Spatial indexing data structures (more than 1D)
New types of data.
 What features to select? How to determine them?
 Mixed-type data (e.g., webpages, or images with
sound and description)
 What clustering/IR methods are better suited for
what features? (What features for what methods?)
 Similar methods in data mining, ...
3
The problem
 Very large document collections
 Google: 4,000,000,000 pages
 Slow response?
 Solution: parallel computing
 Google: 10,000 computers
4
Parallel architectures
Data stream
Instruction stream
Single
Multiple
Single
SISD
classical
SIMD
simple
Multiple
MISD
(rare)
MIMD
many SISD
5
MIMD architecture
 The most common
 Can be
 tightly coupled
 loosely coupled
 Distributed
 Many computers interacting via network
 PC Clusters
 Similar to MIMD computers, but greater cost of
communication
 very loosely coupled
 More coarse-grained programs
6
Performance improvement
Time: speedup S
 Ideally, N times (number of processors)
 In practice impossible
 The problem does not decompose into N equal parts
 Communication and control overhead
 < 1 / f, where f is the largest separable fraction of the
problem
Cost
 Per processor: S / N
7
Two approaches to parallelism
 Build new algorithms
 E.g., neural nets
 Naturally parallel
 Problem: to define the retrieval task
 Adapt the existing techniques to parallelism
 Allows relying on well-studied approaches
 We will consider this option
8
Ways to use parallelism
 Multitasking
 N search engines
 Good for processing many queries
Problems:
 A single query is not speeded up
 Bottleneck: disk access (index)
 Possible solution: replicating (part of) data. RAIDs
 Parallel algorithms
 IR = data. Main question: how to partition the data
 Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
9
Possible partitionings
 Horizontal: document partitioning. Union of results
 Vertical: term partitioning. Basically, intersect results
10
Inverted files: Logical partitioning
 Logical vs. physical document partitioning
 Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
11
Inverted files: Logical partitioning
Construction and updating
 Also parallel
Construction
 Assign docs to processors
 Order docs such that each processor has an interval
 Process in parallel
 Merge. Each piece is ordered already
12
Inverted files:
Physical document partitioning




Several separate collections, one per processor
Separate indices
Then the lists are merged (they are already ordered)
Priority queue is used




The result is not sorted; Insertion is quick
The maximal element can be found quickly
First k elements can be found rather quickly
Details in the book
 Consistent scores are needed
 Global statistics is needed. Can be computed at index time
13
Logical or physical partitioning?
 Logical requires less communication
 Faster
 Physical is more flexible. Simpler implementation
 Simpler conversion of existing systems
14
Inverted files: Term partitioning
 Each processor processes a part of the inverted file
 The results are intersected (for AND)
 (or as appropriate for Boolean operations, OR and NOT)
 When term distribution in user queries is skewed,
then document partitioning is better
 When uniform, term partitioning is better.
 Twice for long queries, 5 – 10 times for short (Web-like)
15
Suffix arrays
 Array construction can be parallelized
 merges are parallel
 Document partitioning is applied straightforwardly
 Each processor maintains its own suffix array
 Term partitioning can be applied
 Each processor owns a branch of the tree (lexicographic
interval)
 Bottleneck: all processors need access to the entire text
16
17
Signature files
 Document partitioning: straightforward
 Create query signature, distribute to each processor
 Merge results (using Boolean operations if needed)
 Term partitioning: shorter signatures
 Merging and eliminating false drops is slow
 This method is not recommended
18
SIMD computers
 Single Instruction, Multiple data
 Uncommon
 Good for simple operations
 Bit operations in signature files
 Details in the book
 Ranking is supported in hardware in some computers
 If signature file does not fit into memory, can be
processed in batches
 I/O overhead
 Use multiple queries with the same batch
 This improves throughput, but not response time
19
… SIMD computers
 Inverted files are difficult to adapt to SIMD
 The inverted file is restructured
 Details in the book
20
Distributed IR
 MIMD with
 Slow communication
 Not all nodes are used for a given query
 Encryption issues
 Document partitioning is usually used
 Term partitioning imposes greater communication
overhead
 Document clustering can be useful (to distribute docs
by processors)
 Index clusters and then search only the best ones
 Another approach: use training queries, then similarity of
the user query to these
21
Research topics




How to evaluate the speedup
New algorithms
Adaptation of existing algorithms
Merging the results is a bottleneck
 Meta search engines
 Creating large collections with judgements
 Is recall important?
22
Conclusions
 Parallel computing can improve
 response time for each query and/or
 throughput: number of queries processed with same speed
 Document partitioning is simple
 good for distributed computing
 Term partitioning is good for some data structures
 Distributed computing is MIMD computing with slow
communication
 SIMD machines are good for Signature files
 Both are out of favor now
23
Thank you!
Till May 17? 18?, 6 pm
24