integrating network storage into information retrieval applications

Download Report

Transcript integrating network storage into information retrieval applications

INTEGRATING NETWORK
STORAGE INTO INFORMATION
RETRIEVAL APPLICATIONS
Svetlana Y. Mironova
The University of Tennessee,
Knoxville
Spring 2003
Topics of Discussion
Motivation
 General Text Parser (GTP)
 Network Storage Stack
 GTP with Network Storage
 Implementation Challenges
 Performance
 Future Work

2
Motivation
Amount of textual-based information stored
on our computers and on the Web is rapidly
accumulating.
 Researchers and scientists need storage to
run simulations and store outputs.
 Data mining and information retrieval
professionals need a tool capable of creating
an index from a document collection, storing
it on the network and sharing with others.

3
General Text Parser (GTP)






Two modules: GTP and GTPQUERY
Text/document parsing and indexing
Construct sparse matrix data structures
Create vector-space model where documents
and queries are vectors in low-dimensional
subspace
Term-by-document matrix defines
relationships between docs and distinct terms
Underlying model is Latent Semantic Indexing
(LSI)
4
Versions of GTP
C++ (original)
 Parallel C++ using MPI (for SVD
computation)
 Java (GUI recently developed)
 Solaris (Unix), Linux in C++
 Parallel only on Solaris
 Solaris, Linux, Mac OS X in Java

5
GTP Process
Filter documents (optional)
 Create database of keys, IDs and
weights
 Perform matrix decomposition (SVD) on
the term-by-document matrix
 Clean up
 Write out summary

6
Query Process







Filter queries (optional)
Parse first query
Generate query vector
Scale query vector by singular values
(optional)
Perform cosine matching
Write results to file for this query
Repeat for more queries
7
Network Storage Stack
Framework for storing and transferring
data over network
 Modeled after Internet Protocol (IP)
Stack
 Designed to add storage resources to
the Internet in a sharable and scalable
manner

8
Network Storage Stack
Applications (GTP, etc)
Logistical File System
Logistical Tools
L-Bone
exNode
IBP
Local Access
Physical Layer
9
IBP






Internet Backplane Protocol
Foundation of Network Storage Stack
Share resources across networks
Use of local storage to create global storage
service
Echoes advantages of IP: abstraction of
datagram delivery, scalability, simple fault
detection (discard faulty datagrams)
Temporary and “unreliable”
10
IBP Client Calls
Allocate
 Store
 Load
 Copy
 Mcopy
 Manage

11
exNode





Hard to manage IBP capabilities
exNode automates it
exNodes are pointers to IBP allocations
Allows to create network files from unreliable
IBP allocations, with stronger properties
(fault-tolerance, longer duration, etc.)
Two major components: metadata and
mappings
12
L-Bone
The Logistical Backbone
 Resource discovery service
 Maintains list of public depots and
metadata about them
 Uses Network Weather Service (NWS)
to monitor throughput between depots
 http://loci.cs.utk.edu/lbone

13
LoRS
Logistical Runtime System
 Automate finding of IBP depots via LBone, creation and management of IBP
capabilities and exNodes
 C API and command line interface tool
set

14
LoRS Functions
Upload
 Download
 Augment
 Trim
 Refresh
 List

15
GTP with Network Storage





Creating an index is a dynamic process
Large document collection => large output
files => require lots of storage space
Need to share produced results with others
(across the globe)
If not satisfied with result – stored files will
go away automatically
If happy with collection – can either store on
IBP longer or store locally (burn on CD, etc)
16
GTP and Upload






GTP parses the collection
GTP creates output files (keys and output )
Files are uploaded to remote network (IBP)
Upload requires some information from the
user (optional)
Information helps optimize performance
Capabilities are returned in the form of XML
files (.xnd extension)
17
GTP and Upload (contd)
Location (Null, hostname, zip, state,
city, country, airport)
 Duration in days
 Fragments
 Copies

18
Download and GTPQUERY
Files keys and output are downloaded
using information from .xnd files
 Download is multithreaded
 Adaptive algorithm: takes into account
throughput to the client
 “Faster” depots provide more blocks of
data

19
Download + GTPQUERY
5K
100
Representation of the binary file output for 5K collection
20
Implementation Challenges
GTP in Java, while LoRS tools in C
 Go through server (first xnd_server,
then lors_server)
 Adapt to changes – both GTP and LoRS
tools are constantly evolving
 Threading to optimize performance
 User friendliness

21
Performance
All results were achieved using the Java
version of GTP
 Three sub collections of FBIS (Foreign
Broadcast Information Service) were used to
produce benchmarks
 Server located in Tennessee
 Upload/download to/from Tennessee(TN),
California (CA), France (FR)

22
Run Specifications
By default, GTP uses 100 SVD factors, i.e., all
term and document vectors are of length 100
 The weighting scheme used was log entropy
 For the query only the first 15 singular triplets
were used
 Three queries were used on each collection:

Yugoslavia Croatia Bosnia-Herzegovina
Russia embassy FIS
Nissan Motor
23
FBIS 5K
FBIS 5K
FBIS 5K
85
300
14
50
284
284
Seconds
Seconds
400
200
284
100
0
FR
CA
TN
60
50
40
30
20
10
0
6.4
49
FR
6.4
19.4
CA
TN
Location
GTP
6.4
26
Location
Upload
Download
GTP + Upload
GTPQUERY
Download + GTPQUERY
Name
Size
Docs
Terms output keys
FBIS 5K
17.8 MB
5,000
22,558
11 MB
2.78 MB
24
FBIS 10K
FBIS 10K
FBIS 10K
600
157
20
19
548
548
Seconds
Seconds
800
400
200
548
0
FR
CA
100
80
60
40
20
0
21
71
FR
TN
21
30
21
21
CA
TN
Location
Location
GTP
Upload
Download
GTP + Upload
GTPQUERY
Download + GTPQUERY
Name
Size
Docs
Terms output keys
FBIS 10K
32 MB
10,000
31,667
18 MB
3.5 MB
25
FBIS 20K
FBIS 20K
FBIS 20K
150
1500
23
270
23
78
1320
1320
1320
Seconds
Seconds
2000
1000
500
0
FR
CA
100
50
113
23
38
23
34
CA
TN
0
TN
FR
Location
GTP
Location
Upload
Download
GTP + Upload
GTPQUERY
Download + GTPQUERY
Name
Size
Docs
Terms output keys
FBIS 20K
63 MB
20,000
46,488
28 MB
5.8 MB
26
Performance Analysis
GTP + Upload
GTP time is directly proportional to the
collection size
 Additional overhead for upload is not
significant compared to the total time
 Upload time depends on multiple factors:
location, network bandwidth, time of day, size
of file, number of copies requested, and
status of depots at the time of the upload

27
Performance Analysis
Download + GTPQUERY
All “heavy-duty” preprocessing of the collection was
done by GTP
 Query process simply projects the query into the
term-by-document vector space
 Dimension of the vector space and number of factors
used affects query time
 Number of queries requested affects query time
 Download takes up greater portion of the total time
 Download is affected by location of fragments and
network conditions

28
Future Work
Optimize Java performance
 Incorporate fully with GUI
 Incorporate network storage into the
other (C++, parallel) versions of GTP
 Streaming data directly while it is
generated?
 Avoid local file generation
 User friendliness

29