integrating network storage into information retrieval applications
Download
Report
Transcript integrating network storage into information retrieval applications
INTEGRATING NETWORK
STORAGE INTO INFORMATION
RETRIEVAL APPLICATIONS
Svetlana Y. Mironova
The University of Tennessee,
Knoxville
Spring 2003
Topics of Discussion
Motivation
General Text Parser (GTP)
Network Storage Stack
GTP with Network Storage
Implementation Challenges
Performance
Future Work
2
Motivation
Amount of textual-based information stored
on our computers and on the Web is rapidly
accumulating.
Researchers and scientists need storage to
run simulations and store outputs.
Data mining and information retrieval
professionals need a tool capable of creating
an index from a document collection, storing
it on the network and sharing with others.
3
General Text Parser (GTP)
Two modules: GTP and GTPQUERY
Text/document parsing and indexing
Construct sparse matrix data structures
Create vector-space model where documents
and queries are vectors in low-dimensional
subspace
Term-by-document matrix defines
relationships between docs and distinct terms
Underlying model is Latent Semantic Indexing
(LSI)
4
Versions of GTP
C++ (original)
Parallel C++ using MPI (for SVD
computation)
Java (GUI recently developed)
Solaris (Unix), Linux in C++
Parallel only on Solaris
Solaris, Linux, Mac OS X in Java
5
GTP Process
Filter documents (optional)
Create database of keys, IDs and
weights
Perform matrix decomposition (SVD) on
the term-by-document matrix
Clean up
Write out summary
6
Query Process
Filter queries (optional)
Parse first query
Generate query vector
Scale query vector by singular values
(optional)
Perform cosine matching
Write results to file for this query
Repeat for more queries
7
Network Storage Stack
Framework for storing and transferring
data over network
Modeled after Internet Protocol (IP)
Stack
Designed to add storage resources to
the Internet in a sharable and scalable
manner
8
Network Storage Stack
Applications (GTP, etc)
Logistical File System
Logistical Tools
L-Bone
exNode
IBP
Local Access
Physical Layer
9
IBP
Internet Backplane Protocol
Foundation of Network Storage Stack
Share resources across networks
Use of local storage to create global storage
service
Echoes advantages of IP: abstraction of
datagram delivery, scalability, simple fault
detection (discard faulty datagrams)
Temporary and “unreliable”
10
IBP Client Calls
Allocate
Store
Load
Copy
Mcopy
Manage
11
exNode
Hard to manage IBP capabilities
exNode automates it
exNodes are pointers to IBP allocations
Allows to create network files from unreliable
IBP allocations, with stronger properties
(fault-tolerance, longer duration, etc.)
Two major components: metadata and
mappings
12
L-Bone
The Logistical Backbone
Resource discovery service
Maintains list of public depots and
metadata about them
Uses Network Weather Service (NWS)
to monitor throughput between depots
http://loci.cs.utk.edu/lbone
13
LoRS
Logistical Runtime System
Automate finding of IBP depots via LBone, creation and management of IBP
capabilities and exNodes
C API and command line interface tool
set
14
LoRS Functions
Upload
Download
Augment
Trim
Refresh
List
15
GTP with Network Storage
Creating an index is a dynamic process
Large document collection => large output
files => require lots of storage space
Need to share produced results with others
(across the globe)
If not satisfied with result – stored files will
go away automatically
If happy with collection – can either store on
IBP longer or store locally (burn on CD, etc)
16
GTP and Upload
GTP parses the collection
GTP creates output files (keys and output )
Files are uploaded to remote network (IBP)
Upload requires some information from the
user (optional)
Information helps optimize performance
Capabilities are returned in the form of XML
files (.xnd extension)
17
GTP and Upload (contd)
Location (Null, hostname, zip, state,
city, country, airport)
Duration in days
Fragments
Copies
18
Download and GTPQUERY
Files keys and output are downloaded
using information from .xnd files
Download is multithreaded
Adaptive algorithm: takes into account
throughput to the client
“Faster” depots provide more blocks of
data
19
Download + GTPQUERY
5K
100
Representation of the binary file output for 5K collection
20
Implementation Challenges
GTP in Java, while LoRS tools in C
Go through server (first xnd_server,
then lors_server)
Adapt to changes – both GTP and LoRS
tools are constantly evolving
Threading to optimize performance
User friendliness
21
Performance
All results were achieved using the Java
version of GTP
Three sub collections of FBIS (Foreign
Broadcast Information Service) were used to
produce benchmarks
Server located in Tennessee
Upload/download to/from Tennessee(TN),
California (CA), France (FR)
22
Run Specifications
By default, GTP uses 100 SVD factors, i.e., all
term and document vectors are of length 100
The weighting scheme used was log entropy
For the query only the first 15 singular triplets
were used
Three queries were used on each collection:
Yugoslavia Croatia Bosnia-Herzegovina
Russia embassy FIS
Nissan Motor
23
FBIS 5K
FBIS 5K
FBIS 5K
85
300
14
50
284
284
Seconds
Seconds
400
200
284
100
0
FR
CA
TN
60
50
40
30
20
10
0
6.4
49
FR
6.4
19.4
CA
TN
Location
GTP
6.4
26
Location
Upload
Download
GTP + Upload
GTPQUERY
Download + GTPQUERY
Name
Size
Docs
Terms output keys
FBIS 5K
17.8 MB
5,000
22,558
11 MB
2.78 MB
24
FBIS 10K
FBIS 10K
FBIS 10K
600
157
20
19
548
548
Seconds
Seconds
800
400
200
548
0
FR
CA
100
80
60
40
20
0
21
71
FR
TN
21
30
21
21
CA
TN
Location
Location
GTP
Upload
Download
GTP + Upload
GTPQUERY
Download + GTPQUERY
Name
Size
Docs
Terms output keys
FBIS 10K
32 MB
10,000
31,667
18 MB
3.5 MB
25
FBIS 20K
FBIS 20K
FBIS 20K
150
1500
23
270
23
78
1320
1320
1320
Seconds
Seconds
2000
1000
500
0
FR
CA
100
50
113
23
38
23
34
CA
TN
0
TN
FR
Location
GTP
Location
Upload
Download
GTP + Upload
GTPQUERY
Download + GTPQUERY
Name
Size
Docs
Terms output keys
FBIS 20K
63 MB
20,000
46,488
28 MB
5.8 MB
26
Performance Analysis
GTP + Upload
GTP time is directly proportional to the
collection size
Additional overhead for upload is not
significant compared to the total time
Upload time depends on multiple factors:
location, network bandwidth, time of day, size
of file, number of copies requested, and
status of depots at the time of the upload
27
Performance Analysis
Download + GTPQUERY
All “heavy-duty” preprocessing of the collection was
done by GTP
Query process simply projects the query into the
term-by-document vector space
Dimension of the vector space and number of factors
used affects query time
Number of queries requested affects query time
Download takes up greater portion of the total time
Download is affected by location of fragments and
network conditions
28
Future Work
Optimize Java performance
Incorporate fully with GUI
Incorporate network storage into the
other (C++, parallel) versions of GTP
Streaming data directly while it is
generated?
Avoid local file generation
User friendliness
29