Transcript belfast

GEDDM: Comparisons of OGSA-DAI and GridFTP for
access to and conversion of remote unstructured data in
legal data mining
Karen Loughran
The Queen’s University of Belfast
The Queen’s University of Belfast
Introduction
Grid Enabled Distributed Data Mining
Industrial partner
Overview of GEDDM
GEDDM Common Semantic Model (CSM)
objectives
Grid enabled solution
The Queen’s University of Belfast
Industrial Partner - Datactics
Northern Ireland based (formed 1999)
Specialising in grid enabled “data-centric”
matching across multiple sectors
Datactics technology is fully parallelised
Computationally intensive - need to compare
every record with every other record
Improve data quality by applying fuzzy matching
techniques
Data mining software being used in the real world
The Queen’s University of Belfast
GEDDM Business Driver
Data sources
numerous structures, formats, locations, administrative
domains…
Client
US County Court: insider trading litigation case
45Tb
Variety of formats
Email, pdf, weblogs, DBMS, report text dumps …
How to interface to large volumes of data in
common structured parallel approach
The Queen’s University of Belfast
Common Semantic Model (CSM)
Objectives
Representation of unstructured data such as
email, weblog, report dumps.
Conversion to structured format.
Evaluation of Grid technologies for access
and conversion.
Secure, reliable and scaleable.
Exploit high bandwidth.
The Queen’s University of Belfast
CSM Grid Enabled Solution
Two Stages:
Represent and convert unstructured Flat File
Formats (FFF) to structured Common Output
Format File (COFF).
Investigate Grid technologies for the remote
access and conversion of unstructured data.
The Queen’s University of Belfast
CSM Representation & Conversion
Data Description Language DDL - XSD
Data Description File DDF
Parser
The Queen’s University of Belfast
Sample FFF data source & DDF
App
IMP
Account
343818
IMP
565777
Address
Balance
Dede H Smith
8600.76
181 Glen Rd
Earls Court, London
Annie Saunders
9905.50
60 Newhaven St
Edinburgh, Scotland
___________________________________________________________________
<datasource>
<database>
<header><headertext>App Account Address Balance
</headertext></header>
<rectype eorecord=’\n’>
<pfield name=”App” pos=1 length=3/>
<pfield name=”Account” pos=10 length=6/>
<pfield name=”Address” pos=24 length=23
multiline=”yes”/>
<pfield name=”Balance” pos=49 length=8/>
</rectype>
</database>
</datasource>
The Queen’s University of Belfast
Parser Design
Object oriented component hierarchy
Each object represents an XML element
Encapsulates data relating to the flat file
component it describes
Encapsulates all import “parse”
SAX parse performed on DDF to build up
internal OO representation of FFF
Parse called on top level object.
The Queen’s University of Belfast
CSM Grid technologies
Transfer & conversion tools
OGSA-DAI (Version 4)
GridFTP (GT4.0.0)
GUI interfacing to both of these
technologies.
The Queen’s University of Belfast
GUI interface – access & conversion
GUI Interface to sample remote FFF, DDF creation and conversion.
View Sample
Describe (DDF)
Convert
Unstructured
FFF Data
Data Conversion
Services
View Results
(COFF)
Structured Data
(COFF)
No
OK ?
Conversion Module
Yes
Complete
The Queen’s University of Belfast
Implementation under OGSA-DAI
OGSA-DAI 4.0.0
Globus Toolkit 3.2.1
New conversion activity designed &
implemented
Calls out to python scripts to perform
conversion
The Queen’s University of Belfast
Implementation under GridFTP
Globus Toolkit 4.0.0
Data Storage Interface (DSI) creation to
perform conversion processing at server
Instead of original unstructured FFF, send
the COFF file back to client
Setup striped server architecture – multiple
nodes working together in parallel.
The Queen’s University of Belfast
GridFTP Striped Architecture
Belfast
London
Host A
Host X
Host B
Host Y
Host C
Host Z
Raid
Raid
Raid
Raid
Raid
Raid
The Queen’s University of Belfast
GridFTP Machine Specifications
BELFAST
 AMD4400 Dual Processor
 4Gig RAM
 1 Terabyte hard disk, serial ATA2
 1 Gigabit ethernet
LONDON
 Dual Optron Processor
 4Gig RAM
 1 Terabyte hard disk
 1 Gigabit ethernet
The Queen’s University of Belfast
GridFTP Evaluation Tests
Attempted conversion and access to large
files across the network.
File sizes:
13Mb, 26Mb, 52Mb, 103Mb, 205Mb, 409Mb,
817Mb, 1634Mb
Buffer sizes:
Default, 4915, 409150, 785408
MTU 1400 - 8000
The Queen’s University of Belfast
OGSA-DAI Benchmark Results
Currently no results available:
Socket Timeout Error and Engine receives a
terminate signal when Activity takes longer
than approximately 10 minutes to run.
DeliverToGridFTP activity would not work in
version 4. Patches required. So far, unable to
get working with these patches.
Security setup issues.
The Queen’s University of Belfast
GridFTP Network Topology
BBC NI
Queens
BBC London
100MBit
1GBit
1GBit
Janet Bar
Queens BESC Router
1GBit
1GBit
BBC
ROUTER
The Queen’s University of Belfast
Results – GridFTP transfer
Throughput hindered by:
Physical Infrastructure/Service Provider-80Mbs
Router/switches/NIC
808 Mbs CPU to CPU (London to Belfast)
688 Mbs Disk to Disk (BBC NI)
Striping with 2 BE servers - 60% improvement
Local 100Mbs switch:
Disc to disc – 82 Mbs
The Queen’s University of Belfast
OGSA-DAI Evaluation ….
DeliverToGridFTP not working in 4.0.0
Configuring GridFTP not possible (buffer sizes,
no. of streams, striped transfer etc.)
Some way to go in efficient transfer of large files.
Installation/runtime overheads
Design/code conversion activity & design perform
documents for access/conversion
Timeouts converting large files. Threads may be
solution.
Clear documentation
The Queen’s University of Belfast
GridFTP Evaluation
Secure, reliable, fast and scaleable
Lightweight installation
Optimum use of high bandwidth networks
Extra ERET/ESTO processing allows
tighter integration of conversions operation
through the definition of a DSI
Striping for much improved efficiency
The Queen’s University of Belfast
GridFTP Evaluation
Extensive tuning required
No clear documentation for writing a DSI.
[email protected] useful source of info
Poor performance on NFS.
PVFS like filesystem recommended for striping.
1Gbit bandwidth in practice difficult to achieve
due to problems with:
Router
NIC
Physical Infrastructure
The Queen’s University of Belfast
Conclusions
Investigated grid technologies for remote
access & conversion
OGSA-DAI disappointing due to lack of
support for large file transfer
GridFTP involved extensive configuration
and due to network infrastructure problems
difficult to get optimum performance in
remote transfer
The Queen’s University of Belfast
Future work
Tighter integration of conversion services
within GridFTP DSI server module.
Extend the services under GridFTP to cope
with Distributed Query Processing.
COFF produced as XML, ready for XPATH
queries.
The Queen’s University of Belfast
Questions ?
The Queen’s University of Belfast