Supercomputing 2003, UK e-Science Booth

Download Report

Transcript Supercomputing 2003, UK e-Science Booth

First Data Investigation on the Grid:
FirstDIG
Terry Sloan, Paul Graham, Adam Carter
Edinburgh Parallel Computing Centre (EPCC)
Telephone: +44 131 650 5155
Email: [email protected]
[email protected]
[email protected]
1
Supercomputing 2003, UK e-Science Booth
Overview
The Project
Motivation
Methodology
Data Sources, Cleaning, Analysis
OGSA-DAI
Future Work
2
Supercomputing 2003, UK e-Science Booth
The Project
Supercomputing 2003, UK e-Science Booth
The Project
Two aims:
1. Demonstrate deployment of OGSA-DAI within
the First South Yorkshire bus operational
environment and learn from it
2. Short data analysis using OGSA-DAI service
enabled data sources to answer business
questions posed by First South Yorkshire
4
Supercomputing 2003, UK e-Science Booth
The Project (cont)
Partners
– First plc represented by First South Yorkshire
– National e-Science Centre represented by EPCC
Timescales
–
–
–
–
5
9 months
Start May 2003
End JanDec 2004
Nov 2003 = Project Month 7 (PM7)
Supercomputing 2003, UK e-Science Booth
Motivation
Supercomputing 2003, UK e-Science Booth
Motivation
First plc
– Few UK e-Science projects involve service companies such as First
plc
– Operate worldwide in variety of transport sectors
– Over 10000 vehicles in the UK, 23% of the market
– UK’s largest operator
– Challenge is meeting the needs of the travelling public whilst making
money
– Data Mining may assist but huge range of fragmented data sources
OGSA-DAI : Data Access and Integration
– Potentially provides a solution
– Need business users to make transition from science to commerce
7
Supercomputing 2003, UK e-Science Booth
Methodology
Supercomputing 2003, UK e-Science Booth
Methodology
Business questions
Data sources
Data cleaning/analysis
OGSA-DAI service-enabled data sources
Replicate data cleaning/data analysis
Feedback on OGSA-DAI suitability and areas for
improvement.
9
Supercomputing 2003, UK e-Science Booth
Data Sources, Cleaning & Analysis
Supercomputing 2003, UK e-Science Booth
Data Sources in the Bus
Industry
Many different kinds of data involved with running
a bus company
– Mileage, revenue, customer contact, schedule, fuel consumption,
vehicle maintenance, routes…
Many means to collect data
–
–
–
–
11
Manually entered data at depot
Data collected on buses from ticket machines
Data collected on buses from GPS systems
GPS system notes when bus passes through a predefined
“footprint” and records the time at which this happens
Supercomputing 2003, UK e-Science Booth
Disparate Databases
Data is typically stored in disparate databases
– Various reasons for this: Incremental construction of systems.
– Not a problem for day-to-day running and querying but…
Introduces challenges for Data Analysis
–
–
–
–
–
–
12
Systems introduced at different times
Different database engines
Different front-ends
Different operating systems
Different physical locations
Different ways of representing data
Supercomputing 2003, UK e-Science Booth
An Example Process
CLEAN
AGGREGATE
RE-FORMAT
JOIN
ANALYSE
RE-FORMAT
CLEAN
13
AGGREGATE
Supercomputing 2003, UK e-Science Booth
Cleaning and Reformatting
One Bus, Many Names
– e.g. Service 25A might be “025A”, “25A”, “25a”
– Sometimes referring to individual depots, and sometimes to
operating regions which may include various depots.
– Furthermore, if data is stored separately for each depot, data
might not explicitly include a reference to a depot – this has to be
added when the data is aggregated
Pre-processing can often be done with SQL after
some initial analysis
– e.g. Create tables with entries corresponding to the depot and
columns containing data on how this depot is labelled in the
different databases.
14
Supercomputing 2003, UK e-Science Booth
Cleaning and Reformatting 2
Pre-processing with SQL (continued)
– Harder for example of service names: Need larger table. Requires
effort, but need only be created once.
– Alternative:
• Read data from database
• Process data with other tools (Perl, SPSS, …)
• Load results to new table in database
15
Supercomputing 2003, UK e-Science Booth
Aggregation
Data can be aggregated in various ways
– e.g. By Service, By Day
SQL can do much of the simple aggregation:
SELECT Service, Region, SUM(Revenue) AS
TotalRevenue
FROM RevenueTable
GROUP BY Service, Region
In practice SQL can be somewhat more
complicated
16
Supercomputing 2003, UK e-Science Booth
Aggregation 2
As before, an alternative is:
– Read data from database
– Aggregate with external program (SPSS, Perl, even Excel)
– Load data back into database
Whether or not this is worth doing depends on
– Availability of Aggregation Functions in database engine
– Extent of processing required: If a database is stored on a small
or heavily-used machine, it may be quicker to export, process,
and import.
17
Supercomputing 2003, UK e-Science Booth
Joins
Can combine data from more than one database:
– Complaints versus Lateness
– Revenue versus Lost Miles
– Complaints versus Lost Miles
Often Joins are on data aggregated in some way:
– By Service
– By Day
Subsets of the data can also be considered
– e.g. no weekends
18
Supercomputing 2003, UK e-Science Booth
Hurdles: Non-Standard SQL
 Non-Standard SQL introduces some hurdles for
transparent integration of data
 Date Formats:
– No standard data format: DD/MM/YYYY or MM/DD/YY
– No standard date handling functions
– Compare MS Access and mySQL:
SELECT * FROM AccessTable WHERE
IncidentDate BETWEEN #11/30/2000# AND
#11/30/2002#
SELECT * FROM MySQLTable WHERE
IncidentDate BETWEEN '2000-11-30' AND
'20021130'
19
Supercomputing 2003, UK e-Science Booth
Hurdles: How representative
is Data?
Data available for mining can influence results
Representative data required for meaningful
results
Since data is not collected for the purposes of
data mining, it may be incomplete
For example, data might only be collected to
analyse a perceived problem with a particular
route
20
Supercomputing 2003, UK e-Science Booth
Required Datamining Tools
 SQL can be used for basic data
analysis but OGSA-DAI doesn’t
replace data mining tools
 More complicated data analysis
requires external tools: e.g. C5, Perl,
SPSS, Excel
 OGSA-DAI’s use here is to extract
data required for analysis and
deliver it to the system on which
analysis is to be performed in a
useful format
21
Supercomputing 2003, UK e-Science Booth
OGSA-DAI
GRID
OGSA-DAI
Machine
performing
analysis
OGSA-DAI
Supercomputing 2003, UK e-Science Booth
The problem
Access to databases at First
The databases:
–
–
–
–
Are located at different sites
Are hosted on different operating systems
Are not all available via the internal network
Have different DBMS
Require ability to analyse their contents in a
uniform manner and include cross-database
analysis
23
Supercomputing 2003, UK e-Science Booth
The solution
 OGSA-DAI
– Open Grid Services Architecture Data Access and Integration
– DAIS-WG at GGF
 Grid middleware:
– Assists with the access and integration of data from separate data
sources via the Grid
– Represents databases as Grid Services
– Enables access from other machines in a secure manner
 OGSA-DAI Partners
–
–
–
–
24
Funded under UK e-Science Core program
Universities of Edinburgh, Manchester and Newcastle
IBM and Oracle
http://www.ogsadai.org.uk
Supercomputing 2003, UK e-Science Booth
OGSA Data Access and
Integration
Based on Grid Services concept
– Stateful web services with an associated lifetime
– Has a set of behaviours, and conforms to a set of interfaces
through which a client may interact
Three main Grid Services:
– DAI Service Group Registry (DAISGR)
• Holds a list of …
– Grid Data Service Factory (GDSF)
• Associated with a single database
– Grid Data Service (GDS)
• A “session” with a database
25
Supercomputing 2003, UK e-Science Booth
OGSA-DAI typical
interaction 1/3
1.
Client
DAISGR
GDSF
Web Services
Container
26
Supercomputing 2003, UK e-Science Booth
OGSA-DAI typical
interaction 2/3
2.
GDS
Client
GDSF
Database
Web Services
Container
27
Supercomputing 2003, UK e-Science Booth
OGSA-DAI typical
interaction 3/3
3.
Client
GDS
Web Services
Container
28
Supercomputing 2003, UK e-Science Booth
Database
First and OGSA-DAI
Our remit:
– To evaluate the suitability of the use of OGSA-DAI in a
commercial environment
Need to find out if OGSA-DAI:
–
–
–
–
Is appropriate
Is secure
Is straightforward to deploy and use
Does what we need!
Feedback from project goes straight to OGSADAI team
29
Supercomputing 2003, UK e-Science Booth
Progress
Have a test deployment running at EPCC
Using two of the databases identified in the data
analysis WP
– The Customer Contact System
• Microsoft Access
• Information on customer complaints e.g. time, service, nature
– The Mileage database
• dBASE IV
• Information on bus mileage e.g. lost miles
30
Supercomputing 2003, UK e-Science Booth
Issue
OGSA-DAI currently does not officially support
Access or dBASE IV !
However, does support JDBC-accessible
databases
Solution
– Use the Microsoft provided ODBC driver
– Use the Sun provided JDBC-ODBC bridge
31
Supercomputing 2003, UK e-Science Booth
Set up
Using three machines within our firewall
– One to host the CCS database
– One to host the Mileage database
– One to act as the client
Client
DAISGR
GDSF
32
DAISGR
GDSF
Supercomputing 2003, UK e-Science Booth
Limitations
Data type support
– The BIT data type (Yes/No fields)
– The Date format
“Out of range” character codes
– Limitation of XML
Firewalls
– General Grid computing discussion
Usability
– Use of XML can be confusing
33
Supercomputing 2003, UK e-Science Booth
Future Work
Supercomputing 2003, UK e-Science Booth
Future work
Deploy at First
– And test within their network
A client tool
– To improve usability
Additional databases and DBMS
– First have other databases under different DBMS they want to
integrate
Single DAI Service Group Registry
– These databases should be registered centrally
More complex interactions
– Joins across databases …
35
Supercomputing 2003, UK e-Science Booth