Supercomputing 2003, UK e-Science Booth
Download
Report
Transcript Supercomputing 2003, UK e-Science Booth
First Data Investigation on the Grid:
FirstDIG
Terry Sloan, Paul Graham, Adam Carter
Edinburgh Parallel Computing Centre (EPCC)
Telephone: +44 131 650 5155
Email: [email protected]
[email protected]
[email protected]
1
Supercomputing 2003, UK e-Science Booth
Overview
The Project
Motivation
Methodology
Data Sources, Cleaning, Analysis
OGSA-DAI
Future Work
2
Supercomputing 2003, UK e-Science Booth
The Project
Supercomputing 2003, UK e-Science Booth
The Project
Two aims:
1. Demonstrate deployment of OGSA-DAI within
the First South Yorkshire bus operational
environment and learn from it
2. Short data analysis using OGSA-DAI service
enabled data sources to answer business
questions posed by First South Yorkshire
4
Supercomputing 2003, UK e-Science Booth
The Project (cont)
Partners
– First plc represented by First South Yorkshire
– National e-Science Centre represented by EPCC
Timescales
–
–
–
–
5
9 months
Start May 2003
End JanDec 2004
Nov 2003 = Project Month 7 (PM7)
Supercomputing 2003, UK e-Science Booth
Motivation
Supercomputing 2003, UK e-Science Booth
Motivation
First plc
– Few UK e-Science projects involve service companies such as First
plc
– Operate worldwide in variety of transport sectors
– Over 10000 vehicles in the UK, 23% of the market
– UK’s largest operator
– Challenge is meeting the needs of the travelling public whilst making
money
– Data Mining may assist but huge range of fragmented data sources
OGSA-DAI : Data Access and Integration
– Potentially provides a solution
– Need business users to make transition from science to commerce
7
Supercomputing 2003, UK e-Science Booth
Methodology
Supercomputing 2003, UK e-Science Booth
Methodology
Business questions
Data sources
Data cleaning/analysis
OGSA-DAI service-enabled data sources
Replicate data cleaning/data analysis
Feedback on OGSA-DAI suitability and areas for
improvement.
9
Supercomputing 2003, UK e-Science Booth
Data Sources, Cleaning & Analysis
Supercomputing 2003, UK e-Science Booth
Data Sources in the Bus
Industry
Many different kinds of data involved with running
a bus company
– Mileage, revenue, customer contact, schedule, fuel consumption,
vehicle maintenance, routes…
Many means to collect data
–
–
–
–
11
Manually entered data at depot
Data collected on buses from ticket machines
Data collected on buses from GPS systems
GPS system notes when bus passes through a predefined
“footprint” and records the time at which this happens
Supercomputing 2003, UK e-Science Booth
Disparate Databases
Data is typically stored in disparate databases
– Various reasons for this: Incremental construction of systems.
– Not a problem for day-to-day running and querying but…
Introduces challenges for Data Analysis
–
–
–
–
–
–
12
Systems introduced at different times
Different database engines
Different front-ends
Different operating systems
Different physical locations
Different ways of representing data
Supercomputing 2003, UK e-Science Booth
An Example Process
CLEAN
AGGREGATE
RE-FORMAT
JOIN
ANALYSE
RE-FORMAT
CLEAN
13
AGGREGATE
Supercomputing 2003, UK e-Science Booth
Cleaning and Reformatting
One Bus, Many Names
– e.g. Service 25A might be “025A”, “25A”, “25a”
– Sometimes referring to individual depots, and sometimes to
operating regions which may include various depots.
– Furthermore, if data is stored separately for each depot, data
might not explicitly include a reference to a depot – this has to be
added when the data is aggregated
Pre-processing can often be done with SQL after
some initial analysis
– e.g. Create tables with entries corresponding to the depot and
columns containing data on how this depot is labelled in the
different databases.
14
Supercomputing 2003, UK e-Science Booth
Cleaning and Reformatting 2
Pre-processing with SQL (continued)
– Harder for example of service names: Need larger table. Requires
effort, but need only be created once.
– Alternative:
• Read data from database
• Process data with other tools (Perl, SPSS, …)
• Load results to new table in database
15
Supercomputing 2003, UK e-Science Booth
Aggregation
Data can be aggregated in various ways
– e.g. By Service, By Day
SQL can do much of the simple aggregation:
SELECT Service, Region, SUM(Revenue) AS
TotalRevenue
FROM RevenueTable
GROUP BY Service, Region
In practice SQL can be somewhat more
complicated
16
Supercomputing 2003, UK e-Science Booth
Aggregation 2
As before, an alternative is:
– Read data from database
– Aggregate with external program (SPSS, Perl, even Excel)
– Load data back into database
Whether or not this is worth doing depends on
– Availability of Aggregation Functions in database engine
– Extent of processing required: If a database is stored on a small
or heavily-used machine, it may be quicker to export, process,
and import.
17
Supercomputing 2003, UK e-Science Booth
Joins
Can combine data from more than one database:
– Complaints versus Lateness
– Revenue versus Lost Miles
– Complaints versus Lost Miles
Often Joins are on data aggregated in some way:
– By Service
– By Day
Subsets of the data can also be considered
– e.g. no weekends
18
Supercomputing 2003, UK e-Science Booth
Hurdles: Non-Standard SQL
Non-Standard SQL introduces some hurdles for
transparent integration of data
Date Formats:
– No standard data format: DD/MM/YYYY or MM/DD/YY
– No standard date handling functions
– Compare MS Access and mySQL:
SELECT * FROM AccessTable WHERE
IncidentDate BETWEEN #11/30/2000# AND
#11/30/2002#
SELECT * FROM MySQLTable WHERE
IncidentDate BETWEEN '2000-11-30' AND
'20021130'
19
Supercomputing 2003, UK e-Science Booth
Hurdles: How representative
is Data?
Data available for mining can influence results
Representative data required for meaningful
results
Since data is not collected for the purposes of
data mining, it may be incomplete
For example, data might only be collected to
analyse a perceived problem with a particular
route
20
Supercomputing 2003, UK e-Science Booth
Required Datamining Tools
SQL can be used for basic data
analysis but OGSA-DAI doesn’t
replace data mining tools
More complicated data analysis
requires external tools: e.g. C5, Perl,
SPSS, Excel
OGSA-DAI’s use here is to extract
data required for analysis and
deliver it to the system on which
analysis is to be performed in a
useful format
21
Supercomputing 2003, UK e-Science Booth
OGSA-DAI
GRID
OGSA-DAI
Machine
performing
analysis
OGSA-DAI
Supercomputing 2003, UK e-Science Booth
The problem
Access to databases at First
The databases:
–
–
–
–
Are located at different sites
Are hosted on different operating systems
Are not all available via the internal network
Have different DBMS
Require ability to analyse their contents in a
uniform manner and include cross-database
analysis
23
Supercomputing 2003, UK e-Science Booth
The solution
OGSA-DAI
– Open Grid Services Architecture Data Access and Integration
– DAIS-WG at GGF
Grid middleware:
– Assists with the access and integration of data from separate data
sources via the Grid
– Represents databases as Grid Services
– Enables access from other machines in a secure manner
OGSA-DAI Partners
–
–
–
–
24
Funded under UK e-Science Core program
Universities of Edinburgh, Manchester and Newcastle
IBM and Oracle
http://www.ogsadai.org.uk
Supercomputing 2003, UK e-Science Booth
OGSA Data Access and
Integration
Based on Grid Services concept
– Stateful web services with an associated lifetime
– Has a set of behaviours, and conforms to a set of interfaces
through which a client may interact
Three main Grid Services:
– DAI Service Group Registry (DAISGR)
• Holds a list of …
– Grid Data Service Factory (GDSF)
• Associated with a single database
– Grid Data Service (GDS)
• A “session” with a database
25
Supercomputing 2003, UK e-Science Booth
OGSA-DAI typical
interaction 1/3
1.
Client
DAISGR
GDSF
Web Services
Container
26
Supercomputing 2003, UK e-Science Booth
OGSA-DAI typical
interaction 2/3
2.
GDS
Client
GDSF
Database
Web Services
Container
27
Supercomputing 2003, UK e-Science Booth
OGSA-DAI typical
interaction 3/3
3.
Client
GDS
Web Services
Container
28
Supercomputing 2003, UK e-Science Booth
Database
First and OGSA-DAI
Our remit:
– To evaluate the suitability of the use of OGSA-DAI in a
commercial environment
Need to find out if OGSA-DAI:
–
–
–
–
Is appropriate
Is secure
Is straightforward to deploy and use
Does what we need!
Feedback from project goes straight to OGSADAI team
29
Supercomputing 2003, UK e-Science Booth
Progress
Have a test deployment running at EPCC
Using two of the databases identified in the data
analysis WP
– The Customer Contact System
• Microsoft Access
• Information on customer complaints e.g. time, service, nature
– The Mileage database
• dBASE IV
• Information on bus mileage e.g. lost miles
30
Supercomputing 2003, UK e-Science Booth
Issue
OGSA-DAI currently does not officially support
Access or dBASE IV !
However, does support JDBC-accessible
databases
Solution
– Use the Microsoft provided ODBC driver
– Use the Sun provided JDBC-ODBC bridge
31
Supercomputing 2003, UK e-Science Booth
Set up
Using three machines within our firewall
– One to host the CCS database
– One to host the Mileage database
– One to act as the client
Client
DAISGR
GDSF
32
DAISGR
GDSF
Supercomputing 2003, UK e-Science Booth
Limitations
Data type support
– The BIT data type (Yes/No fields)
– The Date format
“Out of range” character codes
– Limitation of XML
Firewalls
– General Grid computing discussion
Usability
– Use of XML can be confusing
33
Supercomputing 2003, UK e-Science Booth
Future Work
Supercomputing 2003, UK e-Science Booth
Future work
Deploy at First
– And test within their network
A client tool
– To improve usability
Additional databases and DBMS
– First have other databases under different DBMS they want to
integrate
Single DAI Service Group Registry
– These databases should be registered centrally
More complex interactions
– Joins across databases …
35
Supercomputing 2003, UK e-Science Booth