Efficient Deployment of Predictive Analytics through Open Standards

Download Report

Transcript Efficient Deployment of Predictive Analytics through Open Standards

Efficient Deployment of Predictive
Analytics through Open Standards
and Cloud Computing
ACM SIGKDD Explorations
Volume 11, Issue 1, July 2009
報告人:黃啟智
學號:69821503
1
Outline
•
•
•
•
•
Introduction
Interoperability and Open Standards
Putting Models to Work
Performance
Conclusion
2
Introduction
• Deployment and practical application of
predictive model:
– Limited choice of options
– Often takes months for models to be integrated and
deployment(時間冗長)
– Custom coding or proprietary process(成本昂貴)
• Open standards and Internet-based technologies
are available to provide a more effective end-toend solution for the deployment.
3
Introduction
• SOA:Service Oriented Architecture
– For the design of loosely coupled IT systems(e.g.
based on Web Services)
• SaaS:Software-as-a-Service
– A license model
– Vendors deliver software solutions as a cost-effect
service
• PMML:Predictive Model Markup Language
– A open standard that allows users to exchange
predictive models among various software tools
4
Interoperability and Open Standards
• Cloud Computing
SaaS, IaaS, PaaS
Cloud Computing
(an computing
architecture)
Web Services
SOAP
RPC
WSDL
SOA
UDDI
REST
(SOA-related standards)
(access)
5
Interoperability and Open Standards
• Cloud Computing
–
–
–
–
–
Reduce cost and management overhead for IT
Shift in the geography of computation
The Internet as a platform
A set of services that provide computing resources
A variety of services:
Storage capacity, processing power, business application…
– Cloud infrastructures
Amazon Web Service(AWS)
Sector/Sphere
Hadoop
…
– The OCC, Open Cloud Consortium(www.opencloudconsortium.org)
6
Interoperability and Open Standards
• Web Service
http://zh.wikipedia.org
–
–
–
–
W3C definition
Providing the foundation of SOA
Use XML to code and decode data
Use SOAP(Simple Object Access
Protocol) standard to transport data
– Data can be easily exchanged between different
applications and platforms
– Can be described by a WSDL(Web Service Description
Language) file
– UDDI(Universal Description, Discovery, and Integration):a
platform independent XML-based registry for business to
list themselvs on the Internet
7
Interoperability and Open Standards
• A SOAP request for PMML file
A JDM(Java Data Mining) call
(The file/model was previously uploaded to the service provider.)
8
Interoperability and Open Standards
• SaaS – Software as a Service
– A license model, users may access software via
the Internet(not actually “buy and install”)
– Users only pay for the right for a certain time
period(e.g. NT$100 for an hour)
– No upfront costs in setting up servers or software
– Minimizing the risk of purchasing costly software
that may not provide adequate return of
investment
– E.g. Salesforce.com, Google Apps.
9
Interoperability and Open Standards
• PMML-Predictive Model Markup Language
– Developed by the Data Mining Group(www.dmg.org)
– An open standard for representing data mining
models
– An XML-based language
– Can describe data preprocessing and predictive
algorithms
– Can represent input data and data transformations
10
Interoperability and Open Standards
PMML Structure examples(a test data file)
Required (active)data fields
Predicted data field
11
Interoperability and Open Standards
PMML Structure examples
12
Interoperability and Open Standards
PMML Structure examples
Array of counts of different
field values under different
class labels
13
Interoperability and Open Standards
• PMML Model specifics (parameters, architecture) are
defined under different model elements, including:
–
–
–
–
–
–
–
–
–
–
Neural Networks
Support Vector Machines
Regressions Models
Decision Trees
Association Rules
Clustering
Sequences
Naïve Bayes
Text Models
Rules
14
Interoperability and Open Standards
• PMML On-The-Go
– PMML 4.0
Time series, boolean data types, model segmentation,
lift/gain charts, expanded range of built-in functions…
– More applications support export and import
functionality in PMML
– Open-source environments:
KNIME(www.knime.org)
The R project(www.R-project.org)
15
Putting Models to Work
• Amazon EC2
– Elastic Compute Cloud
– powered by Amazon Web Services
• ADAPA scoring engine
– uses JDM(Java Data Mining) Web Service calls and therefore
– allows for automatic decisions to be virtually embedded into
enterprise systems and applications
– available as a service to minimize total cost
16
Putting Models to Work
• Model Verification and Execution
Typical tasks in the life cycle of a data mining project:
– Building, deploying, testing and using data mining models
(A cross-platform and multi-vendor environment)
17
Putting Models to Work
• Model Verification and Execution
– Model testing/verification
• To ensure that both the scoring engine and the model
development environment produce exactly the same
result
• It allows for a test file containing any number of records
with all the necessary input variables and the expected
result for each record to be upload for score matching
18
Putting Models to Work
• Model Verification and Execution
– Model execution
• Batch mode: via the web console ,uploading a data file
containing records (in CSV format or zipped)
• Real-Time mode: via web services,
embedded calls (SOAP request)
instance
19
Putting Models to Work
• Demo Excel-addin
20
Putting Models to Work
• Demo Excel-addin
21
Putting Models to Work
• Security on the Cloud
– Uploading proprietary information to 3rd party
service → security and control questions
– The engine should not store any data
– An instance shares nothing with other instances
– And instance is Private (via authentication)
– Access to an instance only via HTTPS
– Models and data are deleted after an instance is
terminated
22
Performance
Instance type reference : http://aws.amazon.com/ec2/
23
Performance
24
Conclusion
• Cloud computing
It offers a powerful and revolutionizing way for putting
data mining models to work.
• Open standard(PMML)
It helps predictive models to be easily accessed from
anywhere in the enterprise (web-service calls or
uploading data files).
• The combination of both accelerates the
deployment of predictive models and makes it
more affordable.
25
Questions
• Security (transmission via Internet, to a 3rd
party vendors)、privacy
• High-dimensionality / Large database
transmission time + processing time
26