The Grand Challenge of Estimating One Billion Predictive Models
Download
Report
Transcript The Grand Challenge of Estimating One Billion Predictive Models
Data Quality Models for High
Volume Transaction Streams:
A Case Study
Joseph Bugajski
Visa International
Robert Grossman, Chris Curry,
David Locke & Steve Vejcik
Open Data Group
The Problem: Detect Significant
Changes in Visa’s Payments Network
Account
Issuing Bank
Merchant
Acquiring Bank
Visa Payment Network
Over
1.59 billion Visa cards in
circulation
6800 transactions per second (peak)
Over 20,000 member banks
Millions of merchants
The Challenge: Payments
Data is Highly Heterogeneous
Variation from
cardholder to
cardholder
Variation from
merchant to
merchant
Variation from bank
to bank
Observe: If Data Were Homogeneous,
Could Use Change Detection Model
Baseline
Model
Observed
Model
Sequence of events x[1], x[2], x[3], …
Question: is the observed distribution different than the baseline
distribution?
Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests
4
10 +
Key Idea: Build
Models,
One for Each Cell in Data Cube
Build separate model for each 20,000+ separate
bank (1000+)
baselines
Geospatial
Build separate model for each
region
geographical region (6
regions)
Build separate model for each
different type of merchant (c.
Type of
800 types of merchants)
Transaction
For each distinct cube,
Bank
establish separate baselines
for each metric of interest
Modeling using Cubes
(declines, etc.)
of Models (MCM)
Detect changes from baselines
Greedy Meaningful/Manageable
Balancing (GMMB) Algorithm
• More alerts
Breakpoint
• Fewer alerts
• Alerts more
meaningful
• Alerts more
manageable
• To increase alerts,
add breakpoint
to split cubes,
order by number
of new alerts, &
select one or more
new breakpoints
•To decrease alerts,
remove breakpoint,
order by number
of decreased alerts,
& select one or more
breakpoints to remove
One model for each
cell in data cube
Augustus
Open source Augustus data mining platform was
used to:
• Estimate baselines for over 15,000 separate
segmented models
• Score high volume operational data and issue
alerts for follow up investigations
Augustus is PMML compliant
Augustus scales with
• Volume of data (Terabytes)
• Real time transaction streams (15,000/sec+)
• Number of segmented models (10,000+)
Some Results to Date
System has been operational for 2.5 years
ROI
– 5.1x
Year 1
(over 6 months)
– 7.3x
Year 2
(12 months)
– 10.0x
Year 3
(12 months)
Currently estimating over 15,000 individual baseline models
The system has issued alerts for:
• Merchants using incorrect Merchant Category Code
(MCC) - account testing
• Sales channel variations
• Incorrect use of merchant city name field
• Incorrect coding or recurring payments
Summary
Used new methodology (Modeling using Cubes of
Models) for modeling large, highly heterogeneous
data sets
This project contributed in part to the development
of a Baseline Model in the open source Augustus
system
Integrated system to generate alerts using
Baseline Models with manual investigation
process
Project is generating over 10x ROI
Poster #20 in the Tuesday night Poster Session.