The Grand Challenge of Estimating One Billion Predictive Models

Download Report

Transcript The Grand Challenge of Estimating One Billion Predictive Models

Data Quality Models for High
Volume Transaction Streams:
A Case Study
Joseph Bugajski
Visa International
Robert Grossman, Chris Curry,
David Locke & Steve Vejcik
Open Data Group
The Problem: Detect Significant
Changes in Visa’s Payments Network
Account
Issuing Bank
Merchant
Acquiring Bank
Visa Payment Network
 Over
1.59 billion Visa cards in
circulation
 6800 transactions per second (peak)
 Over 20,000 member banks
 Millions of merchants
The Challenge: Payments
Data is Highly Heterogeneous

Variation from
cardholder to
cardholder
 Variation from
merchant to
merchant
 Variation from bank
to bank
Observe: If Data Were Homogeneous,
Could Use Change Detection Model
Baseline
Model
Observed
Model




Sequence of events x[1], x[2], x[3], …
Question: is the observed distribution different than the baseline
distribution?
Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests
4
10 +
Key Idea: Build
Models,
One for Each Cell in Data Cube





Build separate model for each 20,000+ separate
bank (1000+)
baselines
Geospatial
Build separate model for each
region
geographical region (6
regions)
Build separate model for each
different type of merchant (c.
Type of
800 types of merchants)
Transaction
For each distinct cube,
Bank
establish separate baselines
for each metric of interest
Modeling using Cubes
(declines, etc.)
of Models (MCM)
Detect changes from baselines
Greedy Meaningful/Manageable
Balancing (GMMB) Algorithm
• More alerts
Breakpoint
• Fewer alerts
• Alerts more
meaningful
• Alerts more
manageable
• To increase alerts,
add breakpoint
to split cubes,
order by number
of new alerts, &
select one or more
new breakpoints
•To decrease alerts,
remove breakpoint,
order by number
of decreased alerts,
& select one or more
breakpoints to remove
One model for each
cell in data cube
Augustus
Open source Augustus data mining platform was
used to:
• Estimate baselines for over 15,000 separate
segmented models
• Score high volume operational data and issue
alerts for follow up investigations
 Augustus is PMML compliant
 Augustus scales with
• Volume of data (Terabytes)
• Real time transaction streams (15,000/sec+)
• Number of segmented models (10,000+)
Some Results to Date
 System has been operational for 2.5 years
 ROI
– 5.1x
Year 1
(over 6 months)
– 7.3x
Year 2
(12 months)
– 10.0x
Year 3
(12 months)
 Currently estimating over 15,000 individual baseline models
 The system has issued alerts for:
• Merchants using incorrect Merchant Category Code
(MCC) - account testing
• Sales channel variations
• Incorrect use of merchant city name field
• Incorrect coding or recurring payments
Summary

Used new methodology (Modeling using Cubes of
Models) for modeling large, highly heterogeneous
data sets
 This project contributed in part to the development
of a Baseline Model in the open source Augustus
system
 Integrated system to generate alerts using
Baseline Models with manual investigation
process
 Project is generating over 10x ROI
 Poster #20 in the Tuesday night Poster Session.