CS490D: Introduction to Data Mining Prof. Chris Clifton

Transcript CS490D: Introduction to Data Mining Prof. Chris Clifton

CS490D:
Introduction to Data Mining
Prof. Chris Clifton
April 14, 2004
Fraud and Misuse Detection
What is Fraud Detection?
• Identify wrongful actions
– Is right and wrong universal?
– If so, why not just prevent wrong actions
• Identify actions by the wrong people
• Identify suspect actions
– Legal
– But probably not right
In Data Mining terms…
• Classification?
– Classify into fraudulent and non-fraudulent
behavior
– What do we need to do this?
• Outlier Detection
– Assume non-fraudulent behavior is normal
– Find the exceptions
• Problems?
Solution: Differential Profiling
• Determine individual
behavior
– What is normal for the
individual
– What separates one
individual from another
• Gives profile of
individual behavior
• How do we do this?
+
–
+
–
+
–
Classification
Mining
Profile
Profile
Profile
Has this been done?
Intrusion Detection (Lane&Brodley)
• Profiled computer users based on command
sequences
– Command
– Some (but not all) argument information
– Sequence information
Results
Accuracy
Time to Alarm
Scaling Issues
• What happens with millions of users?
– Credit card
– Cell phone
• What about new users?
• Ideas?
Multi-user profiles
• Cluster users
• Develop profiles for clusters
– E.g., differential profiling
• Old customers: Do they match profile for
their cluster?
– Allows wider range of acceptable behavior
• New customer: Do they match any
profile?
Data mining
for detection
and
prevention
Matching known
fraud/non-compliance
• Which new cases are similar to
known cases?
• How can we define similarity?
• How can we rate or score
similarity?
Anomalies and
irregularities
• How can we detect anomalous or
unusual behavior?
• What do we mean by usual?
• Can we rate or score cases on
their degree of anomaly?
Techniques used to
identify fraud
Predict and Classify
– Regression
algorithms
(predict numeric
outcome): neural
networks, CART,
Regression, GLM
– Classification
algorithms
(predict symbolic
outcome): CART,
C5.0, logistic
regression
Group and Find
Associations
– Clustering/Grou
ping algorithms:
K-means,
Kohonen, 2Step,
Factor analysis
– Association
algorithms:
apriori, GRI,
Capri, Sequence
Techniques for
finding fraud:
• Predict the expected
value for a claim,
compare that with the
actual value of the claim.
• Those cases that fall far
outside the expected
range should be
evaluated more closely
Techniques for
finding fraud:
Decision Trees and Rules
•Build a profile of the
characteristics of
fraudulent behavior.
•Pull out the cases
that meet the
historical
characteristics of
fraud.
Techniques for
finding fraud:
Clustering and Associations
• Group behavior
using a
clustering
algorithm
• Find groups of
events using
the association
algorithms
• Identify outliers
and investigate
Fraud detection using
CRISP-DM
 Provides a systematic way to
detect fraud and abuse
 Ensures auditing and
investigative efforts are
maximized
 Continually assesses and
updates models to identify
new emerging fraud patterns
 Leads to higher recoupments
Data mining in
action: Fraud,
waste and abuse
case studies
Payment Error Prevention
The US Health Care Finance
Administration needed to isolate the
likely causes of payment error by
developing a profile of acceptable
billing practices and...
…used this information to focus
their auditing effort
Payment error
prevention solution
• Clementine™
• Using audited discharge records, built
profiles of appropriate decisions such as
diagnosis coding and admission
• Matched new cases
• Cases not matching are audited
Payment error
prevention results
• Detected 50% of past incorrect
payments – resulting in significant
recovery of funding lost to payment
errors
• PRO analysts able to use resultant
Clementine models to prevent future
error
Billing and payment fraud
The US Defense Finance and
Accounting Service needed to
find fraud in millions of Dept of
Defense transactions and...
Identified suspicious cases to
focus investigations
Billing and payment
fraud solution
• Clementine
• Detection models based on known fraud
patterns
• Analyzed all transactions – scored
based on similarity to these known
patterns
• High scoring transactions flagged for
investigation
Billing and payment
fraud results
• Identified over 1,200 payments for
further investigation
• Integrated the detection process
• Anomaly detection methods (e.g.,
clustering) will serve as ‘sentinel’
systems for previously undetected fraud
patterns
Audit selection
The Washington State
Department of Revenue
needed to detect erroneous
tax returns and...
Focused audit investigations on
cases with the highest likely
adjustments
Audit selection solution
• Clementine
• Using previously audited returns
• Model adjustment (recovery) per auditor
hour based on return information
• Models will then score future returns
showing highest potential adjustment
Audit selection results
• Maximizes auditors’ time by
focusing on cases likely to yield
the highest return
• Closes the ‘tax gap’
Data mining - key to detecting
and preventing fraud, waste
and abuse
• Learn from the past
– High quality, evidence based
decisions
• Predict
– Prevent future instances
• React to changing circumstances
– Models kept current, from latest data

CS490D: Introduction to Data Mining Prof. Chris Clifton

Transcript CS490D: Introduction to Data Mining Prof. Chris Clifton

Directory