Transcript Slide 1

Multivariate Detection of
Aberrant Billing: An Evaluation
Maharaj Singh, Ted Wallace & Martin Schrager
National Government Services, Inc.
Outline of the study
• Outlier defined
• Multivariate method for detecting
Outlier
• Detecting outlier billing providers
• An evaluation of the methodology
• Other factors considered for future
application
• Conclusion
2
An Outlier
• An outlier is not ‘Outlier’. May
be we haven’t yet found the
‘right’ Distribution.
3
Multivariate Method
• We used a Mahalanobis
distance as multivariate
vector corresponding to
each observation in the
data set.
4
Mahalanobis Distance
• Mahalanobis distance is a distance
measure introduced by Prasanta
Chandra Mahalanobis in 1936.
• It is based on Correlation between
variables by which different patterns
can be identified and analyzed.
• It is a useful way of determining
similarity of an unknown sample set to
a known one
5
Mahalanobis distance
• Mahalanobis distance
6
Mahalanobis distance and
Multivariate Outliers
• Mahalanobis D2 is a
multidimensional version of a zscore. It measures the distance of
a case from the centroid
(multidimensional mean) of a
distribution, given the covariance
(multidimensional variance) of
the distribution.
7
D2
• A case is a multivariate outlier if the
probability associated with its D2 is 0.05
or less. D2 follows a chi-square
distribution with degrees of freedom
equal to the number of variables included
in the calculation.
• Mahalanobis D2 requires that the
variables be metric, i.e. interval level or
ordinal level variables that are treated as
metric.
8
Mahalanobis Distance from Ellipsoid
• Mahalanobis distance measure is based
on correlations among the variables by
which different patterns can be
identified and analyzed.
• The region of constant Mahalanobis
distance around the mean forms an
ellipsoid when more than two variables
are used.
9
Multivariate trimming …
•
•
•
•
The χ2 plot for multivariate data is not resistant to the
effect of outliers.
A few discrepant observations can affect the mean
vector, and can potentially influence the outcome.
In order to avoid the effect of a few discrepant
observations, we used multivariate trimming which
involved an iterative process of setting aside the
observations with largest squared distance and the
trimmed statistics are computed from the remaining
observations.
At the end of this iterative process, the new squared
distance values are computed using the robust
statistics.
10
2
χ
Chi Square plot for the dataset
11
The Data Set: Paid Claims
12
Indices for the Id of
observations
– Billing providers
• Location,
• Size
• Specialty
– HCPCS used
– Primary Diagnoses
13
Matrix of Utilization Variables: Amount
• Cost
– Charges Billed
– Charges denied
– Reimbursement
14
Matrix of Utilization Variables: Rate
• Rate
– Reimbursement per
beneficiary
– Service Units per beneficiary
– Service units per service
dates per beneficiary
15
Matrix of Utilization Variables: Volume
• Volume
– Number of claims
– Number of beneficiaries
– Number of service units
rendered
– Number of service days
16
The cost: Medicare trust
fund $$$
• For each observation the amount
paid is a function of the rate and the
volume.
• However for each observation Id,
the rate and volume variables are
also highly inter-correlated.
17
Methodology for Paid Claim
Data Set
18
Data Steps
• The line-level (detailed) paid claim data was
summarized by id (provider-HCPCS
combination) with summary of the utilization
variables (cost, rate and volume).
19
Principal Components
• The variables in the matrix
of the paid claims dataset
were converted into
principal components.
• The distance squared was
computed as unique sum of
squares principal
components.
20
Multivariate Trimming
• The iterative process of multivariate trimming
was used.
21
D2 and Expected Chi Square
Value
2
χ
• Corresponding to the
square distance the
expected chi square value
along with its probability
were computed.
22
Outlier Observations
• The observations with probability <
.05 are treated as outliers and are
flagged.
• The flagged observations are treated
as candidates for probe by medical
review and/or treated as potential
CERT errors and referred to the
Provider Education Unit.
Outlier
23
Outlier Observations
Prioritized
• Finally the outlier observation were
prioritized by the magnitude of distance
measure, expected chi-square value and the
probability associated with measure.
24
Evaluation of Outlier
Classification
• Once each observation in the
dataset has been classified as an
outlier or non-outlier by using chisquare distribution, we used
logistic regression to find out the
estimate of the goodness of fit of
the model.
25
C Statistics
• In order to find out how accurately we were
able identity the outlier observations we used
C Statistics.
26
C Statistics…
• The value of c statistics
varies from 0.5 ( randomly
assigning to one of the other
category) to 1.0 where the
observations are correctly
assigned to the categories.
27
An example from Paid
Claim Dataset
28
Think about Some
Random Numbers??
4 - 100 - 40 - 60
29
Outlier Proportion of NGS
Utilization
•
•
•
•
•
•
Provider HCPC Line
Provider Counts
Total Reimbursement
Total Units
Provider HCPC Benes
Provider HCPC Claims
03.46%
99.49%
37.90%
58.78%
44.21%
48.21%
30
Model Evaluation
• The outlier model for NGS data was evaluated by using
goodness of fit test.
• The NGS combined data set has 930,260 Provider HCPC
Lines.
• Of the total lines there were 32,145 were outliers lines.
• The Chi Square for the model was 344144.04 with
Probability being < 0.0001.
31
C statistics
• Association of Predicted Probabilities and
Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
94.9%
3.6%
1.5%
c
0.956
statistic
32
Past and Future Application
33
Current Application
• Used as a single factor in determining
multivariate statistical outliers
– Problem areas
– HCPC codes
– Individual Providers
34
Current Application
• Positives
– Confidence (statistically valid methodology)
– Consistent methodology regardless of problem
area
– Lack of clinical bias
• Negatives
– Difficult to interpret
– Volume of provider/HCPC combinations
required for valid analysis
– Lack of clinical bias
35
Future Application
• Using the squared distance as a factor
in determining outlier problem areas
• Using the squared distance as a factor
in determining the aberrancy index of a
provider
36
Future Application – Problem
Areas
Squared
Distance
Factor
CERT
Factor
Factor
2
Factor
3
Weight
3
Weight
4
Problem Area 5
10
10
8
0.33872
4
1
2
4
6,735.49
1
Problem Area 10
8
6
9
0.86849
4
1
2
4
5,947.40
2
Problem Area 3
10
8
1
0.75349
4
1
2
4
5,301.40
3
Problem Area 7
10
4
3
0.06234
4
1
2
4
5,024.94
4
Problem Area 9
9
6
3
0.18039
4
1
2
4
4,872.16
5
Problem Area 8
8
1
1
0.43183
4
1
2
4
3,672.73
6
Problem Area 2
6
6
2
0.99527
4
1
2
4
3,798.11
7
Problem Area 6
4
10
1
0.74227
4
1
2
4
3,096.91
8
Problem Area 4
8
2
2
0.22331
4
1
2
4
3,889.32
9
Problem Area 1
2
1
3
0.66469
4
1
2
4
1,765.87
10
Problem Area
Weight
1
Weight
2
Aggregate
Index
Problem
Rank
37
Future Application – Problem
Providers
CERT
Factor
Factor
2
Factor
3
Squared
Distance
Factor
Weight
1
Weight
2
Weight
3
Weight
4
Provider 4
10
10
8
0.55720
4
1
2
4
6,822.88
1
Provider 6
10
5
7
0.98560
4
1
2
4
6,294.24
2
Provider 2
7
2
10
0.12170
4
1
2
4
5,048.68
3
Provider 10
4
6
10
0.89450
4
1
2
4
4,557.80
4
Provider 7
6
9
6
0.13890
4
1
2
4
4,555.56
5
Provider 1
6
1
8
0.23790
4
1
2
4
4,195.16
6
Provider 5
4
3
5
0.79580
4
1
2
4
3,218.32
7
Provider 8
2
2
5
0.20110
4
1
2
4
2,080.44
8
Provider 3
3
1
2
0.55390
4
1
2
4
1,921.56
9
Provider 9
2
5
1
0.69240
4
1
2
4
1,776.96
10
Provider
Aggregate
Index
Provider Rank
38
The multivariate model
• By using multivariate model only 4% of total
Provider-HCPC combinations lines were
identified as outliers.
– However the 4% of the total lines have captured
almost 100% of the NGS providers and questioned
their 40% of their payment in the Quarter 4 of 2007.
39
Testing of the model as
classifier
• Using multivariate model with multivariate trimming we were
able to identify each observation (provider-hcpcs combination) to
be as outlier or non-outlier.
• Using this method we were able identify outliers with a very high
concordance ( 94.6%).
40
Conclusion
• We used multivariate statistical method to
identify aberrant billing and utilization in the
claim data set and tested the validity of the
method by using logistic regression.
• We also noted that statistical method alone is
not enough and we need to add other factors to
add value to the process of identifying the
problem areas as well finding the high value
target.
41
42