Data Mining in Civil Infrastructure

Download Report

Transcript Data Mining in Civil Infrastructure

Automated Procedures for Improving
the Accuracy of Sensor-Based
Monitoring Data
Rebecca Buchheit
AIS Lab
Background
• sporadic use of KDD techniques in civil
infrastructure
• relative youth of data mining research
• difficult to systematically apply KDD process
• KDD process tools (CRISP-DM) still under
development
• KDD process highly domain dependent
• time consuming to teach data mining analysts
domain knowledge
Research Objectives
• develop a framework for systematically
applying KDD process to civil
infrastructure data analysis needs
– set of guidelines for inexperienced analysts
– checklist for more experienced analysts
• describe intersection of KDD process
characteristics and civil infrastructure
– what problems are well-suited to KDD?
– what characteristics are unique to
infrastructure?
Summary
• increased data collection => increased need
to intelligently analyze data
• KDD process as a “power tool” for analyzing
data for high-level knowledge
• civil infrastructure problems are well-suited to
data mining but will need to apply entire KDD
process to get good results
• proposed framework will help researchers to
systematically apply KDD process to their
data analysis problems
Data Quality
• What is it?
– in this talk, “accuracy”
– how close is the observed value to the true
value?
– “ground truth” is rare
– look for anomalous patterns
• Why is it important?
– poor quality data may taint analyses
– patterns of poor quality data may
overwhelm data mining/machine learning
algorithms
Mn/ROAD Data
• weigh-in-motion data
– axle spacings and
weights, speed, lane,
error codes
• derived quantities
courtesy Mn/ROAD
– equivalent standard axle
loads (ESALs)
– FHWA vehicle type
– gross vehicle weight
– total vehicle length
• trucks only (type >= 4)
• Jan 1 ‘98 to Dec 31 ’00
• about 3 million vehicles
Sample Data
Overview of Approach
• use statistical analysis and data mining
algorithms to separate anomalies from
normal data
– clustering
– regression
– physical constraints
– statistical properties
• focus on differences between anomalies
and normal data to help discover
causation
Clustering
• group data
into “natural
classes”
• anomalies
separated
from normal
data
• used
Autoclass
clustering
algorithm
Clustering Results
Regression
∑ ESAL = (3.531±0.176) ∑vehicles –
(1.252±0.099) ∑axles +
(0.066±0.003) ∑GVW –
139.000 ± 79.813
• confidence interval of 95%
• R-square (fit) = 0.923
• if error > 15% then identify as anomaly
Regression Results
Binary Constraints (1)
constraint
# violations (3,068,384
total)
offscale hit error
61,129 (1.99%)
significant weight
difference error
11,107 (0.36%)
different axle counts error 69,521 (2.27%)
tailgating
10,211 (0.33%)
speed >= 64.37 km/h
51,114 (1.86%)
speed <= 128.74 km/h
3,723 (0.12%)
Binary Constraints (2)
constraint
# violations (3,068,384
total)
gross weight <=
45,359kg
24,897 (0.81%)
length <= 22.86 m
79,454 (2.59%)
unknown vehicle
type
190,191 (6.20%)
number of axles != 0
number of axles <= 8
47 (0.00%)
57,114 (1.86%)
Constraint Interactions
c1
c2
% interactions
slow speed
length over limit
63.5%
length over limit
slow speed
45.7%
tailgating
unknown type
31.7%
high speed
unknown type
28.7%
overweight
diff axle counts
25.2%
tailgating
slow speed
21.1%
tailgating
length over limit
15.2%
Distribution Constraints
• use a
goodness-offit test to
compare
distributions
from the
same day of
week
–
–
–
–
length
gross weight
ESALs
lane
Anomaly Identification
• identify days with higher than normal
concentrations of binary constraint
violations
• identify days that are not likely to have
come from the baseline distributions for
length, ESALs, gross weight and lane
Binary Constraints Results
Distribution Constraints Results
A Quick Refresher
• used four different procedures to detect
anomalies
– clustering
– regression
– binary (physical) constraints
– distribution constraints
• next up
– what is causing the anomalies?
– can we fix them?
Gross Vehicle Weight
Lane
What Happened?
• two vehicles traveling slowly and close
together (tailgating) may be recorded as
a single vehicle
• lightweight vehicles are tailgating cars
– cars not supposed to be in database
– mis-classified because of tailgating
– this causes the “high” vehicle counts
• very heavy vehicles are tailgating trucks
• lane 1 (right-hand side) data is missing
for all “low” vehicle count days
Can It Be Fixed? (1)
• removed all
tailgating cars
–
–
–
–
lightweight
short
2 or 3 axles
error code
• “halved” all
tailgating
trucks
– very long
– very heavy
– more than 9
axles
– error code
Can It Be Fixed? (2)
• inserted lane 1
vehicles from
same time
period in 2000
• “shifted” days
to make sure
day of week
was constant
– Tuesday Sept
8 1998 =>
Tuesday Sept
5 2000
Summary
• statistical analysis and data mining
algorithms can be used to detect
systematic anomalies in data
– focus on differences between anomalies
and normal data to discover differences
– need domain knowledge to understand
causation
Current Progress/Future Work
• integrate algorithms into data quality
assessment program == automation
–
–
–
–
–
physical constraints
distribution constraints
other statistical characteristics of data
clustering
regression, neural networks
• will support infrastructure-related data
collection activities
• use algorithms to identify and “clean”
anomalies
Acknowledgements
• Minnesota Department of
Transportation, especially Maggi
Chalkline
• based upon work supported by the
National Science Foundation, under
Grant Numbers 9987871 and DGE
9553380