Presentation

Download Report

Transcript Presentation

Analysis of Railroad Accident Investigation
Reports Using Probabilistic Topic Models
and K-Means Clustering
Trefor P. Williams, Rutgers University
John Betak, Collaborative Solutions LLC
Rutgers, The State University of New Jersey
INFORMS 2016
THE CONTEXT
• US Railroads have an enviable contemporary safety
history – a steadily decreasing rate of incidents
• However, there are still large numbers of:
• Derailments
• Grade crossing incidents
• Injuries and fatalities associated with derailment and
grade crossing incidents
• Significant property damage and losses associated
with derailment and grade crossing incidents
INFORMS 2016
THE CONTEXT-II
• US Railroads gather and maintain large databases regarding:
–
–
–
–
Rail weight, profiles, when rolled, laid, etc.
Crosstie, ballast, sub-base, fasteners, etc.
Rail and roadbed condition – rail wear, ballast fouling, etc.
Maintenance of rail and roadbed, including what was done, when it was
done, what led to the maintenance, etc.
– Inspection reports, visual inspections by track inspectors, automated
inspections with various types of equipment, including today drone
surveillance using multiple sensor devices
– Equipment inspections, maintenance records, etc.
• To understand causes of derailment and grade crossing
incidents, traditional statistical models have been used
INFORMS 2016
THE CONTEXT-III
• Notwithstanding all the data and analytics, the industry still has few
models that are able to predict pending rail or equipment failures
• The Issue
– Develop analytical tools that will allow rapid analyses of large databases
comprised of diverse data types – visual (photographic, LiDAR, etc.),
text descriptions, Track Geometry Car metrics, etc.
– Using these tools, identify clusters of factors most commonly associated
with railroad incidents to develop cost-effective interventions to alleviate
and reduce/eliminate such incidents
– Using these new tools, develop predictive models that accurately predict
equipment, track and roadbed failures before they occur
INFORMS 2016
RUTGERS RAILROAD BIG DATA ANALYTICS
• Our Approach
– Identify automated Big Data analytics that could lend themselves to
addressing derailment and grade crossing incidents
– Identify publicly available datasets to test these analytical tools
– Conduct Proof of Concept analyses with various data visualization and
text analysis models
– Repeat the tests across various databases to look for explicability and
consistency in results
INFORMS 2016
SOME EXAMPLES
• Grade Crossing Accidents – FRA Data:
– Topic Modeling:
• Analysis of Grade Crossing Comment Fields with Latent Dirichlet Allocation
(LDA)
– Data Visualization:
• JIGSAW text clustering and data visualization
• Railroad Operations and Equipment Accidents – FRA Data:
– Data Visualization:
• JIGSAW text clustering and data visualization
• Railroad Accident Investigation Reports – NTSB and TSBC
Data:
– Topic Modeling: LDA
– K-means text clustering
INFORMS 2016
TEXT MINING AND TEXT CLUSTERING – AN EXAMPLE OF THE
POSSIBILITIES FOR USING DATA ANALYTICS TO ENHANCE
UNDERSTANDING OF MAINTENANCE ISSUES
7
INFORMS 2016
Text Mining Accident Reports
• Accident reports of major railway accidents are produced the
National Transportation Safety Board and the Transportation
Safety Board of Canada
• Applied probabilistic topic modeling and k-means clustering to
study the types of major accidents that occur
INFORMS 2016
Accident Reports
• U.S. National Transportation Safety Board
– 167 accident reports
– 1993-2014
• Transportation Safety Board of Canada
– 1991-2014
• PDF Files
• Converted to text files using Adobe Acrobat
9
INFORMS 2016
Latent Dirichlet Allocation
•
The underlying assumption of LDA is that a text document
will consist of multiple themes.
• LDA is a three-level hierarchical Bayesian model where each
item of a collection of text is modeled as a finite mixture over
an underlying set of topics.
• Each topic is, in turn, modeled as an infinite mixture over an
underlying set of topic probabilities.
• Used Stanford Topic Modeling Toolbox
10
INFORMS 2016
LDA Outputs
• Takes words from the accident reports and forms them into ranked
topics
• For each accident report, probabilities that the report is a member
of a topic are generated.
11
INFORMS 2016
LDA Topics from NTSB Reports
12
INFORMS 2016
LDA Topics from TSBC Reports
13
INFORMS 2016
Results of LDA Analysis
• Similar results between US and Canada
–
–
–
–
–
–
–
Braking
Wheels
Signals
Crew Training/Fatigue
Grade Crossings
Switches
Yards (Marshaling)
• There is a topic with bridges as the main word in the Canadian data
14
INFORMS 2016
K-means clustering
• K-means clustering aims to partition n observations into k clusters
• These clusters reflect some mechanism that is at work in the
domain from which instances are drawn.
• This mechanism causes some instances to bear a stronger
resemblance to each other than they do to the remaining instances
15
INFORMS 2016
Word Clusters from NTSB Accident Report Text
16
INFORMS 2016
Canadian Accident Text Clusters
17
INFORMS 2016
Clustering Results
• Accident types in both countries:
–
–
–
–
–
Wheels
Yards
Grade Crossings
Switches
Signals
• Differences
– Water/subgrade, runaway cars and bridges in Canada
– Track Warrants in U.S.
18
INFORMS 2016
Text Mining and Text Clustering
FRA RAILROAD EQUIPMENT
ACCIDENTS
INFORMS 2016
Federal Railroad Administration Railroad
Equipment Accident Database
• Any collisions, derailments, fires, explosions, acts of God, or
other events involving the operation of railroad on-track
equipment (standing or moving) causing damage above a
threshold set by the FRA must be reported by U.S. railroads
to the FRA.
• The threshold amount is adjusted over time for inflation and
was $10,500 in 2014.
• Studied reports from 2005 to 2015
INFORMS 2016
Ranked Topics and Main Words in Each Topic
INFORMS 2016
Interesting LDA Topics
•
•
•
•
•
Topic 1- Hump Yards
Topic 2 and Topic 7- Switching Accidents, Shoving
Topic 5- Brake related accidents
Topic 8- Grade crossing accidents with highway trucks
Topic 14- Hazardous material accidents in yards when
shoving or kicking
• Topic 9, 10, 19 and 20- Derailments
INFORMS 2016
Text Clustering and Visualization
• Jigsaw is a visual analytic system that represents documents
visually to help analysts examine them more efficiently and
develop theories about potential actions more quickly.
• One of the capabilities of Jigsaw is the ability to cluster similar
text documents using a k-means clustering algorithm and then
produce a visualization of the clusters.
• Used 2000 accidents occurring 2014-2015 due to software
limitations
• Similar Clusters to LDA
INFORMS 2016
RAILROAD OPERATIONS AND
EQUIPMENT ACCIDENTS – FRA DATA: JIGSAW
VISUALIZATION
Text Clusters from RR Equipment Accident
Database
INFORMS 2016
Grade Crossing Accidents in the Midwest by
User Type
INFORMS 2016
CONCLUSIONS
• Text mining can be a useful tool to better understand the types
of railroad accidents that occur
• Different methods of analyzing the text, produce similar results
further confirming the existence of recurring railroad accident
types
• Text mining and data clustering allow cross database analyses
to identify recurring themes in different databases
• Data visualization allows policy makers to quickly grasp where
the predominant issues lie
• Potentially, sophisticated visualizations can be linked to
detailed information about track failures as an aid for railroad
managers
INFORMS 2016
CONCLUSIONS CONTINUED
• Many of the accident categories indicate areas where
additional training can reduce accidents
• Main accident themes of rail and track defects, wheel defects,
grade crossing accidents and switching accidents were
identified – areas of concern that have been recognized
previously, but confirmed through analyses of text materials
heretofore not analyzed
• There are some differences found in the U.S. and Canadian
reports: accidents involving bridges were more prominent in
Canada; accidents involving runaway cars are prominent in
the Canadian clustering analysis
• Even though different railroads use different nomenclature in
the text sections of the accident reports – LDA analysis was
able to extract data that indicate common trends in grade
crossing accidents
INFORMS 2016
NEXT STEPS
• LDA is a tool that could be utilized by the FRA and Class I
railroads for analyzing data where comment fields are
common, such as track maintenance and inspection reports
• LDA could be used to find relationships between procedures
used and problem recurrence
• Future research can study how the LDA topic modeling can be
combined with numeric data, like maintenance data collected
by railroads
• Perform additional analysis to understand the grade crossing
factors involving tractor-trailers by analyzing other data fields
in FRA database
INFORMS 2016
NEXT STEPS CONTINUED
• Future research suggested by this work is to perform the text
mining and clustering for different time periods to see if major
accident groupings change over time
• Use the probabilities of word occurrence produced by the LDA
algorithm can be used in predictive models to predict the
number of expected accident types
• There is great potential to combine Data visualization and GIS
with data analytic output to develop systems that provide
increased insight into maintenance problems occurring in the
field.
INFORMS 2016
Conclusions
• Both text mining methods produced results highlighting major
themes in railroad accidents
• Future research suggested by this work is to perform text mining
and clustering for different time periods to see if major accident
groupings change over time.
30