slides - Fangbo Tao

Download Report

Transcript slides - Fangbo Tao

EventCube
Aviation Safety Data Analysis System
Fangbo Tao, Xiao Yu, Jiawei Han
08/10/13
The data we focus:
Following a normal approach and landing
to runway 4 in roc; aircraft was taxied clear
of the end of runway to the gate .several
ground snow removal vehicles were
operating to left of aircraft so we moved
to the right side of ramp.….
Huge Collection of Logs
Each Document
Power of Text-Rich Data Cubes
Hierarchical Data Cube
Text Analysis
Power of Text-Rich Data Cubes
Data Cube
Efficient Summarization
Rich Text
Powerful Text Mining
Power of Text-Rich Data Cube
Other features
Multi-gram
Hierarchical
Summarization
Dimension
Selection
Similar
Keyword
Document
Frequency
Finding
Distribution
Contextual Search
: :support
multiple
choices
based on Contextual Search
Contextual Search
 Motivation:
 Every word/concept may have equivalent word/concept
 “SVM” = “Support Vector Machine”, “Alt” = “Altitude”
 Connections between words
 “Kernel Method” - “SVM”, “altitude” – “flight level”
Contextual Search
 We develop a contextual search framework to build the
word-net
 Contains 4 different relationships:
 A “Use” B: Equivalent terms, B is more common
 A “RT” B: Related terms, not hierarchical
 A “BT” B: B is the broader word
 A “NT” B: B is the narrower word
Contextual Search
 Step 1: Generate word-net when uploading dataset.
 Step 2: Return the related terms when inputing.
 Step 3: Automatically include the equivalent terms when
searching.
 Step 4: Operator Support “AND”/”OR”/”NOT”
Hierarchical Dimension Support
 Multiple Choice Support
 Each Dimension can support several
levels
 Powerful examples:
 “B-737” VS. “B-747”
 “Boeing” VS. “Airbus”
Document List Result
 Using the default Mysql “natural
language full text search”
 Extract the title based on the
most relevant part.
 Show tags of dimension values
for target dimensions
 Highlight the keywords
Similar Document
 Also contextual search
 Step 1: Extract meaningful terms from the original report
 Step 2: Using these terms as input, conduct contextual
search.
Top Cells
 Search all the cells in the
targeted dimensions, find the
most relevant cells
 A multi-dimensional cell ranking
Single Dimension Distribution Based
on Keywords
Single Dimension Distribution Based
on Keywords
 Using a offline + online framework to calculate the
distribution.
 If Offline:
 Combination of keywords are exponential
 If Online:
 Retrieve the whole corpus every time.
 Strategy:
 Store the single keyword distribution in the database. [Offline]
 Combine the single ones to a new distribution online. [Online]
Single Dimension Distribution Based
on Keywords
 Offline process:
 Step1: Map equivalent terms into one.
 Step2: Build both keyword reverse index and cell reverse
index based on report
 Step3: Compare these two reverse indexes and calculate
the single term distribution.
 Online process [with a list of terms and dimensions]
 Step1: match each term into it’s equivalent term.
 Step2: Calculate the combined distribution based on the
independent assumption, for each dimension
 Val(t1..tn) = 1 –π(1-val(ti));
Topic Distribution
 Based on Topic Cube
 Applying topic model.
 Support comparison between different cells
Unigram/Multigram description
 Based on Qiaozhu’s paper,
“Automatic Labeling of
Multinomial Topic Models”
 Find multi-gram candidate from
the whole text
 Scoring it based on unigram
 Adjust it based on it’s length
Thinking
 Data Cube:
 Efficient Summary
 Highly Structured Data.
 Rich Text:
 Topic Analysis, keyword search
 Common: ASRS, IMDB, Publication-Net, News…
 Network (HIN)
 Good at mining, contains structural information.
 No information loss
Motivation of EventCube
 Combine Data Cube with Rich Text.
 Combine Summary with Keyword Search
 Build a general search/analysis system for rich text cube data.
 1. Aviation Safety Reporting Data
 Time, Weather, Location, Model…Flight logs
 2. Publication Data
 Author, Conf, Time, Field, Affliation…Abstract
 3. IMDB
 Time, Country, Style, Director…Description
Thanks