Transcript 0.499 0.63

Apollo – Automated Content
Management System
Srikanth Kallurkar
Quantum Leap Innovations
Work Performed under AFRL contract FA8750-06-C-0052
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Capabilities
• Automated domain relevant information gathering
– Gathers documents relevant to domains of interest from
www or proprietary databases.
• Automated content organization
– Organizes documents by topics, keywords, sources, time
references and features of interest.
• Automated information discovery
– Assists users with automated recommendations on related
documents, topics, keywords, sources…
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Comparison to existing manual information gathering
method (what most users do currently)
User performs a “Keyword
Search”
Data
Generalized Search Index
Take a
break
Search Engine Interface
Query
1. Develop
Information
Need
2. Form
Keywords
Yes
3. Search
5.
Examine
Results
4.
Results
6.
Satisfied
User
No
7. Refine Query
(Conjure up new keywords)
7a. Give up 
The goal is to maximize the
results for a user keyword
query
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Index: User Task
Search Engine Task
Data
Apollo Information Gathering method
(what users do with Apollo)
User explores, filters and
discovers documents assisted
by Apollo features
Data
Specialized Domain Model
Features - Vocabulary, Location,
Time, Sources …
Take a
break
Apollo Interface
Features
1. Develop
Information
Need
2. Explore
Features
Yes
3. Filter
4.
Results
5.
Examine
Results
6.
Satisfied
User
No
7. Discover new/related information
via Apollo features
The focus is on informative
results seeded by a user
selected combination of
features
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Index: User Task
Apollo Task
Data
Apollo Architecture
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Domain Modeling
(behind the scenes)
1. Bootstrap
Domain
2. Define domain,
topics, subtopics
3. Get Training Documents
(Option A/B/AB)
4. Build Domain
Signature
A. From the Web
Build
Representative
Keywords
Query
Search Engine (s)
5. Organize Documents
(Option A/B/AB)
A. From the Web
B. From Specialized
Domain Repository
(Select a small sample)
Curate (optional)
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Identify
Salient
Terms per Domain,
Topic, Sub topic
Compute
Classification
Threshold
Filter
Documents
B. From
Specialized
Domain
Repository
Classify into
defined
topics/subtopics
Extract Features Vocabulary,
Location, Time …
Apollo Data Organization
Snapshot of Apollo process to collect a domain
relevant document
Data
Source
e.g. Web Site, Proprietary database, ...
e.g. Published Article,
Document News Report, Journal Paper, …
Snapshot of Apollo process to evolve domain
relevant libraries
Data
Data
Source
Source
Data
Data
Source
Source
Document
Document
Document
Data
Data
Source
Source
Document
Document
Document
…
Document
Document
Document
…
Apollo collection process
No
Is in
Domain
Discard
Yes
Apollo collection/organization process
Domain A
Domain B
Domain C
Classify into defied
domain topics/subtopics
Apollo library of
domain relevant
documents
Extract Features:
domain relevant vocabulary
locations, time references, sources, …
Organize documents by features
Feature A
Store
document
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Doc 1
Doc 2
Doc N
…
Apollo Information Discovery
User selects a feature via the Apollo interface
e.g.: user selects phrase
“global warming” from
domain “climate change”
Apollo builds a set of documents from the library that contains the feature
A set of n documents containing
phrase “global warming”
Apollo collates all other features from the set and ranks them by domain relevance
User is presented with co-occurring features
User can use discovered features
to expand or restrict the focus of
search based on driving interests
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
e.g. user sees phrase
“greenhouse gas emissions”
And
“ice core”
as phrases co-occurring with
“global warming” and explores
documents containing
the phrases
Illustration: Apollo Web Content
Management Application for the
domain “Climate Change”
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
“Climate Change” Domain Model
Vocabulary
(Phrases, Keywords,
idioms)
identified for the domain
from training documents
collected from the web
Building blocks of the
model of the domain
Modeling error based on
noise in the training data
Can be reduced by input
from human experts
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Prototype
Keyword
Filter
Extracted
“Locations”
across the
collection of
documents
Extracted
“Keywords” or
Phrases
across the
collection of
documents
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Domain
Document
results of
filtering
Automated
Document
Summary
Inline Document View
Filter
Interface
Additional
Features
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Features
extracted
only for this
document
Expanded Document View
Features
extracted
for this
document
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Cached text
of the
Document
Automatically Generated Domain Vocabulary
Vocabulary
collated
across
domain
library
Font size and
thickness
shows domain
importance
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Importance
changes as
the library
changes
Apollo Performance
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Experiment Setup
• The experiment setup comprised the Text Retrieval
Conference (TREC) document collection from the 2002
filtering track [1]. The document collection statistics were:
– The collection contained documents from Reuters Corpus Volume 1.
– There were 83,650 training and 723,141 testing documents
– There were 50 assessor and 50 intersection topics. The assessor topics
had relevance judgments from human assessors where as the
intersection topics were constructed artificially from intersections of
pairs of Reuters categories. The relevant documents are taken to be
those to which both category labels have been assigned.
– The main metrics were T11F or FBeta with a coefficient of 0.5 and
T11SU as a normalized linear utility.
1. http://trec.nist.gov/data/filtering/T11filter_guide.html
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Experiment
• Each topic was set as an independent domain in Apollo.
• Only the set of relevant documents from the training set of the topic were
used to create the topic signature.
• The topic signature was used to output a vector – called the filter vector –
that comprised single word terms that were weighted by their ranks.
• A threshold of comparison was calculated based on the mean and
standard deviation of the cross products of the training documents with
the filter vector.
• Different distributions were assumed to estimate the appropriate
thresholds.
• In addition, the number of documents to be selected was set to be a
multiple of the training sample size.
• The entire testing set was indexed using Lucene.
• For each topic, the documents were compared using the cross product
with the topic filter vector in the document order prescribed by TREC.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Initial Results
50 Assessor Topics
Avg. Recall
Apollo
TREC Benchmark KerMit [2]
Avg. Precision
Avg. T11F
(FBeta)
0.35
0.63
0.499
-
0.43
0.495
• Initial results show that Apollo filtering effectiveness is very
competitive with TREC benchmarks
• Precision and recall can be improved by leveraging additional
components of the signatures.
2. Cancedda et al, “Kernel Methods for Document Filtering” in the NIST special publication 500:251: Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, MD, 2002.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Topic Performance
Recall v/s T11F
1
0.9
0.8
0.7
0.6
Recall
0.5
FBeta
0.4
0.3
0.2
0.1
0
1
3
5
7
9
11 13 15 17 19
21 23 25 27 29 31 33 35 37 39 41 43
45 47 49
Topics
Precision v/s T11F
1.2
1
0.8
Precision
0.6
FBeta
0.4
0.2
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Topics
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Filtering Performance
• Apollo training period was linear to the number and size of the training
set (num training docs vs. avg. training time).
• On average, the filtering time per document was constant (avg. test
time).
num training doc s
160
140
120
100
80
60
40
20
0
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
avg training time (ms )
avg tes t time per doc (ms )
4000
3500
3000
2500
2000
1500
1000
500
0
0.5
0.4
0.3
0.2
0.1
0
1
4
7
10
13
16
19
22
25
28
31
34
37
40
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
43
46
49
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49