Transcript ppt
Noisy Text Analytics: An Exercise in Futility?
Rohini Srihari
Janya, Inc.
www.janyainc.com
8 January 2007
Overview: Noisy Text Analytics
• All Text is Noisy!
– Does not fit shrink wrapped processing, adaptation is
necessary
• Business and national security interests in
processing:
– Open source data (e.g. web pages)
– Consumer generated media
(Blogs, newsgroups, chat, text messaging, etc.)
• Key is to identify analysis requirements clearly
– Not necessary to understand everything
Challenging Problems
• Mixed modalities
– Structured and unstructured; free text cannot be processed in a vacuum;
need to correlate information from different sections
– Text with images, figures
• Improve within document information consolidation, Cross-document
information consolidation
• World models for discourse processing
– Need to bring in more context; relate text analytics to semantic web
activities (DAML/OWL)
– Dynamic use of online resources
• Adaptive text analytics
– extraction requirements are constantly changing, so is data!
– Corpus-based learning
• Flexible architectures
– Integrating additional preprocessing, handling streaming data etc.
USMTF Document Structure
OPER/BRAVE CHILD//
MSGID/BDAREP PHASE2/NMJIC/F-0005//
BDAREPID/BEN:1111-22222/REPCOUNT:1//
ICOD/011630ZJAN2002//
BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//
GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT
CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,
INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION
EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON
THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Sample Document
Sets
OPER/BRAVE CHILD//
MSGID/BDAREP PHASE2/NMJIC/F-0005//
BDAREPID/BEN:1111-22222/REPCOUNT:1//
ICOD/011630ZJAN2002//
BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//
GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT
CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,
INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION
EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON
THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Sample Document
Fields
OPER/BRAVE CHILD//
MSGID/BDAREP PHASE2/NMJIC/F-0005//
BDAREPID/BEN:1111-22222/REPCOUNT:1//
ICOD/011630ZJAN2002//
BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//
GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT
CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,
INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION
EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON
THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Sample Document
OPER/BRAVE CHILD//
MSGID/BDAREP PHASE2/NMJIC/F-0005//
BDAREPID/BEN:1111-22222/REPCOUNT:1//
ICOD/011630ZJAN2002//
BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//
GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT
CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,
INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION
EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON
THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Free-text field
Sample Document
Entity Description/Name Field
TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0
//
ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON
/MAXRECUP:6MON//
GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2
OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS
FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS
CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT
VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING
TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE
SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO
RECONSTITUTE C2 EQUIPMENT//
Sample Document
Reference to Structured Sets from Free Text
TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0
//
ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON
/MAXRECUP:6MON//
GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2
OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS
FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS
CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT
VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING
TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE
SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO
RECONSTITUTE C2 EQUIPMENT//
Cross-Document Entity Profile
Corpus-Based Learning
• Training phase requires four inputs
–
–
–
–
Document repository (unlabeled training data)
Config file1 for DTL Context (how to create unlabeled train data)
Seed file (how to label a small amount of unlabeled train data)
Config file2 for Learning Tool
• How to learn a model
• How to use learned model in Semantex
Document
Repository
Config File1
DTL
Context
Training
Data
Seed File
Learning
Tool
Trainer
Learned
Model
Config File2
Versatility of learning tool applied to
different tasks
• Example: Nominal Event
Classifier
– Seedfile: 95 unambiguous
event nominals, 295
unambiguous nonevent
nominals
– Repository: News texts
processed by Semantex
– Config file (DTL): Look at
features surrounding nouns
– Config file (LearningTool):
Learn using a mixture model
• Example: Disease outbreak
Classifier
– Seedfile: 10 verb types
representative of disease
outbreak
– Repository: Medical reports
processed by Semantex
– Config file (DTL): Look at
features surrounding verbs
– Config file (LearningTool):
Learn using distributional
similarity
Example: Name Disambiguation
•
Are two instances of Tom Smith the same individual?
Conclusions
• Dealing with noisy text is not a futile
exercise!
– Already commercial applications available
– Need to specify analysis requirements clearly
– Adapt IE technology appropriately