Content Extraction in Majordome

Download Report

Transcript Content Extraction in Majordome

Content Extraction in Majordome
• Overall Objective: Quick detection of short
information elements for Message Filtering
and Reporting to User
• Functional position of this processing phase:
– Server-side, event-oriented, background task
– subsequent and/or parallel to speech recognition
(voice messages) or image processing (faxes);
previous to text summarizing
Useful applications (1)
• Name/Date/Subject identification (this
task specifically useful for fax and voice
messages: no standardized fields for storing
this information)
– “You have 1 fax message from Mrs Diaconu
about ‘attending the Barcelona meeting’…”
• Backup information: user’s addressbook
(PABX info yields sender’s phone number)
Useful applications (2)
• Message filtering:
– “You have received 14 personal E-mail
messages, among which 3 messages from
friends, 6 requests from students or colleagues,
and 5 spam messages; you have received 26
mailing list messages, among which 3 call for
papers, 11 conference announcements, and 12
other.”
• Backup information: RFC-822 “From” and
“Subject” fields.
Techniques (1)
• Text statistics measures:
– Frequency of occurrence of certain
words/morphological categories/syntactical
structures in different types of messages
E.g. ratio noun/verb frequency higher in
technical texts; style markers specific to some
text genres (e.g. frequent use of ‘!’ or ‘$’ in
advertisements; ‘loose style’ abbreviations like
‘CU’, ‘IMHO’ in English, or ‘A+’ in French)
Techniques (2)
• Text skimming:
– Spotting “good candidates” for specific word
types (e.g. proper names): selecting capitalized
words…
– … comparing with entries in common first
names / family names database, and/or…
– … using local grammars to disambiguate other
cases.
Techniques (3)
• Merging visual clues and textual clues for
mutual reinforcement of identification
probability.
E.g. Probability of an unidentified, capitalized
character string to be the proper name of a fax’s
sender increases if it stands alone on a line at
the top of the image.
Content Extraction: Current
Developments
• Toolbox for text statistics (word frequency,
contextual windows, co-occurrence
frequency…)
• Tool for determining fuzzy membership to a
given class of words
• Tool for determining document language
and segmenting multilingual documents
Content Extraction: Future
Developments
• Text categorization module for message
sorting and filtering
• Text genre database with (user-controlled)
learning capabilities