Ads tutorial

Download Report

Transcript Ads tutorial

Some Commercial Text Mining Systems
Xuanhui Wang
UIUC
March 29th, 2007
Why Text Mining?
• A large portion of all available information today
exists in the form of unstructured texts (information
overload).
– Books, magazine articles, research papers, product
manuals, memorandums, e-mails, and of course the Web,
all contain textual information in the natural language form.
• A lot of critical information is in the textual format
– The voice of customers -- customer email, customer
complaints
– Product reviews
• Thus, making correct decisions often requires
analyzing large volumes of textual information –
Business Intelligence
Text Mining (From Wikipedia)
• Refer generally to the process of deriving high
quality information from text.
• High quality information is typically derived through
the divining of patterns and trends through means
such as statistical pattern learning.
• Process
– structuring the input text
– deriving patterns within the structured data
– finally evaluation and interpretation of the output
• Tasks
– text categorization, text clustering, concept/entity
extraction, sentiment analysis, document summarization,
and entity relation modeling
Structuring the input text  Information
Extraction
• Named Entity recognition (NE)
– Finds and classifies names, places, etc.
• Coreference resolution (CO)
– Identifies identity relations between entities.
• Template Element construction (TE)
– Adds descriptive information to NE results (using CO).
• Template Relation construction (TR)
– Finds relations between TE entities.
• Scenario Template production (ST)
– Fits TE and TR results into specified event scenarios
Dummy Example
“The shiny red rocket was fired on Tuesday. It is the
brainchild of Dr. Big Head. Dr. Head is a staff
scientist at We Build Rockets Inc.”
• NE discovers that the entities present are the
rocket, Tuesday, Dr. Head and We Build Rockets
Inc.
• CO discovers that it refers to the rocket.
• TE discovers that the rocket is shiny red and that it
is Head’s brainchild.
• TR discovers that Dr. Head works for We Build
Rockets Inc.
• ST discovers that there was a rocket launching
event in which the various entities were involved.
Some Systems
•
•
•
•
•
•
Attensity
Inxight
Anderson
ClearForest
TextAnalyst
Linguamatics
Attensity
• http://www.attensity.com/
• Founded in early 2000
• Culmination of over a decade of research in
computational linguistics at the University of Utah
• The technology allows users to extract and analyze
facts like who, what, where, when and why
• Allows users to drill down to understand people,
places and events and how they are related
• It then creates output in XML and in a structured
relational data format that is fused with existing
structured data
Architecture
Attensity: Information Extraction Engine
• The foundation of all the applications
• Target extraction
– When you know what you are looking for
– Entity and event definitions
– Creating rules and dictionaries specific to your particular
domain
– Graphical user interface that allows users to rapidly create
definitions
• Exhaustive extraction
– When you are trying to understand what is in your text and
you don't exactly know what you are looking for
Attensity: Applications
• Discovery
– Mining relations: uncover who, what, where, when, and why
• Analytics
– Support users to drill down
– Visualization tools to slice, dice and analyze important facts
– Aggregations of facts
• Text search
– Allow approximate matching of query words
– Seamlessly combined with the text analysis
• Classify
– Enable users to define document groups
• Alert
– Provide timely visibility to frequent and emerging issues
– Product problems, trigger emails or notifications
• http://www.attensity.com/www/products/applications.php
Examples Using Attensity
• Attensity boasts customers within Global 2000 organizations
as well as government agencies
• Warranty Improvement
– reviewing warranty data contained in unstructured, text-based
sources such as technician reports, customer surveys and dealer
provided information (reduce warranty cost)
• Understand Voice of the Customer
– both structured and unstructured data to detect product problem
and customer satisfaction
• Government Intelligence
– identify suspicious activities and relationships, detecting threats
to improve homeland security and monitoring of the Internet to
uncover illegal activities
– improve the reliability and supportability of a variety of military
vehicles, weapons and components, by converting unstructured
data from service notes and repair logs into relational tables
Inxight
•
•
•
•
•
http://www.inxight.com/
Founded in 1997
Spun out from Xerox PARC
Based on 25+ years of research at Xerox PARC
Inxight’s ability to “read” text in more than 30
languages
• Inxight takes information search, retrieval and
analysis to an entirely new level.
Components
• Federated & Desktop Search
– Support hundreds of high-value information sources through a
single, user-friendly interface.
– Search results are automatically clustered on-the-fly by
extracting and analyzing the most relevant people, places and
events
– Provide alert functionality of new information (Be alerted when
competitors' websites change, monitor a single web page to know
the change of a product’s price).
– Support different types of search functionalities ("More Like This"
Searching)
– Having Google desktop search entender.
• Text Analysis
– Extracting the "who," "what," "where" and "when" in each
document. (more than 35 types of information)
– Automated entity, concept, event and relation extraction,
categorization and summarization
Components Cont’d
• Data Cleansing
– Human experts can review to clean the extracted data
• Visualization
–
–
–
–
Relationship  StarTree
Trend  TableLens
Timeline  TimeWall
Several demos: http://www.inxight.com/products/vizserver/
Examples Using Inxight
• Customers: More than 350 Global 2000 customers
• Financial Data Analysis
• Crime Analysis
• Pharmaceutical Research
Anderson
• Designed especially for customer behavior
• Market Research
– Collecting external business information (from customer,
competitor, and the market)
– Qualitative (answer the “why”) vs Quantitative (answer the
“how much/many”)
– Hybrid
• Business Intelligence
– Collecting and analyzing internal business information
– Focus on business transactions and communications
– Sale data, supply logs, financial records
ClearForest
• http://www.clearforest.com/
• Tagging Engine
– Information extraction
– Document categorization
• Analytics
– Improve Early Warning Visibility: Include text-based information
to better assess and trigger organizational responses.
– Discover Insights: Identify trends, patterns, and complex interdocument relationships within large text collections.
– Create Links with Structured Data: Incorporation enhances
quality of business intelligence by forging links not previously
possible.
– Become an Expert: Rapidly comprehend and synthesize complex
issues before making key decisions
• See the simple demo
– Automatically identify the people, companies, organizations,
geographies and products on the web page
TextAnalyst
• Based on semantic network
– a list of the most important words from the text and
relations between them
• Functionalities
– Textbase Navigation: concepts in semantic network is
connected to sentences, then documents.
– Topic Structure: transform semantic network to tree-like
list of nested topics
– Clustering: eliminating those weak links in the topic
structure
– Summarization: using semantic network to score
sentences.
Linguamatics
• Interactive information extraction (I2E)
– Powerful queries (John Smith is the chairman of which
company? )
– Graphical interface
– Structured output
– http://www.linguamatics.com/technology/ie/search_results
.html
• Can take existing ontologies
– Synonyms and Canonicalisation
– Class information: providing sub- and super-classes (In the
Life Science domain, relationships between protein
families can point to potential relationships between
specific proteins.)
– Balancing precision and recall: by moving up/down
hierarchy
Commonness
• Information extraction is very important for
commercial text mining systems
• Consider and combine both structured and
unstructured data for analysis
• Alerts are considered as very important
• Search and mining is highly integrated
An IE Toolkit: GATE
• General Architecture for Text Engineering
–
–
–
–
–
University of Sheffield since 1995
More than 10 years old
Free open source software
Implemented in Java
language analysis contexts including Information
Extraction in English, Greek, Spanish, Swedish, German,
Italian and French
– Easily pluggable and used in a lot other projects
– Provide interface as a standalone applications
– Pretty slow and memory consuming
IE in GATE
• Named as ANNIE: a Nearly-New Information
Extraction System (Show the pdf file for some
examples)
•
•
•
•
•
•
•
Tokeniser
Gazetteer
Sentence Splitter
Part of Speech Tagger
Semantic Tagger
Orthographic Coreference (OrthoMatcher)
Pronominal Coreference
Thanks