Kellys slides on textmining

Download Report

Transcript Kellys slides on textmining

Information Retrieval course :::
Information Management
Technologies
Kalliopi Zervanou
[email protected]
Overview
The need for information processing
 Structured vs. unstructured data (text)
 The challenges of text
 Textual information processing technologies

The need for info processing

Large amounts of data in electronic form

Need for large scale & fast info processing

Most information to be found in text
Types of Data

Structured data

Semi-structured data

Unstructured, free-text data
Structured Data: e.g. Databases
Title:
Author:
Doc type:
Publisher:
Pub date:
Id:
Location:
Keywords:
Introduction to Information Retrieval
C.D.Manning, P.Raghavan, H.Schütze
Book
Cambridge University Press
2008
CM20B
Computer Science section
Information Retrieval, Indexing, …
Semi-Structured Data
(e.g. XML)
<?xml version="1.0" encoding="utf-8" ?>
<cmsbwsa_iisg_nl>
<bwsa>
<path> bios/bymholt.html </path>
<voornaam> Berend </voornaam>
<achternaam> Bymholt </achternaam>
<geboortejaar> 1864 </geboortejaar>
<geboortedatum>07-09</geboortedatum>
<sterfjaar>1947</sterfjaar>
<sterfdatum>05-27</sterfdatum>
<extrainfo> socialistisch en anarchistisch publicist en auteur van de
Geschiedenis der Arbeidersbeweging in Nederland</extrainfo>
<id>77</id>
</bwsa>
...
</cmsbwsa_iisg_nl>
Free-Text/ Unstructured data
Bertelsmann 9-mth profit slips on start-up losses
FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a
slight decline in nine-month operating profit due to start-up losses related
to new businesses.
Europe's largest media group on Thursday said it still expects its 2011
operating profit to decline slightly year-on-year. It had cut its outlook in
August due to costs for new projects and rising energy prices.
Bertelsmann owns publishers Gruner + Jahr and Random House as well as
European TV broadcaster RTL Group and Arvato, an outsourcing service
provider.
Operating earnings before interest and tax (EBIT) eased by 1.1 percent to
1.03 billion euros ($1.4 billion) in the first nine months of 2011,
Bertelsmann said.
Data Mining


analysis of structured data
detection of unknown
interesting patterns:

groups of data records
(cluster analysis)

unusual records
(anomaly detection)

data dependencies
(association rule mining)
Text Mining / Text Analytics
analysis of text (semi-/unstructured data)
 detection of unknown, interesting information:

 group
documents (classification/clustering)
 extract
information (content descriptors, concepts of
interest)
 associate/link
 discover
information (e.g. concept relations)
previously unknown facts
The challenges of text

Full text understanding beyond current technology

Human understanding based on context

Context: text, but also world knowledge

Text: ambiguity
(syntactic, semantic, lexical, pragmatic)
Relevant
Docs
Doc Collection
IR
Summarisation
(or Abstracting)
Relevant Info
IE
UNSTRUCTURED
Important Info
NE …
EVENT …
DATA
( Indexing )
ATR
Index
Terms
Terminology
Derived Info
Process
Resource
Data Bases
STRUCTURED
- Thesauri
Reasoning,
etc…
- Lexicons
- Ontologies
DATA
- Gazetteers
Structured
Info
Data Mining
IR: Select relevant documents
Query: “query term”
 Relevant: Documents containing the “term”
 Methods:


Indexing or Automatic Term Recognition
Automatic Term Recognition
Objective:
detect words or phrases denoting specialised
concepts, i.e. terms
supervised/ unsupervised task
 Methods: rule based, statistics-based,
machine learning, hybrid

ATR: example
C-value
Candidate term
338.13958
213.127
200.55471
143.48147
139.07053
134.47055
131.19386
124.91502
94.48066
91.18482
90.80228
trade union
[trade union, Trades Union,…]
ernst papanek [Ernst Papanek]
new york
[New York]
press clipping [Press clippings, press -clippings,…]
world war
[world war, world wars, World Wars,…]
print material [printed materials, Printed material,…]
executive committee [executive committee, …]
communist party
[Communist party,…]
second world war
[Second World War, …]
spanish civil war
[Spanish Civil War, …]
great britain [Great Britain, Great -Britain]
Document clustering
Objective:
group documents based on their content /
semantic similarities

unsupervised task
 “clusters”,

group categories unknown
machine learning and statistics-based
approaches
Document classification
Objective:
classify documents based on their content /
semantics

supervised task


we know the classes/categories
use of machine learning, or statistics-based
methods
Relevant
Docs
Doc Collection
IR
Summarisation
(or Abstracting)
Important Info
Relevant Info
IE
NE …
EVENT …
( Indexing )
ATR
Index
Terms
Terminology
Derived Info
Process
Resource
Reasoning,
etc…
Data Bases
- Thesauri
- Lexicons
- Ontologies
- Gazetteers
Structured
Info
Data Mining
Summarisation or Abstracting
Bertelsmann 9-mth profit slips on start-up losses
FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a
slight decline in nine-month operating profit due to start-up losses related
to new businesses.
Europe's largest media group on Thursday said it still expects its 2011
operating profit to decline slightly year-on-year. It had cut its outlook in
August due to costs for new projects and rising energy prices.
Bertelsmann owns publishers Gruner + Jahr and Random House as well as
European TV broadcaster RTL Group and Arvato, an outsourcing service
provider.
Operating earnings before interest and tax (EBIT) eased by 1.1 percent to
1.03 billion euros ($1.4 billion) in the first nine months of 2011,
Bertelsmann said.
Information Extraction
Objective:
detect specific types of info in documents,
e.g. names, events, relations
supervised, or unsupervised/generic task
 Methods: rule-based, machine learning

IE tasks

Named Entity (NE) recognise entities/concepts of
interest, e.g. persons, organisations, dates & times

Co-reference (CO) recognise mentions to the same
entity

Template Relation (TR) & Scenario
Template (ST) recognise relations among concepts,
e.g. concept properties & entities involved in facts & events
of interest
IE Tasks
ORGANISATION
Bertelsmann said operating earnings before interest
PERCENT
and tax (EBIT) rose 35 percent to
215 million euros
DATE
($272.1 million) compared with 2005, and sales were
AMOUNT
up 17.3 percent at 4.5 billion euros.
ORGANISATION=“Bertelsmann”
DATE=“2011-11-10”
Europe's largest media group on Thursday said it still
expects its 2011 operating profit to decline slightly
year-on-year.
IE Tasks
Event_type: sales
Bertelsmann said operating earnings before interest
Organisation_type:
Company
and tax (EBIT) rose 35 percent to 215
SALES_of
million euros
Organisation_name:
Bertelsmann
($272.1 million) compared with 2005, and sales were
Sector:
media
up 17.3 percent at 4.5 billion euros.
Sales_mode:
increase
Sales_amount:
4.500.000.000
Europe's largest media group on Thursday said it still
Currency:
euros
expects its 2011 operating profit to decline slightly
Period:
??
year-on-year.
Date:
??
Sentiment analysis/Opinion mining
Polarity classification (positive/negative)
 Objectivity/Subjectivity detection

Relevant
Docs
Doc Collection
IR
Summarisation
(or Abstracting)
Important Info
Relevant Info
IE
NE …
EVENT …
( Indexing )
ATR
Index
Terms
Terminology
Derived Info
Process
Resource
Reasoning,
etc…
Data Bases
- Thesauri
- Lexicons
- Ontologies
- Gazetteers
Structured
Info
Data Mining
Structured Data: e.g. Databases
Title:
Author:
Doc type:
Publisher:
Pub date:
Id:
Location:
Keywords:
Introduction to Information Retrieval
C.D.Manning, P.Raghavan, H.Schütze
Book
Cambridge University Press
2008
CM20B
Computer Science section
Information Retrieval, Indexing, …
Structured Data: Ontologies

Structure of concepts:
 Entities
(concepts, objects)
 Properties (concept properties)
 Relations (links between concepts)
 Domain specific relations, e.g., “has_capital”

Objective:
 describe
domain knowledge and reason about
concepts & relations
Einstein's riddle
Source: http://en.wikipedia.org/wiki/Zebra_puzzle

we have five houses in a row,
 each
house is painted with a different colour,
 each house has a single inhabitant

each inhabitant
 is
of different nationality
 drinks different beverage,
 owns a different pet,
 smokes different brands of cigarettes
Einstein's riddle
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
1.
There are five houses.
2.
The Englishman lives in the red house.
3.
The Spaniard owns the dog.
4.
Coffee is drunk in the green house.
5.
The Ukrainian drinks tea.
Einstein's riddle
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
6.
The green house is immediately to the right of
the ivory house.
7.
The Old Gold smoker owns snails.
8.
Kools are smoked in the yellow house.
9.
Milk is drunk in the middle house.
10.
The Norwegian lives in the first house.
Einstein's riddle
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
11.
The man who smokes Chesterfields lives in the
house next to the man with the fox.
12.
Kools are smoked in a house next to the house
where the horse is kept.
13.
The Lucky Strike smoker drinks orange juice.
14.
The Japanese smokes Parliaments.
15.
The Norwegian lives next to the blue house.
Einstein's riddle
Source: http://en.wikipedia.org/wiki/Zebra_puzzle

Who drinks water?

Who owns a zebra?
Ontology: hierarchical structure
House-1
House
Thing/Root
House-2
House...
House-3
Englishman
Inhabitant
House-4
Spaniard...
Spaniard
Japanese
House-5
Red
Norwegean
Green...
Green
Ukranian
Dog
Blue
Colour
Pet
Horse
Ivory
Beverage
Yellow
Snails
Fox
Zebra
Ontology
House-1
House
Thing/Root
House-2
House...
Englishman
Inhabitant

“is-a” or taxonomic
relationships

Denote the “kind” of
a concept

But ontologies: more
than taxonomic
relationships!
Spaniard...
Spaniard
Colour
Red
Green...
Green
Pet
Dog
Horse...
Beverage
Brand
Ontology: properties
House
Thing/Root
Inhabitant
Colour
Pet
House-1
Has_colour:
(Colour>Is_ColourOf:
[House])
Has_inhabitant:
Beverage
Brand
[Colour]
[Inhabitant]
(Inhabitant>LivesIn:[
House])
Is_rightTo:
[House]
Ontology: properties
Spaniard
House
Thing/Root
LivesIn:
Inhabitant
(House>Has_inhabitant:
[Inhabitant])
Has_pet:
Colour
Pet
Beverage
Brand
[House]
[Pet]
(Pet>Has_owner:
[Inhabitant])
Drinks:
[Beverage]
(Beverage>Drunk_by:
[Inhabitant])
Uses_brand:
(Brand>Used_by:
[Inhabitant])
[Brand]