Data Mining on Symbolic Knowledge Extracted from the Web

Download Report

Transcript Data Mining on Symbolic Knowledge Extracted from the Web

Data Mining on Symbolic Knowledge
Extracted from the Web
Changho Choi
Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html
Carnegie Mellon University, J.Stefan Institute
Abstract

This paper gives a case study of combining
information

Unstructured Information


Structured Information


less up-to-date, but reliable as facts
Using information from two kinds of sources

3/6/2001
an errorful source of large amounts of potentially useful
information
Improves the reliability of data-mined rules
Changho Choi, University at Buffalo
1
Introduction (#1/2)

Challenge


3/6/2001
not only gather and represent knowledge existing on the
Web,
but also use that knowledge for planning, acting, and
creating new knowledge
Changho Choi, University at Buffalo
2
Introduction (#2/2)

First stage

integrating three types of information gathering




Aim

3/6/2001
Extracting propositional knowledge from highly-structured
automatically-generated web pages
Extracting propositional knowledge from free-form,
unstructured data sources
Extracting relational knowledge existing on the Web through
a combination of web pages and their hyperlink structure
identify patterns of knowledge that were not explicitly
represented as facts on the Web
Changho Choi, University at Buffalo
3
Data sources and features

Extracted features

come directly from crawling the company Web sites


Wrapper features from secondary sources

rely on a mostly regular format


e.g. hoovers-sector, hoovers-industry, hoovers-type, address, ...
Abstracted features


describe relationships between companies
discretize our continuous features

3/6/2001
e.g. performs-activity, links-to, officers, sector, location, ...
e.g. same-state, same-city, share-officers, mentions-same, ...
Changho Choi, University at Buffalo
4
Process of acquiring potentially interesting
information about companies from the Web
4312 web sites
50 pages on each sites
www.3com.com
The Web
Data Mining
Extracting from
corp. Web sites
New knowledge
KB
Wrapping from
corp. info.
Company information
from www.hoovers.com
3/6/2001
Abstracting features
Changho Choi, University at Buffalo
5
Extracted Features
Feature
Values
Description
Extracting Method
Performsactivity
8
The types of activity this company
engages in.
Looking for keywords associated with each
type of activity.
Links-to
Companies whose web sites are pointed
to by this company.
Simple text search on all the web pages.
mentions
Companies whose name occurs on this
company’s Web site.
,,
officers
Officers of this company.
On the pages containing “officer”, “director”.
sector
200
Naïve Bayes predicted economic sector
of company.
Text classification by a Naïve Bayesian
model.
Coarsesector
12
Naïve Bayes predicted coarse-grained
economic sector.
,,
Derived from a naïve Bayes classifier
on small regions of text surrounding
country names, and autoslog-based
rules.
Advanced Information Extraction technique.
Inferred from the URL domain name
where applicable.
Country domain of the URL
locations
urlcountry
3/6/2001
39
Changho Choi, University at Buffalo
6
Wrapped Features
Feature
Values
Description
hoovers-sector
28
Sector listed on the company’s Hoovers page.
hoovers-industry
298
Industry listed on the company’s Hoovers page.
hoovers-type
18
Public, private, school etc.
address
Address as listed on hoovers.
City, state
Extracted form address.
competitor
Companies that compete with this company.
subsidiary
Companies listed as subsidiaries of this company.
products
4648
officers
auditors
Product categories extracted from the products page.
Officers listed on the Hoovers page.
266
Company auditors.
revenue
Revenue data for up to the last 10 years.
Net-income
Net Income data for up to the last 10 years.
Net-profit
Net Profit data for up to the last 10 years.
employees
Number of employees each year for up to the last 10 years.
3/6/2001
Changho Choi, University at Buffalo
7
Abstracted Features
Feature
Values
Description
Same-state
Companies in the same state as this company.
Same-city
Companies in the same city as this company.
Share-officers
Companies that have officers in common with this company.
Mentions-same
Companies that mention some company also mentioned by this company.
Links-to-same
Companies that link to some company also linked to by this company.
Reciprocally-mentions
Companies mentioned by this company, who link to this company.
Reciprocally-links
Companies linked to by this company, who link to this company.
Reciprocally-competes
Companies listed as a competitor of this company, who list this company
as a competitor.
Revenue-binned
10
Revenues for each of up to 10 years binned into 10 equal sized bins.
Net-profit-binned
10
Net profits similarly binned.
Net-income-binned
10
Net income similarly binned.
employees
10
Employees similarly binned.
3/6/2001
Changho Choi, University at Buffalo
8
Data mining algorithms

Discovering associations


Learning propositional rules



by using the C5.0 algorithm
, which generates a decision tree for the given dataset
Learning relational rules


3/6/2001
by applying the Apriori algorithm
by using Quinlan’s FOIL system
, which can use patterns in the relationship between
companies
Changho Choi, University at Buffalo
9
Experimental results

Apriori Experiments


Decision Trees


generate propositional rules using Decision trees
FOIL Experiments

3/6/2001
discover associations in the data using association rules
generate first order rules using the first order rule
learning system
Changho Choi, University at Buffalo
10
Result:Apriori Experiments (#1/2)

Threshold


minimal support:10%, minimal confidence: 80%
Some Examples

Highest confidence rule =>intuitively be understood

3/6/2001
performs-activity = sell :- locations = united-states,
links-to = adobe-systems-incorporated (10.8%, 93.0%)
performs-activity = sell :- performs-activity = technicalassistance,
links-to = adobe-systems-incorporated (11.8%, 91.1%)
Changho Choi, University at Buffalo
11
Result:Apriori Experiments (#2/2)

Some Examples

Normal rule


Lower support or conficence rule

Meaningful?
3/6/2001
performs-activity = sell :- locations = japan (14.5%, 90.8%)
performs-activity = research :- locations = japan (14.5%, 90.8%)

performs-activity = research :- locations = united-states (26.9%, 72.5%)
hoovers-sector = food-beverage-&-tobacco :- competitor = conagra-inc (1.0%,
89.8%)
hoovers-sector = retail :- competitor = kmart-corporation (1.0%, 75.0%)
hoovers-sector = energy :- competitor = bp-amoco-p.l.c. (1.1%, 73.0%)
Changho Choi, University at Buffalo
12
Result: Decision Trees

Example : Predict the economic sector

For cities,
different
features
3/6/2001
city atlanta
Based on Naïve
revenue1996 <= 0.1 => Diversified Services (28, 0.179)
Bayes Classification
revenue1996 > 0.1 => Computer Software & Services (20, 0.2)
city Houston
coarse-sector [basic-materials, capital-goods, transportation] => Manufacturing
(10, 0.3)
coarse-sector [financial, healthcare, technlogy] => Computer Software &
Services (21, 0.238)
coarse-sector [conglomerates, consumer-cyclical, consumer-non-cyclical, energy,
services, utilities] => Energy (49, 0.49)
city Dallas
net_income1999 <= 19 => Health Products & Services (25, 0.2)
net_income1999 > 19 => Leisure (25, 0.2)
...
Changho Choi, University at Buffalo
13
Result: FOIL Experiments
(Fist Order Inductive Logic)

Example

computer-software-&-services(A) :- hq-city(A,B),
B<>fremont, competitor(A,C),
hq-city(C, Islandia), not(employees_binned(A,?,?)).

3/6/2001
It means that
companies headquartered somewhere other than Fremont
competing with “Computer Associates International” are in the
computer software & services sector.
(“Computer Associates International” is the only company in
our knowledge base headquartered in Islandia.)
Changho Choi, University at Buffalo
14
Discussion

Difficulties

data cleaning



feature selection
Pleased result

3/6/2001
errorful nature of our facts
the interaction between the symbolic features and the
statistically-derived(naïve Bayes) features
Changho Choi, University at Buffalo
15
Further Work

This paper suggests



Further work



3/6/2001
a number of research directions
, impacting each of information extraction, machine
learning, and data-mining from text
Extracting information from wrapped web-sites as a
source of training data
Automatic data-cleaning of tracted features
Extending the information extraction
Changho Choi, University at Buffalo
16
Reference(#1/2)

FOIL

Three companions for first order data mining

3/6/2001
http://www.cs.kuleuven.ac.be/~ml/Doc/Tutorial_Summer/tutor
ial_summer.html
Changho Choi, University at Buffalo
17
Reference(#2/2)
Feature
Sample URL
hoovers-sector
http://www.hoovers.com/sector/
hoovers-industry
http://www.hoovers.com/industry/list/
hoovers-type
http://www.hoovers.com/company/dir/0,2116,15694,00.html
address
http://www.hoovers.com/co/capsule/5/0,2163,12475,00.html
City, state
same
competitor
same
subsidiary
http://www.hoovers.com/premium/profile/5/0,2147,12475,00.html
products
same
officers
same
auditors
same
revenue
http://www.hoovers.com/hoov/join/sample_historical.html
Net-income
same
Net-profit
same
employees
same
3/6/2001
Changho Choi, University at Buffalo
18