7_class_296_Spr08_Web

Transcript 7_class_296_Spr08_Web

Business Intelligence Technologies –
Data Mining
Lecture 7 Link Analysis & Web Mining
1
Agenda

Content (text) Mining

Link Structure Mining

Web Usage Mining

Case Discussion
2
Three Forms of Web Mining
Data Available on
the web
Three forms of
web mining
Content Data
Text Mining
Link Structure
Link Analysis
Web usage data
Web usage
mining
3
Why Text Mining?
Significant proportion of information of great
potential value is stored in documents:
 News
stories pertaining to competition,
customers & the business environment at
large
 Technical reports on new technology
 Email communications with customer,
partners, and within the organization
 Corporate documents embodying corporate
knowledge and expertise
 Legal documents --- automatic reasoning
4
Opportunities
Finding patterns in text:

Identify and track trends in industry
 What are my competitors doing?
 What relevant products are being developed?
 What are the potential usage of my products?
 Identify emerging themes in collections of documents
 Customer communications: cluster messages, each
segment identifies a common theme such as complaints
about a certain problem, or queries about product features.
 Automated categorization of e-mails (Spam Filter!), web
pages, and news stories
5
Text Mining as the Solution

Information retrieval
 Locating
and ranking of documents of interest
 Interest is expressed via a set of keywords

Deeper mining
 Document
categorization
6
Structuring Textual Information
Many methods designed to analyze
structured data
 If documents can be represented by a set
of attributes – can use existing data mining
methods
 How to represent a document ?

Structured
representation
Apply DM methods
to find patterns
among documents
7
Document Representation


A document representation aims to capture what the
document is about
One possible approach:

Each entry in the table represents a document
 Attribute describes whether or not a term appears in the
document
Example
Terms
Camera
Digital
Memory
Pixel
Document 1
1
1
0
1
Document 2
1
1
0
0
…
…
…
…
…
…
8
Document Representation

Another approach:

Attributes represent the frequency in which a term appears
in the document
Example: Term frequency table
Terms
Camera
Digital
Memory
Print
Document 1
3
2
0
1
Document 2
0
4
0
3
…
…
…
…
…
…
9
Document Representation


But a term is mentioned more times in longer
documents
Therefore, use relative frequency (% of
document): No. of occurrences/No. of words in
document
Terms
Camera
Digital
Memory
Print
Document 1
0.03
0.02
0
0.01
Document 2
0
0.004
0
0.003
…
…
…
…
…
…
10
The TF/IDF Document Representation




TF/IDF: Term Frequency and Inverse Document Frequency
An approach for weighting terms in a document based on the term’s frequency
in the document and the document corpus. (used to filter out common words,
e.g. “important”)
A term would have a higher weight if it is found to be a good descriptor for a
particular document, i.e., if it appears frequently in the document but is
infrequent in the entire corpus.
Weight are determined by: W = tf * log (N/df)
tf: a term’s frequency in the document
df: is the frequency of documents in the corpus that contain the term,
N is the number of documents in the corpus.
Terms
Camera
Digital
Memory
Print
…
…
…
…
…
Document 1
Document 2
…
11
Text Mining Application 1:
Association Rules
After proper representation, data mining techniques can be applied to
text, e.g. association rules, clustering, classification.
Keyword-based Association Rules: treat keywords as items.
Microsoft
Antitrust
Document
No.
Item 1
Item 2
Item 3
100
France
Iraq
101
NASDAQ
102
…
Doc No.
Microsoft
antitrust
France
US
100
0
0
1
NYSE
job
101
0
0
0
Iraq
US
UK
102
0
0
0
103
Microsoft
antitrust
OS
103
1
1
0
104
Microsoft
Antitrust
windows
104
1
1
0
…
OR
…
…
12
Text Mining Application 2:
Finding Clusters of Similar Documents
Request for
product information
Complaints
about recent upgrade
Inquiries about
complementary
products
13
How to determine if two documents are similar ?

In order to retrieve documents similar to a given
document we need a measure of similarity
 Euclidean distance:

The Euclidean distance between X=(x1, x2, x3,…xn) and Y
=(y1,y2, y3,…yn) is defined as
D( X , Y ) 
n
2
(
x

y
)
 i i
i 1




Document A: (PDA=0.3, wireless=0.02, commerce=0 )
Document B: (0.001, 0.004, 0)
D(A,B)=sqrt[(0.3-0.001)2+ (0.02-0.004)2 +(0-0)2]
This can be used for document clustering (kmeans) and classification (kNN)
14
FYI: Basic Measures for Text Retrieval
 Most
commonly used is the cosine measure of
similarity between two documents X=(x1, x2, x3,…xn)
and Y =(y1,y2, y3,…yn) :
X Y
sim ( X , Y ) 

Where X  Y 
n
 (x  y )  (x  y )  (x
i 1
i
i
1
1
2
X Y
 y2 )  ...  xn  yn
x  x  x  x12  x22  x32  ...  xn2

And

Example: The similarity between X(3, 2, 0,1), and Y(1, 4, 0,
0) is
Sim ( X , Y )
3
3 1  2  4  0  0  1 0
2
 
 2 2  0 2  12  12  4 2  0  0

15
Personalized Web Ad Delivery

Objective:



Web content is dynamic  need automated ad
placement


Improve effectiveness of Web ads
Customize ad delivery so that ad corresponds to the context user
is exploring
Example: google gmail
Solution:



Represent each ad as a document with a set of keywords.
For example: ad for hybrid car is represented by the following set
of keyword: car, electric, environment, etc.
Then deliver ads to viewers of pages (i.e., documents) that
resemble this description.
16
Text Mining Application 3:
Text Classification
…
Doc
No.
earnings jump
miss
Class
(positive vs.
negative)
100
0.03
0
0.7
Negative
101
0.2
0.003
0.5
Negative
102
0.04
0.01
0.02
Positive
103
0.2
0.4
0.01
Positive
104
0.4
0.3
0.002
Positive
…
17
Applications of Text Classification


Business intelligence
 Classifying news stories: competitors, new
technologies, etc.
Email messages:



Email from friends vs. spam
Classification of Web pages
 E.g., customized delivery of news stories based
on what is considered interesting by the user
(viewed by the user): build a classifier to
automatically classify stories from news stories
into interesting and not-interesting classes.
Personalized Web ads
18
Text Mining Application 4:
Information Retrieval/ Search Engine



Location and ranking of documents of interest
Interest is expressed via a set of keywords
Which documents satisfy a query?
 Query:

Iraq US
Most relevant documents: the terms Iraq and
US are central to their content (have a high
weight in the TF/IDF representation)
19
Basic Measures for Information Retrieval
Relevant
documents
Retrieved
& relevant
Retrieved
documents
All documents

How to evaluate the quality of a search engine

Of the retrieved documents some are relevant while others are not
 Not every relevant document is retrieved

Precision: the percentage of retrieved documents that are in fact relevant
to the query (i.e., “correct” responses)
No. of documents Retrieved and Relevant
 Precision = _________________________________
No. of retrieved documents

Recall: the percentage of documents that are relevant to the query and
were, in fact, retrieved
No. of documents retrieved and relevant
 Recall= _________________________________
No. of relevant documents
20
What we just described was used by
the First Generation Search

TF/IDF based technique, evaluated by
precision and recall
 The

Yahoo/Altavista generation
Problems of the initial approaches?
21
Link Structure Analysis
- Using link structure to rank relevancy of Web pages

Traditional IR methods only examine the
appearance of relevant terms, and often fail to
account for
 The
quality of the information in the retrieved
documents.
 The reliability of the source

From the retrieved documents, want to rank
authoritative documents higher

Approach: Mining the Web’s link structure to
identify authoritative web pages
22
Identify Authoritative Web Pages


The Web includes pages and hyperlinks
A lot of information is in the structure of web
page linkages. Hyperlinks contain rich latent
human information
 An
author creates hyperlink pointing to another page
-- can be viewed as endorsement
 The collective endorsement of a given page by
different authors can help discover authoritative
pages

Google uses link structure of the Web to rank
documents (PageRank)
23
Using Hubs to identify Authoritative Web Pages

A hub is a page pointing to many good authorities.


A hub may not be an authority, and have very few links
pointing to it.


E.g., a web page pointing to many good sources of information
on business intelligence
Yet a link from a hub to a page is valued more than a link from a
regular page
An authority is a page pointed to by many good hubs
Hub
Hub
Authority
Authority
Page
Page
Page
Page
Page
Page
24
Overview of Search Engine
Web Usage Mining
- Data
Usage
Site-level usage data

log files
User-level
 Web
usage data
panel data
30
Site-level Usage Data: Web Logfiles
• A Web server is a program that processes
incoming http requests
• Web servers send Web pages to clients that
request these pages
• Each time the server sends something out to a
client, the server stores “some details of what it
just did” in files called log files.
• What *can* these log files possibly contain?
An Example of a Web Log File
Host/IP
Time stamp
sniksnak.foobar.org - - [30/Feb/1996:06:03:24 -0800]
"GET /film/logos/the.movies.main.gif HTTP/1.0" 200 278
Retrieval
Method
Path and File
Retrieved
Protocol
HTTP
Completion
code
Byte
s
What can we learn from Web log data?
32
User Level Usage Data


Web browsing data collected at a user level
(i.e. all web sites visited by a specific user)
Panels of users (market research companies)
 Nielsen

NetRatings, comScore
Millions of users on the panel
 Tracking
software installed on the users’ computer
 Market reports generated based on their users’ data

E.g. search engines market shares; social networking sites
grow 47% year over year.
 Sell

reports and data sets
What can we learn from user level usage data?
33
Case Discussion

Google
1.
2.

What are the differences between AdWords and
AdSense in terms of techniques, revenue sources
and issues?
How can Google leverage its strength in other
channels of adverting, e.g. print, radio, TV?
MedNet
1.
2.
3.
What are the pros and cons for Windham to
advertise on MedNet.com and Marvel?
What are the pros and cons of click-per-thousandimpression and click-through-rate?
Is there a win-win solution for all the three players?
34

7_class_296_Spr08_Web

Transcript 7_class_296_Spr08_Web

Directory