Web Apriori Power Point

Download Report

Transcript Web Apriori Power Point

Apriori Algorithm and the World Wide Web
Roger G. Doss
CIS 734
The purpose of this presentation is to introduce
an application of the Apriori algorithm to perform
association rule data mining on data gathered from
the World Wide Web.
Specifically, a system will be designed that
gathers information ( web pages ) from a user
specified site, and performs association rule
data mining on that data.
We have already seen the Apriori algorithm applied
to textual data in class.
Given an implementation that can work with
textual data...
What we want to do, is to use Apriori in the following
manner:
Given an input of:
(url,N links,support,confidence,keywords)
* obtain the url
* traverse all adjacent links up to N
* format the data
* compute support and confidence levels
for each word in a user supplied keyword set.
We can invision several components to this
system which can be divided into four
components:
Phase 0: User input.
Phase 1: Data Acquisition.
Phase 2: Running Apriori on the data.
Phase 3: User output.
Data Acquisition:
Traverse Web Page(URL,N)
|
while N web pages not visited
|
Obtain WebPage via HTTP
|
Parse information
( look for keywords, adjacent links )
|
Store keywords in a file
Store adjacent links to visit
Running Apriori on the Data:
If we treat the initial web page and each adjacent web page
as a transaction, then each occurance
of a keyword is an element in that transaction.
At this point, the Apriori algorithm can be run
on the data, producing a set of Association Rules
based on desired Confidence and Support levels.
Some modules that may be needed to implement the
system:
* HTTP Client.
Accessing a web page from a URL mechanically.
* Data Cleaning.
Extracting words that match keyword list.
Extracting hyper text references, ie,
href="http://www.njit.edu".
* Apriori Algorithm.
* Web traversal.
Building this system allows one to have a code base that can
be used for future research and work. An HTTP client is needed
to obtain data from the web, web traversal is important in
web crawling and parsing HTML allows one to extract
information from web pages.
An interesting problem is how one could traverse a web page
and visit N links reachable from that web page.
We can view the WWW as a graph. Each URL is a node
on that graph. From each page, we have hyper-text references
that point to other resources, including other web pages.
We consider these other web pages as adjacent nodes.
Assume that you have the following primitives:
string get_webpage( string url );
list<string> get_adj_webpages( string webpage );
Using C++ Standard Template Library, implement
Breadth First Search to traverse all adjacent web pages
from an initial web page source.
Hint: The following containers might be useful:
map<string, bool> visited;
queue<string> q;
void bfs( string url )
{
// Maps urls to boolean value indicating
// if they were visited.
mapM<string, bool> visited;
// FIFO queue of urls.
queue<string> q;
// List of adjacent urls.
list<string> adj;
// Contains web page results.
string data;
// Mark initial url as not visited.
// Insert into queue the initial url.
q.push(url);
// Traverse the web pages.
while(q.size() != 0) {
if(visited[(url=q.top())] == false) {
data = get_webpage(url);
adj = get_adj_wepages(data);
// Mark as visited.
visited[url] = true;
// Remove url just visited from queue.
q.pop();
// Insert into queue all adjacent webpages.
for(list<string>::iterator i =adj.begin(); i != adj.end(); i++) {
// If we did not already visit this page...
if(visited[(*i)] != true) {
q.push((*i));
visited[(*i)]=false;
}
}
}
}
}// bfs
We have a given node/url, A, with adjacent nodes/urls B,C,D
as follows:
page A
page B
page C
page D
adj B,C,D.
adj A,E,F.
adj G.
adj A.
Or as a directed graph:
B <---------> A <---------> D
|
|
E
C
|
|
(init)
visit A
(from A) visit B,C,D
(from B) visit E,F
(from C) visit G
* We do not consider URLs already visited.
* Each time we visit a page, some processing can be done.
In this case, we obtain a list of words that we are interested
in.
Given that we can extract a set of words from a web page,
we know what URL those words appeared on, and we
can produce support and confidence levels using Apriori,
design a simple database using SQL and a RDBMS
that allows one to model the following information:
keyword, site, url, support, confidence
and give an example query where provided
the keyword, support and confidence levels,
we can obtain the site,url's that contain that
keyword with the desired support and confidence level.
Site refers to the WWW address, such as www.njit.eduan