Ontological Classification of Web Pages

Download Report

Transcript Ontological Classification of Web Pages

Ontological Classification of Web Pages
Zafer Erenel
• Many users use search engines to locate and buy goods and
services (such as choosing a vacation).
• Web pages presented on the internet do not conform to any
data organization standard and search engines provide
primitive query capabilities for users to retrieve relevant data
[1].
• In addition to that, they do not list sites equally and are
inclined toward listing more popular pages. These tendencies
brush many web pages aside and leave a limited number of
alternatives to the users.
I have created a lightweight domain ontology that consists of
a taxonomic hierarchy and made use of it by an automated
agent to classify web pages on the internet.
• The automated agent discovers and classifies relevant pages
with the help of Yahoo and Google search engines.
Related Research
• Desai and Spink presented a clustering scheme that
groups documents into partially and substantially
relevant pages by using similarity measures and ranking
heuristics [2].
• They worked with the end-user queries (limited number
of terms) to obtain the relevance score. Instead, my
automated agent will act along with an established
ontology to discover and classify documents.
• Chiang, Chua, and Storey parsed snippets of returned
links to find the ratio of the number of matching terms to
rank the web pages for relevance [1].
• I believe snippets consist of a very few number of words
and we cannot judge the web page on the basis of
snippets. My agent scours the entire web page which is
more time-consuming but more effective
• Yahoo and Google search results contain scores of links.
• My lightweight domain ontology consists of 7 branches. Each
branch is comprised of predetermined terms.
• Score of the web page increases in a certain branch as the
agent comes across these predetermined terms on the html
code.
• I’ve chosen country ontology because internet users’ interest
in a certain country can be quite high.
• The ranks of web pages in each cluster
will clarify their content to the user.
• In addition to that, we can compare result
sets of different search engines (Yahoo
and Google) for the same queries and find
complement and intersection of their result
sets to have a clear understanding of
search engines’ behaviors.
• I’ve used web stream classes in C# Programming language to
create my agent.
• A WebRequest is an object that requests a Uniform Resource
Identifier (URI) such as the URL for a web page [3].
• You can use a WebRequest object to create a WebResponse
object that will encapsulate the object pointed to by the URI.
• Once you get the actual object (e.g., a web page) pointed to
by the URI, what you get back is a stream of the web page.
• I used this capability for reading a page from a site to extract the
information I need. I have created two web requests using search
syntax given below
http://www.google.com/search?q=cyprus+vacation+travel&lr=&start=0&sa=N
http://search.yahoo.com/search?p=cyprus+vacation+travel&b=1
• Google search engine has returned 200
URLs and I have created 200 web
requests to extract relevant information
from each web page.
• Yahoo search engine has been used in the
same manner to extract relevant
information.
• If we analyze price rankings, we come
across pages that have information about
student flights, travel insurances, vacation
package discounts, cheap flights and etc..
• If we analyze nature rankings, we come
across web pages that offer adventure and
etc.
• If I want to do scuba diving on my
vacation, I know that hawai and fiji are
among my options by looking at activities
rankings
• Interestingly enough, in the top 100 search
lists, the number of web pages that both
appear on Google and Yahoo is 19.
• In the top 200 search lists, the number of
web pages that both appear on Google
and Yahoo is 24.
• As a result, ontologically organized clusters of
web sites that are offering information about a
given country regarding vacation and travel
alternatives serve our objective to a greater
extent in finding what we are in search of.
• In my work, I have used 2 search engines and a
single ontology. Search Engines’ shortcomings
can be prevented by combining multiple engines
with multiple ontologies to ease the search for
most needed information on the internet.
• Venn Diagrams prove that a specific search
engine is not very effective by itself.
References
[1] R.H.L. Chiang, C.E.H. Chua, V.C. Storey, A smart web
query method for semantic retrieval of web data, Data &
Knowledge Engineering 38 (2001) 63-84.
[2] M. Desai, A. Spink, An algorithm to cluster documents
based on relevance, Information Processing and Management
41 (2005) 1035-1049.
[3] Liberty, J., Programming C#,3rd ed. O’REILLY, 2003.