让颜色生动起来 - University of North Texas

Download Report

Transcript 让颜色生动起来 - University of North Texas

Building
Web
Practice
Rajendra Akerkar
Pawan Lingras
- University of North Texas
- DSCI 5240 Fall 2012
- Graduate Presentation
- Option A
Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version
Tankertanker Design
OUTLINES
Tankertanker Design
Introduction
Crawlers
Search Engine
Queries
Tankertanker Design
Tankertanker Design
INTRODUCTION
Tankertanker Design
Web content mining
Uses of Web-content mining techniques
Problems with the web data
Two approaches of web-content mining
Tankertanker Design
Tankertanker Design
INTRODUCTION
Web Content
Tankertanker Design
Uses of Web-content
Mining techniques
o Web-content mining techniques are used
to discover useful information from
content on the web.
o Some of the web content is generated
dynamically using queries to database
management systems.
o Other web content may be hidden from
general users.
Tankertanker Design
Tankertanker Design
INTRODUCTION
Problems with the web data
Tankertanker Design
Distributed data
Large volume
Unstructured data
Redundant data
Prob.5 Prob.6 Prob.7
Prob.1 Prob.2 Prob.3 Prob.4
Quality of data
Extreme percentage volatile data
Varied data
Tankertanker Design
Tankertanker Design
INTRODUCTION
Tankertanker Design
Two approaches of web-content mining
agent-based
• software agents perform the content
mining
database oriented
• view the Web data as belonging to a
database
Tankertanker Design
Tankertanker Design
CRAWLERS
Tankertanker Design
Tankertanker Design
Tankertanker Design
CRAWLERS
Tankertanker Design
Crawling process
A computer program
that navigates the
hypertext structure
of the web.
Builds an index
visiting number of
pages and then
replaces the current
index.
- Begin with group of
URLs
- Breath-first or
depth-first
- Extract more URLs
Numerous crawlers
Context Graph
Context Graph
- Problem of
redundancy
- Web partition
robot per partition
- Focused crawling has
proposed the use of context
graphs, which in turn
created the context focused
crawler (CFC).
- Two steps of the CFC
performs crawling
Tankertanker Design
Tankertanker Design
CRAWLERS
Tankertanker Design
Focused Crawler
• Generally recommended for use due to large size of the Web
• Visits pages related to topics of interest
Two major parts
• The focused crawler structure consists of two major parts:
• The distiller & The hypertext classifier
Priority-based structure
• The pages that the crawler visits are selected using a priority-based
structure managed by the priority associated with pages by the
classifier and the distiller
Documents
• Sample documents are identified and classified based on a
hierarchical classification tree
• Documents are used as the seed documents to begin the focused
crawling
Tankertanker Design
Tankertanker Design
SEARCH ENGINE
Tankertanker Design
Examples of search engine
Components to a search engine
Search engine mechanism
Responsibilities of Search Engines
Tankertanker Design
Tankertanker Design
SEARCH ENGINE
Tankertanker Design
o Uses a ‘spider’ or ‘crawler’
that crawls the Web hunting for
new or updated Web pages to
store in an index.
o Basic components to a search
engine:
• The spider: gathers new or
updated information on Internet
websites
• The index: used to store
information about several
websites
• The search software: performs
searching through the huge
index in an effort to generate an
ordered list of useful search
results
Tankertanker Design
Tankertanker Design
SEARCH ENGINE
Tankertanker Design
Search engine mechanism
o Generic structure of all search
engines is basically the same
o However, the search results differ
from search engine to search
engine for the same search terms
o Document collection
• choose the documents to be
indexed
o Document indexing
• indicate the content of the selected
documents
• frequently 2 indices preserved
o Searching
• indicate the user information need
into a query
• Retrieval
o Document and query management
• present the outcome
• virtual collection
Search engine mechanism
Responsibilities of Search Engines
Tankertanker Design
Tankertanker Design
QUERIES
Tankertanker Design
o Three-tier process of translating
the user's need into a search
engine query:
The first level involves the user
formulating the information need
into a question or a list of terms
using experiences and vocabulary
and entering it into the search
engine.
On the next level, the search
engine must translate the words
with possible spelling errors into
processing tokens.
On the third level, the
search engine must use the
processing tokens to
search the document
database and retrieve the
appropriate documents.
Tankertanker Design
Tankertanker Design
QUERIES
Boolean
Queries
Tankertanker Design
Natural
Language
Thesaurus
Queries
In a thesaurus query
the user selects the
term from a
preceding set of
terms predetermined
by the retrieval
system.
Boolean logic queries
connect words in the
search using
operators such as
AND or OR.
In natural language
queries the user
frames as a question
or a statement.
Fuzzy
Queries
Term
Searches
Probabilistic
Queries
The most common
type of query on the
Web is when a user
provides a few words
or phrases for the
search.
Probabilistic queries
refer to the way in
which the IR system
retrieves documents
according to
relevancy.
Fuzzy queries reflect
no specificity.
Tankertanker Design
Tankertanker Design
Tankertanker Design
Thank you for your
attention!
Tankertanker Design