kap. 26-36 extra notes about web basics and searching
Download
Report
Transcript kap. 26-36 extra notes about web basics and searching
IBE110: HTML document processing
concepts and searching on the Web
2015
Judith A. Molka-Danielsen
Document Processing
Hypertext Processing: In the 1990s we saw the
development of internetworks, and ubiquitous
interfaces (windows).
Tim Berners-Lee at the National Radiation Lab at
CERN created HTML and URL (Uniform
Resource Locator) protocols so that a simple
standardized form of markup, based on Scribe,
could be used to describe documents and
naming scheme would allow for the universal
identification of documents.
So documents could be and viewed in graphical
format and large collections linked across
multiple internets. This is hypertext processing.
Properties of Documents
Syntax - can express structure, presentation style,
semantics, and external actions. It can be implicit in the
contents of a document or expressed in a language.
Structure - a structural element like a section can have
can have a Formating Style associated with it that tells
how the elements relate to each other within the
document.
Presentation Style - is how the document is displayed
or printed. It can be embedded in the documents such
as in TeX, and use macros LaTeX. Or can be defined
separately as CSS for HTML documents. Presentation
style can be determined by the author (in applications
or languages) or the reader (Web browser).
Semantics - the meaning within a language, can be
associated with use.
Characteristics continued...
Metadata - information about the organization of the
data. Data about the data. Such as, author,
publication date, subject codes, etc.
What is Markup?
•Markup is everything in a document that is not content.
Typesetters used procedural markup to lay out instructions of
how a document should look. (16 pt bold Helvetica)
•Word Processing software like Microsoft Word uses Procedural
markup. They have a specific set of markup codes. The codes
apply to a single physical way of presenting information, such as
on a printed page. It doesn't define the appearance on other media
like CD-ROM or Internet.
•Descriptive markup, or generic markup, describes the
structure of the document rather than the appearance. Content is
separate from style. You can publish on all media using the same
structure instruction set.
SGML
SGML (Standard Generalized Markup Language, ISO
8879, 1986), specifies a standard method for describing
the structure of the document. Structural elements are
for example: title, chapter, paragraph. It is an extensible
Meta Language. It can supports an infinite variety of
document structures like: information bulletins, technical
manuals, parts catalogs, design specifications, reports,
letters, memos.
The Document Type Definition (DTD) describes the
structure of the document. (like a database schema in a
database). The DTD provides a framework of elements
(chapters, headers). The DTD specifies rules for the
relationship between elements, ie. a chapter header must
come after the start of a chapter. A document intance is a
document whose contents is tagged in conformance with
a DTD. A DTD can be applied throughout the whole
organization.
SGML continued
SGML uses tagging to identify the contents position
within a DTD structure. So we insert tags around the
content. You can nest elements. A parser program
verifies that a document follows the rules of a DTD.
The parser checks if the document is structurally
correct.
Documents can be ported to different formats for
different output medium (printer, screen, CD Rom,
speaker, TV)
Style is usally handled separately by style sheets,
like Cascading Style Sheets (CSS).
HTML (first version in 1992) a tagging language that could be used on
the World Wide Web for text formatting and linking documents. It adopts
the syntax of SGML and is an application of SGML described by a
particular DTD. HTML is not an extensible language. Authors cannot add
their own tags. HTML supports style sheets written in CSS language
(color, font, layout for web pages.) to define the look and layout of text
and other materials.
HTML can embed scripts written in languages such as JavaScript
which affect the behavior of HTML web pages.
The World Wide Web Consortium (W3C), maintainer of both the HTML
and the CSS standards, has encouraged the use of CSS over explicit
presentational HTML since 1997.
HTML5 – cross platform for mobile applications and implementation with
more file types. Started in 2008, in 2014 is a proposed recommendation
by W3C. http://www.w3schools.com/html/html5_intro.asp
Potential of HTML5: http://learningcircuits.blogspot.no/2011/12/what-dowe-mean-when-we-say-html5.html
Element reference list: http://www.w3schools.com/tags/
Positive comments on HTML
HTML uses tags to separate content (text)
from format (structure, appearance).
It lets amateurs control markup (good and
bad)
HTML tags were used for appearance
formatting, but little attention was used toward
content structuring.
Negative comments on HTML
HTML did not offer enough custom control over the
WYSIWYG environment.
Things looked different in different browsers (reader
interpreted, not author interpreted).
Navigating through hypertext requires user memory.
Designing hypertext (document collections) for easy
searching is hard to do.
Comments on CSS
Cascading Style Sheets helped HTML by
freeing tags like <font> and <b> from carrying
format information. Puts them in the style
sheet.
It lets tags like <header> carry structure
information.
CSS is a styling tool that can work with other
markup languages like XML.
Current version is CSS3
Comments on CSS
The Document
Formating
• Structure
• Appearance
Content
•Information
•Data
Structure – HTML does this a little bit.
Appearance – or presentation, before HTML did this
with tags like <b> but now all structure
control should be taken out of HTML
documents and put in CSS or XSL files.
XML
XML (XML 1.0, 1998, Extensible Markup Language)
is also a meta language in that it describes other
languages. There is not pre-defined list of elements.
Elements are specified using a DTD or Schema. Also
style sheets can be used to specify the output format
of each element (XSL).
XML is based on SGML but it is a subset and is
considered easier to program. XML is also supported
to be viewed in most current versions of browsers.
More on XML in a later lecture..
How do search engines work?
They create directories in different ways: Human powered
directories. In 2001, Yahoo, depended on humans for listings.
You submitted a description to them. The search looked for
matches in the descriptions submitted. Changing your web page
had no effect on your listing. You could get reviewed by others if
you were a good site.
Crawler based search engines: Most use these. Create listings
automatically. Indexes change periodically when the crawler is
reissued. The crawlers must (1) crawl through web pages (2)
make an index, and (3) rank the results.
Hybrid search engines: use both humans and crawlers to
produce directories or listings.
Search engine features
Crawling features:
deep, frame support,
image maps,
robots,
meta tags,
link popularity,
paid inclusion
Indexing features:
full body text,
stop words,
meta descriptions,
meta keywords,
ALT text,
comments,
stemming
Ranking features:
meta tags boost ranking,
link popularity boost ranking,
direct hit boost ranking.
Spam features:
meta refresh (target pages take
visitors automatically to other pages
in a web site),
invisible text (text is same color as
background),
tiny text.
Meaning of full text search types
Keyword search - Accepts a list of words as criteria and matches a document that contains any of the
words. E.g. a keyword search for smart data matches a document that contains either smart or data.
Boolean search - Accepts a Boolean expression that states rules for the presence or absence of words
in a document. Matches a document in which the required words are present and the forbidden words
are absent. For example, a Boolean search for smart & data matches only documents that contain both
smart and data.
Phrase search - Accepts a list of words as criteria and matches a document that contains the words in
the stated order as a complete phrase. For example, a phrase search for smart data matches only
documents that contain the complete phrase smart data.
Proximity search - Accepts a list of words as criteria and matches a document that contains the words
in any order in close proximity. For example, a proximity search for smart data would match a document
that contains the phrase the data is smart.
Fuzzy search - Also known as pattern search. Tunes one of the previous search strategies by matching
slight variations on the words in the criteria list. For example, a fuzzy phrase search for smart data could
match a document that contains the variant phrase a smart datum.
Ranking - Also known as weighting. In a fuzzy search, determines the relevance of the document based
on the similarity of the match to the criteria. Documents with a higher ranking appear earlier in the result
list. For example, a fuzzy phrase search for smart data would rank a document containing the exact
phrase smart data higher than a document containing the variant phrase a smart datum.
Stop Words - Also known as noise words. Words that should be ignored in matches, such as a, the,
some, and other articles and prepositions. For example, if in and the are stop words, a phrase search for
smart data would match a document that contains the phrase smart in the data.
Synonyms - Also known as a thesaurus. Words that are equivalent for the documents in the repository.
For example, if smart and intelligent are synonyms, a phrase search for smart data would match a
document that contains the phrase intelligent data.
Other search engines besides Google
Examples of search engines:
AltaVista (Now Yahoo!)
HotBot
NorthernLight
Excite
Search Engine User Interface
Many search engines have advanced features that the general searcher
does not know how to use. The most commonly used features are
quotation marks and capitalization. (Show example case study in class.)
Important issues:
Query Interface: different by engine. In AltaVista (Yahoo!) a sequence
of words is a logical union. In HotBot it is an intersection.
Interface for complex queries: Boolean, phrase, proximity, wild
cards, filtering, special qualifications via date, language, url, title,
internet domain, file types.
Response Interface: 10 entries per page. Entry contains information
on: url, size, date indexed, some text.
Return options: the number of pages returned, maybe sorting by url
or date.
Crawling the Web
A ranking algorithm like PageRank can be used to rank the
relevancy of documents in a hit set. This algorithm can be used
to decide which page to visit next by web crawler programs.
Crawlers can traverse up to 10 million Web pages per day.
Traversal approaches:
Breadth first is to look at all pages linked by the current pages,
and so on.
Depth first is to follow the first link on the page and successive
pages and return up recursively. This is a narrow but deep
search.
Crawlers can use much bandwidth. Priorities and restrictions
might be set on their use.
Crawlers are also referred to as Spiders.
Ranking: how is it used by search engines
Criteria: location, frequency , metatags, number of web pages indexed,
spamming controls , "off the page" -link analysis, click through ratings
The difference between the Web and DBMS ranking is that the Web
ranking can use hyperlink information. It can use the number of links
coming into a site, or the number of outward pointing links to other sites.
Authorities are pages that have many links pointing to them. They are
likely to be good sources of information on the searched topic.
The number of inward reference links can indicate the popularity of
the site, and perhaps this is reflective of the quality of the
information there.
Hubs are pages that have many links outgoing to other servers. They point
to pages with similar or related information.
Better authority pages come from incoming edges from good hubs.
Better hub pages come from outgoing edges to good authorities.
How users can improve searching on the web
Given the User Problems
user does not understand the meaning of searching
user does not know the rules (case, stemming) used by the search
engine and gets unexpected answers.
users have problems with Boolean logic
users find the engines slow, answers sets too large, not very relevant,
not up to date.
Techniques for Users to improve information retrieval
Start with a relevant page, use the keywords from that page
Use authors personal Web pages
Pages on the topic already contain relevant references and links
use web directories to select a category for a starting point.
use search engines to improve the query formulation on a relevant set
of answers.
On Web Query Languages: Structured searches (using sql type
queries) only work on domains where the data is structured.
Size of the Web
http://www.netcraft.com/Survey/
(Netcraft survey)