LIS650 Web Site Architecture and Design

Download Report

Transcript LIS650 Web Site Architecture and Design

LIS650 lecture 0
Introductory lecture
Thomas Krichel
2004-11-07
today
• Administrative introduction to the course
• Talk about you
• Substantive introduction to the course. The
subject matter is not just about HTML
– web
• servers
• client
– XML
– HTML
• Fairly general but abstract
• Probably the toughest lecture in the course
course resources
• Course home page is at
http://wotan.liu.edu/home/krichel/lis650w04a
• Subscribe to class mailing list
https://lists.liu.edu/mailman/listinfo/cwp-lis650krichel
• Me. Do not hesitate to ask. Send me email. I will
usually answer to the class mailing list.
• I plan to come here on several days to council
students. I will announce all times publicly.
Students who are in need of extra tuition should
ask.
general assessment
• First quiz next lecture.
• If you miss a lecture, let me know in advance.
• In addition to the quizzes, we have
– the web site assessment
– the final web site
• Final grade is calculated by computer. Quizzes
go through a complicated discounting scheme. It
disregards the worst performance.
Web site assessment
• Look at the web site of a university Library and
Information Science department.
• A list is at http://informationr.net/wl/
• Write a text not describing, but commenting on
the web site.
• State the site URL, I will look at it.
• Try to keep you text short please, no more than 2
pages.
• Ask others for opinions if you want.
the final web site
• Contents should be equivalent to a student essay.
• It should be a contribution to knowledge on a
topic.
• Personal sites are no longer allowed.
• Deadline to finish web site: one week after the
end of the last lecture.
• You will not be able to change your web site
between the deadline and the time that the grade
is issued.
course history
• Course was first run as an institute 2002-05-13
to 2002-05-17
• Title was “Webmastering I: the static web site”.
• To the curriculum committee, this title did not
sound academic enough.
• Since “Web Site Architecture and Design” is now
the full title, WeSAD (pronounced like “wizard”)
is the official abbreviation.
• Webmastering is still what we want to learn.
teaching WeSAD
• WeSAD combines many aspects:
– Authoring pages
– Work on the organization of data to fit onto pages
– Set display style of different pages
– Organize the contribution of data
– Maintain a technical web installation
• Some of them can be learned in a course, but
others can not.
• Emphasis has to be on learnable elements.
teaching philosophy
• Point and click on a computer software is not
enough
• Explain underlying principles
• Promote standards
– XHTML 1.0
– CSS level 2.1
• Avoid proprietary software
WeSAD contents
• Deals with the maintenance of a passive web
site. Such a web site remains the same
whatever the user does with it.
• Topics include
– (x)html
– css
– site usability and information architecture, as far as
relevant for static web sites
– http, uri, web server
things this course does not do
• Forms: allow you to design forms that users fill
in. But you do not have the programming skills to
do something with the form.
• Frames: allow you to put several documents into
one physical document. Most experts advise
against them.
• We do not cover image maps.
• We don’t do some advanced CSS properties.
• Some other exotic features of HTML are
overlooked.
Other courses: webmastering II
• Deals with building active web sites.
– Users fill in a form
– Users submit the form
– Web server return a page that is specific to the
request of the user.
• Teaches a language called PHP, that is widely
used to generate such web sites.
– Gets you introduced to computer programming
– Gets you to train analytical thinking.
other courses: webmastering III
• It deals with XML
– XML is a syntax to encode any kind of data.
– XML can be constrained to only allow certain types of
data (XML Schema)
– XML can be transformed to render the data in various
ways (XSLT)
• The aim is to achieve a separation of contents
and presentation of a web page.
• This is an advanced course. It covers both
Schema and Transformation
literature
• I work from the text of the official standard at
http://www.w3.org/TR/html4/
• You can work from any HTML book.
• The W3C is the standard making body for the
Web. Anything that they say is the standard.
• But some people don't behave according to the
standard.
The world wide web
The World Wide Web (Web) is a network of
information resources. The Web relies on three
mechanisms to make these resources readily
available to the widest possible audience:
– A uniform naming scheme for locating resources on
the Web (i.e. URIs).
– Protocols, for access to named resources over the
Internet (e.g., http).
– Hypertext, for easy navigation among resources (e.g.,
HTML).
URI introduction
• Every resource available on the Web -- HTML
document, image, video clip, program, etc. -has an address that may be encoded by a
Universal Resource Identifier, or "URI".
• URIs typically consist of three pieces:
– The name of the mechanism used
• to access the resource
• or the otherwise “resolve” it
– The name of the machine hosting the resource.
– The name of the resource itself, given as a path.
example URI
• http://openlib.org/home/krichel
This URI may be read as follows: There is a
document available via the HTTP protocol,
residing on the site openlib.org, accessible via
the path "/home/krichel".
• mailto:[email protected]
This URI may be read as follows: There is email
user krichel in a domain openlib.org to whom
email may be sent.
Internet application protocols
• On the Internet machines use different application
level protocols to do things
• Common protocols include
– http
-- dns
--telnet
– smtp
-- ssh
--ftp
• All of the ones cited are client/server protocols
– client issues a request
– server gives a response
• All of them use a different port. A port is a number
that tells the machine what to do with the
incoming stream of data.
http
• The web operates mostly on http, the hypertext
transfer protocol.
• The client software is run on the local PC that
you are using, called
– a web browser (not politically correct)
– a user agent (that's better)
• Our server is a piece of hardware called
wotan.liu.edu, “wotan” for short
– It runs the Debian GNU/Linux operating system on a
Intel architecture.
– It provides http daemon software that serves http
requests. The particular software is called Apache.
main features of http
• http is insecure. the contents of http transactions
(requests/responses) can be observed.
• http is stateless. Each transaction is selfcontained. Each transaction has no relationship to
the previous one.
• http has a limited vocabulary of requests and
responses. It is no good, say, to operate a
machine remotely.
• We can therefore not use it communicate with the
server.
working with a remote machine
• There are two traditional ways to work with a
remote machine
– issue commands to it
• used to be done with “telnet”
– transfer files to and from it
• used to be done with “ftp”
• Telnet and ftp servers are not available on
wotan.liu.edu. Telnet and ftp do not encrypt the
communication stream. Therefore they are not
secure.
communication with wotan
• The protocol that we use for communicating with
the server is the secure shell, short ssh. It is
based public-key cryptography.
• There are two PC programs commonly used as
ssh clients
– putty for issuing commands
– winscp for file transfer.
• winscp is the one we will use. In offers a range
of other facilities besides file transfer.
• Mac users should investigate a software called
“fugu”.
registration time
• As part of the course, you are being provided
with web space on the server wotan.liu.edu, at
the URL
http://wotan.liu.edu/~username
where username is a user name that you will
chose now.
• It is my intention to maintain this web space for
you into the foreseeable future.
• You should also choose a password, now.
• I will now register you.
free software
• I maintain wotan.liu.edu server but you can build
your own server if
– you have Internet access
– you have an old PC to spare
• All the server software, as well as putty and
winscp are free, open-source. It is one of my
fundamental beliefs that free information should
run on free software.
• The library community can learn a lot from the
free software community.
• See my talk at http://openlib.org/home/krichel/
presentations/new_york_2003-11-07.ppt
installing winscp
• http://winscp.sourceforge.net/eng/download.php
has
– “installation package”. for use if you have administrator
rights on the machine where you are installing to
– “application”. for use otherwise, i.e. to just download
and run the application
• At installation time, when/if asked about the
default interface, I suggest you use “Windows
explorer style”, rather than the default “Norton
commander style” . You can change that later, so
no panic.
other stuff: installing “user agents”
• Download and install a recent version of at least
two browsers. I suggest
– Mozilla Firefox at http://www.mozilla.org/products/firefox/
– Opera at http://www.opera.com
– Netscape Navigator at
http://channels.netscape.com/ns/browsers/download.jsp
open a wotan session with winscp
• the host name is “wotan.liu.edu”
• give your user name
• click on “save”, this will save the session, after
“ok”
• you will be lead to the list of saved sessions
• double click to open the session
• at first connection you will see a warning you
can ignore
• note:
– you can save the password as part of the session
– it is risky to do that in a public classroom
initial remote files on wotan
• a set of files starting with a dot.
– Lhese are places where Linux Masters exert their black
magic.
– Leave them alone.
• a directory called public_html
– This is the place where web masters exert their magic.
You can go into that directory to see the files that you
have on your web site at the moment.
– There should be two files
• empty.html
• validated.html
public_html
• Imagine you are user user and you have a file
file in public_html.
• The web server will map requests to
http://wotan.liu.edu/~user/file to show the file
public_html/file.
• Here user stands for your user id, and file is the
file name, and “/” is the directory separator.
• If file ends with “.html” or “.htm” the web browser
will be told that the file is a HTML file. It will be
rendered accordingly by the browser.
index.html
• The web server on wotan will map requests to
http://wotan.liu.edu/~user to show the file
public_html/index.html
• If this file is not there, the server will prepare a
HTML document from the list of files that it finds
in the directory and send it to the user agent.
• Once you have a file index.html, the web user
can no longer see the individual files in your
directory.
HTML and XHTML
• HTML is the hypertext markup language
• HTML is a markup language that is widely used
on the Web.
• The latest, and probably last version of HTML is
at http://www.w3.org/TR/html4/
• The W3C, the standard making body for the
Web, have issued XHTML, a replacement of
HTML that is compatible with XML.
• We will work with XHTML.
SGML HTML XML
• You will probably have come across these terms.
• SGML was developed first. HTML and XML are
developed from SGML in different ways.
– HTML is an SGML DTD.
– XML is an SGML application.
• One common thing here is the ML. It stands for
Markup Language.
• Markup is everything in a document that is not
content.
procedural/descriptive
• Markup can be given in two ways
• 1: Procedural
– Codes identify point size, style, font, etc.
– Usually only understood by defining tool
– Example: Microsoft Word
• 2: Descriptive
– Describes purpose of text within the document
– Chapter head, Paragraph, Section Head, TOC
– Structure and Style are kept separate
– Example: LaTeX, SGML
SGML
• Standard Generalized Markup Language
• Descriptive approach with three separate layers
– structure: types of information in document
– content: the information itself
– style: defines how to typeset the document
• Developed for the publishing industry by a group
of consultants.
• So complicated that no software implements it
fully.
• But an important idea that remains of it is the
document type definition.
Document Type Definition (DTD)
• The DTD is a non-SGML language that describes
SGML document types
• Describes information the document handles, e.g.
– title
– chapter
• Relationships between fields e.g.
– a chapter contains sections
– a title comes at the top of the document
XML
• Since SGML is so complicated, it is not good for
use on the Web.
• So the W3C has issued XML, the eXtensible
markup language.
• Every XML document is SGML, but not the
opposite.
• Thus XML is like SGML but with many features
removed.
• XML defines the syntax that we will use in the
course. We have to study that syntax in some
detail.
XML elements
• XML is based on elements. There are basically
three ways of writing an element.
• The first way is write <element/>.
• Here element is the name of the element.
• Such an element is called an empty element.
• Example:
<bang/>
• This is an empty element, the name of which is
“bang”.
non-empty elements
• If name is the name of the element, you can give
an element contents contents by writing
<name>contents</name>.
• Here <name> is called a start tag. </name> is
called the end tag. Both tags surround the
contents of the element.
• Remember the previous slide? Then note that
<name/> is just a shortcut for <name></name>.
Examples
• <greeting>bonjour</greeting>
• <greeting>здравствуйте</greeting>
• <sentence>She says <greeting>hello</greeting>
to you.</sentence>
• <examples> <example>I koh Glos essa, und es
duard ma ned wei.</example><example>Ja
mogu esti staklo, i ne boli me. </example>
<example>Kristala jan dezaket, ez det minik
ematen.</example></examples>
attributes to elements
• Elements can have attributes. Here is an
element with two attributes
• <name attribute_name_one="value_one"
attribute_name_two="value_two"/>
• Here attribute_name_one and
attribute_name_two are attribute names and
value_one and value_two are attribute values.
The element itself is empty.
• Example: <greeting
language=”french”>bonjour</greeting>
more on attributes
• There can be no two attributes to the same
element with the same names.
• Attribute values are simple strings. You can not
have an element inside an attribute value.
• Attribute names are separated from their values
by the = sign.
• Attribute values can be enclosed in single or
double quotes. It does not matter. Double quotes
are more common, so I suggest you use those.
more examples
<poet born="1799" died="1837">
<name lang="ru">Александер Сергеевич
Пушкин</name>
<name lang="en">Alexander S. Pushkin</name>
<name lang="fr">Alexandre Pouchkine</name>
</poet>
XML document
• An XML document is a piece of data that is
written in XML.
• But sometimes the author of a document makes a
mistake, and, in fact the XML is wrong in some
ways.
• If there is no mistake, the document is called wellformed.
• If a document is not well-formed, it really is not an
XML document.
some rules for well-formedness
• All elements must be properly nested. You can
only close the outer element after all inner
elements are closed. Examples
– <a><b></a></b> not well-formed
– <a><b></b></a> well formed
• An attribute must have a value. Thus you can not
write <result abstract>... </result>. The value may
be empty like in <result abstract=''>...</result> or
<result abstract="">... </result>.
• You can not have element contents in attributes.
Thus you can not have
<structure note="<b>something</b>"> ...
more rules for well-formedness
• There must be one single element in the
document.
– It is called the root element.
– All other elements are called children of the root.
– Whitespace that surrounds the root element is ignored.
– The root element may be preceded by a prologue. A
prologue is anything before the root element.
• There can be other things, i.e. that are not
elements in an XML document.
other things: comments
• In an XML document, you can make comments
about your code. These are notes to yourself.
• Comments start with <!-• Comments end with -->
• Example: <!-- this is a comment -->
• Comments can not be nested.
• Can appear anywhere in the document.
• They can enclose elements.
other things: XML declaration
• The XML declaration is a special line that says
that what follows is XML and give some very
basic information about that XML. It is trendy to
use it.
• It is optional, but if it is there it has to be on the
first line.
• You will need to have an XML declaration if your
character encoding is not UTF-8. We will come
back to this point later.
other things: XML declaration
• Normally the XML declaration looks like
• <?xml version="1.0" encoding="encoding"?>
• where encoding is the character encoding. By
default, the character encoding is UTF-8, so if you
use that, you do not need to mention it.
• There is now a version "1.1" of XML around, but
– it is not widely deployed
– it is not much different from version 1.0
other stuff: document type
declaration
• XML documents, like any SGML documents,
accept document type declarations.
• A document type declaration tells us something
about the vocabulary of elements and attributes
used in the document.
• It should appear before the root element, after the
XML declaration, if you have one.
• It takes the form <!DOCTYPE mumbojumbo >
• We will come back to the document type
declaration later.
HTML
• HyperText Markup Language
• HTML is an SGML DTD
–
–
–
–
–
–
Head, Title, Body, Paragraph, etc.
Headings, Bold, Italic, etc.
Table, List, Image, etc.
Links to other documents
Forms
and many others
HTML history
• HTML was a very bare-bones language when
first invented by Tim Berners-Lee. It did not
describe pages with much of a visual appeal.
• In the 90s, successful browsers invented
“extensions” that aimed to stretch the visual
boundaries of HTML.
• Some of these extensions found their way in the
official HTML spec issued by the W3C.
• Later the W3C developed style sheets as a way
to accommodate for display requirements
without having to extend HTML.
HTML versions
• HTML 4.01 is the last version of HTML This
version has two different DTDs:
– the loose DTD
– the strict DTD
• I only the cover the elements of the strict DTD.
• The loose DTD has more elements, but all the
functionality of these elements is best done with
style sheets.
• Thus, the pages created with HTML only will
look rather boring.
• But we do cover style sheets later.
XHTML
• XHTML is HTML written in an XML syntax.
• Every XHTML document has to be well-formed
XML.
• non-XHTML HTML documents can violate some
well-formedness constraints, including
– HTML element names are not case sensitive
– some HTML elements do not need closing.
– there is no need for a single root element in a HTML
document.
XHTML: pain without gain?
• In this course we study XHTML.
• When I say HTML in the following, I mean
XHTML.
• Reasons to study XHTML rather than HTML
– syntactic rules of XML are easier to understand.
– any tool that can work with XML can be applied to
XHTML, but can not be applied to HTML.
– in general XML documents are more computer
understandable. This is crucial in the age of the search
engine.
Example HTML snippet
<a href="http://openlib.org/home/krichel"
title="homepage of Thomas Krichel">Thomas
Krichel</a>
– the whole thing is an <a> element. It creates an
anchor. (I use < and > to surround element names.)
– “href” is an attribute name
– “http://openlib.org/home/krichel” is the value of the
"href" attribute
(I surround attribute names with straight quotes)
– 'Thomas Krichel' is character data.
Characters: concept
• A character set combine two things
– Character repertoire: a set of characters e.g. "A", "‫"ﺾ‬
"‼", "₣"
– Character code positions: defines a number for each
character in the repertoire.
• Character encoding is a way to encode the code
positions in bytes.
• To correctly display a document, the user agent
needs to know both!
playing safe with characters
• Only use the characters on the US keyboard,
don't insert symbols.
• Save as ASCII or UTF-8. All ASCII files are also
UTF-8 files.
• Never save as "Unicode" within MS Notepad.
• If you encounter a character that is not on your
keyboard, use an SGML entity.
• The SGML entity is the last special SGML thing
that we have to study.
SGML entities
• SGML entities are something like a way to
represent non-ASCII characters when only ASCII
input is possible.
• Codes can can be &code;
– Ex. &eacute;
• Inserts and e with acute accent.
– this is called a character entity
– Codes are often abbreviation of the character names
• Codes can be in hex form
• Ex. &#38; to insert an ampersand
• this is called a numeric entity
XHTML entities
• They are officially defined in three files that are
maintained by the W3C
– http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
– http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
– http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
• A sample line is
<!ENTITY ccedil "&#231;"> <!-- latin small letter c
with cedilla, U+00E7 ISOlat1 -->
• <!ENTITY is DTD speak for defining an entity
• it is followed by the character form and the numeric form of the
entity
• the rest of the line is a comment, of course
entities used in XML
• There are three that you need to know and use.
– &lt; stands for <
– &gt; stands for >
– &amp; stands for &
• Every time you want to insert <, > or & in the
documents, you have to use the entities instead.
• Examples:
– krichel&#64;openlib.org
– je suis Fran&ccedil;ais
– Marks &amp; Spencers
another look at empty.html
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title></title>
<meta http-equiv="content-type"
content="text/html; charset=UTF-8"/>
</head>
<body></body>
</html>
empty.html dissected
• the <!DOCTYPE ... > is an SGML document type
declaration. It says that the document contains
XHTML of the “strict” flavor.
• The document type declaration is the only thing
that we have in the prologue. We could have
placed an XML declaration before it but chose not
to do so.
• <html> is the root element. It contains some other
elements. Some of these we discuss now, others
later.
special topic: images
• The appeal of the web to the masses has a lot to
do with its capability to transport image.
• Image formats are independent of the web, but
there are two classic format that are widely
supported by user agents.
– GIF
– JPEG
• There is also a more recent one, the portable
network graphic, PNG.
GIF
• stands for graphics interchange format.
• developed by CompuServe.
• unresolved copyright issues make the format
abhorred by the free software community.
• 250 colors maximum
• uses a loss-less compression technique
GIF has three tricks
• interlacing:
– when downloading the file, the browser can show
every forth row first
– user gets in an idea of the picture before it is sharp
• transparency
– some GIFs are transparent, so you can see them on
top of already exist
– technically, the GIF has one color as the background
color, and pixels of that color are ignored by the user
agent
• animation
– some GIFs are in fact sequences of GIFs that can be
rendered one after the other.
JPEG
• The Joint Photographic Experts Group is a
standard-making body for images
• They can support thousands of colors.
• The compression is lossy, i.e. the JPEG file will
look like the original image, but not be the same.
• The compression does not work well with
drawings.
• There are no copyright and patent problems with
JPEG
Homework
• Look at course home page.
• Install winscp and browsers at home.
• Prepare a one-page max summary of the type of
website that you want to build, bring printed copy
with you next week.
• Prepare for quiz at the beginning of next lecture.
http://openlib.org/home/krichel
Please shutdown the computers when
you are done.
Thank you for your attention!