introduction to the course and to XML

Download Report

Transcript introduction to the course and to XML

LIS650 lecture 0
Introductory lecture
Thomas Krichel
2005-09-10
today
• Today's contents
–
–
–
–
–
–
–
Administrative introduction to the course
Substantive introduction to the course
Talk about you!
Introduction to the web
Introduction to XML
Introduction to character sets
A few words about images
• Fairly general, abstract and tough lecture. Can
lead to serious angst.
course resources
• Course home page is at
http://wotan.liu.edu/home/krichel/lis650w05a
• The class mailing list
https://lists.liu.edu/mailman/listinfo/cwp-lis650krichel
• Me. Send me email. Unless you request privacy,I
answer to the class mailing list. I come here on
several days to council students. I announce all
times on the mailing list.
general assessment
• First quiz next lecture.
• If you miss a lecture, let me know in advance.
• In addition to the quizzes, we have
– the web site assessment
– the final web site
• Final grade is calculated by computer. Quizzes
go through a complicated discounting scheme. It
disregards the worst performance.
web site assessment
• Look at the web site of a university Library and
Information Science department.
• A list is at http://informationr.net/wl/
• Write a text not describing, but commenting on
the web site.
• State the site URL, I will look at it.
• Try to keep you text short please, no more than 2
pages.
• Feel free to others for opinions.
the final web site
• Contents should be equivalent to a student essay.
• Good contents and good architecture are
important to a straight A.
• It should be a contribution to knowledge on a
topic.
• Personal sites are not allowed.
• Deadline to finish web site: one week after the
end of the last lecture.
• You will not be able to change your web site
between the deadline and the time that the grade
is issued.
course history
• Course was first run as an institute 2002-05-13
to 2002-05-17
• Title was “Webmastering I: the static web site”.
• To the curriculum committee, this title did not
sound academic enough.
• In 2003 “Web Site Architecture and Design”
(WebSAD) became the the full title.
• In 2005 “Passive Web Site Architecture and
Design” became the title.
• WebSAD is what we basically learn.
teaching WeSAD
• WeSAD combines many aspects:
–
–
–
–
–
–
Authoring pages
Work on the organization of data to fit onto pages
Set display style of different pages
Organize the contribution of data
Improve look and feel
Maintain a technical web installation
• Some of them can be learned in a course, but
others can not.
• Emphasis has to be on learnable elements.
teaching philosophy
• Point and click on a computer software is not
enough.
• Explain underlying principles.
• Promote standards
– XHTML 1.0
– CSS level 2.1
• Avoid proprietary software.
• Provide a reasonable rigorous introduction to
digital information.
LIS650 contents
• Deals with the maintenance of a passive web
site. Such a web site remains the same
whatever the user does with it. There is no
customization for different users or times.
• Topics include
– (x)html
– css
– site usability and information architecture, as far as
relevant for passive web sites
– http, uri, web server
things this course does not do
• Forms: allow you to design forms that users fill
in. But you do not have the programming skills to
do something with the form.
• Frames: allow you to put several documents into
one physical document. Most experts advise
against them.
• We do not cover image maps.
• We don’t do some advanced CSS properties.
• Some exotic features of HTML are overlooked.
Other course: LIS651
• Deals with building active web sites.
– Users fill in a form
– Users submit the form
– Web server return a page that is specific to the
request of the user.
• Teaches a language called PHP, that is widely
used to generate such web sites.
– Gets you introduced to computer programming
– Gets you to train analytical thinking.
• Teaches relational database to store and retrieve
information.
– Gets you to think about the structure of information.
world wide web
The World Wide Web (Web) is a network of
information resources. The Web relies on four
standards to make these resources readily
available to the widest possible audience:
– A uniform naming scheme for locating resources on
the Web (i.e. URIs).
– Protocols, for access to named resources over the
Internet (e.g., http).
– Hypertext, for easy navigation among resources (e.g.,
HTML).
– Vocabularies for types of objects on the Web (i.e.
MIME types)
URI introduction
• Every resource available on the Web -- HTML
document, image, video clip, program, etc. -has an address that may be encoded by a
Uniform Resource Identifier, or “URI”.
• URIs typically consist of three pieces:
– The name of the mechanism used
• to access the resource
• or the otherwise “resolve” it
– The name of the machine hosting the resource.
– The name of the resource itself, given as a path.
example URI
• http://openlib.org/home/krichel
This URI may be read as follows: There is a
document available via the HTTP protocol,
residing on the Internet host openlib.org,
accessible via the path "/home/krichel".
• mailto:[email protected]
This URI may be read as follows: There is email
user krichel in a domain openlib.org to whom
email may be sent.
Internet application protocols
• On the Internet machines use different application
level protocols to do things
• Common protocols include
– http
– smtp
-- dns
-- ssh
-- telnet
-- ftp
• All of the ones cited are client/server protocols
– client issues a request
– server gives a response
client and server
• The web operates on a client/server model
• The client software is run on the local PC that
you are using, called
– a web browser (not politically correct)
– a user agent (that's better)
• Our server is a piece of hardware called
wotan.liu.edu, “wotan” for short
– It runs the Debian GNU/Linux operating system on a
Intel architecture.
– It provides http daemon software that serves http
requests. The particular software is called Apache.
the http protocol
• http is the most widely used application level
protocol on the web.
• http is stateless. Each transaction is selfcontained. Each transaction has no relationship to
the previous one.
• http has a limited vocabulary of requests and
responses. It is no good, say, to operate a
machine remotely.
• http is insecure. The contents of http transactions
(requests/responses) can be observed.
• We can therefore not use it to build web pages.
working with a remote machine
• There are two traditional ways to work with a
remote machine
– issue commands to it
• used to be done with “telnet”
– transfer files to and from it
• used to be done with “ftp”
• Telnet and ftp servers are not available on
wotan.liu.edu. Telnet and ftp do not encrypt the
communication stream. Therefore they are not
secure.
communication with wotan
• The protocol that we use for communicating with
the server is the secure shell, short ssh. It is
based public-key cryptography.
• There are two PC programs commonly used as
ssh clients
– putty for issuing commands
– winscp for file transfer.
• winscp is the one we will use. In offers a range
of other facilities besides file transfer.
• Mac users should investigate a software called
“fugu”.
important rule
• When you compose web pages, you use winscp.
• When you look at your own web pages, you use a
common web user agent.
• Never use winscp to look at your own web pages.
You will not rot in hell, but you will be confused.
user name & password
• You can choose your user name as a short form
of your own name.
• It should be all lowercases and can not have
spaces.
• Your final project pages can be placed in a
subdirectory, say at
http://wotan.liu.edu/~user/project, where user is
your user name.
• We will worry about that later.
registration time
• As part of the course, you are being provided
with web space on the server wotan.liu.edu, at
the URL
http://wotan.liu.edu/~user
where user is a user name that you will chose
now.
• You may wish to make the user name some
short form of your name. Remember you will be
able to have that site for many years to come.
free software
• I maintain wotan.liu.edu server but you can build
your own server if
– you have Internet access
– you have an old PC to spare
• All the server software, as well as putty and
winscp are free, open-source. It is one of my
fundamental beliefs that free information should
run on free software.
installing winscp
• http://winscp.net/eng/download.php has
– “installation package”. for use if you have administrator
rights on the machine where you are installing to
– “application”. for use otherwise, i.e. to just download
and run the application
• At installation time, when/if asked about the
default interface, I suggest you use “Windows
explorer style”, rather than the default “Norton
commander style” . You can change that later, so
no panic.
other stuff: installing “user agents”
• Download and install a recent version of at least
two browsers. I suggest
– Mozilla Firefox at http://www.mozilla.org/products/firefox/
– Opera at http://www.opera.com
open a wotan session with winscp
• The host name is “wotan.liu.edu”.
• Give your user name.
• Click on “save”, this will save the session, after
“ok”.
• You will be lead to the list of saved sessions.
• Double-click to open the session.
• At first connection you will see a warning you
can ignore.
• You can save the password as part of the
session. It is risky to do that in a public
classroom. You may want to do it at home.
initial remote files on wotan
• A set of files starting with a dot.
– These are places where Linux Masters exert their black
magic.
– Leave them alone.
• A directory called public_html
– This is the place where web masters exert their magic.
You can go into that directory to see the files that you
have on your web site at the moment.
– There should be two file
• validated.html
• main.css
– Do NOT double-click any file!
validated.html
• This is your model web page. You should leave it
alone and never change it.
• To create a new web page, right click (remember
never double-click) on validated.html, and choose
"duplicate" from the menu. Do not choose "copy".
• You will be asked to supply a name for the file.
Erase any contents in the dialog box, and then enter
the file name you want to create (say test.html).
Always have that file name end with ".html".
• You may be asked to give your password again.
• Did I say you should not double-click in winscp?
test.html
• In your test.html file, look for the
<p id="validator">
• Right before that string, insert
<div>Hello, world!</div>
• Save you file by write
• Do not double click test.html !
• Open a web user agent, point it to the URL
http://wotan.liu.edu/~user/test.html where user is
your user name.
public_html
• Imagine you are user user and you have a file
file in public_html.
• The web server will map requests to
http://wotan.liu.edu/~user/file to show the file
/home/user/public_html/file.
• Here user stands for your user name, and file is
the file name, and "/" is the directory separator.
• If file ends with ".html" or ".htm" the web browser
will be told that the file is a HTML file. It will be
rendered accordingly by the browser.
• This is done using the MIME type text/html.
index.html
• The web server on wotan will map requests to
http://wotan.liu.edu/~user/ to show the file
/home/user/public_html/index.html
• If this file is not there, the server prepares a
HTML document from the list of files that it finds
in the directory. Then it sends it to the user
agent.
• Once you have a file index.html, the web user
can no longer see the individual files in your
directory.
HTML and XHTML
• HTML is the hypertext markup language
• HTML is a markup language that is widely used
on the Web.
• The latest, and probably last version of HTML is
at http://www.w3.org/TR/html4/
• The W3C, the standard making body for the
Web, have issued XHTML, a replacement of
HTML that is compatible with XML.
• We will work with XHTML. But we will call it
HTML by abuse of language.
SGML HTML XML
• You will probably have come across these terms.
• SGML was developed first. HTML and XML are
developed from SGML in different ways.
– HTML is an SGML DTD.
– XML is an SGML application.
• One common thing here is the ML. It stands for
Markup Language.
• Markup is everything in a document that is not
content.
SGML
• Standard Generalized Markup Language
• Descriptive approach with three separate layers
– structure: types of information in document
– content: the information itself
– style: defines how to typeset the document
• Developed for the publishing industry by a group
of consultants.
• So complicated that no software implements it
fully.
• But an important idea that remains of it is the
document type definition.
Document Type Definition (DTD)
• The DTD is a non-SGML language that describes
SGML document types
• Describes information the document handles, e.g.
– title
– chapter
• Relationships between fields e.g.
– a chapter contains sections
– a title comes at the top of the document
XML
• Since SGML is so complicated, it is not good for
use on the Web.
• So the W3C has issued XML, the eXtensible
Markup Language.
• Every XML document is SGML, but not the
opposite.
• Thus XML is like SGML but with many features
removed.
• XML defines the syntax that we will use in the
course. We have to study that syntax in some
detail.
XML elements
• XML is based on elements. There are basically
three ways of writing an element.
• The first way is write <element/>.
• Here element is the name of the element.
• Such an element is called an empty element.
• Example:
<bang/>
• This is an empty element, the name of which is
“bang”.
non-empty elements
• If name is the name of the element, you can give
an element contents contents by writing
<name>contents</name>.
• contents is often simple character data.
• Here <name> is called a start tag. </name> is
called the end tag. Both tags surround the
contents of the element.
• Remember the previous slide? Then note that
<name/> is just a shortcut for <name></name>.
Examples
• <greeting>bonjour</greeting>
• <greeting>здравствуйте</greeting>
• <sentence>She says <greeting>hello</greeting>
to you.</sentence>
• <examples> <example>I koh Glos essa, und es
duard ma ned wei.</example><example>Ja
mogu esti staklo, i ne boli me. </example>
<example>Kristala jan dezaket, ez det minik
ematen.</example></examples>
elements within elements
• Elements can have character data contents, but
also other elements as contents.
• An element that is contained another element is
said to be a child of that other element.
• Example
– <menu><choice>Bibbelsches Bohnesupp mit
Quetschekuche</choice> or
<choice>Dibbellabbes</choice></menu>
• This is what is known as a nested structure.
attributes to elements
• Elements can have attributes. Here is an
element with two attributes
• <name attribute_name_one="value_one"
attribute_name_two="value_two"/>
• Here attribute_name_one and
attribute_name_two are attribute names and
value_one and value_two are attribute values.
The element itself is empty.
• Example: <greeting
language="french">bonjour</greeting>
more on attributes
• Attribute names are separated from their values
by the = sign.
• There can be no two attributes to the same
element with the same names. So you can not
have something like <trafficlight color="red"
color="green"/>
more on attributes
• Attribute values are simple strings. You can not
have an element inside an attribute value. Thus
you can not write, for example <meal
type="<cookie/>">chocolate</meal>
• Attribute values can be enclosed in single or
double quotes. It does not matter. Double quotes
are more common, so I suggest you use those.
more examples
<poet born="1799" died="1837">
<name lang="ru">Александер Сергеевич
Пушкин</name>
<name lang="en">Alexander S. Pushkin</name>
<name lang="fr">Alexandre Pouchkine</name>
</poet>
XML document
• An XML document is a piece of data that is
written in XML.
• But sometimes the author of a document makes a
mistake, and, in fact the XML is wrong in some
ways.
• If there is no mistake, the document is called wellformed.
• If a document is not well-formed, it really is not an
XML document.
some rules for well-formedness
• All elements must be properly nested. You can
only close the outer element after all inner
elements are closed. Examples
– <a><b></a></b> not well-formed
– <a><b></b></a> well formed
• An attribute must have a value. Thus you can not
write <result abstract>... </result>. The value may
be empty like in <result abstract=''>...</result> or
<result abstract="">... </result>.
more rules for well-formedness
• There must be one single element in the
document that all other elements are children of.
–
–
–
–
It is called the root element.
All other elements are called children of the root.
Whitespace that surrounds the root element is ignored.
The root element may be preceded by a prologue. A
prologue is anything before the root element.
• There can be other things, i.e. that are not
elements in an XML document.
other things: comments
• In an XML document, you can make comments
about your code. These are notes to yourself.
• Comments start with <!-• Comments end with -->
• Example: <!-- this is a comment -->
• Comments can not be nested.
• Can appear anywhere in the document.
• They can enclose elements.
other things: document type declaration
• XML documents, like any SGML documents,
accept document type declarations.
• A document type declaration tells us something
about the vocabulary of elements and attributes
used in the document.
• It should appear before the root element, after the
XML declaration, if you have one.
• It takes the form <!DOCTYPE mumbojumbo >
• We will come back to the document type
declaration later.
nodes
• elements, attributes, character data etc all are
things that are used in the XML document.
• "node" is a word to characterize everything that
can be put in the XML document.
• Thus an element is a node of type element. A
comment is a node of type comment. "Hello,
world!" is a node of type character data.
• Exercise: open the source code for your test file.
Show your neighbor all the nodes and tell her/him
what type they are.
Characters: concept
• A character set combine two things
– Character repertoire: a set of characters e.g. "A", "‫"ﺾ‬
"‼", "₣"
– Character code positions: defines a number for each
character in the repertoire.
• Character encoding is a way to encode the code
positions in bytes.
• To correctly display a document, the user agent
needs to know both!
playing safe with characters
• Only use the characters on the US keyboard,
don't insert symbols.
• Save as ASCII or UTF-8. All ASCII files are also
UTF-8 files.
• Never save as "Unicode" within MS Notepad.
• If you encounter a character that is not on your
keyboard, use an SGML entity.
• The SGML entity is the last special SGML thing
that we have to study.
SGML entities
• SGML entities are something like a way to
represent non-ASCII characters when only ASCII
input is possible.
• Codes can can be &code;
– Ex. &eacute;
• Inserts and e with acute accent.
– this is called a character entity
– Codes are often abbreviation of the character names
• Codes can be in hex form
• Ex. &#38; to insert an ampersand
• this is called a numeric entity
XHTML entities
• They are officially defined in three files that are
maintained by the W3C
– http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
– http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
– http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
• A sample line is
<!ENTITY ccedil "&#231;"> <!-- latin small letter c
with cedilla, U+00E7 ISOlat1 -->
• <!ENTITY is DTD speak for defining an entity
• it is followed by the character form and the numeric form of the
entity
• the rest of the line is a comment, of course
entities used in XML
• There are three that you need to know and use.
– &lt; stands for <
– &gt; stands for >
– &amp; stands for &
• Every time you want to insert <, > or & in the
documents, you have to use the entities instead.
• Examples:
– krichel&#64;openlib.org
– je suis Fran&ccedil;ais
– Marks &amp; Spencers
other examples
• Now look at two examples.
– Look the source of your own validated.html. Interpret its
contents as XML.
– Look at
http://wotan.liu.edu/home/krichel/examples/xml/grades
heet.xml.
• First consider the rendered version. It illustrates the type of
XML data file that Thomas uses to compose his grades and
feeds them into the computer.
• Second, consider the source code of the web page. Why are
there all these &lt; and &gt; ?
special topic: images
• The appeal of the web to the masses has a lot to
do with its capability to transport image.
• Image formats are independent of the web, but
there are two classic format that are widely
supported by user agents.
– GIF
– JPEG
– PNG
• The resolution of the image is an important factor.
resolution
• On a pixel image the term resolution is often used
to say how many pixels are there horizontally and
vertically.
• The larger the number of pixels the wider it will
appear on the screen.
• But you will never know how large it is on the
screen because that depends on how many
pixels your user's screen draws per inch of
display.
• The web is a bad place for a control freaks.
GIF
• stands for graphics interchange format.
• developed by CompuServe.
• unresolved copyright issues make the format
abhorred by the free software community.
• 250 colors maximum
• uses a loss-less compression technique
GIF has three tricks
• interlacing:
– when downloading the file, the browser can show
every forth row first
– user gets in an idea of the picture before it is sharp
• transparency
– some GIFs are transparent, so you can see them on
top of already exist
– technically, the GIF has one color as the background
color, and pixels of that color are ignored by the user
agent
• animation
– some GIFs are in fact sequences of GIFs that can be
rendered one after the other.
JPEG
• The Joint Photographic Experts Group is a
standard-making body for images
• They can support thousands of colors.
• The compression is lossy, i.e. the JPEG file will
look like the original image, but not be the same.
• The compression does not work well with
drawings.
• There are no copyright and patent problems with
JPEG
Homework
• Look at course home page.
• Install winscp and browsers at home.
• Prepare a one-page max web site plan. Bring a
printed copy with you next week.
• Prepare for quiz at the beginning of next lecture.
web site plan
• Who commissioned the web site?
• Whom is the site for?
• What pages will be on the site?
– Name each page.
– Establish hierarchy between pages.
• Any special technical challenges?
http://openlib.org/home/krichel
Please shutdown the computers when
you are done.
Thank you for your attention!