URLs and Resources

Download Report

Transcript URLs and Resources

URLs and Resources
Herng-Yow Chen
1
Outline


Navigating the Internet’s Resources
URL syntax


URL Shortcuts that many web clients support:





and what the various URLs mean and do
relative URLs
and expanded URLs
URL encoding and character rules
Common URL schemes
The future of URLs, including URNs
2
Navigating a resource by URL, which
tells a web client
1.
2.
3.
URL scheme: how to access the resource
Server location: where the resource is hosted
Resource path: what particular local resource
on the server is being requested
http://english.csie.ncnu.edu.tw/demo/index.html
Web page
Scheme
(how)
Host
(where)
Path
(what)
3
URLs

URLs can direct you to resources available
through protocols other than HTTP.




Email account:
mailto:[email protected]
A file resides on a FTP server:
ftp://ftp.ncnu.edu.tw/a_file.txt
A video streamed by a video server:
rtsp://www.cnn.com/headline.rm
Most URLs have the same
“scheme://server location/path” structure
4
Navigating a resource by URL, which
tells a web client
1.
2.
3.
URL scheme: how to access the resource
Server location: where the resource is hosted
Resource path: what particular local resource
on the server is being requested
http://english.csie.ncnu.edu.tw/demo/index.html
Web page
Scheme
(how)
Host
(where)
Path
(what)
5
URL Syntax

<scheme>://<user>:<password>@<host>:
<port>/<path>;<params>?<query>#<frag>
6
Scheme: what protocol to use

The scheme is really the main identifier of
how to access a given resource.

The scheme must start with an alphabetic
character, and it is separated from the rest
of the URL by the first “:” character.

Scheme names are case-insensitive.
7
Usernames and Passwords

Many servers require a username and password
before you can access data through them.
For examples:





ftp://ftp.prep.ai.mit.edu/pub/gnu
ftp://[email protected]/pub/gnu
ftp://anonymous:[email protected]/pub/gnu
http://joe:[email protected]/sales_info.txt
The default username and password


“anonymous” for username
“Internet Explorer sends “IEUser” for password, while
Netscape send “mozilla”.
8
Hosts and Ports



The host component (IP or Domain Name)
identifies that host machine on the Internet that
has access to the resource.
The port component identifies the network port
on which the server is listing.
Different services uses different default ports for
a machine.




HTTP: 80
FTP: 21
Telnet: 23
SMTP: 25
9
Paths


The path component of the URL specifies where
on the server machine the resource lives.
The path often resembles a hierarchical
filesystem path. For example:


http://www.csie.ncnu.edu.tw/course/1998.html
The path in the URL is “ /course/1998.html”, which
resembles a filesystem path on a UNIX filesystem.
The path component for HTTP URLs can be
divided into path segments separated by“ /” .
Each path segment can have its own params
component (described later).
10
Parameters



For many schemes, a simple host and path
to the object just aren’t enough.
Aside from what port the server is listening
to and even whether or not you have
access to the resource with a username
and password, many protocols require
more information to work.
For example,


ftp://ftp.ncnu.edu.tw/image.gif;type=a
ftp://ftp.ncnu.edu.tw/program.exe;type=i
11
Query strings

Some resources, such as database, can be
queried according to input strings.
For example:


http://www.xxx.tw/a.cgi?id=123&name=abc
There is no requirement for the format of
the query component, except that some
characters are illegal. By convention, many
gateways except the query to be formatted
as a series of “name=value” pairs,
separated by “&” characters.
12
Query Strings
http://english.csie.ncnu.edu.tw/course/NWSMLViewer.php?lectureid=rctlee-20030909125212
lectureid=rctlee20030909125212
Internet
Server
“viewer” gateway
13
Fragments

Some finer resource fragments, such as sessions
in a large HTML document , can friendly be
accessed. For example,



http://engquiz.csie.ncnu.edu.tw/e-book/html/B001.html#page10
Because HTTP servers generally deal only with
entire objects, not with fragments of objects,
clients don’t pass fragments along to servers.
Namely, the whole object is retreived, but only
the partial content is displayed.
Note that in Range Request feature of HTTP/1.1,
agents may request byte ranges of objects. (later
lectures)
14
Fragments
(a)User selects link to
“http://www.csie.ncnu.edu.tw/~hy
chen/web_tech/#Resource”
(Fragment is NOT sent to the server)
(b)Browser makes request to
http://www.csie.ncnu.edu.tw/~hychen/web
_tech/
Internet
www.csie.ncnu.edu.tw
Client
(c)Server returns entire HTML page
Browser scrolls down to star at named
“Resource” fragment
(d)Browser displays HTML page starting with
named ”Resource”fragment
15
URL shortcuts






Web clients understand and use a few URL
shortcuts.
Many browsers also support automatic
expansion of URLs, where the user can type in a
key (memorable) part of a URL, and the browser
fills in the rest.
Relative URLs
Base URLs
Resolving relative references
Expanded URLs
16
Relative URLs



URLs comes in two flavors: absolute and
relative.
So far, we have looked only at absolute
URLs, all the information you need to
access a resource.
On the other hand, relative URL is
incomplete. To get all the information need
to access a resource, a relative URL must
be interpreted on the basis of another URL,
called its base.
17
HTML snippet with relative URL
<HTML>
<HEAD> <TITLE> Joe’s Tools </TITLE> </HEAD>
<BODY>
<H1> Tools page </H1>
<H2> Hammers </H2>
<P> Joe’s HARDWARE online has the largest
selection of <A href= “ ./hammers.html”>
hammers </A> on earth.
</BODY>
</HTML>
18
Using a base URL
Base URL:
Relative URL:
http://www.joes-hardware.com/tools.html
./hammers.html
http://www.joes-hardware.com/hammers.html
New absolute URL
19
Base URLs

The first step in the conversion process is to find
a base URL, which can come from a few places.

Explicitly provided in the resource


Base URL of the encapsulating resource



Use <BASE> tag to define the base URL
Does not explicitly specify a base URL.
Use the URL of the resource in which the document is
imbedded as a base, as the example in the preceding slide.
No base URL

In some instances, there is no base URL. This often means
that you have an absolute URL; however, sometimes you just
have an incomplete or broken URL.
20
Resolving relative references
21
Expanded URLs

Some browser try to expand URLs
automatically, either after you submit the
URL or while you’re typing. This provides
users with a shortcut: they don’t have to
type in the complete URL.

Hostname expansion


Ex: yahoo  www.yahoo.com
History expansion

Ex: http://www.ncnu  http://www.ncnu.edu.tw
22
Shady characters in URLs



URLs were designed to be portable, to uniformly
name all the resources on the Internet. This
means that the URLs will be transmitted through
various protocol.
Because different protocols (schemes) use
different mechanisms for transmitting, it is
important for the URLs to be transmitted safely,
namely without losing information, through any
protocols over network.
Some protocols, such as the Simple Mail
Transfer Protocol (SMTP) for email, use a 7-bit
encoding for message; this can strip off certain
characters if the source is encoded in 8 bits or
more.
23
Shady characters in URLs



URLs are permitted to contain only characters
from a relatively small, universally safe alphabet.
In addition to the transportable issue, URLs
should be readable. Hence, some invisible,
nonprinting characters also are prohibited in
URLs, even though these character may pass
through mailers.
To complete matter further, URLs also need to be
complete. One day people would want URLs to
contain binary data or characters outside of the
universally safe of alphabets. So, an escape
mechanism was added.
24
The URL Character Set



US-ASCII is very portable, due to its long legacy.
It uses 7 bits to represent most keys available on
an English typewriter and a few non-printing
control character for text formatting and hardware
signal. But it doesn’t support the inflected
characters common in European languages or
non-Romanic language read.
Want to contain arbitrary binary data.
Use escape sequences allow the encoding of
arbitrary values using restricted subset of the USASCII character set, yielding portability and
completeness.
25
Encoding mechanism


Simply represents the unsafe character by
an “escape” notation, consisting of a
percent sign (%) followed by two
hexadecimal digits.
For example



~  0x7E, http://www.ncnu.edu.tw/%7Ehychen
Space  0x20, http://www.abc.com/web%20tools.html
%  0x25, http://www.abc.com/100%25satisfaction.html
26
Character Restrictions

%
/
.
..
#
?
;
:
$,+
@&=
{}|\^~[]’

<>”













0x00-0x1F, 0x7F
>0x7F
escape token
path delimiter
Path component
Path component
fragment delimiter
Query-string delimiter
params delimiter
to delimit the scheme, user/password, and host/port
Reserved
Reserved - special meaning in some scheme
Restricted  unsafe handling by various transport
agent, such as gateway
Unsafe; should be encoded  have meaning outside
the scope of URL
Restricted  fall within nonprintable range
Restricted  fall within this range do not fall within
7-bit range of US-ASCII
27
Common scheme format







http, https
mailto
ftp
rtsp, rtspu
file
News
telnet
28
The Future: URN?
STEP1:Ask the resource
resolver what the Joe’s
Hardware URL is. Receive
from the resolver the current
location of the resource
STEP2: Get the actual URL
for the resource
Get http://purl.oclc.org/jhardware/
Internet
Client
Purl.oclc.org
Actual:http://www.joes-hardware.com/
Get http://www.joes-hardware.com
Internet
Client
www.joes-hardware.com
29
URI
Universal Resource Identifier

URIs defined in RFC 1630. (1994)
URI is a superset of URL and URN.

Full URI:

proto://hostname/path
http://www.csie.ncnu.edu.tw:80/~hychen/

Partial URI: /path
Identifies the Server
/~hychen/
No server mentioned
30
URLs information

http://www.w3.org/Addressing/


http://www.ietf.org/rfc/rfc1738.txt


RFC 2141, “URN Syntax,” by R. Moats.
http://purl.oclc.org


RFC 2396, “Uniform Resource Identifiers (URI): Generic Syntax,” by T.
Berners-Lee, R. Fielding, and L. Masinter.
http://www.ietf.org/rfc/rfc2141.txt


RFC 1738, “Uniform Resource Locators (URL),” by T. Berners-Lee, L.
Masinter, and M. McCahill.
http://www.ietf.org/rfc/rfc2396.txt


The W3C page about naming and addressing URIs and URLs.
The persistent uniform resource locator web site.
http://www.ietf.org/rfc/rfc1808.txt

RFC 1808, “Relative Uniform Resource Locators,” by R. Fielding.
31