ppt - Duke Database Devils

Download Report

Transcript ppt - Duke Database Devils

HTTP and the Dynamic Web
How does the Web work?
The canonical example in your Web browser
Click here
“here” is a Uniform Resource Locator (URL)
http://www-cse.ucsd.edu
It names the location of an object on a server.
[courtesy of Geoff Voelker]
[email protected]
In Action…
http://www-cse.ucsd.edu
HTTP
Client
Server
• Client uses DNS to resolves name of server (www-cse.ucsd.edu)
• Establishes an HTTP connection with the server over TCP/IP
• Sends the server the name of the object (null)
• Server returns the object
[Voelker]
Naming and URLs
How should objects be named?
• URLs name objects and the virtual locations for those objects.
Location is a DNS name, so there’s two more levels of naming and
indirection under there.
Before hypertext we used to worry about access transparency.
• Object name interpretation is up to the server, but it’s often a
location in the local file tree.
If an object moves, the URL breaks (dangling reference).
Location-independent names seem like the obvious way to go
• Why don’t we use them (e.g., URNs)?
• How do we make them work, esp. in the face of mobility?
[from Voelker, with additions]
Protocols
What kind of transport protocol should the Web use?
HTTP 1.0
• One TCP connection/object
• Complaints: inefficient, slow, burdensome…
HTTP 1.1
• One TCP connection/many objects (persistent connections)
• Solves all problems, right? Huge amount of complexity
Clients, proxies, servers
How do they compare?
• Protocol differences [Krishnamurthy99], performance comparison
[Nielsen97], effects on servers [Manley97], overhead of TCP connections
[Caceres98]
HTTPS: HTTP with encryption
[Voelker]
HTTP in a Nutshell
GET /path/to/file/index.html HTTP/1.0
Content-type: MIME/html, Content-Length: 5000,...
Client
HTTP supports request/response message exchanges of arbitrary length.
Small number of request types: basically GET and POST, with supplements.
object name, + content for POST
optional query string
optional request headers
Responses are self-typed objects (documents) with various attributes and tags.
optional cookies
optional response headers
Server(s)
Scalable Servers
Server
• Of course, you are not the only person accessing the server…
Web Caching
Clients
Proxy Cache
Servers
• Gee, is there some way to offload those busy servers?
• Use caches to exploit reference locality among clients
[Voelker]
Caching
How should we build caching systems for the Web?
• Seminal paper [Chankhunthod96]
• Proxy caches [Duska97]
• Akamai hack [Karger99]
• Cooperative caching [Tewari99, Fan98, Wolman99]
• Popularity distributions [Breslau99]
[Voelker]
Issues for Web Caching
• binding clients to proxies, handling failover
manual configuration, router-based “transparent caching”, WPAD
(Web Proxy Automatic Discovery)
• proxy may confuse/obscure interactions between server and client
• consistency management
At first approximation the Web is a wide-area read-only file service...but
it is much more than that.
caching responses vs. caching documents
deltas [Mogul+Bala/Douglis/Misha/[email protected]]
• prefetching, scale, request routing, scale, performance
Web caching vs. content distribution (e.g., Akamai)
A few weeks from now...
HTTP 1.1
Specification effort started in W3C, finished in IETF....much later.
A number of research works influenced the specification.
HTTP 1.0 shows the importance of careful specification.
• performance
persistent connections with pipelining
range requests, incremental update, deltas
• caching
cache control headers
• negotiation of content attributes and encodings
• content attributes vs. transport attributes
transport encodings for transmission through proxies
• Trailer header and trailer headers
Persistent Connections
There are three key performance reasons for persistent connections:
• connection setup overhead
• TCP slow start: just do it and get it over with
• pipelining as an alternative to multiple connections
And some new complexities resulting from their use, e.g.:
• request/response framing and pairing
• unexpected connection breakage
Just ask anyone from Akamai...
• large numbers of active connections
How long to keep connections around?
These motivations and issues manifest in HTTP, but they are
fundamental for request/response messaging over TCP.
Cookies
HTTP cookies (RFC2109) have brought us a better Web.
• S optionally includes arbitrary state as a cookie in a response.
• Cookie is opaque to C, but C saves the cookie.
• C sends the saved cookie in future requests to S, and possibly to
other servers as well.
• Allows stateful servers for sessions, personalized content, etc.
But: cookies raise privacy and security issues.
• What did S put in that cookie? Can anyone else see it? How much
space does it take up on my disk that I paid soooo much for?
• Cookies may allow third parties who are friends of S1,..., SN to
observe C’s movements among S1,..., SN.
Unverifiable transactions, e.g., DoubleClick and other ad services.
Unverifiable Transactions
GET x
GET ad
Referer mycfo.com
ad, cookie c
mycfo.com
GET y
Client
GET ad, cookie c
Referer amazon.com/x
ad
amazon.com
doubleclick,
akamai, etc.
• Users may not know that they are interacting with DoubleClick.
Amazon and MyCFO trust DoubleClick, but client is ignorant.
• The user visits pages at many sites that reference DoubleClick.
• DoubleClick’s cookie allows it to associate all the requests from a given user.
• If the browser sends Referer headers, DoubleClick may gather information
about all the sites the user visits that reference DoubleClick.
Web Cache Consistency
“Requirements of performance, availability, and disconnected operation
require us to relax the goal of semantic transparency.”
- HTTP 1.1 specification
Any caching/replication framework must take steps to ensure that
the cache does not deliver old copies of modified objects.
Issues for cache consistency in the Web:
• large number of clients/proxies
• most static objects don’t change very often
• weaker consistency requirements
Stale information might be OK, as long as it is “not too stale”.
Cache Expiration and Validation
GET x
GET x
x, Last-Modified m
Expires t
GET x
GET x
Clients
GET x
If-Modified-Since m
Proxy
304: Not Modified
Origin
Server
HTTP 1.0 cache control
• Origin server may add a “freshness date” (Expires) response header.
...or the cache could determine expiration time heuristically.
• Proxy must revalidate cache entry if it has expired.
Last-Modified and If-Modified-Since
• Whose clock do we use for absolute expiration times?
Expiration and Validation in HTTP 1.1
GET x
GET x
GET x
x, ETag v
max-age t
Age < t
GET x
GET x
If-None-Match v
Age = 0
Clients
Proxy
304: Not Modified, ETag v
Origin
Server
HTTP 1.1 cache control allows origin server to:
• use relative instead of absolute expiration times (max-age);
• issue opaque validators (ETag for entity tag) instead of timestamps;
Origin server may specify which of several cached entries to use.
Other 1.1 Cache Control Features
• Client may specify that no caching is to occur.
private or no-store
• Vary headers allow server to specify that certain request headers
must also match if the proxy deems a cached response valid.
language, character set, etc.
• Server may specify that a response is not cacheable.
Pragma: no-cache header since HTTP 1.0
• Client may explicitly request the proxy to validate the response.
Pragma: no-cache
• Proxy may/should/must tell client the age of a cached response.
Age header
• Proxy may/should/must tell client that it could not validate a nonfresh cached response with the origin server.
Warning header
The Dynamic Web
GET program-name?arg1=x&arg2=y
execute
program
Content-type: MIME/html, Content-Length: 5000,...
Client
Server(s)
HTTP began as a souped-up FTP that supports hypertext URLs.
Service builders rapidly began using it for dynamically-generated content.
Web servers morphed into Web Application Servers.
Common Gateway Interface (CGI)
Java Servlets and JavaServer Pages (JSP)
Microsoft Active Server Pages (ASP)
Microsoft ASPs are not to be confused with Application Service Providers (ASPs).
Multi-tier Services
JNDI, JDBC,SQL
HTTP
Clients
HTML+forms,
applets,
JavaScript,
etc.
HTTP
RPC, RMI
IIOP
Web
application
server
relational
databases
DCOM, EJB,
CORBA, etc.
middle tiers
e.g., component “middleware”
transaction monitors
file servers
From Servers to Servlets
Servlets are dynamically loaded Java classes/objects invoked by a
Web server to process requests.
• Servlets are to servers as applets are to browsers.
• Servlet support converts standard Web servers into extensible
“Web application servers”.
• designed as a Java-based replacement for CGI
Web server acts as a “connection manager” for the service body,
which is specified as pluggable servlets.
interface specified by JavaSoft, supported by major servers
• Servlets can be used in any kind of server (not just HTTP).
Invocation triggers are defined by server; the servlet does not
know or care how it is invoked.
Anatomy of a Servlet
network service
(servlet container)
init(ServletConfig config)
String getServletInfo()
service(....)
destroy()
GenericServlet
(implements)
Servlet
ServletContext
String getServerInfo()
Object getAttribute(name)
String getMimeType(name)
getResource*(name)
log(string)
ServletConfig
String getInitParameter(name)
ServletContext getServletContext()
Enumeration getInitParameterNames()
Invoking a Servlet
service(ServletRequest, ServletResponse)
???
Servlet
network service
ServletInputStream
readline(...)
ServletResponse
ServletRequest
setContentType(MIME type)
getOutputStream()
getContentLength, getContentType,
getRemoteAddr, getRemoteHost,
getInputStream,
ServletOutputStream
getParameter(name),
print(...)
getParameterValues(name),
println(...)
HTTP Servlets
GenericServlet
HttpServlet
ServletResponse
HttpServletResponse
addCookie(),
setStatus(code, msg),
setHeader(name, value),
sendRedirect(),
encodeUrl()
service(...)
doGet()
doHead()
doPost()...
ServletRequest
HttpServletRequest
getCookies(),
getRemoteUser(), getAuthType(),
getHeader(name),
getHeaderNames(),
HttpSession getSession()
HelloWorld Servlet
import java.io.*;
import javax.servlet.*;
public class HelloWorld extends GenericServlet
{
public void service(ServletRequest request, ServletResponse response)
throws ServletException, IOException
{
...
}
public String getServletInfo()
{
return “Hello World Servlet";
}
}
HelloWorld Servlet (continued)
public void service(ServletRequest request, ServletResponse response)
throws ServletException, IOException
{
ServletOutputStream output = response.getOutputStream();
String fromWho = request.getParameter(“from");
response.setContentType(“text/html");
if (fromWho == null) {
output.println(“<p>Hello world!");
} else {
output.println(“<p>Hello world from <em>"
+ fromWho + “</em>");
}
}
Example 1: Invoking a Servlet by URL
Most servers allow a servlet to be invoked directly by URL.
• client issues HTTP GET
e.g., http://www.yourhost/servlet/HelloWorld
• servlet specified by HTTP POST
e.g., with form data
<FORM ACTION=“http://yourhost/servlet/HelloWorld" METHOD=“POST">
From : <INPUT TYPE=“TEXT" NAME=“from" SIZE=“20">
<INPUT TYPE=“SUBMIT" VALUE=“Submit"> </FORM>
generates a URL-encoded query
string, e.g., “<servletURL>?from=me”