Proxy Servers - Mark Lachniet
Download
Report
Transcript Proxy Servers - Mark Lachniet
Proxy Servers
What they are, how
they work, and why
you want them
By: Mark Lachniet
Thursday, October 17, 1997
Introduction
Mark Lachniet
Director of Information Data Systems @
Holt Public Schools
Novell Master CNE in connectivity
Advocate of proxy servers as a means of
responsible Internet usage
Holt Public Schools
K-12 school district, 5300 students, 11 Novell Netware
servers, 11 LINUX proxy servers, and more than 700
networked computers (LCIII, Centris 610, and various PC
compatible 386, 486, and Pentium machines)
Recently completed the deployment of proxy servers
district-wide (Squid + LINUX) in a project conceived of
with Fred Trimble (formerly of Holt, but now at the
Traverse Bay Aarea ISD)
Member school of SCNC (Southeast Central Network
Consortium)
Have a 128k ISDN connection to the Internet through
MERIT
Because of this, we need to maximize our efficiency
Scope of the presentation
Describe how a proxy server works
Describe advantages of a proxy server
Describe disadvantages of a proxy server
Provide an overview of contemporary proxy
server Software
Discuss why proxy servers are important
for the future of the Internet
A basic proxy server
Acts as an intermediary to retrieve external data or
connect to external services
Is application dependent! Many different types of proxy
exist: FTP, HTTP, Gopher, WAIS, Telnet, POP, etc.
Many of the contemporary proxy servers are a
combination of several different services such as
firewalling, VPN (Virtual Private Network), IPX-IP
gateways, HTTP proxy, etc.
Usually thought of in terms of a HTTP web proxy, which is
what this presentation will focus on
Generally used when security or efficiency is an important
factor
How a basic proxy server works
We will focus on the most popular form - the web proxy
1) the proxy server receives a request from a client
machine
2) The proxy server then fetches the data
3) Stores a copy of that data in storage, either in the hard
drive or in RAM
4) Returns that data to the client
Subsequent requests for that information are (usually)
retrieved from local storage instead of the Internet &
transmitted at Ethernet speed instead of WAN link speed
Allows you to increase the efficiency of your network and
decrease expensive WAN traffic
A proxy network diagram
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
This is the simplest kind of proxy server setup
It assumes that the proxy server is connected to a source
of “real” Internet access on one side, and a “fake” private
network on the other
Although this is one application, all three of these
computers could be on the “real” Internet and still work,
but without any security restrictions
A regular web request
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
Without a proxy server, a client workstation
would make a direct request from a web
server
A regular web response
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
The web server then returns the data
directly to the client workstation
If the client machine needs the document
again, it repeats this procedure
A proxy server request
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
When a proxy server is used, the client workstation sends
all of its requests to the proxy instead of the actual host
This is something that must be configured in the client
software.
Your software must support this feature. Microsoft
Internet Explorer and Netscape Navigator (among others)
make use of this
The client configuration
This is an example of how Netscape Navigator 2.0 can be
configured to use a proxy server (from preferences)
The proxy server information
The client configuration consists of an IP address and a
port number (in this case, 3128)
This is the IP and port number the proxy server software
is running on. Often, this is configurable
You can also configure some hosts NOT to use the proxy
Sending a request to the proxy
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
With this information set, the client workstation then uses
the appropriate proxy server instead of going directly to
the web server
You can configure a different proxy for each of the
different types of services: HTTP, FTP, WAIS, Gopher, etc.
For example, you might want a super-fast HTTP proxy on
one server, and a slower Gopher proxy on another
The proxy server checks its rules
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
?
The proxy server then has to decide if the
client is authorized to use its services
This could be based upon many different
rules
Proxy server rules
The criteria that a proxy server uses to decide if it will
grant access depends upon the type of software you are
using
Can be based upon the IP address of the client machine
Can be based upon the IP address of the web server you
are trying to get information from
Can require a valid username and password to access the
proxy server
Can limit access based upon the day, or time of day
Can limit based upon stopwords such as “sex” or “golf”
Can be based upon proprietary criteria such as a Netware
group, a SecureID card, etc.
The proxy server gets the data
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
Once it is satisfied that the request is valid, it obtains the
data from the requested data source
As far as the web server itself is concerned, the request is
coming directly from the proxy server
This insulates the client workstation, and prevents it from
showing up in the server logs
This can be good if you are working from a computer with
sensitive material such as a database server, etc.
The web server returns data
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
The remote web server then returns the
data to the proxy server
In some cases, the “payload” is
examined
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
?
Client Workstation
Some software will examine the data to determine if it is
appropriate
An example of this might be some kind of pornography
filter, or a filter which looks for obscene words
This may be important since a very bad sight might have
a very nice name, and thus not get denied at an earlier
time
The proxy decides if it should
cache the data
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
?
“Should I keep a copy?”
The proxy server must then evaluate the
incoming data
It checks its internal rules to decide if a copy of
this data should be committed to its local storage
Deciding to cache information
The criteria the proxy server uses to make this decision
depends upon the type of proxy server that you are using
Generally, any kind of negotiated content is not cached
This is because such information is subject to frequent
changes, and is almost never the same
An example of this is a Yahoo web search - the database
is added to every day, so caching it would not show any
new sites that were added since the last access
In general, anything with a ‘?’ or ‘cgi-bin’ will not get
cached - this is the common convention for negotiated
content. If you are not sure, take a look at your web
logs, and you can see examples of this
Other configurations are also possible for security, etc.
Caching the data to local storage
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
!
“I’ll keep a copy of this!”
If the data matches its rules, a copy of the data is stored
to local storage
Local storage might be RAM or local disk (SCSI or IDE)
RAM is obviously the fastest, so having a lot of RAM will
make your proxy server run faster
Disk is second best, but you still want a FAST drive
The proxy returns the data
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
The proxy server then returns the data to
the client that originally requested it
The client uses the data
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
Thanks!
The client then uses the data however it likes
In general, the data it gets from the proxy will be exactly
the same as if it had not used the proxy
In some cases, a machine might add a header or trailer to
the data saying that it was processed by the cache
An example of this is the Squid FTP proxy which gives a
time/date stamp and the name of the proxy at the bottom
Requesting the same data twice
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
Sometimes, a client workstation will want
to retrieve the same object twice
The proxy checks to see if it has
the object in cache
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
?
“Do I have this already?”
The proxy server then checks to see if it has a copy of the
requested data stored locally
In fact, this step occurred in the first instance of the
request, but was omitted for clarity
The same access rules as before apply to determine if the
client is authorized to use the cache
If the proxy doesn’t have it
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
“Nope, better go get it!”
If the proxy server doesn’t have it in cache,
it goes out and retrieves the information
just like in the previous examples
If the cache does have it
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
?
“Yes, I do have it,
but is it current?”
If the proxy server does have the object in cache, it must
make sure that the information is still valid
Otherwise, any information which made it to the cache
would be used until it was purged
This would result in out-of-date information if the
document on the web server had changed
The cache checks the data
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
?
“Hey, is this still
current?”
The proxy server then contacts the web server to find out
if the information it has in its cache is still current
Determining if data is current poses an interesting
problem to web designers and proxy programmers
Some of these problems are addressed in the HTML
version 1.1 standard
More about determining if an
object is up-to-date
Under HTML 1.0, there are two ways to determine if a web document
is up to date
Expires: a tag which specifies a time upon which the document
becomes out of date
The expires tag is rarely used, and is relatively uninformative (what
about an unscheduled change?)
Last-Modified: a tag which specifies the object’s age.
This is more useful, as a cache could then employ an algorithm for
computing the most efficient cache pruning system. One way of
doing this is to frequently re-fresh data that has a recent “LastModified” The idea is that recently changed data is more likely to
change in the near future, and should thus be “checked up on”
Both of these approaches are somewhat limited due to the HTML
1.0standard. Some of these issues will be addressed in the upcoming
HTML 1.1 specification by use of “Cache-control” headers
The speed of checking freshness
Obviously, there is still traffic going from the proxy server
to the web server
However, doing a freshness check is a lot faster
The average HTML document is several kilobytes long
A freshness check uses the HTML “HEAD” command
The “HEAD” command returns the document headers and
not the whole document
This is generally only a kilobyte or less of information, but
still returns the necessary freshness information
Thus, a freshness check is much faster than downloading
the whole document again
Graphics are slow to change and are stored on the proxy
server for a long time, so they are almost always fast
The web server responds yes
"The
Internet"
Internet Web Server
“Yes, it’s current”
"The Private
Network"
Proxy Server
Client Workstation
“Okay, I’ll send it
from the cache”
If the document is current, the proxy
server reads the object from local
storage and returns it to the client
workstation
The web server responds no
"The
Internet"
"The Private
Network"
Internet Web Server
Proxy Server
“No, that’s old”
“Okay, I’ll
download it
again”
Client Workstation
If the proxy server determines that its data is old, it
retrieves the complete information again
Just as before, it checks rules, commits the data to
storage, and sends it to the client workstation
All of this is just as in the previous examples, and is
invisible to the client workstation
Once stored, anyone can get it
"The
Internet"
Internet Web Server
"The Private
Network"
Proxy Server
Client Workstation
“Hey, I want
that data too!”
Client Workstation #2
If any other client which uses the proxy server wants that
same piece of data, they can get it directly from the proxy
server’s local storage
This is what makes proxy servers so efficient - shared
caches
Benefits of sharing a proxy
Since data is available to everyone once it is in the cache,
proxies are wonderful for environments like school
laboratories
A teacher might “pre-load” the cache ahead of time by
viewing a site. Afterwards, an entire lab of computers
could view that site without having to clog the Internet
connection
Commonly-used web sites like the netscape web search
screen will stay in the cache almost all of the time
It is possible to use proxy-aware programs such as wget
and web-whacker to pre-load an entire web site
recursively with a single simple command
Pre-loading the cache can be done at convenient times,
such as during the middle of the night
Caches are dynamic
Caches are dynamic, they adapt to usage patterns
Generally have a specified cache space (200mb to
gigabytes of space) for objects
Objects from the cache are “trimmed” based upon last
use, probability of use, expiry dates, and other algorithms
These algorithms involve math and a lot of charts, and
are consequently not included in my presentation
Different cache algorithms work better in different
circumstances - for example, an instructional lab might
tend to view the same information repeatedly while a
general computer lab might view more diverse
information.
Finding the most efficient cache algorithm is a big
question
Computing cache efficiency
Generally, a cache is measured in its “cache-hit”
percentile rating
This is the number of documents which were
served from the local cache area instead of
downloaded from the Internet
This number varies greatly depending on usage,
power of the cache, etc.
In some applications, this number could reach
20%- 50%
The “cache-hit” rating is the benchmark by which
caches should be tuned
Caches within a hierarchy
Caches can also be configured to work within a hierarchy
This allows them to share information between caches
without having to download
In general, the WAN links between proxy servers in an
installation are faster than the connection to the Internet
Thus, it is generally faster to get information from a cache
on the local WAN than from the source web server
One type of communication is known as ICP (Inter-Cache
Protocol) and is used in Squid, Netscape Proxy Server,
and Border Manager
Version 2.0 of Microsoft Proxy server uses a proprietary
standard called CARP
A cache hierarchy
Squid Cache Hierarchy
Parent Object Cache
(most powerful, and/or closest to the Internet)
Child Object Cache
local to a network
(a masquerading
Proxy Server)
Child Object Cache
local to a network
(a masquerading
Proxy Server)
Child Object Cache
local to a network
(a masquerading
Proxy Server)
Child Object Cache
Possibly for a heavy
usage workgroup
Child Object Cache
Possibly for a heavy
usage workgroup
This particular example shows how you might configure
caches within a WAN for maximum efficiency
Transparent Proxy
Transparent proxy is a feature which (to my knowledge) is
unique to LINUX proxy servers
It allows you to run a proxy server without the client
knowing about it or being configured for it
It runs in the example given in this presentation where
the proxy server is also a firewall
It monitors all traffic through the firewall/proxy and
intercepts all traffic on port 80 and routes through the
proxy software
This can ease your migration to using a proxy server
Is a free program distributed by the author for the
general LINUX community
Using a proxy as a web server
accelerator
Most proxy servers will also allow you to “accelerate” a
web server
In essence, you are caching your own web server
This reduces the load on the web server itself, and
offloads some of the work to the proxy server
This can reportedly improve the speed of your web server
significantly
This is particularly handy for Intranet situations where
numerous people within a cache hierarchy access the
same web server on a daily basis
It is a good way to extend a “legacy” or overloaded web
server that is difficult to replace for whatever reason
Fault resistant proxies
Another useful feature for web servers is fault tolerance
Most caches that work within a hierarchy are able to
determine the status of a cache by trying to communicate
with it
However, if the cache goes down, the clients may still
have to be re-configured to use a different cache
Some proxy servers, such as Microsoft Proxy Server 2.0
have roll-over and load balancing
This allows proxies that are particularly busy to share the
work with less busy machines
In the case of catastrophic failure such as a server failure,
another server would take over the workload
This is a feature that will certainly undergo much
development in the near future
Automatic proxy configuration
The major browsers have the ability to configure a client
workstation to use a proxy server automatically
Instead of having to type in an IP address and port
number four times on each machine, it is done behind the
scenes
This makes it unnecessary to change the client software
every time a change is made in the network
Automatic configuration is usually given through a web
page, such as the opening “home” page for the browser
Automatic proxy configuration varies depending on the
browser you are using, so read up on the documentation
to find out how to do it
Squid Internet Object Cache
Runs on numerous different types of hardware and
variants of the UNIX operating system (including the
Freeware LINUX operating system)
caches HTTP, FTP, Gopher, and WAIS information
Is free!
Is the most-used proxy server worldwide
Is especially popular outside of the US where internet
connections are slow, and money is an issue
Is the core technology for Novell’s Border Manager object
cache program
Has numerous scripts and utilities to assist you in
administering the software and getting usage information
Can be downloaded directly from the Internet
Novell Border Manager
Runs on Novell Intranetware
Includes support for NDS rules & user authentication
Has support for dial-in PPP and WAN routing
Has support for site-to-site VPN (Virtual Private
Networking) circuits
Works within a hierarchy using ICP
Provides IPX support
Has firewalling support built in, including various
types of packet filtering
Comes with a free runtime version of the
Intranetware OS
Microsoft Proxy Server 2.0
Runs on Windows NT 4.0
Works in a hierarchy using CARP cache arrays
Caches FTP and HTTP
Has support for HTTP 1.1
Has firewalling support built in, including various
types of packet filtering
Has support for site-to-site VPN (Virtual Private
Networking) circuits
Provides for security features such as resistance
to probes and real time alerts and logging
Netscape Proxy Server 2.5
Runs on Windows NT 4.0 and some UNIX
variants
Works within a hierarchy using ICP
Has support for site-to-site VPN (Virtual
Private Networking) circuits and other basic
networking and firewalling functionality
under NT
Win-Proxy
Runs on 95/NT w/ TCP/IP installed
FTP, Telnet, HTTP, POP, SMTP, SOCKS,
NNTP, SSL (HTTPS), GOPHER
DOD (Dial on Demand) & scheduled dialing
Configurable from a standard browser
2-user demo free for download from
http://www.lanprojekt.cz/winproxy/
Summing it up
Proxy servers come in many different types and
configurations - determine which proxy server makes the
most sense for your system
Proxy servers can greatly increase the speed and
efficiency of your network
Proxy servers allow you greater control of your network you can apply rules, perform logging, and improve your
security
Proxy servers scale well - you can deploy several of them
and still gain a performance advantage as they are added
Proxy servers are good for organizations, and good for
the Internet in general because they reduce traffic on the
Internet
Internet responsibility
The “Cache Now!” campaign
http://vancouver-webpages.com/CacheNow/
Advocates the use of Internet Caches
worldwide
Maintains a “cache friendly” check-list of web
design tips to help object caches work better
Has numerous links to other cache information
Building a free proxy server
If you are interested, I have put together a
HOWTO document which details how to make a
LINUX based proxy server
All of the software is 100% free, but you will
need a 486 or pentium computer to try it out
The URL is at: holt.k12.mi.us/~lachniet/proxy
right now, but will be moved to lachniet.com in
the future
Alternately, start reading up on Squid and LINUX
and you can get more information on how to put
one together on your own
Questions and Answers
?
[email protected]