Internet and WWW services - University of New Mexico

Download Report

Transcript Internet and WWW services - University of New Mexico

Internet and WWW Services
•
•
•
•
•
•
•
Security
Types of Services
Vended versus Internally Provided
Costs and Benefits
Servers and Clients
Potential Problems
Stats
© Copyright 1997, The University of New Mexico
M-1
General Network Security
•
•
•
•
Isolated Servers
Restricted Subnets
Firewalls
Proxy Servers
© Copyright 1997, The University of New Mexico
M-2
WWW Application Security
• OS Level
• Server Level
• Program Level
© Copyright 1997, The University of New Mexico
M-3
Types of WWW Services
•
•
•
•
•
Static Data
Server Search Engines
Dynamic Data
Server Applications
Java Enabled
© Copyright 1997, The University of New Mexico
M-4
Vended
• Which Vendor
• How Much Do They Do
– HTML
– Graphics
– Design & Layout
– Programming
• Bandwidth
– Total
– Dedicated
© Copyright 1997, The University of New Mexico
M-5
Internally Provided WWW Server
• For who?
• How many services, how much traffic?
• For what use (scope the server) ?
© Copyright 1997, The University of New Mexico
M-6
Cost of a WWW Service
•
•
•
•
•
•
Server Usage
Disk Space
Network Bandwidth
Router or LAN Load
Application Development with Limited Capabilities
Application Development with Limited
Standardization
© Copyright 1997, The University of New Mexico
M-7
Benefits
•
•
•
•
•
High-touch, High Impact Narrow-casting
Kiosks
Fast, Simple Apps From Central Server
Built-in Protocols
Potentially Large Installed Client Base
© Copyright 1997, The University of New Mexico
M-8
Shopping List
•
•
•
•
•
•
Server Machine and O/S
Network Access
WWW Server
WWW Client
Server Programming Tools
Data and/or Databases
© Copyright 1997, The University of New Mexico
M-9
Which Server Platform?
• Unix
• NT
© Copyright 1997, The University of New Mexico
M-10
Which Server?
•
•
•
•
•
•
CREN
Microsoft
Netscape - Communication or Commerce
O’Reilly
WebForce
Oracle WebServer
© Copyright 1997, The University of New Mexico
M-11
Client Compliance Level
•
•
•
•
•
HTML 2.0
HTML 3.0
Netscape Enhancements
Java
Lynx (Text Browser)
© Copyright 1997, The University of New Mexico
M-12
CGI-BIN Risks
• Dangerous Programs or Scripts
• User-supplied Programs or Scripts
© Copyright 1997, The University of New Mexico
M-13
Robots and Other Network
Creatures
• Problems with “Automated Agents”
• Deterring Robots
• Reacting to Robots
© Copyright 1997, The University of New Mexico
M-14
WWW Server Stats
WWW Accesses per Week
600,000
500,000
WWW Accesses per Week 400,000
Outside
300,000
200,000
100,000
UNM
0
January
February
March
Outside
© Copyright 1997, The University of New Mexico
M-15
WWW Server Stats
Outside
UNM
35%
UNM
March
Outside
65%
© Copyright 1997, The University of New Mexico
M-16
Web Mining
Web based information extraction
© Copyright 1997, The University of New Mexico
M-17
Why the Web
(web = web browser)
• Ubiquitous:
– Web browsers are on every desktop, every PC, Mac,
workstation, and terminal.
• Platform independence
– Use of Java and server side programs means clicking on a
button does the same thing everywhere.
© Copyright 1997, The University of New Mexico
M-18
Data Cleansing
Text Mining
News Services
Decision Trees
Word frequency
Keyword Search
Tri-Letter Sets
© Copyright 1997, The University of New Mexico
M-19
Extracted
Data
Display
Results
N
DATA
Cleansed
Data
© Copyright 1997, The University of New Mexico
M-20
What Kind of Data?
• Usenet News
– Most places have Multi gigs of news
• System accounting files
– Can tell who is doing what, when
• Misc. Web pages
– A variety of interesting information
• Listserver or public system email
– We keep email concerning system problems
© Copyright 1997, The University of New Mexico
M-21
Cleansing Data
• News article
– NNTP fields
– signatures
• Web Page
– HTML codes
– descriptions of links to other sites
– pattern fields (headers and trailers that appear on every
page at the site)
© Copyright 1997, The University of New Mexico
M-22
Mining for data
• Test hypothesis
• Look for hidden information
• Find other similar information
© Copyright 1997, The University of New Mexico
M-23
Display of Information
• Graphical
• Text Listing
– Directories: human maintained categories
• e.g.: recreation, computers, finances, arts
– Computer generated list
• Customized
– User defined defaults
– Cookie defined defaults
© Copyright 1997, The University of New Mexico
M-24
Learning
to use
services
Learning
to extract
data
from the
answer
Compile
and clean
data
Display
Results
N
Data and Services
© Copyright 1997, The University of New Mexico
M-25
What Services?
• Search Engines
• Internet White Pages
– (information on individuals)
• Internet Yellow Pages
– (information on corporations)
• Usenet News repositories
• Online libraries
• Online periodicals
© Copyright 1997, The University of New Mexico
M-26
Learning to use Services
• Sample sets of data
– can derive a format if taught to.
• Machine learning (same as in Data Mining)
– look at every interpretation, find the one that conveys the
most information.
© Copyright 1997, The University of New Mexico
M-27
Learning to interpret answers
• What format is information given in?
• What do the fields mean?
– Can identify unknown fields by matching the data with a
known information.
© Copyright 1997, The University of New Mexico
M-28
Compile and Clean Data
•
•
•
•
Redundancies
Duplicates
Redundancies
Newer information has precedence
© Copyright 1997, The University of New Mexico
M-29
Security
• Server environment
– Use trusted CGI scripts and server side includes
• Client environment
– Restrict access by IP number or domain
– Restrict access by password
• Internet
– encrypt data (PGP)
– Certification authority
© Copyright 1997, The University of New Mexico
M-30
Checking for hidden
information
Y
Data is in
database?
N
Machine
Learning
© Copyright 1997, The University of New Mexico
M-31
Article: 52151 of comp.lang.perl.misc
Path:
lynx.unm.edu!pr1.plk.af.mil!tesuque.cs.sandia.gov!sloth.swcp.com!news.ironhorse.com!op.net!news.mathworks.com!enews.sgi.com!news.
sgi.com!mr.net!news.mid.net!sbctri.tri.sbc.com!newspump.wustl.edu!newsfeed.rice.edu!rice!add
From: [email protected] (Arthur Darren Dunham)
Newsgroups: comp.lang.perl.misc,comp.infosystems.www.authoring.html
Subject: Re: WWW: web site "pre-processor" in perl ?
Date: 31 Oct 1996 00:20:06 GMT
Organization: Rice University
Lines: 23
Message-ID: <[email protected]>
References: <[email protected]> <[email protected]>
<[email protected]> <[email protected]>
NNTP-Posting-Host: pecos.is.rice.edu
Xref: lynx.unm.edu comp.lang.perl.misc:52151 comp.infosystems.www.authoring.html:111886
In article <[email protected]>, Clay Shirky <[email protected]> wrote:
>
>Au contraire. HTML _is_ broken, relative to, say, SGML, but if you are
>careful with your tags and comment carefully, your data can be derived
>from your HTML files, not v-v.
>
>find . -name '*html' -exec perl -p -i.bak -e
>
's#(<body[^>]*bgcolor="?)oatmeal("?[^>]*>)#$1skyblue$2#i;' {} \;
or if you wanted perl to do all the work, rather than have find(1)
launch N perl executables for each .html files, you could do this....
find . -type f -name '*html' -print | xargs perl -p -i.bak -e
's#(<body[^>]*bgcolor="?)oatmeal("?[^>]*>)#$1skyblue$2#i;'
That way, perl happily iterates through all the lines in all the files
since we don't care which file we're in when we do the substitution.
-Darren Dunham
[email protected]
UNIX Sysadmin
Rice University
(This line currently in revision)
Houston, TX
Any resemblance between real opinions and my post is coincidental
© Copyright 1997, The University of New Mexico
M-32
<HTML>
<HEAD><TITLE>Information gathering</TITLE></HEAD>
<BODY>
<TABLE><TR><TH>
<IMG SRC="info.gif"></TH> <TH>
<font size="+3">Information Gathering</font>
<BR>
Just some sample text which might or might not be worthless.
You'd want to sort out which of this was just HTML tags and
other worthless junk and which was meaningful.
</TH></TR></TABLE>
<P>
<CENTER><H2>Links to</H2>
<A HREF="/sameplace/otherinfo"> A link to something on this site </A>
<A HREF="/otherplace/otherinfo"> A link to something on this another site </A>
</BODY></HTML>
© Copyright 1997, The University of New Mexico
M-33
Articles from sci.lang selected through webSOM
Re: Scots and English Gregory J Dalley, 30 May 1995, Lines: 18.
Re: Dutch and English accents Phil Rose, 15 Jun 1995, Lines: 28.
Re: ANY SIL'rs out there? A.K.A. Summer Institute of Linguistics. yomomma, 16 Jun 1995, Lines: 6.
Re: ANY SIL'rs out there? A.K.A. Summer Institute of Linguistics. yomomma, 16 Jun 1995, Lines: 6.
Conferences, Seminars-info wanted chris bowen, Mon, 03 Jul 1995, Lines: 7.
AIGH? Coby (Jacob) Lubliner, 8 Jul 1995, Lines: 8.
"Shall" and "Will" in Welsh English [email protected], Wed, 19 Jul 95, Lines: 14.
careers in linguistics scharle, 10 Sep 1995, Lines: 8.
job opportunities in computational linguistics? Sonny Xuan Vu, 30 Sep 1995, Lines: 14.
Re: job opportunities in computational linguistics? Miss Sarah Tiller, Wed, 4 Oct 1995, Lines: 27.
Re: What Is Singapore English? Zhong Qiyao, 11 Dec 1995, Lines: 28.
Re: What Is Singapore English? Chew Kim Swee Andrew, 14 Dec 1995, Lines: 41.
Re: What Is Singapore English? Pota alok Ashwin, 16 Dec 1995, Lines: 45.
Re: How to write in English ... Ann Weiner, Tue, 2 Jan 1996, Lines: 13.
Re: What Is Singapore English? Wing Luk, 7 Jan 1996, Lines: 27.
Linguistics Careers lebitz,stacey b, 23 Jan 1996, Lines: 14.
English Teaching Offering in China - offer2.doc [1/1] XIAOJUN ZHANG, 24 Jan 1996, Lines: 240.
TRYING TO PROTECT YOUR WORK? prepaid, Sun, 04 Feb 1996, Lines: 1.
Give me, please, one program for learn to speak english!! Please!! "Eugen I. Ivanov", 20 Feb 1996, Lines: 1.
Re: The English "R" for Germans Joerg Settemeyer, 8 Mar 1996, Lines: 5.
English Tutor Needed. Mua Tran, 23 Mar 1996, Lines: 20.
Re: old form of shorthand Fido, 1 Apr 1996, Lines: 9.
Re: Math as pornography Gordon Fitch, 17 May 1996, Lines: 7.
Re: Chain Shift Charles Lieberman, 26 Jul 1996, Lines: 10.
Re: Tendency of Inflections to Disappear - Why? Terrence Griffin, 28 Jul 96, Lines: 1.
Re: Concerning the number of esperantists Marc Bonnaud, Fri, 09 Aug 1996, Lines: 14.
Re: Concerning the number of esperantists Cheradenine Zakalwe, Fri, 9 Aug 1996, Lines: 16.
Re: Concerning the number of esperantists Alan Gould, Sat, 10 Aug 1996, Lines: 22.
Re: Concerning the number of esperantists Don HARLOW, Sun, 11 Aug 1996, Lines: 21.
Re: Kiom da E-istoj *ne* regas la anglan? Andrew McConnell, Fri, 30 Aug 1996, Lines: 19.
cohesion in CMC Per-Mikael Jansson ENGE, 22 Oct 1996, Lines: 10.
© Copyright 1997, The University of New Mexico
M-34
Limitations of the Web
• Some functionality/specialization was given up for
ubiquity
• Transfer time
– Mass data transfer prohibitive
• External to machine
– Reliance on network
• Not inherently as secure as staying home
© Copyright 1997, The University of New Mexico
M-35
Why Data Mining
• There is a lot of data of unknown worth and purity
• Data mining uses the same underlying procedures as
other knowledge discovery/ data extraction systems
© Copyright 1997, The University of New Mexico
M-36
Automatic Customization to user
preferences
• Web pages
– Hotwired autoconfigs based on what you surf to
• News services
– usenet service custom.roy-corey.1
• Information display paradigm
– industry report style
– collegiate style
– Microsoft style
© Copyright 1997, The University of New Mexico
M-37
Methods for gathering data
• Extraction from documents
– data mining
– keyword searches
– similarity searches
• Extraction from services
– ILA: internet learning agents
– Softbots
– Metacrawler
© Copyright 1997, The University of New Mexico
M-38
Data mining on the web?
• Transfer rate too slow to transfer most databases
whenever you want
• Computation too intensive to let others mine your
database whenever they want
• So: Use pre-collected data or pre-indexed database
© Copyright 1997, The University of New Mexico
M-39
Java -- What is it?
•
•
•
•
•
Programming Language
Java Compiler
Java Interpreter (Java Virtual Machine)
For creating applets which run inside a browser
For creating applications (stand alone programs)
© Copyright 1997, The University of New Mexico
M-40
Java Application Source Code
//
// Sample HelloWorld application
//
class HelloWorldApp {
public static void main(String args[]) {
System.out.println("Hello World!");
}
}
© Copyright 1997, The University of New Mexico
M-41
Java Applet Source Code
//
// Sample HelloWorld applet
//
import java.awt.Graphics;
import java.applet.Applet;
public class HelloWorld extends Applet {
public void paint (Graphics g){
g.drawString("Hello world!", 25, 25);
}
}
© Copyright 1997, The University of New Mexico
M-42
How could you use it?
•
•
•
•
Client applets or applications
Server code
Portable code
Create via Developer Tools
© Copyright 1997, The University of New Mexico
M-43
Developer Tools
•
•
•
•
Visual C++ (Visual Java?)
Symantec
Sun
SGI - Cosmo Code
© Copyright 1997, The University of New Mexico
M-44
Developer Tools
•
•
•
•
SourceCraft
Powersoft - Fusion
Quintessential Objects - Diva for Java (Javaside)
Roguewave - JFactory
© Copyright 1997, The University of New Mexico
M-45
Advantages
•
•
•
•
•
•
Object Oriented and event-driven
Portable* bytecode
Multi-threaded
Integrated Network Abilities
Built-in Multimedia Capabilities
“Robust and Secure”
© Copyright 1997, The University of New Mexico
M-46
Drawbacks
•
•
•
•
•
Few deployed clients
Very C++ -like
Not yet stabilized
Very few Developer Tools
Not all the class libraries exist (yet)
© Copyright 1997, The University of New Mexico
M-47
Class Structure
Class java.applet.Applet
java.lang.Object
|
+----java.awt.Component
|
+----java.awt.Container
|
+----java.awt.Panel
|
+----java.applet.Applet
© Copyright 1997, The University of New Mexico
M-48
Security
•
•
•
•
OS security in applications
“No Pointers” and no user memory management
Compile-time and Run-time checking
Client Data Security
– No access to disk from Netscape
– Directory-based security in Hot Java
© Copyright 1997, The University of New Mexico
M-49
Security
• Network Security
–
–
–
–
–
No Applets
No Access
Applet Host
Firewall
Any Host
© Copyright 1997, The University of New Mexico
M-50
Security Problems
• CERT 96.05 - Firewall Security
– ftp://info.cert.org/pub/cert_advisories/CA96.05.java_applet_security_mgr
• CERT 96.07 - Bytecode Verifier
– ftp://info.cert.org/pub/cert_advisories/CA96.07.java_bytecode_verifier
© Copyright 1997, The University of New Mexico
M-51
Alternative Options
• Visual Basic and browsers
• Visual Basic separate from WWW
• Web Server without Java
© Copyright 1997, The University of New Mexico
M-52
Books About Java
•
•
•
•
•
Teach Yourself Java in 21 Days
Java!
Hooked On Java
Presenting Java
O’Reilly
© Copyright 1997, The University of New Mexico
M-53
Java WWW Sites
• Sun
– http://java.sun.com/
• The Internet Programming Page
– http://www.apexsc.com/vb/internet.html
• Rogue Wave Home Page
– http://www.roguewave.com/
• Symantec Café
– http://cafe.symantec.com/cafe/index.html
© Copyright 1997, The University of New Mexico
M-54
Java WWW Sites
• JavaSoft
– http://www.javasoft.com/
• The Java Directory (Gamelan)
– http://www.gamelan.com/
• IBM: Centre for Java Technology
– http://www.hursley.ibm.com/javainfo/
• News: comp.lang.java
© Copyright 1997, The University of New Mexico
M-55