Web 1.0 - Search

Download Report

Transcript Web 1.0 - Search

Introduction to Web Science
Web 1.0
Dr Alexiei Dingli
1
Introducing Web 1.0
•
•
•
•
•
•
•
Packet switching network
IP Addressing
Internet Applications
The WWW and markup
Searching the WWW
Intelligent Agents
Internet Governance
2
Packet-Switched Networks (1)
• Local area network (LAN)
– Network of computers located close together
• Wide area networks (WANs)
– Networks of computers connected over greater
distances
• Circuit
– Combination of telephone lines and closed
switches that connect them to each other
3
Packet-Switched Networks (2)
• Circuit switching is used in telephone
communication
• The Internet uses packet switching
• Packet switching needs computers called
‘routers’ and the programs called ‘routing
algorithms’
4
Packet-Switched Networks (3)
• Information is
divided into
packets
• It is passed from
node to node
• It is recomposed
as one chunk on
the destination
server
5
6
Routing Packets
• Routing computers
– Computers that decide how best to forward
packets
• Routing algorithms
– Rules contained in programs on router computers
that determine the best path on which to send
packets
– Programs apply their routing algorithms to
information they have stored in routing tables
7
TCP/IP
• Communications protocol suite
– Packet switched protocol
• No end-to-end connection is required
• Each message broken down into small pieces called packets
• Packets possibly routed to destination over different paths
– Transmission Control Protocol (TCP)
• Breaks messages into packets
• Numbers packets in order
• Reorders packets at the destination
– Internet Protocol (IP)
• Routes packets to the proper destination
8
Open Systems Interconnections
Model
OSI Model (also called TCP/IP protocol suite) layers
(from the highest to the lowest):
7 Application
6 Presentation
5 Session
4 Transport
3 Network
2 Data Link
1 Physical
{
HTTP, SMTP, FTP, Telnet,
SSH, Whois, etc.
TCP, UDP
IP
Ethernet
Wire, Radio, Fibre Optic
9
IP Address
• Internet addresses are based on a 32-bit
number called an IP address
• IP addresses appear as a series of up to four
separate numbers delineated by a period
• An address such as 126.204.89.56 uniquely
identifies a computer connected to the
Internet
• IP Subnetting conceptually divides a large
network into smaller sub-networks
10
IP Classes (1)
11
IP Classes (2)
Class
Leading
Value
Class A
0
Network
Numbers
126
Addresses Per
Network
16,777,214
Class B
10
16,384
65,534
Class C
110
2,097,152
254
12
Subnetting
13
Without subnetting …
• Explosion in size of IP routing tables.
• Every time more address space was needed, the
administrator would have to apply for a new block of
addresses.
• Any changes to the internal structure of a company's
network would potentially affect devices and sites
outside the organization.
• Keeping track of all those different Class C networks
would be a bit of a headache in its own right.
14
Benefits of Subnetting
• Better Match to Physical Network Structure
• Flexibility
• Invisibility To Public Internet
• No Need To Request New IP Addresses
• No Routing Table Entry Proliferation
15
IP Vr6 (or IP Next Generation)
• Network Layer
• Developed in 1994
• Will replace the IP Vr4 standard
– limits on network addresses will eventually lead to
exhaustion of available addresses (by 2023)
– supports only 4,294,967,296 addresses (32bits)
• Improvements include
– providing future cell phones and mobile devices their own
unique & permanent addresses
– supports about 3.4 × 1038 (128bits)
16
Domain Names
• A Uniform Resource Locator (URL) consists of names and
abbreviations that are much easier to remember than IP
addresses
• The HTTP protocol defines how an Internet resource is
accessed
• An address such as www.microsoft.com is called a domain
name
• Domain Name System (DNS)
– A database of Internet names
– DNS Servers convert Internet names to IP addresses
– Top level domains
17
Top-Level Domain Names
• Internet Corporation for Assigned Names and
Numbers (ICANN)
– Responsible for managing domain names and
coordinating them with IP address registrars
18
Domain Name case study
• The web was not an ‘open’ place
• One company available where you could buy a
.com, .net or .org domain
• Price of 100 dollars and a two year minimum
• Back then, there was a big chance you would be
able to buy a dictionary word as .com
• In 2000, they lost the monopoly position and
domain prices dropped over 95%
• Since then innovation halted and Network
Solutions became one of the thousands
anonymous domain registrars
19
Internet Applications
• E-Mail
• File transfers
• Instant messaging (IM)
• Newsgroups
• Streaming audio and video
• Internet telephony
• World Wide Web (WWW)
20
E-Mail
• Most popular and widely used Internet application
• 30 billion e-mails sent every day
– Spam – junk e-mail messages
– Spam costs corporate America $9 billion per year
• Every e-mail message contains head that
describes source and destination for the message
• E-mail messages are text, but may have
attachments of many types of digital data
– Viruses often transmitted via e-mail
21
SMTP, POP, and IMAP (1)
• E-mail is sent across the Internet is managed and stored
by mail servers
• Simple Mail Transfer Protocol (SMTP) is the standard to
send mails to the server
• Post Office Protocol (POP) is the standard to get mails
from the server
• The Interactive Mail Access Protocol (IMAP) is a newer
e-mail protocol
22
SMTP, POP, and IMAP (2)
23
Controlling Spam
• Use complex email addresses rather than name and surname
combination
– Why? Bots? Name Directories?
• Control exposure of email address
– How? Java script? JPEG?
• Use multiple email addresses for different purposes
– In what occasions?
• Use content-filtering software
– black list spam filter 
– white list spam filter 
– challenge response using graphical challenges ?
24
E-Mail Case Study
• Hotmail (1995)
• First place to get a free email address,
disconnected from an ISP
• 4 years later, 30 million people worldwide were
exchanging @hotmail email addresses
• Bought by Microsoft in 1998 for just 400 million
dollars
• 2007 the end of Hotmail
– transformation to “Live” mail to become an
integrated part of the Microsoft’s “Live” family
25
File Transfers
• File transfer protocol (FTP)
– Protocol providing for transmission of a file between
an Internet server and a user’s computer
• Peer-to-peer (P2P) file sharing
– Share data from one computer to another
– Every user can be a server
– Napster
• Kazaa
• Gnutella
• Torrent
– With P2P, every user on the network can make data
available to every other user on the network
26
Instant Messaging
• Allows user to create a private chat session with
another user
• IM started with AOL
• IM sneaking into corporate networks
• Many Web-based companies use IM technology
for customer service
– eBay
27
28
ICQ case study
•
ICQ abbreviation of “I seek you”
•
1996 first easy to use instant messenger program where you
could add friends to your list, and see if they were online
•
Back then it was revolutionary for the masses and it became the
‘application’ everybody had installed
•
Acquired by AOL in June 1998 for a whopping $287 million
•
Eventually the program got too many additional features that
made the application heavy and unorganized
•
Competition of AOL IM, Yahoo IM, and MSN Messenger
increased, and friends on your ICQ-list left the application
eventually resulting in a mass abandoning of the network
29
Usenet Newsgroups
• Online, bulletin board discussion forums
• Users post and read messages
• More than 100,000 newsgroups
• Millions of newsgroup readers
• Important information resource, especially for technical
issues and products
• Newsgroup messages distributed using open standard
– Many are uncensored
30
Streaming Audio and Video
• Creating and sending audio and video files
– Sports
• Basketball at sports.yahoo.com
• Major league baseball
– News
• Fox News
• CNN radio
– Business
• ZDNet
– Education
• Warriors of the Net
31
Internet Telephony
• Voice-over Internet Protocol (VoIP)
• Use your computer like a telephone
• Software connects computers via the Internet and
transmits voice data
• Savings comes from eliminating toll charges
between locations
32
Internet TV
33
The World Wide Web
• Collection of hyperlinked computer files on the Internet
• Client-server application
– Web servers
– Web browsers as clients
• WWW standards
– Hypertext markup language (HTML)
• Current standard for writing Web pages
• Tags in HTML instruct the client browser how to format and display the
Web page content
– Hypertext transfer protocol (HTTP)
• Establishes a connection between Web server and client
– Extensible markup language (XML)
• A meta-markup language
• Gives meaning to the data enclosed within XML tags
34
Website case study
• Create your own free homepage on the web
• 1997 Fifth most popular website, with over
500,000 homepages created
• Yahoo bought Geocities two years later for
$3.57 billion dollars and started to actively
commercialize the homepages with various
advertising types that resulted in their death
sentence
• ‘Real’ web hosting becoming affordable for
anybody, the need for free homepages in this
form vanished
35
Overview of Markup Languages
• SGML is a rich meta language that is useful for
defining markup languages
• HTML is particularly useful for displaying Web
pages
• XML defines data structures for electronic
commerce (and much more …)
36
http://www.w3.org/
Development of Markup Languages
37
Standard Generalized Markup
Language
• The ISO adopted SGML standard in 1986
• SGML is nonproprietary and platformindependent
• SGML supports user-defined tags and
architecture to complement the required
richness of documents
38
Extensible Markup Language
• XML is a descendant of SGML
• XML allows designers to easily describe and deliver
structured data from any application in a standard,
consistent way
• XML can be embedded within an HTML document
• XML allows you to create your own customized
markup language.
39
Learn XML in a slide 
•
•
•
Tag – a piece of Markup
– An opening tag
<name>
– A closing tag
</name>
Element – well formed usage of tags
– <name>Alexiei</name>
Attribute – properties
– <name length=“7”>Alexiei</name>
•
Rules to keep XML well formed
1. Can be nested but not overlapping
2. Case sensitivity
3. Quoted attributes
4. Required end tag
•
Short hand
– <abc></abc> is equivalent to <abc/>
40
Some XML examples
<book>E-Commerce</booK>
<book pages=100>E-Commerce</book>
<book pages=“100”><title>E-Commerce</book></title>
<book pages=“100”><title>E-Commerce</title></book>
<book pages=“100”>
<title>E-Commerce</title>
<author>
<name>Gary</name>
<surname>Schneider</surname>
</author>
</book>
41
Some XML examples
<book>E-Commerce</booK> 
<book pages=100>E-Commerce</book> 
<book pages=“100”><title>E-Commerce</book></title> 
<book pages=“100”><title>E-Commerce</title></book> 
<book pages=“100”>
<title>E-Commerce</title>
<author>
<name>Gary</name>
<surname>Schneider</surname>
</author>
</book> 
42
Processing a Request for an XML Page
• Why going through all this hassle?
• How would you go about displaying HTML on a
– PC
– Handheld
– Mobile
43
Hypertext Markup Language
• Tim Berners-Lee invented HTML
• HTML is a document production language that
includes a set of tags that define the format and
style of a document
• HTML is based on SGML
• HTML is an instance of one particular SGML
document type – Document Type Definition
(DTD)
44
HTML Tags
• An HTML document contains both document content and
tags
• The tags are the HTML codes inserted in a document to
specify the format on screen
•
Each tag is enclosed in brackets (< >)
• Most tags are two-sided – opening and closing tags
• Well formed tags, bots, meta tags?? Why are they
important?
45
HTML Links
• Hyperlinks are bits of text that connect the current
document to:
– Another location in the same document
– Another document on the same host machine
– Another document on the Internet
– Can they link to a toaster at home?
• Hyperlinks are created using the HTML anchor tag
• Two popular link structures:
– Linear hyperlink structure
– Hierarchical hyperlink structure
46
HTML Version History
• HTML version 1.0 was introduced in 1991
• HTML 2.0 was released in Sept. 1995
• HTML 3.2 was introduced in 1997
• HTML 4.0 was released by W3C in Dec 1997
• HTML 4.01 was released in Dec 1999
• XHTML 1.0 became a W3C recommendation in Jan 2000
47
HTML Editors (1)
• Low end editor displays HTML code on the
screen and allow you to insert HTML tag pairs by
clicking selected buttons
• High end editor are Web site builder programs,
they provide a rich environment that displays the
Web page, not the HTML code
• Microsoft FrontPage and Macromedia
Dreamweaver are examples of Web site builders
48
HTML Editors (2)
49
Static versus Dynamic Pages
• HTML and XML only display and exchange data
• No interactivity; no processing of data
• Scripting languages
– Provides basic interactivity
• Rollovers
• Crawling text
– JavaScript
– VBScript
• Full-featured Web programming
–
–
–
–
Java
Client side scripting or browser side scripting
Applets
J2EE
• Common Gateway Interface (CGI)
– Allows passing of data between a static HTML page and a
computer program
50
Searching the WWW
• Most data on the Internet is part of the WWW
• Search engines – large databases that index
WWW content
• Building the search engine database
– Submit a site to the search engine administrator for
listing
– Spiders
• Metatags
– Google
– Yahoo
51
Search Engines
• A search engine is a special kind of Web page
software that finds other Web pages that match a
word or phrase you entered
• A Web directory is a listing of hyperlinks to Web pages
that is organized into hierarchical categories Eg:
http://directory.google.com/
• Search engines contain three major parts: spider,
index, and utility
52
Popular Search Engines
53
Spiders and Crawlers
54
Indexing
55
Search Engine case study
• Search engine AltaVista was the Google of the last millennium
• First real effort to index the World Wide Web
• One of the few search engines that actually came up with good
search results
• Had a hard time fighting spam listings in their results
• While spam grew logarithmic in Altavista, some company named
Google found a way to prioritize web pages more intelligently, and
thus keep spam out better
56
Case Study:
PageRank
’s
• PageRank relies on the uniquely democratic nature of
the web by using its vast link structure as an indicator
of an individual page's value
• Google interprets a link from page A to page B as a
vote, by page A, for page B
• But Google looks at more than the sheer volume of
votes, or links a page receives; it also analyzes the
page that casts the vote
• Votes cast by pages that are themselves "important"
weigh more heavily and help to make other pages
"important."
57
Intelligent Agents
• An intelligent agent is a program that
performs functions such as
– information gathering,
– information filtering,
– mediation running,
– in the background on behalf of a person or
entity
• What agents can you think of?
58
Intelligent Agents (2)
• Search Agents
– Improve your information retrieval on the Internet
– Used to find pages on the Web easily and quickly
• Meta Agents, Specialised (MP3), etc
• Web Agents
– Improve browsing experience
• Automate form filling, off-line browsing, etc
• Monitoring Agents
– Monitor web sites or specific themes
– Used to get automatic alerts about the latest news
59
Intelligent Agents (3)
• Virtual Assistants
– Artificial life
– Characters, plants, animals or people living on your desktop
• Shop Bots
– Allow users to compare prices on the Internet
– Find the best price for books, CDs, movies, etc.
• Webmastering Agents
– Make it easy to manage a Web site and make it more effective
– Monitor broken links, content gathering etc.
60
Intelligent Agents (4)
• Other agents …
– Development agents
• Used to develop other agents
– Games agents
• Used in games
61
Ms Dewey not your
ordinary search agent!
62
Internet Governance
• Internet Engineering Task Force (IETF)
– Works in groups to develop standards
• Internet Engineering Steering Group (IESG)
– Approves or disapproves standards developed by the
IETF
• Internet Architecture Board (IAB)
– The oversight authority for the standards development
process
• World Wide Web Consortium (W3C)
– Promotes the WWW and develops new web
technologies and standards
63
Conclusion
• We’re all very familiar with Web 1.0
• But what makes Web 2.0?
• Next lecture …
64
Questions?
65