Web Science1

Download Report

Transcript Web Science1

Web Science: An Interdisciplinary
Approach to Understanding the
Web
ACM Paper
(James Hendler, Nigel Shadbolt,
Wendy Hall, Tim Berners-Lee, and
Daniel Weitzener
Overview
Currently most computer
science departments have
been slow to adapt or
respond to the Web's
influence on computer
science. Surprisingly the
standards, protocols,
architectures are not study in
great detail.
CS has for many years recognized the
importance of networking (TCP/IP)
protocols, Internet etc...
However, the Web, despite having its
own protocols, algorithms and
architectural principles, is often
viewed as just an application of the
network.
This is may be the correct point of view
at the dawn of the Internet, however,
clearly the Web is the most used and
one of the most transformative
applications in the history of
computing, even in human
communications. Irrefutably, the Web
has changed how the world
communicates. Openness and 24/7
availability has revolutionized the
communication grid.
The number of web
pages is greater the
population of the
world...
New algorithms underlying
modern search are
fundamental to Web use.
Hadoop, an () open-source
software framework that
support data-intensive
distributed applications on
large clusters of commodity
computers make it possible
to explore these algorithms
and experiment with largescale Web-programming
practices
Human interaction on the
Web
• social networking
• tagging
• data integration
• information retrieval
• Web ontologies
Form the basis of a new are
called "social computing".
Whether in CS studies or in
information-school courses,
the Web is often studied
exclusively as the delivery
vehicle for content,
technical or social, rather
than as an object of study in
its own right.
Where physical science is
commonly regarded as an
analytic discipline that aims
to find laws that generate or
explain observed events,
CS is predominantly
synthetic (like
mathematics), in that
formalisms and algorithms
are created in order to
support specific desired
behaviors.
The Web needs to be
studied and
understood as a
phenomenon but
also as something
to be engineered for
future growth and
capabilities.
The Web is also an
infrastructure of
artificial languages
and protocols; it is a
piece of engineering.
Also the Web is the
interaction of human beings
creating, linking, and
consuming information that
generates the Web's
behavior overall.
The Web is part of a wider
system of human
interaction; it has profoundly
affected society, with each
emerging wave of creating
new challenges and
opportunities in making
information more available
to wider sectors of the
population than ever before.
Web application are not built
with a single user in mind,
or for a single computer. All
web applications by default
are built for distributive use.
A popular web application
can grow similarly to an
outbreak of a flu-virus. The
use grew exponentially.
The Web, thought of as a marco
system, is the use of the micro
system by many users
interacting with one another in
often unpredicted ways.
One unexpected result came from
the gaming of the search
algorithms to improve search
rank. Thus leading to a need for
better search technologies to
defeat the gaming.
The essence of our
understanding of what
succeeds on the Web and
how to develop better Web
applications is that we must
create new ways to
understand how to design
systems to produce the
effect we want.
How do we design and build
something at the micro level
and have it function in a
desirable way at the
macroscale?
How do we predict other side
effects and the emergent
properties of the macro?
Understanding the Web
requires more than a simple
analysis of technological
issues but also of the social
dynamic of perhaps millions
of users.
Because of the multi-user
(social) nature of the Web,
its science is inherently
interdisciplinary.
WEB GRAPH
One way to model the Web is
define it in terms of a graph
where the nodes are Web
pages and the edges are
the hypertext links among
these nodes.
This graph grows on the order
of seven million new pages
a day
Algorithms developed to exploit
properties of the Web Graph
HITS and PageRank algorithms
are attempts to rate the validity
or importance of web page by
the number of referring
hyperlinks. This assumption led
to the development of powerful
search engines for finding
content on the Web.
The edges of the Web graph
represent single
instantiations of the result of
calling the HTTP protocol
with a GET request. Which
returns an html document
based upon the URI
(universal resource
identifier).
Consider the fact that many html
documents are made up of a
variety of objects which are
embedded in the document.
These can be images on
different servers, or formatting
specifications like cascading
style sheets, XML DTD
documents for example. All of
which can be captured by a
crawler and then in turn, used in
the defining of the Web graph.
Such a model would be static
however when studying
social interaction this type of
model must be enhanced in
order to accurately depict
social interaction.
However, the design of the
Web's protocols and
services must also be
consider. Since without
them, the Web would not be
scale-free.
Core Design Components of
the Web:
• Identification of resources
• Representation of resource
state
• Protocols that support
interaction between agents
and resources in the space
The richness of the request
protocol implies that there is
an underlying attribute that
is yet to be incorporated into
the Web graph. That is, the
user-dependent state.
When considering the Web,
as an application of the
Internet, a static model just
will not do.
Consider HTTP-POST
request rather than HTTPGET…
Sometimes sites generate
complex URIs that use GET
requests to pass on state,
thus obscuring the identity
of the actual resources.
URIs that carry state are used
heavily in Web apps. These
occurrences have yet to be
completely analyzed.
According to Udi Manber at
Google, commented that on
average 20%-25% of daily
searches have never been
submitted before. He points
out that this makes search
incredibly difficult.
Note: Google receives over
100 million queries per day.
Analyzing the Web solely as a
graph ignores many of its
dynamics.
The study of Web dynamics
must take into account the
growth of the Web. How
creation and use of new
applications effect the
dynamics and Web
architecture.
Modern sophisticated Web sites
provide powerful user-interface
functionality by running large
script systems within the
browser. These applications
access the underlying remote
data model through Web APIs.
New forms of global systems are
appearing relying on the user’s
computing power and massive
storage available on Web
servers for easy data retrieval.
User-generated content
sites
• Store personal information
• Security issues
• Awareness of public access to
information
• Advances in three-party
authentication protocols to
secure data (Trust)
Standard mathematical
analysis of the Web is has
been inconclusive.
Linguistic analysis through
tagging may provide some
insight, however complex.
The dynamics of any “social
machine” are highly
complex, and dozens of
academic papers, from
multiple disciplines have
been written about it.
The idea of a social machine
was introduced in Weaving
the Web, which conjectured
that the architectural design
of the Web would allow
developers, and thus end
users, to use computer
technology to help provide
the management function
for social systems as the
were realized online.
Examples of Social Machines
• Blogging
• MySpace
• Facebook
The Web of Data is an
emerging area of study.
Which involves the heavy
use of tagging provided by
many of what are know as
Web 2.0 technologies.
What is tagging?
Articles, blogs, photos, videos,
and all manner of other Web
resources may be annotated
with user-generated keywords,
or tags, that can later be used
for searching or browsing these
resources.
Tagging can enhance metadata to
explain content or objects being
described.
Ambiguous tags
Example: Suppose a tag in a
specific social context may
be useful since it can
designate a particular
individual. The use of a tag
as metadata of depends on
such a context, and the
“network effect”. The deeper
meaning of the tag…
Use of Metadata
Recent applications of semantic
Web technologies and
represents an important
paradigm shift that is a
significant element of emerging
Web technologies.
The semantic Web will allow
programmers and users alike to
refer to real-world objects
without concern for the
underlying documents in which
these things, abstract and
concrete are described.
The semantic Web arena
reflects two principle
nexuses of activity. One
tends to involve data (and
the Web), and the other on
the domain (semantics).