IR in P2P, relational data, OpenURL and full

Download Report

Transcript IR in P2P, relational data, OpenURL and full

LIS618 lecture 6
Thomas Krichel
• Google
– news
– interfaces to non-web sources
• Usenet
• relational databases
• OpenURL
• file sharing
Google news
• Is a gathering of top stories from news stories.
• The entire pages in built by computer. Which
stories make it to the top depends on
how prominently the stories appear on news sites
which sites the stories appear on
when the articles were published
how many articles cover the same story
• Note the side bar with stories of different topic
special syntax for news I
• source: gives news from a source only
– example “source:cnn” works
– examples “source:bbc”, “source:nytimes”
“source:"new york times"” don’t seem to get
• location: gives a location. Can by a twoletter state or a country
– “location:ny”
– “location:russia”
special syntax for news II
• allintitle: searches for words in the title of the
article (not of the page)
– example “allintitle: dead injured”
• allintext: searches for words in the text
– example: “allintext: saarland government”
• allinurl: searches in article URLs
– example: “allinurl:bbc Wales”
• Restrictions
– One “allin???” special syntax only.
– Must come first in the query.
Google interfaces to 3rd party data
• Google groups are an interface to Usenet
news, called Google Groups.
• Google directory is an interface to the
Open Directory Project.
• In both cases Google is dependent on the
quality of these underlying data source.
Usenet news
• Usenet is a collection of user-submitted notes on various
subjects that are posted to servers on a worldwide
network. Each subject collection of posted notes is
known as a newsgroup.
• A newsgroup is a discussion about a particular subject
consisting of notes written to a networked site and
distributed through Usenet.
• Newsgroups are hierarchical. Hierarchical levels are
separated by dots example: comp.text.tex.
• alt, news, info, biz, rec, comp, sci, humanities, soc, misc,
talk are classic world-wide groups.
– alt stands for anarchists, lunatics and terrorists.
Usenet history
• The idea of network news was born in 1979
when two graduate students, Tom Truscott and
Jim Ellis, thought of using UUCP to connect
machines for the purpose of information
exchange among users. They set up a small
network of three machines in North Carolina.
• UUCP is ``UNIX to UNIX copy'' a protocol that is
used to copy files between machines running
some flavor of UNIX, without the need for IP
protocol. Usenet is older than the Internet
decline of Usenet
• essentially open to all (peer-to-peer
• used by spammers for
– posting
– gathering addresses
• steady decline of quality of contribution
• steady decline of quantity of contributions
Usenet worth checking out
• independent reviews of products, often
written by experts.
• Example: interpretation of beethoven
sonatas by Wilhelm Kempff.
• Sorting by date reveals that the
is still active. On a good day, you will find
no finer guide to records.
special syntax for Google Groups
• group: limits posting to a certain group
• title: limits to titles of postings
• author: searches for author name or email
• Mixing syntaxes works well.
• Example: “intitle:kempff”
the open directory project
• “The Open Directory Project is the largest, most
comprehensive human-edited directory of the
Web. It is constructed and maintained by a vast,
global community of volunteer editors.”
• Claim that there is a historic precedence in the
Oxford English Dictionary.
• Formerly known as ``GnuHoo'', then ``NewHoo'',
then acquired by NetScape, and called ``dmoz''.
• dmoz is maintained by volunteers ``net-citizen''.
No special qualifications required, but claimed to
be experts.
• There are about 30,000 volunteers (they claim).
• Powers the core directory services for the
Web's largest and most popular search
engines and portals:
– Netscape Search
– Google
– HotBot
AOL Search
• Headquarters run by Netscape.
Appearance of ODP
• If Google finds a relevant category it puts it into
the result.
• Remember a Google response is a list of results.
• Each result has
– title
– snippet
• Some results have optionally a category
attached. Following such categories is a winner
if your information need is broad.
full-text databases
• These databases have an emphasis on
providing full-text information in a web
• Their particular strength is the aggregation
of material from a range of publishers.
• This especially concerns scholarly
publishing, where the source material are
distributed among a large number of
• Some of the is arranged via the Brooklyn
LIU campus. We can use the on-campus
access here.
• The databases have some full-text, but not
a lot.
• go into the database selection, delete
everything and then use the research
• we can search for Paul Levine. It appears
– not all articles have full-text
– there is no distinction between different Paul
• Otherwise it appears straightforward to
• Proquest and ebsco work as aggregators.
They put different scholarly journals in one
database together, so you don’t have do
deal with publisher’s different interfaces.
• Publishers are reluctant to join and impose
moving-wall embargos on full-text release.
• So you can not access the full-text via
them. But your library may have the text
the library as aggregator
• typically, a library buys holdings from a
publisher, as well as cross-publisher
abstract and indexing data.
• when users finds a reference in an
abstract and wants to access the full text,
they are stuck
• Herbert Van de Sompel has been working
on this problem.
special effects (SFX)
• Herbert’s idea was to equip the interface
with a special effects button.
• When users press the button, the interface
would transmit metadata such as
– author name
– journal name
– title
– date
to a special database, called a resolver.
• The resolver examines the metadata and
makes a decision on what to show to the
– if the journal is subscribed to and the date is
recent, it may formulate a query to the
publishers database and fetch the record
and/or full text there.
– if the journal is not held, suggest ILL
– etc…
configuring the resolver
• librarians, who know the local setting, will
configure the server so that users are
given the appropriate extended services
given the local circumstance.
• Note that what is returned is a set of
extended services, not the response to a
specific query.
Bison Futé model
• This refers to further work by Herbert to
generalize the idea.
– On a web page, you find a link. It has been
made by the provider of the web pages.
– But this link may not be a appropriate. There
maybe better technology that allows you to
move in the same direction but with your own
– In other words we talk about context-sensitive
• This is now a draft standard with NISO to
standardize the special effects request.
• The OpenURL is a transport architecture
for context objects.
• Context objects unite descriptions of
– the reference found
– the context in which is was found
implications for information retrieval
• The implications on the library world are
already important.
– many library systems software already
implement OpenURLs and provide resolvers
• But impact could be wider and could cover
a whole new structure for the web,
replacing static links with on-the-fly
dynamic ones.
• Databases are collection of data with
some organization to them.
• The classic example is the relational
• But not all database need to be relational
Relational databases
• A relational database is a set of tables.
There may be relations between the
• Each table has a number of record. Each
record has a number of fields.
• When the database is being set up, we fix
– the size of each field
– relationships between tables
Example: Movie database
| title
| Gone with the wind
| Room with a view
| High Noon
| Star Wars
| Alien
| Blowing in the Wind
| director
| F. Ford Coppola
| Coppola, F Ford
| Woody Allan
| Steve Spielberg
| Allen, Woody
| Spielberg, Steven
• Single table
• No relations between tables, of course
| date
| 1963
| 1985
| 1974
| 1993
| 1987
| 1962
Problem with this database
• I made up all the data. It is just for
• Name covered inconsistently. There is no
way to find films by Woody Allan without
having to go through all spelling variations.
• Mistakes are difficult to correct. We have
to wade through all records, a masochist’s
Better movie database
| title
| Gone with the wind
| Room with a view
| High Noon
| Star Wars
| Alien
| Blowing in the Wind
| director name
| Ford Coppola, Francis
| Allan, Woody
| Spielberg, Steven
| director
| D1
| D1
| D2
| D3
| D2
| D3
| birth year
| 1942
| 1957
| 1942
| year
| 1963
| 1985
| 1974
| 1993
| 1987
| 1962
Relational database
• We have a one to many relationship between
directors and film
– Each film has one director
– Each director has produced many films
• Here it becomes possible for the computer, and
then the user
– To know which films have been directed by Woody
– To find which films have been directed by a director
born in 1942
Many-to-many relationships
• Each film has one director, but many
actors star in it. Relationship between
actors and films is a many to many
• Here are a few actors
| sex
| actor name
| Brigitte Bardot
| George Clooney
| Marilyn Monroe
| birth year
| 1972
| 1927
| 1934
Actor/Movie table
actor id
| movie id
| M4
| M3
| M2
| M5
| M3
| M6
| M4
… as many lines as required
• Once we have the relational database, we
can ask sophisticated questions:
– Which director has had the most female
actors working for him?
– In which years films have been shot that
starred actors born between 1926 and 1935?
• Such questions can be encoded in a
language know as “structured query
language” or SQL. All relational database
vendors implement a dialect of SQL.
importance of relational databases
• Relational databases dominate the world of
structured information. Examples
– employment and payroll in a company
– stock management
– e-commerce
• There are quite easy ways to get relational
databases to work with web interfaces.
Some are freely available. The most
common one is the LAMP (Linux Apache
MySQL PHP) architecture.
relational databases in libraries
• A 2004 enquiry on the LITA revealed that many
respondents said that they did regret most not
having learned more about relational databases
in library school.
• But there are problems with relational databases
in libraries
– Slow on very large databases (such as catalogs)
– Library data has nasty ad-hoc relationships, e.g.
• Translation of the first edition of a book
• CD supplement that comes with the print version
Difficult to deal with in a system where all relations and
field have to be set up at the start, can not be
changed easily later.
off-web Internet information
• Under this heading, I principally think
about activities known as ‘file-sharing’.
• They concern the (mostly illegal)
exchange of files between users. Such
files many encode
– music
– films
• There is a lot of it going on, but we are not
sure how much.
• Napster was the first prominent filesharing service.
• Napster ran a central server. You
connected to that server and announced
what files you had to share.
• Every search was conducted on the
dataset assembled at the central server.
• Connections to download files were done
between peer machines only.
end of Napster
• Napster argued since it was only involved in
collecting the information about files available, it
was legal.
• Napster never shared any illegal file.
• The courts thought otherwise.
• It was shut down.
• Napster network died without a central machine.
• To enable true piracy, we need a truly distributed
gnutella protocol
• This protocol underlies much of the current
file-sharing activity on the Internet.
• It enables a peer-to-peer network between
machines. There every machine is a client
and a server and called a “servent”
• To connect to a gnutella network, you
need the IP address of one single machine
that is already part of the network.
connection to the guntella network
• Once you establish connection to the first
servent, you announce your presence.
• The first servent will pass on that message
to all the servents that it is connected to,
and so on.
• This quickly adds up to a lot of traffic!
time to live
• Every gnutella message has a time to live
TTL. It is decremented every time it
passes at a servent.
• The TTL is usually quite small. It can be
arbitrarily reduced by servents.
• Therefore you only talk to servents that
are close to you. But your software will
determine which servents to try to contact
first. That usually depends on previous
query results.
• When you do a search, it is passed on
from servent to servent through the p2p
• Servents have their own rule how to
respond to queries.
– Most of the time search strings are matched
against a file name.
– Some may try to match against the directory
– Some general queries may be rejected.
– Some results sets may be truncated.
• If you see a file that you like to have, you
can try to download it.
• To implement downloads the servents use
http. Thus everyone who is connected to a
file sharing network run a web server!
• However, there usually is a tight limit on
how many downloads a server will accept.
• Modern servents have the ability to
download from several servents.
ease to infringe
• Clearly all the traffic on gnutella, with
current technology, can be observed.
• But the infringement is so massive that it
appears difficult to clamp down on.
• The easy to infringe is technological.
• RIAA have sued. They reach the tippy top
of the iceberg, with the hope to dissuade.
Thank you for your attention!