Transcript Google 2

LIS618 lecture 10
Thomas Krichel
2003-04-23
Structure
•
•
•
•
some repeats from last week
other special syntaxes
usenet news in google
open directory project in google.
query language II
• * is a wildcard for any word
• +stopword requires the presences of a
stop word stopword. But the list of stop
words has not been published.
• In fact it depends from query to query
• There is a limit of 10 words, but a * does
not count towards the limit
special syntax I
• intitle: find in title only, "intitle: google"
• intext: find in text only. This will exclude
occurrences of the search term in anchor or
title data. "intext: html"
• inanchor: This option requests pages, for
which there is another page that links to them
with the anchor text in the query. example:
inanchor:"a list of my courses" finds my
courses page because it has a link with that
text
special syntax
• cache: pages that are in the google cache,
useful if query result has nothing to do with
the query terms
cache:openlib.org/home/krichel will show
the cached version of the page.
• If you add further terms, they will be
highlighted.
daterange: special syntax
• limits the search to pages indexed
between a range of dates. Changed pages
are reindexed, unchanged pages are not
reindexed when the crawler visits a page.
• dates are expressed in the Julian period,
i.e. number of days after -4713-01-01 0:00
UTC of the Julian calendar. Today is
2452739
• example: daterange: 2452640-2452739
mixing special syntax expressions
• The link: syntax does not mix with others.
• Other bad ideas:
– "site:openlib.org –inurl:openlib"
– "site:edu site:com"
• Things that work well
– intitle:search
– Intitle:biology inurl:help
Examples
• George Bush site:nytimes.com
• "Copyright * The New York Times"
"George Bush"
• Intitle:"directory * * trees"
• Botany intitle:"directory of" site:edu
• "powered by blogger" or site:blogspot.com
• "classical music" (inurl:mailman |
inurl:listserv)
phonebook: special syntax
• also rphonebook for residential and
bphonebook for businesses
• A location seems to be required, i.e.
phone: long island university
phone: long island university ny
• no
– wildcards
– exclusions
– or
stocks on google
• stocks: ticker will look up a ticker symbol
ticker at http://finance.yahoo.com
• you can find ticker symbols there
• ticker symbols are useful to find financial
information about publicly traded
companies.
google images
• it has the following special syntaxes
– intitle searches for images on a page with a
given title, "intitle: long island university"
– Inurl: searches for images in pages that have
a certain url, inurl:liu.edu
– site: restricts the search to a certain site,
should be combined with a search term like
"site:liu.edu koenig"
Google interfaces to 3rd party data
• Google groups are an interface to usenet
news
• Google directory is an interface to the
Open Directory Project.
• In both cases Google is dependent on the
quality of these underlying data source.
usenet news
• Usenet is a collection of user-submitted notes on
various subjects that are posted to servers on a
worldwide network. Each subject collection of
posted notes is known as a newsgroup.
• A newsgroup is a discussion about a particular
subject consisting of notes written to a
networked site and distributed through Usenet.
• Newsgroups are hierarchical. Hierarchical
levels are separated by dots example:
comp.text.tex
• alt stands for anarchists, lunatics and terrorists.
usenet history
• The idea of network news was born in 1979
when two graduate students, Tom Truscott and
Jim Ellis, thought of using UUCP to connect
machines for the purpose of information
exchange among users. They set up a small
network of three machines in North Carolina.
• UUCP is ``UNIX to UNIX copy'' a protocol that is
used to copy files between machines running
some flavor of UNIX, without the need for IP
protocol. Usenet is older than the Internet
decline of usenet
• essentially open to all (peer-to-peer
system)
• used by spammers for
– posting
– gathering addresses
• steady decline of quality of contribution
• steady decline of quantity of contributions
usenet worth checking out
• independent reviews of products, often
written by experts.
• Example: interpretation of beethoven
sonatas by Wilhelm Kempff.
• Sorting by date reveals that the
newsgroup rec.music.classical.recordings
is still active. On a good day, you will find
no finer guide to records.
special syntax for usenet
• group: limits posting to a certain group
• title: limits to titles of postings
• author: searches for author name or email
address
• Mixing syntaxes works well
the open directory project
• "The Open Directory Project is the largest, most
comprehensive human-edited directory of the
Web. It is constructed and maintained by a vast,
global community of volunteer editors.
• Claim that there is a historic precedence in the
Oxford English Dictionary.
• Formerly known as ``GnuHoo'', then ``NewHoo'',
then acquired by NetScape, and called ``dmoz''.
dmoz.org
• dmoz is maintained by volunteers ``net-citizen''.
No special qualifications required, but claimed to
be experts.
• There are about 30,000 volunteers (they claim).
• Powers the core directory services for the
Web's largest and most popular search
engines and portals
– Netscape Search
– Google
– HotBot
AOL Search
Lycos
DirectHit
• Headquarters run by Netscape
http://openlib.org/home/krichel
Thank you for your attention!