Transcript Document
The Evolving Internet: Some
Implications, Strategies, and
Techniques for More Effective
Research
MSU Product Center
September 26, 2007
Professor Larry G. Hamm
Presentation Outline
•
•
•
•
•
•
•
Introduction
Search Engine Basics
Business Search with Google
News Search
Social Search
Basic Information Trapping
The Future??
QUESTIONS?
• Who is Tim Berners-Lee?
• What happened for “research” in 1990?
Current Number of Websites
July 2007-489,774,269
Top Global Web Properties Ranked by Total Unique Visitors (000)*
June 2007
Total Worldwide, Age 15+ - Home and Work Locations
Number(000’s) Percent Reach
Total Unique Internet Visitors --- 778,310
100%
Google Sites
Microsoft Sites
Yahoo! Sites
Time Warner Network
eBay
Wikipedia Sites
Fox Interactive Media
Amazon Sites
Apple Inc.
Adobe Sites
CNET Networks
Ask Network
Viacom Digital
Lycos Sites
The Mozilla Organization
544,783
529.155
471,924
266,367
264,732
208,120
163,545
145,947
123,554
121,966
116,579
115,655
88,654
77,517
70,850
70
68
61
34
34
27
21
19
16
16
15
15
11
10
9
Share of Online Searches by Engine
August 2007
Total U.S. Home, Work and University Internet Users
Source: comScore qSearch
Aug
07
Total Internet
Population
100%
Google Sites
56.5
Yahoo! Sites
23.3
Microsoft Sites
11.3
Ask Network
4.5
Time Warner
Network
4.5
* Excludes traffic from public computers such as Internet cafes or access from mobile phones or PDAs.
Share of Online Searches by Engine
August 2007
Total U.S. Home, Work and University Internet Users
Source: comScore qSearch
Aug
07
Total Number of
Searches (Million)
9820
Google Sites
5545
Yahoo! Sites
2290
Microsoft Sites
1106
Ask Network
438
Time Warner
Network
441
* Excludes traffic from public computers such as Internet cafes or access
from mobile phones or PDAs.
Herbert Simon, Nobel Prize Economist:
“What information consumes is rather
obvious: it consumes the attention of
its recipients. Hence a wealth of
information creates a poverty of
attention”
SOURCE: “Designing Organizations for an Information-Rich World,” in Donald
M. Lamberton, ed., The Economics of Communication and Information
(Cheltenham, England: Edward Elgar, 1997).
The Source of Power?
• Knowledge is no longer the “scarce”
resource.
• Attention is the “limiting factor”!
• Implications:
– Global--- Decisions on what is brought into
global consciousness
– Research --- Discipline to direct and control
your attention
The Role of ATTENTION
THEREFORE:
• “The most important function of attention is
not taking information in, but screening it
out.”
Introduction
The Meaning of Relevance
Definition: The degree to which a search record (piece
of information) meets the researchers’ query.
• PROBLEM - Relevance to Search Engine and
Researcher Are DIFFERENT
• To a researcher: Does the result help answer the intent
of the query?
• To a Search Engine: Does the result meet the search
engine’s ranking algorithm?
Summary and Conclusion
• Precision searching requires the process of consciously
narrowing and eliminating the gap between
researcher’s and search engine’s RELEVANCY
• Knowledge of the search process and the
characteristics of information sources are required to
attack search engine relevance.
• Intuition is required by the researcher to focus on
formulating the search statements.
Search Engine Basics
•
•
•
•
The Invisible versus the Visible Web
Defining and Identifying Search Engines
How Search Engines Work
Why Google?
The Invisible Web
• Great amounts of information exist than is not
accessible via internet search engines
• Much was formatted digitally but not ‘indexed’
(see latter lecture)
• “Google Books” project is the grandest attempt
to date to ‘shrink’ the invisible web.
• Invisible Web information is differentiated by:
– ACCESS
– MODE of creation
The Invisible Web
(continued)
•
Information is differentiated by the nature
of ACCESS to it:
– 1.Publicly available --- Libraries
– 2.Semi-public --- ‘Private’ Libraries i.e.
MSU Libraries
– 3.Private data --- Only available for
purchase or through reciprocity
The Invisible Web
(continued)
Types of Private Data –
• Private data sets open to anyone with a
checkbook (Mintel)
• Restricted private data sets --- to contributors
(Trade Association)
• Proprietary data of individual firms/public
institutions (Freedom of Information Act)
• Spy data (commercial and public)
• Private data interfaces with ‘Searchable’ data
when private data firms use “free sample” or
“versioning” marketing strategies
The Invisible Web
(continued)
Differentiated by MODE of creation:
• PRIMARY versus SECONDARY Data
– Primary Data is data collected/generated through
direct observation, survey, or poll
– Secondary Data is data that is ‘repackaged’ primary
data
– Secondary data results from an ‘editing’ process
– Evaluating secondary data requires an identification
and evaluation of the base source(s)
• Always go to the “ORIGINAL SOURCE”!!!
The Visible Web
Defining and Identifying Search Engines
What is a search engine?
• Definition – A search engine is an enormous
database of websites compiled by a software
robot that seeks out and indexes websites.
How does it work?
• Sends a ‘spider’ or ‘crawler’ to visit a Web
page, finds the information on the page.
• The ‘crawler’ then sends its “finds” to an
indexer which takes every word on a Web
page, logs it, categorizes it and than stores the
results in a huge databases.
Defining and Identifying Search Engines
What types of search engines exist?
www.searchengineshowdown.com
www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html
• General All Purpose Search Engines (Big 4) –
Google; YahooSearch; Live.com; Ask.com
• Metasearch Engines – Search engines that search
other search engines (S.E. ‘bot’) –
• www.dogpile.com
• www.clusty.com
• www.kartoo.com
Defining and Identifying Search Engines
What types of search engines exist? (continued)
Specialized Search Engines (Vertical Search
Engines) – Search engines dedicated for
specific subject areas or specific purposes. For
research: www.lii.org
• “Customized Search Engine ” – Now anyone
can create one www.google.com/coop/cse/ --See www.customsearchguide.com
The Visible Web
How Search Engines Work
Search Engine – RANKING ALGORITHMS
•
WHAT? – Ranking Algorithms are used to ORDER the
search results
•
WHY DOES ORDER MATTER? Answer - ATTENTION
because the researcher wants ‘help’ in deciding
relevance for the searcher's needs
HOW? - Most ranking algorithms are and continue to
be ordered by the frequency of use of the searched
“WORDS”
Google created a new addition to their Ranking
Algorithm
•
•
The Visible Web
How Google Works
1. The web server sends the query to the index servers. The
content inside the index servers is similar to the index in the back of
a book - it tells which pages contain the words that match the
query.
3. The search
results are
returned to the
user in a fraction
of a second.
2. The query travels to the doc
servers, which actually retrieve the
stored documents. Snippets are
generated to describe each search
result.
The Visible Web
Conclusion
An Overview of a Basic Search
• Be very proficient with ONE search engine
• Remember because of different software
approaches and indexing, NO TWO SEARCH
ENGINES WILL PRODUCE THE SAME
RESULTS
• When very focused and search is narrowed,
identify and use other specific engines
• Should the “Product Center” create their
own?
Business Search with Google
•
•
•
•
Translating Web Language
Underlying Search Logic
Understanding Google Search Features
Conclusion
Translating Web Language
Reading URL’s – Uniform Resource Locator
• This the Web site’s address; i.e. Were a Web site lives
• Example:
http://online.wsj.com/article/SB114609925357637113.html
• http: - Transfer Protocol (hypertext)
– the way the information is transfer on the Web.
– HTML – Hypertext Markup Language is current Web
language
– XML (eXtensible Markup Language) is coming as the
vehicle for information trapping
Translating Web Language
(continued)
Reading URL’s (continued)
www.online.wsj.com (domain name) of
the server
•
Domain Suffix (com) – Perhaps the first and most
important things to examine
– Assigned by ICANN – Internet Corporation for
Assigned Names and Numbers – www.icann.org
–
–
–
Country Codes (.uk) follow domain suffix
(.us) not used by most U.S. sites except with
state/local government sites.
Current Issues?
Translating Web Language
(continued)
Reading URL’s (continued)
Common DOMAIN SUFFIXES
•
•
•
•
•
.com - commercial site
.edu - educational institution
.gov - government agency in the U.S.
.net - network with most assigned to ISP networks
.org - non-profit/non-commercial organization
(Caution: many companies are setting up “non-profits” to get
.org domain suffixes to disguise their agendas)
•
OTHERS - .mil, .biz, .info, .coop, .pro
Underlying Search Logic
Boolean Logic Searches
•
Definition - Use of mathematical set
theory to retrieve search information.
• AND, OR, and NOT searches
• See following Venn diagrams:
Underlying Search Logic
(continued)
Boolean Logic Searches -
AND
The Visible Web
Why Google? (continued)
Boolean Logic Searches -
OR
Underlying Search Logic
(continued)
Boolean Logic Searches -
NOT
The Visible Web
Why Google?
Google Has Two Basic Strengths Over Other
Search Engines
•
•
Popularity Ranking
Number of and Breadth of Features
The Visible Web
Why Google? (continued)
“Popularity” Ranking – “The Google
Creation”
•
•
A page’s ranking includes a score for how
many “other pages” link to it i.e. How ‘popular’
it is with other Web sites
This is done on multiple levels. For Example: If
page X and Y both have 100 pages linked to
them, but the 100 Y pages have more links to
them than do the 100 X pages, Y gets a higher
score for ranking
The Visible Web
Why Google? (continued)
“Popularity” Ranking – “The Google Creation”
(continued)
THE UNDERLYING ASSUMPTION:
•
•
A Web page that has more pages linked
indirectly (like a pyramid scheme) to it implies
that more pages find it relevant implying that it
will be more relevant to you.
Analogy – Your popularity is ranked within high
school by how many friend your friends have
and how many friends those friends have and
so on.
The Visible Web
Why Google? (continued)
“Popularity” Ranking – “The Google Creation” (continued)
“THE GOOGLE BIAS”
• New pages won’t have as many links as
established pages; therefore a lower ranking.
• Analogy: New friends might be better than the
old friends.
The Visible Web
Why Google? (continued)
Google’s Breadth of Features
• Home Page Features – One of the
Cleanest/Clutter Free Page
•
•
Advanced Search Features
Business research useful features are
highlighted here
The Visible Web
Why Google? (continued)
Google’s Advanced Search Features
• Advanced features allow searchers to narrow their
queries to very specific searches
• Narrowed searches allow the gaps between ‘researcher’
and ‘search engine’ RELEVANCY to close much quicker
• With precision query formulation, the search will be
faster and more useful
• 8 highlighted advanced features
The Visible Web
Why Google? (continued)
1.Google uses a modified Boolean Search
Searches can be done from Google Home Page
or from Advanced Features Page
The Visible Web
Why Google? (continued)
•
•
•
•
•
“Phrase Searching”
Google automatically “ANDS” words
Accepts one or more “OR’s”
Use a minus sign in front of term to
“NOT” it
Google will not search on very common
“STOP” words like “a”, “it”, and “the”.
The Visible Web
Why Google? (continued)
•
2.Option to retrieve only a specific file
format
– (pdf), (ps), (xls), (ppt), (doc), (rtf)
– Very useful if searching for a certain ‘type’ of
data. For example: xls. and financial data.
The Visible Web
Why Google? (continued)
3. Date restrictions
4. Window to limit retrieval to title or URL
fields
5. Box for limiting to (or excluding) a
particular DOMAIN or URL
The Visible Web
Why Google? (continued)
6.
Page Specific Searches:
– for pages similar one to the entered URL
– for pages that link to the entered URL
Links to “Topic-Specific Searches”
– for pages similar one to the entered URL
– for pages that link to the entered URL
8. Domain specific searches for .gov, .mil, and .edu
7.
Everything About Google??
• http://www.google.com/intl/en/help/refinesear
ch.html#domain
• http://www.google.com/intl/en/help/operato
rs.html
• http://www.google.com/intl/en/help/cheatsh
eet.html
• http://www.google.com/intl/en/help/feature
s.html
• http://www.google.com/options/
The Visible Web
The Greatest Google Feature??
•
Skip the Title - Click the cache? WHY?
–
–
–
–
–
Google ‘Highlights’ (different color)
keywords/phrases
No pop-ups that are attached to Web pages
Faster – Google’s servers are the best in the world
Allows for ‘text only’ versions
Allows access when the current site is ‘unavailable’
The Visible Web
The Greatest Google Feature??
•
Further ‘Search’ Within the Result
Generated Sites
– If not in cache but titled page, use browser’s
– “Find” button (Control+F) to show
keywords/phrases
– Use (Control+F) for NEW search with new
words and phrases
The Visible Web
Conclusions
•
Is the desired information CONCEPTUAL or FACTUAL?
•
If Conceptual: Use in-depth research (library,
books, scholarly journals, etc.) is most likely
necessary to effectively frame the search.
If Factual: A search engine web search can
most likely proceed
•
•
But always strive to find the “Original Source”
The Visible Web
Conclusions
•
Set a time limit - ‘Web Surfing’ can be
addictive causing:
–
–
–
Tendencies to wonder off task
Get attention ‘fatigue’ resulting in overlooking
possible sources
All other forms of destructive social and moral
behaviors.
News Searching
What Do You Want:
• Read news without ‘a paper, TV, or radio’?
• Just see last second’s headline?
• Find older stories?
• Monitor an industry?
• Other?
NEWS SOURCE GENERATED
INFORMATION
Introduction
• The evolution of news based information
– Story telling town criers news posters/papers
electronic news divisions the WEB
• News is now a ‘commodity’
– Minimal costs of distribution
– ‘Creation of news content’ is believed by many to be
unrestricted (text messages, cell phone pictures, etc.)
– Believed by many that with the ‘information
democracy’ they have the “right” to create news and
that their “news” is as legitimate as anyone else’s
NEWS SOURCE GENERATED
INFORMATION
Introduction (continued)
• News Differentiation Attention Merger of
News & Entertainment
– The Daily Show, The Colbert Report, etc.
NEWS SOURCE GENERATED
INFORMATION
Five Specific Types of Web News Outlets
•
•
•
•
•
1.Individual Online News Sites
2.Breaking News Aggregators
3.News Alert Services
4.Searchable News Data Bases
5.Industry News Sites
NEWS SOURCE GENERATED
INFORMATION
Web News Outlets (continued)
•
1.Individual Online News Sites
–
–
–
–
–
Definition – Migrations of existing established media
outlets to Web based platform
Examples: CNN.com, nytimes.com, onlinewsj.com
Usually have graphics, delivery methods similar to
parent outlet
Mix of “free” and for fee services
Most have archives with most of non-current for
fees (NYT’s recent decision!)
NEWS SOURCE GENERATED
INFORMATION
Web News Outlets (continued)
•
2.Breaking News Aggregators
–
–
–
–
Definition – Sites that pull material from multiple
online news sources
Usually limited to recent “Headline” material
Use to do keyword search for relevant news articles
Use when the individual site does not cover all
possible relevant (geographic/minor stories)
information
NEWS SOURCE GENERATED
INFORMATION
Web News Outlets (continued)
•
2.Breaking News Aggregators (continued)
•
•
•
Personalize one of the general portal sites
(Google News, My Yahoo) and make it your
“start page”
Go to a “news service” site like BBC,
CNN, MSNBC, etc.
Go to favorite newspaper and set up an
RSS feed
NEWS SOURCE GENERATED
INFORMATION
Web News Outlets (continued)
•
3.News Alert Services
–
–
–
–
Definition – Same as breaking news aggregators
except for a “User Profile” can be created
Delivery method is via e-mail or Web site
Useful if your particular interest is a company,
product, topic, etc.
Issues include:
•
•
•
Completeness of what is delivered (original source,
abstract)
Search provisions and degree of advanced features
Frequency of the Update
NEWS SOURCE GENERATED
INFORMATION
Web News Outlets (continued)
•
4.Searchable News Data Bases
– Definition – archive oriented (as opposed to
headline) multiple news source aggregators
– Best are “fee based” (MSU Library)
•
•
•
•
•
Dialog
LexisNexis
Dow Jones (Factiva)
ProQuest
Others
NEWS SOURCE GENERATED
INFORMATION
Web News Outlets (continued)
•
4.Searchable News Data Bases (continued)
– Web “free” sites
– See Suggestions below
NEWS SOURCE GENERATED
INFORMATION
Web News Outlets (continued)
•
•
5.Industry News Sites
Definition – Industry specific news sites
–
–
•
Have combined features of several types
above:
–
–
–
•
Created by Trade Associations
Trade Publications (their migration to the Web)
News Alert Service
Breaking News Aggregators
If ‘News’ source based, may have archives
Examples: www.foodinstitute.com.
A Few Suggested News Sites
Google News Archive Search
news.google.com/archivesearch
• Claims to go back 300 years
• Time, WSJ, NYTimes, The Guardian, The
Washington Post
• Sources from ProQuest, Factiva, HighBeam, etc.
• Some full articles are free, most are fee
• Timelines
• Advanced Search features
A Few Suggested News Sites
See:
www.onstrat.com/news/newssearchchart.html
For a comparison of: Yahoo, Google, daypop,
rocketnews, findroy, feedster, topix
A Few Suggested News Sites
• www.monitor.bbc.uk/weekahead.shtml
• www.wn.com
• www.einnews.com
– Subscription business information and online news
service which draws from 35,000 sources
– Covers 240 countries categorized by country and
topic
– Headlines Only!! Use to identify and than go to library
sources
A Few Suggested News Sites
News Resource Guides:
• www.kidon.com/media-link Provides info and link to
sources and indicates the presence of streaming audio
and video
• www.abyznewslinks.com links to newspapers, broadcast
stations, internet services, etc.
• www.metagrid.com List of 8000 online magazines
newspapers worldwide
• www.newswealth.com Unique categories of
miscellaneous ‘news’ sources
A Few Suggested News Sites
Front Pages:
www.newseum.org/todaysfrontpages
• 581 front pages from 54 countries
• Alphabetical main page with “Sort by Region”
geographic listing
• Thumbnail view
www.pressdisplay.com
• Front Page Free – 7 day free trial
• Full images of news pages of 500 newspapers
from 70 countries for SUBSCRIPTION
• Zoom and in paper search feature
A Few Suggested News Sites
Radio/TV Sources:
www.radio-locator.com
• Links to over 10,000 radio stations and over 2500 audio
streams from radio stations in 130 countries
www.tvradioworld.com
• From over 200 countries
Some Conclusions and Cautions
• There is great redundancy so be very selective
and methodical
• One way is to “personalize” your news (Selfconfirmation bias)
• The nature of news creation and distribution
means that there will be more broken links
• Spend time becoming an “Information Trapper!”
Social Search
What is social search?
• No industry standard definition yet.
• “Internet wayfinding tools informed by
human judgment”
• “Informed” can mean many thingsincluding egregiously uninformed.
Social Search
Algorithmic Search is “Social”
• Algorithms are written by humans who
make choices
• Now. Search engines observe human
behavior – click paths, popular, URL’s, etc
which are used to modify the algorithm
(Yahoo’s 14 tetragigs/day)
• “Personalization efforts are becoming
more evident.
Social Search
Why now?
• Algorithmic search has plateaued
• Humans are still better at some things
• Rise in cocreation and collaboration via Web 2.0
• Recall status of wikipedia
• Social Networking
– 69% of females(56% males) ages 17-25 use
Facebook
– 38% females (14% males) ages 17-25 use MySpace
– 70% ages 18-21 uses social networks
Social Search
Issues--• Scale and scope issues – How to keep up and what is
the level of “control and policies”?
• Tagging – How to you get to common understanding?
– Folksonomy (also known as collaborative tagging , social
classification, social indexing, social tagging, and other
names) is the practice and method of collaboratively creating
and managing tags to annotate and categorize content.
– Ambiguity of language (‘orange’)
– Others?
• Social search will probably work best for non-text content
(photos, music, video, widgets, etc.)
Social Search
Some Selected Types of Social Search
• Shared bookmarks and Web Pages
• Tag Engines (blogs and RSS)
• Collaborative directories
• Personalized vertical search engines
Shared Bookmarks
• The most basic and probably least useful
type of social search
• http://del.icio.us/
• http://www.shadows.com/
• http://myweb2.search.yahoo.com/
• http://www.furl.net/
• http://www.diigo.com/community
Tag Engines
Sometimes call “taggregators” primarily
search blogs and RSS feeds
• http://technorati.com/ - The #1
• http://www.ask.com/?tool=bls – Could be
the best
• http://www.blogpulse.com/ - Monitors and
is a Nielsen firm
Collaborative directories
Directories created by teams of volunteers
• Open Directory Project (AOL) – Has become
dated and stale
• http://www.prefound.com/
• http://www.stumbleupon.com/ - Appears quite
good
• http://www.mahalo.com/ - Mostly currently
popular material
• http://www.linkedin.com/ -Professional
Networking
Personalized Verticals
It is no longer difficult or laborious to create
a specialized search engine –
• http://www.google.com/coop/cse/
• http://www.eurekster.com/
• http://rollyo.com/
Social Search
Conclusions
• Social Search will grow in importance
• People are less predictable than
algorithms – unlimited potential or
problems?
Basic Information Trapping
• Information Trapping is the process of
setting monitors – traps – to cature
information from the flow of the Web and
have it sent to you.
• Termed coined by Tara Calishain
• http://www.researchbuzz.com
Basic Information Trapping
Info Trapping Pros
• Faster Results – As it happens, not weeks latter
• More Results – Don’t have to remember to check
• Saves You Time – Not constantly duplicating searches
Info Trapping Cons
• The sheer volume can overwhelm you.
What Is Trappable?
•
•
•
•
•
•
•
News Stories
Web Sites
Conversations
Multimedia
Tag Directories
Blogs
Anything with an RSS Feed
Basic Information Trapping
This is where ‘the action is’ for:
• Consumer research
• Image management
• Political planning and advertising
• Social profiling
• Etc.
How Do You ‘Trap’?
•
•
•
•
RSS Feed Readers
Web Page Monitors
E-Mail Alerts
‘Trapline’ Allocation
– 70% RSS
– 20% Web pages
– 10% e-mail alerts
Basic Information Trapping
RSS Feeds
• Definition - an XML-based specification that
allows a Web site to instantly and automatically
distribute its content (news and now more) to
other sites
• Accessing - requires specialized software be
installed by the researcher
Basic Information Trapping
RSS Feeds (continued)
• What is the value of RSS Feeds?
Prequalification
• By setting the profile, the user ‘edits’ what
information comes into the attention space
• However, the researcher still has an obligation to
do the editing
– Guard against self-conformation biases
– Must have a ‘focused’ relevancy strategy
Basic Information Trapping
WEBLOGS a.k.a. BLOGS
• Definition - a form of personal journalism where
an individual purporting to have knowledge of
and interest in a specific topic posts his/her
views on the topic on the Web.
• Typical Characteristics of Blogs include:
– daily postings
– recommended links
– often have “chatrooms” for forums and
discussions
– popular blogs now generate advertising
Basic Information Trapping
E-mail Alerts are straight forward –
• Most run on an RSS platform
• Are now readily available
Basic Information Trapping
Info Trapping is a separate training
session
Some possible tools include:
• http://www.aignes.com/ (WebSite Watcher) Free
Trail than fee service
• http://www.trackle.com/ Modest subscription fee
• http://www.rocketnews.com/info/portal.jsp
• http://www.boardtracker.com/ (Conversations)
• http://boardreader.com/ (Conversations)
• http://find.yuku.com/ Web 2.0 (Conversations)
Basic Information Trapping
Some possible tools (continued):
•
•
•
•
http://www.everyzing.com/ (Multimedia/Podcasts)
http://technorati.com/ (everything blogs)
http://www.icerocket.com/ (blogs)
http://www.zuula.com/ (Beta version of a Metasearch
engine for Info Trapping)
Basic Information Trapping
• Requires a fair amount of work
• Absolutely requires you have a very
specific search query
• Requires some advanced skills for
managing the “Trapline”
The Future (NOW) of Internet
Search?
• “Blended” or “Universal” Search are
becoming the norm
• “Personalization’ of Search because of
algorithm interaction with “YOUR” actual
search actions
• “Mobilization” will take everything where
you are
• The battle between Web 1.0 vs. Web 2.0
philosophy