Don`t Be Fooled: Trust and Accuracy Problems and Solutions in
Download
Report
Transcript Don`t Be Fooled: Trust and Accuracy Problems and Solutions in
The Development of Accurate Information
Retrieval Solutions in Early Search Engines
Kate Lopez
December 5, 2008
CS 349
Idea
Historical progression of ideas about the proper method of
ranking or scoring web pages.
Look at several early ideas for ranking and mitigating spam and
index manipulation.
The structure of the Web will inherently change the
fundamentals of the system of information retrieval that we
use. It will have to continually adapt to changes in content
and user preference, while maintaining an acceptable level
of trust in accuracy.
Marchiori
Security of WorldWideWeb Search Engines, Massimo Marchiori,
1996
The Quest for Correct Information on the Web: Hypter Search Engines,
Massimo Marchiori, 1997
Background
Economic motivation
Defense against attacks
Web Structure
Partial function from URL to sequence of bytes
Web Object
Pair of URL and sequence
Score function
Flattening phenomenon
Heavy competition situation
Search Engine Persuasion:
"Spamdexing"
Artificial repetition of relevant keys
Example fake commercial web object
Impact on reliability
Solution
Truncation
Other Approaches to SEP defense
Probabilistic
Search engine post-processor
Effectiveness grows with market pressure
Clustering and shuffling
Unique-Top
Frequency implies relevance assumption
Percentage score function
Hyper
Advertise competitor web objects to score high
Hyper Search Engines
How do we properly classify objects in response to the user’s
needs?
New measure of informative content: hyper information
Hypertext vs. textual information
Visibility vs. quality
Types of Hypertext Evaluation
Single links
Link type
Local
Frame
Multiple links
Other
Testing Post-Processor Implementation
Randomly select 25
queries
Subjects search for
relevant information
given a topic, then
evaluate result
Lynch
When Documents Deceive: Trust and Provenance as New Factors for
Information Retrieval in a TangledWeb, Clifford Lynch, 2001
Historical Assumptions of IR
Behavior, consistency, and admission
Accurate metadata
Database type
Full documents
Surrogates
Document passivity
In Conflict with the Web
Distributed information environment
Document inconsistency
Document presentation
The user
The crawler
Metadata manipulation
Creator
Source document vs. page viewed
Provenance of data and metadata
User trust preferences
Security Concerns: Indexing
Page manipulation to alter behavior
Index spamming
Page jacking
Selective response
Indexer countermeasures
Result spot checking
Page certification
Security Concerns: Metadata
Simple distinctions within searches
Accuracy
Who generated content?
Does it accurately reflect the object it describes?
Metadata use uncommon because of these reasons
Potential solutions
Indexer and content provider collaboration
Signature of assertion
o Example RDF
o Public key infrastructure systems
o Pretty Good Privacy system
User Expectations
Formalization of expectations about behavior and trust in
behavior
Credentials
Personal preferences database
Levels of trust
Conclusions
Pre-Lycos:
Saw the development of web terminology and the first attempts
to defend against information manipulation.
Post-Lycos and Pre-Google:
Developers began to focus on more on user preferences, which
led to progress in the method of page rank.
Post-Google:
Users looked ahead to potential vulnerabilities and improvements
of the system.
Questions?