Don`t Be Fooled: Trust and Accuracy Problems and Solutions in

Download Report

Transcript Don`t Be Fooled: Trust and Accuracy Problems and Solutions in

The Development of Accurate Information
Retrieval Solutions in Early Search Engines
Kate Lopez
December 5, 2008
CS 349
Idea
 Historical progression of ideas about the proper method of
ranking or scoring web pages.
 Look at several early ideas for ranking and mitigating spam and
index manipulation.
 The structure of the Web will inherently change the
fundamentals of the system of information retrieval that we
use. It will have to continually adapt to changes in content
and user preference, while maintaining an acceptable level
of trust in accuracy.
Marchiori
 Security of WorldWideWeb Search Engines, Massimo Marchiori,
1996
 The Quest for Correct Information on the Web: Hypter Search Engines,
Massimo Marchiori, 1997
Background
 Economic motivation
 Defense against attacks
 Web Structure
 Partial function from URL to sequence of bytes
 Web Object
 Pair of URL and sequence
 Score function
 Flattening phenomenon
 Heavy competition situation
Search Engine Persuasion:
"Spamdexing"
 Artificial repetition of relevant keys
 Example fake commercial web object
 Impact on reliability
 Solution
 Truncation
Other Approaches to SEP defense
 Probabilistic
 Search engine post-processor
 Effectiveness grows with market pressure
 Clustering and shuffling
 Unique-Top
 Frequency implies relevance assumption
 Percentage score function
 Hyper
 Advertise competitor web objects to score high
Hyper Search Engines
 How do we properly classify objects in response to the user’s
needs?
 New measure of informative content: hyper information
 Hypertext vs. textual information
 Visibility vs. quality
Types of Hypertext Evaluation
 Single links
 Link type
 Local
 Frame
 Multiple links
 Other
Testing Post-Processor Implementation
 Randomly select 25
queries
 Subjects search for
relevant information
given a topic, then
evaluate result
Lynch
 When Documents Deceive: Trust and Provenance as New Factors for
Information Retrieval in a TangledWeb, Clifford Lynch, 2001
Historical Assumptions of IR
 Behavior, consistency, and admission
 Accurate metadata
 Database type
 Full documents
 Surrogates
 Document passivity
In Conflict with the Web
 Distributed information environment
 Document inconsistency
 Document presentation
 The user
 The crawler
 Metadata manipulation
 Creator
 Source document vs. page viewed
 Provenance of data and metadata
 User trust preferences
Security Concerns: Indexing
 Page manipulation to alter behavior
 Index spamming
 Page jacking
 Selective response
 Indexer countermeasures
 Result spot checking
 Page certification
Security Concerns: Metadata
 Simple distinctions within searches
 Accuracy
 Who generated content?
 Does it accurately reflect the object it describes?
 Metadata use uncommon because of these reasons
 Potential solutions
 Indexer and content provider collaboration
 Signature of assertion
o Example RDF
o Public key infrastructure systems
o Pretty Good Privacy system
User Expectations
 Formalization of expectations about behavior and trust in
behavior
 Credentials
 Personal preferences database
 Levels of trust
Conclusions
 Pre-Lycos:
 Saw the development of web terminology and the first attempts
to defend against information manipulation.
 Post-Lycos and Pre-Google:
 Developers began to focus on more on user preferences, which
led to progress in the method of page rank.
 Post-Google:
 Users looked ahead to potential vulnerabilities and improvements
of the system.
Questions?