CS315-L01-Overview

Download Report

Transcript CS315-L01-Overview

CS315-Web Search & Data Mining
A Semester in 50 minutes or less
The Web


History, Key technologies and developments
Its future
Information Retrieval (IR) incl. on the Web



How do you find the information you need, fast?
Web crawling and Indexing
Link Analysis, Quality of information
Data Mining and Maching Learning

How do you cluster and classify information (semi)automatically?
Introduction to “The Social Web”


Blogs, Twitter, FB, …
Social Networks
Web’s Search Engines
What are they?
How did they start?
How do they work?
How do they make money?
Should I care about privacy?
How high is the quality of their
results?
Can they be improved?
PAID RESULTS
ORGANIC RESULTS
Problems of Search and Mining
The Web poses a number of difficulties




A populist medium
The information abundance and authority problem
Uniform access
Data with little structure
The Web: A populist medium
Anyone can be an author!
# of writers ~= # of readers

Because ~= online members
Anyone can be an author!
The evolution of memes


Memes: ideas, theories, etc., that spread from person to person
by imitation
Now more easily spread via the web
Easier to connect to people with similar interests

Gave rise to a plethora of online social networks
Info Abundance and Authority
Liberal and informal culture of content generation and
dissemination
Redundancy
Non-standard form and content
Millions of qualifying pages for broad queries

E.g.: java, kayaking, panther
No authoritative information about
the reliability or trustworthiness of content on a site

Your favorite urban legend?
Problems from uniform access
Little support for adapting to the background of specific
users


Does your grandfather surf and search the web as easily as you do?
Personalized search might help (somewhat)
Commercial interests
routinely influence the operation of Web search


“Search Engine Optimization”
AdSense
(Lack of) Structured Information
Hypertext refers to ability to click and link,
not to the structure of data
Semi-structured or unstructured

No schema (precise description of data)
Large number of attributes

Each word is a potential feature
Major topics to cover
History of the Web
Relevant network protocols
Search Engines and Directories
Hyperlink analysis
Measuring and Modeling the Web
Quality of information
Clustering and classification
Social networks
The Future of the web
Reading for next time
Vanevar Bush: “As We May Think”
Tim Berners-Lee:

Chapters 1 (Enquire within) & 2 (Tangles, Bits, Webs)
Berners-Lee et. al.: “The Information Universe”