Transcript Document

CS315-Web Search & Data Mining
A Semester in 50 minutes or less
The Web



History
Key technologies and developments
Its future
Information Retrieval (IR)

How do you find the information you need, fast?
IR on the Web



Web crawling and Indexing
Link Analysis
Quality of information
Introduction to “The Social Web”


Blogs, Twitter, FB, …
Social Networks
Web’s Search Engines
What are they?
How did they start?
How do they work?
What do they really do?
How do they make money?
Should I care about privacy?
How high is the quality of their
results?
Can they be improved?
PAID RESULTS
ORGANIC RESULTS
Problems of Search and Mining
The Web poses a number of difficulties




Populist medium
Abundance and authority problem
Uniform access
Data with little structure
The Web: A populist medium
Anyone can be an author!
# of writers ~= # of readers

Because ~= online members
Anyone can be an author!
The evolution of memes


Memes: ideas, theories, etc., that spread from person to person
by imitation
Now more easily spread via the web
Easier to connect to people with similar interests

Gave rise to a plethora of online social networks
Abundance of information
Liberal and informal culture of content generation and
dissemination
Redundancy
Non-standard form and content
Millions of qualifying pages for broad queries

E.g.: java, kayaking, panther
No authoritative information about
the reliability or trustworthiness of content on a site

Your favorite urban legend?
Problems from uniform access
Little support for adapting to the background of specific
users


Does your grandmother surf and search the web as easily as you
do?
Personalized search might help (somewhat)
Commercial interests routinely influence the operation of
Web search


“Search Engine Optimization”
AdSense
(Lack of) Structured Information
Hypertext refers to ability to click and link,
not to the structure of data
Semi-structured or unstructured

No schema (precise description of data)
Large number of attributes

Each word is a potential feature
Major topics to cover
History of the Web
Relevant network protocols
Search Engines and Directories
Clustering and classification
Hyperlink analysis
Measuring and Modeling the Web
Quality of information
Social networks
The Future of the web
Reading for next time
Vanevar Bush: “As We May Think”
Tim Berners-Lee:

Chapters 1 (Enquire within) & 2 (Tangles, Bits, Webs)
Find online and watch the “now-famous video, which
[TBL] didn’t see until 1994”

Make notes of your actions to find the video
A few more details
S.E.: Crawling, Indexing, Ranking
Crawl: Quickly fetch large number of Web pages into a
local repository
Index: based on keywords
Rank: responses to maximize user’s chances that the first
few responses satisfies her information need
Early search engines: WebCrawler, Lycos (1994)





Search engines from the beginning.
Successful, even with the difficulties described
Started as university research projects with small infrastructure,
yet eminently useful
Based in part on traditional IR techniques.
Had interesting ideas that are still useful
Web directories
Yahoo! directory

to locate useful Web sites
Efforts for organizing knowledge into ontologies


Centralized: (Yahoo!)
Decentralized:
 About.COM
 the Open Directory Project (dmoz)
Clustering and classification
Clustering

Discover groups in a set of documents such that
documents within a group are more similar
than documents across groups.
Subjective disagreements due to


Different similarity measures
Large feature sets
Classification

For assisting human efforts in maintaining taxonomies
(topic directories)
(Hyper)Link Analysis
Traditional IR insufficient


Short queries
Abundance and authority problems
Take advantage of the structure of the Web graph.



Indicators of prestige of a page (E.g. citations)
HITS & PageRank
Anchor text
Bibliometry

Bibliographic citation graph of academic papers.
Measuring and Modeling the Web
Useful to better understand the structure of the Web
Can we characterize the Web?



Distribution of hyperlinks per page
Patterns of linkage within topic communities
Path lengths between pages
Can we build a generative model with same
characteristics?
Structured vs Web data mining
Traditional data mining


data is structured and relational
Well-defined tables, columns, rows, keys, and constraints.
Web data


readily available data rich in features and patterns
spontaneous formation and evolution of
 topic-induced graph clusters
 hyperlink-induced communities
Our goal: to discover patterns which are spontaneously
driven by semantics.