Transcript Slides
Mini-Project on
Web Data
Analysis
DA N IEL D E UTCH
Data Management
“Data management is the development, execution and
supervision of plans, policies, programs and practices
that control, protect, deliver and enhance the value of
data and information assets”
(DAMA Data Management Body of Knowledge)
A major success:
the relational model of databases
Relational Databases
•
Developed by Codd (1970), who won the Turing award for
the model
•
Huge success and impact:
‒
The vast majority of organizational data today is stored
in relational databases
‒
Implementations include MS SQL Server, MS excel,
Oracle DB, mySQL,…
‒
2 Turing award winners (Edgar F. Codd and Jim Gray)
•
Basic idea: data is organized in tables (=relations)
•
Relations can be derived from other relations using a set
of operations called the relational algebra
‒
On which SQL is largely based
Research in Data(base) Management
• 1970- : Relational Databases (tables).
‒ Indexing, Tuning, Query Languages, Optimizations, Expressive
Power,….
• ~20 years ago: Emergence of the Web and research on Web
data
‒ XML, text database, web graph….
‒ Google is a product of this research
(by Stanford’s PhD students Brin and Page)
• Recent years: hot topics include distributed databases, data
privacy, data integration, social networks, web applications,
crowdsourcing, trust,…
‒ Foundations taken from “classical” database research
• Theoretical foundations with practical impact
Web 2.0
• “Old” web (“Web 1.0”): static pages
– News, encyclopedic knowledge...
– No, or very little, interactive process between the web-page
and the user.
• Web 2.0: A term very broadly used for web-sites that
use new technologies (Ajax, JS..), allowing interaction
with the user.
– “Network as platform" computing
– The “participatory Web”
Web 2.0
• “Old” web (“Web 1.0”): static pages
– News, encyclopedic knowledge...
– No, or very little, interactive process between the web-page
and the user.
• Web 2.0: A term very broadly used for web-sites that
use new technologies (Ajax, JS..), allowing interaction
with the user.
– “Network as platform" computing
– The “participatory Web”
Online shopping
Advertisements
Social Networks
Crowd Sourcing
Data is all around
• Web graph
• “Social graph”
• Pictures, Videos, notifications, messages..
• Data that the application processes
• Advertisments
• Even the application structure itself
(A small portion of) the web graph
Need to Analyze
• Huge amount of data out there
– Est. 13.68 billion web-pages and counting
– Half a billion tweets per day and counting
• An average user “sees” about 600 tweets per day
• Most of it is irrelevant for you, some is
incorrect
Filter, Rank, Explain
• Filter
– Select the portion of data that is relevant
– Group similar results
• Rank
– Rank data by trustworthiness, relevance, recency...
– Present highest-rank first
• Explain
– An explanation of why is the data considered
relevant/highly-ranked
– An explanation of how has the data propagated
• “Why do I see this?”
Main topics
• Analysis of Tables and Links on the Web
• Trust Management
• Explanation (Provenance)
• Information Extraction
• Social Networks
• Crowd-sourcing
• Distributed Query Evaluation
Approach
• Leverage knowledge from “classic” database
research
• Account for the new challenges
• Do so in a generic manner
• Leverage unique features such as collaborative
contribution, distribution, etc.
Data model
Query language
Students
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
Select…
From…
Where…
sname
Physical Storage
Indexing
cid=cid
Distribution
...
sid=sid
name=“Mary”
Students
Takes
Courses
17
Foundations
• Model
• Query Language
• Query evaluation algorithms
• Prototype implementation and optimizations
• Getting Data and Testing
Project Requirements
• Read a paper (or a bunch of papers) in the area
• Likely to require that you follow citations and read
earlier papers!
• Think of an application based on the paper ideas
• Does not have to be exactly the application described in the
paper!
• E.g. you do not have to use relational databases
• Think of how would you get/generate data
• Implement, test
• Submit an application+ report
Report
• An integral part of the project submission
• Should include:
• A detailed description of the model and algorithms that you
have implemented
• A detailed description of the application
• Code design
• Use cases
• Difficulties that you have encountered and how you addressed
them
Timeline
• By 20/3 (1 week from now): send me an ordered list of
3 preferred papers
• Email title includes the words “mini-project”
• Body includes the names and IDs of the pair
• A bit after passover (date TBA): Each pair presents a
7-10 minutes presentation on the expected project
A slide on each of the issues mentioned in the requirement slide
1 week before the last week of the semester: short project
presentations (including screenshots or live demo)