Why do I see this?

Download Report

Transcript Why do I see this?

Web 2.0
Data Analysis
DA N IEL D E UTCH
Data Management
“Data management is the development, execution and
supervision of plans, policies, programs and practices
that control, protect, deliver and enhance the value of
data and information assets”
(DAMA Data Management Body of Knowledge)
A major success:
the relational model of databases
Relational Databases
•
Developed by Codd (1970), who won the Turing award for
the model
•
Huge success and impact:
‒
The vast majority of organizational data today is stored
in relational databases
‒
Implementations include MS SQL Server, MS excel,
Oracle DB, mySQL,…
‒
2 Turing award winners (Edgar F. Codd and Jim Gray)
•
Basic idea: data is organized in tables (=relations)
•
Relations can be derived from other relations using a set
of operations called the relational algebra
‒
On which SQL is largely based
Research in Data(base) Management
• 1970- : Relational Databases (tables).
‒ Indexing, Tuning, Query Languages, Optimizations, Expressive
Power,….
• ~20 years ago: Emergence of the Web and research on Web
data
‒ XML, text database, web graph….
‒ Google is a product of this research
(by Stanford’s PhD students Brin and Page)
• Recent years: hot topics include distributed databases, data
privacy, data integration, social networks, web applications,
crowdsourcing, trust,…
‒ Foundations taken from “classical” database research
• Theoretical foundations with practical impact
Web 2.0
• “Old” web (“Web 1.0”): static pages
– News, encyclopedic knowledge...
– No, or very little, interactive process between the web-page
and the user.
• Web 2.0: A term very broadly used for web-sites that
use new technologies (Ajax, JS..), allowing interaction
with the user.
– “Network as platform" computing
– The “participatory Web”
Web 2.0
• “Old” web (“Web 1.0”): static pages
– News, encyclopedic knowledge...
– No, or very little, interactive process between the web-page
and the user.
• Web 2.0: A term very broadly used for web-sites that
use new technologies (Ajax, JS..), allowing interaction
with the user.
– “Network as platform" computing
– The “participatory Web”
Online shopping
Advertisements
Social Networks
Collaborative Applications
Data is all around
• Web graph
• “Social graph”
• Pictures, Videos, notifications, messages..
• Data that the application processes
• Advertisments
• Even the application structure itself
(A small portion of) the web graph
News Feed
Need to Analyze
• Huge amount of data out there
– Est. 13.68 billion web-pages and counting
– Half a billion tweets per day and counting
• An average user “sees” about 600 tweets per day
• Most of it is irrelevant for you, some is
incorrect
Maybe we shouldn’t
Haaretz, 7/3/13
Static vs. Dynamic data
• For “static” web data:
problem addressed by search engines
– Allow users to specify criteria for search (e.g. keywords), and
rank the results.
• For “dynamic” web 2.0 data (tweets, notifications, web
applications):
– No satisfying solution yet.
– Data volume is constantly increasing.
Filter, Rank, Explain
• Filter
– Select the portion of data that is relevant
– Group similar results
• Rank
– Rank data by trustworthiness, relevance, recency...
– Present highest-rank first
• Explain
– An explanation of why is the data considered
relevant/highly-ranked
– An explanation of how has the data propagated
• “Why do I see this?”
Approach
• Leverage knowledge from “classic” database
research
• Account for the new challenges
• Do so in a generic manner
• Leverage unique features such as collaborative
contribution, distribution, etc.
Data model
Query language
Students
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
Select…
From…
Where…
sname
Physical Storage
Indexing
cid=cid
Distribution
...
sid=sid
name=“Mary”
Students
Takes
Courses
21
Foundations
• Model
• Query Language
• Query evaluation algorithms
• Prototype implementation and optimizations
• Getting Data and Testing
Two concrete examples
• Analysis of Web Applications
• Analysis of Social Networks
Analysis of Web Applications
Analysis Questions
• For the application owner:
– Can the user make a reservation without giving
credit card details?
– What is the probability that a user would choose a
category for which there are no available products?
• For the application user:
– How can I get the best deals (by navigating)?
– How can I navigate in an optimal way (minimal
number of clicks)
ShopIT-Shopping Assistant
26
ShopIT-Shopping Assistant
27
ShopIT-Shopping Assistant
28
Modeling
• Model should be declarative, intuitive
• Abstract representation
• Allows readability, portability, applicability,
optimizations…
• Tradeoff between expressivity and complexity
Application Model
Effects guiding the execution
• Dynamic effects such as the choice of which link to follow
– Probability/cost value can be associated with every such
choice
– Cumulative: if the same choice is made twice, double the
cost
• Static effects such as available products in the database
– Static in the sense that it is constant in a single execution
– Thus non-cumulative
Filter, Rank, Explain
• Filter
– Choose “interesting” navigation sequences
• Rank
– Rank different sequences by length, purchase
cost,…
• Explain
– Why are these the best sequences? What do
they entail in terms of purchase? How can
they be refined?
Query Language
• Desired executions/navigation sequences are typically
expressed in temporal logic
• Database analysis is typically done through a Firstorder-based query language (e.g. SQL)
• A query language for data-dependent application
should combine both.
• While allowing for efficient evaluation.
• LTL-FO
Query Evaluation
• Problem: Given a process specification and a
query, evaluate the query
– To find all executions that satisfy the criteria
– To find the probability that the criteria are satisfied
– To find the minimal cost execution realizing the
criteria
–…
Interesting problems are hard
(sometimes)
• The (corresponding boolean) problems are NP-hard
‒ Very unlikely that they can be solved efficiently for worst case
input
• This is often the case when modeling real-life problems
• To be considered, but not to be feared!
What do you do with a theoretically
hard problem?
• Find restricted, yet realistic, cases that are “easy”, i.e.
admit efficient worst case algorithms
‒ This also allows to have a better understanding of
“where is the hardness”
‒ Example: If there are only “few” static choices, the
problem becomes easy
• Design optimization techniques for practically efficient
implementations
Optimization
• One idea is based on the notion of data
provenance
– Originally for “classic” database queries, provenance is a
“concise” representation of the query computation
• Here we generalize the approach for the
temporal queries
– Allows for a generic approach and many optimizations
• A system prototype for provenance-aware
analysis of processes is under development
How do we get data?
• Some applications (e.g. Yahoo! Shopping)
allows a (limited) interface to their database.
• Application structure can be obtained by
crawling.
• Allows to prove that the model is realistic, and
to test feasibility of algorithms on real-life data
scale
Open problems
• Uncertainty and partial knowledge
• Data updates by the application
• Incremental maintenance
‒ How do modifications to the database can be
accounted for without restarting the analysis?
Analysis of Social Networks
• Social network common model of disseminating data is
in a “push” mode
• Facebook’s news feed, Twitter’s tweets…
• Huge load of information, and it continues to grow
• No standard way of searching and ranking such posts
A Machine Learning Approach
• “Learn” users preferences
– Ask them whether they like a particular post/tweet, or use
existing mechanisms
• Try to generalize by extracting characteristic features
(e.g. topics)
• Use the generalized function to predict future
preferences
A data management approach
• Allow users to tell the system their
preferences, through a “query” language.
• One family of such languages are rule-based
languages
– Prolog, Datalog,...
• Simpler than standard programming languages
• Allows for easier analysis
Rule based language
(Logic Programming)
descendent(X,Y):- parent(X,Y)
descendent(X,Y):parent(X,Z),descendent(Z,Y)
For tweeter
See(tweet):-Tweets(Tova,tweet)
See(tweet):-Tweets(x,tweet),
related(x,"basketball")
Attending(party):count[friend(x),like(x,party)]>10
Learning
• Unlike standard database querying, some of
the predicates may be difficult to evaluate
• How do we know that a post is related to basketball?
• What can we do?
What can we do
• Text analysis
– Problem: An ”I love basketball” tweet is not really a good match
• Use tagging if exists
• Ask other peers in the network!
– Can also be used to propose predicates to users
Influence
• Predict effects and allow users to influence friends
• "How do I get at least 5 of my friends to come with the
party with me?"
• Need them to (1) see the party, (2) click "attending"
• "How do I get at least 1000 people to go to a political
rally?"
• Can be answered by an analysis of the rules!
Interface for Specifying Rules
• Cannot expect Facebook/Twitter users to be
able to program
• Need to design an interface
• The interface components can be dynamically
designed according to answers by other peers
• An ongoing research
Conclusions
• Web 2.0 data poses new challenges in research
• Data is dynamic, access mode is unique, analysis goals
different than in classic databases
• Lots of research on theoretical questions with strong
practical applications
Questions?