pptx - Computer Science and Engineering
Download
Report
Transcript pptx - Computer Science and Engineering
Presented by: Ashkan Malekloo
Fall 2015
Type: Demonstration paper
Authors:
Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Eugene Wu
VLDB 15
Dirty data
Data cleaning is often domain specific, dataset, and eventual analysis, analysts report
spending upwards of 80% of their time on problems in data cleaning
Possible errors
What to extract
How to clean the data
Whether that cleaning will significantly change results
While the extraction operation can be represented at a logical level by its input and
output schema, there is a huge space of possible physical implementations of the
logical operators.
Rule-based
Learning-based
Crowd-based
Or a combination of three
Let’s say we select crowd-based operator as our extraction method
There are still many parameters that might influence the quality of output
the number of crowd workers
the amount each worker is paid
ETL ( Extract-Transfer-Load)
Constraint Driven tools
Wrangler
OpenRefine
Crowd-based
a system designed to support the iterative development and optimization of data
cleaning plans end to end
allows users to specify declarative data cleaning plans
Wisteria phases:
Sampling
Recommendation
Crowd Latency
Presented by: Ashkan Malekloo
Fall 2015
Type: Demonstration paper
Authors:
Eli Cortez, Philip A. Bernstein, Yeye He, Lev Novik
VLDB 15
In large enterprises, data discovery is a common problem faced by users who need to
find relevant information in relational databases
Finding Tables That are relevant
Find out whether it is truly relevant
In large enterprises, data discovery is a common problem faced by users who need to
find relevant information in relational databases
Finding Tables That are relevant
Find out whether it is truly relevant
In this paper their sampling involves
29 databases
639 tables
4216 data columns
many frequently-used column names are very generic
Name
Id
Description
Field
Code
Column
These generic column names are useless for helping users find tables that have the
data they need.
a system that automatically generates candidate keywords to annotate columns of
database tables
mining spreadsheets
Spreadsheets are more readble
A method to automatically extract tables from a corpus of enterprise spreadsheets.
A method for identifying and ranking relevant column annotations, and an efficient
technique for calculating it.
An implementation of our method, and an experimental evaluation that shows its
efficiency and effectiveness.
Type:
Demonstration Paper
Authors:
Manas Joglekar, Hector Garcia-Molina(Stanford), Aditya
Parameswaran(University of Illinois)
Presented by: Siddhant Kulkarni
Term:
Fall 2015
Drill Down -> Data exploration
Drawbacks with Traditional drill down operation
Too many Distinct Values
One Column at a time
Simultaneously drilling down columns presents too many
values
Interpretable and informative explanations of outcomes
User-adaptive exploration of multidimensional data
User-cognizant multidimensional analysis
Discovery-driven exploration of olap data cubes
Type:
Demonstration Paper
Authors:
Tobias Muller, Torsten Grust (Universit at Tubingen Tubingen,
Germany)
Presented by: Siddhant Kulkarni
Term:
Fall 2015
Given a query
Record it’s control flow decisions (WHY- PROVENANCE)
Data access locations (WHERE- PROVENANCE)
Without actual data values (VALUELESS!) determine the whyorigin and where-origin of a query
DETERMINE THE I/O DEPENDENCIES OF REAL LIFE SQL QUERIES
Step 1: Convert SQL query into Python code
Step 2: Apply Program Splicing
Step 3: Apply Abstract Interpretation
Demo with PostgreSQL
Presented by: Omar Alqahtani
Fall 2015
Nicole Bidoit
Université Paris Sud / Inria
Melanie Herschel
Universität Stuttgart
Katerina Tzompanaki
Université Paris Sud / Inria
Explanations to Why-Not questions:
data-based explanations
query-based explanations
Mixed.
Explain and Fix Query platform (EFQ) that enable to execute
queries, express a Why-Not question and ask for:
Explanations to Why-Not questions.
Query-based
Why-Not Answer polynomials
Query refinements that produce the desired results.
cost model for ranking
Presented by:
Ranjan
Fall 2015
Database ?
Spreadsheets?
Problem ?
A spreadsheet containing course assignment scores and eventual
grades for students from rows 1–1000, columns 1–10 in one sheet,
and demographic information for the students from rows 1–1000,
columns 1–20 in another sheet.
user wants to understand the impact of assignment grades on the course grade,
for which having std_points> 90 in at least one assignment.
user wants to plot the average grade by demographic group (undergrad, MS,
PhD).
the course management software outputs actions performed by students into a
relational database or a CSV file; there is no easy way for the user to study this data
within the spreadsheet, as the data is continuously added.
Schema
Addressing
Modifications
Computation: spreadsheets support value-at-a-time formulae to
allow derived computation, while databases support arbitrary SQL
queries operating on groups of tuples at once.
a)
analytic queries that reference data on the spreadsheet, as well as data in other database
relations.
b)
importing or exporting data from the relational database.
c)
keeps data in the front-end and back-end in-sync during modifications at either end.
a)Use of spreadsheets to mimic the relational database functionalities :
achieves expressivity of SQL, it is unable to leverage the scalability of databases.
b) Use of databases to mimic spreadsheet functionalities :
achieves scalability of databases, it is does not support ad-hoc tabular management
provided by spreadsheets.
c) Use of spreadsheet interface for querying data :
Provide an intuitive interface to query data , but looses the expressivity of SQL as
well as ad-hoc data management capabilities.
Overall, the aforementioned demonstration scenarios will convince
attendees that DATASPREAD system offers a valuable hybrid
between spreadsheets and databases, retaining the ease-of-use of
spreadsheets, and the power of databases
Presented by: Zohreh Raghebi
Fall 2015
Bilegsaikhan Naidan
Norwegian University of Science and Technology Trondheim, Norway
Leonid Boytsov Car negie Mellon University Pittsburgh, PA, USA
Er ic Nyberg Car negie Mellon University Pittsburgh, PA, USA
Nearest-neighbor searching is a fundamental operation employed in many applied
areas such as: pattern recognition, computer vision, multimedia retrieval
Given a query data point q, the goal is to identify the nearest (neighbor) data point x
A natural generalization is a k-NN search, where we aim to find k closest points
The most studied instance of the problem is an exact nearest-neighbor search in
vector spaces
where a distance function is an actual metric distance
Exact methods work well only in low dimensional metric spaces
Experiments showed that exact methods can rarely outperform the sequential scan
when dimensionality exceeds ten
This a well-known phenomenon known as “the curse of dimensionality
Approximate search methods can be much more efficient than exact ones
but this comes at the expense of a reduced search accuracy
The quality of approximate searching is often measured using recall
the average fraction of true neighbors returned by a search method
It is based on the idea that if we rank a set of reference points–called pivots–with
respect to distances from a given point
the pivot rankings produced by two near points should be similar
In these methods, every data point is represented by a ranked list of pivots sorted by
the distance to this point.
Such ranked lists are called permutations
the distance between permutations is a good proxy for the distance between original
points
However, a comprehensive evaluation that involves a diverse set of large metric and
nonmetric data sets is lacking
We survey permutation-based methods for approximate k nearest neighbor search
by examining only a tiny subset of data points whose permutations are similar to the
permutation of a query
Converting the vector of distances to pivots into a permutation entails information loss
but this loss is not necessarily detrimental
our preliminary experiments showed that using permutations instead of vectors of original
distances:
results in slightly better retrieval performance
(1) The distance function is expensive (or the data resides on disk)
(2) The indexing costs of k-NN graphs are unacceptably high
(3) There is a need for a simple, but reasonably efficient, implementation that operates on
top of a relational database
Presented by: Zohreh Raghebi
Tamrapar ni Dasu
AT&T Labs–Research
[email protected]
Vladislav Shkapenyuk
AT&T Labs–Research
[email protected]
Divesh Sr ivastava
AT&T Labs–Research
[email protected]
Data are being collected and analyzed today at an unprecedented scale
Data errors (or glitches) in many domains, such as medicine, finance can have severe
consequences
need to develop data quality management systems to effectively detect and correct
glitches in the data
Data errors can arise throughout the data lifecycle
from data entry, through storage, data integration, analysis
Much of the data quality effort in the database research has focused on detecting and
errors in data once the data has been collected
This is surprising since data entry time offers the first opportunity to detect and correct
errors
We address this problem in our paper, describe principled techniques for online data quality
monitoring in a dynamic feed environment
While there has been significant focus on collecting and managing data feeds
it is only now that attention is turning to their quality
Our goal is to alert quickly when feed behavior deviates from expectations
Data feed management systems(DFMSs) have recently emerged to provide reliable,
continuous data delivery to :
databases and data intensive applications that need to:
perform real-time correlation and analysis
In prior work we have presented the Bistro DFMS, which is deployed at AT&T Labs
responsible for the real-time delivery of over 100 different raw feeds,
distributing data to several large-scale stream warehouses.
Bistro uses a publish-subscribe architecture to efficiently process incoming data from a
large number of data publishers,
identify logical data feeds
reliably distribute these feeds to remote subscribers
FIT naturally fits into this DFMS architecture:
both as a subscriber of data and metadata feeds
as a publisher of learned statistical models and identified outliers
we propose novel enhancements to permit a publish subscribe approach
to incorporate data quality modules into the DFMS architecture
Early detection of errors by FIT enables data administrators to quickly remedy any problems
with the incoming feeds
FIT’s online feed monitoring can naturally detect errors from two distinct perspectives:
(i) errors in the data feed processes
e.g., missing or delayed delivery of files in a feed
by continuously analyzing the DFMS metadata feed
(ii) significant changes in distributions in the data records present in the feeds
e.g., erroneously switching from packets/second to bytes/second in a measurement feed
by continuously analyzing the contents of the data feeds.
Presented by: Shahab Helmi
Fall 2015
Authors:
Publication:
VLDB 2015
Type:
Industrial Paper
What have been done in this paper?
The first attempt to implement three basic DP architectures in the deployed
telecommunication (telco) big data platform for data mining applications (churn
prediction).
What is DP?
Differential Privacy (DP) is an Anonymization technique.
What is Anonymization?
A privacy protection technique, which removes or replaces the explicitly sensitive
identifiers (ID) of customers, such as the identification number or mobile phone number,
by random mapping or encryption mechanisms in DB, and provides the sanitized dataset
without any ID information to DM services.
What have been done in this paper?
The first attempt to implement three basic DP architectures in the deployed
telecommunication (telco) big data platform for data mining applications (churn
prediction).
Who is a Churner?
A person who quits the service! Customer churn is on of the biggest challenge in telco
industry.
Telecommunication (telco) big data platform
Telecommunication (telco) big data record billions of customers’ communication
behaviors for years in the world. Mining big data to increase customers’ experience for
higher profits becomes one of important tasks for telco operators.
Implementation DP in telco big data platform: Data Publication Architecture,
Separated Architecture and Hybridized Architecture.
Extensive experimental results on big data:
influence of privacy budget parameter on different DP implementations with industrial big
data.
The accuracy and privacy budgets trade-off.
The performance of three basic DP architectures in churn prediction;.
How volume and variety of big data affect the performance.
Comparing the DP implementation performance between the simple decision tree and the
relatively complicated random forest classifiers in churn prediction.
Findings:
All DP architectures have a relative accuracy loss less than 5% with week privacy guarantee
and more than 15% (up to 30) with storing privacy guarantee.
Among all three basic DP architectures, the Hybridized architecture performs the best.
Prediction error:
increases with the number of features .
decreases with the growth of the training data volume.
Anonymization techniques: such as K-Anonymity
DP is currently the strongest privacy protection technique, which does not need any
background information assumption of attackers. The attacker can be assumed to
know the maximum knowledge.
Studying DP in different scenarios:
Histogram query
Statistical geospatial
Data query
Frequent item set mining
Crowdsourcing …
Dataset: collected from one of biggest telco operators in China, having 9 consecutive
months of more than 2 million prepaid customer’s behavior records from 2013 to 2014
(around 2M users).
Experiments: checking the effect of following properties on the churn prediction accuracy:
Privacy budget parameter.
Number of features.
Training data volume
AUC: Area under ROC Curve
ROC is a graphical plot that illustrates the performance of a binary classifier system.
[Wikipedia]
The effect of number of features on prediction accuracy (1M training records)
AUC: Area under ROC Curve
ROC is a graphical plot that illustrates the performance of a binary classifier system.
[Wikipedia]
The effect of training data volume on prediction accuracy
AUC: Area under ROC Curve
ROC is a graphical plot that illustrates the performance of a binary classifier system.
[Wikipedia]
Decision Trees VS. Random Forests
Presented by: Shahab Helmi
Fall 2015
Authors:
Publication:
VLDB 2015
Type:
Demonstration Paper
Data analysts often engage in data exploration tasks to discover interesting data
patterns, without knowing exactly what they are looking for (exploratory analysis).
Users try to make sense of the underlying data space by navigating through it. The
process includes a great deal of experimentation with queries, backtracking on the
basis of query results, and revision of results at various points in the process.
When data size is huge, finding the relevant sub-space and relevant results takes so
long.
AIDE is an automated data exploration system that:
Steers the user towards interesting data areas based on her relevance feedback on
database samples.
Aims to achieve the goal of identifying all database objects that match the user
interest with high efficiency.
It relies on a combination of machine learning techniques and sample selection
algorithms to provide effective data exploration results as well as high interactive
performance over databases of large sizes.
Datasets:
AuctionMark: information on action items and their bids. 1.77GB.
Sloan Digital Sky Survey: This is a scientific data set generated by digital surveys of
stars and galaxies. Large data size and complex schema. 1GB-100GB.
US housing and used cars: available through the DAIDEM Lab
System Implementation:
Java: ML, clustering and classification algorithms, such as SVM, k-means, decision trees
PostgreSQL