pptx - Computer Science and Engineering
Download
Report
Transcript pptx - Computer Science and Engineering
Type:
Research Paper
Authors:
David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, Sunil P.
Chakkappen (Oracle)
Presented by:
Siddhant Kulkarni
Term:
Fall 2015
Query Cost!
Estimating “Size” of the resultant table if a join is
performed on two or more tables
Accommodating the predicates
Related work of this paper focuses on:
Sampling techniques such as Bifocal Sampling, End-Biased
Sampling
Sketch based methods for join size estimation
Correlated Sampling
A novel sketch-based join size estimation
Type:
Authors:
Demonstration Paper
Damian Bursztyn(University of Paris-sud), Francois
Goasdoue(University of Rennes), Ioana Manolescu(University of Paris-sud)
Presented by:
Term:
Siddhant Kulkarni
Fall 2015
RDF = Resource Description Framework
You can query this data!
Processing queries and displaying results is called query
answering!
Two ways of query answering
Saturation based (SAT)
Reformulation based (REF) – PROBLEM!
Showcase large set of REF techniques (along with one they have
presented in another paper)
What the demo system allows you to do:
Pick RDF graph and visualize it statistics
Choose a query and answering method
Observe evaluation in real time
Modify RDF data and reevaluate
Xiaolan Wang Mary Feng
University of Massachusetts
Presented by: Omar Alqahtani
Fall 2015
Yue Wang Xin Luna Dong
University of Iowa
Alexandra Meliou
Google Inc.
Retrieving high quality datasets from voluminous and diverse sources is crucial for
many data-intensive applications.
However, the retrieved datasets often contain noise and other discrepancies.
Traditional data cleaning tools mostly trying to answer “Which data is incorrect?”
Demonstration for DATAXRAY, a general-purpose, highly- scalable tool.
It explains why and how errors happen in a data generative process
It answers:
Why are there errors in the data? or
How can I prevent further errors?
It finds groupings of errors that may be due to the same cause. But how:
it identifies these groups based on their common characteristics ( features )
Features are organized in a hierarchical structure based on simple
containment relationships.
DATAXRAY uses a top-down algorithm to explore the feature hierarchy:
To identify the set of features that best summarize all erroneous data
elements.
It uses a cost function based on Bayesian analysis to derive the set of features
with the highest probability of being associated with the causes for the
mistakes in a dataset.
Presented by:
Ranjan_KY
Fall 2015
Web scrapping (or wrapping) is a popular means for
acquiring data from the web.
Today generation made scalable wrapper-generation
possible and enabled data acquisition process involving
thousands of sources.
No scalable tools exists that support these task.
.
Modern wrapper-generation systems leverage a number of features
ranging from HTML and visual structures to knowledge bases and microdata.
Nevertheless, automatically-generated wrappers often suffer from errors
resulting in under/over segmented data, together with missing or spurious
content.
Under and over segmentation of attributes are commonly caused by
irregular HTML markups or by multiple attributes occurring within the
same DOM node.
Incorrect column types are instead associated with the lack of domain
knowledge, supervision, or micro-data during wrapper generation.
The degraded quality of the generated relations argues for means to repair both the data and the
corresponding wrapper so that future wrapper executions can produce cleaner data
WADaR takes as input a (possibly incorrect) wrapper and a target
relation schema, and iteratively repairs both the generated
relations and the wrapper by observing the output of the
wrapper execution.
A key observation is that errors in the extracted relations are
likely to be systematic as wrappers are often generated from
templated websites.
WADaR’s repair process
(i) Annotating the extracted relations with standard entity
recognizers,
(ii) Computing Markov chains describing the most likely
segmentation of attribute values in the records, and
(iii) Inducing regular expressions which re-segment the input
relation according to the given target schema and that can possibly
be encoded back into the wrapper.
In this paper, related work was not evaluated in detail
[1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of
partially overlapping web sources. PVLDB, 6(10):805–816, 2013.
[2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB,
6(13):1486–1497, 2013.
[3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record
alignment. In SIGMOD, pages 1713–1728. ACM, 2015.
Rahul Potharaju† , Joseph Chan† , Luhui Hu† , Cristina Nita-Rotaru ∗ ,
Mingshi Wang† , Liyuan Zhang† , Navendu Jain‡
†Microsoft ∗Purdue University ‡Microsoft Research
Presented by: Zohreh Raghebi
Configuration errors have a significant impact on system performance and availability
For instance, a misconfiguration in the user-authentication system caused login
problems for several Google services including Gmail and Drive
A software misconfiguration in Windows Azure caused a 2.5 hour outage in 2012
Many configuration errors are due to faulty patches
e.g., changed file paths causing incompatibility with other applications, empty fields in
Registry
failed uninstallations
Unfortunately, troubleshooting misconfigurations is time consuming, hard and
expensive.
First, today’s software configurations are becoming increasingly complex and large
comprising hundreds of parameters and their settings
Second, many of these errors manifest as silent failures leaving users clueless:
They either search online or contact customer service and support (CSS)
loss of productivity, time and effort
several research efforts have proposed techniques to identify, diagnose and fix
configuration problems
Some commercial tools are also available to manage system configurations or to
automate certain configuration tasks
many of these approaches either assume the presence of a large set of configurations
to apply statistical testing e.g., PeerPressure
periodically checkpoint disk state e.g., Chronus risking high overheads;
use data flow analysis e.g., ConfAid for error-tracing;
This paper presents the design, implementation and evaluation of ConfSeer
a system that aims to proactively find misconfigurations on user machines
using a knowledge base (KB) of technical solutions
ConfSeer focuses on addressing parameter-related misconfigurations, as they account
for a majority of user configuration errors
the key idea behind ConfSeer:
to enable configuration-diagnosis-as-a-service
by automatically matching configuration problems to their solutions described in free-form text
First, ConfSeer takes the snapshots of configuration files from a user machine as input
These are typically uploaded by agents running on these machines
Second, it extracts the configuration parameter names and value settings from the snapshots
matches them against a large set of KB articles,
which are published and actively maintained
by many vendors
Third, after a match is found, ConfSeer automatically pinpoints the configuration error with its
matching KB article
so users can apply the suggested fix
ConfSeer is the first approach that combines traditional IR and NLP techniques (e.g., indexing,
synonyms)
with new domain specific techniques (e.g., constraint evaluation, synonym
expansion with named-entity resolution)
to build an end-to end practical system to detect misconfigurations.
It is part of a larger system-building effort
to automatically detect software errors and misconfigurations
by leveraging a broad range of data sources such as knowledge bases, technical help articles, and
question and answer forums
which contain valuable yet unstructured information to perform diagnosis
Dan Olteanu
LogicBlox, Inc.
[email protected]
Presented by: Zohreh Raghebi
An increasing amount of self-service enterprise applications require live programming
in the database
the traditional edit-compile-run cycle is abandoned in favor of a more interactive user
experience
with live feedback on a program's runtime behavior
In retail-planning spreadsheets backed by scalable full-edged database systems:
users can define and change schemas of pivot tables and formulas over these schemas
on the fly
These changes trigger updates to the application code on the database server
the challenge is to quickly update the user spreadsheets
in response to these changes
To achieve interactive response times in real world
changes to application code must be quickly compiled and hot-swapped into the
running program
the effects of those changes must be efficiently computed in an incremental fashion
In this paper, we discuss the technical challenges in supporting live programming in the
database.
The workhorse architectural component is a “meta-engine"
Incrementally maintains metadata representing application code
guides its compilation into an internal representation in the database kernel
orchestrates maintenance of materialized results of the application code based on those
changes
In contrast, the engine proper works on application data and can incrementally maintain
materialized results in the face of data updates.
The meta-engine instructs the engine which materialized results need to be (partially or
completely) recomputed
Without the meta-engine, the engine would unnecessarily recompute from scratch all materialized
results
every time the application code changes
render the system unusable for live programming
present the meta-engine solution that implemented in the LogicBlox commercial system
LogicBlox offers a unified runtime for the enterprise software stack
LogicBlox applications are written in an extension of Datalog called LogiQL
LogiQL acts as a declarative programming model unifying OLTP, OLAP, and prescriptive and predictive analytics
It offers rich language constructs for expressing derivation rules
the meta-engine uses rules expressed in a Datalog-like language called MetaLogiQL
these operate on metadata representing LogiQL rules
Outside of the database context:
our design may even provide a novel means of building incremental compilers
for general-purpose programming languages
Presented by: Dardan Xhymshiti
Fall 2015
Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft
Corporation)
Conference: VLDB
Type: Demonstration
Data discovery of relevant information in relational databases.
Problem of generating reports.
To find relevant information, users have to find the database tables that are
relevant to the task, for each of them understand its content to determine
whether it is truly relevant etc.
The schema’s table and column names are often not very descriptive of the
content.
Example: In 412 data columns in 639 tables from 29 databases used by
Microsoft’s IT organization, 28% of all columns were very generic as:
name, it, description, field, code
A typical corporate database table with generic column
names
Such non-descriptive column names make it difficult to
search and understand the table.
One solution: using data stewards to enrich the
database tables and columns with textual description.
Time consuming
Ignoring databases that are less frequent.
Barcelos: automatically annotate columns of database tables.
Annotate even those tables that are not frequent.
How it works?
It works by mining spreadsheets.
Many of these spreadsheets are generated by queries.
It uses spreadsheet’s column names as candidate annotations for the
corresponding database columns.
For the table above Barcelos produces annotations:
TeamID and Team for the first column,
Delivery Team and Team for the second column.
Line of Business and Business for the third.
The authors have provided a method to for extracting relevant
tables from an enterprise database.
A method for identifying and ranking relevant column
annotations.
An implementation of Barcelos and an experimental evaluation
that shows its efficiency and effectiveness.
Presented by: Dardan Xhymshiti
Fall 2015
Authors: Dinesh Das, Jiaqi Yan, Mohamed Zait, Satyanarayana R
Valluri, Nirav Vyas, Ramarajan Krishnamachari, Prashant
Gaharwar, Jesse Kamp, Nioly Mukherjee
Conference: VLDB
Type: Industry paper
Database In-Memory (column-oriented)
Database On-Disk (row-oriented)
Oracle 12C Database In-memory
Industry’s first dual format database (In-Memory & On-Disk)
Problem: optimization of query processing
Optimization for On-Disk query processing are not efficient for In-Memory query
processing.
Motivation:
Modify
the query optimizer to generate execution plans optimized for the
specific format – row major or columnar - that will be scanned during
query execution.
Various vendors have taken different approaches to generating
execution plans for in-memory columnar tables:
Make no change to the query optimizer expecting that the queries in the
different data format would perform better.
Using heuristic methods to allow the optimizer to generate different plans.
Limit optimizer enhancements to specific workloads like star queries.
The authors provide a comprehensive optimizer redesign to handle a
variety of workloads on databases with varied schemas and different
data formats.
Column major tables dates since 1980s.
Sybase IQ.
MonetDB and C-Store around 2000s.
Presented by: Shahab Helmi
Fall 2015
Authors:
Publication:
VLDB 2015
Type:
Demonstration Paper
Data analysts often engage in data exploration tasks to discover interesting data
patterns, without knowing exactly what they are looking for (exploratory analysis).
Users try to make sense of the underlying data space by navigating through it. The
process includes a great deal of experimentation with queries, backtracking on the
basis of query results, and revision of results at various points in the process.
When data size is huge, finding the relevant sub-space and relevant results takes so
long.
AIDE is an automated data exploration system that:
Steers the user towards interesting data areas based on her relevance feedback on
database samples.
Aims to achieve the goal of identifying all database objects that match the user interest
with high efficiency.
It relies on a combination of machine learning techniques and sample selection
algorithms to provide effective data exploration results as well as high interactive
performance over databases of large sizes.
Datasets:
AuctionMark: information on action items and their bids. 1.77GB.
Sloan Digital Sky Survey: This is a scientific data set generated by digital surveys of
stars and galaxies. Large data size and complex schema. 1GB-100GB.
US housing and used cars: available through the DAIDEM Lab
System Implementation:
Java: ML, clustering and classification algorithms, such as SVM, k-means, decision trees
PostgreSQL
Presented by: Shahab Helmi
Fall 2015
Authors:
Publication:
VLDB 2015
Type:
Industry Paper
This paper presents the work on the SAP HANA Scale-out Extension: a novel distributed
database architecture designed to support large scale analytics over real-time data.
High performance OLAP with massive scale-out capabilities.
Concurrently allowing OLTP workloads.
New design of core database components such as query processing, concurrency
control, and persistence, using high throughput low-latency networks and storage
devices.
enables analytics over real-time changing data and allows ne grained user specified
service level agreements (SLAs) on data freshness.
There are two fundamental paradigm shifts happening in enterprise data management:
Dramatic increase in the amount of data being produced and persisted by enterprises.
Need for businesses to have analytical access to up-to-date data in order to make
critical business decisions.
Enterprises want real-time insights from their data in order to make critical time
sensitive business decisions -> the ETL pipelines for offline analytical processing of day,
week, or even month old data do not work.
On one hand, a system must provide on-line transaction processing (OLTP) support to
have real-time changes to data reflected in queries.
On the other, systems need to scale to very large data sizes and provide on-line
analytical processing (OLAP) over these large and changing data sets.
Mixed transactional and analytical workloads.
Ability to take advantage of emerging hardware:
High core count processors, SIMD instructions, large processor caches, and large memory
capacities.
Storage class memories and high-bandwidth low-latency network interconnects.
Supporting cloud data storage.
Heterogeneous scale-out of OLTP and OLAP workloads.
Decoupling query processing from transaction management.
The ability to improve performance by scheduling snapshots for read-only OLAP
transactions according to fine grained SLAs.
A scalable distributed log providing durability, fault tolerance, and asynchronous
update dissemination to compute engines.
Support for different compute engines:
e.g., SQL engines, R, Spark, graph, and text.
Mixed OLTP/OLAP: HyPer, ConuxDB.
Scale-out OLTP Systems: Calvin, H-Store.
Shared Log: CORFU, Kafka, BookKeeper
Presented by: Ashkan Malekloo
Fall 2015
Type: Demonstration paper
Authors:
Quan Pham, Severin Thaler, Tanu Malik, Ian Foster, Boris Glavic
VLDB 15
Recently, application virtualization (AV), has emerged as a light-weight alternative for
sharing and efficient repeatability
AV approaches:
Linux Containers
CDE (Using System Call Interposition to Automatically Create Portable Software Packages)
Generally, application virtualization techniques can also be applied to DB applications
These techniques treat a database system as a black-box application process
Oblivious to the query statements or database model supported by the database system.
Tool for creating packages of DB applications.
LDV package encapsulates:
Application
Relevant dependencies
Relevant data
LDV relies on data prevalence
Its ability to create self-contained packages of a DB application that can be shared and
run on different machine configurations without the need to install a database system
and setup a database
Extracting a slice of the database accessed by an application
How LDV's execution traces can be used to understand how the files, processes, SQL
operations, and database content of an application are related to each other.