Query processing – general issues

Download Report

Transcript Query processing – general issues

Data Warehouse Operational
Issues
Potential Research Directions
Query processing
General issues
•
Query processing for OLAP queries has been addressed extensively in the
past, including standard OLAP operations, ad-hoc queries, etc.
Research Challenges:
1. There is a need for a “standardized” query language for OLAP, a Cube
algebra
2. For several queries it would be sufficient to provide approximate answers.
We need to identify approximation “measures” and formal techniques to
evaluate them.
3. Integrate “what-if” analysis facilities in current OLAP tools, with particular
emphasis on improved performance, in order to avoid using more than one
tools for this purpose (as currently happens)
4. Support “holistic” aggregation (i.e., aggregation that cannot exploit
differentials) efficiently. Examples involve the computation of top-K, or
median values in the presence of streaming insertions and deletions.
Stream data
Streams are sequences of data, continuously flowing from a
data source with the particular characteristic that, due to their
volume, each tuple is available only for a limited time window
for querying.
Stream examples would involve:
•Stock rates extracted from the web
•Packets going through a router
•Clickstreams from a web site
Query processing
OLAP aggregation and stream data
•
Example: these are the standard OLAP queries we know (cube
operator, etc)
Research challenges:
1. Issues related to processing OLAP queries over streams; streams
are eventually stored in the DW (updated consistently) – an
interesting issue is defining consistency between stream and DW.
2. How to split query processing between on-line queries on stream
data and queries on the contents of the warehouse; it might also be
necessary to store some of the results back in the DW to allow for
future queries.
3. Synchronization of multiple incoming streams with different arrival
rates, e.g., when we need to combine them with some join operation
and aggregation, in the presence of consistency constraints.
Query processing
Subscription queries
•
•
Example: Set of indicators define whether or not business is going
well depending on whether stream values exceed some threshold
Subscription queries are defined in order to continuously monitor
the DW. One way to do that is by defining ECA rules on the data
warehouse – evaluation of triggers requires access to the DW;
when event takes place, execute queries against the warehouse.
Research challenges:
1. Modeling of subscription queries (is it a new language or standard
trigger languages are enough?)
2. What kind of extensions do we need at the physical level to support
that?
3. How to optimize trigger condition checking (difficult in light of heavy
aggregation)
Query processing
Cached queries
•
•
Example: Compute the sum per country and per month and save it
because you can compute based on this the sum per continent and year.
Important component, should exist in any data warehousing architecture.
Research challenges:
1.
2.
3.
How to and under what conditions can you do query rewriting? [cube
algebra would be useful here].
Work on all “caching” issues: updating/refreshing, deciding whether or not
to cache, replacement policies in the case of bounded space, etc.
Work on the same problem but for streams (fast arriving data) and for
particular types of queries that are less sensitive to temporal changes or
compute approximate answers (otherwise caching is of no interest as data
changes too fast)
Query processing
Personalization
•
Example: Profile (query) set by the user denotes that she is interested in
sales between 2k and 4k in August in Germany and does not care about
sales between 4k and 5k in July in Italy.
She wishes is to see in the results highlighted regions of the cube that
satisfy her profile.
Research challenges:
1.
2.
3.
4.
5.
Can personalization be used in the DW context? Does it make sense in the
context of aggregation and what kind of predicates based on aggregations
are interesting?
How to express profiles?
How to process “profiles”? Particular emphasis should be put on the
scalability of profile management in the presence of multiple users.
How could profiles be used for the design of the DW so as to avoid
unnecessary processing?
Is this topic really new Research or Re-engineering of existing solutions for
simple queries?
Query processing
Data mining
•
Example: a user specifies an ECA rule that involves the
identification of potential interesting opportunities,
based on the values of a cube. The condition part has to
do with the identification of a pattern more than 5 times
within the last 2 days. The pattern is computed by an
online data mining algorithm
Research challenges:
•
The integration of data mining algorithms in OLAP tools,
especially in the presence of novel types of data and
data streams. Particular emphasis is put on the
efficiency of the algorithms as well as their smooth
integration within the overall OLAP environment.
ETL
Traditional ETL
•
Definition: Traditional ETL processes are responsible for the extraction of
data from several sources, their cleansing, customization and insertion into
a DW. These tasks are repeated on a regular basis and in most cases,
they are asynchronous.
•
Example: integration of data extracted from OLTP and legacy systems to
a DW
•
State of the art: So far, the research community has only partially dealt
with the problem of designing and managing ETL processes.
Typically, research approaches concern
(a) stand-alone problems (e.g., the problem of duplicate detection) in an
isolated setting and
(b) problems mostly related to web data.
Recently, research on data streams has brought up the possibility of giving
an alternative look to the problem of ETL. Nevertheless, for the moment,
research in data streaming has focused on different topics, such as on-thefly computation of queries.
ETL
Traditional ETL (cont.)
Research Challenges
1.
A formal description of ETL processes with particular emphasis on an algebra (for
optimization purposes) and a formal declarative language.
2.
Optimization of ETL processes in logical and physical level. A challenge will be either
the optimization of the whole ETL process or of any individual transformation. Parallel
processing of ETL processes is of particular importance.
3.
Propagation of changes back to the sources. Potential quality problems observed at
the end-user level can lead to clean data being propagated back to the sources, in
order to avoid the repetition of several tasks in future application of the ETL process.
4.
Standard-based metadata for ETL processes. There does not exist common model for
the metadata of ETL processes.
5.
Integration of ETL with XML wrappers, EAI (Enterprise Application Integration) tools
data quality tools.
6.
Extension of the ETL mechanisms for non-traditional data, like XML/HTML, spatial and
biomedical data.
7.
Security issues in ETL; source data and data in transit are security risks.
ETL
Stream ETL (cont.)
Definition:
Stream ETL is an ETL process involving the possible filtering, value
conversion and transformation of the incoming information in a
desirable format.
Although, streams in their entirety cannot always be stored (also do
not want to go over them again and again in ETL processing), some
patterns or snapshot aggregates of them can be stored for subsequent
querying.
Research Challenges:
• An issue that arises concerns correctness guarantees.
• Also, we need cost models for the tuning of the incoming stream within
specified time window.
• Another issue is the audit of the incoming stream data for several
constraints or business rules, also with respect to stored data (e.g.,
primary key violations).
ETL
On-Demand ETL
Example:
Some users request data to be brought in from the web. The administrator/programmer
is assigned the task of constructing an ETL process that extracts the dates from the
specified sites, transforms them and ultimately stores them in the of the warehouse. Any
time the user needs this data, this on-demand ETL process brings in the relevant
information.
Definition
An ETL process executed sporadically, manually initiated by some user demand. The
process is responsible for retrieving external data and loading them in the DW after the
appropriate transformations.
Research Cahllenges
•
Since this process is mostly focused towards web data, there is a need for the
appropriate operators.
Also, additional challenges involve:
• Minimum effort/time/resources for the construction of the process
• Easy adaptability to the changes of the external data
• Efficient algorithms.