Transcript pptx

NoDB: Querying Raw Data
--Mrutyunjay
Overview
▪ Introduction
▪ Motivation
▪ NoDB Philosophy: PostgreSQL
▪ Results
▪ Opportunities
“NoDB in Action: Adaptive Query Processing on Raw Data”
Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos and Anastasia
Ailamaki. In the Proceedings of the VLDB Endowment (PVLDB) (Demo), 2012.
Introduction
Motivation
▪ DBMS: are rarely used for emerging applications such as scientific analysis and
social networks.
▪ due to the prohibitive initialization cost, complexity (loading the data, configuring the
physical design, etc.) and the increased "data-to-query" time.
▪ For example, a scientist needs to quickly examine a few Terabytes of new data
in search of certain properties. Even though only few attributes might be
relevant for the task, the entire data must first be loaded inside the database. For
large amounts of data, this means a few hours of delay.
▪ NoDB Philosophy: To make database systems more accessible to the user by
eliminating major bottlenecks of current state-of-the-art technology that
increases data-to-query time.
Querying Raw Data
▪ Straight Forward Approaches:
▪ -- Run the loading process whenever the relevant query arrives. Store in temporary
table, discard table after query.
▪ -- Integrating raw file access with query execution: Scan operator: raw file is read from
disk in chunks.
▪ Limitations:
▪ Not viable for extensive and repeated query processing.
▪ Does not use important database system functionality like indexing.
NoDB Philosophy: PostgreSQL
▪ On-the-fly parsing
▪ Indexing
▪ Caching
▪ Updates
On-the-fly parsing
▪ Parsing and Tokenizing Raw Data:
▪ Load raw data file -> Identify Rows (tuples) and attributes
▪ Transform into proper binary values depending on the attribute type.
▪ Selective Tokenizing:
▪ Abort tokenizing when required attributes are found.
▪ If a query needs 4th and 8th attributes then tokenize up to 8th attribute only.
▪ No I/O benefit, reduce CPU processing cost.
▪ Selective Parsing:
▪ As said above: Binary transformation of the required attributes.
▪ Selective Tuple Formation:
▪ Only contain the attributes required for a give query.
Indexing (Adaptive Positional Map)
▪ Adaptive Positional Map:
▪ Metadata information on the structured flat file. Used to navigate and retrieve raw
data faster.
▪ Reduce parsing and tokenizing costs.
▪ Meta data refers to position of attributes in the raw file.
▪ Positional map is created on-the-fly during query processing. Populated during
the tokenizing phase.
▪ Positional map store position of every tuple in table based on the query.
Variable attribute length. i.e. same attribute appears in different position in
different tuples
Indexing (Adaptive Positional Map)
▪ Positional Map implemented as collection of chunks portioned vertically and
horizontally.
▪ Maintain a higher level data structure which contains the order of attributes in
the map in respect to order in the file.
Caching and Updates*
▪ Cache holds previously accessed data. Previously accessed attributes.
▪ Populated on-the-fly during query processing (Cache the binary data
immediately).
▪ Follows the format of positional map. Fixed size cache. LRU policy to drop and
populate the cache.
▪ A change in a position of an attribute in the data file might call for significant
reorganization. Nevertheless, being an auxiliary data structure, the positional
map can be dropped and recreated when needed again.
Results
▪ Raw data file: 11GB
▪ 7.5 x 106 tuples. 150 attributes each
Opportunities
▪ Flexible Storage: NoDB systems do not require a priori loading, which implies they also
do not require a priori decisions on how data is physically organized during loading.
▪ Updates: Immediate access on updated raw data files provides a major opportunity
towards decreasing even further the data-to-query time and enabling tight interaction
between the user and the database system.
▪ Information Integration: Query multiple different data sources and formats. Supporting
different file formats.
▪ File System Interface: Unlike traditional database systems, data in NoDB systems is
always stored in file systems, such as NTFS or ext4. This provides NoDB the opportunity
to intercept file system calls and gradually create auxiliary data structures that speed up
future NoDB queries.
Questions