Big Data - School of Information and Communication Technology
Download
Report
Transcript Big Data - School of Information and Communication Technology
Challenges and Opportunities in Efficient
Management of Big Data
Assoc. Professor Bela Stantic
Director IIIS Big Data and Smart Analytics Lab
Deputy Head of School of Information and Communication Technology
Institute for Integrated and Intelligent Systems - IIIS
Overview
Facts about traditional DBMS
Catch-phrase Big Data
What it is, forms, characteristics
Sources of Big Data
Classification into groups
Challenges and methods to overcome them for each group
Current achievements
Concluding remarks – Future of Big Data management
Institute for Integrated and Intelligent Systems - IIIS
Traditional Database Management Systems
Disk based
SQL oriented
data is stored in heavily encoded
disk blocks
main memory buffer pool to
improve performance
blocks moved between memory
buffers and disk
Optimization of CPU usage and
disk I/Os (best way to execute
SQL commands )
The fundamental operation is a
read or an update of a table row.
Institute for Integrated and Intelligent Systems - IIIS
Background Processes
SYSTEM GLOBAL AREA
Servers
Redo Log Files
•Data Files
Control Files
Users
20
Traditional Database Management Systems
Indexing is done by B-trees in the form of clustered or unclustered or
by hash indexes [Comer, 79].
Concurrency control dynamic row-level locking (ACID)
Crash recovery write-ahead log - ARIES [Mohan, 93].
Replication - mostly by updating the primary node and then moving
the log over the network to other sites and rolling forward!
Due to the proven capabilities it is considered that the traditional
database management systems fit all purpose.
Commercial traditional DBMS successfully absorbed new concepts
and trends in past (Object databases, XML databases)
Brief details indicate robustness.
Institute for Integrated and Intelligent Systems - IIIS
4
One size does not fit all !!!
An increase in form of application domains, database sizes, as well as
the variety of captured data commonly called “Big Data” began to cause
problems for this technology as it started to be too robust and not able
to answer the requirements of new demands.
A three decades of commercial DBMS development can be summed up
in a single phrase: ”One size fits all”.
Stonebraker in his 2006 paper - one size does not fit all .
Originally designed and optimized for business data processing is not
best solution for data-centric applications.
Traditional DBMS architecture is no longer best option for whole
database market, and the commercial world will need to venture into a
collection of independent database concepts.
Institute for Integrated and Intelligent Systems - IIIS
5
What is Big Data?
Image Source: http://pivotcon.com/three-ways-to-use-big-data-to-be-a-better-marketer/
Institute for Integrated and Intelligent Systems - IIIS
6
Big Data
Dataset size: beyond the ability of the current system
to collect, process, retrieve and manage.
Collections in traditional legacy databases and Data
Warehouses, social networking and media, mobile
devices, logs, data streams generated by remote
sensors and other IT hardware, e-science
computational biology, bioinformatics, genomics,
astronomy, etc).
A typical feature of most Big Data is absence of a
schema, causes problems with the integration.
The value of information explodes when data can
be linked.
Google - track the spread of influenza (Nature paper
2009)
ttp://flickr.com/photos/53242483@N00/5839399412
Institute for Integrated and Intelligent Systems - IIIS
7
Big Data - V’s.
1. Volume
Data volume is increasing exponentially.
In the next five years, these files will grow by a
factor of 8.
Ebay 90PB, Facebook 50 billion photos
Large Hadron collider 40TB/sec
The amount of information individuals create
themselves in form of writing documents, taking
pictures, downloading music, etc, is far less
than the amount of information being created
about them.
8
Characteristics of Big Data
2. Velocity
How quickly data is being produced and how
quickly the data must be processed to meet the
demand for extracting useful information.
Sources of high-velocity data such as log files of
websites, devices that log events, etc.
Social media: Twitter 500mil T/day
(peak150,000T/s), Facebook 800mil active users,
4,5 bilion likes/day. The value of this data
degrades over time.
Useful information must be extracted in a timely
manner otherwise it looses meaning.
9
Characteristics of Big Data
3. Variety
A single application can generate and
collect many types of data.
many format and types: structured,
unstructured, semi-structured, text (even
in different languages), media, etc.
problems not only for storing and
efficient management but also for
mining and for analyzing data.
To extract the knowledge all these
types of data need to be linked
together.
10
Characteristics of Big Data
4. Veracity
Relates to reliability of data and
predictability of inherently
imprecise data with noise and
abnormality.
data must be meaningful to the
problem being analyzed.
the biggest challenge is to keep
your data clean .
11
Characteristics of Big Data
Variety and Velocity are working against
the Veracity of the data. They decrease
the ability to cleanse the data before the
information is extracted.
5th V: Value - is data worthwhile and has
value for business.
6th V: Variability - different meanings of data
7th V: Volatility - how long data is valid and are relevant to the analysis
and decisions.
Institute for Integrated and Intelligent Systems - IIIS
12
Sources of Big Data
We are no longer hindered by the ability to collect data but by the
ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner.
Today’s data we can generally divided it into three main groups:
Online Transaction Processing OLTP: are basically all forms of legacy
systems driven by ACID properties
Online Analytical Processing (OLAP) or Data Warehousing
Real-Time Analytic Processing (RTAP): new types of data as well as
huge volumes and velocity pose significant demand and it is evident that
the traditional database management systems are unable to answer this
demand in real time. This group starts to dominate and is taking over
with regard to the size of collected data as well as the useful
information it can provide.
Institute for Integrated and Intelligent Systems - IIIS
13
Efficient Management of data in OLTP
Data stored in the form of rows in
relations,
each relation has a primary key,
there can be one or more secondary
indexes,
B-tree default indexes
Hash based indexes
Institute for Integrated and Intelligent Systems - IIIS
14
Efficient Management of data in OLTP
Spatial data: most often managed by
family of R-tree indexes [Guttman,
1994], while some external structures
are also utilized.
Efficient Management of temporal
data represents a challenge:
» The RI-tree [Kriegel, 2000] upper
and lower composite indexes,
Dedicated query transformations
» TD-tree [Stantic, 2010].
Institute for Integrated and Intelligent Systems - IIIS
15
Multidimensional data in OLTP
Efficient management of
multidimensional data in commercial
environment:
Multiple secondary indexes
Compound indexes (compound index
outperforms secondary indexes when
the result set is greater than 0.000015
% of the relation’s population.
At high number of dimensions full
table scan faster (curse of
dimensionality)!
UB-tree [Bayer, 2000]
VG-Curve [Stantic, 2013]
Institute for Integrated and Intelligent Systems - IIIS
16
Efficient Management of data in OLAP
At present it also relies on traditional DBMS
however the request for ACID properties can be
relaxed due to mostly read only operations.
The main attention is given to speed and
therefore OLAP’s are mostly based on indexes.
B-tree indexes and variety of Bitmap indexes
[Maayan, 1985] with different compressions are
utilized.
• To improve the response time different methods of identifying and materializing
views have been also proposed in literature [Stantic, 2006].
Institute for Integrated and Intelligent Systems - IIIS
17
Efficient Management of data in OLAP
Column stores can be used, can perform faster than the row stores by
the factor of 50 or even 100.
The fact table has joins with dimension tables, which provide details
about products, customers, stores, etc. These tables are organized in
star or snowflake schema.
Also, compression is much easier and more productive in a column store
As each block has only one kind of attribute.
Advantages of column stores has been identified, several column
oriented database are available including Sybase IQ3, Amazon 4,
Google BigQuery5, Vertica6, Teradata7, etc.
IBM, in 2012 released DB2 (code name Galileo) with row and column
access control which enables ‘fine-grained’ control.
Institute for Integrated and Intelligent Systems - IIIS
18
Efficient Management of data in RTAP
In the past several main sources have been generating data and all
others have been consuming data.
Today all of us are generating data, and all of us are also consumers of
this shared data.
This new concept of data generation and number of users requires more
feasible solutions of scaling than it is offered by traditional
database architectures.
To efficiently exploit this new resource there is need to scale both
infrastructures and techniques.
Institute for Integrated and Intelligent Systems - IIIS
19
RTAP Data Management Architectures
One of the biggest issue is the scalability of traditional DBMSs.
• Vertical scaling (also called scale up),.
• Horizontal scaling (scale-out) can
ensure scalability in a more effective
and cheaper way.
• Data is distributed horizontally in the
network ”Share nothing”.
Institute for Integrated and Intelligent Systems - IIIS
20
Data Management Architectures
Hadoop is now used as a highly scalable data-intensive platform.
The open source software Hadoop is based on the framework MapReduce and
Hadoop Distributed File System (HDFS ).
• Complexity of tasks for data
processing in such architectures is
minimized using programming
languages, like MapReduce.
• an effective implementation of the
relational operation join in
MapReduce requires special
approach both in data distribution
and indexing.
Institute for Integrated and Intelligent Systems - IIIS
21
NoSQL Databases
To accommodate data with no schema and for
Real-Time Analytic Processing NoSQL databases
have been utilized.
NoSQL means ”not only SQL” or ”no SQL at all”.
NoSQL concept dates back to 1990s. They provide
simpler scalability and improved performance
comparing to traditional relational databases.
Data model is in NoSQL databases described
intuitively without any formal fundamentals.
Institute for Integrated and Intelligent Systems - IIIS
22
NoSQL Databases
In general NoSQL databases can be classified in the following groups:
» key-value stores: SimpleDB, Redis, Memcached, Dynamo, Voldemort
» column-oriented: BigTable, HBase, Hypertable, CASSANDRA, PNUTS
» document-oriented: MongoDB, CouchDB
To improve performance some NoSQL databases are in-memory databases
(Redis and Memcached).
ACID properties are not implemented fully, databases can be only eventually
consistent or weakly consistent.
List of NoSQL databases: http://nosql-database.org/
Institute for Integrated and Intelligent Systems - IIIS
23
NewSQL Databases
A more general category of parallel DBMSs called NewSQL databases are
designed to scale out horizontally on shared nothing machines while ensuring
transparent partitioning and still guaranteeing ACID properties, and employing
lock-free concurrency control.
Applications interact with the database primarily using SQL.
NewSQL provides performance and scalability not comparable with traditional
DBMS and with Hadoop as well.
For example, a comparison of Hadoop and Vertica shown that query times for
Hadoop were a lot slower (1-2 orders of magnitude).
In context of Big Analytics, most representatives of NewSQL are suitable for
Real-time analytics (e.g., ClustrixDB, Vertica, and VoltDB) or columnar storage
(Vertica).
But in general the performance of NewSQL databases is still a problem.
Institute for Integrated and Intelligent Systems - IIIS
24
Concluding Remarks
It is evident ”one size does not fit all”. Different types of data and different
requirements pose different challenges and needs, which can be only efficiently
answered with different techniques and different concepts.
This is interesting time in databases, there is a lot of new database ideas as
well as products.
Commercial traditional database management systems in the past successfully
managed to add features to address different trends and needs and as a
result they became too large and too resource demanding.
None of the newly proposed concepts in near future will guarantee requirements
that OLTP application domains require. New product that will satisfy all
requirements it will be most likely also too complex and resource demanding,
Therefore traditional database management systems are here to stay for OLTP
applications, at least for a while.
However, it is expected that the OLTP will move toward the main memory
systems and therefore current indexing methods will need to evolve to main
memory search.
Institute for Integrated and Intelligent Systems - IIIS
25
Concluding Remarks
In relation to OLAP, answering needs for business intelligence, data
warehouses will be bigger and bigger.
To be able to efficiently work with this volume of data DW will need to be
a column store based.
We are already witnessing this change.
However, DW will need to head in the direction defining and building an
execution engine that processes SQL framework without using a
MapReduce layer (similar to Hive concept).
Such a concept could successfully compete in the data warehousing
market.
Institute for Integrated and Intelligent Systems - IIIS
26
Concluding Remarks
Real-Time Analytic Processing it is expected that the changes will be
driven by requests of complex analytics, which are able to predict future
not just provide valuable information from data.
This will be possible as more and more data from different sources that
contains useful information will be captured.
One of the new sources will be the ”internet of things” (RFID).
Big challenge will be to process efficiently all that data and to extract useful
information in real time.
Current concept of NoSQL databases might survive but will be only
utilized in applications with schema-later concepts and obviously where
ACID properties are not essential (logs).
We already witness that main NoSQL databases such as Cassandra and
MongoDB are moving to SQL. They are also moving toward ACID.
Initially NoSQL used to mean ”No SQL”, then meant ”Not only SQL”, and
at present most accurate definition would be ”Not yet SQL” .
Institute for Integrated and Intelligent Systems - IIIS
27
Concluding Remarks
Hadoop stack will need to change to something different as it is only good for
very small fraction of application, which are highly parallel.
HDFS most likely will not survive, as it is very inefficient.
New data management architectures, e.g. distributed file systems and NoSQL
databases, can solve Big Data problems only partially.
NewSQL systems are addressing these shortcomings. Several promising
systems in this category are VoltDB, Hana (SAP), and SQLFire.
Distributed multi-dimensional indexing will be more important with the increase of
number of nodes and it will need to overcome communication overhead,
which is excessive for routing query requests between computer nodes.
Also, the current index framework cannot meet the velocity of Big data
because communication overhead to synchronize the local index and global
index is huge.
To overcome these problems most likely completely new index framework
need to be introduced, and we are working on it.
Institute for Integrated and Intelligent Systems - IIIS
28