Transcript Lecture 8

IS6146 Databases for Management
Information Systems
Lecture 8: Working with unstructured data
Rob Gleasure
[email protected]
robgleasure.com
IS6146

Today’s session
 Technologies for analysis
 Technologies for storage
 NoSQL
 Distributed map reduce architectures, e.g. Hadoop
Technologies and tools
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
Tools for analysis and presentation

Massive range of software, depending on needs
 For visualising and sorting data
 Excel
 Pentaho
 For data mining, regressions, clustering, graphing, etc.
 SPSS
 R
 Gephi
 UNICET
 For reporting
 Excel
 Pentaho
Let’s get our hands data-y!
(I know. Sorry.)
Tools for analysis and presentation
Image from www.etsy.com
Data warehousing
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
Data warehousing
OLTP
OLAP
Business
intelligence
database
Operational
databases
HR and
payroll
Data
mining
Sales and
customers
Extract
Transform
Load
Data warehouse
Visualisation
Orders
Reporting
Technical
support
Purchased
data
OTLP vs. OLAP

Online transaction processing (OLTP) databases/data stores
support ongoing activities in an organisation
 Hence, they need to
 Manage accurate real-time transactions
 Handle reads, writes, and updates by large numbers of
concurrent users
 Decompose data into joinable, efficient rows (e.g. normalised
to 3rd form)

These issues are often labelled ACID database transactions
 Atomic: Every part of a transaction works or it’s all rolled back.
 Consistent: The database in never left in inconsistent states
 Isolated: Transactions do not interfere with one other
 Durable: Completed transactions are not lost if system crashes
OTLP vs. OLAP

Online analytical processing (OLAP) databases/data stores are used
to support predictive analytics
 Hence, they need to
 Allow vast quantities of historical data to be accessed quickly
 Be updatable in batches (often daily)
 Aggregate diverse structures with summary data

These issues are often labelled BASE database transactions
 Basic Availability
 Soft-state
 Eventual consistency
NoSQL
Data lifecycle
NoSQL
Create
Capture
Data Warehousing
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
What is NoSQL?

What is NoSQL?
 Basically any database that isn’t a relational database
 Stands for ‘Not only SQL’
 It’s NOT anti-SQL or anti-relational databases
Image from www.improgrammer.net
What is NoSQL (continued)?



It’s not only rows in tables
 NoSQL systems store and retrieve data from many formats, e.g.
text, csv, xml, graphml
It’s not only joins
 NoSQL systems mean you can extract data using simple
interfaces, rather than necessarily relying on joins
It’s not only schemas
 NoSQL systems mean you can drag-and-drop data into a folder,
without having to organise and query it according to entities,
attributes, relationships, etc.
What is NoSQL (continued)?




It’s not only executed on one processor
 NoSQL systems mean you can stores databases on multiple
processors with high-speed performance
It’s not only specialised computers
 NoSQL systems mean you can leverage low-cost shared-nothing
commodity processors that have separate RAM and disk.
It’s not only logarithmically scalable
 NoSQL systems mean you can achieve linear scalability as you
add more processors
It’s not only anything, really…
 NoSQL systems emphasise innovation and inclusivity, meaning
there are multiple recognised options for how data is stored,
retrieved, and manipulated (including standard SQL solutions)
What is NoSQL (continued)?
Four Data Patterns in NoSQL
Image from http://www.slideshare.net/KrishnakumarSukumaran/to-sql-or-no-sql-that-is-the-question
Key-value stores

A simple string (the key) returns a Binary Large OBject (BLOB) of
data (the value)
 E.g. the web

The key can take many formats
 Logical path names
 A hash string artificially generated from the value
 REST web service calls
 SQL queries

Three basic functions
 Put
 Get
 Delete
Key-value stores (continued)

Advantages
 Scalability, reliability, portability
 Low operational costs
 Simplicity

Disadvantages
 No real options for advanced search

Commercial solutions
 Amazon S3
 Voldemort
Column-family stores

Stores BLOBs of data in one big table, with four possible basic
identifiers used for look-up
 Row
 Column
 Column-family
 Time-stamp

More like a spreadsheet than an RDBMS in many ways (e.g. no
indices, triggers, or SQL queries)

Grew from an idea presented in a Google BigTable paper
Column-family stores (continued)

Advantages
 Scales pretty well
 Decent search ability
 Easy to add new data
 Pretty intuitive

Disadvantages
 Can’t query BLOB content
 Not as efficient to search as some other options

Commercial solutions
 Cassandra
 HBase
Document stores

Stores data in nested hierarchies (typically using XML or JSON)

Keeps logical chunks of data together in one place
Flat tables,
e.g. csv
Hierarchical
docs, e.g.
JSON
Mixed
content, e.g.
XML
Document stores (continued)

Advantages
 Lends itself to efficient within-document search
 Very suitable for information retrieval
 Very suitable where data is fed directly into websites/applications
 Allows for structure without being overly restrictive

Disadvantages
 Complicated to implement
 Search process may require opening and closing files
 Analysis requires some flattening

Commercial solutions
 MarkLogic
 MongoDB
Graph stores

Model the interconnectivity of the data by focusing on nodes
(sometimes called vertices), relationships (sometimes called edges),
and properties
Image from http://savas.me/2013/03/on-graph-data-model-design-relationships/
Graph stores (continued)

Tables stored for nodes and edges separately, meaning types of
search become possible
Graph stores (continued)

Advantages
 Fast network search
 Works with many public data sets

Disadvantages
 Not very scalable
 Hard to query systematically unless you use specialised
languages based on graph traversals

Commercial solutions
 Neo4j
 AllegroGraph
So, where to get the power needed for
these giant data stores?
Image from chucks-fun.blogspot.com
NoSQL
Data lifecycle
MapReduce
NoSQL
Create
Capture
Data Warehousing
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
Traditional (Structured) Approach
High Power
Processor
Big
Data
Data
The MapReduce Concept
Master
Processor
Data
Node
Data
Node
Data
Node
Data
Big
Data
Data
Node
Slave
Processor
Standard
Processor
Slave
Processor
Data
Node
Data
Node
Data
Node
Data
Node
The MapReduce Concept

Two fundamental steps
1.
Map
 Master node takes large problem and slices it into sub problems
 Master node distributes these sub problems to worker nodes.
 Worker node may also subdivide and distribute (in which case, a
multi-level tree structure results)
 Worker processes sub problems and hands back to master
2.
Reduce
 Master node reassembles solutions to sub problems in a predefined way to answer high-level problem
Issues in Distributed Model

How should we decompose one big task into smaller ones?

How do we figure out an efficient way to assign tasks to different
machines?

How do we exchange results between machines?

How do we synchronize distributed tasks?

What do we do if a task fails?
Apache Hadoop

Hadoop was created in 2005 by two Yahoo employees (Doug
Cutting and Mike Cafarella) building on white papers by Google on
their MapReduce process.

The name refers to a toy elephant belonging to Doug Cutting’s son

Yahoo later donated the project to Apache to maintain in 2006

Hadoop offers a framework of tools for dealing with big data

Hadoop is open source, distributed under the Apache licence
Hadoop Ecosystem
Image from http://www.neevtech.com/blog/2013/03/18/hadoop-ecosystem-at-a-glance/
Hadoop Ecosystem
Image from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview/
MapReducing in Hadoop
Application
Batches
Queue
Master
Processor
Job
Tracker
Task
Tracker
Data
Node
Name
Node
This is
where HDFS
comes in
This is where
MapReduce
comes in
Slave
Processor
Slave
Processor
Slave
Processor
Task
Tracker
Task
Tracker
Task
Tracker
Data
Node
Data
Node
Data
Node
Fault Handling in Hadoop



Distributing processing means that sooner or later, part of the
distributed processing network will fail
 Practical truth of networks – they are unreliable
Hadoop’s HDFS has fault tolerance built-in for data nodes
 Three copies of each file maintained by Hadoop
 If one copy goes down, data is retrieved from another
 Faulty node is then updated with new (working) data from backup
Hadoop’s HDFS also tracks failures in task trackers
 Master node’s job tracker watches for errors in slave nodes
 Allocates tasks to new slave if existing slave responsible fails
Programming in Hadoop


Programmers using Hadoop don’t have to worry about
 Where files are stored
 How to manage failures
 How to distribute computation
 How to scale up or down activities
A variety of languages can be used, though Java is the most
common and arguably most hassle-free
Implementing a Hadoop System

Hadoop can be run in traditional onsite data centres using multiple
dedicated machines

Hadoop can also be run via cloud-hosted services, including
 Microsoft Azure
 Amazon EC2/S3
 Amazon Elastic MapReduce
 Google Compute Engine
Implementing a Hadoop System:
Yahoo Servers Running Hadoop
Image from http://thecloudtutorial.com/hadoop-tutorial.html
Applications of Hadoop

Areas of application include
 Search engines – e.g. Google, Yahoo
 Social media – e.g. Facebook, Twitter
 Financial services – Morgan Stanley, BNY Mellon
 eCommerce – e.g. Amazon, American Airlines, eBay, IBM
 Government – e.g. Federal Reserve, Homeland Security
Users of Hadoop

Just like RDBMS, Hadoop systems have different levels of users

Administrators handle
 Configuring of the system
 Updates and installation
 General firefighting

Basic users
 Run tests and gather data for reporting, market research,
general exploration, etc.
 Design applications to use data
Accessibility of NoSQL databases?
Image from spaaaawk.tumblr.com
Want to read more?



Apache Hadoop Documentation:
 http://hadoop.apache.org/docs/current/
Data Intensive Text Processing with Map-Reduce
 http://lintool.github.io/MapReduceAlgorithms/
Hadoop Definitive Guide:
 http://www.amazon.com/Hadoop-Definitive-Guide-TomWhite/dp/1449311520
Want to read more?


Financial Services using Hadoop
 http://hortonworks.com/blog/financial-services-hadoop/
 https://www.mapr.com/solutions/industry/big-data-and-apachehadoop-financial-services
Hadoop at ND:
 http://ccl.cse.nd.edu/operations/hadoop/