Transcript Lecture 8
IS6146 Databases for Management
Information Systems
Lecture 8: Working with unstructured data
Rob Gleasure
[email protected]
robgleasure.com
IS6146
Today’s session
Technologies for analysis
Technologies for storage
NoSQL
Distributed map reduce architectures, e.g. Hadoop
Technologies and tools
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
Tools for analysis and presentation
Massive range of software, depending on needs
For visualising and sorting data
Excel
Pentaho
For data mining, regressions, clustering, graphing, etc.
SPSS
R
Gephi
UNICET
For reporting
Excel
Pentaho
Let’s get our hands data-y!
(I know. Sorry.)
Tools for analysis and presentation
Image from www.etsy.com
Data warehousing
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
Data warehousing
OLTP
OLAP
Business
intelligence
database
Operational
databases
HR and
payroll
Data
mining
Sales and
customers
Extract
Transform
Load
Data warehouse
Visualisation
Orders
Reporting
Technical
support
Purchased
data
OTLP vs. OLAP
Online transaction processing (OLTP) databases/data stores
support ongoing activities in an organisation
Hence, they need to
Manage accurate real-time transactions
Handle reads, writes, and updates by large numbers of
concurrent users
Decompose data into joinable, efficient rows (e.g. normalised
to 3rd form)
These issues are often labelled ACID database transactions
Atomic: Every part of a transaction works or it’s all rolled back.
Consistent: The database in never left in inconsistent states
Isolated: Transactions do not interfere with one other
Durable: Completed transactions are not lost if system crashes
OTLP vs. OLAP
Online analytical processing (OLAP) databases/data stores are used
to support predictive analytics
Hence, they need to
Allow vast quantities of historical data to be accessed quickly
Be updatable in batches (often daily)
Aggregate diverse structures with summary data
These issues are often labelled BASE database transactions
Basic Availability
Soft-state
Eventual consistency
NoSQL
Data lifecycle
NoSQL
Create
Capture
Data Warehousing
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
What is NoSQL?
What is NoSQL?
Basically any database that isn’t a relational database
Stands for ‘Not only SQL’
It’s NOT anti-SQL or anti-relational databases
Image from www.improgrammer.net
What is NoSQL (continued)?
It’s not only rows in tables
NoSQL systems store and retrieve data from many formats, e.g.
text, csv, xml, graphml
It’s not only joins
NoSQL systems mean you can extract data using simple
interfaces, rather than necessarily relying on joins
It’s not only schemas
NoSQL systems mean you can drag-and-drop data into a folder,
without having to organise and query it according to entities,
attributes, relationships, etc.
What is NoSQL (continued)?
It’s not only executed on one processor
NoSQL systems mean you can stores databases on multiple
processors with high-speed performance
It’s not only specialised computers
NoSQL systems mean you can leverage low-cost shared-nothing
commodity processors that have separate RAM and disk.
It’s not only logarithmically scalable
NoSQL systems mean you can achieve linear scalability as you
add more processors
It’s not only anything, really…
NoSQL systems emphasise innovation and inclusivity, meaning
there are multiple recognised options for how data is stored,
retrieved, and manipulated (including standard SQL solutions)
What is NoSQL (continued)?
Four Data Patterns in NoSQL
Image from http://www.slideshare.net/KrishnakumarSukumaran/to-sql-or-no-sql-that-is-the-question
Key-value stores
A simple string (the key) returns a Binary Large OBject (BLOB) of
data (the value)
E.g. the web
The key can take many formats
Logical path names
A hash string artificially generated from the value
REST web service calls
SQL queries
Three basic functions
Put
Get
Delete
Key-value stores (continued)
Advantages
Scalability, reliability, portability
Low operational costs
Simplicity
Disadvantages
No real options for advanced search
Commercial solutions
Amazon S3
Voldemort
Column-family stores
Stores BLOBs of data in one big table, with four possible basic
identifiers used for look-up
Row
Column
Column-family
Time-stamp
More like a spreadsheet than an RDBMS in many ways (e.g. no
indices, triggers, or SQL queries)
Grew from an idea presented in a Google BigTable paper
Column-family stores (continued)
Advantages
Scales pretty well
Decent search ability
Easy to add new data
Pretty intuitive
Disadvantages
Can’t query BLOB content
Not as efficient to search as some other options
Commercial solutions
Cassandra
HBase
Document stores
Stores data in nested hierarchies (typically using XML or JSON)
Keeps logical chunks of data together in one place
Flat tables,
e.g. csv
Hierarchical
docs, e.g.
JSON
Mixed
content, e.g.
XML
Document stores (continued)
Advantages
Lends itself to efficient within-document search
Very suitable for information retrieval
Very suitable where data is fed directly into websites/applications
Allows for structure without being overly restrictive
Disadvantages
Complicated to implement
Search process may require opening and closing files
Analysis requires some flattening
Commercial solutions
MarkLogic
MongoDB
Graph stores
Model the interconnectivity of the data by focusing on nodes
(sometimes called vertices), relationships (sometimes called edges),
and properties
Image from http://savas.me/2013/03/on-graph-data-model-design-relationships/
Graph stores (continued)
Tables stored for nodes and edges separately, meaning types of
search become possible
Graph stores (continued)
Advantages
Fast network search
Works with many public data sets
Disadvantages
Not very scalable
Hard to query systematically unless you use specialised
languages based on graph traversals
Commercial solutions
Neo4j
AllegroGraph
So, where to get the power needed for
these giant data stores?
Image from chucks-fun.blogspot.com
NoSQL
Data lifecycle
MapReduce
NoSQL
Create
Capture
Data Warehousing
Store
Analyse
Present
Business Intelligence
Tools
Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation
Traditional (Structured) Approach
High Power
Processor
Big
Data
Data
The MapReduce Concept
Master
Processor
Data
Node
Data
Node
Data
Node
Data
Big
Data
Data
Node
Slave
Processor
Standard
Processor
Slave
Processor
Data
Node
Data
Node
Data
Node
Data
Node
The MapReduce Concept
Two fundamental steps
1.
Map
Master node takes large problem and slices it into sub problems
Master node distributes these sub problems to worker nodes.
Worker node may also subdivide and distribute (in which case, a
multi-level tree structure results)
Worker processes sub problems and hands back to master
2.
Reduce
Master node reassembles solutions to sub problems in a predefined way to answer high-level problem
Issues in Distributed Model
How should we decompose one big task into smaller ones?
How do we figure out an efficient way to assign tasks to different
machines?
How do we exchange results between machines?
How do we synchronize distributed tasks?
What do we do if a task fails?
Apache Hadoop
Hadoop was created in 2005 by two Yahoo employees (Doug
Cutting and Mike Cafarella) building on white papers by Google on
their MapReduce process.
The name refers to a toy elephant belonging to Doug Cutting’s son
Yahoo later donated the project to Apache to maintain in 2006
Hadoop offers a framework of tools for dealing with big data
Hadoop is open source, distributed under the Apache licence
Hadoop Ecosystem
Image from http://www.neevtech.com/blog/2013/03/18/hadoop-ecosystem-at-a-glance/
Hadoop Ecosystem
Image from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview/
MapReducing in Hadoop
Application
Batches
Queue
Master
Processor
Job
Tracker
Task
Tracker
Data
Node
Name
Node
This is
where HDFS
comes in
This is where
MapReduce
comes in
Slave
Processor
Slave
Processor
Slave
Processor
Task
Tracker
Task
Tracker
Task
Tracker
Data
Node
Data
Node
Data
Node
Fault Handling in Hadoop
Distributing processing means that sooner or later, part of the
distributed processing network will fail
Practical truth of networks – they are unreliable
Hadoop’s HDFS has fault tolerance built-in for data nodes
Three copies of each file maintained by Hadoop
If one copy goes down, data is retrieved from another
Faulty node is then updated with new (working) data from backup
Hadoop’s HDFS also tracks failures in task trackers
Master node’s job tracker watches for errors in slave nodes
Allocates tasks to new slave if existing slave responsible fails
Programming in Hadoop
Programmers using Hadoop don’t have to worry about
Where files are stored
How to manage failures
How to distribute computation
How to scale up or down activities
A variety of languages can be used, though Java is the most
common and arguably most hassle-free
Implementing a Hadoop System
Hadoop can be run in traditional onsite data centres using multiple
dedicated machines
Hadoop can also be run via cloud-hosted services, including
Microsoft Azure
Amazon EC2/S3
Amazon Elastic MapReduce
Google Compute Engine
Implementing a Hadoop System:
Yahoo Servers Running Hadoop
Image from http://thecloudtutorial.com/hadoop-tutorial.html
Applications of Hadoop
Areas of application include
Search engines – e.g. Google, Yahoo
Social media – e.g. Facebook, Twitter
Financial services – Morgan Stanley, BNY Mellon
eCommerce – e.g. Amazon, American Airlines, eBay, IBM
Government – e.g. Federal Reserve, Homeland Security
Users of Hadoop
Just like RDBMS, Hadoop systems have different levels of users
Administrators handle
Configuring of the system
Updates and installation
General firefighting
Basic users
Run tests and gather data for reporting, market research,
general exploration, etc.
Design applications to use data
Accessibility of NoSQL databases?
Image from spaaaawk.tumblr.com
Want to read more?
Apache Hadoop Documentation:
http://hadoop.apache.org/docs/current/
Data Intensive Text Processing with Map-Reduce
http://lintool.github.io/MapReduceAlgorithms/
Hadoop Definitive Guide:
http://www.amazon.com/Hadoop-Definitive-Guide-TomWhite/dp/1449311520
Want to read more?
Financial Services using Hadoop
http://hortonworks.com/blog/financial-services-hadoop/
https://www.mapr.com/solutions/industry/big-data-and-apachehadoop-financial-services
Hadoop at ND:
http://ccl.cse.nd.edu/operations/hadoop/