Transcript Adrian Tear
Drilling for Big Data
‘GIS with NoSQL’ Workshop
AGILE 2016, Helsinki, 14 June 2016
Adrian Tear
Department of Geography, University of Portsmouth
[email protected]
Introduction
•Past and Present projects
Traditional and Modern database systems
• Tried and Failed…
• Tried and Worked!
Demonstrations:
• Oracle 12c / Tableau 9.3 (local)
• Oracle 12c / Tableau 9.3 (remote supercomputer/local)
• MapR 5 / Tableau 9.3 (remote supercomputer/local)
Summary / Discussion
Past projects
•1997
UK General Election
2000 London Election
2001 UK General Election
Present projects
•PhD Researcher,
University of
Portsmouth
Honorary Fellow,
University of
Edinburgh
GIScience, Big
Data, Locative
Media
Present projects – PhD Working Title
•Geo-tagging matters!
The role and
significance of
geography in online
social network
interactions.
Why?
Past:
• Long-standing interest in politics
• Published three major political web sites in the Web 1.0 Era
• Could detect what people were looking at (sort of) from where
• No User Generated Content or any opinions…
Present:
• Still interested in political geography in the Web 2.0 Era
• No longer a publisher, but can consume public social media data
• Can detect what people are looking at (sometimes) from where
• Plenty of User Generated Content and many opinions!!!
•
What?
•2012 US Presidential
Election campaign:
2014 Scottish
Independence
Referendum campaign:
A lot of data!
• US2012_GEO – 146,424 records (1:5 sample, geo exists)
• US2012_NON_GEO – 1,560,967 records (1:50 sample)
• US2012_NON_GEO_HISP – 11,276 records (1:50, Lang=ES)
• SCOT2014 – 6,477,713 records
• ~8 million records – 90% Twitter/10% Facebook
• 0.86-1.44% geo-tagged; 3 and 12 month data runs US/SCOT
• >20GB raw CSV or JSON; >100GB in indexed database
How?
•Datasift:
• Curated Stream Definition Language (CSDL)
offers OSN filtering:
• Twitter Firehose
• Facebook
• Others
• Runs as a task
• Stores data on Datasift servers (in MongoDB?)
• Downloadable in CSV/JSON
• Used to offer ‘Pay as you go’ pricing model:
• Unfortunately now only open to Subscribers
Datasift CSDL for US2012_GEO Stream
•
tag "Democratic Party" { interaction.content any "democrat,democrats,democratic,barack obama,obama,joe biden,biden" }
tag "Republican Party" { interaction.content any "republican,republicans,mitt romney,romney,paul ryan,ryan" }
tag "Positive" { salience.content.sentiment > 0 }
tag "Neutral" { salience.content.sentiment == 0 }
tag "Negative" { salience.content.sentiment < 0 }
tag "Barack Obama" { interaction.content any "barack obama,barack,obama" }
tag "Mitt Romney" { interaction.content contains "mitt romney,mitt,romney" }
return {
interaction.content any "2012 election,presidential election,US president,president of the united
states,republican,republicans,mitt romney,mitt,romney,democrat,democrats,democratic,barack obama,barack,obama"
AND language.tag == "en"
AND salience.content.sentiment exists
AND interaction.geo exists
AND interaction.sample < 20
}
Set up the task and leave it running…
Download – Comma Separated Values (CSV)
•
US2012:
SCOT2014:
Aim:
• 3 CSV files one for each Stream
• Inconsistent set of fields in each file
• 3 CSV files for one (bigger) Stream
• Consistent themselves but inconsistent with US2012 files
• Read into Relational Database Management System (RDBMS)
• One table for all records with cross-consistent fields
• Tagged by Stream
Inconsistency in CSV fields
• Tear, A. (2014). SQL or NoSQL? Contrasting Approaches to the
Storage, Manipulation and Analysis of Spatio-temporal Online Social
Network Data. In B. Murgante, S. Misra, A. M. A. C. Rocha, C. Torre, J.
G. Rocha, M. I. Falcão, … O. Gervasi (Eds.), Computational Science and
Its Applications -- ICCSA 2014: 14th International Conference,
Guimarães, Portugal, June 30 -- July 3, 2014, Proceedings, Part I (pp.
221–236). Cham: Springer International Publishing.
http://doi.org/10.1007/978-3-319-09144-0_16
Download – JavaScript Object Notation (JSON)
US2012:
• 3 JSON files one for each Stream
• Self-describing variable numbers of key/value pairs (with
nesting); 146 key names
SCOT2014:
• 3 JSON files for one (bigger) Stream
• Self-describing variable numbers of key/value pairs (with
nesting); 411 key names
Aim:
• Read into a database system to allow querying preferably
without Extract/Transform/Load efforts required of CSV
• Two ‘tables’ for US2012 and SCOT2014 datasets
•
Traditional/Modern DB – Tried and failed…
•Microsoft SQL Server 2012 (CSV)
• Could not handle UTF-8 encoded character set (Emoticons etc.)
PostgreSQL 9.3 beta (JSON)
• Couldn’t get it to work at the time with my JSON data
Others:
• Google BigQuery – loaded CSV OK but JSON (at the time) required schema definition for a schema-less
dataset, puzzling?
• IBM InfoSphere BigInsights Quickstart VM – would load smaller JSON files into a Hadoop-under-the-hood
‘sheets’ system but not the whole dataset
• Teradata 15 – couldn’t get it to work
• Apache Drill 0.9 – wouldn’t handle text over 128k characters (Facebook text wider than Twitter!)
Traditional/Modern DB – Tried and worked!
•Oracle 12c
• CSV import successful with careful SQLLDR Control File editing
• JSON successfully imported into Constrained CLOB field
• US2012 dataset (~1.7m records) indexed correctly
• SCOT2014 dataset (~6.5m records) did not (411 key names some exceeding 64 byte name length; fix coming 12c R2)
MarkLogic 7
• JSON import successful (on US2012 dataset; but had to split 3 LF delimited files into ~1.7m atomic files at the time)
Endeca Information Discovery
• Ingested smaller JSON files but not the whole lot; would ingest large CSV files with processing to alter date formats etc
MapR Hadoop ecosystem with Drill
• Works locally in ‘single Drillbit’ mode on the laptop and recently installed on 5-node supercomputer; works on the
whole dataset including GZ compressed JSON
It’s a ‘moveable feast’…
•Software is changing
all the time:
• ‘Big Data’ a growth area for many vendors
• New releases bring new/improved functionality
Some software hasn’t
been tried:
• MongoDB and CouchDB tried briefly; hard to fathom
• Countless others…
Inertia:
• A definite factor; how many systems is it possible/desirable to learn
Infrastructure:
• Largely home-grown on old kit or using Virtual Machines; older
physical kit or VMs may not be adequate to run high-end software
SQL wins out?
•Familiar
Purpose of
Drill
Standard(ish)
Mainstay of
Oracle
Demonstration
Compute environments
•Laptop – Windows 10
• Dell Latitude 7440
• Intel i5, 16GB RAM, 128GB (OS) and 512GB (Data) SSDs
• Running Oracle 12c, Oracle SQL Developer, Tableau 9.3 and Oracle VirtualBox
Supercomputer – Scientific Linux 6
• SCIAMA supercomputer; cluster of 3,702 compute cores (commodity Dell boxes with Infiniband)
used by Institute of Cosmology and Gravitation, University of Portsmouth
• MapR Hadoop ecosystem (v5) on 5 nodes (of 12 cores, 24GB RAM, 2TB SATA disk each)
• Oracle 12c (12.1.0.2) on 1 node (of 12 cores, 24GB RAM, 2*500GB SSDs)
Shedputer – Windows Server 2012
• 2*IBM X3850M2 quad core Xeon, 128GB RAM, 4*146GB 10K SAS Magnetic and 4*200GB SAS
SSD drives (no speed increase; backplane too slow!!)
• 1*DELL PowerEdge 2950, 32GB RAM, 4*73GB 15K SAS Magnetic drives – too noisy to use!!
Software ‘stack’
•DBMS
• Oracle 12c and/or
• MapR/Drill
Graph analysis
• Oracle Spatial and Graph and/or
• Neo4j
Natural Language Processing
• GATEcloud
• AlchemyAPI
Visualisation
• Tableau
• Desktop GIS
Unattainability?
•Recent decades have witnessed an increased growth in data generated by
information, communication, and technological systems, giving birth to the
‘Big Data’ paradigm. Despite the profusion of raw data being captured by
social media platforms, Big Data require specialized skills to parse and analyze
— and even with the requisite skills, social media data are not readily
available to download. Thus, the Big Data paradigm has not produced a
coincidental explosion of research opportunities for the typical scholar.
• Zelenkauskaite, A., & Bucy, E. (2016). A scholarly divide: Social media, Big Data, and
unattainable scholarship. First Monday, 21(5). Retrieved from
http://ojphi.org/ojs/index.php/fm/article/view/6358
Summary
•Handling text-rich
un/semi-structured
data is difficult,
whatever software
vendors would have
you believe!
Fast I/O for large
datasets – in
exploratory mode – is
most useful; use Solid
State Disks if you can
(Oracle/Tableau 4min
to 40sec speed-up)
The number of RDBMSs
available has narrowed
but there are many
(perhaps too many?)
commercial/open
source NoSQL
databases
When choosing
systems consider your
own experience; e.g. if
you think JavaScript try
MongoDB, if you think
SQL try MapR and
Drill…
…Or CSV and Hive… Or
CSV and Spark… Or
some other
combination…
Discussion
•Familiarity/Unfamiliarity will
inevitably sway your decision
Once you have proof of concept get
IT buy-in, you will need support
The aim is to get results and not to
become a prisoner of technology
Resources
• JSON in Oracle database 12c
• Apache Drill
• MapR/Apache Drill: It’s drilliant to query JSON files from Tableau
• GATEcloud Natural Language Processing
• AlchemyAPI Natural Language Processing