Wrangler - TACC User Portal

Download Report

Transcript Wrangler - TACC User Portal

Databases on Wrangler
Niall Gaffney, Christopher Jordan,
Tomislav Urban & David Walling
Outline
• Quick Introduction to Wrangler (Chris J)
• Database Technology Overview (Chris J)
– Short Break (5-10 minutes)
• Database Options on Wrangler (Chris J)
• Populating and Maintaining Data (David W)
• Wrangler Allocations for Databases (Niall G)
Mechanics
• Will move succinctly through presentations
without pausing for questions
• Leave time for QA at the end of each section
• Online Chat - TACC staff should be online to
either answer questions or hold them for the
presenter
Wrangler in 10 Minutes
Niall Gaffney, Christopher Jordan
Acknowledgments
• The Wrangler project is supported by the
Division of Advanced Cyber Infrastructure at
the National Science Foundation.
– Award #ACI-1447307 “Wrangler: A Transformational Data Intensive
Resource for the Open Science Community “
Project Partners
• Academic partners:
– TACC – Primary system design, deployment,
and operations
– Indiana U. ; Hosting/Operating replicated
system and end-to-end network tuning.
– U. of Chicago: Globus Online integration, high
speed data transfer from user and XSEDE
sites.
• Vendors: Dell, DSSD (subsidiary of EMC2)
Goals for Wrangler
• To address the data problem in multiple dimensions
– Data at large and smalls scale, reliable, secure
– Lots of data types: Structured and unstructured
– Fast, but not just for large files and sequential access. Need high
transaction rates and random access too.
• To support a wide range of applications and interfaces
– Hadoop, but not *just* Hadoop.
– Traditional languages, but also R, GIS, DB, and other, less HPC
style performing workflows.
• To support the full data lifecycle
– More than scratch
– Metadata and collection management support
Wrangler Hardware
Mass Storage Subsystem
10 PB
(Replicated)
IB Interconnect 120 Lanes
(56 Gb/s) non-blocking
Access &
Analysis System
96 Nodes
128 GB+ Memory
Haswell CPUs
Interconnect with
1 TB/s throughput
High Speed Storage System
500+ TB
1 TB/s
250M+ IOPS
Three primary subsystems:
– A 10PB, replicated disk storage
system.
– An embedded analytics capability
of several thousand cores.
– A high speed global file store
• 1TB/s
• 250M+ IOPS
Wrangler At Large
TACC
Indiana
Mass Storage Subsystem
10 PB
(Replicated)
Mass Storage Subsystem
10 PB
(Replicated)
IB Interconnect 120 Lanes
(56 Gb/s) non-blocking
Access &
Analysis System
96 Nodes
128 GB+ Memory
Haswell CPUs
Interconnect with
1 TB/s throughput
High Speed Storage System
500+ TB
1 TB/s
250M+ IOPS
40 Gb/s
Ethernet
100 Gbps
Public
Network
Globus
Access &
Analysis System
24 Nodes
128 GB+ Memory
Haswell CPUs
Analysis Hardware
• The high speed storage will be
directly connected to 96 nodes
for embedded processing.
– Each analytics node will have 24
Intel Haswell cores, and 128GB
of RAM, 40 GB Ethernet and
Mellenox FDR networking.
DSSD Storage
• The flash storage provides the truly
“innovative capability” of Wrangler
• Not SSD; a direct attached PCI interface
allows access to the NAND flash.
– Not limited by 40 Gb/s Ethernet or 56 GB/s IB
networking
• Flash storage not tied to individual nodes
– Not single PCI storage in a node
• More than half a petabyte of usable storage
space once “RAIDed”
• Could handle continuous writes to storage
for 5+ years without loss due to Memory
Wear
Wrangler Reservations
• Data motion is too expensive for many data driven jobs to work in the HPC
48 hour maximum job length environment and the shared flash storage
system
• Many data investigations are interactive data analysis “campaigns”, but
HPC environments do not allow working on the login nodes
• For this, we introduce the ability to reserve subclusters in Wrangler for
data analysis in a cloud-like way
• Allocation charged for reservation used from start of reservation until the
end or until the reservation is canceled for all nodes reserved
• Jobs run within a reservation are not charged
• Jobs run outside of a reservation are charged for Node Hours used
Long Term File Systems
• Have standard /home directory for each user
• Have mounted global /work file system
– Your work directory is stored in $WORK environment
variable for Wrangler
– Global Work (which is Stampedes $WORK) is saved in
$STOCKYARD
• Have the /data file system for staging input files for
processing, interim result files from computations, and
preserving results from computations
Database Technology Overview
Christopher Jordan, Tomislav Urban
Database Options (and more options)
• Many, many database options now available
• Open source and startup worlds both
producing lots of innovation/development
• We will focus on robust, widely-used options
• Expansive internet resources available for the
interested
Data Models
• Relational Data is the most common model in
current databases
• Other data models (Object, Graph/Network)
may be appropriate for specific needs
• Will mostly focus on relational data here
Database Concepts
• Record – a single instance of the data
structure – a row in RDBMS
• Column – all rows of a single data element,
e.g. all the last names in an address book
• Table or Relation – a collection of records or
rows
Database Concepts 2
• Key – An element that occurs in multiple tables,
can be used to link rows across tables
• Index – A secondary data structure used for
accelerating access to records, usually a key
• Query – A command issued to the database
• Constraint – A limitation on a specific data
element, e.g. “integers less than 1000”
Structured Query Lanaguage
• SQL Standard is a very powerful language for
defining and querying structured data
• Appropriate when data is consistent, structured
• Relational Database Management Systems
– Typically an RDBMS is a SQL engine
– Postgres, MariaDB/MySQL are supported
– Others (MonetDB) as requested
What does SQL look like?
• CREATE TABLE tablename (id int, name
varchar(80), address text);
• INSERT INTO tablename (data);
• SELECT name FROM tablename where id = 6;
• Note all operations on one or more “rows”
Transactions and ACID
•
•
•
•
•
Atomicity – Each transaction in isolation
Consistency – State before and after
Isolation – Concurrency doesn’t change state
Durability – Changed data stays changed
In SQL, you can group operations together
into a single “transaction” in ACID terms
Stored Procedures and Triggers
• Complex/multi-part queries may be reused
• One change to data may imply or mandate
another change (or something different)
• Stored procedures are like functions, triggers
are policies or rules (when X happens, then Y
must be done as well)
Relational Data Model
Database Interaction Model
• Almost all SQL engines are client-server systems
(SQLite arguably an exception)
• One or more server processes on one or more
nodes, sometimes a “head” node
• Clients run anywhere with network access to the
server process
• Interact using standard protocols, SQL or a
database-specific language
Database Access
• Direct clients – command-line or graphical
shells allowing client-server interaction
• APIs – libraries for programming languages
• ODBC/JDBC – standard connectors allowing
interaction with many different database
types
SQL Example Applications
• Web backend – most common example, backend
storage for web application
– Can include both application and domain data
• Large tabular dataset storage
– SQL used to extract subsets for analysis
• Database as application engine
– SQL can be used for many patternmatching/comparison/data structure applications
SQL as Application Engine
• OrthoMCL Example
– Grouping of orthologous protein sequences
– Load output of BLAST sequence recognition tool into
MySQL
– SQL used to compare sequences globally, identify
similar sequences and generate weightings
– All significant computational effort besides BLAST is
done in the database
PostgreSQL
• Mature database with focus on standardscompliance, ACID properties, reliability
• Single-node, threaded but with limitations
• Supports the widest array of SQL standard
operations of the open-source options
MariaDB/MySQL
• MySQL - Open source database now “owned”
by Oracle and subsequently forked
• MariaDB/Percona/others all based on MYSQL
• Relatively easy to install and use, plug-in
architecture for storage/authentication
• Not as ACID/SQL-compliant as Postgres
NoSQL
• Trading one or more SQL characteristics for
reliability, performance, simplicity, etc
• Used when data is semi-structured or has a
very particular structure:
– Columnar data
– Key-Value pairs
– JSON and XML data
When to Choose a NoSQL Option
• In most cases a traditional RDBMS will be your
best choice (maturity, application support)
• If your data/queries don’t match the RDBMS
model well (e.g. graph/network data)
• If your application depends on a key-value store
or another specialized component
• If you need to apply node parallelism
MongoDB
• “NoSQL” – has a Mongo-specific query
language
• Document-centered – store and manipulate
JSON “documents” rather than rows
• Node-parallel – Can “shard” databases across
multiple nodes for capacity and performance
MongoDB and JSON
• JSON/BSON as native data type
• JSON, like XML, is structured but without an
inherent schema
• Can define/impose a schema in your
application
• Still need to pay attention to data structure
The GIS Software Stack
Tomislav Urban
What is GIS
• GIS stands for Geographic Information System
A geographic information system (GIS) is a system designed to capture,
store, manipulate, analyze, manage, and present all types of spatial or
geographical data.
(Wikipedia)
• So just as a normal relational database
tracks attributes of entities in the form of
strings, integers, Booleans, floating point
numbers, etc., in columns or fields; a GIS
adds the ability to store spatial data in the
form of points, lines or polygons alongside
the non-spatial attributes of those entities.
Spatial Data
• Storing spatial data is similar to any other data
type. Typically, the GIS will provide a geometry
data type that can often be subtyped as a point,
line or polygon. In a database this will be stored
in a native way:
• Here is a point:
0101000020E6100000D9A0F226633944401E338C5A0D172640
• But can also be shown in the WKT (Well Known
Text) format:
POINT(40.44833838317 11.0450237556593)
Spatial Data
• Spatial data exists in the real world on Earth and as such also must
be characterized with an SRID (Spatial Reference System Identifier)
– SRIDs are numbers that often refer to the codes used by the EPSG
(European Petroleum Survey Group)
– For example, GPS devices typically return readings using EPSG:4326
otherwise known as WGS 84 or World Geodetic System 1984, while
Google Maps uses EPSG:3857 referred to as “Web Mercator”
• The SRIDs may refer to
nonprojected (i.e. geographic
coordinates), projected or local
coordinate systems.
Spatial Databases
• Many modern relational database systems
provide support for spatial data
–
–
–
–
–
–
PostgreSQL with PostGIS
MariaDB/MySQL
Microsoft SQL Server
Oracle Spatial
SpatiaLite
H2
Spatial Databases
• The database provides a storage facility for spatial
data, but also allows for queries and analysis.
– Here is a simple query (using PostGIS) that lists the
find locations of specimens collected based on the
search locality in which they were found:
SELECT l.id AS locality_id, o.id AS specimen_id
FROM drp_occurrence o, drp_locality l
WHERE ST_Within(o.geom, l.geom)
– The ST_Within function determines whether one
geometry falls within another.
The Stack
• Once your spatial
data have been
stored in a database,
analyzed, queried
and so forth. You may
wish to make these
data available. Here is
a typical GIS software
stack.
Desktop Client
Web Client
OGC Compliant GIS Server
Spatially enabled RDBMS
The Stack
• We’ve covered a
bit about the
bottom part of
the stack, the
database, above.
• Now let’s address
the server
Desktop Client
Web Client
OGC Compliant GIS Server
Spatially enabled RDBMS
The GIS Server
• The OGC (Open Geospatial Consortium) sets standards
for web services that allow GIS servers to communicate
with clients
• The two most important standards are:
– WMS – Web Map Service. WMS serves images that are
tiled to produce maps. You can think of WMS as the
“raster” service, though the underlying data is not typically
stored as a raster.
– WFS – Web Feature Service. WFS serves data providing the
client with the discrete points that make up the geometry.
This can be thought of as the “vector” service.
WMS
• Requests come in as parameterized URLs
…/wms?LAYERS=stratigraphy%3Ausgs_strata&FORMAT=image%2Fpng&
SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&STYLES=&SRS=EPSG%3
A900913&WIDTH=256&HEIGHT=256
• Responses come back as image tiles (e.g. PNGs)
WMS
• The GIS Server will cache pre-computed tiles
in order to support performant zoom
capabilities for your maps
WFS
• Requests come in as parameterized URLs
…/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=stratig
raphy:usgs_strata&maxFeatures=50&outputFormat=application%2Fjson
• Responses come back as text (e.g. geojson)
{"type":"FeatureCollection","totalFeatures":319822,"features":[{"ty
pe":"Feature","id":"usgs_strata.309165","geometry":{"type":"MultiPo
lygon","coordinates":[[[[1.2219492278393215E7,5573973.086145616],[1.2218750755323239E7,5574930.573897161],[1.2218400054957133E7,5574396.582388744],[1.2219140100366896E7,5573980.728155615],[1.2219312228531137E7,5573800.41573675],[1.2219492278393215E7,5573973.086145616]]]]},"geometry_name":"geom",
"properties":{"url":"http://mrdata.usgs.gov/geology/state/sgmcunit.php...
The GIS Server
• Some OGC Compliant GIS Servers
– GeoServer (Open source, Java)
– MapServer (Open source, C++)
– ArcGIS Server (Commercial, ESRI)
Web Clients
• Web clients that
consume OGC services
typically take the form
of Javascript libraries
such as OpenLayers or
Leaflet
• They make consuming
and displaying GIS
data relatively easy
within a web
application.
Desktop Client
Web Client
OGC Compliant GIS Server
Spatially enabled RDBMS
Web Clients
• This example shows three layers assembled using
Openlayers
– A Google Maps base layer showing terrain
– A partially transparent WMS layer showing stratigraphy
– A WFS layer showing specific points
Desktop Clients
• Desktop clients
provide an interactive
interface to spatial
data. Allow editing,
analysis and high
quality presentation
• They can connect to
databases directly, via
OCG services, or with
legacy file formats
such as shapefiles.
Desktop Client
Web Client
OGC Compliant GIS Server
Spatially enabled RDBMS
Desktop Clients
• Some desktop GIS clients
– QGIS (Open source)
– ArcMap (Commercial, ESRI)
Questions
Databases on Wrangler
Niall Gaffney, Christopher Jordan
Database Usage Modes
• Some “Applications” are primarily database
queries/processing on temporary datasets
• Persistent Databases
– Data Collections/Resources, Stream Processing
• “Transient” Databases
– Database used as temporary engine for processing
– SQLite, other node-local options
Databases for Collections
•
•
•
•
•
Database may itself be a resource
Can be accessed directly or via a web interface
This is a valid allocation type for Wrangler
Needs to utilize “persistent” usage mode
ECSS may help with creating a collection
Databases for Persistent Services
• Database may be a component of a
networked data service (e.g. GIS layers)
• Wrangler persistent databases are appropriate
for part or all of such services
• Database component should be significant
factor (e.g. not just user accounts)
Current Persistent DB Support
• MySQL/MariaDB v10
– Includes native GIS data types
• Postgres v9.4.1
– PostGIS v2.1.7
• Replication for both coming soon
• Oracle and MonetDB under consideration
Persistent DB Storage
• Persistent databases live on disk
• Flash is volatile, not yet suited for long-term
storage
• Flash will be provided as an option
– Reliability considerations for both users and admin
Persistent Database Provisioning
Persistent Database Options
• Type of DB: Postgres or Maria/MySQL
– This list will grow based on demand
• Database/Schema Name
• DBA – Administrative user
• Note that all persistent databases use TACC
authentication and require SSL
Accessing your Persistent DB
• From a Linux system with Postgres Client:
– psql "sslmode=require
host=db1.wrangler.tacc.utexas.edu
user=<username> dbname=db”
• MySQL/MariaDB:
– mysql –ssl –h db1.wrangler.tacc.utexas.edu –u
<username> -p <dbname>
Configuring ODBC/JDBC
• Many tools such as SAS support databases via
ODBC/JDBC
• Configuring these is OS and applicationspecific, but the options are the same
• Two-step process – configure data source,
select data source in application
Wrangler and NoSQL databases
• There are many, many NoSQL and SQL
database technologies in development
• Too many options to support them all
• Wrangler support will be based on demand
• Express interest in specific technologies by
contacting TACC or XSEDE
Transient Databases
• Although most databases require a “server”
process, this doesn’t have to be long-lived
• All the DB “servers” we have encountered run
in user-space, just as any other process
• This is also true of many “cluster” databases
• We encourage users to experiment
Running Transient Databases
•
•
•
•
•
DO NOT run on the login node
You will want a reservation in most cases
Start an interactive job on one (or more) nodes
Configure/Start server within the job
Eventually, we will provide scripts for most
common options (e.g. Postgres, MongoDB)
Using Transient Databases
• Very flexible execution model
• Can run clients directly on compute node
alongside database server
• Or, can run clients on any XSEDE system, or “in
the cloud”
• Details of connection and use will be applicationspecific
MongoDB Example
• MongoDB doesn’t support TACC auth
• Server installed on all compute nodes
• From idev or inside a job script:
– “mongod –config –dbpath
/data/<userpath>/mongodb” (or put in background
with nohup)
– “mongo <dbname>”
• Can put data path on Flash as well
Hadoop and Transient Databases
• Some database technologies now run on top
of Hadoop/MapReduce
• We have a Hadoop partition!
• Same procedure as for other databases, but
create a Hadoop reservation in the user portal
• Should be familiar with Hadoop, though…
Performance Expectations
• Particularly for transient databases, expect at
least 2X-3X improvement over disk-based hosts
• Can be significantly more or less than that,
depending on data size and complexity
• The more of the total dataset is being accessed,
the better comparative performance you will get
Staging Data In
• You can load your database directly from a
desktop/laptop/other cluster
• Performance may be an issue
• Consider staging your data to Wrangler ahead
of time
• Consider preparing your data
Staging Data Out
• Important to remember that transient databases
are transient
• Data will remain on disk, but flash will be purged,
and you need the server process
• Must consider how to retrieve your data before
the job/reservation ends
• Use database dump/backup tools?
Questions?
• Switch to ETL presentation
ETL to Analytics
David Walling
[email protected]
Motivation
Extract->Transform->Load
●
Extract
●
●
●
●
SOURCE DATA
External source: CSV, database, web scraped data, free text
Data is messy
Understanding+Cleaning it is hard: ~ 80% of the effort
Transform
●
Rename/map fields:
Ex. ‘TX’::string -> 9::foreign_key
Convert text to integer:
Ex. 5’10’’::string -> 70::int
Convert missing values:
Ex. 99999 -> NA
CLEAN
●
●
●
●
●
●
Load
●
●
Often a database, order of load important to comply with integrity constraints
CSV, HDFS
DATA STORE
ETL Tools
●
●
●
Manual
●
Just don’t!
Unix Tools + BASH
●
Glue together all the tried and true unix tools into a
bash script
●
grep, awk, sed, etc...
Higher level scripting Languages
●
●
R, Python, Perl, etc….
GUIs
●
●
●
●
Graphically build network of sources, transformation,
and destinations
SSIS: Part of MS SQL Server stack
Informatica: I hear it’s good, but expensive
Pentaho Kettle: Open Source….yeah!!!
Example
●
●
●
●
●
players.txt
Information on basketball players their position and their grade (1-5) by NBA
scouts.
Missing data
Bad data (score=6 invalid)
Data format issues (5’11’’)
Unix Command Line
●
●
●
●
Pipe together simple programs using stdin/stdout to perform powerful
manipulation.
Command line vs bash script
Often very efficient
Can be difficult to learn/remember. --help helps
Core Commands
●
●
●
●
●
grep - find files/lines in file matching *pattern*
find - find files/dirs with name matching *pattern*
cat - send all contents to stdout
less - scrollable file content
head/tail - show first/last -n 10 lines of huge file
Additional Tools
●
●
●
●
●
awk - process delimited files line-by-line. Ex. extract 2nd-5th, 37-52nd columns
of 89 column CSV file
sed - stream editor, similar to awk, work line by line. Ex. find/replace
sort - order stdin data by line
uniq - remove duplicates
awk/sed - many Q/A online to get what you need
Regular Expressions
●
●
●
●
regex - standard...ish way of expressing complex pattern matching
Used by MANY other tools, sometimes with slight variation
Very powerful, but gets complex quickly
Many resources exist to help ‘build’ your regex:
●
●
http://regexr.com/
At a minimum get used to using wildcards/escaping as they are ubiquitous
/regexr\.com.+foo/g
http://regexr.com/foo.html?q=
bar
http://regexr2.com/foo.html?q
=bar
http://espn.com/foo.html
http://regexr.com/me.html
Scripting Languages
●
●
●
●
●
●
Higher level programming languages, do many things
Support for stdin/stdout, more often data import/export through functions
More complex cleaning logic, modularize code
Database lookups
Write unit tests!!!
Perl
●
●
●
Pro: fast, can use almost like straight unix stdin/stdout tool
Con: notoriously hard to read
Python
●
●
Pro: easy to learn, easy to read, access to 3rd party packages
Con: a bit more cumbersome to get going
GUI Based Tools
●
●
●
●
Build pipelines via ‘task’ drag/drop + link
‘Easy’ to learn, many things done for you.
Easy to understand how data moves through pipeline.
Often many common tasks are quicker to implement. Ex.
form based setup of database connection information, field
mapping, etc...
GUI Based Tools 2
●
●
MS SQL Server Integration Services (SSIS)
●
Part of great MS stack
●
Requires windows
●
Not free
Pentaho Kettle
●
Java based, easy to install, access to other java tools
●
Open source: Free + 3rd party contributions
●
Can be kludgey passing params around to ‘tasks’
Pentaho Kettle
Cronjobs
●
Most ETL scripts need to be run on periodic basis
●
You could execute manually on demand,
●
When impractical, setup a ‘scheduled task’ via crontab
●
Add notifications to scripts when things go wrong
Cronjobs and Wrangler
●
cronjobs are supported on wrangler
●
You *cannot* enter cronjobs on the login node
●
Talk with us to setup an appropriate pipeline for your data
●
http://portal.tacc.utexas.edu or http://portal.xsede.org
●
[email protected]
Streaming
●
●
Though rare, it may be the case that your research workflow requires the capture
of streaming data that would otherwise be lost.
More common in business/web, aggregating log files from all IT systems to drive
quality control and business intelligence.
●
Open source tools exist to handle these log heavy workflows
●
Examples: Apache Flume, Loggly, LogStash, Splunk
●
If needed on Wrangler, talk with us
●
[email protected]
Database + Analytic Tools
●
●
●
Databases and SQL are widely supported in analytic tools
Every package has their own way of connecting to and interacting with the
database
However, they generally follow the same pattern
●
Manually open connection
●
Use connection to send a sql statement
●
Client packages up the results, usually in an array/list/dataframe
●
Manually close connection (sometimes done for you)
R Example (RJDBC)
# Do stuff
# Setup connection
library(RJDBC)
drv <- JDBC("com.mysql.jdbc.Driver",
"/home1/0157/walling/drivers/mysql-connector-java3.1.14-bin.jar",
identifier.quote="`")
conn <- dbConnect(drv,
“jdbc:mysql://db1.wrangler.tacc.utexas.edu/schema_n
ame", "user", "pwd")
>
[1]
[3]
[5]
[7]
"columns_priv"
"func"
"help_keyword"
"help_topic"
dbListTables(conn)
"db"
"help_category"
"help_relation"
"host"
data(iris)
dbWriteTable(conn, "iris", iris,
overwrite=TRUE)
dbGetQuery(conn, "select count(*)
from iris")
d <- dbReadTable(conn, "iris")
Python Example (Postgres)
import psycopg2
def connect():
connection = psycopg2.connect(host='db1.wrangler.tacc.utexas.edu', dbname=’bigDB’, user='me’, password='pw')
return connection
def query(sql):
conn = connect()
cursor = conn.cursor()
cursor.execute(sql)
return(cursor.fetchall())
sql = """
select distinct user
from bigusertable;
"""
result = query(sql)
JDBC vs ODBC vs DBI vs ….
●
●
●
Many applications rely on a particular protocol to abstract the type of database
being used.
Makes code more portable allowing you to swap backends if needed.
Suggested to use one of these instead of a package specific to MariaDB, Postgres,
etc...
WRANGLER ALLOCATIONS
Wrangler Allocation Model
• Projects need to think about two different
allocations
– Compute Allocations – Time used computing with
Wrangler and using Flash Storage
– Storage Allocations – Storage needed for importing
data, storing data between computations,
collaborating with data, sharing data, preserving
results
Introducing the Node Hour
• Computations use Flash storage and CPU and
core memory
• To simplify allocations and scheduling we
combine these into a “Node Hour” of allocation
– 1 Node Hour = use of 1 node with 2 CPUs (24 cores
total) and 128 GB of memory PLUS 4 TB of DSSD Flash
Storage
– Need to find limiting factor for computations
and ask for allocation
For Dedicated Databases
• Database has a nominal charge of 1 Node hour per day
– Up to 150 GB database on Flash (larger Flash based
database will use more node hours per day)
– Can host database not needing high transaction rates on
Disk system (requires storage allocation to cover its
storage)
– Should periodically backup any database to
long-term storage (especially for Flash hosted
databases which are not replicated)
For Transient Databases
• Typically one node hour per hour will be
sufficient for most Transient Databases
• Exceptions
– Need more than 4 TB of flash storage to stage the
input data, Extract, Translate and Load the database,
host the database, and extract results
– Need to use a multinode database solution for larger
problems (e.g. Sharded Mongo Cluster, Neo4J
cluster)
https://portal.xsede.org
Create an Account
Login
Learn About Allocations
Learn About Allocations
In particular Startups
Submit/Review Requests
Current Oportunites
Quarterly
Allocation
Periods for
Full Scale
Requests
Guided Process
For startups…
• Minimal information needed…what you are going
to be doing and some rationalization of why you
need Wrangler (e.g. I need a database)
• Startup Requests are typically for 500 to 1000
Node hours, ~1 TB of storage
• Turn around should be within a week (typically a
few days depending on my email backlog)