Satellite Image Processing And Production With
Download
Report
Transcript Satellite Image Processing And Production With
Satellite Image Processing And
Production With Apache Hadoop
U.S. Department of the Interior
U.S. Geological Survey
David V. Hill, Information Dynamics,
Contractor to USGS/EROS
12/08/2011
Overview
Apache Hadoop
Applications, Environment and Use Case
Log Processing Example
EROS Science Processing Architecture (ESPA) and
Hadoop
ESPA Processing Example
ESPA Implementation Strategy
Performance Results
Thoughts, Notes and Takeaway
Questions
Apache Hadoop – What is it?
Open source distributed processing system
Designed to run on commodity hardware
Widely used for solving “Big Data” challenges
Has been deployed in clusters with thousands of
machines and petabytes of storage
Two primary subsystems: Hadoop Distributed File
System (HDFS) and the MapReduce engine
Hadoop’s Applications
Web content indexing
Data mining
Machine learning
Statistical analysis and modeling
Trend analysis
Search optimization
… and of course, satellite image processing!
Hadoop’s Environment
Linux and Unix
Java based but relies on ssh for job distribution
Jobs written in any language executable from shell
prompt
Java, C/C++, Perl, Python, Ruby, R, Bash, et. al.
Hadoop’s Use Case
Cluster of machines is configured into a Hadoop
cluster
Each contributes:
Local compute resources to MapReduce
Local storage resources to HDFS
Files are stored in HDFS
File size is typically measured in gigabytes and terabytes
Job is run against an input file in HDFS
Target input file is specified
Code to run against input also specified
Hadoop’s Use Case
Unlike traditional systems which move data to the
code, Hadoop flips this and moves code to the data
Two software functions comprise a MapReduce job
Map operation
Reduce operation
Upon execution:
Hadoop identifies input file chunk locations, moves the algorithms
and executes the code
The “Map”
Sorts the Map results and aggregates final answer (single thread)
The “Reduce”
Log Processing Example
ESPA and Hadoop
Hadoop map code runs parallel on the input (log file)
Processes a single input file as quickly as possible
Reduce code runs on mapper output
ESPA processes satellite images, not text
Algorithms cannot run parallel within an image
Cannot use satellite images as the input
Solution: Use a text file with the image location as
input. Skip the reduce step
Rather than parallelize within an image, ESPA
handles many images at once
ESPA Processing Example
Implementation Strategy
LSRD is budget constrained for hardware
Other projects regularly excess old hardware
upon warranty expiration
Take ownership of these systems… if they
fail, they fail
Also ‘borrow’ compute and storage from
other projects
Only network connectivity is necessary
Current cluster is 102 cores, minimal expense
Cables, switches, etc
Performance Results
Original throughput requirement was 455
atmospherically corrected Landsat scenes per day
Currently able to process ~ 4800!
Biggest bottleneck is local machine storage
input/output
Due to implementation of ftp’ing files instead of using HDFS as
intended
Attempted to solve this with ram disk, not enough
memory
Currently evaluating solid state disk
Thoughts and Notes
Number of splits on input file can be
controlled via the dfs.block.size parameter
Therefore control number of jobs run against an input
file
ESPA-like implementation does not require
massive storage unlike other Hadoop
instances
Input files are very small
Robust internal job monitoring mechanisms
are usually custom-built
Thoughts and Notes
Jobs written for Hadoop Streaming may be
tested and run without Hadoop
cat inputfile.txt | mapper.py | sort | reducer.py > out.txt
Projects can share resources
Hadoop is tunable to restrict resource utilization on a
per machine basis
Provides instant productivity gains versus
internal development
LSRD is all about science and science algorithms
Minimal time and budget for building internal systems
Takeaways
Hadoop is proven and tested
Massively scalable out of the box
Cloud based instances available from Amazon and
others
Shortest path to processing massive amounts of data
Extremely hardware failure tolerant
No specialized hardware or software needed
Flexible job API allows existing software skills to be
leveraged
Industry adoption means support skills available
Questions