nebula_8_15x - boinc - University of California, Berkeley

Download Report

Transcript nebula_8_15x - boinc - University of California, Berkeley

Nebula
A cloud-based back end for SETI@home
David P. Anderson
Kevin Luong
Space Sciences Lab
University of California, Berkeley
SETI@home
observation
signal detection
Signal
storage
re-observation
Back-end processing:
RFI detection/removal
persistent signal detection
Signal storage
●
Using SQL database (Informix)
●
Signal types
–
●
spike, Gaussian, triplet, pulse, autocorrelation
Database table hierarchy
–
tape
–
workunit group
–
workunit
–
result
–
(signal tables)
Pixelized sky position
●
●
HealPix: Hierarchical Equal-Area Isolatitude
Pixelization
~51M pixels; telescope beam is ~1 pixel
Current back end: NTPCKR
●
As signals are added, mark pixels as “hot”
●
Score hot pixels
–
●
DB-intensive
Do RFI removal from high-scoring pixels, flag
for re-scoring
Problems with current back end
●
Signal DB is large
–
●
5 billion signals, 10 TB
Informix has limited speed
–
NTPCKR can’t keep up with signal arrival
–
> 1 year to score all pixels
●
labor-intensive
●
non-scalable
Impact on science
●
We haven’t done scoring/reobservation in 10
years
●
We wouldn’t find ET signal if it were there
●
We don’t have anything to tell volunteers
●
We don’t have basis for writing papers
Nebula goals
●
Short-term
–
RFI-remove and score all pixels in ~1 day for ~$100
–
stop doing sysadmin, start doing science
●
●
●
e.g. continuous reobservation, experiment with scoring algorithm
Long-term
–
generality; include other signal sources (SERENDIP)
–
provide outside access to scoring, signals, raw data
General
–
build expertise in clouds and big-data techniques
–
form relationship with cloud providers, e.g. Amazon
Design decisions
●
Use Amazon cloud (AWS) for the heavy lifting
–
●
Use flat files and Unix filesystem
–
●
For bursty usage, clouds are cheaper than in-house
hardware
NoSQL DB systems don’t buy us anything
Software
–
C++ for compute-intensive stuff (use existing code)
–
Python for the rest
AWS features
Simple Storage System(S3)
disk storage by the GB/month
HTTP
Elastic Computing Cloud (EC2)
VM hosting by the hour
various “node types”
mount
HTTP
Elastic Block Storage (EBS)
Internet
disk storage by the GB/month
attached to 1 EC2 node
Interfaces to AWS
●
Web-based
●
Python APIs
–
Boto3: interface to S3 storage
–
Fabric: interface to EC2 nodes
local host
HTTP
script.py
AWS
Nebula: the basic idea
●
Dump SETI@home database to flat files
●
Upload files to S3
●
Split files by pixel (~80M files)
●
–
remove RFI, redundant signals in the process
–
do this in parallel on EC2 nodes
Score the pixels
–
do this in parallel on EC2 nodes
Moving data from Informix to S3
●
Informix DB unload: 1-2 days
●
Nebula upload script
–
use Unix “split” to make 2GB chunks
–
upload chunks in parallel
●
–
●
thread pool / queue approach, 8 threads
S3 automatically reassembles chunks
Getting close to 1 Gb throughput
Pixelization
●
●
Need to:
–
Divide TB-size files into 16M files
–
remove RFI, redundant signals
Can’t do this sequentially
–
A process can only have 1024 open files
–
it would take too long
Hierarchical pixelization
●
●
●
Level 1
–
split flat files 512 ways based on pixel
–
convert from ASCII to binary
–
remove redundant signals
Level 2
–
split level 1 files 256 ways
–
result: 130K level 2 files
Level 3
–
split each level 2 file 512 ways
–
remove RFI
Pixelization on EC2
●
Create N instances (t2.micro)
●
Create a thread per node
●
Create a queue of level 1 tasks
●
To run a task:
–
get input file from S3
–
run pixelize program
–
upload output files to S3
–
create next-level tasks
●
Keep going until all tasks done
●
Kill instances
Removing redundant signals
●
●
Old way: for each signal, walk up chain of DB
tables
New way:
–
create bitmap file, indexed by result ID, saying
whether result is from a redundant tape
–
memory-map this file
–
given a signal, can instantly see if it’s redundant
Pixel scoring
●
Assemble signals in disc centered at pixel
●
Compute probability that these are noise
●
Can be done independently for each pixel
Nebula scoring program
●
●
●
Same code as NTPCKR
Modified to get signals from flat files instead
of DB
First try: remove all references to Informix
–
●
this failed; too intertwined
Second try: keep Informix but don’t use it
Parallelizing scoring
●
Need to score 16M pixels
●
Use about 1K nodes
●
Want to minimize file transfers; reuse signal files on a node
●
Divide pixels into adjacent “blocks” of 4^n, say 1024
●
Each block is a job (16K of them)
●
●
Each job loops over pixels, fetches and caches files, creates and
uploads output file (pixel, score)
Master script instantiates EC2 nodes, uses thread/queue
approach
–
keeps nodes busy even if some pixels take longer than others
Nebula user interface
●
●
Configuration
–
AWS, Nebula config files
–
check out, build SETI@home software
Scripts
–
s3_upload.py, s3_status.py, s3_delete.py
–
pixelize.py
–
score.py
●
logging
●
Amazon accounting tools
Status
●
Mostly written, working
–
●
doing performance, cost tests
–
●
code: seti_science/nebula
I think we’ll meet goals
design docs are on Google
–
readable to ucb_seti_dev group.
Future directions
●
●
Flat-file-centric architecture
–
assimilators write signals to flat files
–
load into SQL DB if needed
Amazon spot instances (auction pricing)
–
●
Amazon elastic file system (upcoming)
–
●
instances are killed if price goes above bid
shared mountable storage, at a price
Incremental processing