SAS Grid Project - Texas Tech University
Download
Report
Transcript SAS Grid Project - Texas Tech University
Grid Computing at Texas
Tech University using SAS
Ron Bremer
Jerry Perez
Phil Smith
Peter Westfall*
Director, Center for Advanced Analytics and Business Intelligence
Texas Tech University
What is Grid Computing?
• Grid computing means using multiple
resources connected by the net to perform
demanding calculations.
• Example:
Economies of High Performance
Computing
• Current fastest machine: ~40 Teraflops
($300M)
• 10 Tflops Machines
(~$50M)
• Fastest Cluster at TTU: 0.1 Tflops
(~$0.1M)
• Speed of a PC 0.003 Tflops
(~$.001M)
Underused Resources
• Computers are everywhere, mostly idle!
• Grid computing leverages unused
resources to create an effective
“Supercomputer”
• Teraflops = (N computers) x (TFLPs per)
• For Free! (Almost)
Grid Initiatives at TTU and in Texas
• HipCAT – High Performance Computing
Across Texas
• TIGRE – Texas Internet Grid for Research
and Education
• SORCER – Service ORienter Computing
EviRonment (TTU CS dept.)
• SAS/Connect grid
HipCAT
• Consortium of Texas institutions working together to use
–
–
–
–
–
High performance computing
Clusters
Massive data storage
Scientific visualization
Grid computing.
• Director: Phil Smith, Texas Tech University
• Members:
–
–
–
–
–
–
–
–
–
–
Baylor College of Medicine
Rice University
Texas A&M University
Texas Tech University
University of Houston
University of Texas
University of Texas at Austin
University of Texas at Arlington
University of Texas at El Paso
University of Texas Southwestern Medical Center
TIGRE
• Texas Internet Grid for Research & Education
• Two year project involving: UT, TTU, UH, Rice,
and TAMU
• Funding announced by the Governor in
September
• TIGRE will develop a grid software stack and
policies and procedures to facilitate Texas grid
computing efforts.
Grid Software Products
Used at TTU
•
•
•
•
AVAKI
Globus
Jini Networking Technology
SAS/Connect (MPConnect), %Distribute
macro
Benefits of SAS
• Ease of Use (relative to other grid
products)
• Available and applicable for many
scientists in their resp. fields
• Flexibility
– Data base (DATA step, PROC SQL)
– Math/Optimization (SAS/IML, SAS/OR)
– Stat (SAS/STAT, SAS/ETS)
Problems Amenable to SAS Grid
• Replicates of Fundamental task
• Fundamental tasks are time consuming,
lots of replicates
• Examples
– Simulation
– Astrophysics
– Bioinformatics
– Ensembles of predictive models
Success Story
• Financial Event Studies
– Developed simulation tool to detect events
– Simulated its performance
– 25 hours finished in 40 minutes
– Published in J. Fin. Econometrics
• Old system: “Sneaker grid”
Another Success Story:
Portfolio Analysis
•
•
•
•
300 portfolios, 50 securities each by randomly
sampling securities from CRSP daily database
(7.23 Gigabytes)
15 models created for each of 50 securities
(PROC AUTOREG of SAS/ETS), under 169
treatment settings.
126,750 models and associated data steps per
portfolio.
500 days of continuous computing time
reduced to two weeks.
Notoriety
• Web articles appeared in SAS, Grid today,
Next-Gen Data forum
• Interviewed by DataBase Trends and
Applications
SAS Grid Structure
• Client connects to host machines
• Client sends replicates of fundamental
task (“chunks”) to hosts
• Hosts process chunks, send back to client
• Client combines chunks and summarizes
The SAS Grid
SAS Farm
•
•
•
•
•
100 SAS machines in student lab
2.66 GhZ per node
All have SAS software installed
SAS “Spawner” must be started on all
Avaki also installed - diagnoses problems
Student Lab
Load Balancing
• Automatically supports load balancing by
farming out independent tasks to the next
available resource.
• Students never noticed that their machines
were being used!
Simulation-Based Methods
PROC MULTTEST of SAS/STAT(first hardcoded bootstrap?)
Simulation-Based Methods, II
• Adjust=simulate in GLM and MIXED
• Posterior simulation in MIXED
Toy Example – Testing Random
Number Generators
• Random number generators often fail to
provide independent numbers.
• Test case: U1, U2 are Uniform on (0,1).
• If independent, then E{6(U1-U2)2} = 1.00.
• Check: Generate many pairs, report
average (should be 1.000000)
Code
Results
Startup (Windows)
1. Start Spawner:
C:\Program Files\SAS\SAS 9.1>spawner -i -comamid tcp
2. Activate Spawner:
3. Set batch log in permissions:
The %Distribute Macro
• Written by Cheryl Doninger and Randy
Tobias
• File:
http://support.sas.com/rnd/scalability/pape
rs/distribute.zip
• Supporting document:
http://support.sas.com/rnd/scalability/pape
rs/distConnect0401.pdf
Problems We Have Experienced
•
•
•
•
•
Random crashes (client as well as hosts)
Diagnosing errors
I/O problems
Windows Service Pack 2 Firewall
Social issues (grid involves people!)
Future Plans
• Support from business and government:
– grid-enabled bioinformatics
– business intelligence/data mining
• Support HPC at TTU and in Texas