Select eXtreme Computing Group (XCG) Initiatives
Download
Report
Transcript Select eXtreme Computing Group (XCG) Initiatives
Tools and Services for Data Intensive Research
An Elephant Through the Eye of a Needle
Roger Barga, Architect
eXtreme Computing Group, Microsoft Research
Select eXtreme Computing Group (XCG) Initiatives
• Cloud Computing Futures
– ab initio R&D on cloud hardware/software infrastructure
• Multicore academic engagement
– Universal Parallel Computing Research Centers (UPCRCs)
• Software incubations
– Multicore applications, power management, scheduling
• Quantum computing
– Topological quantum computing investigations
• Security and cryptography
– Theoretical explorations and software tools
• Research cloud engagement
– Worldwide government and academic research partnerships
– Inform next generation cloud computing infrastructure
Data Intensive Research
• The nature of scientific computing is changing
– It is about the data…
• Hypothesis-driven research
– “I have an idea, let me verify it...”
• Exploratory
– “What correlations can I glean from everyone’s data?”
• Requires different tools and techniques
– Exploratory analysis relies on data mining, viz analytics
– “grep” is not a data mining tool, and neither is a DBMS…
• Massive, multidisciplinary data
– Rising rapidly and at unprecedented scale
Why Commercial Clouds are Important*
Research
Science Start-ups
Cloud Computing Model
1.
2.
3.
4.
1.
2.
3.
4.
5.
6.
1. Have good idea
2. Grab nodes from Cloud provider
3. Start Work
Have good idea
Write proposal
Wait 6 months
If successful, wait 3
months
5. Install Computers
Have good idea
Write Business Plan
Ask VCs to fund
If successful..
Install Computers
Start Work
4. Pay for what you used
also scalability, cost, sustainability
6. Start Work
*
Slide used with permission of Paul Watson, University of Newcastle (UK)
The Pull of Economics (follow the money)
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Unprecedented economies of scale
Enterprise moving to PaaS, SaaS, cloud computing
Opportunities for Analysis as a Service, multi-disciplinary data sets,…
This will drive changes in research computing and cloud infrastructure
Just as did “killer micros” and inexpensive clusters
Drinking from the Twitter Fire Hose
On the “input” end
• Start with the ‘twitter fire hose’, messages flowing inbound at specific rate.
• Enrich each element with significantly more metadata, e.g. geolocation.
• Assume the order of magnitude of the twitter user base is in the 1050MM range, let’s crank this up to the 500M range.
• The average Twitter user is generating a relatively low incoming
message rate right now, assume that a user’s devices (phone, car,
PC) are enhanced to begin auto-generating periodic Twitter
messages on their behalf, e.g. with location ‘pings’ and solving other
problems that twitterbots are emerging to address. So let’s say the
input rate grows again to 10x-100x what it was in the previous step.
Drinking from the Twitter Fire Hose
On the “input” end
On the “output” end: three different usage modalities
• Each user has one or more ‘agents’ they run on their behalf, monitoring this input
stream. This might just be a client that displays a stream that is incoming from
the @friends or #topics or the #interesting&@queries (user standing queries).
• A user can do more general queries from a search page. This query may have
more unstructured search terms than the above, and it is expected not just to be
going against incoming stream but against much larger corpus of messages from
the entire input stream that has been persisted for days, weeks, months, years…
• Finally, analytical tools or bots whose purpose is to do trend analysis on the
knowledge popping out of the stream, in real-time. Whether seeded with an
interest (“let me know when a problem pops up with <product> that will damage
my company’s reputation”) or just discovering a topic from the noise (“let me
know when a new hot news item emerges”), both must be possible.
Pause for Moment…
Defining representative challenges or quests to focus group
attention is an excellent way to proceed as a community
Publishing a whitepaper articulating these
challenges is a great way to allow others
to contribute to a shared research agenda
Make simulated and reference data sets available
to ground such a distributed research effort
Drinking from the Twitter Fire Hose
On the “input” end
On the “output” end: three different usage modalities
A combination of live data, including streaming, and historical data
Lots of necessary technology, but no single technology is sufficient
If this is going to be successful it must be accessible to the masses
Simple to use and highly scalable, which is extremely difficult
because in actuality it is not simple…
This Talk is About
Effort to build & port tools for data intensive research in the cloud
– None have run in the cloud to date or at scale we are targeting…
Able to handle torrential streams of live and historical data
– Goal is simplicity and ease-of-use combined with scalability
Intersection of four fundamental strategies
1. Distribute Data and perform Parallel Processing
2. Parallel operations to take advantage of multiple cores;
3. Reduce the size of the data accessed
– Data compression
– Data structures that limit the amount of data required for queries;
4. Stream data processing to extract information before storage
Microsoft’s Dryad
•
•
•
•
•
•
•
Continuously deployed since 2006
Running on >> 104 machines
Sifting through > 10Pb data daily
Runs on clusters > 3000 machines
Handles jobs with > 105 processes each
Used by >> 100 developers
Rich platform for data analysis
Microsoft Research, Silicon Valley
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
Pause for Moment…
Data-Intensive Computing Symposium, 2007
Dryad is now freely available
http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx
Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters
Commitment by External Research (MSR) to support research community use
Simple Programming Model
Terasort, well known benchmark, time to sort time 1 TB data [J. Gray 1985]
• Sequential scan/disk = 4.6 hours
• DryadLINQ provides simple but powerful programming model
• Only few lines of code needed to implement Terasort, benchmark May 2008
• DryadLINQ result: 349 seconds (5.8 min)
• Cluster of 240 AMD64 (quad) machines, 920 disks
• Code: 17 lines of LINQ
DryadDataContext ddc = new DryadDataContext(fileDir);
DryadTable<TeraRecord> records =
ddc.GetPartitionedTable<TeraRecord>(file);
var q = records.OrderBy(x => x);
q.ToDryadPartitionedTable(output);
LINQ
• Microsoft’s Language INtegrated Query
– Available in Visual Studio 2008
• A set of operators to manipulate datasets in .NET
– Support traditional relational operators
• Select, Join, GroupBy, Aggregate, etc.
• Data model
– Data elements are strongly typed .NET objects
– Much more expressive than SQL tables
• Extremely extensible
– Add new custom operators
– Add new execution providers
Dryad Generalizes Unix Pipes
Unix Pipes: 1-D
grep | sed | sort | awk | perl
Dryad: 2-D, multi-machine, virtualized
grep1000 | sed500 | sort1000 | awk500 | perl50
Dryad Job Structure
Channels
Input
files
Stage
sort
grep
grep
grep
Vertices
(processes)
sed
sed
Output
files
awk
perl
sort
awk
sort
Channel is a finite streams of items
• NTFS files (temporary)
• TCP pipes (inter-machine)
• Memory FIFOs (intra-machine)
Dryad System Architecture
job schedule
data plane
Files, TCP, FIFO, Network
NS
Job manager
control plane
V
V
V
PD
PD
PD
cluster
Dryad Job Staging
1. Build
2. Send
.exe
JM code
3. Start JM
7. Serialize vertices
Vertex
Code
5. Generate graph
6. Initialize vertices
Cluster
services
8. Monitor vertex execution
4. Query cluster resources
Dryad Scheduler is a State Machine
• Static optimizer builds execution graph
– Vertex can run anywhere once all its inputs are ready.
• Dynamic optimizer mutates running graph
– Distributes code, routes data;
– Schedules processes on machines near data;
– Adjusts available compute resources at each stage;
– Automatically recovers computation, adjusts for overload
o If A fails, run it again;
o If A’s inputs are gone, run upstream vertices again (recursively);
o If A is slow, run a copy elsewhere and use output from one that finishes first.
– Masks failures in cluster and network;
Combining Query Providers
.Net
program
(C#, VB,
F#, etc)
Query
Objects
LINQ provider
interface
Local Machine
Execution Engines
DryadLINQ
PLINQ
LINQ-to-IMDB
LINQ-to-CEP
Scalability
Cluster
Multi-core
Single-core
LINQ == Tree of Operators
• A query is comprised of a tree of operators
• As with a program AST, these trees can be analyzed, rewritten
• This is why PLINQ can safely introduce parallelism
q = from x in A where p(x) select x3;
• Intra-operator:
• Inter-operator:
• Both composed:
• Nesting queries inside of others is common
PLINQ can fuse partitions
var q1 = from x in A select x*2;
var q2 = q1.Sum();
Combining with PLINQ
Query
DryadLINQ
subquery
PLINQ
Combining with LINQ-to-IMDB
Query
DryadLINQ
Subquery
Subquery
Subquery
Subquery
Historical
LINQ-to-IMDB
Reference
Data
Combining with LINQ-to-CEP
Query
DryadLINQ
Subquery
Subquery
Subquery
Subquery
Subquery
LINQ-to-IMDB
‘Live’
LINQ-to-CEP
Streaming
Data
Cost of acquiring
data – negligible
Cost of storing data –
few cents/month/MB
Extracting insight while acquiring data - priceless
Mining historical data for ways to extract insight – precious
CEDR CEP – the engine that makes it possible
Consistent Streaming Through Time: A Vision for Event Stream Processing
Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong
In the proceedings of CIDR 2007
Complex Event Processing
Complex Event Processing (CEP) is the continuous and incremental processing of
event (data) streams from multiple sources based on declarative query and pattern
specifications with near-zero latency.
The CEDR (Orinoco) Algebra
Leverages existing SQL understanding
– Streaming extensions to relational algebra
– Query integration with host languages (LINQ)
Semantics are independent of order of arrival
– Specify a standing event query
– Separately specify desired disorder handling strategy
– Many interesting repercussions
Consistent Streaming Through Time: A Vision for Event Stream Processing
Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong
In the proceedings of CIDR 2007
CEDR (Orinoco) Overview
Currently processing over 400M events per day for internal application (5000 events/sec)
Reference Data on Azure
Ocean Science data on Azure SDS-relational
• Two terabytes of coastal and model data
• Collaboration with Bill Howe (Univ of Washington)
Computational finance data on Azure SDS-relational
• BATS, daily tick data for stocks (10 years)
• XBRL call report for banks (10,000 banks)
Working with IRIS to store select seismic data on
Azure. IRIS consortium based in Seattle (NSF)
collects and distributes global seismological data.
• Data sets requested by researchers worldwide
• Includes HD videos, seismograms, images, data from
major seismic events.
Summary
• Data growing exponentially: big data, with big implications…
• Implications for research environments and cloud infrastructure
• Building cloud analysis & storage tools for data intensive research
– Implementing key services for science (PhyloD for HIV researchers)
– Host select data sets for multidisciplinary data analysis
• Ongoing discussions for research access to Azure
– Many PB of storage and hundreds of thousands of core-hours
– Internet2/ESnet connections, w/ service peering at high bandwidth
– Drive negotiations with ISVs for pay-as-you-go licensing (MATLAB)
• Academic access to Azure through our MSDN program
• Technical engagement team to onboard research groups
– Tools for data analysis, data storage services, and visual analytics
Questions