Data-Intensive Computing Symposium

Download Report

Transcript Data-Intensive Computing Symposium

Data-Intensive Computing
Symposium: Report Out
Phillip B. Gibbons
Intel Research Pittsburgh
Data-Intensive Computing Symposium
Data-Intensive Computing Symposium
 Held 3/26/08 @Yahoo! in Sunnyvale, CA
 Sponsored by:
– Yahoo! Research
– Computing Community Consortium supports the
computing research community in creating compelling
research visions and the mechanisms to realize these visions
(http://www.cra.org/ccc/)
 ~100 invited attendees, ~12 invited talks
 Slides and video to be posted on CCC web site
 Blog: http://dita.ncsa.uiuc.edu/xllora (thanks!)
2
Phillip B. Gibbons, Data-Intensive Computing Symposium
Randy Bryant (CMU)
Data-Intensive Scalable Computing
 Local speaker; I’ll skip in interest of time
 DISC has been renamed
3
Phillip B. Gibbons, Data-Intensive Computing Symposium
ChengXiang Zhai (UIUC)
Text Information Management
4
Phillip B. Gibbons, Data-Intensive Computing Symposium
ChengXiang Zhai (UIUC)
Proposal 1: Maximum Personalization
5
Phillip B. Gibbons, Data-Intensive Computing Symposium
ChengXiang Zhai (UIUC)
6
Phillip B. Gibbons, Data-Intensive Computing Symposium
ChengXiang Zhai (UIUC)
7
Phillip B. Gibbons, Data-Intensive Computing Symposium
Dan Reed (Microsoft)
Clouds and ManyCore: The Revolution
 Big Data: Should focus more on the user experience
 How to manage resources
 Cloud computing can help organically orchestrate
resources on demand
 Initiative to bring academics, business, and users
together under the big data problem (PCAST NITRD
review)
8
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jill Mesirov (Broad Institute)
Comput. Paradigms for Genomic Medicine
 Broad has 4.8K processors, 1.4 PBs storage on site
 Big Data Problem: Mining genome expression arrays
– Row: patients; Column: genes, Value: expression values
– Example: classify leukemias based on expression arrays
– Solved by grad student over the weekend using web sources
 Challenge: Computation/Analysis/Provenance
infrastructure needed
– Developed GenePattern 3.1: Software infrastructure for
interoperable informatics
– Usable by biologists
9
Phillip B. Gibbons, Data-Intensive Computing Symposium
Garth Gibson (CMU)
Simplicity and Complexity
in Data Systems at Scale
 Petascale Data Storage Institute
 Understanding disk failures, cfdr.usenix.org
 Another local speaker, so I’ll skip in interest of time
10
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
Handling Large Datasets at Google
11
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
12
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
13
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
GFS Usage
14
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
15
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
16
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
17
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeff Dean (Google)
18
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jon Kleinberg (Cornell)
Large-Scale Social Network Data
Diffusion in Social Networks
Why is chain letter diffusion so deep & narrow?
Iraq war authorization protest
chain letter diffusion (18K nodes)
19
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jon Kleinberg (Cornell)
20
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jon Kleinberg (Cornell)
21
Phillip B. Gibbons, Data-Intensive Computing Symposium
Marc Najork (Microsoft Research)
Mining the Web Graph
Query-dependent link-based ranking algorithm (HITS, SALSA)
Scalable Hyperlink Store: used internally within MSR, for web graphs
22
Phillip B. Gibbons, Data-Intensive Computing Symposium
Joe Hellerstein (UC Berkeley)
“What” Goes Around
1. Industrial revolution of data: sensors, logs, cameras
2. Hardware revolution: datacenters/virtualization,
many-core
3. Industrial revolution in software? Declarative
languages in some domains
Why “What”:
–
–
–
–
–
23
Rapid prototyping
Pocket-size code bases
Independent from the runtime
Ease of analysis and security
Allow optimization and adaptability
Phillip B. Gibbons, Data-Intensive Computing Symposium
Joe Hellerstein (UC Berkeley)
24
Phillip B. Gibbons, Data-Intensive Computing Symposium
Joe Hellerstein (UC Berkeley)



Sensor Networks, Mobile Networks, Modular
Robotics, computer games, program analysis
Distributive inference (junction trees and loopy belief
propagation), graphs upon graphs
Evita Raced: Overlog Metacompiler (compiler is
written declaratively)
– matches datalog optimizations (dynamic prog.), cycle tests


Datalog with known extensions and tweaks
Centrality of Rendezvous & graphs

Challenges:
– performance beyond number of messages (e.g., memory
hierarchy), availability, real programs, not Turing complete
25
Phillip B. Gibbons, Data-Intensive Computing Symposium
Raghu Ramakrishnan (Yahoo! Res.)
Sherpa: Cloud Computing of the Third Kind
26
Phillip B. Gibbons, Data-Intensive Computing Symposium
Raghu Ramakrishnan (Yahoo! Res.)
27
Phillip B. Gibbons, Data-Intensive Computing Symposium
Raghu Ramakrishnan (Yahoo! Res.)
28
Phillip B. Gibbons, Data-Intensive Computing Symposium
Alex Szalay (Johns Hopkins)
Scientific Applications of Large Databases
29
Phillip B. Gibbons, Data-Intensive Computing Symposium
Alex Szalay (Johns Hopkins)
30
Phillip B. Gibbons, Data-Intensive Computing Symposium
Alex Szalay (Johns Hopkins)
31
Phillip B. Gibbons, Data-Intensive Computing Symposium
Phillip Gibbons (Intel Research)
Data-Rich Computing: Where It’s At
 Important, interesting, exciting
research area
I know where
it’s at, man!
 Cluster approach:
computing is co-located where the storage is at
Focus of this talk:
 Memory hierarchy issues:
where the (intermediate) data are at, over the course of
the computation
 Pervasive multimedia sensing:
processing & querying must be pushed out of the data center
to where the sensors are at
32
Phillip B. Gibbons, Data-Intensive Computing Symposium
Hierarchy-Savvy Parallel Algorithm
Design (HI-SPADE) project
Goal: Support a hierarchy-savvy model of
computation for parallel algorithm design
 Hierarchy-savvy:
– Hide what can be hid
– Expose what must be exposed
– Sweet-spot between ignorant
and fully aware
 Support:
– Develop the compilers, runtime systems,
architectural features, etc. to realize the model
– Important component: fine-grain threading
33
Phillip B. Gibbons, Data-Intensive Computing Symposium
IrisNet’s Two-Tier Architecture
Two components:
SAs: sensor feed processing
OAs: distributed database
Query
User
Web Server
for the url
...
OA
XML database
OA
XML database
SA
SA
senselet
senselet
senselet
senselet
Sensornet
34
...
Sensor
Sensor
OA
XML database
...
SA
senselet
senselet
Sensor
Phillip B. Gibbons, Data-Intensive Computing Symposium
Jeannette Wing (CMU/NSF)
NSF Plans for Supporting
Data-Intensive Computing
Google/IBM Data Center
– ~2000 processors, large Hadoop cluster
– Allocate in units of rack weeks
– NSF will review proposals for use: Cluster Exploratory (CluE)
– Running Xen; Won’t open up performance monitoring
– Goal: Show applicable outside of computer science
Academic-Industry-Government partnership
35
Phillip B. Gibbons, Data-Intensive Computing Symposium
Randy Bryant (CMU)
Big Data Computing Study Group
 Collection of ~20 people (looking for volunteers)
 Goals:
– Fostering educational activities
– Advocacy
– Building community
 CCC’s Big Data Computing Study Group seeks to foster
collaborations between industry, academia, and the U.S.
government to advance the state of art in the development and
application of large scale computing systems for making
intelligent use of the massive amounts of data being generated
in science, commerce, and society
36
Phillip B. Gibbons, Data-Intensive Computing Symposium