2010 Workshop on Massive Data Analytics on the Cloud (MDAC

Download Report

Transcript 2010 Workshop on Massive Data Analytics on the Cloud (MDAC

2010 Workshop on
Massive Data Analytics on the
Cloud
(MDAC 2010)
April 26, 2010
Raleigh, NC, USA
In association with the 19th Annual World Wide Web Conference
(WWW2010)
Making Sense of Mountains of Data
Search
Online
Transaction
Processing
System
Feedback/Action
Embedded
Analytics
 ClickSteam, CRM
 Claim data (text, picture,
video)
 Call data records
 Location Tracking
(GPS),
 iPhone, Vehicle Use
Data,
 $ Trans tracking (Across
borders & IP providers),
Dashboards
Semi-Un-struct
Continuous arrival of high
volume information (evolving,
highly variant)
(struct-/semi--/un-structured
Financial
Planning
Scorecards
Auto/Cross
Correlation
Analytics,
Predictive Analytics
Billions of
mobile devices
Feeds:
 Census Bureau
Data
 Market Data,
Weather Data
 Sensors data
Mash ups
 Web Data (for search)
 Web Buz data
(for reputation analysis)
PetaBytes ->
Exabytes
Deep & Wide
Analytics
Fine grained –
individual product and
customer at a time and
place
Massive Data Analytic Platforms
• Google: Original MapReduce implementation
• Microsoft: Dryad
• Yahoo!, Facebook, and many others: Hadoop
• Ecosystems: Hive, Pig, Jaql, Zookeeper,
• Alternatives to Map/Reduce, e.g. Pregel
M
•
1000’s processors
Petabytes of data
…and growing
R
M
M
C
C
Partition
Sort
•
•
C
R
•
•
•
•
•
•
“Easy” parallelism
Scalability
Fault-Tolerance
Elastic
Flexibility
Cost / Performance
Chairpeople Perspective
• Other parallel systems technology and
customers
– Parallel Database – enterprise data warehousing
– Parallel ETL (extraction, transformation, load)
– Search and text analytics
• Hadoop and related technologies
– Finance, Telco, Healthcare, Retail, Government, …
Questions Posed in Call For Papers
• What kinds of problems are people trying to solve?
• How are existing massive-scaleout platforms used,
and what extensions would be helpful?
• Other kinds of platforms for different problems?
• How to integrate with existing environments such as
data warehouses?
• Challenges in managing massive datasets?
• Legal/moral challenges associated with mining these
data sets?
Agenda (morning)
9:00 - 10:30: Session 1
Introduction and Welcome
Invited Talk: "Hadoop: An Industry Perspective"
Dr. Amr Awadallah, CTO, VP-Engineering, Cloudera
10:30 - 11:00: Coffee Break*
11:00 - 12:30: Session 2
Distributed Indexing of Web Scale Datasets for the Cloud
Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos, Nectarios Koziris;
National Technical University of Athens
Beyond Online Aggregation: Parallel and Incremental Data Mining with Online Map-Reduce
Joos-Hendrik Böse1, Artur Andrzejak2, Mikael Högqvist2; 1Intl. Comp. Sci. Institute,
2Zuse Institute Berlin (ZIB)
Efficient Updates for a Shared Nothing Analytics Platform
Katerina Doka3, Dimitrios Tsoumakos4, Nectarios Koziris3; 3National Technical University
of Athens, Greece, 4University of Cyprus
12:30 - 1:30: Lunch*
Agenda (afternoon)
1:30 - 3:30: Session 3
Invited Talk: "Large Scale Applications on Hadoop in Yahoo"
Dr. Vijay Narayanan, Yahoo! Labs Silicon Valley,
Extracting User Profiles from Large Scale Data
Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass,
David Konopnicki; IBM Research, Haifa
A Novel Approach to Multiple Sequence Alignment using Hadoop Data Grids
Sudha Sadasivam, G. Baktavatchalam; PSG College of Technology
3:30 - 4:00: Coffee Break*
4:00 - 5:30: Session 4
Towards Scalable RDF Graph Analytics on MapReduce
Padmashree Ravindra, Vikas Deshpande, Kemafor Anyanwu; North Carolina State
University
SPARQL Basic Graph Pattern Processing with Iterative MapReduce
Jaeseok Myung, Jongheum Yeon, Sang-goo Lee; Seoul National University
Parallelizing Random Walk with Restart for Large-Scale Query Recommendation
Meng-Fen Chiang, Tsung-Wei Wang, Wen-Chih Peng; National Chiao Tung University
Hsinchu, Taiwan
Acknowledgements
Workshop Chairs
Ullas Nambiar, IBM India Research Lab,
New Delhi, India
John McPherson, IBM Almaden
Research Center, USA
David Konopnicki, IBM Haifa Research
Lab, Israel
Steering Committee
Rakesh Agrawal, Microsoft Search Labs,
Mountain View, CA, USA
Alon Halevy, Google Inc., Mountain
View, CA, USA
Invited Speakers
Amr Awadallah, CTO, VP-Engineering,
Cloudera, "Hadoop: An Industry
Perspective"
Vijay Narayanan, Yahoo! Labs Silicon
Valley, "Large Scale User Modeling
on Hadoop"
Program Committee
Amr Awadallah, Cloudera, USA
Andrew McCallum, University of Massachusetts Amherst, USA
Assaf Schuster, Technion - Israel Institute of Technology
Gautam Das, University of Texas, Arlington, USA
Jimeng Sun, IBM Watson Research Center, USA
John Shafer, Microsoft Search Labs, USA
Kevin Chang, University of Illinois at Urbana-Champaign, USA
Kun Liu, Yahoo! Labs, USA
Louiqa Raschid, University of Maryland, College Park, USA
Michal Shmueli-Scheuer, IBM Haifa Research Lab, Israel
Michael Sheng, University of Adelaide, Australia
Mong Li Lee, National University of Singapore, Singapore
Rajeev Gupta, IBM India Research Lab, India
Vanja Josifovski, Yahoo Research, USA
Yannis Sismanis, IBM Almaden Research Center, USA
Yi Chen, Arizona State University, USA
Wen-syan Li, SAP, China