HPDC_DADC_June24-08x

Download Report

Transcript HPDC_DADC_June24-08x

Challenges and New Trends in Data
Intensive Science
Panel at Data-aware Distributed Computing (DADC) Workshop
HPDC Boston
June 24 2008
Geoffrey Fox
Community Grids Laboratory, School of informatics
Indiana University
[email protected], http://www.infomall.org
1
HPDC DADC Panel Questions I
•
What do you think is the most challenging research problem in data-intensive
computing today? What will it be five years from now?
– Same problem as for any distributed computing arena
• Should we call it clouds
• What is a sustainable architecture and associated software
• What are special requirements of scientific data processing
• How do they relate across sensors, “classic” astronomy/particle physics and “scholarly
studies”
• How do these relate to commercial systems that perhaps look most like “scholarly
studies”
• Is there any chance today’s 100 idiosyncratic data and metadata systems will survive
• 5 year projection depends a lot on evolution of compute, data and network
systems/hardware
•
The demand for Data-intensive computing has also attracted the attention of the
federal funding agencies. NSF, DOE and other agencies are introducing new
programs focusing on data-intensive computing such as DataNet, INTEROP, CLuE,
SCiDAC Data Centers etc. Can we say Data Intensive Computing will be the next
big/hot thing? How long do you think this trend will continue?
–
–
–
–
–
Note 30 years ago I was solving LHC problem on a few hundred 1600 BPI tapes (around 10
gigabyte of data ). I used identical analysis software (as did those from 15 years before) but
volume of data a factor of million lower
HPCC Initiative shifted emphasis to simulation in 1990
Datanet: Note raw data and software thrown away (without asking me; both “saved” on data
repository chipstore) 12 years ago but I do have results of analysis
So problem is not so new; scale is new
I want to stress that agencies should keep a good balance between research and deployment;
I am not so sure community has done very well on deployment
HPDC DADC Panel Questions II
• What How do you think the new Petaflop scale systems and multicore machines will affect data-intensive science?
– Multi-core will be very important as the high performance distributed
data analysis system for plethora of distributed data
– Not clear what computational complexity of important data mining
algorithms will be
– Unclear if Petaflops very relevant; many smaller machines better and
not allowed to use Petaflops in this fashion
• Could you comment on the vision of your institution on the "next
generation data-intensive science & discovery"?
– Very important in usual sciences – Biology, Chemistry, Physics
– Indiana University has no engineering school so social science or
“scholarly knowledge” especially interesting
– Education and Training; clearly identified as a serious problem
– Propose (as chair of informatics department) certificates/minors in XInformatics (a wrinkle on Computational Science)
• X = Geography, Library, Health, Business, Media, Sports
• Already have full set of offerings in Bio, Chem, Security, Music, Social, HCI ….