Michael Drinkwater: Big Data in Science

download report

Transcript Michael Drinkwater: Big Data in Science

Big Data in Science
(Lessons from
Michael Drinkwater, UQ & CAASTRO
1. Preface
Contributions by Jim Grey
Astronomy data flow
2. Past Glories
Why it was easy to be world-leading
3. Future challenges
Why really big data makes us worry!
CSIRO Parkes radio telescope
1. Preface: Jim Grey (Microsoft eScience)
› Much of what I discuss was already said by the
late Jim Grey:
› Jim Grey on eScience, in The Fourth Paradigm, eds Hey,
Tansley & Tolle, 2009. (emphasis added)
› “I have been hanging out with
astronomers for about the last 10
years… I look at their telescopes…
$15-20M worth of capital equipment
with about 20-50 people operating the
instrument… millions of lines of code
are needed to analyse all this
information. In fact the software cost
dominates the capital expenditure!”
Jim Grey,
1. Preface: Astronomy Data Flow
Raw Images
Output Image 
2. Past Glories
› 20 years ago
- Easy to lead the world!
› UKST photographic all sky
- 1 image = 1 GB
- All-sky image = 1 TB
- All-sky catalogue = 100 MB
- Put online with two summer
student projects
2. Past Glories
› Why did astronomy lead the way
with (old) big data?
› 1) Telescopes are expensive so
only a few data sources
- Data complex so only a few
software packages, especially
for national projects
- => easy to adopt a common
data file format
› 2) Astronomers had strong
computing skills
- => easy to search relatively
large discovery space
CSIRO's ASKAP radio telescope with its
innovative phased array receiver technology.
(Image: Dragonfly Media)
2. Past Glories
› Problems with the old
approach in astronomy
- Most team projects
underestimate or ignore
database budget
- Astronomers too
independent – skeptical of
computer science expertise
- Bespoke solutions not
scalable or sustainable
The Anglo-Australian Telescope
(Image: AAO) – used for many team
2. Past Glories
› WiggleZ Dark Energy Survey
- 5 year observing project
- $5M facility time + $1.5M grants
+ 20 team salaries
- Database $40k (donated by host
as not funded)
› Success!
- 4 tests proving Einstein’s General
Relativity correct
- Many other results
- 1425 citations
› Failure!
- Database failed as not supported
3. Future Challenges
› New projects so large astronomy must
- 1995 Schmidt photographic survey: 1 TB
- 2006 Sloan Digital Sky Survey: 25 TB
- 2022-32 Large Synoptic Survey
Telescope 130 PB in 10 years
- 2030-? Square Kilometre Array radio
telescope: 10 PB per day!
- More data per day than entire internet
per year
The LSST: 8.4 m
telescope mirror,
3.2Gpixel camera
3. Future Challenges
› Challenges we know how to solve (Jim
Gray predicted most of these)
- Realistic funding
- Scalable database structure: how to
avoid i/o limits
- Must move the query to the data
- Efficient database design (Jim’s 20
questions to define functionality)
3. Future Challenges
› Nasty challenges we are yet
to solve…
- Complex data mining way
beyond SQL
- “Teaching software
engineering to the whole
- Real-time analysis for
transient events
- Cross-matching different
large databases in different
“The data collected by the SKA in
a single day would take nearly two
million years to play back on an
iPod.” skatelescop.org
1. Mario Juric, LSST Data Management Project Scientist
Postscript: Jim Grey (Microsoft eScience)
› Jim Gray’s rules for large data design:
- Scientific computing is increasingly
data intensive
- Solution is a “scale-out” architecture
- Bring computations to the data,
rather than data to the computations
- Start the design with the 20 top
- Go from "working to working"
- From “Gray’s Laws: Database-centric Computing in
Science”, Szalay & Blakeley, , in The Fourth
Paradigm, eds Hey, Tansley & Tolle, 2009.
Jim Grey,