Michael Drinkwater: Big Data in Science
Download
Report
Transcript Michael Drinkwater: Big Data in Science
Big Data in Science
(Lessons from
astrophysics)
Michael Drinkwater, UQ & CAASTRO
1. Preface
Contributions by Jim Grey
Astronomy data flow
2. Past Glories
Why it was easy to be world-leading
3. Future challenges
Why really big data makes us worry!
CSIRO Parkes radio telescope
1. Preface: Jim Grey (Microsoft eScience)
› Much of what I discuss was already said by the
late Jim Grey:
› Jim Grey on eScience, in The Fourth Paradigm, eds Hey,
Tansley & Tolle, 2009. (emphasis added)
research.microsoft.com
› “I have been hanging out with
astronomers for about the last 10
years… I look at their telescopes…
$15-20M worth of capital equipment
with about 20-50 people operating the
instrument… millions of lines of code
are needed to analyse all this
information. In fact the software cost
dominates the capital expenditure!”
Jim Grey,
Microsoft
Research
1. Preface: Astronomy Data Flow
Telescope
Science
Raw Images
Database
Output Image
Catalogues
2. Past Glories
› 20 years ago
- Easy to lead the world!
› UKST photographic all sky
survey
- 1 image = 1 GB
- All-sky image = 1 TB
- All-sky catalogue = 100 MB
- Put online with two summer
student projects
2. Past Glories
› Why did astronomy lead the way
with (old) big data?
› 1) Telescopes are expensive so
only a few data sources
- Data complex so only a few
software packages, especially
for national projects
- => easy to adopt a common
data file format
› 2) Astronomers had strong
computing skills
- => easy to search relatively
large discovery space
CSIRO's ASKAP radio telescope with its
innovative phased array receiver technology.
(Image: Dragonfly Media)
2. Past Glories
› Problems with the old
approach in astronomy
- Most team projects
underestimate or ignore
database budget
- Astronomers too
independent – skeptical of
computer science expertise
- Bespoke solutions not
scalable or sustainable
The Anglo-Australian Telescope
(Image: AAO) – used for many team
projects
2. Past Glories
› WiggleZ Dark Energy Survey
- 5 year observing project
- $5M facility time + $1.5M grants
+ 20 team salaries
- Database $40k (donated by host
as not funded)
› Success!
- 4 tests proving Einstein’s General
Relativity correct
- Many other results
- 1425 citations
› Failure!
- Database failed as not supported
3. Future Challenges
› New projects so large astronomy must
change…
- 1995 Schmidt photographic survey: 1 TB
- 2006 Sloan Digital Sky Survey: 25 TB
-…
- 2022-32 Large Synoptic Survey
Telescope 130 PB in 10 years
- 2030-? Square Kilometre Array radio
telescope: 10 PB per day!
- More data per day than entire internet
per year
The LSST: 8.4 m
telescope mirror,
3.2Gpixel camera
3. Future Challenges
› Challenges we know how to solve (Jim
Gray predicted most of these)
- Realistic funding
- Scalable database structure: how to
avoid i/o limits
- Must move the query to the data
- Efficient database design (Jim’s 20
questions to define functionality)
3. Future Challenges
› Nasty challenges we are yet
to solve…
- Complex data mining way
beyond SQL
- “Teaching software
engineering to the whole
community”1
- Real-time analysis for
transient events
- Cross-matching different
large databases in different
locations
“The data collected by the SKA in
a single day would take nearly two
million years to play back on an
iPod.” skatelescop.org
1. Mario Juric, LSST Data Management Project Scientist
Postscript: Jim Grey (Microsoft eScience)
› Jim Gray’s rules for large data design:
research.microsoft.com
- Scientific computing is increasingly
data intensive
- Solution is a “scale-out” architecture
- Bring computations to the data,
rather than data to the computations
- Start the design with the 20 top
questions
- Go from "working to working"
- From “Gray’s Laws: Database-centric Computing in
Science”, Szalay & Blakeley, , in The Fourth
Paradigm, eds Hey, Tansley & Tolle, 2009.
Jim Grey,
Microsoft
Research