Transcript downloading

DDM
Kirk
LSST-VAO discussion:
Distributed Data Mining (DDM)
Kirk Borne
George Mason University
March 24, 2011
The LSST Data Challenges
The LSST Data Mining Challenges
1. Massive data stream: ~2
Terabytes of image data
per hour that must be
mined in real time (for 10
years).
2. Massive 20-Petabyte
database: more than 50
billion objects need to be
classified, and most will be
monitored for important
variations in real time.
3. Massive event stream:
knowledge extraction in
real time for 100,000
events each night.
• Challenge #1 includes
both the static data
mining aspects of #2
and the dynamic data
mining aspects of #3.
• Look at #2 and #3 in
more detail ...
LSST data mining challenge # 2
• Accurately characterize and classify 50 billion
objects and 20 trillion source observations
• Requires VO-accessible multi-wavelength data
• Szalay’s Law: Astrophysical discovery potential
grows as (number of data sources)2
Benefits of very large datasets:
• best statistical analysis of
“typical” events
• automated search for “rare”
events
LSST data mining challenge # 3
• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a
new sky event, and we will be challenged with
classifying these data:
LSST data mining challenge # 3
flux
• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a
new sky event, and we will be challenged with
classifying these data:
time
LSST data mining challenge # 3
flux
• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a
new sky event, and we will be challenged with
classifying these data: more data points help !
time
LSST data mining challenge # 3
flux
• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a
new sky event, and we will be challenged with
classifying these data: more data points help !
Characterize first !
then Classify.
time
Characterization Use Case #1
• Feature detection and extraction:
– Automated pipelines’ tasks: Characterize!
• Identify and describe features in the data
• Extract feature descriptors from the data
• Curating these features for scientific re-use
– Human experts’ tasks: Categorize and Classify!
• Associate features with astrophysical processes
• Find boundaries between feature sets and label them
– Example: Star-Galaxy Separation
Characterization Use Case #2
• The clustering problem:
– Finding clusters of objects within a data set
– Pipeline: apply an optimal algorithm for finding
friends-of-friends or nearest neighbors
• N is >1010, so what is the most efficient way to sort?
• Number of dimensions ~ 1000 – therefore, we have an
enormous subspace search problem
– Scientist: determine the significance of the clusters
(statistically and scientifically) – categorize!
Characterization Use Case #3
• Outlier detection: (unknown unknowns)
– Finding the objects and events that are outside the
bounds of our expectations (outside known clusters)
– These may be real scientific discoveries or garbage
– Outlier detection is therefore useful for:
• Novelty Discovery – is my Nobel prize waiting?
• Anomaly Detection – is the detector system working?
• Data Quality Assurance – is the data pipeline working?
– How does one optimally find outliers in 103-D
parameter space? or in interesting subspaces (in
lower dimensions)?
– How do we measure their “interestingness”?
Characterization Use Case #4
• The dimension reduction problem:
– Finding correlations and “fundamental planes” of parameters
– Number of attributes can be
hundreds or thousands
• The Curse of High
Dimensionality !
– Are there combinations (linear
or non-linear functions) of
observational parameters that
correlate strongly with one
another?
– Are there eigenvectors or
condensed representations
(e.g., basis sets) that represent
the full set of properties?
The LSST Data Mining Challenges:
What’s the common theme?
• Need multi-wavelength data in all use cases!
• VO-accessible ancillary information is essential.
The LSST Data Mining Challenges:
What’s the common theme?
• Need multi-wavelength data in all use cases!
• VO-accessible ancillary information is essential.
Requirements for success:
• Discovery of distributed data sources
• Access to distributed data sources
• Applying characterization and clustering (data
mining) algorithms on distributed data:
• Unsupervised and Supervised Machine Learning
Data Bottleneck
• Mismatch:
• Data volumes increase 1000x in 10 yrs
• I/O bandwidth improves ~3x in 10 years
• Therefore . . . Distributed Data Mining
Distributed Data Mining (DDM)
• DDM comes in 2 types:
1. Mining of Distributed Data (MDD)
2. Distributed Mining of Data (DMD)
• Type 1 takes many forms, with data being
centralized (in whole or in partitions)
• Type 2 requires sophisticated algorithms that
operate with data in situ …
• Ship the Code to the Data
• The computations are done on the data locally,
with partial results shipped around to the different data nodes,
and the DDM algorithm iterates until a solution
is converged upon.
• This can be pipeline-initiated or scientist end-user-initiated.
• References: http://www.cs.umbc.edu/~hillol/DDMBIB/
• Ultimate goal: Knowledge Discovery through Data Discovery