Our Previous Function

Download Report

Transcript Our Previous Function

New Progress On Astronomical
Cross-Match Research
Zhao Qing
天文信息技术联合实验室
Contents
• Our Previous Function
• New Improvements and Attempts
– Discussion of Adaptability on HTM-Indexed
Data
– New function based on Boundary Growing
Model
– Cross-match in distributed environment
based on MapReduce model
• Plan & Discussion
Contents
• Our Previous Function
• New Improvements and Attempts
– Discussion of Adaptability on HTM-Indexed
Data
– New function
• Plan & Discussion
Our Previous Function
• PHIXmatch —— Paralleled Healpix-Indexing Xmatch
• Test Dataset:SDSS(100million)
×2MASS(470million)
• Function: Spatial Join
• Results:
SDSS_ID
Twomass_ID
Distance
587731512617271364
02595905+0000200
5.243e-05
587731512617271365
02595905+0000200
6.55e-05
587731513154076828
02593768+0012219
3.2e-05
587731513154077269
02593768+0012219
0.0025043169
HEALPix Index Function
• HEALPix —— Hierarchical Equal Area
isoLatitude Pixelization of a sphere.
• Quadtree pixel numbering
What we have resolved
• Resolve the border-block problem
A fast bitwise operation algorithms to deduce the
neighbor blocks’ index number
• Realize parallel cross-match computation in
multi-core environment
Results & Performance Analysis
• Results
Function
Table A
Data Amount
of A
Table B
Data
Amounts of B
Time
Finish
amounts
/sec
PHIXmatch
function
SDSS
100,106,811
2MASS
470992970
25min
52,139
GaoDan’s
Function
Part of
GSC2.3
295,832
Part of
GSC2.3
295,832
5.6min
880
• Conclusion
Has marked performance superiority comparing with previous
functions and is applicable to large-scale cross-match on multi-core
system
• Paper:
Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui, Liqiang Lv, and Jian Xiao, A Paralleled
Large-Scale Astronomical Cross-Matching Function, International Conference on
Algorithms and Architectures for Parallel Processing (ica3pp) 2009, LNCS5574:
p604~614
Contents
• Our Previous Function
• New Improvements and Attempts
– Discussion of Adaptability on HTM-Indexed
Data
– New function
• Plan & Discussion
Adaptability Research on HTM-Indexed Data
• HTM—Hierarchical Triangular Mesh
• Resolve the border-data problem in HTM
Results of HTM version Xmatch
• 42min
• Why the results is poor compared with HEALPix
version? Answer: the triangle-shape!
Contents
• Our Previous Function
• New Improvements and Attempts
– Discussion of Adaptability on HTM-Indexed
Data
– New function
• Plan & Discussion
New function based on Boundary
Growing Model
• Database reading operation is too
time-consuming, especially for the
border data!
Contents
• Our Previous Function
• New Improvements and Attempts
– Discussion of Adaptability on HTM-Indexed
Data
– New function
• Plan & Discussion
MapReduce
• A software framework introduced by Google to support
distributed computing on large data sets on clusters of
computers.
– Huge datasets
– Distributable application
– Data stored either in a filesystem (unstructured) or within a
database (structured)
• Map step & Reduce step
– Map: The master node takes the input, chops it up into smaller
sub-problems, and distributes those to worker nodes. A worker
node may do this again in turn, leading to a multi-level tree
structure. The worker node processes that smaller problem,
and passes the answer back to its master node.
– Reduce: The master node then takes the answers to all the
sub-problems and combines them in a way to get the output the answer to the problem it was originally trying to solve.
Map Step & Reduce Step
Chop/replicate
Map
Shuffle/Sort
Map
Input
Reduce
Result
Reduce
Result
Map
Map
Map
Map
Apache Hadoop
• A Java software framework inspiredPage
bylinks: 1 T
output: over 300 TB,
Google’s MapReduce and Google File
System
compressed!
papers.
Number of cores in a
job: over 10,000
• What function does it perform?
disk in the cluster:
overdetection
5P
Easy programming, auto scheduling, error
&
correction,
• Who use Hadoop?
–
–
–
–
Yahoo! – web search; advertising businesses
Amazon – S3, EC2
IBM & Google – computation plat for Universities
Institute of Computing Technology, Chinese
Academy of Sciences -- PBminer
Hadoop Architecture
Why using MapReduce to Xmatch
• Near-linear speedup, comparing with MPI
cluster
• Suitable for data-intensive, computeintensive application, low-cost!
• Have been used in many Data Mining
application, maybe useful for more complex
cross-match functions.
Plan & Discussion
• Service for larger data sets (TB) and
various catalogs such as…
– Interfaces for more kinds of catalogs
– Additional measures to deal with TB-level
data
– Parallelizing other cross-match functions
Thank you!
We need your help!
天文信息技术联合实验室