O1.1 Rule-based Cross-matching of Very Large Catalogs

Download Report

Transcript O1.1 Rule-based Cross-matching of Very Large Catalogs

Rule-based Cross-matching of
Very Large Catalogs
Patrick Ogle and the NED Team
IPAC, California Institute of Technology
NASA Extragalactic Database (NED)
A fusion of multi-wavelength extragalactic data from journal articles and large catalogs
NED Holdings (October 2014)
2MASS PSC
And much more, including classifications, notes, images, spectra…
New Cross-matching Algorithm
• Very Large Catalogs
(VLCs, >107 sources)
• Find candidate matches in
NED
• Select best match
– Rule-based
– Statistical analysis
• Match data recorded in DB
• Reversible and iterable
GALEX ASC (NUV) vs. SDSS DR6 (gri, 6’x6’)
Cross-match Inputs
• VLC Source and NED Object Positions (RA, Dec, ±)
Source-Object Separation (s, ±σ)
• Source and Object Types
(galaxy, galaxy cluster, star, UV source, etc…)
• Background Object Density (measured for each source)
• Instrumental Beam Size
• Other: redshift, photometry, diameters
NED Pipeline for Very Large Catalogs
•
Source Loader
– Load Very Large Catalog (VLC) source names and positions into NED.
•
CSearch (PostgreSQL)
Source
Loader
– Find match candidates with NED near position search
– Count background objects
– Spatial indexing will speed up search (e.g. Q3C, HTM)
•
MatchExpert (python)
–
–
–
–
–
•
Select best match from CSearch match candidates
Object associations for no-matches
Record match statistics for each match
Match statistic distributions and integrals
Code migration to DBMS for speed
Object Loader (PostgreSQL)
– Create NED cross-IDs
– new objects
– associations
CSearch
MatchEx
Object
Loader
MatchEx Logic
Thresholds
Match
List from
Csearch
S<Scut
Create NED
object and
associations
No
Match
NED Cross-ID
Match
P>Pcut
Type
Match
Name
Prefix
Match
S1/S2
<0.33
Error
Circles
Overlap
Single
Good
Match
NED
dup.
Associations
• Where a match is not made to a nearby object, an
association record may be created.
• Association types:
– Source and object position error circles overlap ()
– Object is within the beam (PSF) of the source ()
Error
Circles
Overlap
Create Error Overlap
Association record
S<beam
Create In Beam
Association record
No
Match
Application to GALEX ASC Catalog
GALEX ASC (NUV) vs. NED
NED object
GALEX search region
Background region
SDSS DR6
(g,r,i)
SDSS DR6 (gri, 6’x6’)
• GALEX All-Sky Catalog
of ~40 milllion unique
NUV sources created by
M. Seibert (2012)
• Matched against ~180
million NED objects
(2013)
Poisson Match Probability
•
•
•
•
•
Search radius: rs = 7.5″ for GALEX
Background radius: rb = 46.5″ for GALEX
Density of background NED objects: n = N/(πrb2)
Expected number inside s: <Ns> = N(s/rb)2, s = separation
Poisson probability of x = k objects closer than s:
– Ps(x=k) = <Ns>k exp(-<Ns>)/k!
– For k=0, simplifies to:
Ps(x=0) = exp(-<Ns>) = exp(-N(s/rb)2)
• False-match probability: Pf = 1-Ps(0)
Example:
N = 4, s/rb = 0.08
Ps(0) = 0.975
Pf = 0.025
rb
s
rs
Optimizing Match Selection
•
•
•
•
•
Optimize on 100K
subsample in SDSS region
False-positive rate
decreases with increasing
Poisson cutoff.
False negative rate
increases with Poisson
cutoff.
Give 10x weight to false
positives--it’s worse to
make an incorrect match
than to miss a match.
Poisson cutoff value of 90%
minimizes the combined,
weighted error rate.
GALEX ASC Match Results: Totals
• 39,570,031 input GALEX ASC UV sources
• NED (2013) contained ~180 million distinct objects
• 10,595,382 (26.8%) of the ASC sources matched NED
objects  Cross-IDs
• 28,974,649 (73.2%) are not matched new NED objects
– 68.2% of GASC sources are in blank NED fields
– 5.0% have multiple match candidates
Image credit : GALEX
NASA/JPL-Caltech/SSC
GALEX ASC Match Results: Background
Rejection and False-Negative Rate
• Uncorrelated background
out to 15 arcsec fit by
straight line: dN/ds ~ s
• MatchEx is successful at filtering out this
background.
• False-negative rate fn = 2.4% estimated by
comparison to background-subtracted
match candidates (red line).
false negatives
Separation (arcsec)
GALEX ASC Results: False Positive Rate
• The false-positive match rate is estimated by summing the
Poisson statistic (1-P) over all matches and dividing by the
total number of sources : fp=0.25%
20
Number
15
10
5
GALEX ASC Results: Position Error
Distribution
• The distribution of normalized separation r=s/σ
deviates from a Gaussian. The peak is at 0.9 instead of
1.0, and the tail is stronger.
Number
Derivative of
a Gaussian
r=s/σ
Important Lessons Learned:
1. Do not assume reported
catalog position errors are
correct.
2. Do not assume position error
distributions are Gaussian.
3. A 3.5σ threshold on match
separation rejected more
candidates than expected.
Comparison to SDSS Photometry
• While no color criteria were used to select matches to
GALEX sources, the NUV-g colors of GALEX-SDSS matches
were checked:
Most matches have -7<NUV-g<7
• GALEX ASC range: 14<NUV<24
• Detection rate falls at NUV>21.7
Results by Object Type
• Object Types ordered by candidate match frequency
• Most GALEX sources matched to galaxies (G) and stars (*)
• QSO, Galactic star (!*), UV excess object (UvES), and WD* matches overrepresented,
as might be expected for a UV-selected catalog.
• Matches to RadioS, XrayS, GGroup, and GPair candidates were disallowed.
GALEX Photometry in NED
• GALEX ASC photometry added to NED spectral energy
distribution of 3C 382 (CGCG 173-014)
• Over 145 million GALEX ASC NUV and FUV photometry
records added to NED (2 extraction methods per band)
VLCs in NED, now and future


•
•
•
•
GALEX ASC: ~40,000,000 UV sources loaded and matched (2013)
GALEX MSC: ~22,000,000 UV sources loaded and matched (2014)
Spitzer Source List: ~42,000,000 MIR sources (2014)
2MASS PSC: ~471,000,000 NIR sources loaded (2015 finish)
AllWISE: ~748,000,000 MIR sources (2015 start)
SDSS DR10: ~469,000,000 Vis sources (2015 start)
 SDSS DR6: ~154,000,000 Vis sources loaded and matched (out of 217M),
excluding sources with undesirable flag values (2008)
NED aims to quadruple its object holdings in the next year!