Transcript Document

Организация информационного
взаимодействия разнородных
астрономических ресурсов при
решении задач в виртуальной
обсерватории
В.В.Витковский, О.П.Желенкова (САО РАН),
Д.О.Брюхов, В.Н.Захаров, Л.А.Калиниченко (ИПИ РАН)
ВАК-2004, Симпозиум 1
«Телескопы будущего и виртуальные обсерватории»
Astronomical data is ideal for use in the development of
this new type of science because it has no commercial
value or ethical constraints, there’s lots of it, it’s complex,
it’s heterogeneous and it’s real. Jim Gray
План доклада






Цели аванпроекта информационной инфраструктуры
РВО
Примеры задач, решаемых при помощи ВО
Методы интеграции неоднородных источников
Пример предметного посредника
Возможные архитектурные решения
Организационные вопросы
Цель проекта

Аванпроект при поддержке РФФИ

Анализ работ по ВО в мире, определение классов астрофизических
задач, на решение которых должен быть направлен проект,
определение первоочередных источников (архивов) данных и
программных сервисов, которые должны быть включены в
инфраструктуру РВО.
Определение архитектуры и основных компонентов инфраструктуры,
их интерфейсов и технологий, определение концептуальной схемы
предметных посредников для решения первоочередных классов задач.
Основным архитектурным решением настоящего проекта
информационной инфраструктуры ВО предполагается применение
технологии предметных посредников


Примеры цифровых архивов наблюдений






Архив космического телескопа Хаббла (the Hubble Space Telescope),
рентгеновского телескопа Чандра (the Chandra X-Ray Observatory), 2микронного обзора неба the Two Micron All Sky Survey (2MASS) и
цифрового Паломарского обзора (the Digitized Palomar All Sky).
Sloan Digital Sky Survey (SDSS) - обзор неба (50% северного
полушария) в 5 спектральных диапазонах от ультафиолетового до
инфракрасного (http://www.sdss.org/).
Центр данных в Страсбурге CDS (http://cdsweb.u-strasbg.fr).
Имеются в открытом web-доступе архивы, содержащие
астрономические наблюдения десятков миллионов астрономических
объектов как спектральных, так и мониторинговых обзоров.
Чтобы решить проблемы интегрированного использования
астрономических данных, астрономическое сообщество
разрабатывает новый подход к работе с ними - создание виртуальной
обсерватории (ВО).
NVO: Проект направлен на объединение для совместного
пользования имеющихся и планируемых в США архивов с
наблюдательными данными
VO: General requirements



A Virtual Observatory (VO) is a collection of interoperating data archives and
software tools which utilize the internet to form a scientific research
environment in which astronomical research programs can be conducted. The
VO consists of a collection of data centres each with unique collections of
astronomical data, software systems and processing capabilities.
If large surveys and catalogues could be joined into a uniform and interoperating
"digital universe", entire new areas of astronomical research would become
feasible.
Astronomical data falls into two broad categories: catalog (hundreds of attributes
for billions of objects ) and image (10s TB of pixel data). The specific data
classes include source catalog, time series, event list, visibility data, (including
the various image subclasses), spectrum. A great many astronomical queries have
a spatial component, so a spatial indexing scheme is crucial for good query
execution performance.
IVOA

International Virtual Observatory Alliance (IVOA): создан для того,
чтобы способствовать международной координации и
сотрудничеству, необходимому для разработок и размещения
инструментов, систем и организационных структур, что позволит
использовать астрономические архивы как объединенную
интероперабельную ВО. Участники IVOA:














AstroGrid: UK VO initiative www.astrogrid.org
Aus.VO: Australian Virtual Observatory
AVO: Astrophysical Virtual Observatory www.euro.vo.org
AVO: AVO Science Working Group
CDS: Centre de Donnees astronomiques de Strasbourg
China.VO: Chinese Virtual Observatory www.china.vo.org
CVO: Canadian Virtual Observatory
GAVO: German Virtual Observatory
GSC : UK Grid Steering Committee
India.VO: Indian Virtual Observatory
JVO: Japanese Virtual Observatory jvo.nao.ac.jp/index.e.html
NVO: National Virtual Observatory www.s.vo.org
RVO: Russian Virtual Observatory
IVOA планирует разработать Реестр ресурсов. Метаданные (FITS),
семантика (Unified Content Descriptors - 4-х уровневое иерархическое
дерево, содержащее 1500 терминов)
From Tera to Petabytes





Large Synoptic Survey Telescope (LSST) ranging from Earth's vicinity to the
edge of the optical universe.
It will reach 24th mag in 10 seconds, and will survey up to 14,000 square
degrees three times per month. Over a period of years, 30,000 square
degrees will be surveyed in multiple bands and the co-added images will
go to 27th magnitude.
High technology in microelectronics, large optics fabrication and metrology,
and software.
Comparing the LSST (8.4 m) telescope with the SDSS, and allowing also for
its increased pixel sampling and resolution, the advantage in figure of merit
is by a factor of close to 200
Data products will consist of photometric catalogs which will be
continuously updating during the survey, a moving object database, images
in at least 5 bands (updated on a regular schedule), the huge time-tagged
processed image database, totally will climb to around 15 Petabytes.
Примеры задач
Subject Domain in Natural Science
Material System Def in NL Semantics
Domain Terminology and Concepts
(abstract, methodological, concrete)
Semantics of T1…Tn constituents
Interpretations
Theory (Model) 1. T1 Signature
(attributes, types, classes, processes)
T1 Measurable Characteristics
[simulators]
(attributes, types, classes, procs)
Concretization A of T1
T2, … , Tn measuConcretization B of T1
rable characteristics
…
Simulation
Observations, simulations,
Explaining, forecasting
measurements for T1
Theories (Models)
T2, … , Tn
Observable/Measurable
Characteristics
Methods and Instruments for observation, experimentation, measurement,
data analysis, discovery
Problems, methods of solutions,
algorithms, programs, workflows
Задача поиска далеких объектов
В САО РАН в течение ряда лет под руководством Ю.Н.Парийского
ведутся исследования радиоисточников по программе «Большое
Трио». В рамках этой программы решается задача поиска далеких
галактик и разработана методология такого поиска, включая
следующие этапы:




Cелекция по радиосвойствам

угловой размер радиоисточника

морфология

спектральный индекс
Cелекция по оптическим свойствам
Селекция по наличию рентгеновского излучения
Исследование окружения далеких объектов
NVO Astronomical Grid Applications

The Galaxy Morphology prototype is a highly-specialized analysis
service aimed at studying the morphological properties of galaxies in
rich clusters.

The Galaxy Morphology prototype needs to support the following
operations: find online catalogs of galaxies in clusters, obtain
images of the many hundreds of those galaxies, compute a set of
morphological parameters on those images using Grid
computing, and integrate the new results into the catalogs.

Correlation Functions of Galaxies: gravity naturally leads to a
highly clustered universe. Cosmologists have chosen to characterize
such clustering using n-point correlation functions. Precision
measurements of the higher-order correlations of galaxies are
now possible due to the availability of high quality data from the
SDSS survey.
VirtU - The Virtual Universe (UK)



VirtU is a computing infrastructure to enable direct and rigorous
comparisons of realistic simulations of cosmic structures, based on the
best current theoretical understanding, with real data.
The TVO is a completely novel concept. Member scientists, in close
collaboration with Europe’s Virtual Observatory programme, will build up
the infrastructure required to publish simulated data and analysis tools in
standardized formats. The best simulations will become readily
accessible to non-specialists, leading to entirely new science applications.
This model, now widely accepted as the standard cosmogony, is based on two
key assumptions: (i) that the Universe underwent an early period of
inflationary expansion during which its curvature was flattened and small
irregularities of quantum origin were imprinted and (ii) that these
irregularities grew into cosmological structures by gravitational evolution
driven by massive, weakly interacting elementary particles or cold dark
matter (CDM). This model agrees with the distribution of galaxies, as
mapped by the new generation of surveys
The relationship between the
TVO, TOI and AstroGrid
An example of an extragalactic
application of VirtU: TVO
An example of an extragalactic
application of VirtU: TOI
Requirements for scientific results publishing
To publish means to make data products in an archive available through services
that are accessible via a VO supplied internet site.






To allow independent checks of conclusions based on theoretical results,
reproducing certain results
To allow comparisons with similar results/methodologies or with the
corresponding data by observers/theoreticians.
To make theoretical results more easily accessible and understandable
for observers.
Journals may require links to actual data products and/or software used
in published work.
To allow querying of publications, real and simulated data products in a
uniform manner (joint queries on a structured content items and on metadata –
on observations and publications)
Invariants for observable classes, observable classes as interpretations of
theories (models), triggers watching for inconsistencies of observations and
theoretical models
Методы интеграции
неоднородных источников
Методы интеграции
неоднородных источников



Виртуальная интеграция:
 Получение глобальной схемы в результате
интеграции фиксированного заранее набора схем
коллекций (Global as View)
 Глобальная схема определяется независимо от
коллекций (схема предметной области) – Local as
View
Материализация интегрированных данных
(хранилища данных)
Комбинированные методы (GLAV, частичная
материализация)
SkyQuery A Distributed Web-based
Query Service for Astronomy


SkyQuery provides a user-friendly interface to run distributed queries over
the federation of registered astronomical archives. SkyQuery will not
only provide location transparency, but will also take care of vertical
fragmentation of the data and will run the query efficiently to minimize
query execution costs.
Briefly, the technologies used are:




DATABASES: In principle, any database can be used. For this service, we will
use SQL Server 2000. Each database will be accessible through a .NET web
service (hereafter SkyNode)
PORTAL: The portal is another C# .NET web service that executes a
distributed query by splitting the job up between the SkyNode web services.
CLIENT: The client is an ASP web page.
It is planned to have data covering most of the sky in over 10 different
wavelengths. The astronomy data set today consists of over 50 surveys
of the sky, with a total data volume of 100TB.
DistributedQueryService


The OGSA-DAI (Distributed Access and Integration) Distributed Query
Processor (DQP) involves a single query referencing data held at multiple
sites. DQP requires the Grid’s capabilities for systematic access to remote
data and computational resources. DQP extends the core of OGSA-DAI
by defining a new portType –Grid Distributed Query (GDQ)- and two new
services – Grid Distributed Query Service (GDQS) and Grid Query
Evaluator Service (GQES).
Query processing in DQP consists of the following five stages:






Logical optimisation. (rewriting in GAV)
Physical optimisation
Partitioning. (adding move operator)
Scheduling. Partitions are allocated to Grid nodes
Query Evaluation
Parallel DB machine is used.
Subject Mediator Concept
The mediator architecture (Wiederhold, 1992) deals with the
problem of integration of heterogeneous information. The
sources are "heterogeneous" on many levels.
Mediator is to provide a uniform query interface to the
multiple data sources, thereby freeing the user from having to
locate the relevant sources, query each one in isolation, and
combine manually the information from the different sources.
Mediator Definition as a Subject
Metainformation Consolidation
For the mediator's scalability two separate phases of the mediator's
functioning are distinguished: consolidation and operational.
•On the consolidation phase the efforts of the scientific community
are focused on the mediator subject definition by declaring its
metainformation. The metainformation created at the consolidation
phase constitutes a definition of the subject domain of the mediator.
•During the operational phase arbitrary information collections can
be registered at the mediator expressed in terms of the mediator.
Process of the registration is autonomous and can be done by
collection providers independently of each other. Users of the
mediator know only the metainformation defining the mediator’s
subject and formulate their queries in terms of the mediator’s
subject.
Advantages of subject domain
mediation
Semantic integration of heterogeneous information collections
is reached
2. Users should know only subject definitions as defined by a
community
3. Information providers can disseminate their information for
integration independently of each other and at any time.
4. Autonomous information collections are absolutely independent
on the mediator and its consolidated metainformation
definitions
5. Users have integrated access to all information registered up to
the moment of a query.
6. Mediators form recursive structure. Multiple subjects can be
semantically integrated defining mediators of the higher level.
1.
Пример описания предметного
посредника для класса задач
поиска далеких объектов
Схема посредника
-spatialCoord
«type»
Component
-name : string
-description : string
-parts
1
«type»
CoordEQJ
-ra : real
-de : real
-raError : real
-deError : real
1
*
-spatialCoord
«type»
SpRangeFreq
1
-freqValue : real
-spectralRange
1
1
1
1
-observes
-flux
AstrObj
ScienceData
«type»
FluxJy
-name : string
-origin : string
-main
1 -observedAs
1 1
-las : real
-las : real
-fluxValue : real
-error : real
+match(in osd : ScienceData) : boolean
*
Galaxy
-galaxyType : enum
-morphologyType : enum
Star
-starType : enum
-spClass : enum
Nebula
-nebulaType : enum
OpticalScienceData
-redShift : real
-filter : enum
IRScienceData
-filter : enum
RadioScienceData
-spIndex : real
Примеры дескрипторов основных
понятий
UCD POS_EQ_RA_MAIN represents: Right Ascension
UCD POS_EQ_DEC_MAIN represents: Declination
UCD ERROR represents: Error or Uncertainty in Measurements
UCD ID_MAIN represents: Main Identifier of a Celestial Object
UCD CODE_MULT_INDEX represents: Multiplicity Index Code
UCD EXTENSION_DIAM represents: Angular Diameter or Size of the
Major Axis
UCD CLASS_OBJECT represents: Object Type Classification
UCD MORPH_TYPE represents: Morphological Type
UCD PHOT_FLUX represents: Flux
UCD OBS_FREQUENCY represents: Frequency of the observation
UCD SPECT_SP-INDEX represents: Spectral Index = -d(Log F)/d(Log nu)
UCD REDSHIFT_HC represents: Redshift (normally heliocentric)
Некоторые коллекции, отобранные для
регистрации в посреднике

rcCatalog(rSource/RCdata[spatialCoord, flux, origin, spIndex]) 
radioScienceData (rsd/RadioScienceData[spatialCoord, flux, origin, spIndex])

nvss(nvssSource/NVSSdata[spatialCoord, flux, origin])  radioScienceData
(rsd/RadioScienceData[spatialCoord, flux, origin])
координаты, представленные в RC catalog’e и в NVSS в виде строки
определенного формата, преобразуются в градусы, значения потоков
преобразуются из mJy в Jy, значение частоты берется из названия колонок

2mass(2massSource/2MASSdata[spatialCoord, flux, origin]) 
irScienceData(irs/IRScienceData[spatialCoord, flux, origin])
ошибки для каталога указываются в его описании, необходимо занести их
при регистрации; преобразовать значение звездной величины для объектов
в шкалу потоков, принятую в посреднике
Пример запроса на OQL для поиска
радиоисточников
Выбрать координаты и потоки радиоисточников, у которых
спектральный индекс лежит в указанном диапазоне значений, потоки
не превышают указанного значения и линейные размеры источника не
превышают указанного. При этом спектральный индекс вычислять
функцией calcIndex для подмножества объектов типа
RadioScienceData в классе radioScienceData, имеющих совпадающие
координаты.
select scoord, sf
from
(select scoord: s.spatialCoord, sf: s.flux, spind: calcIndex(RA, DE, partition)
from (select r
from radioScienceData r
where r.flux.fluxValue < value3
and r.las < value4) as s
group_by RA: s.spatialCoord.ra
DE: s.spatialCoord.de)
where between (spind, value1, value2)
Отождествление радиоисточников
Отождествить радиоисточники, у которых спектральный индекс
лежит в указанном диапазоне значений, потоки не превышают
указанного значения и линейные размеры источника не превышают
указанного, с оптическими, для которых объектом является галактика
и выдать эти галактики
select o.observes
from radioScienceData r opticalScienceData o
where between (r.spIndex, value1, value2)
and r.flux.fluxValue < value3
and r.las < value4 and match(r, o) and o.observes in galaxy
Возможные архитектурные
решения
Компоненты инфраструктуры РВО

Основными компонентами архитектуры ВО являются:









.
репозиторий метаинформации посредника;
средства поддержки процесса регистрации информационных
источников в посреднике;
средства компиляции запросов посредника и планирование их
совмещенного во времени выполнения в среде множественных
источников;
система управления базами данных (объектно-реляционная),
служащая для вычисления ответа на запрос;
среда для унифицированного доступа к источникам данных и
сервисам (грид);
среда решения задач, включая средства управления потоками
работ, извлечения знаний;
адаптеры для подключения конкретных источников информации
к посреднику и их интерфейсы;
средства поддержки электронных библиотек;
порталы для взаимодействия различных категорий
пользователей с ВО
Средства предметного посредника
Client
soap
OGSA-DAI + GT
Mediator
Oracle 10g
Query Processing
jdbc
Collection Registration
Data
Repository
Metainformation Management
soap
Collection
Adapter
soap
soap
collection
protocol
Collection
Collection
Adapter
collection
protocol
Collection
Collection
Metainformation
Repository
soap
Tool
Adapter
collection
protocol
Software
Tools
Компоненты общей инфраструктуры
Portal
Web
Browser
Web
Web
Page
Page
Application Server
Java
Servlets
EJB /
WS
Problem
Solving
Environment
Mediator
Adapter
DL Open
Source
Посредники в OGSA DAI
Data Mining (извлечение
знаний) как часть PSE





Two basic classes of models: predictive and descriptive
Predictive (прогнозирующие): one of the observational features is chosen
as the target. The model provides a way of calculating the target as a function
of the rest of the features: Y=F(X1, … ,Xn). Two approaches – classification
(predicts a class to which an object may belong with a certain probability) and
regression (predicts a value of the target)
Descriptive (дескриптивные): a) Clusterization applying certain criteria
of similarity (in contrast with classification features and classes of partitioning
are unknown), b) Associative model (looking for stable associations – e.g.,
pampers – bier)
For each model many algorithms exist (classification and regression decision
trees, genetic algorithms, neuron nets, discriminant analysis, etc.)
Technology of data mining: 1) problem statement, 2) data preparation, 3)
model development and choosing the algorithm, 4) evaluation and
interpretation. Not all models allow interpretation (e.g., neuron nets). But if
rules are applied, they give a way for interpretation
Data Mining (2)






DARWIN (Thinking Machine Corp.) has been bought by Oracle in 1999.
First release of Oracle DM appeared in 2001 (Oracle 9i). DM is
incorporated directly into Oracle DB. Algorithms are implemented as
stored procedures. Parallel computations are used if possible. Windows is
not a proper environment. Specific repository contains information on
models, their applications, results.DM4J – data mining graphical client.
Oracle provides DM infrastructure, not DM instrument facilities. This
provides for incorporation of DM into applications. DM infrastructure
provides a way for application problems solving.
Java API and PL/SQL – two kinds of interfaces. JDM – new standard
under development. DBMS_DATA_MINING,
DBMS_MINING_TRANSFORMATION
Predictive algs: classification (Naïve Bayes, Adaptive Bayes, Support
Vector Machines (SVM), regression, searching for essential attributes
(actually, creating new concepts – example: matrix (animal X properties)
decomposition into two matrices whose product leads to the original one)
Descriptive algs: (enhanced K-means, O-cluster, association search
(Apriori algorithm))
Unstructured data analysis (texts, bioinformation, maps, schemas, etc.)
Организационные вопросы
РВО сообщество




РВО-сообщество (вопросы формирования сообщества
ученых, вовлеченных в процесс создания и использования
РВО в научных исследованиях).
РВО и образование
РВО и международное сотрудничество
Проект РВО в организационном плане (структура,
управление, финансирование, рабочие группы,
симпозиумы). Устойчивое развитие.
IVOA Working Groups








Resource Registry
Data Modeling
Content Description (UCD)
Data Access Layer
VOTable
VO Query Language
Grid & Web Services
Standards & Processes
Interest groups




VO Architecture
VO Applications
VO Theory
GGF Astro-RG
IVOA Documents







UCD (Unified Content Descriptor) IVOA Working Draft 200404-26
Metadata Content within VO Resources, Version 1.9.9b IVOA
Working Draft, 2003-10-21
A unified domain model for astronomy, for use in the
Virtual Observatory, Version 0.9 IVOA Working Draft,
2003-11-04
Observation Coverage and Space-Time Coordinates IVOA DM
WG Internal Note, 2004-03-31
Data Model for Quantity IVOA DM WG Internal Draft, 200403-01
Data Model for Observation, Version 0.2 IVOA DM WG
Internal Draft, 2004-03-01
IVOA Astronomical Data Query Language, Version 0.7.1
IVOA Working Draft, 2004-01-23
IVOA Documents






IVOA SkyNode Interface, Version 0.7 IVOA Working Draft
2004-01-23
IVOA Data Access Layer (DAL) Work Package, July 2003
Resource Metadata for the Virtual Observatory Version 0.8
IVOA Working Draft 2003-07-09
IVOA Document Standards Version 0.1 IVOA Working Draft
2003-07-09
IVOA: Theory in the VO, 2004-02-26
Astro-RG: Proposed Global Grid Forum Research Group
Charter, The Astronomical Grid Community, 2003-07-23