Technology and Infrastructure Support for Large Scale Information

Download Report

Transcript Technology and Infrastructure Support for Large Scale Information

Technology and Infrastructure Support for
Large Scale Information
Marcio Faerman
The Brazilian National Education and Research Network - RNP
[email protected]
www.rnp.br
Generating Large Data Collections
• Large Data Volumes can be generated much faster
than they can be analyzed
– Instrument Observations
•
•
•
•
Particle Accelerators (Cern LHC)
Telescopes, Satellites
Sensor Networks
Virtual Observatories
– Large Model Simulations
• High resolution, Very complex
• Scientific Experiments
–
–
–
–
–
–
medical imaging (fMRI):
Bio-informatics queries:
Satellite world imagery:
Current particle physics:
LHC physics (2007):
LSST Astronomy (2012):
~ 1 GByte per measurement (day)
500 GByte per database
~ 5 TByte/year
1 PByte per year
10-30 PByte per year
5 PBytes per year
Challenges
Managing Large Volume Data
•
Scalability
– What works for small datasets does not necessarily work for large collections
•
Data Integrity
– At a terabyte scale failures and data corruption are very likely to occur
– Is data provenance reliable?
•
Efficiency
– Data should be accessed at a rate which keeps work feasible
– More data – need for more speed
•
Distributed Access
– Data can be at remote (and possibly unknown) location
•
Infrastructure Management
–
–
–
–
Heterogeneous
Distributed
Prone to failures
Very Complex
Challenges – Getting to Know your
Data
• Extract knowledge from raw data files
– Data product derivation
•
•
•
•
Vizualization
Relationships
Patterns
New derived quantities
– Cross institutional and cross disciplinary collaborations
• What if experiments
– Your data with our model?
• Dataset Access
– Multiple formats
• Each sensor, simulation has its own storage format
– Federated collections
– Discovery by content
Technological Response
• Integration of compute, communication, storage and
instrument resources into a powerful infrastructure –
Information Grids
– Very powerful infrastructure
– Economy of scale
• Serves broad range of customers
– biologists, pysicists, government, industry
• Infrastructure is heterogeneous, distributed, very
complex
• Middleware and Data Oriented tools act as facilitators
to tackle data management complexities
Open Access and Preservation
Functionalities
• Federated Digital Libraries
–
–
–
–
Integration of distributed repositories
Access control – can decide who can see it
Organize the data in collections
Describe your data – Metadata
• Data Grids
– Access to efficient parallel I/O systems
– Hierarchical Systems
• Disk caches, tapes
• Often Distributed
–
–
–
–
Analysis, Data Mining
Visualization
Workflow based systems
Transaction based data ingestion
• Data provenance, Data fingerprinting
– What if virtual lab
• End User Oriented Portals
– "I deal with the data in the way it makes sense to me"
Middlewares and Tools
• Data Management
–
–
–
–
–
Storage Resource Broker (SRB)
Globus Data Management
L-Store
IBP
Storage Resource Manager (SRM)
• Data Representation Libraries
– HDF5
– NetCDF
• Portals
– OGCE
– JSR 168
Today’s Reality
• Exceptional achievements by early adopters
• Integration between domain scientists – data users
and producers still a challenge
– Need much more cross-disciplinary interaction
• Emphasis on scale and performance
• Failures are still a taboo
– Frustration factor should be addressed in partnership with
users
– Focus on failure recovery and quality of service getting more
attention
Grid Initiatives around the World
e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007
9
UNAM
OurGrid
EELA
SINAPAD
SPRACE
HEPGrid
Ringrid
CL Grid
UCRAV
Networking in Latin America
CUDI-MX
REACCIUN-VE
RAAP-PE
RNP-BR
REUNA-CL
Brazilian National Research And
Education Network - RNP
•
In November 2005 the
RNP networking
infrastructure was
entirely renovated.
It consists of
• A multigigabit core
connecting 10 capitals
at 2.5 and 10 Gbps
• Connections at 34
Mbps to 11 capitals
• Connections up to
16 Mbps to 6 capitals
12
Communitary Metropolitan Networks
• It is not enough to bring high speed connectivity to each
city – it is necessary bring it to the university campus /
research lab as well.
• The metropolitan network is the solution
– Infrastructure sharing to support:
• Campi interconnection of each partner institution
• Access to RNP national network backbone
– This sharing substantially reduces deployment costs
– Preferably, the infrastructure will be owned by the partners
themselves (reducing operating costs)
• Pilot: The Metrobel project in the city of Belém do Pará in
the Amazon region
Infra-estrutura para e-Ciência
13
Metrobel – Belém Metropolitan
Network
Redecomep Project(2005-7)
• Following Metrobel, Brazilian Ministry of Science and
Technology is supporting the Communitary Networks for
Education and Research (Redecomep) Project, with a
R$ 39,7 M (~ U$ 19,0 M) through Finep (dec/2004)
• Goals:
– Extend the metropolitan optical network to other
26 cities with RNP points of presence
– Promote integration in metropolitan area
– High speed access to RNP point of presence
Infra-estrutura para e-Ciência
15
Next steps
• Integration between network, data repositories,
compute, storage resources and applications
– Identify who needs better connectivity
– Developing Brazilian cyberinfrastructure
– Generally uncoordinated funding for infrastructure resources
– Need broad vision at funding agencies and partners level of
application requirements and cyberinfrastructure integration
• RNP articulating with scientific communities and
infrastructure providers e-Science/Infrastructure
initiative in Brazil
JRU- Brazil: 22 members in EELA-2
#
STATE
INSTITUTION
E-SCIENCE COMMUNITIES
1
SP
CCE / USP
(e-INFRASTRUCTURE only)
2
RJ
CEFET-RJ
e-GOVERNMENT, E-INDUSTRY
3
RJ
FCM / UERJ
BIOMED
4
RJ
FIOCRUZ
BIOMED, e-EDUCATION
5
SP
IAG / USP
CLIMATE
6
RJ
IME
BIOMED
7
SP
INCOR / USP
BIOMED
8
SP
INPE
CLIMATE
9
RJ
LNCC
BIOMED
10
RJ
ON
PHYSICS
11
BR
RNP (NREN)
(e-INFRASTRUCTURE only)
12
SP
SPRACE / UNESP
PHYSICS
13
PB
UFCG
CLIMATE, EARTH-SCIENCE
14
RJ
UFF
(e-INFRASTRUCTURE only)
15
MG
UFJF
BIOMED
16
MS
UFMS
BIOMED
17
RS
UFRGS
CLIMATE
18
RJ
UFRJ (coordinator for EELA-2)
BIOMED, PHYSICS, e-EDUCATION, CLIMATE
19
RS
UFSM
CLIMATE
20
DF
UnB
BIOMED
21
RJ
UNILASALLE
e-EDUCATION
22
SP
UNISANTOS
BIOMED, E-LEARNING, e-GOVERNMENT
e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007
17
Developing Together
• Information infrastructure is being redefined in Brazil
and Latin America
• Now is the time to have as much cross-disciplinary
interaction as possible to define needs, partnerships
and investments
• Please contact us
THANK YOU!