ORACLE & VLDB
Download
Report
Transcript ORACLE & VLDB
ORACLE & VLDB
Nilo Segura
IT/DB - CERN
VLDB
• The real world is in the Tb range (British
Telecom - 80Tb using Sun+Oracle)
• Data consolidated from different sources to
build Data Warehouses and using Data
mining techniques to extract useful
information
• Data is always READ only
• Physics data has similar characteristics
VLDB
• Current size limits : Solaris 64bits + Oracle
64bits = 4Pb per database
• This is more or less 19240 SunStoredge
A1000 units (216Gb per unit today)
• The current technology does not allow us to
store on-line all this data in a manageable
way
VLDB
• A typical LHC experiment will get several
Petabytes of raw data
• But the end user ought not to have access to
all this.
• We need to process/group/summarize it
following certain criterias
• This also means more disk space (if we
want to keep everything on-line)
VLDB
• Not to mention backup…we need to keep
our data safe (have you devised your
backup strategy?)
• RAID technology to help us to increase the
performance and availability
• RAID 0+1(best) or RAID 5(cheaper)
• Today, we should have raw data on tapes
and reconstructed data on-line
VLDB
• Now some software tricks to deal with all
this amount of data
–
–
–
–
–
Partitioning
Parallel DML
Materialized Views
Parallel Server (Cluster configuration)
Bitmapped indexes?
VLDB
• Partitioning
• A large amount of data can be divided into
physically independent structures according to a
certain criteria
• However the user continues to see the same logical
structure
• Partition keys can be defined by range or using a
hash function
• The system can discard partitions based on the
user’s queries, reducing the amount of data to be
processed
VLDB
• Parallel DML
– A single select/insert/delete/update can be executed by
several processes (slaves) coordinated by a master
process
– Naturally leads to a better use of SMP machines
– The degree of parallelism can be set by the user of
automatically by the system (testing is a must)
– Parallel insert/update/delete does need partitioning
– You need to plan carefully your I/O system
VLDB
• Materialized views
– Normal views are just a name for a SQL query
with no real data associated (until runtime)
– This can be very costly if we run it regularly
– MV is just a view with all the data that satisfy
the query already there (like in a normal table)
– It can be refreshed (manually or automatically)
to reflect the dynamic nature of the view
• Parallel Server
– Several nodes are attacking the same database that is
on a disk system shared by all the nodes in the cluster
– Users can be assigned to different nodes (load
balancing)
– Intra parallelism – queries are executed across the
different nodes
– At CERN there are 3 Sun Clusters (2 for DB, 1 for
Web) and 1 Compaq
– There is no such thing for Linux (yet)
Others
• Another point is how to distribute data
amongst the different Institutes
• Network , Tapes + Post …
• It would be nice to have a plug-in plug-out
mechanism in the database
• This is called transportable tablespaces
• We may also use the database replication
option but….
Conclusion
• Do not despair by the size of problem
• We are trying to solve tomorrow’s problem using
today’s technology
• So keep an eye open and be VERY flexible in your
model to be able to adapt quickly and painlessly
• Do never declare you model frozen, it is a
mistake, try to improve it, adopt new technologies
(if it is a benefit)