Panel discussion summary

download report

Transcript Panel discussion summary

Panel Summary
Andrew Hanushevsky
Stanford Linear Accelerator Center
Stanford University
XLDB
23-October-07
State in High Energy Physics
A lot of data

15 PB/Year for LHC



Typically, write once data
Applications are CPU bound
A lot of institutes must be involved

Increase total resources
Necessity forces a Hybrid Model (RDBMS + Files)

Performance impact of consistency is high
Not required for LHC
Wide range of applications, DB expertise, environments

23-October-07
LHC Issues
Power and Cooling
Cheap hardware for scaling
 Reliability
problems
Patching issues
Distributed Deployment Issues
 Needed
to develop in-house tools
Multi-dimensional search requirements
 Usually
23-October-07
reason for using “files” for data
LHC Questions
Database as a

Transactional system, efficient query engine,
highly available storage?

Can one product do all of this?
Multi-Mode Storage
How do you measure scaling?

Size? Transactions/Second? Etc.
Shared everything or shared nothing
architectures?
23-October-07
State in Astronomy (LSST
A lot of data

Trillions or more of rows


14PB by 2024
Only data about the image

Actual images (write once) much larger!
Data is distributed

Telescope and archive physically separate
Time for databases technology to catch up (12 years)

Some proprietary systems handle even more data today
Reliability and Security issues loose

Can absorb some data may be lost, up time 98%, public data
However must be able to ingest the data

Telescope keeps going
23-October-07
Issues in LSST
Easy Scaling

Add resources on the fly
Dependable software sources

This is a long term project
Data has some unique needs


Distributed mining capabilities
Varied database data types



Not available today except in OO databases
Relaxed consistency requirements
Fault tolerant software not hardware
Human scaling must be low
23-October-07
Scientific Panel I
40% Pure Database

Otherwise 20-30% in DB rest in files
Majority in the peta-byte range

Everyone in the 10-100 TB range
Majority use commercial products
Though open source DB’s rampant
 Few (in XL scale today) use homegrown systems


23-October-07
Sometimes driven by need sometimes by legacy
Scientific Panel II
Wide range of user analytic needs
DB’s have limited “express-ability”
 Unlikely there is a common set of operators

Common Data Processing Model
Write once read many
 But a lot of meta-data updates
 Amenable to data parallelism
 Approximate results are acceptable to 1st order

23-October-07
Scientific Panel III
Wish List
Approximate queries
 Full spatial queries
 Multiple availability levels



Mixture of real-time, interactive, background uses
The rest is yes

23-October-07
Scaling, performance, maintainability, etc.
Industry Panel I
Primarily traditional DB use



Standard scaling techniques
Disallow certain types of queries
Availability is a must


Money and survivability is the issue
90% non-transactional query
Wide range of size several TB to several PB




1 Billion rows/hour ingest peak
Trillions of rows
25TB/Day is not unusual
Millions of queries a day
23-October-07
Industry Panel II
Some homegrown solutions

Depending on how it is used
Problem is I/O throughput

Minimize use of indexes


Some specialized systems used to increase
performance
Dirty reads common
Transactional latency is a problem
23-October-07
Industry Panel III
Varied use patterns (business model driven)




Non-indexed data for mining purposes
Parallel Load and Query
Real time queries (currency is a must)
Designing for the unknown query
Customization motivation varies



Join inefficiency
Limited SQL expressiveness
Lack of sufficient parallelism
23-October-07
Common Industry/Science Issues
Performance issues


I/O throughput, transactional latency, etc
Lack of effective parallelism
Usability

SQL expressiveness
Licensing

Industry more constrained but cost is an issue
Human power


Labor is the dominant cost
DBA costs are high and must be reduced
23-October-07
Final Perceptions
Science/Industry operate roughly on same scale

Size and throughput
Science & Industry “business models” differ

Drive each community into different direction
Science is a long-term affair
 Industry must be reactive

23-October-07
Discussion Points
What drives feature sets?

General feeling that scaling features are missing



Is it the architecture (e.g., Relational vs other)?
Is it the business model?
Something else?
What feature sets do you think are important?


Performance, Scalability, Usability, Reliability?
Do you see it as a tradeoff?
Open Software Presence


A question of customization possibilities or simply cost?
Is it considered a threat to your business model?
Is it time to rethink the nature and placement of databases?
23-October-07