Panel discussion summary
Download
Report
Transcript Panel discussion summary
Panel Summary
Andrew Hanushevsky
Stanford Linear Accelerator Center
Stanford University
XLDB
23-October-07
State in High Energy Physics
A lot of data
15 PB/Year for LHC
Typically, write once data
Applications are CPU bound
A lot of institutes must be involved
Increase total resources
Necessity forces a Hybrid Model (RDBMS + Files)
Performance impact of consistency is high
Not required for LHC
Wide range of applications, DB expertise, environments
23-October-07
LHC Issues
Power and Cooling
Cheap hardware for scaling
Reliability
problems
Patching issues
Distributed Deployment Issues
Needed
to develop in-house tools
Multi-dimensional search requirements
Usually
23-October-07
reason for using “files” for data
LHC Questions
Database as a
Transactional system, efficient query engine,
highly available storage?
Can one product do all of this?
Multi-Mode Storage
How do you measure scaling?
Size? Transactions/Second? Etc.
Shared everything or shared nothing
architectures?
23-October-07
State in Astronomy (LSST
A lot of data
Trillions or more of rows
14PB by 2024
Only data about the image
Actual images (write once) much larger!
Data is distributed
Telescope and archive physically separate
Time for databases technology to catch up (12 years)
Some proprietary systems handle even more data today
Reliability and Security issues loose
Can absorb some data may be lost, up time 98%, public data
However must be able to ingest the data
Telescope keeps going
23-October-07
Issues in LSST
Easy Scaling
Add resources on the fly
Dependable software sources
This is a long term project
Data has some unique needs
Distributed mining capabilities
Varied database data types
Not available today except in OO databases
Relaxed consistency requirements
Fault tolerant software not hardware
Human scaling must be low
23-October-07
Scientific Panel I
40% Pure Database
Otherwise 20-30% in DB rest in files
Majority in the peta-byte range
Everyone in the 10-100 TB range
Majority use commercial products
Though open source DB’s rampant
Few (in XL scale today) use homegrown systems
23-October-07
Sometimes driven by need sometimes by legacy
Scientific Panel II
Wide range of user analytic needs
DB’s have limited “express-ability”
Unlikely there is a common set of operators
Common Data Processing Model
Write once read many
But a lot of meta-data updates
Amenable to data parallelism
Approximate results are acceptable to 1st order
23-October-07
Scientific Panel III
Wish List
Approximate queries
Full spatial queries
Multiple availability levels
Mixture of real-time, interactive, background uses
The rest is yes
23-October-07
Scaling, performance, maintainability, etc.
Industry Panel I
Primarily traditional DB use
Standard scaling techniques
Disallow certain types of queries
Availability is a must
Money and survivability is the issue
90% non-transactional query
Wide range of size several TB to several PB
1 Billion rows/hour ingest peak
Trillions of rows
25TB/Day is not unusual
Millions of queries a day
23-October-07
Industry Panel II
Some homegrown solutions
Depending on how it is used
Problem is I/O throughput
Minimize use of indexes
Some specialized systems used to increase
performance
Dirty reads common
Transactional latency is a problem
23-October-07
Industry Panel III
Varied use patterns (business model driven)
Non-indexed data for mining purposes
Parallel Load and Query
Real time queries (currency is a must)
Designing for the unknown query
Customization motivation varies
Join inefficiency
Limited SQL expressiveness
Lack of sufficient parallelism
23-October-07
Common Industry/Science Issues
Performance issues
I/O throughput, transactional latency, etc
Lack of effective parallelism
Usability
SQL expressiveness
Licensing
Industry more constrained but cost is an issue
Human power
Labor is the dominant cost
DBA costs are high and must be reduced
23-October-07
Final Perceptions
Science/Industry operate roughly on same scale
Size and throughput
Science & Industry “business models” differ
Drive each community into different direction
Science is a long-term affair
Industry must be reactive
23-October-07
Discussion Points
What drives feature sets?
General feeling that scaling features are missing
Is it the architecture (e.g., Relational vs other)?
Is it the business model?
Something else?
What feature sets do you think are important?
Performance, Scalability, Usability, Reliability?
Do you see it as a tradeoff?
Open Software Presence
A question of customization possibilities or simply cost?
Is it considered a threat to your business model?
Is it time to rethink the nature and placement of databases?
23-October-07