NoCOUG Persistance of Memory Issues
Download
Report
Transcript NoCOUG Persistance of Memory Issues
Root Cause and Other DBA
Urban Legends
Brian Hitchcock
OCP 10g DBA
Sun Microsystems
[email protected]
[email protected]
www.brianhitchcock.net
SunFed DBA
Brian Hitchcock October 19, 2006
Page 1
For DBA Issues
I'm told that I must find root cause,
–
Can't resolve the issue without root cause
Must have testing environment that is
–
–
Similar enough to production to recreate the issue
And the fix
Need extensive expertise to solve performance issues
–
–
–
–
–
–
10046 trace
Wait events
Extents size/number
Physical spindles
Rebuilding indexes
Undocumented init parameters
SunFed DBA
Brian Hitchcock October 19, 2006
Page 2
What is My Experience
Centralized DBA support team
2000+ databases
Database users open cases with DBA team
Cases randomly assigned to DBAs
–
–
Not assigned based on experience or expertise
My cases are probably typical of all the cases
SunFed DBA
Brian Hitchcock October 19, 2006
Page 3
My Experience
Finding root cause costs money
Building and maintaining test system(s) costs money
Test system
–
–
–
Needs to be able to recreate production issue
Database, network, apps and web servers, load balancers
Users around the world
Root cause and test system(s) are only worthwhile
–
If having them is less expensive than not having them.
Perhaps these are Urban Legends?
–
–
–
Root cause
Test system(s)
Specific expertise required
SunFed DBA
Brian Hitchcock October 19, 2006
Page 4
My Experiences
Review major cases I worked on over the last 2 years
–
Cases are presented in chronological order, oldest first
For each case
–
–
–
–
What worked, i.e. what was the real-world solution?
Was root cause identified?
Was a test system available to verify root cause
and the solution?
What extensive expertise was required?
SunFed DBA
Brian Hitchcock October 19, 2006
Page 5
Keeping Score
For each case, look for 3 things
–
–
–
Root cause
Test system(s)
Specific expertise
Listed earlier (10046, wait events, extents, etc.)
–
Were any of the following useful in resolving the issue?
Open Mind (anything can and will happen)
Communication (it's difficult)
Simple solutions
- Any active db users?
- Reboot one or more components?
SunFed DBA
Brian Hitchcock October 19, 2006
Page 6
Case 1 – CRM Local Language
Moved to UTF8, add local language columns
SQL ran slower
10046 trace, explain plan
Found the single step slowing execution
Existing index wasn’t updated to have new
local language column
Recreated index
–
Performance returned to normal
SunFed DBA
Brian Hitchcock October 19, 2006
Page 7
Case 1 – Three Things
Root cause
–
–
Existing indexes not optimal for new schema
Corrected indexes fixed the problem
Test system identified the issue
Specific Expertise
–
–
–
10046
Explain plan
Indexes
SunFed DBA
Brian Hitchcock October 19, 2006
Page 8
Case 2 – Storefront Slow
App support team reports database slow
Check database, 1-3 active users at most
Active users gone in 1-2 seconds
App users report 30 second response time
Have App support person use app
–
Watch database
See single active user connect, complete, disconnect
–
2 seconds maximum
Remainder of response time
–
Web servers, load balancers etc.
Application server was locked up – reboot fixed it
SunFed DBA
Brian Hitchcock October 19, 2006
Page 9
Case 2 – Three Things
Root cause
–
Not identified
Test Environment
–
–
–
Exists, but not even close to production
Complex production application environment
Accurate test environment
Expensive to build and maintain
–
–
Hard to find qualified testers
No one completely understands how production was built
Impact of layoffs, outsourcing
Hard to recreate what you don’t understand
No specific expertise required
SunFed DBA
Brian Hitchcock October 19, 2006
Page 10
Case 3 – Contracts Slow
App users report message
–
‘Problem contacting the database: No available resource…’
Sar shows CPU 90% idle
Statspack snapshot for last hour shows top wait
events
–
–
70% CPU time
20% db file read
App server rebooted, performance returns to normal
SunFed DBA
Brian Hitchcock October 19, 2006
Page 11
Case 3 – Three Things
Root cause unknown
–
How to determine what caused the app server to hang?
Test system does not have all the components of prod
–
–
Don’t know why production locked up
Tough to reproduce in test system
Simple solutions – no specific expertise required
–
–
–
–
Reboot app server
Low-risk, cheap, quick
No special experience required
If it doesn’t work, time to put more resources into the issue
SunFed DBA
Brian Hitchcock October 19, 2006
Page 12
Case 4 – Storefront
Users report app hanging
Few or no active users in database
Active users taking much longer than normal
Isolate single SQL statement that is running slow
Explain plans show good/bad execution plans
–
Changes from good to bad at random times
Various attempts to isolate give conflicting results
–
We assume that the problem is stable, it isn’t
10053 trace, watch optimizer choose execution plan
Optimizer changes from good to bad plan
–
one column of one table has 2 versus 3 distinct values
SunFed DBA
Brian Hitchcock October 19, 2006
Page 13
Case 4 – Three Things
Root cause
–
not sure…
App developers gone (outsourced)
–
–
–
–
–
No one knows how code really works
Why are these values appearing and disappearing?
App is inserting/deleting the rows with the 3rd distinct value
This change causes optimizer to choose bad plan
Bug or feature?
Fix?
–
Create rows so column always has 3 distinct values
SunFed DBA
Brian Hitchcock October 19, 2006
Page 14
Case 4 – Three Things (cont’d)
No test system
–
–
Test system has small percentage of production data
App system too complex to reproduce
Multiple strings, load balancers, app servers
Hardware, support, upgrades, patches
Difficult to get testing resources – people are expensive
–
–
Changes are tested for functionality
Changes aren’t tested for performance
Until released in production
Specific expertise
–
–
Explain plan
10053 trace
SunFed DBA
Brian Hitchcock October 19, 2006
Page 15
Case 5 – Customer Demo
Users report slow performance
Database shows 4 inactive sessions
Inactive sessions holding locks
Kill these 4 sessions
–
Performance is fine
SunFed DBA
Brian Hitchcock October 19, 2006
Page 16
Case 5 – Three Things
Root cause – unknown
–
–
–
–
Why some sessions inactive and holding locks?
App developers gone
Easier to kill sessions once in a while
Real root cause would be expensive to find
Test system
–
–
Don’t have test system – app developed on the cheap
Don’t know how app works
No specific expertise required
–
Kill database sessions
SunFed DBA
Brian Hitchcock October 19, 2006
Page 17
Case 6 – Executive Dashboard
User reports database is sorting dates incorrectly
–
Phone on mute – laugh
Users have really good drugs today!
Connect to db
–
Verify that indeed dates are being sorted backwards
From NLS experience, look at actual bytes of dates
–
–
–
–
There are extra bytes that I can’t explain
Go back to the docs for DATE type
Expand DATE format to include all the possible fields
Suddenly it all makes sense!
SunFed DBA
Brian Hitchcock October 19, 2006
Page 18
Case 6 – What’s the Story?
The dates are being correctly sorted – because
–
They are from BC!
OK, now what?
–
–
This is sales data from Fred Flintstone?
Is Barney setting up his own IT department?
Check the basics
–
–
App code uses 10g JDBC
User says this was a requirement
But the database is 9i – it just gets better and better
User finds Metalink note
–
–
10g JDBC issues with 9i database
Doesn’t describe our issue, but it’s a start…
SunFed DBA
Brian Hitchcock October 19, 2006
Page 19
Case 6 – What to Do?
Setup a 10g database and test the app code against it
–
–
–
No test system, in fact, no other system of any kind
Critical executive reporting system, can’t be down for long
Very limited disk space
Why not upgrade existing 9i database?
–
–
Upgrades can cause problems
If this isn’t a 10g to 9i issue, why risk the db upgrade?
Install 10g db, full export 9i db, import into 10g db
User tests app code against 10g database
–
No more dates from BC!
Remove 9i database, expand 10g database to match
User happy
–
Executive’s reports don’t show sales to the Flintstones
SunFed DBA
Brian Hitchcock October 19, 2006
Page 20
Case 6 – Three Things
Root cause
–
Not clear
Tried 10g database and issue went away
Did we really identify the root cause?
Was it a feature of 10g JDBC? A bug?
Test system
–
None
Expertise required
–
–
Minimal configuration control would have prevented this
Hardest part was accepting what Oracle was telling us
Dates from BC
–
Not at all the way things are supposed to be
SunFed DBA
Brian Hitchcock October 19, 2006
Page 21
Case 7 – Customer Demo
Users calls, application is slow
Blocking processes, restart database twice in one day
–
App still slow
Ask user to connect and start using app
–
–
–
–
–
I watch database for active users
Over 20 seconds before new db user appears
Db user is done and gone in less than a second
User reports 30 seconds before results appear in browser
I tried ping between the db server and the app server
200ms with packet loss
Ask network group to investigate
- Network switch in data center has failed
SunFed DBA
Brian Hitchcock October 19, 2006
Page 22
Case 7 – Three Things
Root cause
–
–
Network switch – no question
This was not a database problem
But we could have wasted a lot of time with SQL tracing etc.
Need to confirm that the database is the problem
–
Then work the issue as a db tuning issue
No test system available
–
How would you reproduce the data center network?
Expertise required
–
–
Look for active users in database
ping between db and app server
SunFed DBA
Brian Hitchcock October 19, 2006
Page 23
Case 8 – Data Warehouse
User reports application slow
–
–
User can’t truncate a table – command hangs
Loading data into warehouse is also hanging
Watch database while user starts load process
During load
–
–
User is using third-party app to do the load
No active users performing inserts
When no load is happening
–
Truncate table runs quickly
User contacts app vendor
–
–
Vendor “changes some parameters for the load process”
Data load runs normally, truncate table runs normally
SunFed DBA
Brian Hitchcock October 19, 2006
Page 24
Case 8 – Three Things
Root cause
–
–
Looking at the db showed no active inserts during data load
Needed to verify that db wasn’t the issue
Force vendor to perform
Test system
–
–
This was a dev database so there was a ‘test’ system
Data load issues identified, resolved in dev environment
Expertise required
–
–
–
Quickly check for database problems
Communicate with user to understand what is happening
Politics – dealing with vendor
Vendor really, really wants it to be a database problem
SunFed DBA
Brian Hitchcock October 19, 2006
Page 25
Case 9 – Software Registry
User trying to truncate table
–
Part of a data feed process
Other users in database
–
Performing deletes on same table
Blocking truncate
Kill delete processes
–
Truncate still blocked
Watch database processes
–
Delete process starts every hour on the hour
Cron job starts delete hourly
–
–
Part of data feed process
Why don’t the app owners know this?
SunFed DBA
Brian Hitchcock October 19, 2006
Page 26
Case 9 – Three Things
Root cause
–
–
Unknown
Why was cron in place that user didn’t know about?
Test system – exists
–
But doesn’t have the production data feeds
Expertise required
–
–
–
None – basic DBA skills
Not a database performance issue
Basic app configuration was the issue
SunFed DBA
Brian Hitchcock October 19, 2006
Page 27
Case 10 – CRM Reporting
User reports app is too slow
Specific selects are taking 10-15 seconds
STATSPACK snapshots are being taken
Check snapshots over same time for last week
Performance of the top SQL hasn’t changed
User agrees that performance hasn’t changed
Resources not available to make performance better
When is a performance problem not a problem?
–
When you don’t want to pay to fix it
SunFed DBA
Brian Hitchcock October 19, 2006
Page 28
Case 10 – Three Things
Root cause
–
User perception
Test system
–
None
Expertise required
–
–
Document that performance hasn’t changed
Offer to work the issue if user will get funding
SunFed DBA
Brian Hitchcock October 19, 2006
Page 29
Case 11 – Configurator
Users getting error
–
Can’t allocate memory in shared pool
What is causing this?
Spend some time looking at the database
–
Looks normal
Reboot database to clear shared pool
Watch database after restart
–
Shared pool doesn’t fill up
SunFed DBA
Brian Hitchcock October 19, 2006
Page 30
Case 11 – Three Things
Root cause
–
Don’t know why shared pool fills up
Test system
–
How to recreate problem in test system?
Expertise needed
–
–
Basic DBA skills
Flush shared pool
Problem didn’t reoccur
–
One-time or infrequent set of circumstances
SunFed DBA
Brian Hitchcock October 19, 2006
Page 31
Case 12 – Pricing Database
User reports slow performance
Check database server
–
–
–
–
CPU and iowait very high
Watch SQL executing
Explain plan for worst SQL shows full table scans
No indexes on tables being scanned
Test server
–
Same tables do have indexes
While recreating indexes
–
All the needed indexes reappear
Cron job for pricing data runs every two weeks
–
Rebuilds indexes and loads data
SunFed DBA
Brian Hitchcock October 19, 2006
Page 32
Case 12 – Three Things
Root cause
–
What caused indexes to be dropped?
Unknown
Test system
–
–
Used to verify that indexes were missing
Can’t reproduce whatever dropped the indexes
Expertise required
–
Basic DBA skills
SunFed DBA
Brian Hitchcock October 19, 2006
Page 33
Case 13 – Contracts Database
User reports error when selecting from table
–
Invalid row id
Some queries on same table don’t report error
Looking at table and row ids
–
–
Some queries use index
No errors
Other queries use table
Specific rows have invalid row ids
Rebuild table
–
Create table as select * from <table>
SunFed DBA
Brian Hitchcock October 19, 2006
Page 34
Case 13 – Three Things
Root cause
–
Why did row ids become invalid?
Unknown
Test system
–
Exists, can’t reproduce cause of invalid row ids
Expertise required
–
Basic DBA skills
SunFed DBA
Brian Hitchcock October 19, 2006
Page 35
Case 14 – Software Download
User reports application slow in Dev environment
–
–
SQL so slow, application server times out
Sar shows CPU near 0% idle
User executes problem SQL
–
–
–
Watch database
SQL completes, very slowly, database is working
Explain plan shows full table scans
Look at tables involved
–
–
Production – last analyzed NULL, indexes, runs fast
Test – recently analyzed, indexes, runs slow
Drop stats, rebuild indexes, reanalyze
–
Performance returns to normal
SunFed DBA
Brian Hitchcock October 19, 2006
Page 36
Case 14 – Three Things
Root cause
–
–
What happened to indexes, statistics in production?
Unknown
Test system
–
–
–
Exists
Doesn’t match production – which is correct?
Can’t reproduce dropped indexes and/or statistics
Expertise required
–
Basic DBA skills
SunFed DBA
Brian Hitchcock October 19, 2006
Page 37
Case 15 – BRIO
User reports SQL failing with error
–
Can’t allocate memory in shared pool
Watch database
–
–
Same SQL runs most of the time
Error occurs once in a while
Solutions
–
–
Reduce sort area size
Free up more memory for shared pool
Increase physical memory assigned to database
Error occurred so infrequently
–
Users decided not to change system
SunFed DBA
Brian Hitchcock October 19, 2006
Page 38
Case 15 – Three Things
Root cause
–
–
Unknown
Just too many users for a brief time?
Test system
–
Yes, but how to reproduce this issue (user load)?
Expertise required
–
Basic DBA skills
SunFed DBA
Brian Hitchcock October 19, 2006
Page 39
Case 16 – Revenue App
Automated alert
–
–
Report log switching hung
Database stopped for transactions
Examine database
–
Can’t find anything wrong
Restart database
–
–
–
Log switch works
No further alerts
No user problems
SunFed DBA
Brian Hitchcock October 19, 2006
Page 40
Case 16 – Three Things
Root Cause
–
Unknown
Test system
–
Exists, but no help
Expertise required
–
Basic DBA skills
SunFed DBA
Brian Hitchcock October 19, 2006
Page 41
Scoreboard
Item
Case
root cause test system 10046 wait event extents spindles indexes init.ora open mind
1 Siebel
xxxx
xxxx
xxxx
xxxx
xxxx
2 Storefront
xxxx
3 Contracts
xxxx
4 Storefront
xxxx
5 Cust Demo
xxxx
6 Dashboard
xxxx
7 Cust Demo
xxxx
xxxx
8 Warehouse
xxxx
xxxx
9 Registry
xxxx
10 CRM
xxxx
11 Configurator
xxxx
12 Pricing
xxxx
xxxx
13 Contracts
xxxx
14 SW Download
xxxx
15 BRIO
xxxx
16 PANDORA
xxxx
communication
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
SunFed DBA
Brian Hitchcock October 19, 2006
Page 42
Observations
How many times did we
–
–
–
Identify root cause?
Have a complete test system?
Need specific expertise?
Un-scientific results
–
–
–
Root cause – 2 out of 16 = 13%
Test system – 3 out of 16 = 19%
Specific expertise – indexes 1/16 = 6%
Simple solutions worked 94% of the time
SunFed DBA
Brian Hitchcock October 19, 2006
Page 43
Conclusions
Brian’s off his meds
–
Ignore him – this isn’t typical
Urban legends?
–
–
–
You don’t have to find root cause
You don’t have to have a complete test environment
You don’t need extensive expertise in specific DBA areas
You do need
–
–
–
DBA experience
Open mind – strange stuff happens all the time
Communication skills – you don’t know what’s happening
Looking at training resources
–
–
Why become more expert at things that are rarely needed?
Why not become familiar with things you know little about?
SunFed DBA
Brian Hitchcock October 19, 2006
Page 44
Opinion
Specific expertise can be obtained when needed
Based on the results shown
–
–
94% of the time, don’t need expertise to be productive
15 of 16 cases solved with general DBA skills
Some expertise can be automated or outsourced
–
Tracing, wait events
Focus on skills that can’t be put into a GUI
–
–
Communication
Ability to solve problems, not just database issues
SunFed DBA
Brian Hitchcock October 19, 2006
Page 45