Ten Database Mysteries Part I

Download Report

Transcript Ten Database Mysteries Part I

Ten Database Mysteries
Chris Lawson
Database Specialists, Inc.
www.dbspecialists.com
[email protected]
Focus of Presentation
• Explore some “strange” database problems that have
baffled some DBAs
• Most of the mysteries occurred on critical production
systems, although some were on development systems
• ALL of the mysteries were eventually explained
• Depending on your personal experience, some of these
“mysteries” will seem trivial or commonplace; others
will indeed seem mysterious
• Most mysteries have a simple explanation
• Most mysteries have a simple fix
Why Spend Time on These
Database Mysteries?
• Each DBA has a unique set of experiences and biases.
What one DBA thinks is obvious, another will not.
• An Oracle “detective” is part scientist, part artist.
Many solutions require creativity, not just logic.
• A superior DBA will look for ways to “stretch” and
learn ways to handle difficult problems.
• Without working out difficult problems, you will not
advance as a DBA.
• You will be the “hero” if you encounter a mystery and
solve it; remember the solution--you may see it again!
A Word About Oracle Versions
• This presentation was originally written in
1998
• Most of these mysteries involve Oracle 7
databases
• Although some of the mysteries might not
apply directly to Oracle 8i, they still offer
insight into the problem-solving process
1
The Case of the
Berserk Application
Clue #1: “Big Phone Company” 1997
• Using HPUX, Oracle 7.3.2.3
• Help desk application (Vantive) that connects to Oracle
database suddenly goes berserk, creating thousands of
connections
• Program had worked normally for many months
• DBAs watch helplessly as CPU load driven from 1 to 50
• As DBAs kill extra processes, more take their place
• Alert log and recent trace files show nothing unusual
• DBAs are united in accusing the application as the culprit
Berserk Application
(continued)
Clue #2: “Big Publisher Ltd.” 1998
• Running Sun Solaris, Oracle 7.3.2.3
• Users complain that performance has degraded in recent
months
• Manager states that “something must be wrong with the
network”
• Application is CORIS, a document management/printing
application
• DBA investigates. Discovers that time to connect in
SQL*Plus is 30-45 seconds, even though server load is low
• Connect time is bad whether remote (PC) or directly on server
• Server load (file I/O and CPU) is generally low
Berserk Application: Solution
• OTRACE is the culprit. It is active by default on
many 7.3 Oracle versions
• Excerpt from Oracle Corporation Alert:
“Problems described here can occur when Oracle
Trace is not configured and is widely enabled.”
Berserk Application: Solution
(continued)
To Detect:
• Check directory ORACLE_HOME/rdbms/otrace: As
size of files process.dat and regid.dat approach 10mb,
problems arise
-rw-r--r-- 1 oracle dba 3161424 Jun 05 09:43 process.dat
-rw-r--r-- 1 oracle dba 263808 Jun 05 09:43 regid.dat
• To correct: simply remove these two files, then issue
command otrccref
Berserk Application: Solution
(continued)
To Prevent:
• Add line to listener.ora for each database (after
ORACLE_HOME):
( ENVS=’EPC_DISABLED=TRUE’)
• Set and export environment variable
EPC_DISABLED=TRUE for all users.
Put standard profile in /etc directory
• Restart all databases and restart listener
2
The Case of the Reluctant Patch
Background:
• To correct several bugs, decision is made to
upgrade from 7.3.2.2 to 7.3.2.3 (HPUX)
• Patch is obtained from Oracle and applied to test
server. DBA notes that patch ran very quickly and
runs again “just to be sure”
• Bug is now gone on test server
The Case of the Reluctant Patch
(continued)
Problem:
• Patch is similarly applied to production server-same operating system and version.
• Production application is tested, but bug is still
there!
• Another DBA reviews patch file, location, etc. All
seem correct.
The Case of the Reluctant Patch:
Solution
• DBA happens to notice that upon SQL*Plus
startup, database is 7.3.2.2!
• The patch was really only applied on the second
run. This is apparently a quirk in the patch readme
file.
• The command what oracle (then grep for patch#)
can be used to determine which patches are
applied
3
The Case of the Sleazy SQL
• “Big Publisher Ltd.” runs an MRP system called
“AVALON,” similar to Oracle Manufacturing.
Database stores inventory, part information,
vendors, etc.
• Server is ATT3555, running NCR UNIX.
Database is Oracle 7.1.6
• Issue: Users report that certain common operations
are very slow
The Case of the Sleazy SQL
(continued)
• DBA investigates and queries v$sqlarea using:
SELECT sql_text FROM v$sqlarea
WHERE disk_reads/executions > 1000;
• Query yields troublesome SQL statement, with
these stats:
DISK_READS PER EXECUTION
=5,000
BUFFER_GETS PER EXECUTION
=5,100
The Case of the Sleazy SQL
(continued)
• Statement has been accidentally designed to ensure worst
possible performance by making index usage impossible:
SELECT * FROM ABC
NVL (COL_W) = NVL (:1)
NVL (COL_X) = NVL (:2)
NVL (COL_Y) = NVL (:3)
NVL (COL_Z) = NVL (:4);
WHERE
AND
AND
AND
• TABLE ABC SIZE = 3mb, about 25,000 rows
• DBA requests developers to alter statement to eliminate
NVL (COL_N) functions
• DBA advised that no resources available to make change
• Problem: If code can’t be changed, what can be done to
improve performance?
The Case of the Sleazy SQL:
Solution
• CACHE the table! For example:
alter table xyz cache;
• Normally, blocks from full-table scans are designated for
rapid age-out; otherwise, they would “wipe-out” the db
cache. Cache of table causes blocks to be treated “normally.”
• Caching table disables rapid age-out of this table
• Logical reads will not be reduced, but disk reads approach
zero!
• Note: DB_BLOCK_BUFFERS was slightly increased to
compensate for the cached table that now consumes a few
megabytes of database cache
4
The Case of the
Non-Optimal Optimizer
• A large software company based in “Cedar Shores” has
designed a large financials application. Program has been
tuned for Rule Based Optimizer.
• The application runs very well, is a mature product, which
is used in thousands of companies around the world.
• Some users clamor for new features: more horns and
whistles
The Case of the
Non-Optimal Optimizer
(continued)
• The new development team, afraid to become obsolete,
wants to convert to Cost-Based Optimizer (CBO). They
also wisely consider that Oracle recommends using CBO
on new projects.
• The older developers, now nearing peaceful retirement,
predict disaster if the database is switched to CBO,
because the execution plans will change.
• Issue: How can Optimizer be selectively switched to CBO
without changing the code?
The Case of the Non-Optimal
Optimizer: Solution
• Simply substitute a view having a “hint” for the table
needing CBO
For example:
rename DEPT to DEPT_ORIG
create view DEPT as select / * + ALL_ROWS */
* from DEPT_ORIG;
• Now, application will use the VIEW when it looks for
DEPT
• All queries using DEPT will use CBO
• Note: Upon renaming a table, the indexes and constraints
will “move” with the table; however, synonyms and grants
may need to be reset.
5
The Case of the
Forgetful Memory
• A new internet-transaction application, ECXpert, and its
database have been installed on a Sun Ultra Enterprise Server
• Sun Solaris 2.5.1, Oracle 7.2.3
• Application appears to run smoothly for several months,
although it occasionally creates large dump files
• Trace files appear occasionally with ORA-4030 “Out of Process
Memory” and recommends “increase process memory quota”
• Server seems to hang occasionally. Server reboot fixes
• SysAdmin checks kernel parameters related to memory. All
correct and match other servers. Not using any large stored
procedures
• Problem: What is causing memory/hang problems?
The Case of the
Forgetful Memory: Solution
• DBA checks /tmp (swap area on server) and notes 99%
consumed
• Investigation reveals that application occasionally goes
berserk and consumes ENTIRE SWAP area with log
files
• Deletion of log files does not return disk space, since
application is still “holding” the files
• Reboot of server cleaned up /tmp area, thereby
correcting problem
• Suggestion: If memory-related error messages exist,
check swap area first
6
The Reluctant Index Affair
Background:
• DBA asked to analyze and tune Australia manufacturing
database. Database is running CBO. One particularly
bothersome SQL statement is identified
• The WHERE condition is perfect for a new index,
because of its excellent selectivity
• Index is quickly created. Table is also analyzed
Problem:
• Even though index is a “perfect” solution to the query, a
full table scan is used instead
The Reluctant Index Affair:
Solution
• The values in the table are very lopsided. Optimizer,
however, will assume uniform distribution, which is
incorrect in many cases
• Re-analyze and specify histogram:
ANALYZE TABLE XYZ FOR ALL INDEXED COLUMNS SIZE 75
• This creates histogram of 75 “buckets” for each indexed
column
• With these statistics, optimizer will “know” how values are
distributed, and will more often make right decision to use
an index or not
7
Mystery of the
Hanging Database
• At random intervals, a 7.3.2.3 database hangs. No trace
files, and nothing unusual in the alert log.
• When problem occurs, no response to new connections
requests; over 1200 existing connections “hang.”
• Oracle Support is alerted to priority 1 problem; experts
across the world investigate for weeks
• DBA is using OEM Lock Manager tool and notices user
who is blocking about 25 other users. The hang occurs
soon after.
• Oracle Australia recommends checking indexes. This
suggestion led to the solution.
Problem:
• How did index problems hang database?
Mystery of the
Hanging Database: Solution
• Application design flaw
• There are hundreds of foreign keys in the database; 99%
had indexes. A few did not, violating good design practice.
When batch program began updates, locking increased
rapidly.
• Without FK index, updates on parent table completely
block updates on child (vice versa for 7.1.6)
• Reference: Server Application Developers Guide
• Although not admitted as database “bug,” database was
overwhelmed by the locks
• Once indexes on all FK’s created, problems disappeared
8
The Case of the
Mysterious Package
• Manufacturing application was installed on a Sun Ultra
3000 server. A small database was created for testing
purposes. Oracle version 7.2.3. Shared pool size about 60mb.
• At first, all went well. Then, seemingly randomly, when
the users began to try new features, they would receive a
“funny” error message and the application failed.
• A trace file recommended increasing shared pool
Problem:
• How can application fail with such a sizable shared pool?
• Aside from massive increase in shared pool, what can be
done?
The Case of the
Mysterious Package: Solution
• The application uses about 20 massive PL/SQL packages.
Some are 5x the SYS.STANDARD package. When a
package load is attempted, it will not fit in the shared pool.
• Memory-intensive packages should be “pinned” or “kept”
in shared pool after database startup
EXECUTE SYS.dbms_shared_pool.keep ('OBJECT_NAME');
• But first, must find the “big” packages (will also list
SYS.STANDARD):
SELECT owner, name, sharable_mem FROM v$db_object_cache
WHERE sharable_mem > 100000;
The Case of the
Mysterious Package: Solution
(continued)
• Example script to find “big” packages and generate
SQL script to “pin” them in memory
SELECT 'EXECUTE SYS.dbms_shared_pool.keep('''||
owner||'.'||name||''');'
FROM
v$db_object_cache
WHERE sharable_mem > 100000
AND
type NOT IN ('VIEW', 'SYNONYM', 'TABLE')
AND
name NOT LIKE '%SHARED_POOL%'
AND
owner IS NOT NULL;
9
The Case of the
Uncooperative Rollback
• In mid-afternoon, DBA (running “OEM Top Sessions”)
notices many users “ACTIVE” but showing 0 file I/O. Lock
Manager reveals one user performing big update blocking all.
• Culprit tracked down--agrees to be terminated. DBA
disconnects session.
• Locks are not released, but user is “marked for kill.”
• Very little file I/O activity. Alert log shows very slow
switching of redo logs.
• DBA performs shutdown abort then startup. Database starts
up after 2 minutes. All is well.
Problem:
• Why did user not rollback and release locks?
The Case of the Uncooperative
Rollback: Solution
• If session is terminated, speed of rollback is proportional
to init.ora parameter
CLEANUP_ROLLBACK_ENTRIES
• If default value (20) is used, rollback of killed session
can take 50x time of original update. Alternatively,
shutdown abort/startup cleans up database much faster.
• Rationale: Parameter prevents rollback of one user from
hogging all the resources on a busy system
• Solution: Increase parameter to reduce rollback time
(since shutdown abort is usually not an option)
10
The Singular Case of the
Phantom Users
• A manufacturing database in Sydney, Australia, needed
performance tuning. SQL tuning on US databases had yielded
good results.
• The table v$sqlarea was queried to find resource-intensive
SQL statements. Several commonly-run statements were
isolated. Performance was improved through index additions.
• Statistics were re-examined over the next 4 hours, in order to
confirm improvements.
• However, repeated looks at execution statistics showed no
change.
• DBA puzzles over enigma for several hours, then realizes that
NOTHING is WRONG! What did he finally realize?
The Singular Case of the
Phantom Users: Solution
• Nothing is wrong because the users were
still asleep. It was only 5:00am in Sydney!
11
The Case of the Slow Physician
(Bonus Mystery)
• Health application is experiencing slow run times. Analysis shows
following SQL statement causing 3000 disk reads
SELECT * from COSIGN_VIEW WHERE doctor_id = 'DR. MCKENZIE'
• COSIGN_VIEW is a join of 2 tables (DOCS + COSIGN),
joined on patient_id (indexed)
• Search criteria ‘DR. MCKENZIE’ is very selective; thus,
nested loop IS expected choice for optimizer, with DOCS
as Driving table.
The Case of the Slow Physician
(continued)
• Even with index on DOCS(doctor_id), optimizer (CBO)
insists on using hash-join, and refuses to ever use index on
doctor_id!
• Repeated analyze table commands do not correct
• Substituting query not using a view yields expected NL
result
Problem:
• Why does using the view cause optimizer to make the
“wrong” choice?
The Case of the Slow Physician:
Solution
• Everything seemed to point to a problem with the view,
because all worked normally as long as the view was
excluded
• Finally, DBA compared view definition (in OEM Schema
Manager) to definition seen using “describe table” syntax.
The columns did NOT match!
• Examining the object-create script revealed that the view
switched column names, so that column DOC_ID in the
view did NOT match DOC_ID in the table!
• Once the correct column was indexed, a Nested-Loop Join
was selected by the optimizer
Contact Information
Chris Lawson
[email protected]
http://www.dbspecialists.com
Database Specialists, Inc.
388 Market Street, Suite 400
San Francisco, CA 94111