Wednesday, 20 March 2013

Diagnosing Oracle Database Issues

From studying for the Oracle Database Admin II OCP Exam

The Alert Log

There are a number of sources of information that the Oracle Database provides to help in getting to the bottom of issues. The first place in a situation where you do not know what has caused the issue is to check the alert log for the instance. This log file contains messages for significant events which will include general information and warnings/errors

·         All startup and shutdown commands, times taken for transitions from nomount to mount

·         All Log switches and archive logs

·         Details on tablespace changes

·         Changes to the database involving datafiles and redo logs.

·         Initialization parameters

·         Unable to open file ( if there is an issue with a datafile or control file during normal startup)

·         Corrupt data blocks

·         Deadlocks

·         Problems with archiving such as the destination being 100%

·         Trace file information for background processes which have encountered problems

These messages are in chronological order so checking the last number of changes or errors is as easy as going to the end of the file and scanning up. Many of the messages will also include the details of a trace file. These trace files are generated when a server or background process encounters an error.

The Location of the Alert Log is set in the database pararamter : BACKGROUND_DUMP_DEST
(This database parameter is dynamic and can be changed with shutting down the instance)

The naming convention for the alert log is  alert_<ORACLE_SID>.log

Viewing the alert log at the Operating System level can be achieved by navigating to its location and opening with any editor. You can also find it in Enterprise Manager on the database home page in the Diagnostic Summary section with an Alert log link.

Trace Files

As mentioned in relation to the alert log, when a background process encounters a problem it will write its output to a trace file in the same location as the alert log : BACKGROUND_DUMP_DEST.
When a server process encounters a problem it will go to: USER_DUMP_DEST

Trace file headers contain the version of Oracle and Operating System the trace file was generated on , instance name and process/thread id, followed by the actual error. When a trace file is generated then unlike message in the alert log which can be informational, an actual error has occurred.

You can limit the size a trace file will be (unlike the alert log which cannot be limited) with the database parameter MAX_DUMP_FILE_SIZE which by default is unlimited.

SQL> ALTER SYSTEM SET MAX_DUMP_FILE_SIZE = "50m" scope = both

Server Generated Alerts

Both the alert log and trace files looked at previously provide a starting point when looking at an issue, but out of the box, Oracle comes with an alert system which is very configurable. As well as using this to look at immediate problems, it can be configured to alert you before the issue becomes critical and effects the end users.

You can interact with the alert system through Database Control which will be the easiest way but you can also you setup changes through API’s and be notified by subscribing to the ALERT_QUE which will be covered in the Alert System Architecture.

A server generated alert will fire when a certain limit is reached or when certain events occur. These are looked after by the background process MMON. When we talk about limits we are talking about database metrics of which over 120 can be configured ( V$ALERT_TYPES).  A metric is a statistic converted into some meaningful figure such as disk reads(statistic) per second or space used(statistic) is in proportion to total space which would be metrics.

Two Types of Alerts

1.       Threshold (stateful) -  these can persist for some time until cleared. An example would be a warning when a tablespace it 87% full. Once action is taken these clear

2.       Nonthreshold(stateless) – unpredictable events for example snapshot to old errors which are resolved immediately

For threshold alerts, when the problem condition is cleared by fixing it, it’s moved to the alert history unlike Nonthreshold alerts which go there directly.

Again you can view alerts in enterprise manager on the home page under Alert and Related Alerts and under the Alert Log link. Each alert can be drilled down into to find out more information.
Alert System Architecture

The statistics used by the alert system are collected when the database parameter STATISTICS_LEVEL is set to TYPICAL(DEFAULT) or ALL. Setting this to BASIC will disable statistic gathering. MMON gathers these statistics and calculates the metrics as all as monitoring thresholds. If a metric breaks a given threshold then it will write it to the predefined ALERT_QUE(if there is a problem with this queue or any other it would write it to the alert log, this is the only interaction between the alert system and the alert log). The ALERT_QUE can have more than one consumer so third-party software could subscribe to it using the DBMS_AQADM package and CREATE_AQ_AGENT , ADD_SUBSCRIBER procedures.

A notification rule can be created that when an alert occurs, the DBA can take appropriate action sometimes even before the end user notices.

User Trace Files

Trace files for a user’s session which normally are generated on demand in helping to diagnose an issue locally.

Setting the SQL_TRACE value to TRUE for the session will generate performance statistics for the processing of all SQL in that session.

SQL> ALTER SESSION SET SQL_TRACE TRUE;

These files will be produced in : USER_DUMP_DEST

If the user is using an application then they probably will not be able to open a SQL session and run commands like the above. In this case you need to first find the SID from v$session that the user is running. This can be narrowed down using any of the columns in v$session.

Then you can monitor the session using the DBMS_MONITOR package like below

SQL> execute dbms_monitor.session_trace_enable( session_id=>101,serial_num=>99);

SQL> execute dbms_monitor.session_trace_disable( session_id=>101,serial_num=>99);

You can implement tracing through Database Control and indeed it maybe easier to do this, especially in identifying the session for the user be that through TOP CONSUMERS or something else.

System Log Files

You may or not also by the System Administrator or have to co-ordinate with another team who manage the operating system but these would need checked periodically and immediately if there was an issue. From experience I’ve spent time going through the whole above process only to find out the System Administrator never checked the system log files and in there found out the root cause ( possible mount point issue or SMTP issue) e.g. in Unix systems : /var/log/syslog or /var/adm/messages

Summary

For reactive situations, you will want to first check the alert log and then drill down into any trace files which look like candidates for the issue.

For proactive monitoring you will want to start playing with and configuring the server alert system.

If trying to identify unique issues,  you can trace the database by user , session or program 

No comments:

Post a Comment