A Glance at Oracle Database: AWR Does not show where's the problem. Just shows what is happening!

It looks like something that does not need to be said, but is a trap where many junior DBA fall. AWR (or Statspack) can be misleading if we do not keep in mind this detail:

What AWR and Statspack report shows is not where or what is the problem, but just what was happenning in the database at the moment of the snapshot!.

That means, that we need to interpret correctly what we are seeing at the report. Here I'd like to share an example that illustrates this point.

Some months ago, I got a request from a colleague to help him with an issue. They were running a query from a SAP server and it was very slow. They had taken an AWR snapshot of the period and want me to take a look at it. Here is the output:

Apparently, the issue could have something to do with index access (db file sequential reads), as is the event that the most of the DB Time (57,08%) while direct path read event takes only a 25,78%.
If we don't pay a closer look, the diagnostics would be that there is a problem of IO related to indexes. (wrong execution path, wrong indexes used, indexes that should be rebuild, etc).

If we take a closer look, something we can see a couple of weird details.

1. There is a lot more of direct path reads that db file sequential read calls (714874 vs 279552).
2. The average wait time for direct path reads is very good (5 ms), but db file sequential reads time is quite bad (26 ms)

Let's take a look at another section of the report:

(to keep confidentiality, I have deleted the modules and the other queries section, also, in the next paragraphs, some data have been changed. Sorry guys, but need to be done)

So, Query with SQL_ID gmgw5f1zggj8j is the one clearly at the top, taking 97,15% of the total of reads to disk.

But let's take a look at this execution plan:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID gmgw5f1zggj8j, child number 0
-------------------------------------
SELECT /*+ FIRST_ROWS (1) */ * FROM "XXXXX000" WHERE "YYYYY"=:A0 AND
"RRRRRR"=:A1 AND ROWNUM <=:A2

Plan hash value: 3808608472

-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 8414 (100)| |

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|* 1 | COUNT STOPKEY | | | | | |
|* 2 | TABLE ACCESS FULL| XXXXX000 | 1 | 206 | 8414 (1)| 00:02:02 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(ROWNUM<=:A2)
2 - filter(("RRRRRR"=:A1 AND "YYYYY"=:A0))

This is not an index access.....

Let's go to another section:

OK, 825G reads from datafiles. Nothing unexpected....

822,3G Direct reads. That means that the most of physical access to the disk is due to Direct Path Reads (actually matches what we see in the first section, regarding number of calls), but this does not explains why the db file sequential reads appears first and have a greater avg wait time.

At this point, I had a picture in my mind of what possibly happened, Luckily, they could launch the slow query (one that used indexes, by the way) again. I used a wonderful script created by Tanel Poder (snapper) to sample v$session. And that was what I saw:

---------------------------------------------------------------------------------------------
Active% | SQL_ID | SQL_CHILD | EVENT | WAIT_CLASS
---------------------------------------------------------------------------------------------
96% | gmgw5f1zggj8j | 0 | direct path read | User I/O
4% | gmgw5f1zggj8j | 0 | ON CPU | ON CPU

-- End of ASH snap 1, end=2013-01-24 10:15:35, seconds=180, samples_taken=180

Different samples showed a similar output. Some of them showed db file sequential reads waits, but mainly there were no index related waits, only direct path access!

That was the situation of the disk:

product01> iostat -xd 1 30 | grep sdg
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
sdg 167.82 9.15 336.27 19.63 108573.17 808.71 307.34 2.52 7.08 1.40 49.71
sdg 324.00 57.00 482.00 2.00 283568.00 472.00 586.86 14.88 31.87 2.07 100.10
sdg 319.00 0.00 522.00 0.00 270560.00 0.00 518.31 14.76 28.52 1.92 100.20
sdg 264.00 5.00 597.00 17.00 241376.00 328.00 393.65 15.40 25.06 1.63 100.00
sdg     331.00 0.00 450.00 0.00 283824.00 0.00 630.72 13.79 30.44 2.22 100.10
sdg     368.00 20.00 510.00 4.00 323280.00 192.00 629.32 13.59 25.83 1.95 100.20
sdg     256.44 54.46 365.35 19.80 221813.86 974.26 578.45 15.51 40.04 2.58 99.41
sdg     352.00 0.00 482.00 41.00 303216.00 328.00 580.39 17.12 33.63 1.92 100.20
sdg     411.00 0.00 544.00 0.00 358112.00 0.00 658.29 16.91 30.02 1.84 100.30
sdg 342.00 3.00 425.00 13.00 286696.00 232.00 655.09 18.60 41.71 2.29 100.20
sdg 279.00 11.00 363.00 0.00 239896.00 0.00 660.87 18.23 49.15 2.76 100.20
sdg     287.13 56.44 343.56 3.96 244055.45 570.30 703.91 17.50 49.88 2.85 99.11
sdg 319.00 0.00 469.00 6.00 282400.00 176.00 594.90 17.33 39.20 2.11 100.20
sdg     315.00 0.00 433.00 0.00 264496.00 0.00 610.85 17.86 37.82 2.32 100.30

Tha last column shows that the disk was 100% busy.
From the second column to the right, (svctm)we see that the service time of the disk is fine, but the response time (await) is very bad, because of the high concurrency to the IO subsystem.

When I asked about the query that appeared in the output, they told me that a batch job was running and could be stopped, and so they did.

This is the state of the disk after they stopped the job:

product01> iostat -xd 1 30 | grep sdg
Device:    rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util sdg        171.85     9.17 339.17 19.49 112553.23 802.55 316.05     2.58    7.19   1.40 50.21
sdg          0.00     0.00 49.00 81.00   784.00 3888.00    35.94     0.17    1.33   0.87 11.30
sdg          0.00     6.00 83.00 3.00 1328.00    96.00    16.56     0.26    2.98   2.83 24.30
sdg          0.00    55.45 125.74 3.96 2011.88   475.25    19.18    0.59    4.53   4.04 52.38
sdg          0.00     0.00 99.00 16.00 1584.00 4112.00    49.53     0.29    2.59   1.64 18.90
sdg          0.00     0.00 102.00 1.00 1632.00    32.00    16.16     0.22    2.10   2.03 20.90
sdg          0.00     6.00 98.00 2.00 1568.00    64.00    16.32     0.22    2.16   2.10 21.00
sdg          0.00     0.00 119.00 29.00 1888.00 5440.00    49.51     0.33    2.19   1.36 20.10
sdg          0.00    45.00 111.00 7.00 1776.00   440.00    18.78     0.26    2.20   1.97 23.30
sdg          0.00     0.00 487.00 53.00 7792.00 5440.00    24.50     0.96    1.77   0.90 48.40
sdg         62.00     4.00 1227.00 52.00 39648.00 6624.00    36.18     9.77    7.63   0.73 3.80
sdg        253.00     0.00 300.00 1.00 29264.00    32.00    97.33     1.76    5.90   2.55 76.90
sdg          0.00     0.00 204.95 0.00 3279.21     0.00    16.00     0.62    3.03   2.50 51.29
sdg          0.00    70.00 169.00 23.00 2704.00 4256.00    36.25     0.54    2.82   2.10 40.30
sdg          0.00     8.00 167.00 7.00 2688.00   240.00    16.83     0.37    2.13   1.93 33.50
sdg          0.00     4.00 185.00 2.00 2960.00    48.00    16.09     0.45    2.42   2.09 39.00
sdg          0.00     0.00 173.00 13.00 2768.00 2608.00    28.90     0.47    2.50   1.91 35.60

As can be seen, the disk now has a very good response time.

So, to conclude, the problem was that the query doing the Full Scan, was consuming all the throughput capacity of the disk, making the other queries (index access, db file sequential reads) go slower.
That is the reason that we see db file sequential reads in the AWR report as taking most of the time, because they were running slower than the direct path reads. (and, of course, direct path reads avg time is excellent, because they are literally "eating" all the capacity of the disk!)

This is one case where the output seen in the Top 5 Foreground Processes area of the report could be misleading, since at first sight, it looks that db file sequential reads is the murderer, and actually it is not, but the victim!!

Remember, AWR/Statspack report is just a photo!, by itself, does not necessarily tells where lies the problem.

A Glance at Oracle Database

lunes, 17 de octubre de 2016

AWR Does not show where's the problem. Just shows what is happening!

1 comentario:

My favourite blogs and sites

Search blog