jueves, 19 de enero de 2017

Troubleshooting a slow RMAN backup

(As I always do, the names of the databases and files are not the real ones, but the numbers are).

Yesterday we got a complain for one client. He said that the RMAN backup of the database was running on working hours and that should not be. I looked at the times of the backup along the week and saw that the backup took much longer than expected.



I compared this backup with another one from another database:

  Server              Database            Date                             Status              MB Processed     Elapsed(minutes)
SERVEROK       BD1              17-01-2017 08:33:21     COMPLETED        482582               183    

SERVERNOK     BD2             17-01-2017 08:10:07     RUNNING             532904               521



As can be seen, the difference at ratios (MB per minute) is very big. Both backups where a full database backup and both catalog as tape (virtual tape) were the same. The backup was taken using Tivoli TDP as API. Oracle version is 11.2.0.4

So, as a first step, I decided to check the backup related data transfer ratios for both databases and do a comparison.

Here, I consulted two views. V$BACKUP_ASYNC_IO and V$BACKUP_SYNC_IO. I do not explain all the details, since would make this article too long and boring, so here:
http://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmtunin.htm#BRADV89862
you are all the explanations related to those views.

In those views, you'll find a field called TYPE.
type='INPUT' means froms database to memory, that is, reading
type='OUTPUT' means from memory to external destination (tape or disk), that is, writing
type ='AGGREGATE' means the aggregate rate for all the files in one operation. According to
the notes in support and manual, this is the  value we should check.

In my databases, the entries in V$BACKUP_ASYNC_IO has the type INPUT (and AGGREGATE), so corresponds to reading the datafiles.
The view V$BACKUP_SYNC_IO has the type OUTPUT. This corresponds to writing the datafiles to the disk or tape.

So, as the tape server, API and catalog where the same for both of my servers, and the ratios were quite different even when both backup's hours matched, I checked the view V$BACKUP_ASYNC_IO, to take a look at the reading ratios (considering a possible local I/O problem in the server).

I used the free program R to calculate histograms and statistical data. (do not worry, I will tell you how I did it. It's very easy!) You can get it from here:

https://www.r-project.org/


So, first, I ran the query to get the data for the backups in BD1 and BD2


select device_type "Device", type, filename, to_char(open_time, 'mm/dd/yyyy hh24:mi:ss') open,
to_char(close_time,'mm/dd/yyyy hh24:mi:ss') close,elapsed_time ET, effective_bytes_per_second EPS
from v$backup_async_io
where close_time > sysdate - 2
order by close_time desc



This is an example of the output (a part only):
The column we are interested in is EPS (EFFECTIVE_BYTES_PER_SECOND)

Device            TYPE      FILENAME                                           OPEN                CLOSE                       ET        EPS
----------------- --------- -------------------------------------------------- ------------------- ------------------- ---------- ----------
DISK              AGGREGATE                                                    01/18/2017 08:07:38 01/18/2017 08:07:39        100   7851184
DISK              INPUT     /opt/oracle/product/11.2.0.4/dbs/snapcf_SRVBD2.f   01/18/2017 08:07:38 01/18/2017 08:07:39        100   26607616
DISK              AGGREGATE                                                    01/18/2017 08:01:17 01/18/2017 08:07:30      37300   9356063
DISK              INPUT     /bdd/SRVBD2_arch/SRVBD2/datafile1.dbf           01/18/2017 08:01:17 01/18/2017 08:07:30      37300    2671838
DISK              INPUT     /bdd/SRVBD2_arch/SRVBD2/datafile2.dbf           01/18/2017 08:01:17 01/18/2017 08:07:28      37100    2596404
DISK              INPUT     /bdd/SRVBD2_arch/SRVBD2/datafile3.dbf           01/18/2017 08:01:18 01/18/2017 08:07:25      36700    2565406
DISK              INPUT     /bdd/SRVBD2_arch/SRVBD2/datafile4.dbf           01/18/2017 08:01:18 01/18/2017 08:07:24      36600    2558450
DISK              INPUT     /bdd/SRVBD2_arch/SRVBD2/datafile5.dbf           01/18/2017 08:07:23 01/18/2017 08:07:23          0
DISK              AGGREGATE                                                    01/18/2017 08:07:23 01/18/2017 08:07:23          0
DISK              INPUT     /bdd/SRVBD2_arch/SRVBD2/datafile6.dbf            01/18/2017 08:01:19 01/18/2017 08:07:23      36400    2566605
DISK              AGGREGATE                                                    01/18/2017 08:01:17 01/18/2017 08:07:11      35400   26607616


What I did was to spool the result of the query to a file, delete the fields with EPS empty and format it to be able to paste it in the R console using this command:

cat file.txt | awk '{print $8}' | sed 's/$/,/' | awk 'BEGIN{ORS=" "} {print $1}'

The output of this command is a list of values separated by commas. I take this output, put into into parenthesis and create a R vector called vect1 (just paste the output in the R console):

vect1=c(7851184, 9356063, 26607616, 8192000, 52428800, 33554432, 58254222, 29127111, 44739243, 58040099, 42107523, 58040099, 65075262,
6334760, 4946113, 5741935, 54597042, 18592932, 8992813, 5660329, 10737418, 55063683, 6848814, 7100424, 8635993,
 5724927, 7786455, 58355534, 9062887, 77864554, 83668194, 65590564, 10489256, 15471784, 12290845, 11581630, 60549351, 84546600, 11447141, 12563282, 7150334,
11574651, 10357638, 7110873, 7697074, 11907962, 13613312, 6052304, 10875389, 13833203, 11327687, 10882826, 10500620, 11737263, 9389810, 11581723)


The values in the vector are in bytes per second. So, I divide it by 1024 to get KB per second (easily readable)

BD2=vect1/1024

Then, I generate the statistical data:

summary(BD2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   4830    8325   11310   24050   41760   82570


And generate the histogram:

hist(BD2)

 

I do the same steps for BD1. here you are the histogram





And the statistical data.

summary(BD1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  10670   12330   12540   23880   16990   98700

>

Also, and following Oracle's manual advice, I looked at the ratio of I/Os to long Waits in the view V$BACKUP_ASYNC_IO (LONG_WAITS/IO_COUNTS)
Long Waits is the number of times that a buffer was not immediately available,and only became available after a blocking wait was issued.
This ratio (taking 1 as the maximum value) indicate if could be some sort of I/O bottleneck causing problems.

This is the query:

select io_count, ready, short_waits, long_waits, long_waits/io_count, filename from v$backup_async_io.

In my case, both in BD1 as in BD2, the column long_waits/io_count was always 0, so no I/O issue (as expected, looking at the previous graphs).

I also checked the view that corresponded (in my case) to the writing to the tape. (V$BACKUP_SYNC_IO). In this view, I used the field DISCRETE_BYTES_PER_SECOND,
since is the average transfer rate for the corresponding file. 

This is the query:

select device_type "Device", type, filename, to_char(open_time, 'mm/dd/yyyy hh24:mi:ss') open,
    to_char(close_time,'mm/dd/yyyy hh24:mi:ss') close,elapsed_time ET, discrete_bytes_per_second DPS
    from v$backup_sync_io
    close_time > sysdate -2
    order by close_time desc


And those are the graphs (I already explained how to create them previously).





> summary(BD2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   6522   13300   14970   19260   17690   64000

>





> summary(BD1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  18690   19530   20030   26410   20890   82290
>



So, looking at the statistical data for the reading part, we see what comes next:

Server SERVEROK (database BD1) (the one with the good time) total  MB: 482582, Time elapsed: 183 min. The median transfer rate was 12540 KB/sec, Average: 23880 KB/sec.  50% of transfer rate are inthe range between 12330 KB/sec and 16990 KB/sec. Maximum Transfer rate was 98700 KB/sec and minimum: 10670 KB/sec

Server SERVERNOK (database BD2) (the one with the bad time) total  MB: 532904, Time elapsed: 521 min. The median transfer rate was 11310 KB/sec, Average: 24050 KB/sec.  50% of transfer rate are in
the range between 8325 KB/sec and  41760 KB/sec. Maximum Transfer rate was 82570 KB/sec and minimum: 4830 KB/sec

As we can see from the data and graphs reading transfer rates are quite similar, even with BD2 (bad backup times) getting a slightly better transfer rates.

Looking at the statistical data for the writing part, we see what comes next:


Server SERVEROK (database BD1) (the one with the good time) total  MB: 482582, Time elapsed: 183 min. The median transfer rate was 20030 KB/sec, Average: 26410 KB/sec.  50% of transfer rate are in
the range between 19530 KB/sec and 20890 KB/sec. Maximum Transfer rate was 82290 KB/sec and minimum: 18690 KB/sec

Server SERVERNOK (database BD2) (the one with the bad time) total  MB: 532904, Time elapsed: 521 min. The median transfer rate was 14970 KB/sec, Average: 19260 KB/sec.  50% of transfer rate are in
the range between 13300 KB/sec and  17690 KB/sec. Maximum Transfer rate was 64000 KB/sec and minimum: 6522 KB/sec

There is some difference on the writing, but the difference is not big and should not justify the huge difference in the whole time of the backup.

So, as a first step, I checked the backup scripts of the databases.

Surprise!:

In Server SERVEROK (database BD1) the script allocates 4 channels and 2 files per backupset. The database has 30 datafiles
In Server SERVERNOK (database BD2) the script allocates 2 channels and 1 file per backupset. The database has 51 datafiles

So, I checked the CPU and I/O use of the SERVERNOK during backup hours. The CPU was 50% used and there were some waits for I/O. As the server has 8 CPUS, I still had 4 to play with.
I decided to add 2 more channels (2 CPUS) and assume a more intense I/O during this time. Also, I modified the script to use 2 files per backupset instead of 1.

Next day results:

 Server            Database            Date                            Status          MB Processed     Elapsed(minutes)
SERVERNOK     BD2         18-01-2017 05:07:16     COMPLETED        664550            338    



Backup times had dropped more than 5 hours !


Here you can download a great doc about RMAN performance:
 

http://www.oracle.com/technetwork/database/features/availability/br-optimization-132612.pdf

2 comentarios:

Foued dijo...

Thanks for sharing this post.
Foued

Unknown dijo...

Thanks for sharing. Good job