miércoles, 6 de septiembre de 2017

Not always the more the better. ORA-04031 provoked by parallel servers

I thought is worth to write about this incident that happened in one of my servers a couple of months ago. The reason is to show that to add or increase resources is not always the right approach to solve a problem.


It began with the classic mail, complaining about one job. The developer told me that in the former server (this was a migrated server), the job run well, but in the new one gave failed showing that error:

ORA-04031: no se han podido asignar 65560 bytes de memoria compartida ("large pool","unknown object","large pool","PX msg pool")


Of course, the "PX msg pool" immediately called my attention.And the fact that a lot of those messages appeared, complaining about both large as very small chunks of memory made me thing of a space issue in the pool.

First, I checked the trace file, that showed:

TOP 20 MEMORY USES ACROSS LARGE POOL 1 - 7
----------------------------------------------
"PX msg pool               "       3214 MB 98%
"free memory               "         65 MB  2%
"ASM map operations hashta "        384 KB  0%
TOTALS ---------------------------------------
Total free memory                    65 MB
Total memory alloc.                3215 MB
Grand total                        3280 MB
==============================================


98% use if the PX msg pool. Obviously, a space issue. The subpools from 1 to 7 showed the same.

My first thought, of course, was to make the pool bigger, but just in case, I decided to take a deeper look.

So, I went to the one of the generated cdmp directory and run this command, to check the last wait events of the processes :


server> cd cdmp_20170727101202
server> cat * | awk '/Current Wait Stack:/ {getline; print $0}' | sort | uniq -c

This is the output:

      1
      1  0: waiting for 'ASM background timer'
     11  0: waiting for 'class slave wait'
      1  0: waiting for 'DIAG idle wait'
      5  0: waiting for 'EMON slave idle wait'
      1  0: waiting for 'GCR sleep'
      4  0: waiting for 'gcs remote message'
      1  0: waiting for 'ges remote message'
      1  0: waiting for 'JOX Jit Process Sleep'
    256  0: waiting for 'latch: parallel query alloc buffer'
      1  0: waiting for 'PING'
      1  0: waiting for 'PL/SQL lock timer'
      1  0: waiting for 'pmon timer'
      1  0: waiting for 'process diagnostic dump'
    128  0: waiting for 'PX Deq: Execution Msg'
    128  0: waiting for 'PX Deq: Table Q Normal'
     34  0: waiting for 'rdbms ipc message'
      2  0: waiting for 'SGA: allocation forcing component growth'
      1  0: waiting for 'smon timer'
      3  0: waiting for 'Space Manager: slave idle wait'
     14  0: waiting for 'SQL*Net message from client'
      1  0: waiting for 'Streams AQ: emn coordinator idle wait'
      1  0: waiting for 'Streams AQ: qmn coordinator idle wait'
      1  0: waiting for 'Streams AQ: qmn slave idle wait'
      1  0: waiting for 'Streams AQ: waiting for time management or cleanup tasks'
      1  0: waiting for 'VKTM Logical Idle Wait'
      1  0: waiting for 'wait for unread message on broadcast channel'
server>


256 processes waiting for a latch to allocate memory in the parallel pool plus 256 more processes running parallel calls!, that means around 500 parallel processes running!. Apart, the waits for 'SGA: allocation forcing component growth', that give me the tip about the space pressure in the SGA.
This database is not a big one, cannot justify this amount of parallel processes.

At this point, I already a vision in my mind of what could be happening. I asked the former DBA to send to me the SGA and parallel parameters of the old database to confirm my idea. And it turned to be right. The SGA size of the previous database was smaller than the current one.

So, looking at all, my idea what this: We had migrated from one single instance database to a two nodes RAC. with much more powerful machines. The parallel parameters where set by default.
According to Oracle manual (http://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#i1007334) and MOS Note 1678111.1 (the 11.2 related section) , the default DOP(when DOP is not specified by a hint, or in the table or index)  for a SQL statement is determined , among other parameters by :

"The value of the parameter CPU_COUNT, which is, by default, the number of CPUs on the system, the number of RAC instances, and the value of the parameter PARALLEL_THREADS_PER_CPU"

So, the number of parallel servers range from this value to the value of PARALLEL_MAX_SERVERS

Our new servers were far more powerful that the old one, and more, was a RAC. As the DOP was not set, Oracle decided to open a high number of servers (our parameter PARALLEL_MAX_SERVERS was set to 940)  . As the SGA was not very big (was set a bit bigger that the old one), could not handle this amount of servers and provoked the errors.

Well, I did not know the old execution plan and if it was using parallelism or not, so and as this was a batch job, I decided not to mess with the execution plan or the query itself in the beginning and I lowered the parameter PARALLEL_MAX_SERVERS to a much lower value (i tried with 80).

This was the end of the problem. Next day the job run smoothly.

This is an example about how sometimes to add more resources or increase the current ones is not the best way to deal with a performance problem. In that case, to increase the SGA to a very big size would have been a waste of memory and resources , more, perhaps could have even not solved the issue .











No hay comentarios: