martes, 25 de julio de 2017

ASL - My thermometer for the database

I suppose that most of you, DBAs,  have been asked sometime questions like "How is the general performance of the database?",  "I think there are a problem somewhere, is the database running well?", "Yesterday, at some moments of the day, the database run slow. Can you check what happened?".

Sure, those questions sound familiar to you. How could we give an answer to them?. Of course, we can take a look at the typical ratios, check AWR or Statspack reports, take a look at the OS, etc, etc.




Here I share the method I use. Sure, there are more ways, but this one has proven to me to be a fast and reliable way to take the "temperature" of the database, that is, to provide a quick way to check the general performance of the database and identify moments of the day with a high load or performance issues.

The metric I use is ASL (Average Session Load). First time I read about it was some years ago from a paper written by Kyle Hailey : Average Session Load (ASL) The Golden Metric ?
(http://www.skillbuilders.com/download/download-resource.cfm?file=skillbuilders_oracle_performance_tuning_kyle_hailey_Jan_2008.pdf)

The paper explains more deeply about this metric and gives very useful tips and ways to use it. Personally, I use it for a quick check.

Average Session Load also appears in OEM as Average Active Sessions. If you want all the details about the metric, I strongly recommend you to read the paper. To make a long history short, ASL is the relationship between DB Time and Elapsed Time:

ASL = DB Time / Elapsed Time

Being DB Time the total time spent by Oracle processes in CPU and non-idle wait time, and elapsed time the measure period.  Non active sessions do not increase DB Time, so we have a ratio that gives a good idea about the activity of the database.

What I do is to get the ASL from AWR , put it in a spreadsheet and generate a graphic (with Excel, R, or Python (using Matplotlib)) , then, I  check the graphics. There, I can see if there are spikes, some patterns that reflects the usual load, high loads, etc.

To review the performance, those are the outlines:

ASL < 1 : Database is not blocked
ASL ~= 0: Database basically idle
ASL < number of CPUs: CPU available Database is probably not blocked
ASL > Number of CPUs: Could have performance problems
ASL >> Number of CPUS: There is a bottleneck

Those are the queries I use:

SINGLE INSTANCE:

COL snap_B_time format A40
col snap_E_time format A40
set lines 200
set pages 500
 SELECT kv.hora_inicio as snap_B_time, kv.hora_fin as snap_E_time ,kv.inicio as start_sn_id,kv.fin as end_start_id, (s2.value - s1.value)/(kv.intervalo*1000000) as ASL  from
 (select iv.start_snap_time as hora_inicio,iv.end_snap_time as hora_fin,iv.start_snap_id as inicio, iv.end_snap_id as fin, extract(hour from (iv.end_snap_time    - iv.start_snap_time))*3600 + extract(minute from (iv.end_snap_time - iv.start_snap_time))*60 + extract(second from
  (iv.end_snap_time - iv.start_snap_time))as intervalo
   from
    (SELECT lag(dbid) over (order by dbid, snap_id) AS start_dbid,
    dbid AS end_dbid,lag(snap_id) over (order by dbid, snap_id) AS start_snap_id, snap_id AS end_snap_id,
    lag(end_interval_time) over (order by dbid, snap_id)
    AS start_snap_time, end_interval_time AS end_snap_time,
   lag(startup_time) over (order by dbid,  snap_id)
   AS start_startup_time, startup_time AS end_startup_time
   FROM sys.wrm$_snapshot) iv
  WHERE iv.start_snap_id IS NOT NULL
  AND iv.start_dbid=iv.end_dbid
  AND iv.start_startup_time=iv.end_startup_time) kv, sys.wrh$_sys_time_model s1, sys.wrh$_sys_time_model s2, v$sysstat n
  where
  kv.inicio = s1.snap_id and
  kv.fin = s2.snap_id and
  n.name = 'DB time' and
  s1.stat_id = s2.stat_id and
  s2.stat_id= n.stat_id and
  kv.hora_inicio > trunc(sysdate - 1)
 order by kv.hora_inicio


Modify the line kv.hora_inicio > trunc(sysdate - 1) to suit your needs regarding time.

RAC:

COL snap_B_time format A40
col snap_E_time format A40
set lines 200
set pages 50
SELECT substr(kv.hora_inicio,1,15) as snap_B_time, substr(kv.hora_fin,1,15) as snap_E_time ,kv.inicio as start_sn_id,kv.fin as end_start_id, kv.instance_nr as instancia,(s2.value - s1.value)/(kv.intervalo*1000000) as ASL  from
(select iv.start_snap_time as hora_inicio,iv.end_snap_time as hora_fin,iv.start_snap_id as inicio, iv.end_snap_id as fin, extract(hour from (iv.end_snap_time - iv.start_snap_time))*3600 + extract(minute from (iv.end_snap_time - iv.start_snap_time))*60 + extract(second from
 (iv.end_snap_time - iv.start_snap_time))as intervalo,
iv.start_inst_nr as instance_nr, iv.end_inst_nr as instance_nr_end
from
(SELECT lag(dbid) over (order by dbid, instance_number, snap_id) AS start_dbid,
dbid AS end_dbid,lag(snap_id) over (order by dbid, instance_number, snap_id) AS start_snap_id,
snap_id AS end_snap_id,lag(instance_number) over (order by dbid, instance_number, snap_id)
AS start_inst_nr, instance_number AS end_inst_nr,
lag(end_interval_time) over (order by dbid, instance_number, snap_id)
AS start_snap_time, end_interval_time AS end_snap_time,
lag(startup_time) over (order by dbid, instance_number, snap_id)
AS start_startup_time, startup_time AS end_startup_time
 FROM sys.wrm$_snapshot) iv
WHERE iv.start_snap_id IS NOT NULL
AND iv.start_dbid=iv.end_dbid
 AND iv.start_inst_nr=iv.end_inst_nr
 AND iv.start_startup_time=iv.end_startup_time) kv, sys.wrh$_sys_time_model s1, sys.wrh$_sys_time_model s2, v$sysstat n
where
kv.inicio = s1.snap_id and
kv.fin = s2.snap_id and
n.name = 'DB time' and
s1.stat_id = s2.stat_id and
s2.stat_id= n.stat_id and
s1.instance_number = s2.instance_number
and kv.instance_nr = s2.instance_number
and kv.hora_inicio > trunc(sysdate - 1)
and kv.instance_nr = &instance_number
order by kv.instance_nr, kv.hora_inicio;



Modify the line kv.hora_inicio > trunc(sysdate - 1) to suit your needs regarding time.
And introduce instance number.

A example of a generated graphic :





In that system there were 8 CPUs, so, as can be seen, the usual ASL does not exceed 4 and usually is between 1 and 3, but at some point in time, there was a spike. (actually, this spike was 40!!). So, seeing that, we know which AWR reports should we take to investigate the spike. Also, can say that apart from this spike, the database went well, with some load, but with available CPU so there were not performance issues from a general point of view.

Here you are an example of patterns along a week:




In the next picture , you can see that every pattern corresponds to a day of the week (sorry , cannot show the labels because will not fit in the page) Actually, each period goes from 5.AM to 4-5 AM of the next day.



 
Looking at those pictures, we can see the general behaviour of the database. Obviously, if we are reported some problem and we see that at one point in time the pattern changes, we can deduce that something happened and we should investigate.

As you see, this metric can be quite useful. I use it very usually whenever I am reported some issues and as a way to see where I should start my troubleshooting.

Sure there are another methods, but this one has worked quite well for me. Hope you like it and also helps you in your troubleshooting!


No hay comentarios: