Signal to noise ratio – the AWR trap – Learning is not a spectator sport

Its never been easier to analyze an AWR report. With our new oracle skills repository, your favourite agent can do a respectable job giving some insights into the AWR report you created for your database.

In my case, once codex has churned through the report using its various skills, the driving recommendation was as below

and if I take a look at the original AWR report and hone in on the all important “DB Time”, you can see where this recommendation was most likely sourced from:

Half the overall wait time was lock contention, so that clearly should be my first avenue of investigation right?

When dealing with AWR reports, it is important to remember there are probably two reasons why you are looking at the AWR report in the first place:

Someone has asked you to build an overview about general performance/activity on the database system
Someone has experienced a potential problem with the database (or more likely the apps that run against that database) and has asked you to investigate what happened.

Moreover, it’s important to be aware as a database professional that when someone asks you for (1), then it is quite possible that the catalyst for this request was in fact related to (2), ie, something seems to be wrong. I’ve lost count of the number of times I’ve got a phone call or email along the lines of: “Hey, can you take a look at the server” which is polite-speak for “I think something might be wrong” 🙂

Back to the AWR at hand. If you are being asked to investigate a problem, then it is quite possible that the AWR report (and any recommendations you or an AI deduce from it) are totally wrong. Because contrary to popular opinion, the most important section of an AWR report is not

the DB time breakdown, or
the Wait time analysis, or
the SQL performance metrics.

The single most important part of an AWR is this line here:

namely, for what time period does the AWR report span.

If that is not appropriate for the problem you are trying to investigate, then your AWR report is at best useless, and at worst, going to lead you in the wrong direction.

By default, workload snapshots are taken every hour in your database. If you have an issue that lasts for a good portion on an hour, then using the standard AWR reporting will probably be fine. But often that is not the case. Often customers will report things like “Hey, at 9:20am, things went really bad for 5 mins but now they’re OK. Do you know what happened?“. If you grab an AWR report for 9am to 10am, then you might get lucky and make some discoveries, but you can also just a likely be lead to a wrong conclusion.

For example, back to the AWR report that I asked codex to assess. Let’s assume that was because someone called me with “Hey, our APEX apps were running really slow about 10mins ago“. Now, because I artificially created this example, I can tell you exactly what the cause was:

I created an APEX “hammer bot” that absolutely smashed my system with APEX requests to a page that returned a huge amount of content. This meant I had dozens of sessions all clamouring for huge amounts of PGA to hold the output buffer. That bot ran for 5 mins.
Then I did nothing for a while, which meant the only activity on my database was the scheduler.
Then I created a simple blocking wait. Several sessions all waiting to delete a common row from a common table.

Those blocking waits had zero impact on any APEX sessions. The Codex analysis of my AWR report is correct but is totally irrelevant to the issue at hand. That is not the fault of Codex – it is the standard premise of “garbage in, garbage out“.

Because I engineered the issue, I also took snapshots at the appropriate time. Here’s the AWR report for snapshots that surrounded the APEX hammer bot period.

Unsurprisingly, 60% of my time was lost on PGA competition because my bot was hammering the PGA 🙂 It’s a very accurate assessment of the activity because I snapshotted precisely around the bot workload

So does this mean we have to take AWR snapshots every 3 minutes to ensure we get good data? No. Well, if you want to, then you can, but you’ll probably find the busiest things in your AWR report are … the taking of AWR snapshots. But here’s some recommendations I generally apply on my databases:

Unless your system is running close to its resource capacity, I prefer 30min snapshots over 1 hour. It is just too easy for problems to get swallowed up in the noise with a 1 hour period. Obviously tailor this advice for your specific system. For example, if you have 1000s of sessions, then taking a snapshot might be more expensive than on a system with just a few hundred sessions.
When someone reports a problem, before heading straight to AWR, check out ASH data instead. It is far more granular, but has a smaller retention. Consider something like my infinite ASH routine to give you a longer retention. ASH data lets you hone in to precisely the timeframe of a reported issue.
Encourage your customers to be vocal when they have an issue. (This is normally not difficult :-)). It is far easier to diagnose a problem happening “now” than it it is to diagnose one that happened a fortnight ago.
Be proactive when customers are vocal. A lot of workplaces will respond to a customer concern by logging a ticket in a system, with that ticket eventually getting to an engineer days later. That doesn’t help anyone. If you’re engineers are already flat out, then build a process that automatically grabs snapshots or records the ASH data when the ticket is logged, so that an engineer (or your favourite AI agent) has some good data to work with.

Learning is not a spectator sport

Leave a ReplyCancel reply

Trending

The APEX_COLLECTIONS crash puzzler…

Not the post I was expecting

The “best” tool

Tip: Expanding a SQL statement