It’s just bad code or bad design … most of the time

Some years ago I wrote an article for the UKOUG magazine called “Want a faster database – Take a drive on the M25”.  For those not familiar with the United Kingdom, the M25 is one of its busiest roads (M = “motorway”) and because it moves so much traffic, and runs so close to capacity, it has often been referred to as “the world’s largest car park”.  Many people have probably spent a good part of their lives on the M25 Smile  I used the M25 as a metaphor for how database professionals can focus on the wrong things when trying to solve a performance problem, such as:

“I’m stuck in traffic…perhaps a faster car will help”

ie, throwing CPU at a problem that is not CPU bound will not help things, or

“I’m stuck in traffic…it must be the width of the paint on the lane markings”

ie, looking at the mutex hit rates is perhaps not the first place you should look at to solve a performance issue.

I concluded the article with the statement:

“…many Oracle professionals have forgotten the most important weapon in planning, designing, and implementing a successful Oracle database – that is, common sense.”

 

I wrote the article (I think) around 15 years ago, and yet our zeal for digging around metrics and internals when trying to solve performance issues before we examine the very basic components of a problem remain.

I saw a tweet recently that led to a thread on the database community forum recently that demonstrates this most effectively.  The were very early warning signs that the poster had jumped to the detail before looking holistically at the problem , simply due to the title of the post – “AWR Report High DB CPU Time and low SQL execute elapsed time”.

Because that is not a problem.

Namely, it is not a problem  in the sense that, I’ve never heard a customer call up a support department and say “I have some concerns over the ratio of DB CPU time to my SQL elapsed time” Smile  Customers simply say “My application is too slow”, or if that “customer” is the operations department, they will say “Our server is reaching its capacity limits” or similar.  In fact, it took 9 more posts before our poster finally described what the problem was: “We have a billing time approx. 20 hours currently. If we’ll reduce billing time to 8 hours, the problem will be solved”.  And when people start their description of an issue with the too low level of detail, it is like click-bait to the well intentioned people in the community that feel inspired to help.  What followed in the thread was a series of interactions along the lines of (paraphrased):

  • “What is the latch activity like?”
  • “What background waits are dominating?”
  • “What is the CPU consumption like?”
  • “What is the Oracle version to 4 decimal places?”
  • “How many cores?”
  • “Bare metal or virtualised”

And such detailed questions are then just catalysts for low level options being proposed as solutions.  In this case, there were discussions such as to whether the parameter filesystemio_options be altered, or whether some asynchronous I/O tunables should be examined.  And then discussions about whether the database needed to be patched or upgraded.  As I mentioned, it took some time before someone finally asked what the actual problem they were trying to solve was, and a full AWR report was made available.  Once the AWR report was visible, some people posted on the fairly obvious issue, an average over 25,000 calls to the database per second.  That might be reasonable depending on the size and activity on the database, but when combined with another statistic, namely 44,000 calls per transaction,  it starts to sing out as poor application design.  Because if the application was a busy but efficient OLTP system, we might see 25,000 calls per second, but small calls per transaction.  And if it was a busy but efficient batch system, we might see larger calls per transaction but unlikely to see so many calls per second.  Of course, maybe this system is both, but later in the AWR report we could see that a single SQL was executed over 75,000,000 times in the sample period so the evidence of poor design was mounting up fast.

Eventually some learned minds contributed to the thread pointing out the strong likelihood of an application doing row by row processing (or “slow by slow” as its often called) to achieve the requirements of the batch run.  It’s probably the most inefficient way of getting work done in an Oracle database.  The database is not the problem.  Now people may accuse me of bias on that statement (because I work for Oracle) but it is true.  The Oracle database software is astoundingly good at getting every last drop of performance out of a server.  And I’ve had that view for the 20 years I’ve been involved with the Oracle database software before I joined the company Smile Yes, there are some times when the core issue is due to a bug or defect in the product, but in the overwhelmingly majority of cases – if you write good code on top of a good design, the database layer will meet or exceed your most demanding of requirements.  Especially in these days where modern hardware is just so incredibly quick and scalable, both in the computing tier and storage tier, there are very few applications that will ever need any bespoke facilities over and above what comes with the delivered database product.

Similarly, I want to stress – I’m not trying to be critical of the the original poster of the thread, and the subsequent replies.  I’m just as guilty as anyone, in that it’s tremendously interesting to dive down into the nitty gritty, to swim in the mire of latches, buffers, mutexes, CPU cycles and wait events looking for some nugget of gold.  And I’m also not dismissing the importance of being knowledgeable in these detailed lower level areas of the database technology, because ultimately we may need to perform investigations at this level to solve some of the more niche problems we encounter in our jobs.  But it is a question of timing – these are not the first port of call.  We should always start with the simple:

  • What are you trying to achieve?
  • How are you trying to achieve it?
  • Are you doing so in a sensible way?
  • Do you even know the way in which you are trying to do it?

We didn’t need to look at the DB CPU to SQL elapsed ratio in this example.  We didn’t need to look at latches.  We just needed to consider the task that was being attempted (billing run, which by its very name sounds like a batch-style process) and then look at the how the task was being tackled, and 25,000+ user calls per second doesn’t sound “batch-like”.

So please, step back from the detail, at least initially.

3 thoughts on “It’s just bad code or bad design … most of the time

  1. The interesting thing here is that the OP for the discussion referenced stats “Unfortunately this is standard Oracle Customer Care and Billing functionality and design” …

  2. While I totally agree with what you say, subsequent additions to the thread show another forum communication issue:

    Sometimes you need to answer the question that is asked.
    Sometimes (as Tom and many others have said) you need to say “why do you ask?”

    The actual answer to the original question appears to be Oracle’s accounting of cpu on a VM may be misleading. As Jonathan put it, “When you’re under extreme CPU your virtual machine will behave as if it has16 CPUs (the core count) rather than the 32, and the CPU accounting will get completely messed up.” Is Oracle “wrong?” No, you just have to understand the level of detail and why.

    You can also think you are going faster in traffic by looking at kilometers per hour instead of miles per hour 🙂

    joel garry
    unable to login to wordpress for some reason.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s