SQL tuning with Hashing and Sorting

Posted by

Aggregation

SQL is not all just primary key lookups, although some applications sadly see SQL as only useful in that way. As soon as you have data, you will always have the need to aggregate it, and SQL is the perfect vehicle for that. I did a video recently which I will embed below that was discussing the new 19c COUNT DISTINCT facilities which can dramatically speed up query processing that have requirements including the DISTINCT clause. One of the things I covered was the difference between our traditional aggregation facilities of sorting the data versus our new improved hashing facilities.

This blog post is inspired by and AskTOM question which came in recently, which demonstrates that depending on your requirements, sometimes we might need to fall back onto the traditional style sorting mechanism

A request came in to improve the performance of the following query:


select col,
       avg(timestamp2-timestamp1),
       min(timestamp2-timestamp1),
       max(timestamp2-timestamp1),
       median(timestamp2-timestamp1)
from   [large table]
group by col

To replicate the example, I have my table called T which is just 200 copies of the DBA_OBJECTS view. I want to do some aggregation around the owner column to collect some typical metrics.


SQL> create table t tablespace largets nologging
  2  as select
  3  OWNER
  4  ,OBJECT_NAME
  5  ,SUBOBJECT_NAME
  6  ,OBJECT_ID
  7  ,DATA_OBJECT_ID
  8  ,OBJECT_TYPE
  9  ,cast(CREATED as timestamp) created
 10  ,cast(LAST_DDL_TIME as timestamp) LAST_DDL_TIME
 11  ,TIMESTAMP
 12  ,STATUS
 13  ,TEMPORARY
 14  ,GENERATED
 15  ,SECONDARY
 16  ,NAMESPACE
 17  ,EDITION_NAME
 18  ,SHARING
 19  ,EDITIONABLE
 20  ,ORACLE_MAINTAINED
 21  from dba_objects d,
 22   ( select 1 from dual connect by level <= 200 );

Table created.

SQL> select
  2    owner,
  3    max(created - last_ddl_time) as MIN_LAG,
  4    min(created - last_ddl_time) as MAX_LAG,
  5    avg(extract(second from created - last_ddl_time)+
  6        extract(minute from created - last_ddl_time) * 60 +
  7        extract(hour from created -  last_ddl_time) * 60 * 60 +
  8        extract(day from created - last_ddl_time) * 60 * 60 * 24) AVG_LAG,
  9    median(extract(second from created - last_ddl_time)+
 10              extract(minute from created - last_ddl_time) * 60 +
 11              extract(hour from created - last_ddl_time) * 60 * 60 +
 12            extract(day from created - last_ddl_time) * 60 * 60 * 24) MEDIAN_LAG
 13  from t
 14  group by owner;

37 rows selected.

Elapsed: 00:00:56.81

You can see that the query elapsed time is just under one minute but I was also surprised to see what the execution plan was:


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    37 |  1036 | 81466   (3)| 00:00:04 |
|   1 |  SORT GROUP BY     |      |    37 |  1036 | 81466   (3)| 00:00:04 |
|   2 |   TABLE ACCESS FULL| T    |    16M|   439M| 80134   (1)| 00:00:04 |
---------------------------------------------------------------------------

Aggregations of this type I would have expected to be a HASH GROUP BY not a SORT GROUP BY. The only “out of the ordinary” function being used here is the MEDIAN function so I removed that and reran the query:


SQL> select
  2    owner,
  3    max(created - last_ddl_time) as MIN_LAG,
  4    min(created - last_ddl_time) as MAX_LAG,
  5    avg(extract(second from created - last_ddl_time)+
  6        extract(minute from created - last_ddl_time) * 60 +
  7        extract(hour from created -  last_ddl_time) * 60 * 60 +
  8        extract(day from created - last_ddl_time) * 60 * 60 * 24) AVG_LAG
  9  from t
 10  group by owner;

37 rows selected.

Elapsed: 00:00:09.04

You can see the dramatic improvement in performance. Initially I thought this was just the CPU cost of evaluating a median, but when I looked at the execution plan we see a significant difference:


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    37 |  1036 | 81466   (3)| 00:00:04 |
|   1 |  HASH GROUP BY     |      |    37 |  1036 | 81466   (3)| 00:00:04 |
|   2 |   TABLE ACCESS FULL| T    |    16M|   439M| 80134   (1)| 00:00:04 |
---------------------------------------------------------------------------

We have now benefitted from HASH GROUP BY operation. Thus it appears that the MEDIAN function does not natively align with the hashing operation and thus we need to fall back to some heavy duty sorting of the input data. It is always worthwhile to check the execution plan for any form of aggregation that you are doing, because if you see a large sorting operation then the tuning mechanisms can often be focused around simply bumping up your PGA allocation

For the original AskTOM question due to the size of the source table, ultimately I offered a somewhat counter intuitive solution, this being to scan the table twice; once for all the operations that could be done with a HASH GROUP BY, and then a second time for the operations that required a sort, in this case, just the MEDIAN. By doing this second scan in isolation, we could keep to a the size of the data set that would need to be sorted down to a minimum in order to fit within PGA allocations for the session.

Replicating that approach in my test case I managed to get a performance improvement of over 50%.


SQL> with t1 as
  2   ( select /*+ materialize */
  3        owner,
  4        max(created - last_ddl_time) as MIN_LAG,
  5        min(created - last_ddl_time) as MAX_LAG,
  6        avg(extract(second from created - last_ddl_time)+
  7            extract(minute from created - last_ddl_time) * 60 +
  8            extract(hour from created -  last_ddl_time) * 60 * 60 +
  9            extract(day from created - last_ddl_time) * 60 * 60 * 24) AVG_LAG
 10      from t
 11      group by owner
 12  ),
 13  t2 as
 14  ( select /*+ materialize */
 15    owner,
 16    median(created - last_ddl_time ) MEDIAN_LAG
 17  from t
 18  group by owner
 19  )
 20  select *
 21  from t1, t2
 22  where t1.owner = t2.owner;

37 rows selected.

Elapsed: 00:00:19.44

(Footnote: This also goes to show the scanning a table in Oracle on modern storage is so fast nowadays that we can choose a method which scans it twice and still get benefits!)

Summary

Although most aggregation facilities now in Oracle can take advantage of some cool hashing algorithms, it is important to always be cognizant of which functions may not be able to do that and adapt your queries accordingly. This is easily detected via looking at the execution plan.

9 comments

  1. Hello Connor, Since we are talking about tunning, can you help to clarify this:

    I have read some articles saying that I should set OPTIMIZER_CAPTURE_SQL_PLAN_BASELINES=TRUE and other articles that I should NOT set OPTIMIZER_CAPTURE_SQL_PLAN_BASELINES=TRUE

    Thoughts?

    Regards,
    jorge

  2. Nice!
    I have a query that takes more than 45 minutes to complete. SELECT * from (select * from (select a.*, rownum DefaultSortColumn from(v_report_summary) a));
    and using /*+ ALL_ROWS */, takes less than 12 seconds, most of the info I found, said not to use /*+ ALL_ROWS */ in an OLTP.
    Do you see any issue if I use /*+ ALL_ROWS */?
    thanks
    regards,
    jorge

    1. Hard to say.

      I’m guessing that “v_report_summary” is a view. Most likely, something in the view is blocking any first row style optimizations. But if its runs faster than other alternatives, it would seem the logical choice

      1. That’s what I told my developers, but they keep saying that the view is fine.
        Thanks for getting back.

        regards,
        jorge

  3. Hello, nice to meet you.
    I’ve been through the same thing.

    You can use HASH GROUP BY using approximation functions in version 12c. Please note.

    APPROX_MEDIAN(DCOL_SPEC)
    —————————————————-
    | Id | Operation | Name |
    —————————————————-
    | 0 | SELECT STATEMENT | |
    | 1 | SORT ORDER BY | |
    | 2 | HASH GROUP BY APPROX | |
    | 3 | TABLE ACCESS STORAGE FULL| CONFIG_TABLE |
    —————————————————-

    But this may cause some difference in the result. because it’s an approximation.
    So I think your method is optimal. All right.

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.