Aggregation
SQL is not all just primary key lookups, although some applications sadly see SQL as only useful in that way. As soon as you have data, you will always have the need to aggregate it, and SQL is the perfect vehicle for that. I did a video recently which I will embed below that was discussing the new 19c COUNT DISTINCT facilities which can dramatically speed up query processing that have requirements including the DISTINCT clause. One of the things I covered was the difference between our traditional aggregation facilities of sorting the data versus our new improved hashing facilities.
This blog post is inspired by and AskTOM question which came in recently, which demonstrates that depending on your requirements, sometimes we might need to fall back onto the traditional style sorting mechanism
A request came in to improve the performance of the following query:
select col,
avg(timestamp2-timestamp1),
min(timestamp2-timestamp1),
max(timestamp2-timestamp1),
median(timestamp2-timestamp1)
from [large table]
group by col
To replicate the example, I have my table called T which is just 200 copies of the DBA_OBJECTS view. I want to do some aggregation around the owner column to collect some typical metrics.
SQL> create table t tablespace largets nologging
2 as select
3 OWNER
4 ,OBJECT_NAME
5 ,SUBOBJECT_NAME
6 ,OBJECT_ID
7 ,DATA_OBJECT_ID
8 ,OBJECT_TYPE
9 ,cast(CREATED as timestamp) created
10 ,cast(LAST_DDL_TIME as timestamp) LAST_DDL_TIME
11 ,TIMESTAMP
12 ,STATUS
13 ,TEMPORARY
14 ,GENERATED
15 ,SECONDARY
16 ,NAMESPACE
17 ,EDITION_NAME
18 ,SHARING
19 ,EDITIONABLE
20 ,ORACLE_MAINTAINED
21 from dba_objects d,
22 ( select 1 from dual connect by level <= 200 );
Table created.
SQL> select
2 owner,
3 max(created - last_ddl_time) as MIN_LAG,
4 min(created - last_ddl_time) as MAX_LAG,
5 avg(extract(second from created - last_ddl_time)+
6 extract(minute from created - last_ddl_time) * 60 +
7 extract(hour from created - last_ddl_time) * 60 * 60 +
8 extract(day from created - last_ddl_time) * 60 * 60 * 24) AVG_LAG,
9 median(extract(second from created - last_ddl_time)+
10 extract(minute from created - last_ddl_time) * 60 +
11 extract(hour from created - last_ddl_time) * 60 * 60 +
12 extract(day from created - last_ddl_time) * 60 * 60 * 24) MEDIAN_LAG
13 from t
14 group by owner;
37 rows selected.
Elapsed: 00:00:56.81
You can see that the query elapsed time is just under one minute but I was also surprised to see what the execution plan was:
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 37 | 1036 | 81466 (3)| 00:00:04 |
| 1 | SORT GROUP BY | | 37 | 1036 | 81466 (3)| 00:00:04 |
| 2 | TABLE ACCESS FULL| T | 16M| 439M| 80134 (1)| 00:00:04 |
---------------------------------------------------------------------------
Aggregations of this type I would have expected to be a HASH GROUP BY not a SORT GROUP BY. The only “out of the ordinary” function being used here is the MEDIAN function so I removed that and reran the query:
SQL> select
2 owner,
3 max(created - last_ddl_time) as MIN_LAG,
4 min(created - last_ddl_time) as MAX_LAG,
5 avg(extract(second from created - last_ddl_time)+
6 extract(minute from created - last_ddl_time) * 60 +
7 extract(hour from created - last_ddl_time) * 60 * 60 +
8 extract(day from created - last_ddl_time) * 60 * 60 * 24) AVG_LAG
9 from t
10 group by owner;
37 rows selected.
Elapsed: 00:00:09.04
You can see the dramatic improvement in performance. Initially I thought this was just the CPU cost of evaluating a median, but when I looked at the execution plan we see a significant difference:
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 37 | 1036 | 81466 (3)| 00:00:04 |
| 1 | HASH GROUP BY | | 37 | 1036 | 81466 (3)| 00:00:04 |
| 2 | TABLE ACCESS FULL| T | 16M| 439M| 80134 (1)| 00:00:04 |
---------------------------------------------------------------------------
We have now benefitted from HASH GROUP BY operation. Thus it appears that the MEDIAN function does not natively align with the hashing operation and thus we need to fall back to some heavy duty sorting of the input data. It is always worthwhile to check the execution plan for any form of aggregation that you are doing, because if you see a large sorting operation then the tuning mechanisms can often be focused around simply bumping up your PGA allocation
For the original AskTOM question due to the size of the source table, ultimately I offered a somewhat counter intuitive solution, this being to scan the table twice; once for all the operations that could be done with a HASH GROUP BY, and then a second time for the operations that required a sort, in this case, just the MEDIAN. By doing this second scan in isolation, we could keep to a the size of the data set that would need to be sorted down to a minimum in order to fit within PGA allocations for the session.
Replicating that approach in my test case I managed to get a performance improvement of over 50%.
SQL> with t1 as
2 ( select /*+ materialize */
3 owner,
4 max(created - last_ddl_time) as MIN_LAG,
5 min(created - last_ddl_time) as MAX_LAG,
6 avg(extract(second from created - last_ddl_time)+
7 extract(minute from created - last_ddl_time) * 60 +
8 extract(hour from created - last_ddl_time) * 60 * 60 +
9 extract(day from created - last_ddl_time) * 60 * 60 * 24) AVG_LAG
10 from t
11 group by owner
12 ),
13 t2 as
14 ( select /*+ materialize */
15 owner,
16 median(created - last_ddl_time ) MEDIAN_LAG
17 from t
18 group by owner
19 )
20 select *
21 from t1, t2
22 where t1.owner = t2.owner;
37 rows selected.
Elapsed: 00:00:19.44
(Footnote: This also goes to show the scanning a table in Oracle on modern storage is so fast nowadays that we can choose a method which scans it twice and still get benefits!)
Summary
Although most aggregation facilities now in Oracle can take advantage of some cool hashing algorithms, it is important to always be cognizant of which functions may not be able to do that and adapt your queries accordingly. This is easily detected via looking at the execution plan.
Hello Connor, Since we are talking about tunning, can you help to clarify this:
I have read some articles saying that I should set OPTIMIZER_CAPTURE_SQL_PLAN_BASELINES=TRUE and other articles that I should NOT set OPTIMIZER_CAPTURE_SQL_PLAN_BASELINES=TRUE
Thoughts?
Regards,
jorge
for things like upgrades etc, its invaluable to keep all plans the same throughout the upgrade period
Thanks
Nice!
I have a query that takes more than 45 minutes to complete. SELECT * from (select * from (select a.*, rownum DefaultSortColumn from(v_report_summary) a));
and using /*+ ALL_ROWS */, takes less than 12 seconds, most of the info I found, said not to use /*+ ALL_ROWS */ in an OLTP.
Do you see any issue if I use /*+ ALL_ROWS */?
thanks
regards,
jorge
Hard to say.
I’m guessing that “v_report_summary” is a view. Most likely, something in the view is blocking any first row style optimizations. But if its runs faster than other alternatives, it would seem the logical choice
That’s what I told my developers, but they keep saying that the view is fine.
Thanks for getting back.
regards,
jorge
Hello, nice to meet you.
I’ve been through the same thing.
You can use HASH GROUP BY using approximation functions in version 12c. Please note.
APPROX_MEDIAN(DCOL_SPEC)
—————————————————-
| Id | Operation | Name |
—————————————————-
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
| 2 | HASH GROUP BY APPROX | |
| 3 | TABLE ACCESS STORAGE FULL| CONFIG_TABLE |
—————————————————-
But this may cause some difference in the result. because it’s an approximation.
So I think your method is optimal. All right.
Nice input. In particular, I reckon in many cases (fro massive data sets) you could give someone an approximate median and they’d never know the difference