No more stale statistics in 19c

August 15, 2019

7–10 minutes

There is an odd contradiction that we all encounter for most databases, especially if they are predominantly used during the business day. Here is how that contradiction comes to be – it is in the way that we obtain and use optimizer statistics on those databases. The contradiction runs like this:

To minimize service disruption, we gather statistics at a quiet time, for example, in the middle of the night
We then use those statistics during the business day whilst user activity is at its highest.
Highest user activity will typically mean the highest frequency of data changes.
Hence the statistics are at their peak accuracy when no-one is using them to optimize queries, and they are at their least accurate when everyone is using them to optimize queries!

We can demonstrate this easily with the following script run in 18c.


SQL> select banner from v$version where rownum = 1 ;

BANNER
--------------------------------------------------------------------------------
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

1 row selected.

SQL> create table t (
  2    id number(10),
  3    num number(10),
  4    constraint t_pk primary key ( id )
  5  );

Table created.

SQL> insert into t select rownum,rownum from dual connect by rownum<=10000;

10000 rows created.

SQL> exec dbms_stats.gather_table_stats(user,'t')

PL/SQL procedure successfully completed.

SQL> select stale_stats from user_tab_statistics
  2  where  table_name = 'T';

STALE_S
-------
NO

So far I’ve mimicked a table that has been populated with (say) a days worth of data (10,000) rows, and I’ve gathered statistics at the end of the business day so that my statistics reflect the current data in the table.



SQL> select table_name, num_rows, blocks
  2  from   user_tab_statistics
  3  where  table_name = 'T';

TABLE_NAME NUM_ROWS     BLOCKS
---------- -------- ----------
T             10000         20

1 row selected.

SQL> select table_name, column_name, low_value, high_value, num_distinct
  2  from   user_tab_col_statistics
  3  where  table_name = 'T';

TABLE_NAME COLUMN_NAME          LOW_VALUE       HIGH_VALUE      NUM_DISTINCT
---------- -------------------- --------------- --------------- ------------
T          ID                   C102            C302                   10000
T          NUM                  C102            C302                   10000

2 rows selected.

Hence, if I was to optimize and run a query now, I can expect an excellent estimate from the optimizer. The query below returns a COUNT of 1000 rows, and we can see from the PLAN_TABLE output, that the estimated rows was also 1000. So a perfect estimate by the optimizer!



SQL> select max(num),count(*)
  2  from t where id > 9000;

  MAX(NUM)   COUNT(*)
---------- ----------
     10000       1000

1 row selected.

SQL> select *
  2  from table(dbms_xplan.display_cursor(format=>'typical'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------

---------------------------------------------------------------------
| Id  | Operation                            | Name | Rows  | Bytes |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |       |       |
|   1 |  SORT AGGREGATE                      |      |     1 |     8 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |  1000 |  8000 |
|*  3 |    INDEX RANGE SCAN                  | T_PK |  1000 |       |
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("ID">9000)


20 rows selected.

But of course, the next business day rolls on, and by mid-morning we may have added 1000 more rows to our table, bringing the total number of rows to 11000.


SQL> insert into t select rownum+10000,rownum+10000 from dual connect by rownum<=1000;

1000 rows created.

SQL> commit;

Commit complete.

Even with an explicit flush of the database monitoring information, we not yet reached the threshold where this tables statistics would be considered stale, so even a GATHER_STALE operation would have no effect if we ran one. The statistics still reflect the data from last night.


SQL> exec dbms_stats.flush_database_monitoring_info;

PL/SQL procedure successfully completed.

SQL> select stale_stats from user_tab_statistics
  2  where  table_name = 'T';

STALE_S
-------
NO

SQL> select table_name, num_rows, blocks
  2  from   user_tab_statistics
  3  where  table_name = 'T';

TABLE_NAME NUM_ROWS     BLOCKS
---------- -------- ----------
T             10000         20

1 row selected.

SQL> select table_name, column_name, low_value, high_value
  2  from   user_tab_col_statistics
  3  where  table_name = 'T';

TABLE_NAME COLUMN_NAME          LOW_VALUE       HIGH_VALUE
---------- -------------------- --------------- ---------------
T          ID                   C102            C302
T          NUM                  C102            C302

2 rows selected.

When I run my same query, and don’t forget, this would now be during the business day, when it is critical that I get a good optimizer plan, you can see that the result is now 2000 rows, but the plan estimate is left languishing at its original 1000 rows because the statistics no long reflect the current state of the table.


SQL> select max(num),count(*)
  2  from t where id > 9000;

  MAX(NUM)   COUNT(*)
---------- ----------
     11000       2000

1 row selected.

SQL> select *
  2  from table(dbms_xplan.display_cursor(format=>'typical'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------
SQL_ID  8juuu5warw2z6, child number 0
-------------------------------------
select max(num),count(*) from t where id > 9000

Plan hash value: 2053823973

---------------------------------------------------------------------
| Id  | Operation                            | Name | Rows  | Bytes |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |       |       |
|   1 |  SORT AGGREGATE                      |      |     1 |     8 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |  1000 |  8000 |
|*  3 |    INDEX RANGE SCAN                  | T_PK |  1000 |       |
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("ID">9000)


20 rows selected.

Having great statistics that are not used, and poor statistics that are used all the time seems a nonsensical way to do things, but there really hasn’t been any alternative unless you were prepared to consume precious server resources to collect statistics whilst your users are performing their transactions. I have known customers to do this in the past, but of course, it takes careful monitoring and management to ensure that the slicing up of the resource “pie” is done appropriately to keep the right balance between business needs and background database tasks.

19c brings a nifty solution to this contradiction by having the database automatically maintain details about how data is changing in a table via a feature called Real Time Statistics. DML activities can be tracked, and that information can then be looped back as input into future query parsing. Here’s the same example from above, this time performed in 19c.

The same table is seeded with the same initial 10000 rows as before. I have included a new column in my data dictionary queries called NOTES. You will see the significance of that shortly.



SQL> select banner from v$version where rownum = 1;

BANNER
--------------------------------------------------------------------------
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

1 row selected.

SQL>
SQL> create table t (
  2    id number(10),
  3    num number(10),
  4    constraint t_pk primary key ( id )
  5  );

Table created.

SQL> insert into t select rownum,rownum from dual connect by rownum<=10000;

10000 rows created.

SQL> exec dbms_stats.gather_table_stats(user,'t')

PL/SQL procedure successfully completed.

SQL> select stale_stats from user_tab_statistics
  2  where  table_name = 'T';

STALE_S
-------
NO

1 row selected.

SQL> select notes, table_name, num_rows, blocks
  2  from   user_tab_statistics
  3  where  table_name = 'T';

NOTES                            TABLE_NAME NUM_ROWS     BLOCKS
-------------------------------- ---------- -------- ----------
                                 T             10000         20

1 row selected.

SQL> select notes, table_name, column_name, low_value, high_value, num_distinct
  2  from   user_tab_col_statistics
  3  where  table_name = 'T';

NOTES                            TABLE_NAME COLUMN_NAME          LOW_VALUE       HIGH_VALUE      NUM_DISTINCT
-------------------------------- ---------- -------------------- --------------- --------------- ------------
                                 T          ID                   C102            C302                   10000
                                 T          NUM                  C102            C302                   10000

And since this is a simple query, the optimizer performs exactly as per 18c for its estimate. Once again, at this point, where the data and statistics are in alignment, the estimate is perfect.


                                 
SQL> select max(num),count(*)
  2  from t where id > 9000;

  MAX(NUM)   COUNT(*)
---------- ----------
     10000       1000

1 row selected.                                 


SQL> select *
  2  from table(dbms_xplan.display_cursor(format=>'typical'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------

----------------------------------------------------------------------------------
| Id  | Operation                            | Name | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |       |       |     5 (100)|
|   1 |  SORT AGGREGATE                      |      |     1 |     8 |            |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |  1000 |  8000 |     5   (0)|
|*  3 |    INDEX RANGE SCAN                  | T_PK |  1000 |       |     3   (0)|
----------------------------------------------------------------------------------

The business day commences, and a fresh 1000 rows are added to the database.



SQL> insert into t select rownum+10000,rownum+10000 from dual connect by rownum<=1000;

1000 rows created.

SQL> commit;

Commit complete.

And just like before, this is not enough for the statistics on the table to be considered stale by the database



SQL> exec dbms_stats.flush_database_monitoring_info;

PL/SQL procedure successfully completed.

SQL>
SQL> select stale_stats from user_tab_statistics
  2  where  table_name = 'T';

STALE_S
-------
NO

But, here is where the 19c enhancements come into play. When we query the data dictionary, we now have TWO rows of statistics for this single table.



SQL> select notes, table_name, num_rows, blocks
  2  from   user_tab_statistics
  3  where  table_name = 'T';

NOTES                            TABLE_NAME NUM_ROWS     BLOCKS
-------------------------------- ---------- -------- ----------
                                 T             10000         20
STATS_ON_CONVENTIONAL_DML        T             11000         20

2 rows selected.

The first row is the original statistics we gathered. The second row is new statistics we have derived from the normal DML activities that have occurred on the database. This is not a full GATHER operation, it is simply taking advantage of simple mathematics, eg if I have 10000 rows and then insert 50 and deleted 20, then I would have 10030 rows etc. The real time statistics can also keep track of simple column level statistics, like whether the low and high water mark values have changed.



SQL> select notes, table_name, column_name, low_value, high_value
  2  from   user_tab_col_statistics
  3  where  table_name = 'T';

NOTES                            TABLE_NAME COLUMN_NAME          LOW_VALUE       HIGH_VALUE
-------------------------------- ---------- -------------------- --------------- ----------
                                 T          ID                   C102            C302
                                 T          NUM                  C102            C302
STATS_ON_CONVENTIONAL_DML        T          ID                   C102            C302094A
STATS_ON_CONVENTIONAL_DML        T          NUM                  C102            C302094A

The real test is whether these statistics can be beneficial to our sample query. I’ll run that through again.



SQL> select max(num),count(*)
  2  from t where id > 9000;

  MAX(NUM)   COUNT(*)
---------- ----------
     11000       2000

1 row selected.

SQL> select *
  2  from table(dbms_xplan.display_cursor(format=>'typical'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 2966233522

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |     7 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |  1895 | 15160 |     7   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("ID">9000)

Note
-----
   - dynamic statistics used: statistics for conventional DML

Notice the estimate of 1895 is much closer now to the true value of 2000. You can expect a little variation here from the true value, because as I mentioned before, real time statistics is not about performing a full gather operation on the table – that would be far too resource hungry for every single DML. So some elements of the statistics (for example, the number of distinct values, or histogram distribution) are not feasible to keep up to date every time a DML is performed. But even having the improved row counts and column extrema have yielded benefit on the execution plan. When the normal nightly gather job comes along, these DML statistics are no longer relevant and will be expunged


SQL> exec dbms_stats.gather_table_stats(user,'t')

PL/SQL procedure successfully completed.

SQL> select notes, table_name, num_rows, blocks
  2  from   user_tab_statistics
  3  where  table_name = 'T';

NOTES                            TABLE_NAME NUM_ROWS     BLOCKS
-------------------------------- ---------- -------- ----------
                                 T             11000         20

1 row selected.

SQL> select notes, table_name, column_name, low_value, high_value
  2  from   user_tab_col_statistics
  3  where  table_name = 'T';

NOTES                            TABLE_NAME COLUMN_NAME          LOW_VALUE       HIGH_VALUE
-------------------------------- ---------- -------------------- --------------- ----------
                                 T          ID                   C102            C3020B
                                 T          NUM                  C102            C3020B

2 rows selected.

There are more real time demos of 19c statistics improvements in the video below

There are other enhancements in 19c to keep the statistics in the dictionary closer to the true data in database tables which you can read about here.

Full disclosure: Some 19c new features are available on certain Oracle Database platforms only. Check the Oracle Database Licensing Guide for more information

13 responses to “No more stale statistics in 19c”

Dom Brooks

August 15, 2019 at 5:34 pm

I think stale statistics are the great myth of most level 1 DBA support services – they get a call about poor peformance and their first thought is “are there stale statistics” rather than “are stale statistics relevant to the problem”

Reply
1. Connor McDonald
  
  August 15, 2019 at 7:57 pm
  
  True, but a very common problem is queries above the low/high recorded values for columns, eg
  
  where sales_date > sysdate
  
  and of course, ‘sales_date’ continuously moves upwards as data comes in. Real time statistics could be a real winner here.
  
  Reply
Dom Brooks

August 15, 2019 at 5:35 pm

Nevertheless this could be a useful feature.

Reply
Matthias Rogel

August 15, 2019 at 7:11 pm

with “the Oracle Database Licensing Guide” you mean https://docs.oracle.com/en/database/oracle/oracle-database/19/dblic/Licensing-Information.html#GUID-0F9EB85D-4610-4EDF-89C2-4916A0E7AC87 ?
So Real-Time-Statistics is currently only available on Exadata and Exadata Cloud, correct ?

Reply
1. Connor McDonald
  
  August 15, 2019 at 7:56 pm
  
  Currently yes.
  
  Reply
Geath

August 16, 2019 at 5:06 am

I didn’t like that this feature is restricted to Exadata and Cloud services 😦
neither MemOptimized RowStore.. the most interesting features in 19c are very restricted.

Reply
iiotzov

August 19, 2019 at 9:53 pm

You could gather stats in triggers (not using DBMS_STATS though) and effectively deal with stale stats. The solution comes with some performance penalty.

Here the papers that explain why it is done and how it is done:

Click to access 2013_nyoug_presentation.pdf

http://iiotzov.files.wordpress.com/2013/09/2013_nyoug_whitepaper.docx

And here is some code:
http://iiotzov.files.wordpress.com/2012/03/just_stats_spec4.doc
https://iiotzov.files.wordpress.com/2014/08/just_stats_body_patch_aug2014.docx
https://iiotzov.files.wordpress.com/2012/03/readme.doc
https://iiotzov.files.wordpress.com/2012/03/gathering_stats_in_trigger_example.doc

Disclaimer:
—————————————————————–
I do not offer support for that package

I do not accept any legal liability or responsibility for the content or the accuracy of the information so provided, or, for any loss or damage caused arising directly or indirectly in connection with reliance on the use of such information.

Use at your own risk!
—————————————————————–

Reply
Erik van Roon

August 20, 2019 at 7:43 pm

Cool.
But I was wondering….

When and how do these DML statistics get recorded into the data dictionary?

At running the DML?
It would feel like that means that statistics are recorded for data that is not (yet, and may never be) visible to other sessions.
So, that would mean queries from other sessions may have different optimizer plans due to data that does not (and may never) exist as far as that session knows.

At commit time?
That would mean a session can’t benefit from the statistics in queries it fires before it ends the transaction.

The only thing I can think of to get around both disadvantages would be some sort of user-level dml-statistics which get ‘promoted’ to globally visible dml-statistics at commit (or thrown away on rollback).

I have no doubt this point has been well discussed and dealt with by Oracle, but I’m just trying to understand the end result.

Reply
elfcheg

August 20, 2019 at 9:35 pm

>When the normal nightly gather job comes along, these DML statistics are no longer relevant and will be expunged

Does this mean that nightly job will explicitly search for presence of real time statistics to recollect them as if they were stale?

Reply
Connor McDonald

August 20, 2019 at 10:10 pm

An explicit re-gather clears them out as shown above. I havent checked yet but my assumption will be that the nightly job would look for stale stats as per normal. If they were not stale, the real time stats would still be in use until the table became stale.

Reply
Lynne

July 8, 2021 at 7:50 pm

Oracle has made real time stats available for EE in the 19.10 update.

https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/OPTIMIZER_REAL_TIME_STATISTICS.html#GUID-313C8DE2-6F01-49E3-962F-653065C14F57

Reply
1. Ted Persky
  
  August 27, 2021 at 11:10 pm
  
  Actually, Lynne, it’s only licenseable for Exadata. Please see Table 1-8:
  
  https://docs.oracle.com/en/database/oracle/oracle-database/19/dblic/Licensing-Information.html#GUID-0F9EB85D-4610-4EDF-89C2-4916A0E7AC87__GUID-0DA9277F-CD0F-4DE9-A904-62E382BD41D5
  
  Reply
Amin Adatia

April 24, 2023 at 8:31 pm

I am trying this on Exadata with 19c (19.17) But I do not see the ” STATS_ON_CONVENTIONAL_DML” line in the output. Is there any setting I need to check for in the Exadata?

Reply