12c Statistics on load–special cases

One of the cool features in 12c is the automatic collection of optimizer statistics when a table is either created or loaded via direct path from empty.  This makes a lot of sense because it saves us from what used to be the mandatory second step of gathering statistics whenever we loaded an empty table with data.

For example

11.2.0.4


SQL> create table t1 as
  2  select * from dba_objects;

Table created.

SQL>
SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
(null)

SQL> select count(*) from t1;

  COUNT(*)
----------
     86802

So at this point, we’d then have to scan the table we just loaded with a DBMS_STATS call in order to come up with some representative statistics. This was fixed in 12c with “statistics on load”.

12c 


SQL> create table t1 as
  2  select * from dba_objects;

Table created.

SQL> – no dbms_stats call
SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
     78170

But this facility has been blogged about extensively, so I wont labour the point. This post is more about just a couple of things to be aware of when taking advantage of the facility.

Partitioned tables

The 12c feature is for statistics on load of the table. So if you are creating/loading a partitioned table, whilst you will still get statistics collected, they are at the table level only. For example,


SQL> create table t1
  2  partition by range ( object_id )
  3  interval ( 20000 )
  4  (
  5    partition p1 values less than ( 20000 ),
  6    partition p2 values less than ( 200000 )
  7  )
  8  as select d.* from dba_objects d
  9  where object_id is not null;

Table created.

SQL>
SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
     78165

SQL>
SQL> select partition_name, num_rows
  2  from user_tab_partitions
  3  where table_name = 'T1';

PARTITION_   NUM_ROWS
---------- ----------
P1         (null)
P2         (null)

Notice that the partition level statistics are not collected. Interestingly, indexes however will be ok whether they are created during or after the load.


--
-- indexes created after the load on the table above
--
SQL> create index ix1 on t1 ( object_id ) local;

Index created.

SQL> create index ix2 on t1 ( object_name);

Index created.

SQL>
SQL> select index_name, leaf_blocks
  2  from user_indexes
  3  where index_name in ( 'IX1','IX2');

INDEX_NAME                     LEAF_BLOCKS
------------------------------ -----------
IX1                                    173
IX2                                    535

2 rows selected.

SQL>
SQL> select index_name, partition_name, leaf_blocks
  2  from user_ind_partitions
  3  where index_name in ( 'IX1','IX2');

INDEX_NAME                     PARTITION_ LEAF_BLOCKS
------------------------------ ---------- -----------
IX1                            P1                  43
IX1                            P2                 130

--
-- indexes created as part of the table creation process
--
SQL> create table t1 (
  2     owner
  3    ,object_name
  4    ,subobject_name
  5    ,object_id
  6    ,data_object_id
  7    ,object_type
  8    ,created
  9    ,last_ddl_time
 10    ,timestamp
 11    ,status
 12    ,temporary
 13    ,generated
 14    ,secondary
 15    ,namespace
 16    ,edition_name
 17    ,sharing
 18    ,editionable
 19    ,oracle_maintained
 20    ,application
 21    ,default_collation
 22    ,duplicated
 23    ,sharded
 24    ,created_appid
 25    ,created_vsnid
 26    ,modified_appid
 27    ,modified_vsnid
 28    ,constraint pk primary key ( object_id ) using index local
 29    ,constraint uq unique ( owner, object_id ) using index
 30  )
 31  partition by range ( object_id )
 32  interval ( 20000 )
 33  (
 34    partition p1 values less than ( 20000 ),
 35    partition p2 values less than ( 200000 )
 36  )
 37  as select d.* from dba_objects d
 38  where object_id is not null;

Table created.

SQL>
SQL>
SQL> select index_name, leaf_blocks
  2  from user_indexes
  3  where index_name in ('PK','UQ');

INDEX_NAME                     LEAF_BLOCKS
------------------------------ -----------
PK                                     163
UQ                                     263

2 rows selected.

SQL>
SQL> select index_name, partition_name, leaf_blocks
  2  from user_ind_partitions
  3  where index_name in ('PK','UQ');

INDEX_NAME                     PARTITION_ LEAF_BLOCKS
------------------------------ ---------- -----------
PK                             P1                  41
PK                             P2                 122

2 rows selected.

CTAS empty is still OK

A common trick for creating an empty table whilst copying the definition of an existing table is to use a predicate with an always false condition. For example,


SQL> create table t1 as
  2  select * from dba_objects
  3  where 1=0;

Table created.

Statistics on load will still take place for such a table, ie


SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
         0

Now you might be worried that since there are now statistics on the table, that perhaps a direct load using INSERT /*+ APPEND */,  which would normally perform statistics on load will no longer work. But there is no need to panic. We’ll still determine that the table is empty, and do another statistics on load operation when you perform that direct load.


SQL> insert /*+ APPEND */ into t1
  2  select * from dba_objects;

78171 rows created.

SQL> commit;

Commit complete.

SQL>
SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
     78171

But this also has implications if you want to retain some existing statistics on the table. Because when we truncate the table, that same reset of statistics collection can occur.


SQL> create table t1 as
  2  select * from dba_objects
  3  where rownum <= 10;

Table created.

SQL>
SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
        10

SQL>
SQL> truncate table t1;

Table truncated.

SQL>
SQL> insert /*+ APPEND */ into t1
  2  select * from dba_objects;

78171 rows created.

SQL> commit;

Commit complete.

SQL>
SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
     78171

Similarly, all you need is one transaction that “brings the table to life” for automatic statistics collection to be no longer active. Even if that transactions rolls back. For example:


SQL> create table t1 as
  2  select * from dba_objects
  3  where 1=0;

Table created.

SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
         0

SQL> -- a normal mode insert of a row
SQL> insert into t1
  2  select * from dba_objects where rownum = 1;

1 row created.

SQL>
SQL> rollback;

Rollback complete.

SQL>
SQL> insert /*+ APPEND */ into t1
  2  select * from dba_objects;

78171 rows created.

SQL> commit;

Commit complete.

SQL>
SQL> select num_rows
  2  from user_tables
  3  where table_name = 'T1';

  NUM_ROWS
----------
         0

All of these behaviour are pretty much how I would expect them to work (your opinion may differ of course Smile), and I know that some people have opted to disable the automatic collection altogether to avoid any ambiguity but that strikes me as overkill.

If you are really concerned about it, it is easy enough to suffix your load routines with a simple wrapper function to check the statistics and issue the appropriate DBMS_STATS call to make up the shortfall.   And it would be remiss of me not to mention the latest optimizer whitepaper which covers in detail other optimizer enhancements in 12c. 

Happy stats collecting !

AskTOM TV episode 8

On AskTOM episode 8, I’ve taken a look at locating the SQL Plan Directives used for a particular query.  Here is the script output from the video if you want to use this for your own exploration


SQL>
SQL> create table t as
  2  select *
  3  from dba_objects
  4  where owner = 'SYS' and rownum <= 20
  5  union all
  6  select *
  7  from dba_objects
  8  where owner = 'SYSTEM'
  9  and rownum <= 200;

Table created.

SQL>
SQL> create index ix on t ( owner);

Index created.

SQL>
SQL> select COLUMN_NAME,NUM_DISTINCT,AVG_COL_LEN, num_nulls, density
  2  from   user_tab_cols
  3  where  table_name = 'T'
  4  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT AVG_COL_LEN  NUM_NULLS    DENSITY
------------------------------ ------------ ----------- ---------- ----------
OWNER                                     2           7          0         .5
OBJECT_NAME                             199          18          0 .005025126
SUBOBJECT_NAME                            1           3        199          1
OBJECT_ID                               220           4          0 .004545455
DATA_OBJECT_ID                          167           4         49 .005988024
OBJECT_TYPE                               9           8          0 .111111111
CREATED                                  12           8          0 .083333333
LAST_DDL_TIME                            16           8          0      .0625
TIMESTAMP                                13          20          0 .076923077
STATUS                                    1           6          0          1
TEMPORARY                                 2           2          0         .5
GENERATED                                 2           2          0         .5
SECONDARY                                 1           2          0          1
NAMESPACE                                 4           3          0        .25
EDITION_NAME                              0           0        220          0
SHARING                                   2          10          0         .5
EDITIONABLE                               1           2        206          1
ORACLE_MAINTAINED                         1           2          0          1
APPLICATION                               1           2          0          1
DEFAULT_COLLATION                         1           7        136          1
DUPLICATED                                1           2          0          1
SHARDED                                   1           2          0          1
CREATED_APPID                             0           0        220          0
CREATED_VSNID                             0           0        220          0
MODIFIED_APPID                            0           0        220          0
MODIFIED_VSNID                            0           0        220          0

26 rows selected.

SQL>
SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> commit;

Commit complete.

SQL>
SQL> select /*+ gather_plan_statistics */ count(created)
  2  from t
  3  where owner = 'SYS'
  4  and object_type = 'JAVA CLASS';

COUNT(CREATED)
--------------
        138424

1 row selected.

SQL>
SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  3qyuxjtjy92m5, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(created) from t where owner
= 'SYS' and object_type = 'JAVA CLASS'

Plan hash value: 2143077847

-------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |      1 |        |      1 |00:00:00.08 |    5991 |
|   1 |  SORT AGGREGATE                      |      |      1 |      1 |      1 |00:00:00.08 |    5991 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |     12 |    138K|00:00:00.07 |    5991 |
|*  3 |    INDEX RANGE SCAN                  | IX   |      1 |    110 |    207K|00:00:00.03 |    1218 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')


22 rows selected.

SQL>
SQL> select sql_id, child_number,is_reoptimizable  from v$sql where sql_id = '3qyuxjtjy92m5';

SQL_ID        CHILD_NUMBER I
------------- ------------ -
3qyuxjtjy92m5            0 Y

1 row selected.

SQL>
SQL> exec dbms_spd.FLUSH_SQL_PLAN_DIRECTIVE

PL/SQL procedure successfully completed.

SQL>
SQL> select /*+ gather_plan_statistics */ count(created)
  2  from t
  3  where owner = 'SYS'
  4  and object_type = 'JAVA CLASS';

COUNT(CREATED)
--------------
        138424

1 row selected.

SQL>
SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  3qyuxjtjy92m5, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ count(created) from t where owner
= 'SYS' and object_type = 'JAVA CLASS'

Plan hash value: 2143077847

-------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |      1 |        |      1 |00:00:00.06 |    5383 |
|   1 |  SORT AGGREGATE                      |      |      1 |      1 |      1 |00:00:00.06 |    5383 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |    138K|    138K|00:00:00.05 |    5383 |
|*  3 |    INDEX RANGE SCAN                  | IX   |      1 |    220 |    207K|00:00:00.02 |     610 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - statistics feedback used for this statement
   - performance feedback used for this statement
   - 1 Sql Plan Directive used for this statement


29 rows selected.

SQL>
SQL> select count(*) from dba_sql_plan_directives;

  COUNT(*)
----------
       354

1 row selected.

SQL>
SQL> set lines 60
SQL>
SQL> desc v$sql_plan
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ADDRESS                                RAW(8)
 HASH_VALUE                             NUMBER
 SQL_ID                                 VARCHAR2(13)
 PLAN_HASH_VALUE                        NUMBER
 FULL_PLAN_HASH_VALUE                   NUMBER
 CHILD_ADDRESS                          RAW(8)
 CHILD_NUMBER                           NUMBER
 TIMESTAMP                              DATE
 OPERATION                              VARCHAR2(30)
 OPTIONS                                VARCHAR2(30)
 OBJECT_NODE                            VARCHAR2(40)
 OBJECT#                                NUMBER
 OBJECT_OWNER                           VARCHAR2(128)
 OBJECT_NAME                            VARCHAR2(128)
 OBJECT_ALIAS                           VARCHAR2(261)
 OBJECT_TYPE                            VARCHAR2(20)
 OPTIMIZER                              VARCHAR2(20)
 ID                                     NUMBER
 PARENT_ID                              NUMBER
 DEPTH                                  NUMBER
 POSITION                               NUMBER
 SEARCH_COLUMNS                         NUMBER
 COST                                   NUMBER
 CARDINALITY                            NUMBER
 BYTES                                  NUMBER
 OTHER_TAG                              VARCHAR2(35)
 PARTITION_START                        VARCHAR2(64)
 PARTITION_STOP                         VARCHAR2(64)
 PARTITION_ID                           NUMBER
 OTHER                                  VARCHAR2(4000)
 DISTRIBUTION                           VARCHAR2(20)
 CPU_COST                               NUMBER
 IO_COST                                NUMBER
 TEMP_SPACE                             NUMBER
 ACCESS_PREDICATES                      VARCHAR2(4000)
 FILTER_PREDICATES                      VARCHAR2(4000)
 PROJECTION                             VARCHAR2(4000)
 TIME                                   NUMBER
 QBLOCK_NAME                            VARCHAR2(128)
 REMARKS                                VARCHAR2(4000)
 OTHER_XML                              CLOB
 CON_ID                                 NUMBER

SQL> set lines 200
SQL>
SQL>
SQL> select other from v$sql_plan
  2  where other is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

no rows selected

SQL>
SQL> select remarks from v$sql_plan
  2  where remarks is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

no rows selected

SQL>
SQL> select other_xml from v$sql_plan
  2  where other_xml is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

OTHER_XML
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<other_xml><info type="performance_feedback" note="y">yes</info><info type="cardinality_feedback" ...


1 row selected.

SQL>
SQL> select xmltype(other_xml) from v$sql_plan
  2  where other_xml is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

XMLTYPE(OTHER_XML)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<other_xml>
  <info note="y" type="performance_feedback">yes</info>
  <info note="y" type="cardinality_feedback">yes</info>
  <info type="db_version">12.2.0.1</info>
  <info type="parse_schema"></info>
  <info note="y" type="dynamic_sampling">2</info>
  <info type="plan_hash_full">1068910003</info>
  <info type="plan_hash">2143077847</info>
  <info type="plan_hash_2">1068910003</info>
  <spd>
    <cv>0</cv>
    <cu>1</cu>
  </spd>
  <outline_data>
    <hint></hint>
    <hint></hint>
    <hint></hint>
    <hint></hint>
    <hint></hint>
    <hint></hint>
    <hint></hint>
  </outline_data>
</other_xml>


1 row selected.

SQL>
SQL> explain plan for
  2  select count(created)
  3  from t
  4  where owner = 'SYS'
  5  and object_type = 'JAVA CLASS';

Explained.

SQL>
SQL> select * from table(dbms_xplan.display(format=>'all'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2143077847

---------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |     1 |    23 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |      |     1 |    23 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |   143K|  3226K|     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX   |   220 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$1
   2 - SEL$1 / T@SEL$1
   3 - SEL$1 / T@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=0) COUNT("CREATED")[22]
   2 - "CREATED"[DATE,7]
   3 - "T".ROWID[ROWID,10]

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - 1 Sql Plan Directive used for this statement

35 rows selected.

SQL>
SQL> select * from table(dbms_xplan.display(format=>'all +metrics'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2143077847

---------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |     1 |    23 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |      |     1 |    23 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |   143K|  3226K|     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX   |   220 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$1
   2 - SEL$1 / T@SEL$1
   3 - SEL$1 / T@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=0) COUNT("CREATED")[22]
   2 - "CREATED"[DATE,7]
   3 - "T".ROWID[ROWID,10]

Sql Plan Directive information:
-------------------------------

  Used directive ids:
    14906410523430420431

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - 1 Sql Plan Directive used for this statement

41 rows selected.

SQL>
SQL>

Subtle variations in optimizer stats

Subtle variances in the way you work with statistics can have a significant impact on how your optimizer plans work out…so you need to take care.

Let’s look at the following example


SQL> create table T (
  2    x varchar2(20) , y varchar2(100));

Table created.

SQL> insert into T
  2  select 'x' , rpad('z',100) from all_objects;

94117 rows created.

SQL> insert into T
  2  select 'X' , rpad('z',100) from all_objects;

94117 rows created.

SQL> exec dbms_stats.gather_table_stats(user,'T',cascade=>true)

PL/SQL procedure successfully completed.

SQL> create index TX on T ( upper(x) ) ;

Index created.

So you can see that in reality, ALL of the rows have a single value for UPPER(X), namely “X”. So let’s look at an execution plan.


SQL> exec dbms_stats.gather_index_stats(user,'TX')

PL/SQL procedure successfully completed.

SQL> explain plan for select * from T where upper(x) = :b1
  2  /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------
Plan hash value: 1501193905

--------------------------------------------------------------------------------------------
| Id  | Operation                           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |      |  1882 |   189K|   354   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T    |  1882 |   189K|   354   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | TX   |   753 |       |   342   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(UPPER("X")=:B1)

14 rows selected.

That seems incorrect. Bind peeking etc aside (because we are just using standard EXPLAIN), we would expect to not use an index when the there is only 1 value across the whole table.

Let’s try that experiment again.


SQL> drop table T;

Table dropped.

SQL>
SQL> create table T (
  2    x varchar2(20) , y varchar2(100));

Table created.

SQL>
SQL> insert into T
  2  select 'x' , rpad('z',100) from all_objects;

94117 rows created.

SQL>
SQL> insert into T
  2  select 'X' , rpad('z',100) from all_objects;

94117 rows created.

SQL>
SQL> create index TX on T ( upper(x) ) ;

Index created.

SQL>
SQL> exec dbms_stats.gather_table_stats(user,'T',cascade=>true)

PL/SQL procedure successfully completed.

SQL>
SQL> exec dbms_stats.gather_index_stats(user,'TX')

PL/SQL procedure successfully completed.

SQL>
SQL> explain plan for select * from T where upper(x) = :b1
  2  /

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   188K|    18M|   786   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |   188K|    18M|   786   (1)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(UPPER("X")=:B1)

13 rows selected.

And this time it has got the estimate just right. So what did we do different ? If you look at the two scripts, in the first one, we did

“gather table stats, create index, gather index stats”

and in the second one, we did

“create index, gather table stats, gather index stats”

The subtle difference there is what statistics were calculated at column level on the hidden column (which got created as part of our function based index).

In the second (“correct”) example, statistics were gathered on ALL of the columns, including the hidden one


SQL> select COLUMN_NAME,NUM_DISTINCT,AVG_COL_LEN, num_nulls, density
  2  from user_tab_cols
  3  where  table_name = 'T'
  4  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT AVG_COL_LEN  NUM_NULLS    DENSITY
------------------------------ ------------ ----------- ---------- ----------
X                                         2           2          0         .5
Y                                         1         101          0          1
SYS_NC00003$                              1           2          0          1

whereas in the first example, because the hidden column did not yet exist when we gathered the table stats, it got missed.


SQL> select COLUMN_NAME,NUM_DISTINCT,AVG_COL_LEN, num_nulls, density
  2  from user_tab_cols
  3  where  table_name = 'T'
  4  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT AVG_COL_LEN  NUM_NULLS    DENSITY
------------------------------ ------------ ----------- ---------- ----------
X                                         2           2          0         .5
Y                                         1         101          0          1
SYS_NC00003$

So if you see execution plans that don’t seem right…double check your statistics – you might be missing some.

TOP-N histograms on 12c

I had an interesting question on AskTom today, with a nicely provided test case !!!,  so I thought I’d share it here with some additional commentary.

The basic premise of the TOP-N histogram in 12c is that when the number of distinct values in a column exceeded the number of available buckets for histograms, the historical fall-back position was to go to the height-balanced histograms, which is better than no histogram at all, but loses much of the precision we’d like to see from (say) a frequency based histogram. The TOP-N histogram represents a “halfway house” between the two histogram types. When we exceed the number of available histogram buckets, it might be the case that the overwhelming majority of rows do fall within the desired number of buckets, with just a few outliers being the proverbial “straw the broke the camels back”.

In such an instance, it makes more sense to perhaps store a frequency histogram for the most prevalent values, and discard (aka, lump into a single “nonpopular” bucket) the outliers on the premise that either they will not be queried anyway, or that even if they do, their “absence” from the histogram lets us come up with an appropriate cost of access anyway.

(* – I’m only using the term “nonpopular” to align with the documentation…perhaps an unpopular choice Smile )

Let’s look at an example of this in action. We’ll use ALL_OBJECTS as a base, and the OWNER column will be ideal for a skewed set of data.


SQL> create table t1 as select * from all_objects;

Table created.

SQL> select column_name,histogram,num_nulls,num_distinct,sample_size,density
  2  from user_tab_col_statistics
  3  where table_name='T1'
  4  and column_name='OWNER';

COLUMN_NAME          HISTOGRAM        NUM_NULLS NUM_DISTINCT SAMPLE_SIZE    DENSITY
-------------------- --------------- ---------- ------------ ----------- ----------
OWNER                NONE                     0           38       94322 .026315789

1 row selected.

SQL> select owner,count(*)
  2  from t1
  3  group by owner
  4  order by 2 desc;

OWNER                            COUNT(*)
------------------------------ ----------
SYS                                 42057
PUBLIC                              37138
APEX_050000                          3192
ORDSYS                               3157
APEX_040200                          3021
MDSYS                                1647
SYSTEM                                589
MCDONAC                               470
XDB                                   403
CTXSYS                                401
WMSYS                                 371
SH                                    300
DVSYS                                 292
ORDDATA                               274
LBACSYS                               237
ORDS_METADATA                         184
GSMADMIN_INTERNAL                     104
OE                                     93
DBSNMP                                 55
IX                                     48
SOE                                    47
SCOTT                                  39
HR                                     34
OLAPSYS                                25
OJVMSYS                                22
APEX_LISTENER                          22
DVF                                    19
FLOWS_FILES                            12
ORDPLUGINS                             10
PM                                     10
AUDSYS                                  9
SI_INFORMTN_SCHEMA                      8
OUTLN                                   8
BI                                      8
ORACLE_OCM                              6
APPQOSSYS                               5
TOP_DOG                                 3
JOE_SCHMO                               2

38 rows selected.

So we can see that SYS, PUBLIC et al are popular values, whereas JOE_SCHMO is not, and for example, APEX_PUBLIC_USER even less so (because it does not appear in the table at all, but is indeed a valid user). Now we’ll calculate a histogram with 25 buckets (which is less than the 38 distinct values in the column) but represents the majority of all data in the table.


SQL> begin
  2     dbms_stats.gather_table_stats(user,'T1',
  3             method_opt=>'for columns OWNER size 25');
  4  end;
  5  /

PL/SQL procedure successfully completed.

SQL> select
  2  column_name,histogram,num_nulls,num_distinct,sample_size,density
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  and column_name='OWNER';

COLUMN_NAME          HISTOGRAM        NUM_NULLS NUM_DISTINCT SAMPLE_SIZE    DENSITY
-------------------- --------------- ---------- ------------ ----------- ----------
OWNER                TOP-FREQUENCY            0           38       94322 .000005301

1 row selected.

We can see the new 12c histogram, the TOP-FREQUENCY which has done what was alluded to above, namely a frequency histogram the popular values rather than a generic height-balanced one.  Now lets count the number of rows in the table, and look at the end point values for the values in the histogram.  We’ll need those shortly.


SQL> col c new_value nvals
SQL> select count(*) c from t1;

         C
----------
     94322

1 row selected.

SQL> select endpoint_number,endpoint_actual_value
  2  from user_tab_histograms
  3  where table_name ='T1'
  4  and column_name ='OWNER';

ENDPOINT_NUMBER ENDPOINT_ACTUAL_VALUE
--------------- ---------------------
           3021 APEX_040200
           6213 APEX_050000
           6235 APEX_LISTENER
           6636 CTXSYS
           6691 DBSNMP
           6983 DVSYS
           7087 GSMADMIN_INTERNAL
           7121 HR
           7169 IX
           7406 LBACSYS
           7876 MCDONAC
           9523 MDSYS
           9616 OE
           9641 OLAPSYS
           9915 ORDDATA
          13072 ORDSYS
          13256 ORDS_METADATA
          50394 PUBLIC
          50433 SCOTT
          50733 SH
          50780 SOE
          92837 SYS
          93426 SYSTEM
          93797 WMSYS
          94200 XDB

25 rows selected.

Our first query now shows the benefit of this kind of histogram. When I query for a popular value (one in the histogram), I get very good cost estimation.  There are 403 XDB rows and the execution plan got it spot on.


SQL> set autotrace on explain
SQL> select count(*) from t1
  2  where owner ='XDB';

  COUNT(*)
----------
       403

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |   441   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |   403 |  2418 |   441   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"='XDB')


So what do we do with an nonpopular value ?  After all, it is not in the histogram.  The documentation tells us:

image

So let’s give that a go, and see if the costs align.


SQL> set autotrace off
SQL> select density
  2  from user_tab_col_statistics
  3  where table_name='T1'
  4  and column_name='OWNER';

   DENSITY
----------
.000005301

1 row selected.

SQL>
SQL>
SQL> select &&nvals*density
  2  from user_tab_col_statistics
  3  where table_name='T1'
  4  and column_name='OWNER';

94322*DENSITY
-------------
           .5

1 row selected.

So according to this, we would expect a cost value of 1 (ie, 0.5 rounded up), but look what happens:



SQL> set autotrace on explain
SQL> select count(*) from t1
  2  where owner ='APEX_PUBLIC_USER';

  COUNT(*)
----------
         0

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |   441   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     9 |    54 |   441   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"='APEX_PUBLIC_USER')

SQL>
SQL>

The challenge now becomes, where does the “9” come from ? The secret lies in the 10053 trace file.  When we look at this, some new information comes to light.



SINGLE TABLE ACCESS PATH 
  Single Table Cardinality Estimation for T1[T1] 
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Column (#1): 
    NewDensity:0.000099, OldDensity:0.000005 BktCnt:94208.000000, PopBktCnt:94208.000000, PopValCnt:25, NDV:38
  Column (#1): OWNER(VARCHAR2)
    AvgLen: 6 NDV: 38 Nulls: 0 Density: 0.000099
    Histogram: Top-Freq  #Bkts: 94208  UncompBkts: 94208  EndPtVals: 25  ActualVal: yes


Notice the presence of a value “NewDensity”. If we use that value instead of the “density” from user_tab_cols, then we arrive at our estimate of 9 rows as per the explain plan. Since we are doing a costing for a nonpopular value, the NewDensity is a density designed to more closer reflect that of the nonpopular values, not all of rows.

“NewDensity” appears to be: number of nonpopular rows / ( number of distinct nonpopular values * num_rows )

Common GATHER_PLAN_STATISTIC confusion

Most people already know about the very cool GATHER_PLAN_STATISTICS hint.  If not, you can see an example here

But here’s a common cause of confusion when using it in SQL Plus:


SQL> select /*+ gather_plan_statistics*/ count(p) from t where x > sysdate - 30;

  COUNT(P)
----------
         0

SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'))

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
--------------------
SQL_ID  9babjv8yq8ru3, child number 0

BEGIN DBMS_OUTPUT.GET_LINES(:LINES, :NUMLINES); END;

NOTE: cannot fetch plan for SQL_ID: 9babjv8yq8ru3, CHILD_NUMBER: 0
      Please verify value of SQL_ID and CHILD_NUMBER;
      It could also be that the plan is no longer in cursor cache (check v$sql_plan)



So what has happened ? Why didn’t I get my nicely formatted output ?

Well, if you have “SERVEROUTPUT” set to “on” in SQL Plus, then when your SQL statement has completed, SQL Plus makes an additional call to the database to pick up any data in the DBMS_OUTPUT buffer.

Hence when you ask for statistics on the last call made to the database, it is not the one you were expecting.

Moral of the story: Turn off serveroutput before using GATHER_PLAN_STATISTICS in SQL Plus.

SAMPLE costing

People often think when using the SAMPLE clause, that because they are only dealing with a subset of the data, that immediately it should be a much faster operation. And whilst sometimes this is indeed the case, it is not a guarantee. The optimizer probably gives us a hint ( no pun intended) as to how we should interpret the SAMPLE clause.

Lets compare a standard scan of a table, with a sampled scan.



SQL> create table T as select * from dba_Objects;

Table created.

SQL>
SQL> set autotrace traceonly explain
SQL> select * from T;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 96399 |    10M|   451   (1)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| T    | 96399 |    10M|   451   (1)| 00:00:01 |
--------------------------------------------------------------------------

SQL>
SQL> select * from T sample (1);

Execution Plan
----------------------------------------------------------
Plan hash value: 2767392432

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |   964 |   108K|   450   (1)| 00:00:01 |
|   1 |  TABLE ACCESS SAMPLE| T    |   964 |   108K|   450   (1)| 00:00:01 |
----------------------------------------------------------------------------

The “Rows” seems to make sense, ie, a one percent sample will return about one percent of the rows, but notice that the Cost is the same in both cases. Now at first thought, you might be thinking that this is wrong – after all, if I only have to produce 1/100th of the data, then why would it cost the same ? Well, its because we are sampling rows. To get 1/100th of the rows, it is still quite possible that we’d have to scan every single block in the table – that’s what sampling is, getting a pseudo-randomly distributed set of rows.

What most people think sampling is, is to pick a random subset of blocks, and then grab the rows from that. You can indeed so that, and the costs are reflected accordingly.




SQL> select * from T sample block (1);

Execution Plan
----------------------------------------------------------
Plan hash value: 2767392432

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |   964 |   108K|     6   (0)| 00:00:01 |
|   1 |  TABLE ACCESS SAMPLE| T    |   964 |   108K|     6   (0)| 00:00:01 |
----------------------------------------------------------------------------


Dont forget though, that as an empty table grows, then its quite possible that rows clumped together in blocks are not as random a sample of the data you might think Smile

Optimizer curiosity in 12.1.0.2

For almost as long as I can remember, the optimizer has had a nifty little trick when you (in effect) try to combine two different usage models within a single SQL.  To explain that, I’m referring to the common scenario of: “If a bind value is provided by the caller, then use it, otherwise it shouldn’t limit the result set.  So we commonly see queries like:

select *
from MY_TABLE
where COL1 = NVL(:mybindvar, COL1)

[For ease of discussion, we’ll assume COL1 is not nullable]

Anyway, the nice little optimizer trick was to optimize the query to handle the two separate use cases, so you see a CONCATENATION step in the execution plan, and two FILTER’s, one to handle the case when the bind variable is null, and one to handle the case where it is provided.



-----------------------------------------------
| Id  | Operation                             |
-----------------------------------------------
|   0 | SELECT STATEMENT                      |
|   1 |  CONCATENATION                        |
|*  2 |   FILTER                              |
            
|*  5 |   FILTER                              |
            
-----------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(:B1 IS NULL)
   5 - filter(:B1 IS NOT NULL)

In effect, the optimizer has taken care of the old SQL tuning advice we used to give to the developers to rewrite the SQL as a UNION ALL, or even split it into two distinct SQL statements to cater for each case.  With that in mind, I picked up a change to this behaviour in 12.1.0.2 (and some additional work by Jonathan Lewis suggests 11.2.0.4 as well), where bind peeking seems to create some confusion.

Let’s look at an example.  I’ll create a table, populate it with approx 4million rows, where two cols are of interest:

  • SEQ, 2m distinct values and hence highly selective
  • PSEQ, only 2 distinct values, and hence not a great option for an index path

The primary key is the composite of these two columns, and a secondary index on PSEQ



SQL> create table T (
  2    seq int, pseq int, blah char(30),
  3    constraint T_PK primary key (seq,pseq)
  4  )
  5  /
 
Table created.
 
SQL> insert into T
  2  select trunc(rownum/2) seq,
  3         mod(rownum,2) pseq,
  4         'x' blah
  5  from
  6    ( select 1 from dual connect by level < 1000 ),
  7    ( select 1 from dual connect by level < 4000 )
  8  /
 
3995001 rows created.
 
SQL> create index T_IX2 on T ( pseq );
 
Index created.
 
SQL> exec dbms_stats.gather_table_stats('','T',no_invalidate=>false);
 
PL/SQL procedure successfully completed.

Now we’ll execute an SQL in the form previously mentioned, and take the case where the bind variable in the NVL is null.



SQL> variable b1 number
SQL> variable b2 number
SQL> exec :b1 := null;
SQL> exec :b2 := 1706496;
 
SQL> select /*+ gather_plan_statistics */ *
  2  from T
  3  WHERE seq = :B2
  4  AND pseq = NVL(:B1 ,pseq)
  5  /
 
       SEQ       PSEQ BLAH
---------- ---------- ------------------------------
   1706496          0 x
   1706496          1 x
 
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
 
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0caq50rvfkub5, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ * from T WHERE seq = :B2 AND pseq
= NVL(:B1 ,pseq)
 
Plan hash value: 3837764478
 
---------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |       |      1 |        |      2 |00:00:00.01 |       6 |
|   1 |  CONCATENATION                        |       |      1 |        |      2 |00:00:00.01 |       6 |
|*  2 |   FILTER                              |       |      1 |        |      2 |00:00:00.01 |       6 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T     |      1 |      2 |      2 |00:00:00.01 |       6 |
|*  4 |     INDEX RANGE SCAN                  | T_PK  |      1 |      2 |      2 |00:00:00.01 |       4 |
|*  5 |   FILTER                              |       |      1 |        |      0 |00:00:00.01 |       0 |
|*  6 |    TABLE ACCESS BY INDEX ROWID BATCHED| T     |      0 |      1 |      0 |00:00:00.01 |       0 |
|*  7 |     INDEX RANGE SCAN                  | T_IX2 |      0 |      1 |      0 |00:00:00.01 |       0 |
---------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(:B1 IS NULL)
   4 - access("SEQ"=:B2)
       filter("PSEQ" IS NOT NULL)
   5 - filter(:B1 IS NOT NULL)
   6 - filter("SEQ"=:B2)
   7 - access("PSEQ"=:B1)
 
 
30 rows selected.

Because :B1 is null, you can see from the Predicate Information, and from the Actual Rows information, that we took the execution path in lines 3-4.  But take a peek (no pun intended) at lines 6-7.  That path, whilst not used, is proposing the use of index T_IX2, which as we know is 2 distinct keys across 4 million rows.  The problem is … we’ve now loaded that execution plan into our library cache.  So let’s see what happens when we exercise that part of the plan when we this time specify both bind variables



SQL> exec :b1 := 1;
 
SQL> select /*+ gather_plan_statistics */ *
  2  from T
  3  WHERE seq = :B2
  4  AND pseq = NVL(:B1 ,pseq)
  5  /
 
       SEQ       PSEQ BLAH
---------- ---------- ------------------------------
   1706496          1 x
 
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
 
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0caq50rvfkub5, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ * from T WHERE seq = :B2 AND pseq
= NVL(:B1 ,pseq)
 
Plan hash value: 3837764478
 
------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |       |      1 |        |      1 |00:00:00.70 |   28124 |    178 |
|   1 |  CONCATENATION                        |       |      1 |        |      1 |00:00:00.70 |   28124 |    178 |
|*  2 |   FILTER                              |       |      1 |        |      0 |00:00:00.01 |       0 |      0 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T     |      0 |      2 |      0 |00:00:00.01 |       0 |      0 |
|*  4 |     INDEX RANGE SCAN                  | T_PK  |      0 |      2 |      0 |00:00:00.01 |       0 |      0 |
|*  5 |   FILTER                              |       |      1 |        |      1 |00:00:00.70 |   28124 |    178 |
|*  6 |    TABLE ACCESS BY INDEX ROWID BATCHED| T     |      1 |      1 |      1 |00:00:00.70 |   28124 |    178 |
|*  7 |     INDEX RANGE SCAN                  | T_IX2 |      1 |      1 |   1997K|00:00:00.63 |    3898 |    178 |
----------------------------------------------------------

Ouch…2000 million rows scanned in the index.  Since we provided both bind variables, the obvious access path should have been a simple primary key lookup.  But the path derived from our first execution (with :B1 being null) has left behind a “nasty legacy”.

The troubling thing about this (in terms of application stability) is that the order in which we run queries now impacts the performance of how they run.  Lets clear out the cursor by recalculating stats and then reverse the order of execution.



SQL> exec dbms_stats.gather_table_stats('','T',no_invalidate=>false);
 
PL/SQL procedure successfully completed.
 
SQL> exec :b1 := 1;
SQL> exec :b2 := 1706496;
 
PL/SQL procedure successfully completed.
 
SQL> select /*+ gather_plan_statistics */ *
  2  from T
  3  WHERE seq = :B2
  4  AND pseq = NVL(:B1 ,pseq)
  5  /
 
       SEQ       PSEQ BLAH
---------- ---------- ------------------------------
   1706496          1 x
 
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
 
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0caq50rvfkub5, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ * from T WHERE seq = :B2 AND pseq
= NVL(:B1 ,pseq)
 
Plan hash value: 933468988
 
--------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |      |      1 |        |      1 |00:00:00.01 |       4 |
|   1 |  CONCATENATION                        |      |      1 |        |      1 |00:00:00.01 |       4 |
|*  2 |   FILTER                              |      |      1 |        |      0 |00:00:00.01 |       0 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T    |      0 |      2 |      0 |00:00:00.01 |       0 |
|*  4 |     INDEX RANGE SCAN                  | T_PK |      0 |      2 |      0 |00:00:00.01 |       0 |
|*  5 |   FILTER                              |      |      1 |        |      1 |00:00:00.01 |       4 |
|   6 |    TABLE ACCESS BY INDEX ROWID        | T    |      1 |      1 |      1 |00:00:00.01 |       4 |
|*  7 |     INDEX UNIQUE SCAN                 | T_PK |      1 |      1 |      1 |00:00:00.01 |       3 |
--------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(:B1 IS NULL)
   4 - access("SEQ"=:B2)
       filter("PSEQ" IS NOT NULL)
   5 - filter(:B1 IS NOT NULL)
   7 - access("SEQ"=:B2 AND "PSEQ"=:B1)

When we optimize the query for the initial case of both bind variables provided, you can see that both sides of the CONCATENATION have yielded a sensible path.