Subtle variations in optimizer stats

Posted by

Subtle variances in the way you work with statistics can have a significant impact on how your optimizer plans work out…so you need to take care.

Let’s look at the following example


SQL> create table T (
  2    x varchar2(20) , y varchar2(100));

Table created.

SQL> insert into T
  2  select 'x' , rpad('z',100) from all_objects;

94117 rows created.

SQL> insert into T
  2  select 'X' , rpad('z',100) from all_objects;

94117 rows created.

SQL> exec dbms_stats.gather_table_stats(user,'T',cascade=>true)

PL/SQL procedure successfully completed.

SQL> create index TX on T ( upper(x) ) ;

Index created.

So you can see that in reality, ALL of the rows have a single value for UPPER(X), namely “X”. So let’s look at an execution plan.


SQL> exec dbms_stats.gather_index_stats(user,'TX')

PL/SQL procedure successfully completed.

SQL> explain plan for select * from T where upper(x) = :b1
  2  /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------
Plan hash value: 1501193905

--------------------------------------------------------------------------------------------
| Id  | Operation                           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |      |  1882 |   189K|   354   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T    |  1882 |   189K|   354   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | TX   |   753 |       |   342   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(UPPER("X")=:B1)

14 rows selected.

That seems incorrect. Bind peeking etc aside (because we are just using standard EXPLAIN), we would expect to not use an index when the there is only 1 value across the whole table.

Let’s try that experiment again.


SQL> drop table T;

Table dropped.

SQL>
SQL> create table T (
  2    x varchar2(20) , y varchar2(100));

Table created.

SQL>
SQL> insert into T
  2  select 'x' , rpad('z',100) from all_objects;

94117 rows created.

SQL>
SQL> insert into T
  2  select 'X' , rpad('z',100) from all_objects;

94117 rows created.

SQL>
SQL> create index TX on T ( upper(x) ) ;

Index created.

SQL>
SQL> exec dbms_stats.gather_table_stats(user,'T',cascade=>true)

PL/SQL procedure successfully completed.

SQL>
SQL> exec dbms_stats.gather_index_stats(user,'TX')

PL/SQL procedure successfully completed.

SQL>
SQL> explain plan for select * from T where upper(x) = :b1
  2  /

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   188K|    18M|   786   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |   188K|    18M|   786   (1)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(UPPER("X")=:B1)

13 rows selected.

And this time it has got the estimate just right. So what did we do different ? If you look at the two scripts, in the first one, we did

“gather table stats, create index, gather index stats”

and in the second one, we did

“create index, gather table stats, gather index stats”

The subtle difference there is what statistics were calculated at column level on the hidden column (which got created as part of our function based index).

In the second (“correct”) example, statistics were gathered on ALL of the columns, including the hidden one


SQL> select COLUMN_NAME,NUM_DISTINCT,AVG_COL_LEN, num_nulls, density
  2  from user_tab_cols
  3  where  table_name = 'T'
  4  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT AVG_COL_LEN  NUM_NULLS    DENSITY
------------------------------ ------------ ----------- ---------- ----------
X                                         2           2          0         .5
Y                                         1         101          0          1
SYS_NC00003$                              1           2          0          1

whereas in the first example, because the hidden column did not yet exist when we gathered the table stats, it got missed.


SQL> select COLUMN_NAME,NUM_DISTINCT,AVG_COL_LEN, num_nulls, density
  2  from user_tab_cols
  3  where  table_name = 'T'
  4  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT AVG_COL_LEN  NUM_NULLS    DENSITY
------------------------------ ------------ ----------- ---------- ----------
X                                         2           2          0         .5
Y                                         1         101          0          1
SYS_NC00003$

So if you see execution plans that don’t seem right…double check your statistics – you might be missing some.

One comment

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.