Frequency histogram – where did that estimate come from?

Frequency histograms in any version of Oracle are pretty cool things, so whenever you have a distribution of data that potentially skewed and the number of distinct values fall under the limit of how many histogram buckets you can have, then a frequency histogram is pretty much a no-brainer. In particular, if you don’t have a large number of distinct values, the nice thing is that you can nominate the largest bucket size possible, and we’ll only create the buckets required to contain the frequency information.

For example, I’ll create table with only 3 distinct values (1,3 and 5) and the distribution of the data is skewed. Then I’ll ask for a 254-bucket histogram, but you can see by querying USER_HISTOGRAMS that only 3 buckets were required to hold the histogram.


SQL> create table t ( x int );

Table created.

SQL> insert into t select 1 from dual connect by level <= 100;

100 rows created.

SQL> insert into t select 3 from dual connect by level <= 1000;

1000 rows created.

SQL> insert into t select 5 from dual connect by level <= 2000;

2000 rows created.

SQL> exec dbms_stats.gather_table_stats('','T',method_opt=>'for all columns size 254');

PL/SQL procedure successfully completed.

SQL> select
  2    endpoint_number,
  3    endpoint_value,
  4    endpoint_number -
  5       nvl(lag(endpoint_number)
  6          over ( order by endpoint_number),0) occurrences
  7  from user_histograms
  8  where   table_name = 'T'
  9  order by 1;

ENDPOINT_NUMBER ENDPOINT_VALUE OCCURRENCES
--------------- -------------- -----------
            100              1         100
           1100              3        1000
           3100              5        2000

With the frequency histogram in place, the optimizer does an excellent job of predicting the row estimates for any value which is known within the histogram.


SQL> set autotrace traceonly explain
SQL> select * from t where x = 1;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 |   300 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |   100 |   300 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=1)

SQL> select * from t where x = 3;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 |  3000 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |  1000 |  3000 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=3)

SQL> select * from t where x = 5;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  2000 |  6000 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |  2000 |  6000 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=5)

But it is important to remember the old adage “garbage-in-garbage-out”. If the only data contained in my table T for column X are the values 1, 3 and 5, then there’s a good chance the optimizer is going to be our friend. But what if after you gathered statistics, we then added some rows with a value of X=4? When you query for a value of 4, the optimizer has no information about how likely that value to be present. So what does it do? Let’s take a look.


SQL> select * from t where x = 4;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    50 |   150 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    50 |   150 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=4)

The obvious question is: Where did “50” come from? You could say that the optimizer takes an “optimistic” approach to its statistics, and the person who gathered them. It’s line of thinking is: “Surely … I mean SURELY… someone is going to ensure that when we gathered statistics, we did it when the data in the table best represented the state it would be in when user queries would be executed?”

Based on this assumption, if you present a query with a value not contained in the frequency histogram, this value “surely” must be an outlier, an anomaly, something unexpected and rare. For that reason, it will assume that its likelihood of being in the table is half that of the least frequent known value. In this case, the optimizer knows that there are only 100 values of X=1, so it decides that there will be 50 values of the unknown value X=4.

Of course, the best solution here would be to collect statistics when there values for X=4 are present, so that the optimizer has better information. Better information all the time is one of the motivations for the real time statistics and high frequency gathering initiatives you’ll see in 19c.

The optimizer also takes some other things into consideration when you lob an unknown value at it. The value of “4” sits between the two extrema that the optimizer knows about (X=1 and X=5). But when a value is presented in a query that sits outside the known range of value, the optimizer tackles things differently. Once again taking an optimistic approach to your statistics gathering regime, it assumes that values outside the known range must be anomalous. If you query just outside the range, then it will make assumption similar to that of the value of 4, in this case, a little less than the half the least known value.


SQL> select * from t where x = 6;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    38 |   114 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    38 |   114 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=6)

But as the value gets further outside the known range, that estimate drops away. For example, for x=10 the optimizer assumes a solitary row.


SQL> select * from t where x = 10;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |     3 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |     1 |     3 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=10)

The “sliding scale” reduction of the estimate can be more easily seen if we increase the size of the demo. I’ll have values 1 to 100 with a frequency histogram, and the explore values past 100.


SQL> create table t ( x int );

Table created.

SQL>
SQL> insert into t select p from 
  2 ( select 1 from dual connect by level <= 100), 
  3 ( select level p from dual connect by level <= 100);

10000 rows created.

SQL>
SQL> exec dbms_stats.gather_table_stats('','T',method_opt=>'for all columns size 254');

PL/SQL procedure successfully completed.

SQL> set autotrace traceonly explain
SQL> select * from t where x = 100;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 |   300 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |   100 |   300 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=100)

SQL> select * from t where x = 101;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    49 |   147 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    49 |   147 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=101)

SQL> select * from t where x = 102;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    49 |   147 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    49 |   147 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=102)

SQL> select * from t where x = 103;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    48 |   144 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    48 |   144 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=103)

SQL> select * from t where x = 104;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    48 |   144 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    48 |   144 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=104)

SQL> select * from t where x = 105;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    47 |   141 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    47 |   141 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=105)

SQL> select * from t where x = 110;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    45 |   135 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    45 |   135 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=110)

SQL> select * from t where x = 150;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    25 |    75 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |    25 |    75 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=150)

SQL> select * from t where x = 200;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |     3 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |     1 |     3 |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=200)

SQL> set autotrace off

Hopefully this explains why you might see unexpected estimates from the optimizer even when you have frequency histograms in place. When in doubt, you can query USER_HISTOGRAMS to see what data has been collected, and if any values are missing. Often a fresh gather is all that is required to rectify things.

TL;DR: Frequency histograms are great, but do your best to ensure that all potential values are present in the table when you gather the statistics.

2 responses to “Frequency histogram – where did that estimate come from?”

gazza

September 1, 2020 at 3:32 pm

If a Full Table Scan is performed, what would the performance difference be if the estimate was 1 or 1,000,000?

Loading…

1. Connor McDonald
  
  September 1, 2020 at 6:08 pm
  
  on the scan…nothing. But if that scan was (say) the input into whether we should do a hash join or a nested loop to *another* table, then the mis-estimate could catastrophic
  
  Loading…