TOP-N histograms on 12c

Posted by

I had an interesting question on AskTom today, with a nicely provided test case !!!,  so I thought I’d share it here with some additional commentary.

The basic premise of the TOP-N histogram in 12c is that when the number of distinct values in a column exceeded the number of available buckets for histograms, the historical fall-back position was to go to the height-balanced histograms, which is better than no histogram at all, but loses much of the precision we’d like to see from (say) a frequency based histogram. The TOP-N histogram represents a “halfway house” between the two histogram types. When we exceed the number of available histogram buckets, it might be the case that the overwhelming majority of rows do fall within the desired number of buckets, with just a few outliers being the proverbial “straw the broke the camels back”.

In such an instance, it makes more sense to perhaps store a frequency histogram for the most prevalent values, and discard (aka, lump into a single “nonpopular” bucket) the outliers on the premise that either they will not be queried anyway, or that even if they do, their “absence” from the histogram lets us come up with an appropriate cost of access anyway.

(* – I’m only using the term “nonpopular” to align with the documentation…perhaps an unpopular choice Smile )

Let’s look at an example of this in action. We’ll use ALL_OBJECTS as a base, and the OWNER column will be ideal for a skewed set of data.


SQL> create table t1 as select * from all_objects;

Table created.

SQL> select column_name,histogram,num_nulls,num_distinct,sample_size,density
  2  from user_tab_col_statistics
  3  where table_name='T1'
  4  and column_name='OWNER';

COLUMN_NAME          HISTOGRAM        NUM_NULLS NUM_DISTINCT SAMPLE_SIZE    DENSITY
-------------------- --------------- ---------- ------------ ----------- ----------
OWNER                NONE                     0           38       94322 .026315789

1 row selected.

SQL> select owner,count(*)
  2  from t1
  3  group by owner
  4  order by 2 desc;

OWNER                            COUNT(*)
------------------------------ ----------
SYS                                 42057
PUBLIC                              37138
APEX_050000                          3192
ORDSYS                               3157
APEX_040200                          3021
MDSYS                                1647
SYSTEM                                589
MCDONAC                               470
XDB                                   403
CTXSYS                                401
WMSYS                                 371
SH                                    300
DVSYS                                 292
ORDDATA                               274
LBACSYS                               237
ORDS_METADATA                         184
GSMADMIN_INTERNAL                     104
OE                                     93
DBSNMP                                 55
IX                                     48
SOE                                    47
SCOTT                                  39
HR                                     34
OLAPSYS                                25
OJVMSYS                                22
APEX_LISTENER                          22
DVF                                    19
FLOWS_FILES                            12
ORDPLUGINS                             10
PM                                     10
AUDSYS                                  9
SI_INFORMTN_SCHEMA                      8
OUTLN                                   8
BI                                      8
ORACLE_OCM                              6
APPQOSSYS                               5
TOP_DOG                                 3
JOE_SCHMO                               2

38 rows selected.

So we can see that SYS, PUBLIC et al are popular values, whereas JOE_SCHMO is not, and for example, APEX_PUBLIC_USER even less so (because it does not appear in the table at all, but is indeed a valid user). Now we’ll calculate a histogram with 25 buckets (which is less than the 38 distinct values in the column) but represents the majority of all data in the table.


SQL> begin
  2     dbms_stats.gather_table_stats(user,'T1',
  3             method_opt=>'for columns OWNER size 25');
  4  end;
  5  /

PL/SQL procedure successfully completed.

SQL> select
  2  column_name,histogram,num_nulls,num_distinct,sample_size,density
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  and column_name='OWNER';

COLUMN_NAME          HISTOGRAM        NUM_NULLS NUM_DISTINCT SAMPLE_SIZE    DENSITY
-------------------- --------------- ---------- ------------ ----------- ----------
OWNER                TOP-FREQUENCY            0           38       94322 .000005301

1 row selected.

We can see the new 12c histogram, the TOP-FREQUENCY which has done what was alluded to above, namely a frequency histogram the popular values rather than a generic height-balanced one.  Now lets count the number of rows in the table, and look at the end point values for the values in the histogram.  We’ll need those shortly.


SQL> col c new_value nvals
SQL> select count(*) c from t1;

         C
----------
     94322

1 row selected.

SQL> select endpoint_number,endpoint_actual_value
  2  from user_tab_histograms
  3  where table_name ='T1'
  4  and column_name ='OWNER';

ENDPOINT_NUMBER ENDPOINT_ACTUAL_VALUE
--------------- ---------------------
           3021 APEX_040200
           6213 APEX_050000
           6235 APEX_LISTENER
           6636 CTXSYS
           6691 DBSNMP
           6983 DVSYS
           7087 GSMADMIN_INTERNAL
           7121 HR
           7169 IX
           7406 LBACSYS
           7876 MCDONAC
           9523 MDSYS
           9616 OE
           9641 OLAPSYS
           9915 ORDDATA
          13072 ORDSYS
          13256 ORDS_METADATA
          50394 PUBLIC
          50433 SCOTT
          50733 SH
          50780 SOE
          92837 SYS
          93426 SYSTEM
          93797 WMSYS
          94200 XDB

25 rows selected.

Our first query now shows the benefit of this kind of histogram. When I query for a popular value (one in the histogram), I get very good cost estimation.  There are 403 XDB rows and the execution plan got it spot on.


SQL> set autotrace on explain
SQL> select count(*) from t1
  2  where owner ='XDB';

  COUNT(*)
----------
       403

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |   441   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |   403 |  2418 |   441   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"='XDB')


So what do we do with an nonpopular value ?  After all, it is not in the histogram.  The documentation tells us:

image

So let’s give that a go, and see if the costs align.


SQL> set autotrace off
SQL> select density
  2  from user_tab_col_statistics
  3  where table_name='T1'
  4  and column_name='OWNER';

   DENSITY
----------
.000005301

1 row selected.

SQL>
SQL>
SQL> select &&nvals*density
  2  from user_tab_col_statistics
  3  where table_name='T1'
  4  and column_name='OWNER';

94322*DENSITY
-------------
           .5

1 row selected.

So according to this, we would expect a cost value of 1 (ie, 0.5 rounded up), but look what happens:



SQL> set autotrace on explain
SQL> select count(*) from t1
  2  where owner ='APEX_PUBLIC_USER';

  COUNT(*)
----------
         0

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |   441   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     9 |    54 |   441   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"='APEX_PUBLIC_USER')

SQL>
SQL>

The challenge now becomes, where does the “9” come from ? The secret lies in the 10053 trace file.  When we look at this, some new information comes to light.



SINGLE TABLE ACCESS PATH 
  Single Table Cardinality Estimation for T1[T1] 
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Column (#1): 
    NewDensity:0.000099, OldDensity:0.000005 BktCnt:94208.000000, PopBktCnt:94208.000000, PopValCnt:25, NDV:38
  Column (#1): OWNER(VARCHAR2)
    AvgLen: 6 NDV: 38 Nulls: 0 Density: 0.000099
    Histogram: Top-Freq  #Bkts: 94208  UncompBkts: 94208  EndPtVals: 25  ActualVal: yes


Notice the presence of a value “NewDensity”. If we use that value instead of the “density” from user_tab_cols, then we arrive at our estimate of 9 rows as per the explain plan. Since we are doing a costing for a nonpopular value, the NewDensity is a density designed to more closer reflect that of the nonpopular values, not all of rows.

“NewDensity” appears to be: number of nonpopular rows / ( number of distinct nonpopular values * num_rows )

5 comments

  1. Hello Connor,

    Depending on whether the min and max value of column owner are naturally included in the TOP-N values (or forced because they have to be in the histogram) there are two formulas to use in order to calculate the NewDensity

    NewDensity = (sample_size-TopNRows)/(num_distinct-num_buckets)/sample_size
    NewDensity = (sample_size- AdjustedTopNRows)/(num_distinct-num_buckets)/sample_size
    

    I have tried to explain the above formulas in the following article :http://allthingsoracle.com/12c-histogram-top-frequency/

    When applied to your case (I don’t have the same all_objects content though) I have the following:

    SQL> explain plan for
        select count(*) from t1_top
           where owner ='APEX_PUBLIC_USER';
    
    Explained.
    
    SQL> select * from table(dbms_xplan.display);
    
    
    Plan hash value: 222165550
    -----------------------------------------------------
    | Id  | Operation          | Name   | Rows  | Bytes |
    -----------------------------------------------------
    |   0 | SELECT STATEMENT   |        |     1 |     6 |
    |   1 |  SORT AGGREGATE    |        |     1 |     6 |
    |*  2 |   TABLE ACCESS FULL| T1_TOP |     4 |    24 |
    -----------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - filter("OWNER"='APEX_PUBLIC_USER')
    
    SQL> select
      2          column_name
      3         ,num_distinct
      4         ,num_buckets
      5         ,sample_size
      6         ,histogram
      7       from
      8          user_tab_col_statistics
      9       where table_name = 'T1_TOP'
     10       and column_name  = 'OWNER';
    
    COLUMN_NAME                    NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
    ------------------------------ ------------ ----------- ----------- ---------------
    OWNER                                    12          10       17032 TOP-FREQUENCY
    
    SQL> select
      2           sum (cnt) TopNRows
      3        from (select
      4                 owner
      5                ,count(*) cnt
      6              from t1_top
      7              group by owner
      8              order by count(*) desc
      9              )
     10       where rownum <= 10;
    
      TOPNROWS
    ----------
         17025
    
    
    SQL> select (17032-17025)/(12-10)/17032 NewDensity from dual;
    
    NEWDENSITY
    ----------
    .000205496
    
    
    SQL> select num_rows * (17032-17025)/(12-10)/17032 Card from user_tables where table_name = 'T1_TOP';
    
          CARD
    ----------
           3.5
    

    This is of course backed by the corresponding 10053 trace file:

    Column (#1): 
        NewDensity:0.000205, OldDensity:0.000029 BktCnt:17025.000000, PopBktCnt:17025.000000, PopValCnt:10, NDV:12
      Column (#1): OWNER(VARCHAR2)
        AvgLen: 6 NDV: 12 Nulls: 0 Density: 0.000205
        Histogram: Top-Freq  #Bkts: 17025  UncompBkts: 17025  EndPtVals: 10  ActualVal: yes
      Table: T1_TOP  Alias: T1_TOP
        Card: Original: 17032.000000  Rounded: 4  Computed: 3.500000  Non Adjusted: 3.500000	   
    

    Best regards
    Mohamed Houri

  2. So according to this, we would expect a cost value of 1

    It is not cost, instead it is cardinality (card)

    so that should be rewritten like this.

    So according to this, we would expect a CARD value of 1

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.