I had an interesting question on AskTom today, with a nicely provided test case !!!, so I thought I’d share it here with some additional commentary.
The basic premise of the TOP-N histogram in 12c is that when the number of distinct values in a column exceeded the number of available buckets for histograms, the historical fall-back position was to go to the height-balanced histograms, which is better than no histogram at all, but loses much of the precision we’d like to see from (say) a frequency based histogram. The TOP-N histogram represents a “halfway house” between the two histogram types. When we exceed the number of available histogram buckets, it might be the case that the overwhelming majority of rows do fall within the desired number of buckets, with just a few outliers being the proverbial “straw the broke the camels back”.
In such an instance, it makes more sense to perhaps store a frequency histogram for the most prevalent values, and discard (aka, lump into a single “nonpopular” bucket) the outliers on the premise that either they will not be queried anyway, or that even if they do, their “absence” from the histogram lets us come up with an appropriate cost of access anyway.
(* – I’m only using the term “nonpopular” to align with the documentation…perhaps an unpopular choice )
Let’s look at an example of this in action. We’ll use ALL_OBJECTS as a base, and the OWNER column will be ideal for a skewed set of data.
SQL> create table t1 as select * from all_objects;
Table created.
SQL> select column_name,histogram,num_nulls,num_distinct,sample_size,density
2 from user_tab_col_statistics
3 where table_name='T1'
4 and column_name='OWNER';
COLUMN_NAME HISTOGRAM NUM_NULLS NUM_DISTINCT SAMPLE_SIZE DENSITY
-------------------- --------------- ---------- ------------ ----------- ----------
OWNER NONE 0 38 94322 .026315789
1 row selected.
SQL> select owner,count(*)
2 from t1
3 group by owner
4 order by 2 desc;
OWNER COUNT(*)
------------------------------ ----------
SYS 42057
PUBLIC 37138
APEX_050000 3192
ORDSYS 3157
APEX_040200 3021
MDSYS 1647
SYSTEM 589
MCDONAC 470
XDB 403
CTXSYS 401
WMSYS 371
SH 300
DVSYS 292
ORDDATA 274
LBACSYS 237
ORDS_METADATA 184
GSMADMIN_INTERNAL 104
OE 93
DBSNMP 55
IX 48
SOE 47
SCOTT 39
HR 34
OLAPSYS 25
OJVMSYS 22
APEX_LISTENER 22
DVF 19
FLOWS_FILES 12
ORDPLUGINS 10
PM 10
AUDSYS 9
SI_INFORMTN_SCHEMA 8
OUTLN 8
BI 8
ORACLE_OCM 6
APPQOSSYS 5
TOP_DOG 3
JOE_SCHMO 2
38 rows selected.
So we can see that SYS, PUBLIC et al are popular values, whereas JOE_SCHMO is not, and for example, APEX_PUBLIC_USER even less so (because it does not appear in the table at all, but is indeed a valid user). Now we’ll calculate a histogram with 25 buckets (which is less than the 38 distinct values in the column) but represents the majority of all data in the table.
SQL> begin
2 dbms_stats.gather_table_stats(user,'T1',
3 method_opt=>'for columns OWNER size 25');
4 end;
5 /
PL/SQL procedure successfully completed.
SQL> select
2 column_name,histogram,num_nulls,num_distinct,sample_size,density
3 from user_tab_col_statistics
4 where table_name='T1'
5 and column_name='OWNER';
COLUMN_NAME HISTOGRAM NUM_NULLS NUM_DISTINCT SAMPLE_SIZE DENSITY
-------------------- --------------- ---------- ------------ ----------- ----------
OWNER TOP-FREQUENCY 0 38 94322 .000005301
1 row selected.
We can see the new 12c histogram, the TOP-FREQUENCY which has done what was alluded to above, namely a frequency histogram the popular values rather than a generic height-balanced one. Now lets count the number of rows in the table, and look at the end point values for the values in the histogram. We’ll need those shortly.
SQL> col c new_value nvals
SQL> select count(*) c from t1;
C
----------
94322
1 row selected.
SQL> select endpoint_number,endpoint_actual_value
2 from user_tab_histograms
3 where table_name ='T1'
4 and column_name ='OWNER';
ENDPOINT_NUMBER ENDPOINT_ACTUAL_VALUE
--------------- ---------------------
3021 APEX_040200
6213 APEX_050000
6235 APEX_LISTENER
6636 CTXSYS
6691 DBSNMP
6983 DVSYS
7087 GSMADMIN_INTERNAL
7121 HR
7169 IX
7406 LBACSYS
7876 MCDONAC
9523 MDSYS
9616 OE
9641 OLAPSYS
9915 ORDDATA
13072 ORDSYS
13256 ORDS_METADATA
50394 PUBLIC
50433 SCOTT
50733 SH
50780 SOE
92837 SYS
93426 SYSTEM
93797 WMSYS
94200 XDB
25 rows selected.
Our first query now shows the benefit of this kind of histogram. When I query for a popular value (one in the histogram), I get very good cost estimation. There are 403 XDB rows and the execution plan got it spot on.
SQL> set autotrace on explain
SQL> select count(*) from t1
2 where owner ='XDB';
COUNT(*)
----------
403
1 row selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 441 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
|* 2 | TABLE ACCESS FULL| T1 | 403 | 2418 | 441 (1)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("OWNER"='XDB')
So what do we do with an nonpopular value ? After all, it is not in the histogram. The documentation tells us:
So let’s give that a go, and see if the costs align.
SQL> set autotrace off
SQL> select density
2 from user_tab_col_statistics
3 where table_name='T1'
4 and column_name='OWNER';
DENSITY
----------
.000005301
1 row selected.
SQL>
SQL>
SQL> select &&nvals*density
2 from user_tab_col_statistics
3 where table_name='T1'
4 and column_name='OWNER';
94322*DENSITY
-------------
.5
1 row selected.
So according to this, we would expect a cost value of 1 (ie, 0.5 rounded up), but look what happens:
SQL> set autotrace on explain
SQL> select count(*) from t1
2 where owner ='APEX_PUBLIC_USER';
COUNT(*)
----------
0
1 row selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 441 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
|* 2 | TABLE ACCESS FULL| T1 | 9 | 54 | 441 (1)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("OWNER"='APEX_PUBLIC_USER')
SQL>
SQL>
The challenge now becomes, where does the “9” come from ? The secret lies in the 10053 trace file. When we look at this, some new information comes to light.
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for T1[T1]
SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
Column (#1):
NewDensity:0.000099, OldDensity:0.000005 BktCnt:94208.000000, PopBktCnt:94208.000000, PopValCnt:25, NDV:38
Column (#1): OWNER(VARCHAR2)
AvgLen: 6 NDV: 38 Nulls: 0 Density: 0.000099
Histogram: Top-Freq #Bkts: 94208 UncompBkts: 94208 EndPtVals: 25 ActualVal: yes
Notice the presence of a value “NewDensity”. If we use that value instead of the “density” from user_tab_cols, then we arrive at our estimate of 9 rows as per the explain plan. Since we are doing a costing for a nonpopular value, the NewDensity is a density designed to more closer reflect that of the nonpopular values, not all of rows.
“NewDensity” appears to be: number of nonpopular rows / ( number of distinct nonpopular values * num_rows )
Hi Connor,
always great to have Alberto’s white paper in background (but all 12c possibilities needs to be verified of course): http://www.adellera.it/investigations/11g_newdensity/11gNewDensity.pdf
Regards
Stefan
Hello Connor,
Depending on whether the min and max value of column owner are naturally included in the TOP-N values (or forced because they have to be in the histogram) there are two formulas to use in order to calculate the NewDensity
I have tried to explain the above formulas in the following article :http://allthingsoracle.com/12c-histogram-top-frequency/
When applied to your case (I don’t have the same all_objects content though) I have the following:
This is of course backed by the corresponding 10053 trace file:
Best regards
Mohamed Houri
Hi Mohamed,
Thanks for stopping by and adding to the content.
Cheers,
Connor
Thanks Connor, Is that my question from Asktom made interesting and triggered this blog post?
So according to this, we would expect a cost value of 1
It is not cost, instead it is cardinality (card)
so that should be rewritten like this.
So according to this, we would expect a CARD value of 1