Faster DISTINCT operations in 19c

August 17, 2020

2–4 minutes

If I gave you a telephone book and asked you to tell me how many distinct street names are present in the book, then the most likely thing you would do is …Wave your mobile phone at me and ask what a “telephone book” is Smile . But assuming you’re old enough to remember telephone books, you’ll probably tackle the task by sorting the entire set in street name order and then work your way down the list and incrementing your count of distinct street names every time it changed.

In database terms, a little demo of that approach is as below



SQL> create table telephone_book
  2  as select 'Street'||trunc(dbms_random.value(1,30)) street
  3  from dual
  4  connect by level  select street
  2  from   telephone_book
  3  order by 1;

STREET
----------------------------------------------
Street1             == 1st street
Street1
Street10            == 2nd street 
Street10
Street11            == 3rd street 
Street11
Street11
Street11
Street11
Street11
...
...
Street8
Street9             == 28th street
Street9
Street9
Street9

100 rows selected.

which results in our 28 distinct streets.

But what if we were not allowed to sort the data. Perhaps a more realistic question is – what if it was ridiculously prohibitive in terms of time and effort to sort the data? I don’t know about you, but I have better things to do with my weekend than sort the telephone book! Smile

From my demo above, I can see that the highest possible number of distinct streets I could have is 30. So rather than sort the data, I can create a simple 30-character string, which each character represents the occurrence the partnering street. My string would start off as (the hyphens added for clarity)

NNNNNNNNNN-NNNNNNNNNN-NNNNNNNNNN

and as I read the telephone book, I simply change the string for each street I encounter. If the street for the first row is “Street7”, I change my string to be:

NNNNNNYNNN-NNNNNNNNNN-NNNNNNNNNN

If the second row is “Street22” then the string becomes:

NNNNNNYNNN-NNNNNNNNNN-NYNNNNNNNN

and so on until I have read the entire table. At the end of the exercise, I just count the number of “Y” and that is the distinct count of streets. I never had to sort the data. Here’s a simple code demo of that process:


SQL> set serverout on
SQL> declare
  2    bits varchar2(32) := rpad('N',32,'N');
  3    street_no int;
  4  begin
  5    for i in ( select rownum idx, t.* from telephone_book t )
  6    loop
  7      street_no := to_number(ltrim(i.street,'Street'));
  8      bits := substr(bits,1,street_no-1)||'Y'||substr(bits,street_no+1);
  9      dbms_output.put_line('After row '||i.idx||' map='||bits);
 10    end loop;
 11    dbms_output.put_line('Final='||bits);
 12    dbms_output.put_line('ndv='||(length(bits)-length(replace(bits,'Y'))));
 13  end;
 14  /
After row 1 map=NNNNNNNNNNNNNNNNNNNNNYNNNNNNNNNN
After row 2 map=NNNNNNNNNNNNNNNNNNYNNYNNNNNNNNNN
After row 3 map=NNNNNNNNYNNNNNNNNNYNNYNNNNNNNNNN
After row 4 map=NNNNNNNNYNNNNYNNNNYNNYNNNNNNNNNN
...
...
After row 97 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 98 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 99 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 100 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
Final=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
ndv=28

PL/SQL procedure successfully completed.

and we can verify the result with a standard SQL query:


SQL> select count(distinct street)
  2  from   telephone_book;

COUNT(DISTINCTSTREET)
---------------------
                   28

Of course, such a simple solution masks a lot of complexity in implementing something like this for an arbitrary set of data.

How do we know the upper limit on the potential number of distinct rows?
How do we rapidly count the number of “Y” or “hits” once we have scanned the data?
Have we just shifted the problem to an enormous memory structure?

You’ll be reassured to know that a lot of thought has gone into tackling these issues, and thus taking the simple demo above into a genuine robust implementation within version 19c of the database. Many queries requiring distinct counts can be achieved now without requiring an expensive sort.

Here’s the video walking through exactly what we’ve done in 19c, and what some of the benefits are:

10 responses to “Faster DISTINCT operations in 19c”

Brian Fitzgerald

August 18, 2020 at 12:38 pm

Baffled.
select version from v$instance;
19.0.0.0.0

with t as
(
select
level v
from dual
connect by level <= 64991
)
select count(distinct v) ndv
from t;

64991

with t as
(
select
level v
from dual
connect by level <= 64991
), bm as
(
select
bitmap_construct_agg(v) bca,
null
from
t
group by
bitmap_bucket_number(v)
)
select
bitmap_count(bitmap_or_agg(bca)) bndv
from bm
;

64983

Loading…

Reply

Brian Fitzgerald

August 18, 2020 at 12:39 pm

with t as 
(
select 
level v 
from dual
connect by level <= 64991
)
select count(distinct v) ndv
from t;

64991Loading...

Brian Fitzgerald

August 18, 2020 at 12:40 pm

with t as 
(
select 
level v 
from dual
connect by level <= 64991
), bm as
(
select 
bitmap_construct_agg(v) bca,
null
from
t
group by 
bitmap_bucket_number(v)
)
select 
bitmap_count(bitmap_or_agg(bca)) bndv
from bm
;

64983

Brian Fitzgerald

August 18, 2020 at 12:41 pm
```
with t as 
(
select 
level v 
from dual
connect by level <= 64992
), bm as
(
select 
bitmap_construct_agg(v) bca,
null
from
t
group by 
bitmap_bucket_number(v)
)
select 
bitmap_count(bitmap_or_agg(bca)) bndv
from bm
;

ORA-62600: Invalid value is passed to a bitmap operator.
62600. 00000 -  "Invalid value is passed to a bitmap operator."
*Cause:    An attempt was made to pass an invalid value to a bitmap operator.
*Action:   Pass only expressions that have integer values to BITMAP_BUCKET_NUMBER
           and BITMAP_BIT_POSITION and provide only positive integer values for
           BITMAP_CONSTRUCT_AGG.
```
Loading…
Reply
1. Connor McDonald
  
  August 19, 2020 at 10:55 am
  
  Is this what you had in mind ?
  
  SQL> with t as
  2 (
  3 select
  4 level v
  5 from dual
  6 connect by level <= 64992
  7 ), bm as
  8 (
  9 select
  10 bitmap_construct_agg(bitmap_bit_position(v)) bca,
  11 bitmap_bucket_number(v) bkt
  12 from
  13 t
  14 group by bitmap_bucket_number(v)
  15 )
  16 select bkt, bitmap_count(bca) bndv
  17 from bm
  18 ;
  
  BKT BNDV
  ———- ———-
  1 32760
  2 32232
  
  Good presentation here https://cdn.ymaws.com/ukoug.org/resource/collection/904331A4-A794-4730-AC6B-6AB8B1DD1E36/UKOUG_Count_Distinct.pdf
  
  Loading…
  
  Reply
Brian Fitzgerald

August 18, 2020 at 12:42 pm

Expected 64991. Got 64983.

Expected 64992. Got error.

Loading…

Reply
Brian Fitzgerald

August 18, 2020 at 12:53 pm

This select distinct technique only works on nonnegative integers.

Loading…

Reply

Brian Fitzgerald

August 20, 2020 at 7:46 am

Thank you Connor.

I get bucket size 16000.

with t as 
(
    select 
    level v 
    from dual
    connect by level <= 64992
), bm as
(
    select 
        bitmap_bucket_number(v) bkt,
        bitmap_construct_agg(bitmap_bit_position(v)) bca
    from
    t
    group by 
    bitmap_bucket_number(v)
)
select 
    bkt,
    sum(bitmap_count(bca)) bndv
from bm
group by 
grouping sets
(
    (),
    (
    bkt
    )
)
;

BKT	BNDV
1	16000
2	16000
3	16000
4	16000
5	992
	64992

Connor McDonald

August 20, 2020 at 9:55 am

I don’t know how we decide on size of bucket. I suspect its max_string_size.

Loading…

Reply

My favorite top 10 new features in Oracle database 19c – svenweller

October 13, 2021 at 12:02 am

[…] Connor McDonald explains how the logic works: https://connor-mcdonald.com/2020/08/17/faster-distinct-operations-in-19c/ […]

Loading…

Reply

Learning is not a spectator sport

Leave a ReplyCancel reply

Trending

The “best” tool

Tip: Expanding a SQL statement

WITH clause gets a 26ai boost!

The Goldilocks problem – Materialized Views