Faster DISTINCT operations in 19c

Posted by

If I gave you a telephone book and asked you to tell me how many distinct street names are present in the book, then the most likely thing you would do is …Wave your mobile phone at me and ask what a “telephone book” is Smile. But assuming you’re old enough to remember telephone books, you’ll probably tackle the task by sorting the entire set in street name order and then work your way down the list and incrementing your count of distinct street names every time it changed. 

In database terms, a little demo of that approach is as below



SQL> create table telephone_book
  2  as select 'Street'||trunc(dbms_random.value(1,30)) street
  3  from dual
  4  connect by level  select street
  2  from   telephone_book
  3  order by 1;

STREET
----------------------------------------------
Street1             == 1st street
Street1
Street10            == 2nd street 
Street10
Street11            == 3rd street 
Street11
Street11
Street11
Street11
Street11
...
...
Street8
Street9             == 28th street
Street9
Street9
Street9

100 rows selected.

which results in our 28 distinct streets.

But what if we were not allowed to sort the data. Perhaps a more realistic question is – what if it was ridiculously prohibitive in terms of time and effort to sort the data? I don’t know about you, but I have better things to do with my weekend than sort the telephone book! Smile

From my demo above, I can see that the highest possible number of distinct streets I could have is 30. So rather than sort the data, I can create a simple 30-character string, which each character represents the occurrence the partnering street.  My string would start off as (the hyphens added for clarity)

NNNNNNNNNN-NNNNNNNNNN-NNNNNNNNNN

and as I read the telephone book, I simply change the string for each street I encounter. If the street for the first row is “Street7”, I change my string to be:

NNNNNNYNNN-NNNNNNNNNN-NNNNNNNNNN

If the second row is “Street22” then the string becomes:

NNNNNNYNNN-NNNNNNNNNN-NYNNNNNNNN

and so on until I have read the entire table. At the end of the exercise, I just count the number of “Y” and that is the distinct count of streets. I never had to sort the data. Here’s a simple code demo of that process:


SQL> set serverout on
SQL> declare
  2    bits varchar2(32) := rpad('N',32,'N');
  3    street_no int;
  4  begin
  5    for i in ( select rownum idx, t.* from telephone_book t )
  6    loop
  7      street_no := to_number(ltrim(i.street,'Street'));
  8      bits := substr(bits,1,street_no-1)||'Y'||substr(bits,street_no+1);
  9      dbms_output.put_line('After row '||i.idx||' map='||bits);
 10    end loop;
 11    dbms_output.put_line('Final='||bits);
 12    dbms_output.put_line('ndv='||(length(bits)-length(replace(bits,'Y'))));
 13  end;
 14  /
After row 1 map=NNNNNNNNNNNNNNNNNNNNNYNNNNNNNNNN
After row 2 map=NNNNNNNNNNNNNNNNNNYNNYNNNNNNNNNN
After row 3 map=NNNNNNNNYNNNNNNNNNYNNYNNNNNNNNNN
After row 4 map=NNNNNNNNYNNNNYNNNNYNNYNNNNNNNNNN
...
...
After row 97 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 98 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 99 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 100 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
Final=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
ndv=28

PL/SQL procedure successfully completed.

and we can verify the result with a standard SQL query:


SQL> select count(distinct street)
  2  from   telephone_book;

COUNT(DISTINCTSTREET)
---------------------
                   28

Of course, such a simple solution masks a lot of complexity in implementing something like this for an arbitrary set of data.

  • How do we know the upper limit on the potential number of distinct rows?
  • How do we rapidly count the number of “Y” or “hits” once we have scanned the data?
  • Have we just shifted the problem to an enormous memory structure?

You’ll be reassured to know that a lot of thought has gone into tackling these issues, and thus taking the simple demo above into a genuine robust implementation within version 19c of the database. Many queries requiring distinct counts can be achieved now without requiring an expensive sort.

Here’s the video walking through exactly what we’ve done in 19c, and what some of the benefits are:

10 comments

  1. Baffled.
    select version from v$instance;
    19.0.0.0.0

    with t as
    (
    select
    level v
    from dual
    connect by level <= 64991
    )
    select count(distinct v) ndv
    from t;

    64991

    with t as
    (
    select
    level v
    from dual
    connect by level <= 64991
    ), bm as
    (
    select
    bitmap_construct_agg(v) bca,
    null
    from
    t
    group by
    bitmap_bucket_number(v)
    )
    select
    bitmap_count(bitmap_or_agg(bca)) bndv
    from bm
    ;

    64983

  2. with t as 
    (
    select 
    level v 
    from dual
    connect by level <= 64991
    ), bm as
    (
    select 
    bitmap_construct_agg(v) bca,
    null
    from
    t
    group by 
    bitmap_bucket_number(v)
    )
    select 
    bitmap_count(bitmap_or_agg(bca)) bndv
    from bm
    ;
    
    64983
    
  3. with t as 
    (
    select 
    level v 
    from dual
    connect by level <= 64992
    ), bm as
    (
    select 
    bitmap_construct_agg(v) bca,
    null
    from
    t
    group by 
    bitmap_bucket_number(v)
    )
    select 
    bitmap_count(bitmap_or_agg(bca)) bndv
    from bm
    ;
    
    ORA-62600: Invalid value is passed to a bitmap operator.
    62600. 00000 -  "Invalid value is passed to a bitmap operator."
    *Cause:    An attempt was made to pass an invalid value to a bitmap operator.
    *Action:   Pass only expressions that have integer values to BITMAP_BUCKET_NUMBER
               and BITMAP_BIT_POSITION and provide only positive integer values for
               BITMAP_CONSTRUCT_AGG.
    
    1. Is this what you had in mind ?

      SQL> with t as
      2 (
      3 select
      4 level v
      5 from dual
      6 connect by level <= 64992
      7 ), bm as
      8 (
      9 select
      10 bitmap_construct_agg(bitmap_bit_position(v)) bca,
      11 bitmap_bucket_number(v) bkt
      12 from
      13 t
      14 group by bitmap_bucket_number(v)
      15 )
      16 select bkt, bitmap_count(bca) bndv
      17 from bm
      18 ;

      BKT BNDV
      ———- ———-
      1 32760
      2 32232

      Good presentation here https://cdn.ymaws.com/ukoug.org/resource/collection/904331A4-A794-4730-AC6B-6AB8B1DD1E36/UKOUG_Count_Distinct.pdf

  4. Thank you Connor.

    I get bucket size 16000.

    with t as 
    (
        select 
        level v 
        from dual
        connect by level <= 64992
    ), bm as
    (
        select 
            bitmap_bucket_number(v) bkt,
            bitmap_construct_agg(bitmap_bit_position(v)) bca
        from
        t
        group by 
        bitmap_bucket_number(v)
    )
    select 
        bkt,
        sum(bitmap_count(bca)) bndv
    from bm
    group by 
    grouping sets
    (
        (),
        (
        bkt
        )
    )
    ;
    
    BKT	BNDV
    1	16000
    2	16000
    3	16000
    4	16000
    5	992
    	64992
    

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.