If I gave you a telephone book and asked you to tell me how many distinct street names are present in the book, then the most likely thing you would do is …Wave your mobile phone at me and ask what a “telephone book” is . But assuming you’re old enough to remember telephone books, you’ll probably tackle the task by sorting the entire set in street name order and then work your way down the list and incrementing your count of distinct street names every time it changed.
In database terms, a little demo of that approach is as below
SQL> create table telephone_book 2 as select 'Street'||trunc(dbms_random.value(1,30)) street 3 from dual 4 connect by level select street 2 from telephone_book 3 order by 1; STREET ---------------------------------------------- Street1 == 1st street Street1 Street10 == 2nd street Street10 Street11 == 3rd street Street11 Street11 Street11 Street11 Street11 ... ... Street8 Street9 == 28th street Street9 Street9 Street9 100 rows selected.
which results in our 28 distinct streets.
But what if we were not allowed to sort the data. Perhaps a more realistic question is – what if it was ridiculously prohibitive in terms of time and effort to sort the data? I don’t know about you, but I have better things to do with my weekend than sort the telephone book!
From my demo above, I can see that the highest possible number of distinct streets I could have is 30. So rather than sort the data, I can create a simple 30-character string, which each character represents the occurrence the partnering street. My string would start off as (the hyphens added for clarity)
and as I read the telephone book, I simply change the string for each street I encounter. If the street for the first row is “Street7”, I change my string to be:
If the second row is “Street22” then the string becomes:
and so on until I have read the entire table. At the end of the exercise, I just count the number of “Y” and that is the distinct count of streets. I never had to sort the data. Here’s a simple code demo of that process:
SQL> set serverout on SQL> declare 2 bits varchar2(32) := rpad('N',32,'N'); 3 street_no int; 4 begin 5 for i in ( select rownum idx, t.* from telephone_book t ) 6 loop 7 street_no := to_number(ltrim(i.street,'Street')); 8 bits := substr(bits,1,street_no-1)||'Y'||substr(bits,street_no+1); 9 dbms_output.put_line('After row '||i.idx||' map='||bits); 10 end loop; 11 dbms_output.put_line('Final='||bits); 12 dbms_output.put_line('ndv='||(length(bits)-length(replace(bits,'Y')))); 13 end; 14 / After row 1 map=NNNNNNNNNNNNNNNNNNNNNYNNNNNNNNNN After row 2 map=NNNNNNNNNNNNNNNNNNYNNYNNNNNNNNNN After row 3 map=NNNNNNNNYNNNNNNNNNYNNYNNNNNNNNNN After row 4 map=NNNNNNNNYNNNNYNNNNYNNYNNNNNNNNNN ... ... After row 97 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN After row 98 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN After row 99 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN After row 100 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN Final=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN ndv=28 PL/SQL procedure successfully completed.
and we can verify the result with a standard SQL query:
SQL> select count(distinct street) 2 from telephone_book; COUNT(DISTINCTSTREET) --------------------- 28
Of course, such a simple solution masks a lot of complexity in implementing something like this for an arbitrary set of data.
- How do we know the upper limit on the potential number of distinct rows?
- How do we rapidly count the number of “Y” or “hits” once we have scanned the data?
- Have we just shifted the problem to an enormous memory structure?
You’ll be reassured to know that a lot of thought has gone into tackling these issues, and thus taking the simple demo above into a genuine robust implementation within version 19c of the database. Many queries requiring distinct counts can be achieved now without requiring an expensive sort.
Here’s the video walking through exactly what we’ve done in 19c, and what some of the benefits are:
select version from v$instance;
with t as
connect by level <= 64991
select count(distinct v) ndv
with t as
connect by level <= 64991
), bm as
Is this what you had in mind ?
SQL> with t as
4 level v
5 from dual
6 connect by level <= 64992
7 ), bm as
10 bitmap_construct_agg(bitmap_bit_position(v)) bca,
11 bitmap_bucket_number(v) bkt
14 group by bitmap_bucket_number(v)
16 select bkt, bitmap_count(bca) bndv
17 from bm
Good presentation here https://cdn.ymaws.com/ukoug.org/resource/collection/904331A4-A794-4730-AC6B-6AB8B1DD1E36/UKOUG_Count_Distinct.pdf
Expected 64991. Got 64983.
Expected 64992. Got error.
This select distinct technique only works on nonnegative integers.
Thank you Connor.
I get bucket size 16000.
I don’t know how we decide on size of bucket. I suspect its max_string_size.