If I gave you a telephone book and asked you to tell me how many distinct street names are present in the book, then the most likely thing you would do is …Wave your mobile phone at me and ask what a “telephone book” is . But assuming you’re old enough to remember telephone books, you’ll probably tackle the task by sorting the entire set in street name order and then work your way down the list and incrementing your count of distinct street names every time it changed.
In database terms, a little demo of that approach is as below
SQL> create table telephone_book
2 as select 'Street'||trunc(dbms_random.value(1,30)) street
3 from dual
4 connect by level select street
2 from telephone_book
3 order by 1;
STREET
----------------------------------------------
Street1 == 1st street
Street1
Street10 == 2nd street
Street10
Street11 == 3rd street
Street11
Street11
Street11
Street11
Street11
...
...
Street8
Street9 == 28th street
Street9
Street9
Street9
100 rows selected.
which results in our 28 distinct streets.
But what if we were not allowed to sort the data. Perhaps a more realistic question is – what if it was ridiculously prohibitive in terms of time and effort to sort the data? I don’t know about you, but I have better things to do with my weekend than sort the telephone book!
From my demo above, I can see that the highest possible number of distinct streets I could have is 30. So rather than sort the data, I can create a simple 30-character string, which each character represents the occurrence the partnering street. My string would start off as (the hyphens added for clarity)
NNNNNNNNNN-NNNNNNNNNN-NNNNNNNNNN
and as I read the telephone book, I simply change the string for each street I encounter. If the street for the first row is “Street7”, I change my string to be:
NNNNNNYNNN-NNNNNNNNNN-NNNNNNNNNN
If the second row is “Street22” then the string becomes:
NNNNNNYNNN-NNNNNNNNNN-NYNNNNNNNN
and so on until I have read the entire table. At the end of the exercise, I just count the number of “Y” and that is the distinct count of streets. I never had to sort the data. Here’s a simple code demo of that process:
SQL> set serverout on
SQL> declare
2 bits varchar2(32) := rpad('N',32,'N');
3 street_no int;
4 begin
5 for i in ( select rownum idx, t.* from telephone_book t )
6 loop
7 street_no := to_number(ltrim(i.street,'Street'));
8 bits := substr(bits,1,street_no-1)||'Y'||substr(bits,street_no+1);
9 dbms_output.put_line('After row '||i.idx||' map='||bits);
10 end loop;
11 dbms_output.put_line('Final='||bits);
12 dbms_output.put_line('ndv='||(length(bits)-length(replace(bits,'Y'))));
13 end;
14 /
After row 1 map=NNNNNNNNNNNNNNNNNNNNNYNNNNNNNNNN
After row 2 map=NNNNNNNNNNNNNNNNNNYNNYNNNNNNNNNN
After row 3 map=NNNNNNNNYNNNNNNNNNYNNYNNNNNNNNNN
After row 4 map=NNNNNNNNYNNNNYNNNNYNNYNNNNNNNNNN
...
...
After row 97 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 98 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 99 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 100 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
Final=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
ndv=28
PL/SQL procedure successfully completed.
and we can verify the result with a standard SQL query:
SQL> select count(distinct street)
2 from telephone_book;
COUNT(DISTINCTSTREET)
---------------------
28
Of course, such a simple solution masks a lot of complexity in implementing something like this for an arbitrary set of data.
- How do we know the upper limit on the potential number of distinct rows?
- How do we rapidly count the number of “Y” or “hits” once we have scanned the data?
- Have we just shifted the problem to an enormous memory structure?
You’ll be reassured to know that a lot of thought has gone into tackling these issues, and thus taking the simple demo above into a genuine robust implementation within version 19c of the database. Many queries requiring distinct counts can be achieved now without requiring an expensive sort.
Here’s the video walking through exactly what we’ve done in 19c, and what some of the benefits are:
Baffled.
select version from v$instance;
19.0.0.0.0
with t as
(
select
level v
from dual
connect by level <= 64991
)
select count(distinct v) ndv
from t;
64991
with t as
(
select
level v
from dual
connect by level <= 64991
), bm as
(
select
bitmap_construct_agg(v) bca,
null
from
t
group by
bitmap_bucket_number(v)
)
select
bitmap_count(bitmap_or_agg(bca)) bndv
from bm
;
64983
Is this what you had in mind ?
SQL> with t as
2 (
3 select
4 level v
5 from dual
6 connect by level <= 64992
7 ), bm as
8 (
9 select
10 bitmap_construct_agg(bitmap_bit_position(v)) bca,
11 bitmap_bucket_number(v) bkt
12 from
13 t
14 group by bitmap_bucket_number(v)
15 )
16 select bkt, bitmap_count(bca) bndv
17 from bm
18 ;
BKT BNDV
———- ———-
1 32760
2 32232
Good presentation here https://cdn.ymaws.com/ukoug.org/resource/collection/904331A4-A794-4730-AC6B-6AB8B1DD1E36/UKOUG_Count_Distinct.pdf
Expected 64991. Got 64983.
Expected 64992. Got error.
This select distinct technique only works on nonnegative integers.
Thank you Connor.
I get bucket size 16000.
I don’t know how we decide on size of bucket. I suspect its max_string_size.