How to generate random unique values

When sequences won’t do

In an Oracle database if someone comes to you and says “I need to generate unique numbers” then anyone with any experience of Oracle will more likely than not suggest a sequence. And that is good advice because a sequence is incredibly fast, scales well with multiple users, is very easy to code and is guaranteed unique. It is the perfect tool for generating surrogate keys, that is, meaningless numbers for primary keys in particular, which of course by the very definition of primary key, must be unique


SQL> create table t ( pk number);

Table created.

SQL> create sequence seq cache 1000;

Sequence created.

SQL>
SQL> set timing on
SQL> begin
  2    for i in 1 .. 100000 loop
  3      insert into t values (seq.nextval);
  4    end loop;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:04.81

But sometimes those same unique identifiers that we want to generate, actually identifies a core piece of information that we reveal to the customers of our systems. The simplicity of a sequence is also a drawback in this instance.


SQL> select pk from t where rownum <= 20;

    PK
------
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20

because if I see a set of roughly contiguous numbers, and I know that they are, for instance, customer identifiers, then that is potentially a area of concern in terms of security. A a potential hacker may find an avenue to probe for customer identifiers simply by working their way through the contiguous pattern.

GUID to the rescue?

An alternative to sequences for these scenarios is to use a globally unique identifier. The database has a facility to do this as well via the SYS_GUID function It requires a touch more code but is very similar to the sequence solution. Commonly if you are using tools like QuickSQL then we convert the GUID to a number and it looks suitably random



SQL> set timing on
SQL> begin
  2    for i in 1 .. 100000 loop
  3      insert into t values (to_number(sys_guid(), 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'));
  4    end loop;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.99

SQL> select pk from t where rownum <= 20;

                                       PK
-----------------------------------------
  105854066610474458368090556750481295785
  186822031504444284720892382281585799957
  132888271924145856133313587357433340980
  276167378372574928342970152593385328900
   74086652057562125835746396515377682140
  286319886043592938247346128711951783544
  336167790885619983030212750918503752250
  288589581961303614692651099852065933187
  324920391859101420030979893594250781674
  126732679724678528065575262532843100163
  113307513485209808186387016050557945591
  200787479897466765534406064765323360878
  139368685100390784122367914412713398477
  238803160953268950614524266576466039309
  241991797837563579395004471847660812814
  120454289487617651156843055918296309789
  105272018464122283803281386350139492346
  198909665803908780010454156775156795329
  249866914728623919764347522241235284337
  339746714983781506209715409558673179734

20 rows selected.

Similarly, if we consider just the raw data type it also in this demo appears to be random:



SQL> drop table t purge;

Table dropped.

SQL> create table t ( pk raw(16));

Table created.

SQL> set timing on
SQL> begin
  2    for i in 1 .. 100000 loop
  3      insert into t values (sys_guid());
  4    end loop;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.83
SQL> set timing off
SQL>
SQL> select pk from t where rownum <= 20;

PK
--------------------------------
CFF50A624B4F456A8E429D12CA55C67B
4E6B43C7B661401580A78F39DDDBA100
DF7AC2A6DB4F47A2AF3B560DCE4BBA32
B276B27FD6F640DB9A5669D21B86D374
2D2079BFEA494A7C90374DB6955A1263
2E5F210976C1423899FF736108842D72
A6E03758A3B2436D9941C7CAB87406D4
DBC3033CB54544D19CC1CD35FF8B4691
8C4542D16FB342878C2A8B230C21EE1E
1A2BD7CCD5CE43CDB7AAF2DC4615889B
274C873A645748C9B886F20427AD0B94
5E509539B2D4449A919459D6507078E3
8E8F2E18A34544CDBA2D2703CBCADD47
DA2244E2CFF443E382E6CE86C09F8615
F52369F769A147CE918E9E98426A8493
58F5D7B3F2094A35BA7BEAC6DCD93B39
BA330C0BB236430CB3228EF3C96102C5
F003DE9F90714016923DC83D2977F593
5F3E03A0AFF04F88857778C102F6D39D
BC0B24E1BFA0433D94AED3A8B489A9D6

20 rows selected.

But all is not as it seems because the generation of the value for SYS_GUID is platform dependent and when I run the same demo on Linux then suddenly our unique identifier suffers the same issue as the sequence namely it could potentially be predicted by a malicious party. It is almost hard to even distinguish that these values are unique, but I’ve highlighted the critical nybble.



SQL> set timing on
SQL> begin
  2    for i in 1 .. 100000 loop
  3      insert into t values (sys_guid());
  4    end loop;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.22
SQL> set timing off
SQL>
SQL> select pk from t where rownum <= 20;

PK
--------------------------------
AA1252410FA30F15E0539314000A0C74
AA1252410FA40F15E0539314000A0C74
AA1252410FA50F15E0539314000A0C74
AA1252410FA60F15E0539314000A0C74
AA1252410FA70F15E0539314000A0C74
AA1252410FA80F15E0539314000A0C74
AA1252410FA90F15E0539314000A0C74
AA1252410FAA0F15E0539314000A0C74
AA1252410FAB0F15E0539314000A0C74
AA1252410FAC0F15E0539314000A0C74
AA1252410FAD0F15E0539314000A0C74
AA1252410FAE0F15E0539314000A0C74
AA1252410FAF0F15E0539314000A0C74
AA1252410FB00F15E0539314000A0C74
AA1252410FB10F15E0539314000A0C74
AA1252410FB20F15E0539314000A0C74
AA1252410FB30F15E0539314000A0C74
AA1252410FB40F15E0539314000A0C74
AA1252410FB50F15E0539314000A0C74
AA1252410FB60F15E0539314000A0C74

DBMS_RANDOM to the rescue?

Maybe a solution is to use the random number generation facilities inside the database. However the very definition of “random” runs counter to that of uniqueness. For a random number generator to be truly random, it could easily generate the same number 20 times in a row and still be valid in terms of randomness.

The lack of uniqueness when using a random number generator randomness this is easy to prove by doing a drop in replacement of our GUID code with DBMS_RANDOM. The demo runs to completion but we can easily detect that some duplicates occurred.



SQL> drop table t purge;

Table dropped.

SQL> create table t ( pk number);

Table created.

SQL> set timing on
SQL> begin
  2    for i in 1 .. 100000 loop
  3      insert into t values (abs(dbms_random.random));
  4    end loop;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.14
SQL> set timing off
SQL>
SQL> select pk from t where rownum <= 20;

                                       PK
-----------------------------------------
                                702163599
                                628903810
                               1723610093
                                 53889653
                                552375825
                                715776656
                               1023547593
                               1180087997
                               1276294049
                               2118855978
                               1482501921
                               1298798620
                                  2674977
                                 45293496
                               2111077922
                               1749692232
                               1800848612
                                150316819
                                753335341
                               1252549540

20 rows selected.

SQL> select pk from t
  2  group by pk
  3  having count(*) > 1;

                                       PK
-----------------------------------------
                                918416655
                               2871298471

2 rows selected.

So to generate a identifier that is both random and therefore hard to guess but unique as well, we need to combine some of the techniques above. Only a sequence or a GUID is a guarantee of uniqueness so an identifier must at a minimum contain a usage of one of those.

Here is a potential solution. I will grab the seconds and fractions of a second from the current timestamp, and also I append to that a fixed width sequence to guarantee the uniqueness. I am choosing fixed width because if not, if there is a “bleed” of values between the timestamp values and the sequence values, the potential for duplicates would be increased. For example, a timestamp/sequence pair of “123-87” could clash with “12-387” if they were not both fixed width. And just to add a little more unpredictability I will suffix the value with a call to DBMS_RANDOM. Duplicate random numbers are not an issue, because the sequence element guarantees uniqueness.



SQL> drop table t purge;

Table dropped.

SQL> create table t ( pk number);

Table created.

SQL> drop sequence seq;

Sequence dropped.

SQL> create sequence seq start with 1000000 minvalue 1000000 maxvalue 9999999 cycle;

Sequence created.

SQL> set timing on
SQL> begin
  2    for i in 1 .. 100000 loop
  3      insert into t values (
  4            to_number(trunc(dbms_random.value(1000,9999))||
  5                      to_char(systimestamp,'FFSS')||
  6                      seq.nextval));
  7    end loop;
  8  end;
  9  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:05.87
SQL> set timing off
SQL> select pk from t where rownum <= 20;

                                         PK
-------------------------------------------
                        3607951782351001708
                        9784951885351001709
                        8223951981351001710
                        2862952057351001711
                        9015952125351001712
                        9447952198351001713
                        7311952264351001714
                        6102952328351001715
                        2204952392351001716
                        6937952455351001717
                        1260952519351001718
                        9725952584351001719
                        2274952648351001720
                        1274953133351001721
                        7580953210351001722
                        4164953278351001723
                        3659953344351001724
                        1928953410351001725
                        8497953476351001726
                        6248953542351001727

20 rows selected.

Running this demo you can still see that there is a slight semblance of a pattern due in the values due to the sequence but I would contend that this is still an impossible to guess identifier. And similarly I would contend that this identifier is “always” unique because in order to encounter a violation we would need to generate the same sequence number as it cycles around, along with the same random value from our random number generator, and chance upon both of those repeated values at the same fraction of a second within a one minute.

Too much information

Of course I started this post discussing the need for an identifier that reveals, for instance, a customer account. This would suggest that this identifier will be potentially provided to users directly, for example, when they register with our application they may be told “Congratulations, your customer ID is 123456” etc . Whilst we have a solution for our random and unique identifier requirement, that of course is now a very large number for them to remember and/or note down. Ideally we want to give them something smaller and simpler to remember. A solution is to pass them the value as reduced in size by converting the number to a larger base than decimal . Here is a simple example where we convert the identifier to base 34.



SQL> create or replace
  2  function base34(p_num number) return varchar2 is
  3    l_dig varchar2(34) := 'AB0CD1EF2GH3JK4LM5NP6QR7ST8UV9WXYZ';
  4    l_num number := p_num;
  5    l_str varchar2(38);
  6  begin
  7    loop
  8      l_str := substr(l_dig,mod(l_num,34)+1,1) || l_str ;
  9      l_num := trunc(l_num/34);
 10      exit when l_num = 0;
 11    end loop;
 12    return l_str;
 13  end;
 14  /

Function created.

SQL>
SQL> select base34(pk) from t where rownum <= 20;

BASE34(PK)
------------------------------------------------------------------------
CNZT85YD4ZGSD
BJ996YX6FXSVB
0DFQMNTKX9XFY
BWRWL8RMD7SEX
BL96GSRMUBM02
B3V5H8WNRZ5EK
C1F3LF9KFZNDW
7LASBJFMWTLQ
0AVVDLXUBPJYM
B9FMN324WK11C
0459EMZWT883J
0K9PPRBPNB3T1
05J27N0G9AB5R
BDZAVV0FLSGD5
068AZPEECMDE0
BMFSBC4X80STK
CRWA0G2BBAPVV
0R0CTVZBD72H1
B1TE2B2ZJWW36
CN8M8ATS83U11

20 rows selected.

SQL>
SQL> create or replace
  2  function dec34(p_str varchar2) return number is
  3    l_dig varchar2(34) := 'AB0CD1EF2GH3JK4LM5NP6QR7ST8UV9WXYZ';
  4    l_num number := 0;
  5  begin
  6    for i in 1 .. length(p_str) loop
  7      l_num := l_num * 34 + instr(l_dig,upper(substr(p_str,i,1)))-1;
  8    end loop;
  9    return l_num;
 10  end;
 11  /

Function created.

SQL> select to_char(pk) from t where rownum = 1
  2  union all
  3  select base34(pk) from t where rownum = 1
  4  union all
  5  select to_char(dec34(base34(pk))) from t where rownum = 1;

TO_CHAR(PK)
-----------------------------------------------------------------
3607951782351001708
B5KS4X1SQVMMV
3607951782351001708

You may be pondering “I have never seen base 34 as a commonly used base” and you would be correct Smile . I have chosen 34 because it is the letters of the alphabet and the decimal digits but I have removed the letter “I” and letter “O” because when provided to customers, they could easily be misinterpreted as the number 1 or 0. Using my own customised base conversion adds a little more obfuscation, but mainly lets me provide a better service to customers (and probably my call centre employees) when they have to ask for this identifier.

Summary

If you need to generate identifiers that are both unique and random always make sure you at least include either a sequence or a GUID, and then augment or manipulate that value to provide the randomness you desire.

Update: Thanks for Oren for pointing out the missing “minvalue”

Learning is not a spectator sport