Do a quick Google search and you’ll find plenty of blog posts about why GUIDs are superior to integers for a unique identifier, and of course, an equal number of posts about why integers are superior to GUIDs. In the Oracle world, most people have been using sequence numbers since they were pretty much the only option available to us in earlier versions. But developers coming from other platforms often prefer GUIDs simply due to their familiarity with them.
I’m pretty much ambivalent when it comes to which one to use. In fact, a good example is the AskTOM database -which had exclusively sequence-based primary keys on inception, but as the database has evolved and developers of different backgrounds and ages have worked on it, there is now of mix of strategies. The older tables have sequence based primary keys, and many of the newer tables have GUIDs as primary keys. Don’t get me wrong – I’m not advocating that you should have a mix – for a given database schema I’d recommend picking one regime and sticking with it. But my point is, that even with the mix of approaches in the AskTOM schema, I’ve never encountered any problems or issues with either.
However, there is one use case where I would strongly recommend using some caution on the use of GUIDs, and that is in the arena of systems that load data in bulk (eg data warehouses).
GUIDs are not cheap to generate. A quick look at the history on the structure and generation of unique IDs at https://en.wikipedia.org/wiki/Universally_unique_identifier all come down to a common component – the need to generate a good random number, and “good” can be a struggle for computers because you need algorithms that ensure sufficient randomness and distribution of the generated numbers. That takes CPU cycles and whilst that is something that you will never notice when using a GUID for the 100 customers in your customer table, it definitely will be noticeable if you are going to attach a GUID to every one of your 10 million daily sales transactions, or telephone records, or similar.
Here’s a simple example where I’ll load 100 million rows into a table. First I’ll try the conventional (when it comes to Oracle) approach of using a sequence number to uniquely identify each row.
SQL> create sequence seq cache 50000; Sequence created. SQL> SQL> create table t1 2 ( pk int, 3 data1 int, 4 data2 varchar2(10), 5 data3 int, 6 data4 varchar2(10) 7 ) 8 tablespace demo 9 / Table created. SQL> SQL> set timing on SQL> insert /*+ APPEND */ into t1 2 select seq.nextval, int_val, char_val, int_val, char_val 3 from 4 ( select level int_val, to_char(level) char_val from dual connect by level <= 10000 ), 5 ( select 1 from dual connect by level <= 10000 ); 100000000 rows created. Elapsed: 00:03:31.42 SQL> SQL> commit; Commit complete. Elapsed: 00:00:00.01 SQL>
3 minutes 30 seconds for 100 million rows is pretty good performance for a laptop, although obviously the table structure here is very simple.
And now I’ll repeat the exercise with the same table structure, but using a raw column to hold the output of a call to SYS_GUID().
SQL> SQL> SQL> create table t2 2 ( pk raw(20), 3 data1 int, 4 data2 varchar2(10), 5 data3 int, 6 data4 varchar2(10) 7 ) 8 tablespace demo 9 / Table created. SQL> SQL> set timing on SQL> insert /*+ APPEND */ into t2 2 select sys_guid(), int_val, char_val, int_val, char_val 3 from 4 ( select level int_val, to_char(level) char_val from dual connect by level <= 10000 ), 5 ( select 1 from dual connect by level <= 10000 ); 100000000 rows created. Elapsed: 00:30:56.78 SQL> commit; Commit complete. Elapsed: 00:00:00.03
That’s right – we’ve blown out to 30 minutes. As you can see, there can be a large cost when the row volumes (and hence number of calls to generate a GUID) get large. We can even take the INSERT out of the equation here, and simply do a raw stress test to see how many GUIDs we can call from the SQL engine using the following test harness.
SQL> create table t ( sz int, dur interval day to second ); Table created. SQL> SQL> SQL> declare 2 ts_start timestamp; 3 ts_end timestamp; 4 iter int; 5 dummy raw(32); 6 begin 7 for i in 1 .. 8 loop 8 iter := power(10,i); 9 10 ts_start := systimestamp; 11 if iter <= 10000 then 12 select max(x) into dummy from 13 ( 14 select sys_guid() x from 15 ( select 1 from dual connect by level <= iter ) 16 ); 17 else 18 select max(x) into dummy from 19 ( 20 select sys_guid() x from 21 ( select 1 from dual connect by level <= iter/10000 ), 22 ( select 1 from dual connect by level <= 10000 ) 23 ); 24 end if; 25 26 ts_end := systimestamp; 27 insert into t values (iter, ts_end - ts_start ); 28 commit; 29 30 end loop; 31 end; 32 / PL/SQL procedure successfully completed. SQL> SQL> select * from t order by 1; SZ DUR ---------- ---------------------------------------------------------- 10 +00 00:00:00.000000 100 +00 00:00:00.000000 1000 +00 00:00:00.015000 10000 +00 00:00:00.172000 100000 +00 00:00:01.607000 1000000 +00 00:00:16.083000 10000000 +00 00:02:49.713000 100000000 +00 00:26:46.570000
I’m not trying to scare you off GUIDs – but like any functionality or feature you’re using to build applications, make sure you test it for business requirements you need to satisfy and make an informed decision on how best to use (or not use) it.