Gooey GUIDs

Posted by

Do a quick Google search and you’ll find plenty of blog posts about why GUIDs are superior to integers for a unique identifier, and of course, an equal number of posts about why integers are superior to GUIDs. In the Oracle world, most people have been using sequence numbers since they were pretty much the only option available to us in earlier versions. But developers coming from other platforms often prefer GUIDs simply due to their familiarity with them.

I’m pretty much ambivalent when it comes to which one to use.  In fact, a good example is the AskTOM database -which had exclusively sequence-based primary keys on inception, but as the database has evolved and developers of different backgrounds and ages have worked on it, there is now of mix of strategies. The older tables have sequence based primary keys, and many of the newer tables have GUIDs as primary keys. Don’t get me wrong – I’m not advocating that you should have a mix – for a given database schema I’d recommend picking one regime and sticking with it. But my point is, that even with the mix of approaches in the AskTOM schema, I’ve never encountered any problems or issues with either.

However, there is one use case where I would strongly recommend using some caution on the use of GUIDs, and that is in the arena of systems that load data in bulk (eg data warehouses).

GUIDs are not cheap to generate. A quick look at the history on the structure and generation of unique IDs at https://en.wikipedia.org/wiki/Universally_unique_identifier all come down to a common component – the need to generate a good random number, and “good” can be a struggle for computers because you need algorithms that ensure sufficient randomness and distribution of the generated numbers.  That takes CPU cycles and whilst that is something that you will never notice when using a GUID for the 100 customers in your customer table, it definitely will be noticeable if you are going to attach a GUID to every one of your 10 million daily sales transactions, or telephone records, or similar.

Here’s a simple example where I’ll load 100 million rows into a table.  First I’ll try the conventional (when it comes to Oracle) approach of using a sequence number to uniquely identify each row.



SQL> create sequence seq cache 50000;

Sequence created.

SQL>
SQL> create table t1
  2  ( pk int,
  3    data1 int,
  4    data2 varchar2(10),
  5    data3 int,
  6    data4 varchar2(10)
  7  )
  8  tablespace demo
  9  /

Table created.


SQL>
SQL> set timing on
SQL> insert /*+ APPEND */ into t1
  2  select seq.nextval, int_val, char_val, int_val, char_val
  3  from
  4   ( select level int_val, to_char(level) char_val from dual connect by level <= 10000 ),
  5   ( select 1 from dual connect by level <= 10000 );

100000000 rows created.

Elapsed: 00:03:31.42
SQL>
SQL> commit;

Commit complete.

Elapsed: 00:00:00.01
SQL>

3 minutes 30 seconds for 100 million rows is pretty good performance for a laptop, although obviously the table structure here is very simple.

And now I’ll repeat the exercise with the same table structure, but using a raw column to hold the output of a call to SYS_GUID().



SQL>
SQL>
SQL> create table t2
  2  ( pk raw(20),
  3    data1 int,
  4    data2 varchar2(10),
  5    data3 int,
  6    data4 varchar2(10)
  7  )
  8  tablespace demo
  9  /

Table created.

SQL>
SQL> set timing on
SQL> insert /*+ APPEND */ into t2
  2  select sys_guid(), int_val, char_val, int_val, char_val
  3  from
  4   ( select level int_val, to_char(level) char_val from dual connect by level <= 10000 ),
  5   ( select 1 from dual connect by level <= 10000 );

100000000 rows created.

Elapsed: 00:30:56.78
SQL> commit;

Commit complete.

Elapsed: 00:00:00.03

That’s right – we’ve blown out to 30 minutes. As you can see, there can be a large cost when the row volumes (and hence number of calls to generate a GUID) get large. We can even take the INSERT out of the equation here, and simply do a raw stress test to see how many GUIDs we can call from the SQL engine using the following test harness.



SQL> create table t ( sz int, dur interval day to second );

Table created.

SQL>
SQL>
SQL> declare
  2    ts_start timestamp;
  3    ts_end   timestamp;
  4    iter int;
  5    dummy raw(32);
  6  begin
  7   for i in 1 .. 8 loop
  8    iter := power(10,i);
  9
 10    ts_start := systimestamp;
 11    if iter <= 10000 then
 12       select max(x) into dummy from
 13       (
 14       select sys_guid() x from
 15       ( select 1 from dual connect by level <= iter )
 16       );
 17    else
 18       select max(x) into dummy from
 19       (
 20       select sys_guid() x from
 21       ( select 1 from dual connect by level <= iter/10000 ),
 22       ( select 1 from dual connect by level <= 10000 )
 23       );
 24    end if;
 25
 26    ts_end := systimestamp;
 27    insert into t values (iter, ts_end - ts_start );
 28    commit;
 29
 30   end loop;
 31  end;
 32  /

PL/SQL procedure successfully completed.

SQL>
SQL> select * from t order by 1;

        SZ DUR
---------- ----------------------------------------------------------
        10 +00 00:00:00.000000
       100 +00 00:00:00.000000
      1000 +00 00:00:00.015000
     10000 +00 00:00:00.172000
    100000 +00 00:00:01.607000
   1000000 +00 00:00:16.083000
  10000000 +00 00:02:49.713000
 100000000 +00 00:26:46.570000
 
 

I’m not trying to scare you off GUIDs – but like any functionality or feature you’re using to build applications, make sure you test it for business requirements you need to satisfy and make an informed decision on how best to use (or not use) it.

5 comments

  1. Thanks for the details.

    with sequence based primary key (sequentially unique values)- each block in the index leaf block is well utilized before moving into the next leaf block. of course the index leaf block will go into 90-10 splits.

    with GUID based primary key (randomly unique values) – GUID based values are spread across the breath of the leaf block. will lead to 50-50 block splits, will lead to more free space in each leaf block.

    details @ https://tinyurl.com/ycbcycut

  2. This is specific to Oracle. Oracle generated GUID’s using sys_guid() function, which has first few characters in sequence, but when you are generating through the external application like .net or Java they are totally random and if you decide to have this RAW GUID column as PK or FK than the index on that are expensive. This is due to it need to keep lot’s of Block in memory to query or insert rather than last few blocks. And also this make’s these indices mostly split to 50:50 rather than the right side split. Any other implication may be there but this might be just obvious one isn’t it.

  3. with a very simple test I can see better performance/time with nextval 😦

    nextval—————
    create sequence SEQ cache 50000;
    create table TEST_SEQ (pk number) tablespace admin;
    insert /*+ APPEND */ into TEST_SEQ select seq.nextval from ( select 1 from dual connect by level more than 2 minutes

    sys_guid()————
    create table TEST_SEQ ( pk raw(20)) tablespace admin;
    insert /*+ APPEND */ into TEST_SEQ select sys_guid() from ( select 1 from dual connect by level 18 seconds!

  4. nextval—————
    insert /*+ APPEND */ into TEST_SEQ select seq.nextval from ( select 1 from dual connect by level <= 10000000);
    10000000 rows created.
    Elapsed: 00:02:17.90

    sys_guid()————
    insert /*+ APPEND */ into TEST_SEQ select sys_guid() from ( select 1 from dual connect by level <= 10000000);
    10000000 rows created.
    Elapsed: 00:00:18.53

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.