LOBs are larger in multibyte databases

Posted by

Most customers I encounter nowadays are generally sticking to the defaults when creating their databases from 12c onwards. This means that they’ll be using the UTF8 class of charactersets when configuring their database. As a quick segue from the topic at hand, I think this is a good thing. Having a UTF8 characterset avoids a lot of hassles later when applications encounter a need to store extended characters, and let’s face it, the moment you need to store someone’s name, then at some stage you are going to hit this.

But back to large objects. It is important to realise that a UTF8 characterset has implications with how LOBs will be stored in the database. In a single byte characterset database, then if you need to store a 1 megabyte text file in a CLOB, then it will consume approximately 1 megabyte. But lets look at what happens when you are using a multi-byte database.


SQL> host dir /s /b c:\temp > c:\temp\file1.txt

SQL> host dir c:\temp\file1.txt
 Volume in drive C is OS
 Volume Serial Number is 66D0-4AB5

 Directory of c:\temp

22/03/2020  07:19 PM           635,750 file1.txt
               1 File(s)        635,750 bytes
               0 Dir(s)  54,663,430,144 bytes free

So I have a text file of about 630k, and I’ll now store that as a single CLOB in the database (which was built with AL32UTF8)


SQL> select property_name, property_value
  2  from   database_properties
  3  where  property_name like 'NLS%';

PROPERTY_NAME                  PROPERTY_VALUE
------------------------------ -----------------------------
NLS_RDBMS_VERSION              19.0.0.0.0
NLS_NCHAR_CONV_EXCP            FALSE
NLS_LENGTH_SEMANTICS           BYTE
NLS_COMP                       BINARY
NLS_DUAL_CURRENCY              $
NLS_TIMESTAMP_TZ_FORMAT        DD-MON-RR HH.MI.SSXFF AM TZR
NLS_TIME_TZ_FORMAT             HH.MI.SSXFF AM TZR
NLS_TIMESTAMP_FORMAT           DD-MON-RR HH.MI.SSXFF AM
NLS_TIME_FORMAT                HH.MI.SSXFF AM
NLS_SORT                       BINARY
NLS_DATE_LANGUAGE              AMERICAN
NLS_DATE_FORMAT                DD-MON-RR
NLS_CALENDAR                   GREGORIAN
NLS_NUMERIC_CHARACTERS         .,
NLS_NCHAR_CHARACTERSET         AL16UTF16
NLS_CHARACTERSET               AL32UTF8
NLS_ISO_CURRENCY               AMERICA
NLS_CURRENCY                   $
NLS_TERRITORY                  AMERICA
NLS_LANGUAGE                   AMERICAN

SQL> create tablespace small_extents 
  2    datafile 'X:\ORACLE\ORADATA\DB19\PDB1\SMALL01.DBF' size 20m
  3    extent management local uniform size 128k;

Tablespace created.

SQL> create table t ( c clob )
  2  lob ( c )store as basicfile
  3  tablespace small_extents;

Table created.

SQL>
SQL> declare
  2    bf    bfile;
  3    cb    clob;
  4
  5    dest  int := 1;
  6    src   int := 1;
  7    cs    int := 0;
  8    ctx   int := 0;
  9    warn  int := 0;
 10  begin
 11    insert into t (c) values (empty_clob())
 12    returning c into cb;
 13
 14    bf := bfilename('TEMP', 'file1.txt');
 15    dbms_lob.fileopen(bf, dbms_lob.file_readonly);
 16    dbms_lob.loadclobfromfile (
 17      dest_lob      => cb,
 18      src_bfile     => bf,
 19      amount        => dbms_lob.lobmaxsize,
 20      dest_offset   => dest,
 21      src_offset    => src,
 22      bfile_csid    => cs ,
 23      lang_context  => ctx,
 24      warning       => warn);
 25    dbms_lob.fileclose(bf);
 26
 27    commit;
 28  end;
 29  /

PL/SQL procedure successfully completed.

So the CLOB has been stored. Let us take a look at the size of the CLOB in the database.


SQL> select segment_name, bytes
  2  from   user_segments
  3  where  segment_name = 'T'
  4   or segment_name = ( select segment_name from user_lobs where table_name = 'T' )
  5   or segment_name = ( select index_name from user_lobs where table_name = 'T' );

SEGMENT_NAME                        BYTES
------------------------------ ----------
SYS_IL0000082130C00001$           131072
SYS_LOB0000082130C00001$         1572864
T                                  131072

3 rows selected.

SQL>
SQL> select bytes
  2  from   user_extents
  3  where  segment_name = 'SYS_LOB0000082130C00001

Even without the LOB index, the LOB data is just over double the size of the original text file. This might come as a surprise, but it is documented in the Large Object Guide

CLOB data is stored in a format that is compatible with UCS-2 if the database character set is multibyte

Hence, if you are doing a database sizing estimate, and you intend to store text data in CLOBs, make sure to take this into consideration when planning your size. In case you are thinking that this is an unreasonable “tax” to pay, don’t forget that all we are really doing is catering to the potential of the file containing extended characters. For example, if I open the original file with a text editor and save that file in UTF16 format, then the file size is similar to that of the CLOB.



SQL> host dir c:\temp\file1*
 Volume in drive C is OS
 Volume Serial Number is 66D0-4AB5

 Directory of c:\temp

22/03/2020  07:19 PM           635,750 file1.txt
22/03/2020  07:33 PM         1,271,500 file1u.txt
               2 File(s)      1,907,250 bytes
               0 Dir(s)  54,639,943,680 bytes free

;

BYTES
———-
131072
131072
131072
131072
131072
131072
131072
131072
131072
131072
131072
131072

12 rows selected.

SQL>Even without the LOB index, the LOB data is just over double the size of the original text file. This might come as a surprise, but it is documented in the Large Object Guide

CLOB data is stored in a format that is compatible with UCS-2 if the database character set is multibyte

Hence, if you are doing a database sizing estimate, and you intend to store text data in CLOBs, make sure to take this into consideration when planning your size. In case you are thinking that this is an unreasonable “tax” to pay, don’t forget that all we are really doing is catering to the potential of the file containing extended characters. For example, if I open the original file with a text editor and save that file in UTF16 format, then the file size is similar to that of the CLOB.


Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.