The death of UTL_FILE

In a previous post I covered a technique to improve the performance of UTL_FILE, but concluded the post with a teaser: “you probably don’t need to use UTL_FILE ever again”.

image

Time for me to back that statement up with some concrete evidence.

UTL_FILE can read and write files. This blog post will cover the writing functionality of UTL_FILE and why I think you probably don’t need UTL_FILE for this. I’ll come back to UTL_FILE to read files in a future post.

There are two possibilities when it comes to writing a file:

  • The file is being requested by a client program and the results should be stored on a client machine. In that case, you do not use UTL_FILE anyway – you use the tools available on the client, for example, SQL Developer, SQLcl, SQL*Plus or a myriad of other tools, OR
  • the file is to be written by the database server to a location on the database server itself, or to a location that is accessible to the database server. This is where UTL_FILE has been used, the main motivations being that it can be part of existing database-resident PL/SQL code, and does not require clients to have direct SQL or OS level access to the database server.

So here is my assertion – most of the time, UTL_FILE is used to write out a file because we can’t make use an easier tool like those mentioned above directly on the database server.  After all, who wants to write code like:


SQL> declare
  2    f utl_file.file_type;
  3    line varchar2(255);
  4    s timestamp;
  5  begin
  6    f := utl_file.fopen('TMP','demo.dat','w');
  7    s := systimestamp;
  8    for i in ( select * from t )
  9    loop
 10      line := i.object_id||','||i.object_name||','||i.object_type||','||i.data_object_id;
 11      utl_file.put_line(f,line);
 12    end loop;
 13    dbms_output.put_line(systimestamp-s);
 14    utl_file.fclose_all;
 15  end;
 16  /

PL/SQL procedure successfully completed.

when most of the time if I was doing this in a tool it would be simply:


SQL> set markup csv on
SQL> spool c:\tmp\demo.dat
SQL> select * from t;
SQL> spool off

But in recent versions of the database, you can do exactly this! The database scheduler has been enhanced to be able to to run SQL*Plus style scripts directly out of the database without needing to give OS access to database server, or the SQL*Plus executable. In the previous post I unloaded 40million rows to a file in CSV format and the responsibility for formatting the data into CSV format fell to me – I had to do all of the “heavy lifting”. Using the scheduler and the SQL_SCRIPT job type, it is as simple as writing a SQL*Plus script, and submitting it as a job.


SQL> declare
  2    l_script   VARCHAR2(32767) :=
  3  'conn /@db
  4  set markup csv on
  5  set arraysize 500
  6  set pages 0
  7  set lines 200
  8  set trimspool on
  9  spool c:\tmp\demo_sched.dat
 10  select * from t;
 11  spool off';
 12  begin
 13    dbms_scheduler.create_job(
 14      job_name        => 'UNLOAD_DATA',
 15      job_type        => 'SQL_SCRIPT',
 16      job_action      => l_script,
 17      credential_name => 'MY_ACCOUNT',
 18      enabled         => true
 19    );
 20  end;
 21  /

PL/SQL procedure successfully completed.

You will want to ensure that you have some controls over the usage of this feature, and what credentials the scripts will run under.  Also in my example, I’ve got a connection wallet setup so that I do not have to code any passwords into the connection string for my scheduler job. But suddenly it has become easy to get access to the scripting tools we are used to on our own client machines, and utilize them on the database server.

“Slow down there cowboy…” I hear you exclaim. “…That might be fine for simple SELECT * scripts, but my UTL_FILE procedures have a lot of complex logic to construct the file data”.

Even if you are using UTL_FILE because you are performing some complicated algorithms to generate the data that will be written to file, you can still utilize this scheduler facility.  After your complicated logic is completed, ultimately you typically will have a line of data you need to write to a file.  And if you have a line of data, then it is trivial to port that procedure to become a pipelined function.  And once you have a pipelined function, then we have a simple query mechanism that can be used to spool the output.  For example, if my original procedure is:


SQL> create or replace
  2  procedure my_procedure is
  3  begin
  4    for each_row in ( [some set] )
  5    loop
  6    ...
  7    ... my complex logic
  8    ...
  9
 10    ...
 11    ... writing each line to file with UTL_FILE
 12    ...
 13    end loop;
 14  end;
 15  /

then we change none of the logic – we only need replace all of that messy UTL_FILE code with a simple pipe command to allow querying that function as if it was source of rows, and then spool it to a file in the usual way.


SQL> create or replace
  2  function my_function return sys.odcivarchar2list pipelined is
  3  begin
  4    for each_row in ( [some set] )
  5    loop
  6    ...
  7    ... my complex logic
  8    ...
  9
 10    ...
 11    ... pipe row ( the_line );
 12    ...
 13    end loop;
 14    return;
 15  end;
 16  /

SQL> spool c:\tmp\demo.dat
SQL> select * from my_function();
SQL> spool off

So next time you’re cranking out some cumbersome UTL_FILE code to write a file, take a moment to see if the scheduler can look after some of the work for you. I’ll talk about (not) using UTL_FILE to read files in a future post.

When WHEN went faster

Yeah…try saying that blog post title 10 times in a row as fast as you can Smile

But since we’re talking about doing things fast, this is just a quick post about a conversation I had a twitter yesterday about the WHEN clause in a trigger.

 

image

That is an easy benchmark to whip up – I just need a couple of tables, each with a simple a trigger differing only by their usage of the WHEN clause.  Here is my setup:


SQL> create table t1 ( x int );

Table created.

SQL> create table t2 ( x int );

Table created.

SQL> create or replace
  2  trigger TRG1
  3  before insert on t1
  4  for each row
  5  begin
  6    if sys_context('USERENV','SID') = 0 then
  7       :new.x := 0;
  8    end if;
  9  end;
 10  /

Trigger created.

SQL> create or replace
  2  trigger TRG2
  3  before insert on t2
  4  for each row
  5  when ( sys_context('USERENV','SID') = 0 )
  6  begin
  7     :new.x := 0;
  8  end;
  9  /

Trigger created.

I’m using an always false condition so the content of the trigger will never be fired (except for the WHEN clause and the IF-test).  Now I’m ready to slam a bunch of inserts into each table and measure the performance.


SQL> set timing on
SQL> insert into t1
  2  select 1 from dual
  3  connect by level <= 1000000;

1000000 rows created.

Elapsed: 00:00:02.52

SQL> insert into t2
  2  select 1 from dual
  3  connect by level <= 1000000;

1000000 rows created.

Elapsed: 00:00:00.41

There is an immediate urge here to go screaming from the roof tops that WHEN is the all conquering hero of this benchmark. And yes, it is indeed faster but make sure you keep the results in perspective.  We gained back about 2 seconds for 1 million inserts. For me, once I’m inserting millions of rows into a table I’m starting to think not about trigger performance but whether this is the kind of table I want to be having triggers on at all. 

Juicing up UTL_FILE

Think about your rubbish bin for a second. Because, clearly this is going to be an oh so obvious metaphor leading into UTL_FILE right?  OK, maybe a little explanation is needed. I have a basket next to my desk into which I throw any waste paper. It is where I throw my stupid ideas and broken dreams Smile

image

Melancholy aside, once it is full I’ll take it out and empty it into the main recycling bin that is collected once a fortnight from my street front.

image

Metaphor make sense now? Still No? OK, let me bring it around to the database. When you make changes to data in the database, most of the time these changes are only made to data blocks in memory. Obviously we have to keep a permanent record of the changes in the redo log to ensure that database recovery is possible, but the changed data blocks themselves are only dumped out to disk on an intermittent basis. That way, the process of changing data in the database is nice and efficient.

That is similar to what I do with the waste paper basket. I don’t take each piece of paper out to the street front recycling bin; I wait until the waste paper basket is full and then just do the one trip. It is more efficient. (This is hopefully your light bulb moment on my metaphor skills Smile)

So back to UTL_FILE. You can take the same approach when you need to unload some data using UTL_FILE. There is an inherent overhead with every call you make to write out a line of output to a flat file using UTL_FILE. Hence, if you can minimize the number of calls you make, you’ll see some benefits.

Here’s an example of that in action. I’m building a simple CSV file based on a table called T which contains approximately 40 million rows.  Here is my first cut at the solution which I’d wager most people have written in their IT lives. We simply loop through each record and write it out to the file:


SQL> set serverout on
SQL> declare
  2    f utl_file.file_type;
  3    line varchar2(255);
  4    s timestamp;
  5  begin
  6    f := utl_file.fopen('TMP','demo.dat','w');
  7    s := systimestamp;
  8    for i in ( select * from t )
  9    loop
 10      line := i.object_id||','||i.object_name||','||i.object_type||','||i.data_object_id;
 11      utl_file.put_line(f,line);
 12    end loop;
 13    dbms_output.put_line(systimestamp-s);
 14    utl_file.fclose_all;
 15  end;
 16  /
+000000000 00:04:07.368000000

PL/SQL procedure successfully completed.

I’d contend that four minutes for 40 million records is pretty good, but maybe we want to go faster.  Based on my metaphor it might not be immediately apparent how could I perform less UTL_FILE calls? Surely I need to call it once per line of data written? But don’t forget that a “line” in a file is merely the data terminated with a end-of-line pattern specific to your platform. It could be ASCII 10 or an ASCII 13,10 pair, but either way, it is really just an extension of the data you need to write to the file. A “line” is merely our interpretation of that data where that pattern means move to the next line.

So rather than call UTL_FILE for each line, I’ll build up multiple lines and then write them out with a single UTL_FILE.PUT_LINE call. I’m choosing a cap of around 300 lines to ensure that my concatenation does not exceed the maximum size of a VARCHAR2, but as a result I should cut down on the number of UTL_FILE calls by a factor of 300.


SQL> set serverout on
SQL> declare
  2    f utl_file.file_type;
  3    line varchar2(255);
  4    big_line varchar2(32767);
  5    s timestamp;
  6    cnt pls_integer := 0;
  7  begin
  8    f := utl_file.fopen('TMP','demo2.dat','w',32767);
  9    s := systimestamp;
 10    for i in ( select * from t )
 11    loop
 12      line := i.object_id||','||i.object_name||','||i.object_type||','||i.data_object_id;
 13      cnt := cnt + 1;
 14      if cnt < 300 then
 15         big_line := big_line || line || chr(10);
 16      else
 17         utl_file.put_line(f,big_line||line);
 18         cnt := 0;
 19         big_line := null;
 20      end if;
 21    end loop;
 22    utl_file.put_line(f,big_line);
 23    dbms_output.put_line(systimestamp-s);
 24    utl_file.fclose_all;
 25  end;
 26  /
+000000000 00:02:23.297000000

PL/SQL procedure successfully completed.

That’s pretty cool. We reclaimed around 100 seconds of elapsed time just by reducing the number of UTL_FILE.PUT_LINE calls. In this particular instance, that’s about 40% but obviously your mileage may vary based on a myriad of factors – so make you sure do your own benchmarking on your own systems.

As you can see, with just a couple of lines of extra code, we can really make UTL_FILE sing from a performance perspective. Having said that, in an upcoming blog post I’ll make a bold assertion – that you probably don’t need to use UTL_FILE ever again!  Stay tuned for that one.

The simplest things….can be risky

Java and Oracle expert Lukas Eder tweeted yesterday about a potential optimization that could be done when reviewing database SQL code.

image

This looks to be a logical thing to do.  Why scan the table T twice to perform an update, when the same job could be done in a single pass.  The benefits seem obvious:

  • less I/O work
  • less time the data is spent locked
  • less risk of an error between the two operations

so don’t get me wrong – the consolidation is going to be a good thing in the majority of cases

And therein lies the rub – the “majority” of cases is the not the same as “all” cases, and that is why I don’t think a tool should ever automatically perform this change. I’d be cool with a tool making a recommendation but let’s see why you cannot just assume that the consolidation is correct.

Here’s our table with a single row and single business rule implement with a check constraint



SQL> create table t ( a int, b int );

Table created.

SQL> alter table t add constraint chk check ( a < b ) ;

Table altered.

SQL> insert into t values (1,2);

1 row created.

SQL> commit;

Commit complete.

Now I’ll implement the application in the original “unoptimized” way:


SQL> update t set a = a + 1;
update t set a = a + 1
*
ERROR at line 1:
ORA-02290: check constraint (MCDONAC.CHK) violated


SQL> update t set b = b*2;

1 row updated.

You can see that the first update failed – it violated the check constraint. Of course, it is not definitively clear whether this should be the case based on the business requirements, because I haven’t elaborated on whether these two updates should be two transactions or a single transaction. The correctness is not really the point I’m trying to make here, but that if I now choose to consolidate the update, I end up with a different application behaviour.

I’ll roll back the change above, and repeat the experiment using the consolidate update:


SQL> roll;
Rollback complete.
SQL> update t set a = a + 1, b = b*2;

1 row updated.

This time the update completes successfully. If a tool had automatically done this, then I will get a different behaviour in my application. That might be a good thing..it might not be. I could eliminate the difference by implementing the constraint in a DEFERRED usage, but we’re starting to depart even further from the existing implementation of the application code, which means more scrutiny and more regression testing.

So by all means, explore opportunities to improve the performance of your SQL by re-arranging it, consolidating it, and aiming to get more done with less work. But be careful that you do not unknowingly change the way your application works when you do so.

 

The AskTOM data model

I popped out a tweet yesterday in Throwback Thursday style showing the date of the first question we took AskTOM – 18 years ago! Many entire computer systems don’t even last that long, and AskTOM hasn’t really needed to change that much in those 18 years. We’ve added a new skin, added the ability to have multiple AskTOM experts on the answer team, and of course, our new-for-2018 Office Hours program, which gives everyone free access to experts inside Oracle Corporation.

John replied to my tweet asking if we could show the ERD.

image

So here it is – I’ve taken out a few things here and there that don’t have any impact on the public facing part of AskTOM. And I’ve included some notes underneath about each table’s purpose in AskTOM.

Relational_1

ATE_ADMINS

The administrators on AskTOM. There are various privilege levels such as being able to see the admin pages, view unanswered questions and actually answer them.

ATE_HOME_MESSAGES

Any messages that we want to appear on the Questions home screen, such as upcoming conferences etc.

ATE_SUPPORTING_FILES

Files that support the answering of questions. We’ll sometimes link in scripts, images etc to better answer a question. They are all stored in here, in the database of course.

ATE_PRESENTATIONS

The presentations and other resources that you can download from the Resources page.

image

ATE_FEEDBACK

Our recently added facility to allow people to give us feedback on the AskTOM application.

image

ATE_SUBMITTED_QUESTIONS

The bread and butter of AskTOM. Your questions are stored here, along with our answer.

ATE_POSTING_STATUS

image

When you submit a question, we record it as “New, Never Read”. Once we start working on it, we will then mark it as “Read, Not Answered” so that another AskTOM administrator does not work on it. We might then have a little back-and-forth with our original submitter if we need more detail (which is a polite way of saying they didn’t read the Question Guidelines, and hence we didn’t get a test case or sample data or nicely formatted code etc etc) Smile. So that is “Need More Info” and “More Info Supplied” statuses.

Once we’ve answered the question, if we think the general community will benefit it will be “Answered Publish” which puts it up on the home page, otherwise its “Answered, Not Published”, which means the original submitter will see it, but no-one else. As you can see from the numbers in the picture (taken today), for the 18,500 questions you can see on AskTOM, there’s another 5,500 we’ve answered that don’t make the front page.

“No Further Action” is what happens when we ask for more information and the submitter decides to ghost us, or if a submitter decides the best way to communicate with us is via a torrent of insults. Either way, we ain’t bothering with that question again Smile

ATE_QUESTION_REVIEWS

The review comments from the community, and our follow-ups for each question that we answer.

image

ATE_REVIEW_STATUSES

We read every single review that comes into AskTOM. They come in with a status of “New”, we’ll “Assign” them to an administrator, and once we’re done with it, we mark it “Complete”. Yup, that’s over 125,000 reviews we’ve taken.

image

ATE_QUESTION_STATUS_HIST

We track our own performance on AskTOM so that we can better serve you all. So every time we change a question status, eg from “New” to “Need More Info” to “Answered” etc, we log the time at which the status change occurred so we can review it later to see how we’re tracking.

ATE_OBSOLETE_QUESTIONS

18 years of taking and answering questions means that some of the answers may no longer be correct or up to date. If a reviewer brings this to our attention, we’ll record it here as an action for an administrator to look at and rectify. If you’re wondering why it is not just a flag on the question, that’s because a single question may become obsolete more than once in as time marches on.

image

ATE_QUESTION_VIEWS

Every time you click on a question in AskTOM, we record that fact. Why? Because that drives the metrics on knowing which questions are most sought after by the community. We get hundreds of thousands of question views every week in AskTOM.

ATE_QUESTIONS_TOP20

This drives our “Popular Questions” page on AskTOM.

image

ATE_QUESTIONS_SOCIAL_MEDIA

Whenever we tweet out about a particular AskTOM question, or pop it on Facebook, we log it here – because we don’t want to be spamming you. Smile

And that’s all there is to it. All driven by SQL, PL/SQL and of course the latest and greatest version of Application Express.

“Call me!” Many many times!

Some readers might recall that classic Blondie track “Call me”.  Of course, some readers might be wishing that I wouldn’t harp on about great songs from the 80’s. But bear with me, there is a (very tenuous) link to this post. If you haven’t heard the song, you can jump to the chorus right here.  Go on, I’ll wait until you get back. Smile

This golden oldie is relevant when it comes to dealing with object types in PL/SQL, and in particular, when you are using them in a SQL-related context.  To set the scene, I’ll start with a trivial example – creating a simple object, utilizing that object in a function, and then using a simple SELECT-FROM-DUAL on that function to check that it works.


SQL> create or replace
  2  type three_values as object
  3   ( val_1 int,
  4     val_2 int,
  5     val_3 int
  6   );
  7  /

Type created.

SQL>
SQL> create or replace
  2  function f return three_values is
  3  begin
  4     return three_values(1,2,3);
  5  end;
  6  /

Function created.

SQL>
SQL> select f from dual;

F(VAL_1, VAL_2, VAL_3)
-----------------------------------------
THREE_VALUES(1, 2, 3)

So far so good. But as is often the case, the requirement is not to get the object as an entity in it’s own right, but to get at the scalar values within the object. Hence the wrapping of that query as an inline view to an outer one to get the individual columns:


SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

But let’s explore that a little more carefully. Observation of the SQL text might lead us to think that

  • we ran the inner query,
  • saved the result we’ve already seen into a temporary result of type THREE_VALUES,
  • then extracted the elements via the outer SELECT.

But now I’ll make a small amendment to the function as follows. I’ll introduce a package variable so we can do some tracking of executions


SQL> create or replace package pkg as
  2    x int := 0;
  3  end;
  4  /

Package created.

SQL>
SQL> create or replace
  2  function f return three_values is
  3  begin
  4     pkg.x:= pkg.x+1;
  5     return three_values(1,2,3);
  6  end;
  7  /

Function created.

As you would expect, a simple SELECT-FROM-DUAL results in the function being executed once.


SQL> select f from dual;

F(VAL_1, VAL_2, VAL_3)
-----------------------------------------------------
THREE_VALUES(1, 2, 3)

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

As before, I’ll now wrap that query in another SELECT to extract the column elements. I’ll reset my package variable to start the count again from zero.


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=3

PL/SQL procedure successfully completed.

Notice the difference. The function was called three times. If that function was doing some “heavy lifting” then those additional executions might be a cause for concern. The number of calls to a function referenced from SQL has always been indeterminate, so this is nothing new, but it still might catch you by surprise. References throughout the SQL to those function scalar values can bump the count up even more:


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x
  4  where x.c.val_1 = 1 and x.c.val_2 = 2;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=5

but conversely, you cannot just simply assume that every reference will result in an additional execution. For example, adding an ORDER BY clause containing references does not increment the count.


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x
  4  where x.c.val_1 = 1 and x.c.val_1+10 = 11
  5  order by x.c.val_1, x.c.val_2, x.c.val_3;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=5

PL/SQL procedure successfully completed.

A 10053 trace on any of these queries reveals the reason why we can expect to see multiple executions. When I parse my initial query


select x.c.val_1, x.c.val_2, x.c.val_3
from
(select f c from dual ) x

then scrolling through the 10053 trace, I’ll end up with this:


Final query after transformations:******* UNPARSED QUERY IS *******
SELECT SYS_OP_ATG("SCOTT"."F"(),1,2,2) "C.VAL_1",SYS_OP_ATG("SCOTT"."F"(),2,3,2) "C.VAL_2",SYS_OP_ATG("SCOTT"."F"(),3,4,2) "C.VAL_3" 
FROM "SYS"."DUAL" "DUAL"

You can see that the query has been transformed to have explicit calls of the function for each column we requested.

Can the multiple calls be avoided? Yes. There’s a couple of simple options to explore here. First of these, is that you can use a scalar subquery to take advantage of some query level caching that the database will do to avoid repeated executions of query sub-components.  Here is the previous examples repeated but with a scalar subselect to access the function.


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select ( select f from dual ) c
  4   from dual ) x;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL>
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL>
SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select ( select f from dual ) c
  4   from dual ) x
  5  where x.c.val_1 = 1 and x.c.val_2 = 2;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL>
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

Alternately, since we are returning an object as a table, then we can reflect this by adding an additional nested table object type to handle this within the function:


SQL> create or replace
  2  type three_val_list as
  3   table of three_values
  4  /

Type created.

SQL>
SQL> create or replace
  2  function f return three_val_list is
  3  begin
  4     pkg.x:= pkg.x+1;
  5     return three_val_list(three_values(1,2,3));
  6  end;
  7  /

Function created.

SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select * from f();

     VAL_1      VAL_2      VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

Before you get carried away and start screaming “Performance Bug!” from the rooftops, the moral of the story here is probably to first ask yourself: Is there a valid reason for referencing my OBJECT data types via SQL? Very often I see the instantiation of any variable (complex or scalar) being done with a SELECT-FROM-DUAL. Don’t do it – it is almost never needed.

But, if you have truly embraced the Object-Relational features and are genuinely gathering object types from SQL statements, then consider some of the workarounds above to avoid excessive function calls.

Attribute clustering….super cool

I’ve spoken about attribute clustering before here, here and here. So from that you can probably glean that I’m a fan.

I recently spoke about an example of this as well during my AskTOM Office Hours session which you can watch below:

After posting that video, I had a follow-up question which I thought would be worth exploring. My examples so far on attribute clustering were either loading a table from empty, or issuing an “alter table move” on a table that already contained data. In both those situations, it is reasonable to assume that clustering is going to help because it aligns conceptually with what we’d expect to happen, namely

  • read all of the data
  • re-organize all of the data so that it is clustered by the nominated clustering key(s)
  • reload all of the data

But what if we do not have that luxury of moving all of the data around? What if I have table that is already clustered, but now I need to load more data into it? As we know, we only see clustering in effect if we perform a direct load operation. But a direct load operation only creates new blocks. That would suggest we cannot cluster the data, no?

Lets explore this scenario with an example.

As a control to this experiment, I’ll perform some index lookups to a “conventional” table, that is, random data without any clustering attributes.

I’ll create a table with the data arranged in random order, and then add the same amount of data again. (I could do this in a single operation, but I’m keeping it in alignment with experiments I’ll perform shortly to mimic the concept of load data and then appending new data).


SQL> create table t as
  2  select rownum id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

Table created.

SQL>
SQL> insert /*+ APPEND */ into t
  2  select rownum+100000 id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

100000 rows created.

SQL>
SQL> commit;

Commit complete.

I’ve got 200,000 randomly distributed rows for 100 distinct CUST_ID values. Now I’ll add an index and perform a simple lookup to examine the cost. (Note: The statistics you see are not from an initial execution, but from subsequent executions to eliminate parsing and other costs)


SQL> create index ix on t ( cust_id );

Index created.

SQL>
SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1502  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

So our control figure is approximately 1500 logical I/O’s.

In the previous blog posts on this topic, I then added attribute clustering definitions and perform an ‘alter table move’ to reorganize all of the data into a nice clustered arragement. This time, I’ll repeat the above experiment in a different manner. The sequence of steps will be:

  1. load the data,
  2. add the clustering attributes,
  3. reorganize the data,
  4. add an index,
  5. observe the expected benefits of clustering,
  6. add more data, which we might expect could not be clustered because the table is already clustered now,
  7. perform more index lookups and see what the cost is.

If clustering is only useful for an initial data load, then we’d expect to see a significant degradation when the final index lookups are performed in (7). First of all, just to reinforce the coolness of attribute clustering just one more time, here are steps (1) through (5). Remember – the control figure here is 1500 LIO’s.


SQL> create table t as
  2  select rownum id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

Table created.

SQL>
SQL> create table t2 as select * from t;

Table created.

SQL>
SQL> alter table t ADD clustering by linear order(cust_id);

Table altered.

SQL> alter table t move;

Table altered.

SQL>
SQL> create index ix on t ( cust_id );

Index created.

SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         20  consistent gets
          2  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Ooooh yeahhhhh… 20 LIOs!!! But now the real test comes. You can see I also created a table called T2, being a clone of T in its randomised data state. Now I am going to append the contents on T2 to my (clustered) T table.


SQL> insert /*+ APPEND */ into t select * from t2;

100000 rows created.

SQL> commit;

Commit complete.

One thing that does look promising is notice the “SORT ORDER BY” step in the execution plan. That suggests that even though this is appending new data to an existing set of data, some clustering might be taking place.


SQL> explain plan for insert /*+ APPEND */ into t select * from t2;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 528765404

-------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |   100K|    10M|       |  2905   (1)| 00:00:01 |
|   1 |  LOAD AS SELECT                  | T    |       |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |   100K|    10M|       |  2905   (1)| 00:00:01 |
|   3 |    SORT ORDER BY                 |      |   100K|    10M|    11M|  2905   (1)| 00:00:01 |
|   4 |     TABLE ACCESS FULL            | T2   |   100K|    10M|       |   442   (1)| 00:00:01 |
-------------------------------------------------------------------------------------------------

11 rows selected.

But the real test comes in the index lookup. Let’s repeat that.


SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         43  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

And we’re still awesomely good. The amount of data for CUST_ID = 17 has doubled, and the LIOs approximately doubled as well. So we are still seeing benefits of clustering even though we are appending to a table with previously clustered data.

Just how close to optimal is this? I’ll perform an ‘alter table move’ against the entire table, and re-measure:


SQL> alter table t move online;

Table altered.

SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
---------------------------------------------------------------------------
x

1 row selected.

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         37  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Hence optimal clustering would have yielded 37 LIOs for this data, so 43 is very close to that, and a huge step up in efficiency over the 1500 LIOs from the base experiment. In effect, we have “clumps” of clustered data in the table now – so larger, low frequency dataset loads will give better results than smaller, higher frequency loads. But still…very cool stuff that you should be taking advantage of in your databases.