Juicing up UTL_FILE

Think about your rubbish bin for a second. Because, clearly this is going to be an oh so obvious metaphor leading into UTL_FILE right?  OK, maybe a little explanation is needed. I have a basket next to my desk into which I throw any waste paper. It is where I throw my stupid ideas and broken dreams Smile

image

Melancholy aside, once it is full I’ll take it out and empty it into the main recycling bin that is collected once a fortnight from my street front.

image

Metaphor make sense now? Still No? OK, let me bring it around to the database. When you make changes to data in the database, most of the time these changes are only made to data blocks in memory. Obviously we have to keep a permanent record of the changes in the redo log to ensure that database recovery is possible, but the changed data blocks themselves are only dumped out to disk on an intermittent basis. That way, the process of changing data in the database is nice and efficient.

That is similar to what I do with the waste paper basket. I don’t take each piece of paper out to the street front recycling bin; I wait until the waste paper basket is full and then just do the one trip. It is more efficient. (This is hopefully your light bulb moment on my metaphor skills Smile)

So back to UTL_FILE. You can take the same approach when you need to unload some data using UTL_FILE. There is an inherent overhead with every call you make to write out a line of output to a flat file using UTL_FILE. Hence, if you can minimize the number of calls you make, you’ll see some benefits.

Here’s an example of that in action. I’m building a simple CSV file based on a table called T which contains approximately 40 million rows.  Here is my first cut at the solution which I’d wager most people have written in their IT lives. We simply loop through each record and write it out to the file:


SQL> set serverout on
SQL> declare
  2    f utl_file.file_type;
  3    line varchar2(255);
  4    s timestamp;
  5  begin
  6    f := utl_file.fopen('TMP','demo.dat','w');
  7    s := systimestamp;
  8    for i in ( select * from t )
  9    loop
 10      line := i.object_id||','||i.object_name||','||i.object_type||','||i.data_object_id;
 11      utl_file.put_line(f,line);
 12    end loop;
 13    dbms_output.put_line(systimestamp-s);
 14    utl_file.fclose_all;
 15  end;
 16  /
+000000000 00:04:07.368000000

PL/SQL procedure successfully completed.

I’d contend that four minutes for 40 million records is pretty good, but maybe we want to go faster.  Based on my metaphor it might not be immediately apparent how could I perform less UTL_FILE calls? Surely I need to call it once per line of data written? But don’t forget that a “line” in a file is merely the data terminated with a end-of-line pattern specific to your platform. It could be ASCII 10 or an ASCII 13,10 pair, but either way, it is really just an extension of the data you need to write to the file. A “line” is merely our interpretation of that data where that pattern means move to the next line.

So rather than call UTL_FILE for each line, I’ll build up multiple lines and then write them out with a single UTL_FILE.PUT_LINE call. I’m choosing a cap of around 300 lines to ensure that my concatenation does not exceed the maximum size of a VARCHAR2, but as a result I should cut down on the number of UTL_FILE calls by a factor of 300.


SQL> set serverout on
SQL> declare
  2    f utl_file.file_type;
  3    line varchar2(255);
  4    big_line varchar2(32767);
  5    s timestamp;
  6    cnt pls_integer := 0;
  7  begin
  8    f := utl_file.fopen('TMP','demo2.dat','w',32767);
  9    s := systimestamp;
 10    for i in ( select * from t )
 11    loop
 12      line := i.object_id||','||i.object_name||','||i.object_type||','||i.data_object_id;
 13      cnt := cnt + 1;
 14      if cnt < 300 then
 15         big_line := big_line || line || chr(10);
 16      else
 17         utl_file.put_line(f,big_line||line);
 18         cnt := 0;
 19         big_line := null;
 20      end if;
 21    end loop;
 22    utl_file.put_line(f,big_line);
 23    dbms_output.put_line(systimestamp-s);
 24    utl_file.fclose_all;
 25  end;
 26  /
+000000000 00:02:23.297000000

PL/SQL procedure successfully completed.

That’s pretty cool. We reclaimed around 100 seconds of elapsed time just by reducing the number of UTL_FILE.PUT_LINE calls. In this particular instance, that’s about 40% but obviously your mileage may vary based on a myriad of factors – so make you sure do your own benchmarking on your own systems.

As you can see, with just a couple of lines of extra code, we can really make UTL_FILE sing from a performance perspective. Having said that, in an upcoming blog post I’ll make a bold assertion – that you probably don’t need to use UTL_FILE ever again!  Stay tuned for that one.

The simplest things….can be risky

Java and Oracle expert Lukas Eder tweeted yesterday about a potential optimization that could be done when reviewing database SQL code.

image

This looks to be a logical thing to do.  Why scan the table T twice to perform an update, when the same job could be done in a single pass.  The benefits seem obvious:

  • less I/O work
  • less time the data is spent locked
  • less risk of an error between the two operations

so don’t get me wrong – the consolidation is going to be a good thing in the majority of cases

And therein lies the rub – the “majority” of cases is the not the same as “all” cases, and that is why I don’t think a tool should ever automatically perform this change. I’d be cool with a tool making a recommendation but let’s see why you cannot just assume that the consolidation is correct.

Here’s our table with a single row and single business rule implement with a check constraint



SQL> create table t ( a int, b int );

Table created.

SQL> alter table t add constraint chk check ( a < b ) ;

Table altered.

SQL> insert into t values (1,2);

1 row created.

SQL> commit;

Commit complete.

Now I’ll implement the application in the original “unoptimized” way:


SQL> update t set a = a + 1;
update t set a = a + 1
*
ERROR at line 1:
ORA-02290: check constraint (MCDONAC.CHK) violated


SQL> update t set b = b*2;

1 row updated.

You can see that the first update failed – it violated the check constraint. Of course, it is not definitively clear whether this should be the case based on the business requirements, because I haven’t elaborated on whether these two updates should be two transactions or a single transaction. The correctness is not really the point I’m trying to make here, but that if I now choose to consolidate the update, I end up with a different application behaviour.

I’ll roll back the change above, and repeat the experiment using the consolidate update:


SQL> roll;
Rollback complete.
SQL> update t set a = a + 1, b = b*2;

1 row updated.

This time the update completes successfully. If a tool had automatically done this, then I will get a different behaviour in my application. That might be a good thing..it might not be. I could eliminate the difference by implementing the constraint in a DEFERRED usage, but we’re starting to depart even further from the existing implementation of the application code, which means more scrutiny and more regression testing.

So by all means, explore opportunities to improve the performance of your SQL by re-arranging it, consolidating it, and aiming to get more done with less work. But be careful that you do not unknowingly change the way your application works when you do so.

 

The AskTOM data model

I popped out a tweet yesterday in Throwback Thursday style showing the date of the first question we took AskTOM – 18 years ago! Many entire computer systems don’t even last that long, and AskTOM hasn’t really needed to change that much in those 18 years. We’ve added a new skin, added the ability to have multiple AskTOM experts on the answer team, and of course, our new-for-2018 Office Hours program, which gives everyone free access to experts inside Oracle Corporation.

John replied to my tweet asking if we could show the ERD.

image

So here it is – I’ve taken out a few things here and there that don’t have any impact on the public facing part of AskTOM. And I’ve included some notes underneath about each table’s purpose in AskTOM.

Relational_1

ATE_ADMINS

The administrators on AskTOM. There are various privilege levels such as being able to see the admin pages, view unanswered questions and actually answer them.

ATE_HOME_MESSAGES

Any messages that we want to appear on the Questions home screen, such as upcoming conferences etc.

ATE_SUPPORTING_FILES

Files that support the answering of questions. We’ll sometimes link in scripts, images etc to better answer a question. They are all stored in here, in the database of course.

ATE_PRESENTATIONS

The presentations and other resources that you can download from the Resources page.

image

ATE_FEEDBACK

Our recently added facility to allow people to give us feedback on the AskTOM application.

image

ATE_SUBMITTED_QUESTIONS

The bread and butter of AskTOM. Your questions are stored here, along with our answer.

ATE_POSTING_STATUS

image

When you submit a question, we record it as “New, Never Read”. Once we start working on it, we will then mark it as “Read, Not Answered” so that another AskTOM administrator does not work on it. We might then have a little back-and-forth with our original submitter if we need more detail (which is a polite way of saying they didn’t read the Question Guidelines, and hence we didn’t get a test case or sample data or nicely formatted code etc etc) Smile. So that is “Need More Info” and “More Info Supplied” statuses.

Once we’ve answered the question, if we think the general community will benefit it will be “Answered Publish” which puts it up on the home page, otherwise its “Answered, Not Published”, which means the original submitter will see it, but no-one else. As you can see from the numbers in the picture (taken today), for the 18,500 questions you can see on AskTOM, there’s another 5,500 we’ve answered that don’t make the front page.

“No Further Action” is what happens when we ask for more information and the submitter decides to ghost us, or if a submitter decides the best way to communicate with us is via a torrent of insults. Either way, we ain’t bothering with that question again Smile

ATE_QUESTION_REVIEWS

The review comments from the community, and our follow-ups for each question that we answer.

image

ATE_REVIEW_STATUSES

We read every single review that comes into AskTOM. They come in with a status of “New”, we’ll “Assign” them to an administrator, and once we’re done with it, we mark it “Complete”. Yup, that’s over 125,000 reviews we’ve taken.

image

ATE_QUESTION_STATUS_HIST

We track our own performance on AskTOM so that we can better serve you all. So every time we change a question status, eg from “New” to “Need More Info” to “Answered” etc, we log the time at which the status change occurred so we can review it later to see how we’re tracking.

ATE_OBSOLETE_QUESTIONS

18 years of taking and answering questions means that some of the answers may no longer be correct or up to date. If a reviewer brings this to our attention, we’ll record it here as an action for an administrator to look at and rectify. If you’re wondering why it is not just a flag on the question, that’s because a single question may become obsolete more than once in as time marches on.

image

ATE_QUESTION_VIEWS

Every time you click on a question in AskTOM, we record that fact. Why? Because that drives the metrics on knowing which questions are most sought after by the community. We get hundreds of thousands of question views every week in AskTOM.

ATE_QUESTIONS_TOP20

This drives our “Popular Questions” page on AskTOM.

image

ATE_QUESTIONS_SOCIAL_MEDIA

Whenever we tweet out about a particular AskTOM question, or pop it on Facebook, we log it here – because we don’t want to be spamming you. Smile

And that’s all there is to it. All driven by SQL, PL/SQL and of course the latest and greatest version of Application Express.

“Call me!” Many many times!

Some readers might recall that classic Blondie track “Call me”.  Of course, some readers might be wishing that I wouldn’t harp on about great songs from the 80’s. But bear with me, there is a (very tenuous) link to this post. If you haven’t heard the song, you can jump to the chorus right here.  Go on, I’ll wait until you get back. Smile

This golden oldie is relevant when it comes to dealing with object types in PL/SQL, and in particular, when you are using them in a SQL-related context.  To set the scene, I’ll start with a trivial example – creating a simple object, utilizing that object in a function, and then using a simple SELECT-FROM-DUAL on that function to check that it works.


SQL> create or replace
  2  type three_values as object
  3   ( val_1 int,
  4     val_2 int,
  5     val_3 int
  6   );
  7  /

Type created.

SQL>
SQL> create or replace
  2  function f return three_values is
  3  begin
  4     return three_values(1,2,3);
  5  end;
  6  /

Function created.

SQL>
SQL> select f from dual;

F(VAL_1, VAL_2, VAL_3)
-----------------------------------------
THREE_VALUES(1, 2, 3)

So far so good. But as is often the case, the requirement is not to get the object as an entity in it’s own right, but to get at the scalar values within the object. Hence the wrapping of that query as an inline view to an outer one to get the individual columns:


SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

But let’s explore that a little more carefully. Observation of the SQL text might lead us to think that

  • we ran the inner query,
  • saved the result we’ve already seen into a temporary result of type THREE_VALUES,
  • then extracted the elements via the outer SELECT.

But now I’ll make a small amendment to the function as follows. I’ll introduce a package variable so we can do some tracking of executions


SQL> create or replace package pkg as
  2    x int := 0;
  3  end;
  4  /

Package created.

SQL>
SQL> create or replace
  2  function f return three_values is
  3  begin
  4     pkg.x:= pkg.x+1;
  5     return three_values(1,2,3);
  6  end;
  7  /

Function created.

As you would expect, a simple SELECT-FROM-DUAL results in the function being executed once.


SQL> select f from dual;

F(VAL_1, VAL_2, VAL_3)
-----------------------------------------------------
THREE_VALUES(1, 2, 3)

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

As before, I’ll now wrap that query in another SELECT to extract the column elements. I’ll reset my package variable to start the count again from zero.


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=3

PL/SQL procedure successfully completed.

Notice the difference. The function was called three times. If that function was doing some “heavy lifting” then those additional executions might be a cause for concern. The number of calls to a function referenced from SQL has always been indeterminate, so this is nothing new, but it still might catch you by surprise. References throughout the SQL to those function scalar values can bump the count up even more:


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x
  4  where x.c.val_1 = 1 and x.c.val_2 = 2;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=5

but conversely, you cannot just simply assume that every reference will result in an additional execution. For example, adding an ORDER BY clause containing references does not increment the count.


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select f c from dual ) x
  4  where x.c.val_1 = 1 and x.c.val_1+10 = 11
  5  order by x.c.val_1, x.c.val_2, x.c.val_3;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> set serverout on
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=5

PL/SQL procedure successfully completed.

A 10053 trace on any of these queries reveals the reason why we can expect to see multiple executions. When I parse my initial query


select x.c.val_1, x.c.val_2, x.c.val_3
from
(select f c from dual ) x

then scrolling through the 10053 trace, I’ll end up with this:


Final query after transformations:******* UNPARSED QUERY IS *******
SELECT SYS_OP_ATG("SCOTT"."F"(),1,2,2) "C.VAL_1",SYS_OP_ATG("SCOTT"."F"(),2,3,2) "C.VAL_2",SYS_OP_ATG("SCOTT"."F"(),3,4,2) "C.VAL_3" 
FROM "SYS"."DUAL" "DUAL"

You can see that the query has been transformed to have explicit calls of the function for each column we requested.

Can the multiple calls be avoided? Yes. There’s a couple of simple options to explore here. First of these, is that you can use a scalar subquery to take advantage of some query level caching that the database will do to avoid repeated executions of query sub-components.  Here is the previous examples repeated but with a scalar subselect to access the function.


SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select ( select f from dual ) c
  4   from dual ) x;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL>
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL>
SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select x.c.val_1, x.c.val_2, x.c.val_3
  2  from
  3  (select ( select f from dual ) c
  4   from dual ) x
  5  where x.c.val_1 = 1 and x.c.val_2 = 2;

   C.VAL_1    C.VAL_2    C.VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL>
SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

Alternately, since we are returning an object as a table, then we can reflect this by adding an additional nested table object type to handle this within the function:


SQL> create or replace
  2  type three_val_list as
  3   table of three_values
  4  /

Type created.

SQL>
SQL> create or replace
  2  function f return three_val_list is
  3  begin
  4     pkg.x:= pkg.x+1;
  5     return three_val_list(three_values(1,2,3));
  6  end;
  7  /

Function created.

SQL> exec pkg.x := 0;

PL/SQL procedure successfully completed.

SQL> select * from f();

     VAL_1      VAL_2      VAL_3
---------- ---------- ----------
         1          2          3

1 row selected.

SQL> exec dbms_output.put_line('calls='||pkg.x);
calls=1

PL/SQL procedure successfully completed.

Before you get carried away and start screaming “Performance Bug!” from the rooftops, the moral of the story here is probably to first ask yourself: Is there a valid reason for referencing my OBJECT data types via SQL? Very often I see the instantiation of any variable (complex or scalar) being done with a SELECT-FROM-DUAL. Don’t do it – it is almost never needed.

But, if you have truly embraced the Object-Relational features and are genuinely gathering object types from SQL statements, then consider some of the workarounds above to avoid excessive function calls.

Attribute clustering….super cool

I’ve spoken about attribute clustering before here, here and here. So from that you can probably glean that I’m a fan.

I recently spoke about an example of this as well during my AskTOM Office Hours session which you can watch below:

After posting that video, I had a follow-up question which I thought would be worth exploring. My examples so far on attribute clustering were either loading a table from empty, or issuing an “alter table move” on a table that already contained data. In both those situations, it is reasonable to assume that clustering is going to help because it aligns conceptually with what we’d expect to happen, namely

  • read all of the data
  • re-organize all of the data so that it is clustered by the nominated clustering key(s)
  • reload all of the data

But what if we do not have that luxury of moving all of the data around? What if I have table that is already clustered, but now I need to load more data into it? As we know, we only see clustering in effect if we perform a direct load operation. But a direct load operation only creates new blocks. That would suggest we cannot cluster the data, no?

Lets explore this scenario with an example.

As a control to this experiment, I’ll perform some index lookups to a “conventional” table, that is, random data without any clustering attributes.

I’ll create a table with the data arranged in random order, and then add the same amount of data again. (I could do this in a single operation, but I’m keeping it in alignment with experiments I’ll perform shortly to mimic the concept of load data and then appending new data).


SQL> create table t as
  2  select rownum id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

Table created.

SQL>
SQL> insert /*+ APPEND */ into t
  2  select rownum+100000 id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

100000 rows created.

SQL>
SQL> commit;

Commit complete.

I’ve got 200,000 randomly distributed rows for 100 distinct CUST_ID values. Now I’ll add an index and perform a simple lookup to examine the cost. (Note: The statistics you see are not from an initial execution, but from subsequent executions to eliminate parsing and other costs)


SQL> create index ix on t ( cust_id );

Index created.

SQL>
SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1502  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

So our control figure is approximately 1500 logical I/O’s.

In the previous blog posts on this topic, I then added attribute clustering definitions and perform an ‘alter table move’ to reorganize all of the data into a nice clustered arragement. This time, I’ll repeat the above experiment in a different manner. The sequence of steps will be:

  1. load the data,
  2. add the clustering attributes,
  3. reorganize the data,
  4. add an index,
  5. observe the expected benefits of clustering,
  6. add more data, which we might expect could not be clustered because the table is already clustered now,
  7. perform more index lookups and see what the cost is.

If clustering is only useful for an initial data load, then we’d expect to see a significant degradation when the final index lookups are performed in (7). First of all, just to reinforce the coolness of attribute clustering just one more time, here are steps (1) through (5). Remember – the control figure here is 1500 LIO’s.


SQL> create table t as
  2  select rownum id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

Table created.

SQL>
SQL> create table t2 as select * from t;

Table created.

SQL>
SQL> alter table t ADD clustering by linear order(cust_id);

Table altered.

SQL> alter table t move;

Table altered.

SQL>
SQL> create index ix on t ( cust_id );

Index created.

SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         20  consistent gets
          2  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Ooooh yeahhhhh… 20 LIOs!!! But now the real test comes. You can see I also created a table called T2, being a clone of T in its randomised data state. Now I am going to append the contents on T2 to my (clustered) T table.


SQL> insert /*+ APPEND */ into t select * from t2;

100000 rows created.

SQL> commit;

Commit complete.

One thing that does look promising is notice the “SORT ORDER BY” step in the execution plan. That suggests that even though this is appending new data to an existing set of data, some clustering might be taking place.


SQL> explain plan for insert /*+ APPEND */ into t select * from t2;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 528765404

-------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |   100K|    10M|       |  2905   (1)| 00:00:01 |
|   1 |  LOAD AS SELECT                  | T    |       |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |   100K|    10M|       |  2905   (1)| 00:00:01 |
|   3 |    SORT ORDER BY                 |      |   100K|    10M|    11M|  2905   (1)| 00:00:01 |
|   4 |     TABLE ACCESS FULL            | T2   |   100K|    10M|       |   442   (1)| 00:00:01 |
-------------------------------------------------------------------------------------------------

11 rows selected.

But the real test comes in the index lookup. Let’s repeat that.


SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         43  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

And we’re still awesomely good. The amount of data for CUST_ID = 17 has doubled, and the LIOs approximately doubled as well. So we are still seeing benefits of clustering even though we are appending to a table with previously clustered data.

Just how close to optimal is this? I’ll perform an ‘alter table move’ against the entire table, and re-measure:


SQL> alter table t move online;

Table altered.

SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
---------------------------------------------------------------------------
x

1 row selected.

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         37  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Hence optimal clustering would have yielded 37 LIOs for this data, so 43 is very close to that, and a huge step up in efficiency over the 1500 LIOs from the base experiment. In effect, we have “clumps” of clustered data in the table now – so larger, low frequency dataset loads will give better results than smaller, higher frequency loads. But still…very cool stuff that you should be taking advantage of in your databases.

You need scaling huh? Maybe it’s just ego

I’ve just come back from OracleCode Singapore.  It was a great event – the venue was awesome and the attendees were engaged and interested in the content. But there was one thing that I found amusing (disturbing perhaps?) is the number of times I had people approach me on the topic of scaling.  Conversation would typically run along the lines of:

“What is your recommendation for scaling?”

which almost suggests that scaling is of itself, the end solution here.  Not “Here is function X, and I need it to scale”, or “My business requirement is X, and it needs to scale” but just “I need to scale”

So I’d push back and ask more questions:

  • Scale what?
  • What data are you capturing?
  • Where is it coming from?
  • What speed? What volume?
  • What are you plans with the data you are capturing?
  • How do you intend to process the data?
  • Is it transient? Are you planning on storing it forever? Is it sensitive information?

And the scary thing is – more often than not, those were questions for which they did not have answers to (yet?). I’d ask the questions, and very quickly the conversation would be returned to:

“Do I need sharding?”
”Should I use a NoSQL solution?”
“What ‘aaS’ option should I be using to achieve my scaling needs”
”How many nodes do I need?
”What server configuration is best?”

I’m seeing this more and more – that the technological approach to achieve a business requirement is seen AS the business requirement. I hate to be brutal (well…that’s a lie, I like being brutal Smile) but here’s the thing – Stop being so damn focussed on scaling until you have an idea of what your true performance requirements are!

Don’t get me wrong – there are systems out there that need to be architected from the ground up that will have to deal with scaling challenges that have perhaps never been tackled before.  But read those last few words again: “never been tackled before”.  Do you know what that also means?  It means it applies to an intsy wintsy tiny percentage of IT systems.  If it wasn’t, then surprise surprise – those challenges have been tackled before.  Why does everyone in the IT industry think that the system they are about to build will need the same architectural treatment as those 0.00001% of systems in the world that truly do.

Because in almost all of my time in IT, for the other 99.999% of systems out there – the two critical solutions to scaling systems to meet (and well and truly exceed) the performance requirements to meet the business needs are pretty simple:

1) don’t write crappy code,

2) don’t store data in a crappy way

That’s it.  When you can definitively demonstrate that

a) your code is well written,

b) your data is being stored in a means to best serve business requirements

and your application still cannot meet performance needs, then yes, it’s time to talk about architectural options for scaling. But more and more I see folks ignoring (a) and (b), or worse, just assuming that they are implicit and guaranteed to happen, and leaping straight into “I need a 10,000 node, geo-disperse, NoSQL, cached, compressed, mem-optimized, column-based, non-ACID, distributed blah blah blah” for my system to work.

Here’s a reality check – you don’t.  Save yourself a lot of hassles and start simple and focus on quality. You’ll find things will probably scale just fine.

If you’ve made it this far through the post and you think I’m just ranting…well, that’s true Smile but let me also answer the next obvious question:

“So how do we make sure we write good code? How do we make sure we store our data intelligently?”

That’s why we (developer advocates) are here. We’re here to help you succeed. So check out our resources, reach out to us via social media channels, and we’ll help you every step of the journey.

From file names to directory hierarchy

I had a fun request come in from a colleague the other day.  They had a simple list of fully qualified file names and they needed to present that data in the familiar hierarchical tree layout. 

To demonstrate, I took a little trip down memory lane Smile and grabbed a subset of presentations I’ve done over the years.


SQL> create table t ( fname varchar2(1000));

Table created.

SQL>
SQL> insert into t values ('C:\Users\Connor\Presentations\2002\scene_200205.pdf');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\scene_200205a.pdf');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\254old.ppt');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\254_full.ppt');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\257.ppt');

1 row created.
...
...
...

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\london\optimizer.pptx');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\london\optimizer_full.pptx');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\odtug\common_disasters_short.pptx');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\odtug\common_disasters_short_comp.pptx');

1 row created.

SQL> select count(*) from t;

  COUNT(*)
----------
       634

1 row selected.

So the challenge was to present this list of files in a nice tree format which mirrored my directory structure.

(Truth be told, my preference would be that the query would do a better job than I do in trying to organize my files, but that is a blog post for another day Smile)

Step 1 was to use a little LATERAL trickery to parse out all of the elements of the file names into one row per “element”, where “element” an unqualified file name, or single level folder name.


SQL> select id, seq, token
  2  from
  3    ( select rownum id, fname from t ) t_base,
  4    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') token
  5            from dual
  6            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  7            );

        ID        SEQ TOKEN
---------- ---------- ------------------------------------------------------------------------------------
         1          1 C:
         1          2 Users
         1          3 Connor
         1          4 Presentations
         1          5 2002
         1          6 scene_200205.pdf
         2          1 C:
         2          2 Users
         2          3 Connor
         2          4 Presentations
         2          5 2002
         2          6 scene_200205a.pdf
         3          1 C:
         3          2 Users
         3          3 Connor
         3          4 Presentations
         3          5 2002
         3          6 254old.ppt
         4          1 C:
         4          2 Users
         4          3 Connor
         4          4 Presentations
         4          5 2002
         4          6 254_full.ppt
         ...
         ...

We can use that as input to a standard LAG function to associate each file/folder with its parent, using the SEQ column to provide sequencing within each ID


SQL> with data as
  2  (
  3  select id, seq, token
  4  from
  5    ( select rownum id, fname from t ) t_base,
  6    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') token
  7            from dual
  8            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  9            )
 10  )
 11  select
 12      id,
 13      seq,
 14      token,
 15      lag(token) over ( partition by id order by seq ) as parent
 16  from data
 17  order by id, seq desc;

        ID        SEQ TOKEN                                              PARENT
---------- ---------- -------------------------------------------------- -----------------------------------
         1          6 scene_200205.pdf                                   2002
         1          5 2002                                               Presentations
         1          4 Presentations                                      Connor
         1          3 Connor                                             Users
         1          2 Users                                              C:
         1          1 C:
         2          6 scene_200205a.pdf                                  2002
         2          5 2002                                               Presentations
         2          4 Presentations                                      Connor
         2          3 Connor                                             Users
         2          2 Users                                              C:
         2          1 C:
         3          6 254old.ppt                                         2002
         3          5 2002                                               Presentations
         3          4 Presentations                                      Connor
         3          3 Connor                                             Users
         3          2 Users                                              C:
        ...
        ...

That’s looking good, but we have duplicates in the sense that multiple files will roll up to a single folder (eg the “C:” root is repeated).. So we DISTINCT it out to leave just the relationships we need


SQL> with data as
  2  (
  3  select id, seq, token
  4  from
  5    ( select rownum id, fname from t ) t_base,
  6    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') token
  7            from dual
  8            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  9            )
 10  )
 11  select
 12    distinct
 13      token,
 14      lag(token) over ( partition by id order by seq ) as parent
 15  from data
 16  /

TOKEN                                              PARENT
-------------------------------------------------- --------------------------------------------------
C:
254old.ppt                                         2002
257.ppt                                            2002
plus2.ppt                                          2002
254.ppt                                            UKOUG2002_final
error_handling.pdf                                 2003
2004                                               Presentations
mcdonald1.pdf                                      2004
sun_storage_brief.ppt                              sun_storage_presn
ausoug05_memory_x.ppt                              2005
jul05_read_consistency.ppt                         2005
saoug_oracle_home.ppt                              2005
plsql_80_a58236.pdf                                sql_plsql_evo
jun07_tas_moving_data_short.ppt                    misc
ausoug_adel                                        2008
successful_development.pdf                         ausoug_adel
apr08_11g_developers.ppt                           ausoug_apr
apr08_11g_developers_print.ppt                     ausoug_apr
mcdonald_oct08_flashback.pdf                       perth
oct08_11g_developers.ppt                           template
nov08_read_consistency.pdf                         rwwa
perf101b_rwwa.pptx                                 rwwa
sqlectric_melb                                     2009
sql_electric2.pdf                                  sqlectric_melb
sql_electric1.pptx                                 sqlectric_melb
ukoug                                              2009
...
...
...

And we’re pretty much done. At this point, we have the raw data we need in a child/parent form that we could use to traverse with a standard hierarchy CONNECT BY query – I’ll use LPAD to show the hierarchy relationships


SQL> with data as
  2  (
  3  select id, seq, token
  4  from
  5    ( select rownum id, fname from t ) t_base,
  6    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') toke
  7            from dual
  8            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  9            )
 10  ),
 11  hier as(
 12    select
 13      distinct
 14        token,
 15        lag(token) over ( partition by id order by seq ) as parent
 16    from data
 17  )
 18  select lpad(' ',level*2)||token hier_list
 19  from hier
 20  start with parent is null
 21  connect by prior token = parent
 22  order siblings by token;

HIER_LIST
--------------------------------------------------------------------------------------------------------
  C:
    Users
      Connor
        Presentations
          2002
            254_full.ppt
            254old.ppt
            257.ppt
            UKOUG2002_final
              254.ppt
              257.ppt
              prelim.ppt
            dec_9i_features2.ppt
            dec_legends.ppt
            dec_legends_prelim.ppt
            jul_sig.ppt
            jul_sig_printout.ppt
            june_sig.ppt
            logo.ppt
            march_sig.ppt
            march_sig2.ppt
            plaus.ppt
            plus2.ppt
            scene_200205.pdf
            scene_200205a.pdf
            sep_unix_sig.ppt
            sep_unix_sig2.ppt
          2003
            254.ppt
            abm_succesful_development.ppt
            apr03_perth.ppt
            dec03_perth_9i_nf.ppt
            dec03_perth_9i_nf_popquiz.pdf
            dec03_perth_9i_nf_popquiz.ppt
            error_handling.pdf
            error_handling.ppt
            ims_performance_examples.pdf
            ims_performance_examples.ppt
            sep03_perth_9i_hidden.pdf
            sep03_perth_9i_hidden.ppt
            sep03_perth_9i_nf.ppt
            sep03_perth_legends.pdf
            sep03_perth_legends.ppt
            sep03_perth_prelim.ppt
            sf_oow_03.pdf
            sf_oow_03.ppt
            sf_oow_03a.ppt
            slide.ppt
            succesful_development.pdf
            succesful_development.ppt
            tools_for_performance.pdf
            tools_for_performance.ppt
          2004
            10046_scene.pdf
            bind_var.pdf
            bind_var.ppt
            dec03_perth_book.ppt
            dec03_perth_book2.ppt
            generating_test_data.pdf
            generating_test_data.ppt
            mar04_hotsos_legends.ppt
            mar04_hotsos_nf_w2k.ppt
            mar04_hotsos_nf_w2k_edit.ppt
            mcdonald1.pdf
            mcdonald2.pdf
            nov04_reorgs_rebuild.ppt
            nov04_succesful_development.ppt
            
            ...
            ...
            ...
            
          2018
            apex_connect
              apex_connect_sql_plsql.pdf
              apex_connect_sql_plsql.pptx
              apex_connect_sql_plsql_v2.pptm
              apex_connect_sql_plsql_v2.pptx
              apex_connect_sql_plsql_v3.pptm
              apex_connect_sql_plsql_v3_subset.pptm
              extensions1.pptx
              extensions2.pptx
            code_china
              better_sql_oracle_code.pptx
              better_sql_oracle_code_v2.pdf
              better_sql_oracle_code_v2.pptx
            code_hyderabad
              better_sql_oracle_code.pptx
            hong_kong
              hong_kong_122.pdf
              hong_kong_122.pptx
              hong_kong_122_no_multi.pptx
              hong_kong_partitioning_ora_template.pptx
              hong_kong_partitioning_ora_template2.pdf
              hong_kong_partitioning_ora_template2.pptx
            london
              london.pdf
              london.pptx
              optimizer.pdf
              optimizer.pptx
              optimizer_full.pptx
            odtug
              common_disasters_short.pptx
              common_disasters_short_comp.pptx      

And there we have it. From flat file listing to a nice hierarchical layout just by tackling the problem piece by piece using the trusty WITH clause.