Attribute clustering….super cool

Posted by

I’ve spoken about attribute clustering before here, here and here. So from that you can probably glean that I’m a fan.

I recently spoke about an example of this as well during my AskTOM Office Hours session which you can watch below:

After posting that video, I had a follow-up question which I thought would be worth exploring. My examples so far on attribute clustering were either loading a table from empty, or issuing an “alter table move” on a table that already contained data. In both those situations, it is reasonable to assume that clustering is going to help because it aligns conceptually with what we’d expect to happen, namely

  • read all of the data
  • re-organize all of the data so that it is clustered by the nominated clustering key(s)
  • reload all of the data

But what if we do not have that luxury of moving all of the data around? What if I have table that is already clustered, but now I need to load more data into it? As we know, we only see clustering in effect if we perform a direct load operation. But a direct load operation only creates new blocks. That would suggest we cannot cluster the data, no?

Lets explore this scenario with an example.

As a control to this experiment, I’ll perform some index lookups to a “conventional” table, that is, random data without any clustering attributes.

I’ll create a table with the data arranged in random order, and then add the same amount of data again. (I could do this in a single operation, but I’m keeping it in alignment with experiments I’ll perform shortly to mimic the concept of load data and then appending new data).


SQL> create table t as
  2  select rownum id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

Table created.

SQL>
SQL> insert /*+ APPEND */ into t
  2  select rownum+100000 id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

100000 rows created.

SQL>
SQL> commit;

Commit complete.

I’ve got 200,000 randomly distributed rows for 100 distinct CUST_ID values. Now I’ll add an index and perform a simple lookup to examine the cost. (Note: The statistics you see are not from an initial execution, but from subsequent executions to eliminate parsing and other costs)


SQL> create index ix on t ( cust_id );

Index created.

SQL>
SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1502  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

So our control figure is approximately 1500 logical I/O’s.

In the previous blog posts on this topic, I then added attribute clustering definitions and perform an ‘alter table move’ to reorganize all of the data into a nice clustered arragement. This time, I’ll repeat the above experiment in a different manner. The sequence of steps will be:

  1. load the data,
  2. add the clustering attributes,
  3. reorganize the data,
  4. add an index,
  5. observe the expected benefits of clustering,
  6. add more data, which we might expect could not be clustered because the table is already clustered now,
  7. perform more index lookups and see what the cost is.

If clustering is only useful for an initial data load, then we’d expect to see a significant degradation when the final index lookups are performed in (7). First of all, just to reinforce the coolness of attribute clustering just one more time, here are steps (1) through (5). Remember – the control figure here is 1500 LIO’s.


SQL> create table t as
  2  select rownum id, rpad('x',100) data, mod(rownum,100) cust_id
  3  from dual
  4  connect by level <= 100000
  5  order by dbms_random.value;

Table created.

SQL>
SQL> create table t2 as select * from t;

Table created.

SQL>
SQL> alter table t ADD clustering by linear order(cust_id);

Table altered.

SQL> alter table t move;

Table altered.

SQL>
SQL> create index ix on t ( cust_id );

Index created.

SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         20  consistent gets
          2  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Ooooh yeahhhhh… 20 LIOs!!! But now the real test comes. You can see I also created a table called T2, being a clone of T in its randomised data state. Now I am going to append the contents on T2 to my (clustered) T table.


SQL> insert /*+ APPEND */ into t select * from t2;

100000 rows created.

SQL> commit;

Commit complete.

One thing that does look promising is notice the “SORT ORDER BY” step in the execution plan. That suggests that even though this is appending new data to an existing set of data, some clustering might be taking place.


SQL> explain plan for insert /*+ APPEND */ into t select * from t2;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 528765404

-------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |   100K|    10M|       |  2905   (1)| 00:00:01 |
|   1 |  LOAD AS SELECT                  | T    |       |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |   100K|    10M|       |  2905   (1)| 00:00:01 |
|   3 |    SORT ORDER BY                 |      |   100K|    10M|    11M|  2905   (1)| 00:00:01 |
|   4 |     TABLE ACCESS FULL            | T2   |   100K|    10M|       |   442   (1)| 00:00:01 |
-------------------------------------------------------------------------------------------------

11 rows selected.

But the real test comes in the index lookup. Let’s repeat that.


SQL>
SQL> set autotrace on stat
SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
----------------------------------------------------------------------------------------------------
x

1 row selected.


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         43  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

And we’re still awesomely good. The amount of data for CUST_ID = 17 has doubled, and the LIOs approximately doubled as well. So we are still seeing benefits of clustering even though we are appending to a table with previously clustered data.

Just how close to optimal is this? I’ll perform an ‘alter table move’ against the entire table, and re-measure:


SQL> alter table t move online;

Table altered.

SQL> select /*+ index(t (cust_id)) */ max(data)
  2  from t
  3  where cust_id = 17;

MAX(DATA)
---------------------------------------------------------------------------
x

1 row selected.

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         37  consistent gets
          0  physical reads
          0  redo size
        641  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Hence optimal clustering would have yielded 37 LIOs for this data, so 43 is very close to that, and a huge step up in efficiency over the 1500 LIOs from the base experiment. In effect, we have “clumps” of clustered data in the table now – so larger, low frequency dataset loads will give better results than smaller, higher frequency loads. But still…very cool stuff that you should be taking advantage of in your databases.

3 comments

  1. Hello, Connor

    Many thanks for sharing attribute clustering (nice feature of Oracle 12.1 and later) with others. According to your SQL script, I’ve do some test on Oracle 12.2.0.1.0 for Windows x86_64 on my laptop. About “consistent gets”, yours is from 43 to 37, but mine is from 48 to 78 (particularly, after the step about “alter table t move online”).

    Why has my test been a little different from yours?

    Best Regards
    Quanwen Zhao

  2. Thanks a lot Connor for sharing insight into attribute clustering, just would like to update you that the part 2 and part 3 links of the topic is broken. Kindly correct it.

    Thanks,
    Rajesh

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.