Attribute clustering (part 2)

Posted by

In the previous post, I demonstrated attribute clustering by

  • creating a table of source data which contained data in randomised order, via

SQL> create table source_data  as
  2      select d.*
  3      from dba_objects d
  4      where object_id is not null
  5      order by dbms_random.random;



  • and then it loading into a table with attribute clustering defined to cluster the data into the desired order.

The obvious question is then “Why not just cluster the data by combining the two operations ?”, namely


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select d.* from dba_objects d
  4  order by dbms_random.random;

There was a reason I did my first demo in the way I did.  Let’s look at the execution plan for that CREATE statement.  Now, bear with me, for simplicity sake, rather than show the execution plan for DBA_OBJECTS access (which (which is a huge complex view), I’ll use SOURCE_DATA both with and without ORDER BY clauses as a representative.


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select * from source_data;

------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes |
------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |             | 99386 |    10M|
|   1 |  LOAD AS SELECT                  | T1          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |             | 99386 |    10M|
|   3 |    SORT ORDER BY                 |             | 99386 |    10M|
|   4 |     TABLE ACCESS FULL            | SOURCE_DATA | 99386 |    10M|
------------------------------------------------------------------------

SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select * from dba_objects
  4  order by dbms_random.random;
  
------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes |
------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |             | 99386 |    10M|
|   1 |  LOAD AS SELECT                  | T1          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |             | 99386 |    10M|
|   3 |    SORT ORDER BY                 |             | 99386 |    10M|
|   4 |     TABLE ACCESS FULL            | SOURCE_DATA | 99386 |    10M|
------------------------------------------------------------------------

Notice that the execution plans appear the same.  We can see a SORT ORDER BY in the first CTAS, even though there is no ORDER BY in the SELECT part.  We can hence reasonably assume this is to achieve the desired clustering.  But the second CTAS raises an interesting question.  If I have to sort the data for the clustering and  there was an ORDER BY clause in the query, which one of those, if any, does the SORT ORDER BY in the execution relate to ?  Well, lets run our CTAS command and find out.


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select * from source_data
  4  order by dbms_random.random;

Table created.

SQL>
SQL> select object_id, owner from t1 where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
     72256 SYS
     29101 PUBLIC
     26037 SYS
     47229 PUBLIC
     54002 SYS
     65203 PUBLIC
     49264 SYS
     71998 SYS
     91256 APEX_040200
     85311 PUBLIC
     72948 SYS
     68626 SYS
       337 SYS
     54564 SYS
     11256 SYS
     40694 SYS
     44193 PUBLIC
     31856 SYS
     26153 SYS
     25604 PUBLIC
  ...
  ...

Notice that the data is not clustered as per our expectation.  It looks like the final ORDER BY took precedence over our attempt to cluster the data.

So you need to be careful here when using attribute clustering. The database engine is quite particular on the syntax and ordering operations.

In the example above, simply in-lining the order by statement resolves the issue


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as
  4  select *
  5  from
  6  ( select * from source_data order by dbms_random.random );

Table created.

SQL>
SQL> select object_id, owner from t1 where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
         2 SYS
         3 SYS
         4 SYS
         5 SYS
         6 SYS
         7 SYS
         8 SYS
         9 SYS
        10 SYS
        11 SYS
        12 SYS
        13 SYS
        14 SYS
        15 SYS
        16 SYS
        17 SYS
        ...
        ...

And just in case you are wondering, there was no change the execution plan for the above statement. That is, you don’t see any view resolution and multiple SORT ORDER BY lines. It looks as before


------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes |
------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |             | 99386 |    10M|
|   1 |  LOAD AS SELECT                  | T1          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |             | 99386 |    10M|
|   3 |    SORT ORDER BY                 |             | 99386 |    10M|
|   4 |     TABLE ACCESS FULL            | SOURCE_DATA | 99386 |    10M|
------------------------------------------------------------------------

But the 1 level of in-lining was enough to fix the issue.   So always validate that you are getting the clustering you are expecting.

4 comments

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.