Attribute clustering (part 3)

So in part1 and part2, we looked at creating tables with clustered data.  If you’re ready to climb aboard the attribute clustering heading toward Fastville Smile you might want to take an existing table and cluster it.  In part 2 we saw how we had to be extra careful with syntax.  The same rule applies with altering a table to cluster it.  Lets start with our SOURCE_DATA table which was not clustered.



SQL> create table source_data  as
  2      select d.*
  3      from dba_objects d
  4      where object_id is not null
  5      order by dbms_random.random;

Table created.

SQL>
SQL> select object_id, owner from source_data where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
     37627 PUBLIC
     79081 ORDSYS
     26953 SYS
     74744 PUBLIC
     47975 PUBLIC
     94003 APEX_050000
     59014 SYS
     66409 PUBLIC
     36977 PUBLIC
     74886 PUBLIC
       565 SYS
       ...
       ...

As you see, the data has no particular ordering.  Let’s now run an ALTER command to cluster this table



SQL> alter table source_data clustering by linear order(object_id);

Table altered.

SQL> select object_id, owner from source_data where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
     37627 PUBLIC
     79081 ORDSYS
     26953 SYS
     74744 PUBLIC
     47975 PUBLIC
     94003 APEX_050000
     59014 SYS
     66409 PUBLIC
     36977 PUBLIC
     74886 PUBLIC
       565 SYS
     35456 SYS
     74656 SYS
     ...
     ...

As you can see, nothing seems to have happened. Now that is to be expected, because we have only set an attribute on the table, similar to (say) setting the compression attribute.

Let us now MOVE the data to “reload” the table data.



SQL> alter table source_data move;

Table altered.

SQL>
SQL> select object_id, owner from source_data where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
     37627 PUBLIC
     79081 ORDSYS
     26953 SYS
     74744 PUBLIC
     47975 PUBLIC
     94003 APEX_050000
     59014 SYS
     66409 PUBLIC
     36977 PUBLIC
     74886 PUBLIC
       565 SYS
     35456 SYS
     74656 SYS
     ...
     ...

And …. Bzzzzt! Still nothing has happened. The issue is here is not the MOVE command. We made a tiny error in our initial ALTER statement. Let’s try it again



SQL> alter table source_data ADD clustering by linear order(object_id);

Table altered.

SQL>
SQL> alter table source_data move;

Table altered.

SQL>
SQL> select object_id, owner from source_data where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
         2 SYS
         3 SYS
         4 SYS
         5 SYS
         6 SYS
         7 SYS
         8 SYS
         9 SYS
        10 SYS
        11 SYS
        12 SYS
        13 SYS
        14 SYS
        15 SYS
        16 SYS
     ...
     ...

As you can see, the keyword ADD makes all the difference.  As I said in the previous posts, always validate that your clustering is happening as you expect.

Attribute clustering (part 2)

In the previous post, I demonstrated attribute clustering by

  • creating a table of source data which contained data in randomised order, via

SQL> create table source_data  as
  2      select d.*
  3      from dba_objects d
  4      where object_id is not null
  5      order by dbms_random.random;



  • and then it loading into a table with attribute clustering defined to cluster the data into the desired order.

The obvious question is then “Why not just cluster the data by combining the two operations ?”, namely


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select d.* from dba_objects d
  4  order by dbms_random.random;

There was a reason I did my first demo in the way I did.  Let’s look at the execution plan for that CREATE statement.  Now, bear with me, for simplicity sake, rather than show the execution plan for DBA_OBJECTS access (which (which is a huge complex view), I’ll use SOURCE_DATA both with and without ORDER BY clauses as a representative.


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select * from source_data;

------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes |
------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |             | 99386 |    10M|
|   1 |  LOAD AS SELECT                  | T1          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |             | 99386 |    10M|
|   3 |    SORT ORDER BY                 |             | 99386 |    10M|
|   4 |     TABLE ACCESS FULL            | SOURCE_DATA | 99386 |    10M|
------------------------------------------------------------------------

SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select * from dba_objects
  4  order by dbms_random.random;
  
------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes |
------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |             | 99386 |    10M|
|   1 |  LOAD AS SELECT                  | T1          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |             | 99386 |    10M|
|   3 |    SORT ORDER BY                 |             | 99386 |    10M|
|   4 |     TABLE ACCESS FULL            | SOURCE_DATA | 99386 |    10M|
------------------------------------------------------------------------

Notice that the execution plans appear the same.  We can see a SORT ORDER BY in the first CTAS, even though there is no ORDER BY in the SELECT part.  We can hence reasonably assume this is to achieve the desired clustering.  But the second CTAS raises an interesting question.  If I have to sort the data for the clustering and  there was an ORDER BY clause in the query, which one of those, if any, does the SORT ORDER BY in the execution relate to ?  Well, lets run our CTAS command and find out.


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as select * from source_data
  4  order by dbms_random.random;

Table created.

SQL>
SQL> select object_id, owner from t1 where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
     72256 SYS
     29101 PUBLIC
     26037 SYS
     47229 PUBLIC
     54002 SYS
     65203 PUBLIC
     49264 SYS
     71998 SYS
     91256 APEX_040200
     85311 PUBLIC
     72948 SYS
     68626 SYS
       337 SYS
     54564 SYS
     11256 SYS
     40694 SYS
     44193 PUBLIC
     31856 SYS
     26153 SYS
     25604 PUBLIC
  ...
  ...

Notice that the data is not clustered as per our expectation.  It looks like the final ORDER BY took precedence over our attempt to cluster the data.

So you need to be careful here when using attribute clustering. The database engine is quite particular on the syntax and ordering operations.

In the example above, simply in-lining the order by statement resolves the issue


SQL> create table t1
  2  clustering by linear order(object_id)
  3  as
  4  select *
  5  from
  6  ( select * from source_data order by dbms_random.random );

Table created.

SQL>
SQL> select object_id, owner from t1 where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
         2 SYS
         3 SYS
         4 SYS
         5 SYS
         6 SYS
         7 SYS
         8 SYS
         9 SYS
        10 SYS
        11 SYS
        12 SYS
        13 SYS
        14 SYS
        15 SYS
        16 SYS
        17 SYS
        ...
        ...

And just in case you are wondering, there was no change the execution plan for the above statement. That is, you don’t see any view resolution and multiple SORT ORDER BY lines. It looks as before


------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes |
------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |             | 99386 |    10M|
|   1 |  LOAD AS SELECT                  | T1          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |             | 99386 |    10M|
|   3 |    SORT ORDER BY                 |             | 99386 |    10M|
|   4 |     TABLE ACCESS FULL            | SOURCE_DATA | 99386 |    10M|
------------------------------------------------------------------------

But the 1 level of in-lining was enough to fix the issue.   So always validate that you are getting the clustering you are expecting.

Attribute clustering (part 1)

One of the nice facilities in 12c is the ability to instruct the database to co-locate data during loading in order to facilitate reduced resource consumption for subsequent queries.

Here’s an example of it in use. First I’ll create table SOURCE_DATA to mimic some randomised incoming data feed



SQL> exec dbms_random.seed(0)

PL/SQL procedure successfully completed.

SQL> drop table source_data purge;
drop table source_data purge
           *
ERROR at line 1:
ORA-00942: table or view does not exist


SQL>
SQL> create table source_data  as
  2      select d.*
  3      from dba_objects d
  4      where object_id is not null
  5      order by dbms_random.random;

Table created.

So let’s now populate (create) our table T with a straight copy of the data from SOURCE_DATA and index a column of interest


SQL> select object_id, owner from t where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
     59002 PUBLIC
     50435 SYS
     37276 SYS
      2280 SYS
     65169 PUBLIC
     36054 SYS
     47375 SYS
     66267 PUBLIC
     34259 SYS
     74791 APEX_050100
     37002 SYS
     30269 SYS
     33423 SYS
     70990 MDSYS
     29479 SYS
     33347 SYS
     49426 SYS
     44222 SYS
     17796 PUBLIC
     44212 SYS
     35609 SYS
     45084 SYS
     29310 SYS
     55452 SYS
     30311 SYS
...

SQL> create index t_idx on t(object_id);

Index created.

You can see the data is in pseudo-random order on both OBJECT_ID and OWNER. (We only care about OBJECT_ID for the sake of this demo). Let us now see what the I/O of an index range scan query on OBJECT_ID costs us.


SQL> set autotrace traceonly stat
SQL> select /*+ index(t) */ * from t where object_id between 10000 and 15000;

4997 rows selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       5339  consistent gets
         12  physical reads
          0  redo size
     629403  bytes sent via SQL*Net to client
       4214  bytes received via SQL*Net from client
        335  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       4997  rows processed

SQL> set autotrace off
SQL>
SQL>

Just under 5400 consistent gets, which is not dissimilar from the rows we fetched, because those rows from OBJECT_ID 10,000 to 15,000 they are scattered throughout the table. Let us now repeat the exercise, this time using the clustering facilities available under 12c.


SQL>
SQL> drop table t purge;

Table dropped.

SQL>
SQL> create table t
  2  clustering by linear order(object_id)
  3  as select * from source_data;

Table created.

SQL> select object_id, owner from t where rownum < 50;

 OBJECT_ID OWNER
---------- ------------------------------
         2 SYS
         3 SYS
         4 SYS
         5 SYS
         6 SYS
         7 SYS
         8 SYS
         9 SYS
        10 SYS
        11 SYS
        12 SYS
        13 SYS
        14 SYS
        15 SYS
        16 SYS
        17 SYS
        18 SYS
        19 SYS
        20 SYS
        21 SYS
        22 SYS
        23 SYS
        24 SYS
        25 SYS
        26 SYS
        27 SYS
        28 SYS
        29 SYS
        30 SYS
        31 SYS
        32 SYS
        33 SYS
        34 SYS
        35 SYS
        36 SYS
        37 SYS
        38 SYS
        39 SYS
        40 SYS
        41 SYS
        42 SYS
        43 SYS
        44 SYS
        45 SYS
        46 SYS
...
...

SQL> create index t_idx on t(object_id);

Index created.

SQL>
SQL> set autotrace traceonly stat
SQL> select /*+ index(t) */ * from t where object_id between 10000 and 15000;

4997 rows selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        756  consistent gets
         12  physical reads
          0  redo size
     629403  bytes sent via SQL*Net to client
       4214  bytes received via SQL*Net from client
        335  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       4997  rows processed

SQL> set autotrace off
SQL>
SQL>

Before we even used the index, you could see from the sample of rows we queried from the table, the data has been stored in an ordered sequence. As a result, our index lookup is dramatically improved, dropping down to just ~750 consistent gets.  Clustering is covered in great detail in the Data Warehousing Guide, but I presented this example to help with my next blog post on the same topic (coming soon).

Some people have mentioned that Attribute Clustering is an engineered system feature only.  This is not correct – you’re free to Attribute Clustering in Enterprise Edition. It is only when you want to incorporate Zone Maps that licensing things get more complicated Smile