Datatype discipline

Posted by

(This is based on some ‘issues’ a client was having, so the example is architected to make a particular point, but is relevant nonetheless)

In the distributed database world, we often have different names for the same piece of information.

Consider two tables which hold customer information. In one table, they called the customer key “REF_NUM”, and in the other table “CUST_ID”.

That’s not ideal, but it’s not a big drama because at least the data types align.

        SQL> desc T1

                   Name                                        Null?    Type
                   ------------------------------------------- -------- ----------------------
                   REF_NUM                                     NOT NULL NUMBER(10)
                   T1_DATA                                              VARCHAR2(1000)

        SQL> desc T2

                   Name                                        Null?    Type
                   ------------------------------------------- -------- ----------------------
                   CUST_ID                                     NOT NULL NUMBER(10)
                   T2_DATA                                              VARCHAR2(1000)

When I do a join between these two tables, it costs me about 9400 I/O’s

        SQL> select *
          2  from  t1, t2
          3  where t1.ref_num = t2.cust_id
          4  and   t1.t1_data like '100%';

        Statistics
        ----------------------------------------------------------
                  0  recursive calls
                  1  db block gets
               9408  consistent gets
                  0  physical reads

Consider now the same two tables, with exactly the same data, but the data types are not aligned. In one system, they used a numeric, and in the other system they used a string

        SQL> desc T1

                   Name                                        Null?    Type
                   ------------------------------------------- -------- ----------------------
                   REF_NUM                                     NOT NULL NUMBER(10)
                   T1_DATA                                              VARCHAR2(1000)

        SQL> desc T2

                   Name                                        Null?    Type
                   ------------------------------------------- -------- ----------------------
                   CUST_ID                                     NOT NULL VARCHAR2(10)
                   T2_DATA                                              VARCHAR2(1000)

Now look what happens to my SQL

        SQL> select *
          2  from  t1, t2
          3  where t1.ref_num = t2.cust_id
          4  and   t1.t1_data like '100%';

        Statistics
        ----------------------------------------------------------
                  0  recursive calls
                  2  db block gets
              18428  consistent gets
                  0  physical reads

It costs DOUBLE the amount of I/O’s – it runs twice as slowly.

Moreover, it’s not just performance that is at risk…You’re whole application might just start to crash randomly

Using the same tables above (with the conflicting data types) here’s a SQL that works fine “today”…

        SQL> select count(*)
          2  from  t1, t2
          3  where t1.ref_num = t2.cust_id
          4  and   t2.t2_data like '100%';

          COUNT(*)
        ----------
                71

Then a pesky user comes along and does what users do…adds data to the one of the tables.

        SQL> insert into T2 values ('X123','100some data');

        1 row created.

And now let’s re-run that same query again

        SQL> select count(*)
          2  from  t1, t2
          3  where t1.ref_num = t2.cust_id
          4  and   t2.t2_data like '100%';

        ERROR: ORA-01722: invalid number

Ker-splat….your application is toast…

Not being disciplined with data type selection causes pain later.

2 comments

  1. Connor, you still thinking procedurally!
    Abstract the data type and store everything as a LOB.
    Pronto! No more data conversion issues.
    Now all you need is a “scalable” application server, clustered across 10000 nodes, to dynamically convert and interpret the data as needed so it can produce whatever results one might want.
    Simple, and always the correct result no matter what data is input!
    /sarcasm

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.