Parsing freeform data in flat files

SQL loader is a very cool utility that has existed for a long time within Oracle to load flat files into the database. However sometimes people find the control file syntax quite cryptic, and when it comes to passing very complicated structures, this can mean control files which are hard to maintain. For me the best solution here is to use an external table. That way we can combine the power of the SQL Loader control file syntax embedded within the external table definition, along with the full power of PL/SQL and SQL for additional parsing of that data.

Here is an example where the data is spread across multiple lines and the task is to bring all that data together into a natural form, namely an ID followed by text.

So here is my file that has free format text


10,some data
that is really 
annoying because it
spans lots of rows.
20,and some more than does not.
30,and a mixture of stuff and we 
will assume that numerics then comm as 
is the thing that tells us it is a new line. 
40,otherwise we will keep on going.

What I’m going to do is use an external table to bring that data in as it is within the file, and then use a PL/SQL function in pipelined mode to parse the data into the component attributes.


SQL>
SQL> create table freeform
  2  (
  3  line varchar2(200)
  4  )
  5  ORGANIZATION external
  6  (
  7  TYPE oracle_loader
  8  DEFAULT DIRECTORY temp
  9  ACCESS PARAMETERS
 10  (
 11  RECORDS DELIMITED BY NEWLINE
 12  FIELDS LDRTRIM
 13  (
 14  line char(200)
 15  )
 16  )
 17  location
 18  (
 19  'freeform.dat'
 20  )
 21  )REJECT LIMIT UNLIMITED ;

Table created.

SQL>
SQL> select * from freeform;

LINE
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10,some data
that is really
annoying because it
spans lots of rows.
20,and some more than does not.
30,and a mixture of stuff and we
will assume that numerics then comm as
is the thing that tells us it is a new line.
40,otherwise we will keep on going.

9 rows selected.

SQL>
SQL> create or replace
  2  type output_data as object (id int, str varchar2(500))
  3  /

Type created.

SQL>
SQL> create or replace
  2  type output_list as table of output_data
  3  /

Type created.

SQL>
SQL> create or replace
  2  function cleanup_that_sucker(rc sys_refcursor) return output_list pipelined is
  3    type strlist is table of varchar2(200) index by pls_integer;
  4    s strlist;
  5    o output_data := output_data(null,null);
  6    c pls_integer := 0;
  7  begin
  8    loop
  9      fetch rc bulk collect into s limit 500;
 10
 11      for i in 1 .. s.count loop
 12        c := c + 1;
 13        if regexp_like(s(i),'^[[:digit:]]+,') then
 14          if c > 1 then
 15             pipe row (o);
 16          end if;
 17          o.id  := substr(s(i),1,instr(s(i),',')-1);
 18          o.str := substr(s(i),instr(s(i),',')+1);
 19        else
 20          o.str := o.str || ' ' || s(i);
 21        end if;
 22      end loop;
 23      exit when rc%notfound;
 24    end loop;
 25    pipe row (o);
 26    return;
 27  end;
 28  /

Function created.

SQL> sho err
No errors.
SQL>
SQL> select *
  2  from cleanup_that_sucker(cursor(
  3          select line from freeform
  4          ));

        ID STR
---------- ------------------------------------------------------------------------------------------------------------------------
        10 some data that is really annoying because it spans lots of rows.
        20 and some more than does not.
        30 and a mixture of stuff and we will assume that numerics then comm as is the thing that tells us it is a new line.
        40 otherwise we will keep on going.

4 rows selected.

SQL>
SQL>

And there we have it. We have the full power of PL/SQL and SQL at our fingertips even though we are parsing flat file data. This avoids complicated control file syntax and makes ongoing maintenance far easier. Performance is still good because we are using the bulk collect feature in PL/SQL. Also because we are taking a cursor expression as input, we have lots of flexibility in terms of what data are we passing in. We simply change the SQL expression in the cursor.

A new line on NEWLINE

Recently I was doing a simple external table load using a CSV file, but was getting an interesting error. My file looked simple enough


"ID","EMAIL","TIMESTAMP","SUBJECT","STATUS","STATUS_TS"
"2012348048","john@anon.com","05/02/2000","Subject 1","5","09/04/2007"
"2412348048","mike@anon.com","05/02/2000","Subject 1","5","09/16/2002"
"348543169051","sue@anon.com","03/10/2001","Subject 1","5","03/24/2008"
"348396029762","mary@anon.com","03/10/2001","Subject 1","5","03/10/2001"
"1212348047","sam@anon.com","05/02/2000","Subject 1","5","05/02/2000"
"1612348048","vincent@anon.com","05/02/2000","Subject 1,"5","06/02/2006"
...
...

So it should have been a fairly straightforward external table definition to access it


SQL> create table T
  2        ( ID int
  3         ,EMAIL varchar2(255)
  4         ,TIMESTAMP date
  5         ,SUBJECT varchar2(512)
  6         ,STATUS int
  7         ,STATUS_TS date
  8      )
  9      organization external
 10      ( default directory TMP
 11        access parameters
 12        ( records delimited by newline 
 13          fields terminated by ',' optionally enclosed by '"'
 14          ( ID, EMAIL, "TIMESTAMP" DATE 'MM/DD/YYYY', SUBJECT, STATUS, STATUS_TS DATE 'MM/DD/YYYY'
 15          )
 16        )
 17        location ('my_report.csv')
 18    )
 19    ;

Table created.

At which point, things went downhill rapidly 🙂 When I queried the external table, I got the following error


SQL> select * from T;
select * from T;
*
ERROR at line 1:
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04020: found record longer than buffer size supported,
524288, in c:\temp\my_report.csv (offset=1048576)


The clue here is that a record was seen to be longer than allowed. Since I know all my “records” are quite short, it must be the definition of what ends a record that is causing a problem. problem.  In my case, the file had come from a system with a different characterset, and hence the concept of “newline” differs between characterset/target platform. Once I included that information into the external table definition, all was fine.


SQL> create table T
  2        ( ID int
  3         ,EMAIL varchar2(255)
  4         ,TIMESTAMP date
  5         ,SUBJECT varchar2(512)
  6         ,STATUS int
  7         ,STATUS_TS date
  8      )
  9      organization external
 10      ( default directory TMP
 11        access parameters
 12        ( records delimited by '\n' characterset AL32UTF8
 13          fields terminated by ',' optionally enclosed by '"'
 14          ( ID, EMAIL, "TIMESTAMP" DATE 'MM/DD/YYYY', SUBJECT, STATUS, STATUS_TS DATE 'MM/DD/YYYY'
 15          )
 16        )
 17        location ('my_report.csv')
 18    )
 19    ;

Table created.

SQL> select * from T where rownum < 10;

...

16,308 rows selected.


Slow external table access

We had an interesting issue on 12.1.0.1, where users were reporting very slow performance on queries to external tables. When I tried to replicate the problem, everything seemed just fine, so I initially reported back the familiar “Well, it works on my PC” 🙂 [Just kidding]

Anyway, connecting by proxy to one of their accounts, did reveal the error, which suggested something to do with privileges. A sql trace revealed that the performance was due to a query which appears to get the list of directories and their privileges:

SELECT NAME, PATH, READ, WRITE, EXECUTE FROM SYS.LOADER_DIR_OBJS

the definition of which was:

create or replace force view sys.loader_dir_objs
(name, path, read, write, execute)
bequeath definer
as
select o.name
      ,d.os_path
      ,'TRUE'
      ,'TRUE'
      ,'TRUE'
from   sys.obj$ o
      ,sys.x$dir d
where  o.obj# = d.obj#
and    (o.owner# = uid
or      exists
          (select null
           from   v$enabledprivs
           where  priv_number in (-177
                                 ,-178)))
union all
select 
       o.name
      ,d.os_path
      ,decode(sum(decode(oa.privilege#, 17, 1, 0)), 0, 'FALSE', 'TRUE')
      ,decode(sum(decode(oa.privilege#, 18, 1, 0)), 0, 'FALSE', 'TRUE')
      ,decode(sum(decode(oa.privilege#, 12, 1, 0)), 0, 'FALSE', 'TRUE')
from   sys.obj$ o
      ,sys.x$dir d
      ,sys.objauth$ oa
where  o.obj# = d.obj#
and    oa.obj# = o.obj#
and    oa.privilege# in (12
                        ,17
                        ,18)
and    oa.grantee# in (select kzsrorol
                       from   x$kzsro)
and    not (o.owner# = uid
or          exists
              (select null
               from   v$enabledprivs
               where  priv_number in (-177
                                     ,-178)))
group by o.name
        ,d.os_path;

Re-gathering dictionary and fixed object stats yielded no benefit, so I tinkered with the view definition to come up with a faster version, which was this:

create or replace force view sys.loader_dir_objs
(name, path, read, write, execute)
bequeath definer
as
select o.name
      ,d.os_path
      ,'TRUE'
      ,'TRUE'
      ,'TRUE'
from   sys.obj$ o
      ,sys.x$dir d
where  o.obj# = d.obj#
and    (o.owner# = uid
or      exists
          (select null
           from   v$enabledprivs
           where  priv_number in (-177
                                 ,-178)))
union all
select /*+ leading(d o oa) */
       o.name
      ,d.os_path
      ,decode(sum(decode(oa.privilege#, 17, 1, 0)), 0, 'FALSE', 'TRUE')
      ,decode(sum(decode(oa.privilege#, 18, 1, 0)), 0, 'FALSE', 'TRUE')
      ,decode(sum(decode(oa.privilege#, 12, 1, 0)), 0, 'FALSE', 'TRUE')
from   sys.obj$ o
      ,sys.x$dir d
      ,sys.objauth$ oa
where  o.obj# = d.obj#
and    oa.obj# = o.obj#
and    oa.privilege# in (12
                        ,17
                        ,18)
and    oa.grantee# in (select kzsrorol
                       from   x$kzsro)
and    not (o.owner# = uid
or          exists
              (select null
               from   v$enabledprivs
               where  priv_number in (-177
                                     ,-178)))
group by o.name
        ,d.os_path;

[Note: If you’re having the same issue, before you race out as use the same hints, then be aware that the hints above work for us because the number of directories in our database is very small. If you’ve got lots of directories defined, this might not be the best approach for you]

The challenge now was getting this performance benefit without actually hacking the dictionary view (which means calls to MOS..and well…that’s no way to be starting the new year :-))

So initially, I used the standard technique of applying a baseline, but encountered some dramas (more on that in a later post). Then, following a suggestion from an OakTable colleague (http://www.oaktable.net/users/lothar) I ventured down the path of “sql patch”:

I put my ‘corrected’ view in place, issued the problem query, and got the full plan using the DBMS_XPLAN with the +OUTLINE parameter. I put the original version of the view back in place, and then attempted to patch in the altered plan as thus:

declare
  l_sql  clob;
  l_hint clob :=
'[the full text of the plan outline]';
begin
  --
  -- '23ka1fq59wg0b' is the sqlid for my problem query.  
  --
  select sql_text into l_sql from v$sql where sql_id = '23ka1fq59wg0b';

  sys.dbms_sqldiag_internal.i_create_patch(
    sql_text=>l_sql,
    hint_text=>l_hint,
    name=>'patch_LOADER_DIR_OBJS');
end;

This failed with a PL/SQL error, because the hint text for sql patch is limited to 500 characters. So then it was a case of stripping out hints that were “redundant” [Note: This is generally a bad idea, because there’s a strong argument to be made that no hint is redundant]. But once within 500 chars, the patch was applied, and external table performance is now fine for all users.