Parsing freeform data in flat files

Posted by

SQL loader is a very cool utility that has existed for a long time within Oracle to load flat files into the database. However sometimes people find the control file syntax quite cryptic, and when it comes to passing very complicated structures, this can mean control files which are hard to maintain. For me the best solution here is to use an external table. That way we can combine the power of the SQL Loader control file syntax embedded within the external table definition, along with the full power of PL/SQL and SQL for additional parsing of that data.

Here is an example where the data is spread across multiple lines and the task is to bring all that data together into a natural form, namely an ID followed by text.

So here is my file that has free format text


10,some data
that is really 
annoying because it
spans lots of rows.
20,and some more than does not.
30,and a mixture of stuff and we 
will assume that numerics then comm as 
is the thing that tells us it is a new line. 
40,otherwise we will keep on going.

What I’m going to do is use an external table to bring that data in as it is within the file, and then use a PL/SQL function in pipelined mode to parse the data into the component attributes.


SQL>
SQL> create table freeform
  2  (
  3  line varchar2(200)
  4  )
  5  ORGANIZATION external
  6  (
  7  TYPE oracle_loader
  8  DEFAULT DIRECTORY temp
  9  ACCESS PARAMETERS
 10  (
 11  RECORDS DELIMITED BY NEWLINE
 12  FIELDS LDRTRIM
 13  (
 14  line char(200)
 15  )
 16  )
 17  location
 18  (
 19  'freeform.dat'
 20  )
 21  )REJECT LIMIT UNLIMITED ;

Table created.

SQL>
SQL> select * from freeform;

LINE
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10,some data
that is really
annoying because it
spans lots of rows.
20,and some more than does not.
30,and a mixture of stuff and we
will assume that numerics then comm as
is the thing that tells us it is a new line.
40,otherwise we will keep on going.

9 rows selected.

SQL>
SQL> create or replace
  2  type output_data as object (id int, str varchar2(500))
  3  /

Type created.

SQL>
SQL> create or replace
  2  type output_list as table of output_data
  3  /

Type created.

SQL>
SQL> create or replace
  2  function cleanup_that_sucker(rc sys_refcursor) return output_list pipelined is
  3    type strlist is table of varchar2(200) index by pls_integer;
  4    s strlist;
  5    o output_data := output_data(null,null);
  6    c pls_integer := 0;
  7  begin
  8    loop
  9      fetch rc bulk collect into s limit 500;
 10
 11      for i in 1 .. s.count loop
 12        c := c + 1;
 13        if regexp_like(s(i),'^[[:digit:]]+,') then
 14          if c > 1 then
 15             pipe row (o);
 16          end if;
 17          o.id  := substr(s(i),1,instr(s(i),',')-1);
 18          o.str := substr(s(i),instr(s(i),',')+1);
 19        else
 20          o.str := o.str || ' ' || s(i);
 21        end if;
 22      end loop;
 23      exit when rc%notfound;
 24    end loop;
 25    pipe row (o);
 26    return;
 27  end;
 28  /

Function created.

SQL> sho err
No errors.
SQL>
SQL> select *
  2  from cleanup_that_sucker(cursor(
  3          select line from freeform
  4          ));

        ID STR
---------- ------------------------------------------------------------------------------------------------------------------------
        10 some data that is really annoying because it spans lots of rows.
        20 and some more than does not.
        30 and a mixture of stuff and we will assume that numerics then comm as is the thing that tells us it is a new line.
        40 otherwise we will keep on going.

4 rows selected.

SQL>
SQL>

And there we have it. We have the full power of PL/SQL and SQL at our fingertips even though we are parsing flat file data. This avoids complicated control file syntax and makes ongoing maintenance far easier. Performance is still good because we are using the bulk collect feature in PL/SQL. Also because we are taking a cursor expression as input, we have lots of flexibility in terms of what data are we passing in. We simply change the SQL expression in the cursor.

3 comments

  1. In 12c and above with Pattern matching it goes like this.

    rajesh@ORA12C> set linesize 250
    rajesh@ORA12C> column listaggs format a130
    rajesh@ORA12C> select mno, listagg(line,”) within group(order by x) listaggs
    2 from (
    3 select *
    4 from freeform
    5 match_recognize(
    6 order by x
    7 measures
    8 match_number() mno
    9 all rows per match
    10 pattern( b1 b2*)
    11 define
    12 b1 as regexp_like(line,’^\d+’) ,
    13 b2 as regexp_like(line,’^\D+’) )
    14 )
    15 group by mno
    16 order by mno ;

    MNO LISTAGGS
    ———- ——————————————————————————————————————–
    1 10,some datathat is reallyannoying because itspans lots of rows.
    2 20,and some more than does not.
    3 30,and a mixture of stuff and wewill assume that numerics then comm asis the thing that tells us it is a new line.
    4 40,otherwise we will keep on going.

    rajesh@ORA12C>

  2. Or Perhaps like this,

    rajesh@ORA12C> column num# format a5
    rajesh@ORA12C> column listed_values format a140
    rajesh@ORA12C> select mno,
    2 regexp_substr( listagg(line,”) within group(order by x) ,’\d+’ ) as num#,
    3 regexp_replace( listagg(line,”) within group(order by x) ,’\d+,’ ) listed_values
    4 from (
    5 select *
    6 from freeform
    7 match_recognize(
    8 order by x
    9 measures
    10 match_number() mno
    11 all rows per match
    12 pattern( b1 b2*)
    13 define
    14 b1 as regexp_like(line,’^\d+’) ,
    15 b2 as regexp_like(line,’^\D+’) )
    16 )
    17 group by mno
    18 order by mno ;

    MNO NUM# LISTED_VALUES
    ———- —– ——————————————————————————————————————
    1 10 some datathat is reallyannoying because itspans lots of rows.
    2 20 and some more than does not.
    3 30 and a mixture of stuff and wewill assume that numerics then comm asis the thing that tells us it is a new line.
    4 40 otherwise we will keep on going.

    rajesh@ORA12C>

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.