APEX_DATA_PARSER turbo boost!

Posted by

One of the most common recommendations I’ll make to customers who post questions on AskTom is “Install APEX even if you do not plan to use APEX”. That may seem an odd recommendation but APEX is a database-centric tool and hence, even if you never use APEX, the utilities for doing data processing within the database make it a useful addition to any developers toolkit.

In particular, for those database features that often require “heavy lifting” to handle all the possibilities (eg email, network communications etc), you can often find that the APEX development team has done that heavy lifting for you and have wrapped these facilities in easy to use packages that mask all of the complexity from you. As the saying goes – “Why reinvent the wheel?”

A great example of this is parsing structured data. AskTom is littered with questions about extracting fields from a complex structure such as CSV data, XML or similar. One of the very cool facilities that came in APEX 19 was the APEX_DATA_PARSER package which can take care of all of this for you. All of those idiosyncrasies that you would have had to deal with (nested quotes, compressed files, etc ) are handled for you so that you can just focus on working with the extracted data in a relational format.

You may be thinking that SQL*Loader or external tables satisfies this need, but in most cases, these tools to process external data only work on data that is stored external to the database. That means having a staging area on the database server, or similar in an object store in a cloud environment, plus all of the associated requirements of access rights, security, archival etc etc. Often people would much rather be able to upload their raw data into the database as a CLOB or BLOB and parse it directly from there. APEX_DATA_PARSER fills that need perfectly where others tools do not. APEX team member Carsten Czarski has a great post on all of the features on APEX_DATA_PARSER.

However there can be a price to pay when using a utility that must be “all things to all people

What defines “typical” usage of APEX_DATA_PARSER when it is such a flexible package? Will it used for mostly CSV files? Or perhaps Excel spreadsheets? Will we parse 100 rows? or 100 million?

With those questions in mind, let us consider the following concern that came from a customer recently. They reported their experiences with APEX_DATA_PARSER as the tool being very slow, taking several minutes to parse a relatively small file of CSV data. That seemed at odds with my experience. I’ve always found APEX_DATA_PARSER to be very efficient, even taking into account the fact that it ultimately is a PL/SQL based parsing facility which will always be a little more inefficient to run versus a tailor-made C routine sitting inside the database kernel.

To investigate the claims from the customer, I set up a simple benchmark to use APEX_DATA_PARSER to parse simple CSV files of varying sizes. You can see from the first few lines of code below that I have files which are named according to their size

  • data5mb.csv
  • data10mb.csv
  • data20mb.csv
  • data30mb.csv
  • data40mb.csv
  • data50mb.csv
  • data60mb.csv
  • data70mb.csv

Each file contains the same data, just replicated enough times to pad out the file size. Each file is loaded into a BLOB in the table MYCSVFILE and then I’ll use APEX_DATA_PARSER to parse the LOB into rows and columns that can be fetched with a PL/SQL cursor. I’m using starting and ending timestamps to record how long each parsing iteration takes.


SQL> create table mycsvfile ( blob blob);

Table created.

SQL>
SQL> set serverout on
SQL> declare
  2    l_filename sys.odcivarchar2list :=
  3         sys.odcivarchar2list(
  4             'data5mb.csv'
  5             ,'data10mb.csv'
  6             ,'data20mb.csv'
  7             ,'data30mb.csv'
  8             ,'data40mb.csv'
  9             ,'data50mb.csv'
 10             ,'data60mb.csv'
 11             ,'data70mb.csv'
 12             );
 13
 14      l_clob           clob;
 15      l_blob           blob;
 16      l_bfile          bfile;
 17      l_target_offset  number := 1;
 18      l_source_offset  number := 1;
 19      l_language_ctx   number := dbms_lob.default_lang_ctx;
 20      l_warning        number;
 21      l_lob_len        number;
 22
 23      l_ts             timestamp;
 24
 25  begin
 26    for i in 1 .. l_filename.count
 27    loop
 28      dbms_application_info.set_client_info(localtimestamp||' '||l_filename(i));
 29      dbms_output.put_line('File: '||l_filename(i));
 30
 31      execute immediate 'truncate table mycsvfile';
 32      insert into mycsvfile (blob) values (empty_blob()) returning blob into l_blob;
 33
 34      l_bfile := bfilename('TMP',l_filename(i));
 35      dbms_lob.fileopen(l_bfile, dbms_lob.file_readonly);
 36      dbms_lob.loadblobfromfile (
 37        dest_lob    => l_blob,
 38        src_bfile   => l_bfile,
 39        amount      => dbms_lob.lobmaxsize,
 40        dest_offset => l_target_offset,
 41        src_offset  => l_source_offset);
 42      dbms_lob.fileclose(l_bfile);
 43      commit;
 44
 45      l_ts := localtimestamp;
 46
 47      for j in (
 48        select col001,col002,col003,col004,col005,col006,col007,col008,col009,col010,
 49               col011,col012,col013,col014,col015,col016,col017,col018,col019,col020,
 50               col021,col022,col023,col024,col025,col026
 51        from mycsvfile f,
 52           table( apex_data_parser.parse(
 53                    p_content          => f.blob,
 54                    p_csv_col_delimiter => ',',
 55                    p_skip_rows         => 1,
 56                    p_file_name         => l_filename(i))
 57                ) p
 58      )
 59      loop
 60        null;
 61      end loop;
 62
 63      dbms_output.put_line(localtimestamp-l_ts);
 64    end loop;
 65  end;
 66  /
File: data5mb.csv
+000000000 00:00:06.923000000

File: data10mb.csv
+000000000 00:00:07.274000000
File: data20mb.csv
+000000000 00:00:13.320000000
File: data30mb.csv
+000000000 00:00:14.365000000
File: data40mb.csv
+000000000 00:00:12.873000000
File: data50mb.csv
+000000000 00:00:15.313000000
File: data60mb.csv
+000000000 00:11:22.875000000
File: data70mb.csv
+000000000 00:13:31.123000000

PL/SQL procedure successfully completed.

SQL>

This was an interesting result. Notice that the first 6 files parsed all completed swiftly. (For the record, the data50mb.csv file contained 280,000+ rows so 15 seconds is an impressive result!). But parsing the 60MB file blew out to over 11minutes, over 40 times slower then the 50MB file.

After a chat with Carsten, the cause was revealed. I mentioned earlier that APEX_DATA_PARSER must be all things to all people. We don’t know in advance what the typical usage will be for any given customer, so we must be able to handle any and all requirements. The slowdown observed here is a consequence of this. In most cases, the fastest way to work with a LOBs contents in the Oracle database is to reference it as a temporary LOB. Temporary LOBs get all of advantages of the usual temporary data processing structures in the database, by keeping data in session PGA until such stage as it exceeds allowable limits and is staged to temporary tablespace area.

Hence the best way for APEX_DATA_PARSER to work efficiently is for it to copy your LOB to a temporary LOB and then work with that. But a LOB inside the Oracle Database can be terabytes(!) in size . You probably won’t be pleased if you see your PGA consumption go through the roof and your temporary tablespace explode when you run the parser against the 2TB LOB your loaded into the database Smile

Thus APEX_DATA_PARSER makes some sizing decisions on how to process your LOB, according to the following rules:

  • If you pass in a temporary LOB, we’ll go ahead and use that.
  • If you pass in a permanent LOB and it is less then 50MB, then we’ll copy it to a temporary LOB and use that to keep the processing as fast as possible.
  • If you pass in a permanent LOB that is more then 50MB, then we’ll err on the side of safety and reference the LOB directly instead of creating a potentially huge temporary segment.

There are a few other considerations here relating to charactersets etc, but the above rules are sufficient to give you an understanding of what is happening behind the scenes.

This does not mean that you must accept the performance cost when dealing with LOBs larger than 50MB. These rules simply mean that the decision on using a large temporary segment now falls to you, the customer, rather then APEX_DATA_PARSER. Now that you have the rules above, you can explicitly choose to use a temporary segment and pass that to APEX_DATA_PARSER to get the performance benefits.

Here’s the benchmark repeated, with a solution similar to what we provided to the customer. We explicitly create a temporary LOB from our stored LOB, and pass the temporary LOB to APEX_DATA_PARSER.


SQL> create table mycsvfile ( blob blob);

Table created.

SQL>
SQL> set serverout on
SQL> declare
  2    l_filename sys.odcivarchar2list :=
  3         sys.odcivarchar2list(
  4             'data5mb.csv'
  5             ,'data10mb.csv'
  6             ,'data20mb.csv'
  7             ,'data30mb.csv'
  8             ,'data40mb.csv'
  9             ,'data50mb.csv'
 10             ,'data60mb.csv'
 11             ,'data70mb.csv'
 12             );
 13
 14      l_clob           clob;
 15      l_blob           blob;
 16      l_bfile          bfile;
 17      l_target_offset  number := 1;
 18      l_source_offset  number := 1;
 19      l_language_ctx   number := dbms_lob.default_lang_ctx;
 20      l_warning        number;
 21      l_lob_len        number;
 22
 23      l_tmp_blob       blob;
 24
 25      l_ts             timestamp;
 26
 27  begin
 28    dbms_lob.createtemporary(l_tmp_blob, true);
 29
 30    for i in 1 .. l_filename.count
 31    loop
 32      dbms_application_info.set_client_info(localtimestamp||' '||l_filename(i));
 33      dbms_output.put_line('File: '||l_filename(i));
 34
 35      execute immediate 'truncate table mycsvfile';
 36      insert into mycsvfile (blob) values (empty_blob()) returning blob into l_blob;
 37
 38      l_bfile := bfilename('TMP',l_filename(i));
 39      dbms_lob.fileopen(l_bfile, dbms_lob.file_readonly);
 40      dbms_lob.loadblobfromfile (
 41        dest_lob    => l_blob,
 42        src_bfile   => l_bfile,
 43        amount      => dbms_lob.lobmaxsize,
 44        dest_offset => l_target_offset,
 45        src_offset  => l_source_offset);
 46      dbms_lob.fileclose(l_bfile);
 47      commit;
 48
 49      l_ts := localtimestamp;
 50
 51      dbms_lob.copy(
 52        dest_lob  => l_tmp_blob,
 53        src_lob   => l_blob,
 54        amount      => dbms_lob.lobmaxsize);
 55
 56     -- Notice above we have copied our permanent LOB
 57     -- into a temporary lob, and it is that LOB that
 58     -- gets passed into the APEX_DATA_PARSER call below
 59
 60      for j in (
 61        select col001,col002,col003,col004,col005,col006,col007,col008,col009,col010,
 62               col011,col012,col013,col014,col015,col016,col017,col018,col019,col020,
 63               col021,col022,col023,col024,col025,col026
 64        from
 65           table( apex_data_parser.parse(
 66                    p_content          => l_tmp_blob,
 67                    p_csv_col_delimiter => ',',
 68                    p_skip_rows         => 1,
 69                    p_file_name         => l_filename(i))
 70                ) p
 71      )
 72      loop
 73        null;
 74      end loop;
 75
 76      dbms_output.put_line(localtimestamp-l_ts);
 77    end loop;
 78  end;
 79  /
File: data5mb.csv
+000000000 00:00:06.787000000
File: data10mb.csv
+000000000 00:00:07.003000000
File: data20mb.csv
+000000000 00:00:13.086000000
File: data30mb.csv
+000000000 00:00:14.426000000
File: data40mb.csv
+000000000 00:00:12.928000000
File: data50mb.csv
+000000000 00:00:15.205000000
File: data60mb.csv
+000000000 00:00:16.307000000
File: data70mb.csv
+000000000 00:00:17.136000000

PL/SQL procedure successfully completed.

SQL>

Now our large LOBs over 50MB are being processed as temporary LOBs and we get similar performance figures for all of the file sizes. Note that this is not a recommendation to simply do this with every LOB that you want to parse with APEX_DATA_PARSER. You need to understand the cost/benefit decision here – we are potentially consuming more memory, more temporary storage, and there is an execution time cost to copying that LOB data.

You might be thinking you can skip the copying and just SELECT your LOB into the created temporary LOB, but remember that when you query a LOB, you are really just retrieving a locator to the LOB not the LOB itself. For example, here’s the same benchmark using SELECT-INTO


SQL> create table mycsvfile ( blob blob);

Table created.

SQL>
SQL> set serverout on
SQL> declare
  2    l_filename sys.odcivarchar2list :=
  3         sys.odcivarchar2list(
  4             'data5mb.csv'
  5             ,'data10mb.csv'
  6             ,'data20mb.csv'
  7             ,'data30mb.csv'
  8             ,'data40mb.csv'
  9             ,'data50mb.csv'
 10             ,'data60mb.csv'
 11             ,'data70mb.csv'
 12             );
 13
 14      l_clob           clob;
 15      l_blob           blob;
 16      l_bfile          bfile;
 17      l_target_offset  number := 1;
 18      l_source_offset  number := 1;
 19      l_language_ctx   number := dbms_lob.default_lang_ctx;
 20      l_warning        number;
 21      l_lob_len        number;
 22
 23      l_tmp_blob       blob;
 24
 25      l_ts             timestamp;
 26
 27  begin
 28    dbms_lob.createtemporary(l_tmp_blob, true);
 29
 30    for i in 1 .. l_filename.count
 31    loop
 32      dbms_application_info.set_client_info(localtimestamp||' '||l_filename(i));
 33      dbms_output.put_line('File: '||l_filename(i));
 34
 35      execute immediate 'truncate table mycsvfile';
 36      insert into mycsvfile (blob) values (empty_blob()) returning blob into l_blob;
 37
 38      l_bfile := bfilename('TMP',l_filename(i));
 39      dbms_lob.fileopen(l_bfile, dbms_lob.file_readonly);
 40      dbms_lob.loadblobfromfile (
 41        dest_lob    => l_blob,
 42        src_bfile   => l_bfile,
 43        amount      => dbms_lob.lobmaxsize,
 44        dest_offset => l_target_offset,
 45        src_offset  => l_source_offset);
 46      dbms_lob.fileclose(l_bfile);
 47      commit;
 48
 49      l_ts := localtimestamp;
 50
 51      select blob
 52      into   l_tmp_blob
 53      from   mycsvfile;
 54
 55      for j in (
 56        select col001,col002,col003,col004,col005,col006,col007,col008,col009,col010,
 57               col011,col012,col013,col014,col015,col016,col017,col018,col019,col020,
 58               col021,col022,col023,col024,col025,col026
 59        from
 60           table( apex_data_parser.parse(
 61                    p_content          => l_tmp_blob,
 62                    p_csv_col_delimiter => ',',
 63                    p_skip_rows         => 1,
 64                    p_file_name         => l_filename(i))
 65                ) p
 66      )
 67      loop
 68        null;
 69      end loop;
 70
 71      dbms_output.put_line(localtimestamp-l_ts);
 72    end loop;
 73  end;
 74  /
File: data5mb.csv
+000000000 00:00:06.901000000
File: data10mb.csv
+000000000 00:00:07.134000000
File: data20mb.csv
+000000000 00:00:13.053000000
File: data30mb.csv
+000000000 00:00:14.474000000
File: data40mb.csv
+000000000 00:00:14.016000000
File: data50mb.csv
+000000000 00:00:15.888000000
File: data60mb.csv
+000000000 00:11:55.209000000
File: data70mb.csv
+000000000 00:12:09.676000000

PL/SQL procedure successfully completed.

SQL>

Now you can make informed decisions about the best usage model for APEX_DATA_PARSER for your business requirements. In all current releases of APEX, the threshold we are using is 50MB and this is noted in the APEX release docs.

image

2 comments

  1. Curious why you would use the (rarely seen) odcivarchar2list instead of apex_t_varchar2, since this post is all about using what’s built into APEX? 😆

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.