From file names to directory hierarchy

May 14, 2018

3–5 minutes

I had a fun request come in from a colleague the other day. They had a simple list of fully qualified file names and they needed to present that data in the familiar hierarchical tree layout.

To demonstrate, I took a little trip down memory lane Smile and grabbed a subset of presentations I’ve done over the years.


SQL> create table t ( fname varchar2(1000));

Table created.

SQL>
SQL> insert into t values ('C:\Users\Connor\Presentations\2002\scene_200205.pdf');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\scene_200205a.pdf');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\254old.ppt');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\254_full.ppt');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2002\257.ppt');

1 row created.
...
...
...

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\london\optimizer.pptx');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\london\optimizer_full.pptx');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\odtug\common_disasters_short.pptx');

1 row created.

SQL> insert into t values ('C:\Users\Connor\Presentations\2018\odtug\common_disasters_short_comp.pptx');

1 row created.

SQL> select count(*) from t;

  COUNT(*)
----------
       634

1 row selected.

So the challenge was to present this list of files in a nice tree format which mirrored my directory structure.

(Truth be told, my preference would be that the query would do a better job than I do in trying to organize my files, but that is a blog post for another day )

Step 1 was to use a little LATERAL trickery to parse out all of the elements of the file names into one row per “element”, where “element” an unqualified file name, or single level folder name.


SQL> select id, seq, token
  2  from
  3    ( select rownum id, fname from t ) t_base,
  4    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') token
  5            from dual
  6            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  7            );

        ID        SEQ TOKEN
---------- ---------- ------------------------------------------------------------------------------------
         1          1 C:
         1          2 Users
         1          3 Connor
         1          4 Presentations
         1          5 2002
         1          6 scene_200205.pdf
         2          1 C:
         2          2 Users
         2          3 Connor
         2          4 Presentations
         2          5 2002
         2          6 scene_200205a.pdf
         3          1 C:
         3          2 Users
         3          3 Connor
         3          4 Presentations
         3          5 2002
         3          6 254old.ppt
         4          1 C:
         4          2 Users
         4          3 Connor
         4          4 Presentations
         4          5 2002
         4          6 254_full.ppt
         ...
         ...

We can use that as input to a standard LAG function to associate each file/folder with its parent, using the SEQ column to provide sequencing within each ID


SQL> with data as
  2  (
  3  select id, seq, token
  4  from
  5    ( select rownum id, fname from t ) t_base,
  6    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') token
  7            from dual
  8            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  9            )
 10  )
 11  select
 12      id,
 13      seq,
 14      token,
 15      lag(token) over ( partition by id order by seq ) as parent
 16  from data
 17  order by id, seq desc;

        ID        SEQ TOKEN                                              PARENT
---------- ---------- -------------------------------------------------- -----------------------------------
         1          6 scene_200205.pdf                                   2002
         1          5 2002                                               Presentations
         1          4 Presentations                                      Connor
         1          3 Connor                                             Users
         1          2 Users                                              C:
         1          1 C:
         2          6 scene_200205a.pdf                                  2002
         2          5 2002                                               Presentations
         2          4 Presentations                                      Connor
         2          3 Connor                                             Users
         2          2 Users                                              C:
         2          1 C:
         3          6 254old.ppt                                         2002
         3          5 2002                                               Presentations
         3          4 Presentations                                      Connor
         3          3 Connor                                             Users
         3          2 Users                                              C:
        ...
        ...

That’s looking good, but we have duplicates in the sense that multiple files will roll up to a single folder (eg the “C:” root is repeated).. So we DISTINCT it out to leave just the relationships we need


SQL> with data as
  2  (
  3  select id, seq, token
  4  from
  5    ( select rownum id, fname from t ) t_base,
  6    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') token
  7            from dual
  8            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  9            )
 10  )
 11  select
 12    distinct
 13      token,
 14      lag(token) over ( partition by id order by seq ) as parent
 15  from data
 16  /

TOKEN                                              PARENT
-------------------------------------------------- --------------------------------------------------
C:
254old.ppt                                         2002
257.ppt                                            2002
plus2.ppt                                          2002
254.ppt                                            UKOUG2002_final
error_handling.pdf                                 2003
2004                                               Presentations
mcdonald1.pdf                                      2004
sun_storage_brief.ppt                              sun_storage_presn
ausoug05_memory_x.ppt                              2005
jul05_read_consistency.ppt                         2005
saoug_oracle_home.ppt                              2005
plsql_80_a58236.pdf                                sql_plsql_evo
jun07_tas_moving_data_short.ppt                    misc
ausoug_adel                                        2008
successful_development.pdf                         ausoug_adel
apr08_11g_developers.ppt                           ausoug_apr
apr08_11g_developers_print.ppt                     ausoug_apr
mcdonald_oct08_flashback.pdf                       perth
oct08_11g_developers.ppt                           template
nov08_read_consistency.pdf                         rwwa
perf101b_rwwa.pptx                                 rwwa
sqlectric_melb                                     2009
sql_electric2.pdf                                  sqlectric_melb
sql_electric1.pptx                                 sqlectric_melb
ukoug                                              2009
...
...
...

And we’re pretty much done. At this point, we have the raw data we need in a child/parent form that we could use to traverse with a standard hierarchy CONNECT BY query – I’ll use LPAD to show the hierarchy relationships


SQL> with data as
  2  (
  3  select id, seq, token
  4  from
  5    ( select rownum id, fname from t ) t_base,
  6    lateral(select level seq, rtrim(regexp_substr(t_base.fname || '\', '(.*?)\\', 1, level),'\') toke
  7            from dual
  8            connect by level <= 1+length(fname)-length(replace(fname,'\'))
  9            )
 10  ),
 11  hier as(
 12    select
 13      distinct
 14        token,
 15        lag(token) over ( partition by id order by seq ) as parent
 16    from data
 17  )
 18  select lpad(' ',level*2)||token hier_list
 19  from hier
 20  start with parent is null
 21  connect by prior token = parent
 22  order siblings by token;

HIER_LIST
--------------------------------------------------------------------------------------------------------
  C:
    Users
      Connor
        Presentations
          2002
            254_full.ppt
            254old.ppt
            257.ppt
            UKOUG2002_final
              254.ppt
              257.ppt
              prelim.ppt
            dec_9i_features2.ppt
            dec_legends.ppt
            dec_legends_prelim.ppt
            jul_sig.ppt
            jul_sig_printout.ppt
            june_sig.ppt
            logo.ppt
            march_sig.ppt
            march_sig2.ppt
            plaus.ppt
            plus2.ppt
            scene_200205.pdf
            scene_200205a.pdf
            sep_unix_sig.ppt
            sep_unix_sig2.ppt
          2003
            254.ppt
            abm_succesful_development.ppt
            apr03_perth.ppt
            dec03_perth_9i_nf.ppt
            dec03_perth_9i_nf_popquiz.pdf
            dec03_perth_9i_nf_popquiz.ppt
            error_handling.pdf
            error_handling.ppt
            ims_performance_examples.pdf
            ims_performance_examples.ppt
            sep03_perth_9i_hidden.pdf
            sep03_perth_9i_hidden.ppt
            sep03_perth_9i_nf.ppt
            sep03_perth_legends.pdf
            sep03_perth_legends.ppt
            sep03_perth_prelim.ppt
            sf_oow_03.pdf
            sf_oow_03.ppt
            sf_oow_03a.ppt
            slide.ppt
            succesful_development.pdf
            succesful_development.ppt
            tools_for_performance.pdf
            tools_for_performance.ppt
          2004
            10046_scene.pdf
            bind_var.pdf
            bind_var.ppt
            dec03_perth_book.ppt
            dec03_perth_book2.ppt
            generating_test_data.pdf
            generating_test_data.ppt
            mar04_hotsos_legends.ppt
            mar04_hotsos_nf_w2k.ppt
            mar04_hotsos_nf_w2k_edit.ppt
            mcdonald1.pdf
            mcdonald2.pdf
            nov04_reorgs_rebuild.ppt
            nov04_succesful_development.ppt
            
            ...
            ...
            ...
            
          2018
            apex_connect
              apex_connect_sql_plsql.pdf
              apex_connect_sql_plsql.pptx
              apex_connect_sql_plsql_v2.pptm
              apex_connect_sql_plsql_v2.pptx
              apex_connect_sql_plsql_v3.pptm
              apex_connect_sql_plsql_v3_subset.pptm
              extensions1.pptx
              extensions2.pptx
            code_china
              better_sql_oracle_code.pptx
              better_sql_oracle_code_v2.pdf
              better_sql_oracle_code_v2.pptx
            code_hyderabad
              better_sql_oracle_code.pptx
            hong_kong
              hong_kong_122.pdf
              hong_kong_122.pptx
              hong_kong_122_no_multi.pptx
              hong_kong_partitioning_ora_template.pptx
              hong_kong_partitioning_ora_template2.pdf
              hong_kong_partitioning_ora_template2.pptx
            london
              london.pdf
              london.pptx
              optimizer.pdf
              optimizer.pptx
              optimizer_full.pptx
            odtug
              common_disasters_short.pptx
              common_disasters_short_comp.pptx

And there we have it. From flat file listing to a nice hierarchical layout just by tackling the problem piece by piece using the trusty WITH clause.

5 responses to “From file names to directory hierarchy”

Boneist

May 15, 2018 at 3:48 pm

You can avoid the use of the lateral join by tweaking the connect by, plus the additional connect by can be avoided by using analytic functions instead, e.g.:

WITH main_results AS (SELECT fname,
                             LEVEL lvl,
                             regexp_substr(fname, '[^\]+', 1, LEVEL, NULL) token,
                             MIN(LEVEL) OVER (PARTITION BY regexp_substr(fname, '[^\]+', 1, LEVEL, NULL)) min_lvl_per_token,
                             MIN(fname) OVER (PARTITION BY regexp_substr(fname, '[^\]+', 1, LEVEL, NULL)) min_fname_per_token,
                             lpad(' ', level*2) || regexp_substr(fname, '[^\]+', 1, LEVEL, NULL) hier_list
                      FROM   t
                      CONNECT BY PRIOR fname = fname
                                 AND PRIOR sys_guid() IS NOT NULL
                                 AND regexp_substr(fname, '[^\]+', 1, LEVEL) IS NOT NULL)
SELECT hier_list
FROM   main_results
WHERE  fname = min_fname_per_token
AND    lvl = min_lvl_per_token
ORDER BY fname, lvl;

Boneist

May 15, 2018 at 4:00 pm

Whether that is faster than your query is debateable though! *{:-)

Boneist

May 15, 2018 at 11:27 pm

Following Juri’s comment (well spotted!!), here’s a corrected version of my query above:

WITH main_results AS
  (SELECT fname,
          lvl,
          token,
          sub_path,
          MIN(lvl) OVER (PARTITION BY token, sub_path) min_lvl_per_token,
          MIN(fname) OVER (PARTITION BY token, sub_path) min_fname_per_token,
          lpad(' ', lvl*2) || token hier_list
   FROM   (SELECT fname,
                  LEVEL lvl,
                  regexp_substr(fname, '[^\]+', 1, LEVEL, NULL) token,
                  regexp_substr(fname, '([^\]+\\){'||(LEVEL -1)||'}', 1, 1, NULL) sub_path
           FROM   t
           CONNECT BY PRIOR fname = fname
                      AND PRIOR sys_guid() IS NOT NULL
                      AND regexp_substr(fname, '[^\]+', 1, LEVEL) IS NOT NULL))
SELECT hier_list
FROM   main_results
WHERE  fname = min_fname_per_token
AND    lvl = min_lvl_per_token
ORDER BY fname, lvl;

Juri Wolfenhaut

May 15, 2018 at 10:30 pm

Dear Connor,

many thanks for this example of using LATERAL.

When trying to rewrite your original query for use in Oracle 11g (without LATERAL) I discovered that it delivers wrong results if subfolders with the same name (like PROD and TEST in the test case below) appear in different places in the directory hierarchy.

So here the test case and corrected query:

create table t ( fname varchar2(1000));

insert into t (fname) values (‘C:\readme.txt’);

insert into t (fname) values (‘C:\PROD\prod_readme.txt’);
insert into t (fname) values (‘C:\TEST\test_readme.txt’);

insert into t (fname) values (‘C:\Data\generic_data_1.xls’);
insert into t (fname) values (‘C:\Data\generic_data_2.xls’);
insert into t (fname) values (‘C:\Data\PROD\prod_data_1.xls’);
insert into t (fname) values (‘C:\Data\PROD\prod_data_2.xls’);
insert into t (fname) values (‘C:\Data\TEST\test_data_1.xls’);
insert into t (fname) values (‘C:\Data\TEST\test_data_2.xls’);

insert into t (fname) values (‘C:\Docs\generic_doc_1.txt’);
insert into t (fname) values (‘C:\Docs\generic_doc_2.txt’);
insert into t (fname) values (‘C:\Docs\PROD\prod_doc_1.txt’);
insert into t (fname) values (‘C:\Docs\PROD\prod_doc_2.txt’);
insert into t (fname) values (‘C:\Docs\TEST\test_doc_1.txt’);
insert into t (fname) values (‘C:\Docs\TEST\test_doc_2.txt’);

insert into t (fname) values (‘C:\Programs\generic_prog_1.exe’);
insert into t (fname) values (‘C:\Programs\generic_prog_2.exe’);
insert into t (fname) values (‘C:\Programs\PROD\prod_prog_1.exe’);
insert into t (fname) values (‘C:\Programs\PROD\prod_prog_2.exe’);
insert into t (fname) values (‘C:\Programs\TEST\test_prog_1.exe’);
insert into t (fname) values (‘C:\Programs\TEST\test_prog_2.exe’);

commit;

with data as
(
select id, seq, token, parent_folder
from
( select rownum id, fname from t ) t_base,
lateral(select level seq, rtrim(regexp_substr(t_base.fname || ‘\’, ‘(.*?)\\’, 1, level),’\’) token
,case
when level > 1
then
substr(t_base.fname, 1, instr(t_base.fname, ‘\’, 1, level – 1))
end as parent_folder
from dual
connect by level <= 1+length(fname)-length(replace(fname,'\'))
)
),
hier as(
select
distinct
token,
lag(token) over ( partition by id order by seq ) as parent,
parent_folder
from data
)
select lpad(' ',level*2)||token hier_list
from hier
start with parent is null
connect by prior token = parent
and prior parent_folder||parent||'\' = parent_folder
order siblings by token;

Best regards,
Juri

Reply
Iudith Mentzel

May 16, 2018 at 3:22 am

Hello All,

Just for fun, here is another variant, using data from the post of Juri, above:

WITH data (fname, parent, child, lvl)
AS
(
SELECT fname,
REGEXP_SUBSTR(fname,'(.*?)\\([^\]*$)’,1,1,NULL,1) parent,
REGEXP_SUBSTR(fname,'(.*?)\\([^\]*$)’,1,1,NULL,2) child,
1
FROM t
UNION ALL
SELECT fname,
REGEXP_SUBSTR(parent,'(.*?)\\([^\]*$)’,1,1,NULL,1) parent,
REGEXP_SUBSTR(parent,'(.*?)\\([^\]*$)’,1,1,NULL,2) child,
lvl + 1
FROM data
WHERE INSTR(parent,’\’) > 0
),
hier
AS
(
SELECT DISTINCT parent, null child
FROM data
START WITH INSTR(parent,’\’) = 0
CONNECT BY parent = PRIOR parent || ‘\’ || PRIOR child
UNION ALL
SELECT parent, child
FROM data
WHERE lvl = 1
)
SELECT
CASE
WHEN child IS NULL THEN LPAD(‘.’, 2 * REGEXP_COUNT(parent,’\\’),’.’) || PARENT
ELSE LPAD(‘.’, 2 * (REGEXP_COUNT(parent,’\\’) + 1),’.’) || CHILD
END files_hierarchy
FROM HIER
ORDER BY parent, child NULLS FIRST
/

FILES_HIERARCHY
—————————-
C:
..readme.txt
..C:\Data
….generic_data_1.xls
….generic_data_2.xls
….C:\Data\PROD
……prod_data_1.xls
……prod_data_2.xls
….C:\Data\TEST
……test_data_1.xls
……test_data_2.xls
..C:\Docs
….generic_doc_1.txt
….generic_doc_2.txt
….C:\Docs\PROD
……prod_doc_1.txt
……prod_doc_2.txt
….C:\Docs\TEST
……test_doc_1.txt
……test_doc_2.txt
..C:\PROD
….prod_readme.txt
..C:\Programs
….generic_prog_1.exe
….generic_prog_2.exe
….C:\Programs\PROD
……prod_prog_1.exe
……prod_prog_2.exe
….C:\Programs\TEST
……test_prog_1.exe
……test_prog_2.exe
..C:\TEST
….test_readme.txt

33 rows selected.

Thanks a lot & Best Regards,
Iudith Mentzel

Reply

Learning is not a spectator sport