A SQL conundrum

May 20, 2024

5–7 minutes

I had an AskTOM question recently about finding occurrences of a set of words within sentences stored in a table.

For example, if you have two sentences,

The quick brown fox jumps over the lazy dog
Never trust a fox. Looks like a dog, behaves like a cat.

and I were given the phrase “fox,dog,cat” then I need to report that the 2 of the terms occur in the first sentence, and 3 of the terms occur in the second.

To test this on a slightly larger set of rows than just 2, I extracted the plain text from the 23ai SQL Language reference, preserving the new lines, and loaded this into a database table called T. For example, some random rows.


SQL> select id, txt
  2  from t
  3  where rownum <= 20
  4  and txt like '%match%';

        ID TXT
---------- --------------------------------------------------------------------------------------------------
       197 [e is friends] is an edge pattern that matches either incoming or outgoing
       268 ^ matches the position before the first row in the partition. this
       538 a mismatch domain_check_type raises an error.
       603 a path pattern specifies a linear pattern that matches a string of vertices and edges.
       611 a percent sign % in the pattern can match zero or more characters as opposed to
       612 a period . does not match the newline character.
       388 a dash  as an alternative to schema_name indicates that if match_string matches
       431 a foreign key constraint requires values in one table to match values in another table.
       450 a graph table shape defines how the result of pattern matching should be transformed into
       473 a is person is a vertex pattern that matches vertices labeled person and binds
       474 a java name, or a wildcard that can match any name.
      1490 addition, the current cdb name should match the target cdb name specified in
       704 a row pattern match consists of a set of contiguous rows in a row pattern partition. each row
       723 a semijoin returns rows that match an exists subquery without duplicating rows from
       894 a vertex pattern is a pattern that matches vertices in a graph. the result of such matching is
      1037 accept a sequence of multiple values matched by the rhs path expression. note
      1039 accept for a match. in this table, n and m represent unsigned integers.
      1820 aliases must match the number of expressions selected by the view. aliases must follow the
      1893 all of its graph element tables are defined with a primary key that matches an
      1660 after match skip to last up

20 rows selected.

Using a technique I saw on Stew Ashton’s excellent SQL blog, it’s easy to transform the list of search words into rows,


SQL> with words as
  2      ( select 'create,database,privilege,select,drop,view,table' terms
  3        from dual
  4      )
  5      select term from words w,
  6      json_table(
  7        replace(json_array(w.terms), ',', '","'),
  8       '$[*]' columns (
  9          term varchar2(50) path '$'
 10   )
 11   );

TERM
--------------------------------------------------
create
database
privilege
select
drop
view
table

and then of course the task becomes straightforward. To keep the SQL easy to read I’ve loaded the search terms into their own table TERMS using the technique above. I’ve also loaded both the SQL language text as well as the search terms in lower case to not bother with mixed case and the like. I’ve also not bothered with the potential intricacies of what defines a “word” in a word search, for example, Does it need to be a whole word or partial word match? What about plurals? etc etc. This is just a very simple “Does a sequence of characters exist in the larger set?” task.

With that in mind, one possible SQL to achieve this is below. To keep the output at a reasonable size, I’ve limited it to only those rows that contained 4 or more of the 7 terms.


SQL> select id,  listagg(val,',') within group ( order by val)  result, count(*) hits, txt
  2  from terms,
  3       t
  4  where instr(t.txt,terms.val) > 0
  5  group by id, txt
  6  having count(*) > 3
  7  order by 1;

        ID RESULT                                  HITS TXT
---------- --------------------------------- ---------- ------------------------------------------------------------
      1319 create,drop,privilege,table                4 add privileges create any table, drop any table;
      3938 create,privilege,select,table              4 any table, create any sequence, and select any sequence syst
                                                        em privileges.

      8058 create,database,table,view                 4 clause, then oracle database creates the materialized view i
                                                        n the default tablespace

     10604 create,database,privilege,table            4 create an index on the table. you need these privileges beca
                                                        use oracle database creates an

     10980 create,privilege,select,view               4 create materialized view system privilege, and the read or s
                                                        elect object privilege on

     11735 create,select,table,view                   4 created. a query that selects rows from two or more tables,
                                                        views, or materialized

     12700 create,database,table,view                 4 database creates the materialized view log in the schema of
                                                        its master table. you cannot

     12728 create,database,drop,table                 4 database drops the nested table segments of the current part
                                                        ition and creates new

     13014 database,drop,select,table                 4 database, drop tablespace, resumable, select any
     15527 create,drop,privilege,table                4 drop privileges create any table
     15528 create,drop,privilege,table                4 drop privileges create any table;
     22941 database,drop,table,view                   4 if the table has a materialized view log, then the database
                                                        drops this log and any

     22950 database,drop,table,view                   4 if the tablespace contains a materialized view log, then the
                                                         database drops the log and

     23150 create,database,table,view                 4 if you create views that refer to remote tables and views, t
                                                        hen the database links you specify

     23216 create,database,drop,table,view            5 if you drop a materialized view that was created on a prebui
                                                        lt table, then the database

     24223 create,select,table,view                   4 in order to create a materialized view whose defining query
                                                        selects from a master table

     33695 drop,privilege,select,table                4 on the table and the select privilege on the sequence. if th
                                                        e sequence is later dropped, then

     37829 database,select,table,view                 4 pl/sql object name. you can query a table or view in the oth
                                                        er database with the select

     38528 create,drop,privilege,table                4 privileges create any table and drop any table:
     38529 create,drop,privilege,table                4 privileges create any table, drop any table
     38530 create,drop,privilege,table                4 privileges create any table, drop any table;
     43999 privilege,select,table,view                4 select read or select privilege, insert, update, or delete r
                                                        ows from all the tables or views on

     49214 create,database,table,view                 4 sysdate. a materialized view log was created for the employe
                                                        e table, so oracle database

     49281 create,drop,privilege,table                4 system privileges create any table and drop any table in the
                                                         current container:

     52489 create,drop,privilege,table                4 the following statement adds the system privileges create an
                                                        y table and drop any table to

     52796 create,drop,privilege,table                4 the following statement drops the system privilege create an
                                                        y table from unified

     54243 create,privilege,table,view                4 the owner of the materialized view must have the create tabl
                                                        e system privilege. the

     57316 drop,privilege,table,view                  4 to drop a materialized view log, you must have the privilege
                                                        s needed to drop a table.

     57413 create,privilege,select,view               4 to grant select and update privileges on the view emp_view,
                                                        which was created in creating

     61995 drop,privilege,table,view                  4 view system privilege. you must also have the privileges to
                                                        drop the internal table and

     62596 create,database,table,view                 4 when you create a materialized view, oracle database creates
                                                         one internal table and

31 rows selected.

But here’s a question for the curious out there. What other SQL techniques could be employed here to solve this problem? I stress, this isn’t me withholding a different clever solution from you, which I’ll subsequently spring upon you later. This is just a genuine curiosity to see how other minds might tackle the problem, and not specifically with a mindset of “better” or “faster”, just “different”.

You can download the population scripts for both T and TERMS from my github repo here.

I hope to see some cool ideas in the comments.

7 responses to “A SQL conundrum”

iudithd5bf8e4d8d

May 20, 2024 at 6:52 pm

Hi Connor,
The only quick idea that comes to my mind is to maybe create an Oracle Text index on table T and then some more “fancy” queries
could be performed, using “various combinations” of the search terms.

Unfortunately, in real life, we never got experienced enough with the depth and richness of the Oracle Text features,
we employed them extremely rarely, though, each such solution proved itself very useful and I always remained with a sense of
“wish for more”, when I will just find time to study the entire documentation from cover to cover …

So, if such solutions do exist in your thoughts, I would be very glad if you could enrich our knowledge in this direction.

Cheers & Best Regards,
Iudith Mentzel

Reply
1. Connor McDonald
  
  May 20, 2024 at 8:47 pm
  
  We could something like
  
  create index t_ix on t ( txt) indextype is ctxsys.context;
  
  and then build a Text phrase, eg
  
  SQL> with
  2 phrase as
  3 ( select ‘create,database,privilege,select,drop,view,table’ x1 from dual ),
  4 ctx_phrase as
  5 ( select ‘{‘||replace(x1,’,’,’} or {‘)||’}’ x2 from phrase )
  6 select * from ctx_phrase; X2
  
  {create} or {database} or {privilege} or {select} or {drop} or {view} or {table}
  
  which would then let you do the following to shrink the candidate rows:
  
  with
  phrase as
  ( select ‘create,database,privilege,select,drop,view,table’ x1 from dual ),
  ctx_phrase as
  ( select ‘{‘||replace(x1,’,’,’} or {‘)||’}’ x2 from phrase )
  select *
  from t, ctx_phrase
  where contains(txt, x2, 99) > 0;
  
  but you would still need to then extract the hit count, probably as per my original solution or with some inline plsql
  
  Reply
phsalvisberg

May 20, 2024 at 8:19 pm

Here’s a variant using MLE: https://gist.github.com/PhilippSalvisberg/e0564e763d45ecf1c42d436f92d4f1a4#file-solution_alternative_1-sql-L66-L79

Reply
Mark Russell Brown

May 20, 2024 at 9:14 pm

This is not exactly what you wanted but shows how the APEX API can answer tricky questions with a minimum of complexity.

The source of the DBMS_STATS package is one of the largest.

DECLARE
l_phrases apex_t_varchar2 := apex_t_varchar2();
l_arr apex_t_varchar2 := apex_t_varchar2();
BEGIN
apex_string.push(l_phrases,’instead’);
apex_string.push(l_phrases,’executed’);
for c in (select text from all_source where name=’DBMS_STATS’ and type=’PACKAGE’) loop
l_arr := apex_string_util.find_phrases(l_phrases,C.text);
if (l_arr.count>0) then
for i in l_arr.first..l_arr.last
loop
dbms_output.put_line(l_arr(i)||’:’||regexp_count(C.text,l_arr(i),1,’i’)||’ – ‘||c.text);
end loop;
end if;
end loop;
END;
/

Reply
Russ

May 20, 2024 at 9:55 pm

My goto approach for such things is Regular Expressions, and I like Oracle’s implementation of the POSIX standard with a bit of PERL syntax thrown in.

So, I think the only thing I would have done different is to use RegEx pattern matching, but that is probably not any more clever than what you have done.

Cheers,
Russ

Reply
Jim Dickson

May 21, 2024 at 1:42 pm

Not an whole solution and uses some non-relational concepts

— split needs to cater for more separators
— result is object type / collection

with v_terms as ( select cast(collect(val) as apex_t_varchar2) as val_arr from terms )
, v_t as ( select id, txt, apex_string.split(txt,’ ‘) as txt_arr from t )
, v_cj as ( select v_t.id, v_t.txt, ( v_terms.val_arr multiset intersect v_t.txt_arr ) as result from v_terms cross join v_t )
select v_cj.id, v_cj.result, — , cast(v_cj.result as varchar2),
cardinality(v_cj.result) as hits, v_cj.txt from v_cj
where cardinality(v_cj.result) > 0
order by 1 ;

Results are not the same because I did not load all data in T.

Jim

Reply
Jim Dickson

May 21, 2024 at 1:58 pm

Small changes, same idea

— split needs to cater for more separators
— contains duplicates (because of table function) => distinct/group by

with v_terms as ( select cast(collect(val) as apex_t_varchar2) as val_arr from terms )
, v_t as ( select id, txt, apex_string.split(txt,’ ‘) as txt_arr from t )
, v_cj as ( select v_t.id, v_t.txt, ( v_terms.val_arr multiset intersect v_t.txt_arr ) as result from v_terms cross join v_t )
select DISTINCT
v_cj.id, listagg(r_t.column_value,’,’) within group (order by r_t.column_value) over (partition by v_cj.id) as result, cardinality(v_cj.result) as hits, v_cj.txt
from v_cj, table(v_cj.result) r_t
where cardinality(v_cj.result) > 1
order by 1 ;

Reply