I recently posted a video where I explain the difference between UNION and UNION ALL with three aims. It should:

  • be educational
  • be fun
  • be an excuse to get stuck into some candy 🤣

You can watch it below if you haven’t seen it.

Someone from the community reached out to me with the following comment:

“Isn’t this comparison flawed for the union? If bag A is Unioned to Bag B, you get all of Bag A plus the non duplicates from bag B, don’t you? Not just the unique values from each set.”

 

The easiest way to assess the correctness of this statement is with a simple test:


SQL> create table bag1 ( color varchar2(10));

Table created.

SQL> create table bag2 ( color varchar2(10));

Table created.

SQL>
SQL> insert into bag1
  2  select
  3    case trunc(dbms_random.value(0,4.99999))
  4    when 0 then 'red'
  5    when 1 then 'blue'
  6    when 2 then 'green'
  7    when 3 then 'orange'
  8    when 4 then 'yellow'
  9    end
 10  from dual
 11  connect by level <= 30;

30 rows created.

SQL>
SQL> insert into bag2
  2  select
  3    case trunc(dbms_random.value(0,6.99999))
  4    when 0 then 'red'
  5    when 1 then 'blue'
  6    when 2 then 'green'
  7    when 3 then 'orange'
  8    when 4 then 'yellow'
  9    when 5 then 'brown'
 10    when 6 then 'purple'
 11    end
 12  from dual
 13  connect by level <= 30;

30 rows created.


There are 30 rows in BAG1, so if the assertion above is correct, then the result of the UNION will be at least 30 rows.


SQL> select * from bag1
  2  union
  3  select * from bag2;

COLOR
----------
red
yellow
orange
green
blue
brown
purple

7 rows selected.

The execution plan also gives a clue that the removal of duplicates is the very last step performed.


SQL> set autotrace traceonly explain
SQL> select * from bag1
  2  union
  3  select * from bag2;

Execution Plan
----------------------------------------------------------
Plan hash value: 723047669

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |    60 |  1200 |     8  (25)| 00:00:01 |
|   1 |  HASH UNIQUE        |      |    60 |  1200 |     8  (25)| 00:00:01 |
|   2 |   UNION-ALL         |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| BAG1 |    30 |   600 |     3   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| BAG2 |    30 |   600 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------

Obviously tables typically do not have a single column, but the same principle applies. Whatever columns you select will be subject to the same cleansing of duplicates.


SQL> create table bag1 ( color varchar2(10), grams integer);

Table created.

SQL> create table bag2 ( color varchar2(10), grams integer);

Table created.

SQL>
SQL> insert into bag1
  2  select
  3    case trunc(dbms_random.value(0,4.99999))
  4    when 0 then 'red'
  5    when 1 then 'blue'
  6    when 2 then 'green'
  7    when 3 then 'orange'
  8    when 4 then 'yellow'
  9    end, trunc(dbms_random.value(1,4))
 10  from dual
 11  connect by level <= 30;

30 rows created.

SQL>
SQL> insert into bag2
  2  select
  3    case trunc(dbms_random.value(0,4.99999))
  4    when 0 then 'red'
  5    when 1 then 'blue'
  6    when 2 then 'green'
  7    when 3 then 'orange'
  8    when 4 then 'yellow'
  9    end, trunc(dbms_random.value(1,4))
 10  from dual
 11  connect by level <= 30;

30 rows created.

SQL>
SQL> select * from bag1
  2  union
  3  select * from bag2;

COLOR           GRAMS
---------- ----------
red                 3
blue                1
green               3
yellow              3
green               1
yellow              2
yellow              1
orange              3
red                 2
red                 1
blue                3
orange              1
orange              2
green               2
blue                2

15 rows selected.

This is another reason I’m not a huge fan of the common technique of using Venn diagrams to describe SQL functionality. Often a UNION will be described as something similar to below.

and whilst that is conceptually correct, it doesn’t reinforce that adequately enough that the contents of each circle is a set of data, and a set is a collection of distinct items. Thus people intuitively assume that each circle simply represents of the data in their tables which is not the case.

One response to “UNION versus UNION ALL”

  1. The proof can be shortened further – don’t populate bag2. The UNION to an empty table removes dups from bag1

Leave a reply to Duncan & Julie Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trending