Learning is not a spectator sport

Mr DISTINCT might not be your friend

August 15, 2015

1–2 minutes

Whenever you have the need to use the DISTINCT keyword, its worth just pausing for a second, and making sure that you are not hiding just a larger issue. It actually might represent either incorrect use of SQL or incorrect assumptions from the data model.

Consider the following example

SELECT DISTINCT d.dname
FROM   emp e, dept d
WHERE  e.ename = 'SMITH'
AND    e.deptno = d.deptno

The query is certainly valid, but when I see “distinct” I ask myself the following questions:

Has the DISTINCT has been added in an attempt to only return a single row ?, ie, is someone working under the assumption being that an employee name can only refer to a single department ? Unless there is a unique constraint on the ENAME column, then we can still just as easily get multiple rows back (even with the DISTINCT), so the SQL will be a “sleeping problem” in the application until the data causes it to fail.

Because the DISTINCT keyword here:

doesn’t give you any guarantee that you’ll only get one row, and
might be making the database do excessive work (get all rows, sort them and remove duplicates), every single time you run this SQL

Alternatively, perhaps there is meant to be a single department for an ENAME, ie, ENAME is unique. If that is the case, then we should change the database model/design to reflect it and not use DISTINCT to “fake” it.

I’m not saying DISTINCT is always a bug – but it’s always worth just asking yourself if you were correct to be using it.

5 responses to “Mr DISTINCT might not be your friend”

Scott

September 10, 2015 at 9:04 pm

All good, I just use UNIQUE 😉

Loading…

Reply
1. Jeffrey Kemp
  
  September 11, 2015 at 7:47 am
  
  Amateur – I fetch all the rows and de-dup on the client with a home-grown hashmap like a real man 🙂
  
  Loading…
  
  Reply
Toon Koppelaars

September 7, 2016 at 6:48 pm

I’ll use an EXISTS with correlated subquery to EMP, add UNIQUE constraint on EMP(DEPTNO,ENAME), and then get rid of the DISTINCT….

😉

Loading…

Reply
connormcdonald

September 7, 2016 at 6:51 pm

Or shuffle it all into Hadoop in the cloud and clear out the duplicates there 🙂

Loading…

Reply
GROUP BY might be distinctly better than DISTINCT – Learning is not a spectator sport

February 11, 2020 at 11:41 am

[…] while back I posted about how sometimes the appearance of the DISTINCT keyword in a SQL statement is a red flag to […]

Loading…

Reply