Whenever you have the need to use the DISTINCT keyword, its worth just pausing for a second, and making sure that you are not hiding just a larger issue. It actually might represent either incorrect use of SQL or incorrect assumptions from the data model.
Consider the following example
SELECT DISTINCT d.dname
FROM emp e, dept d
WHERE e.ename = 'SMITH'
AND e.deptno = d.deptno
The query is certainly valid, but when I see “distinct” I ask myself the following questions:
Has the DISTINCT has been added in an attempt to only return a single row ?, ie, is someone working under the assumption being that an employee name can only refer to a single department ? Unless there is a unique constraint on the ENAME column, then we can still just as easily get multiple rows back (even with the DISTINCT), so the SQL will be a “sleeping problem” in the application until the data causes it to fail.
Because the DISTINCT keyword here:
- doesn’t give you any guarantee that you’ll only get one row, and
- might be making the database do excessive work (get all rows, sort them and remove duplicates), every single time you run this SQL
Alternatively, perhaps there is meant to be a single department for an ENAME, ie, ENAME is unique. If that is the case, then we should change the database model/design to reflect it and not use DISTINCT to “fake” it.
I’m not saying DISTINCT is always a bug – but it’s always worth just asking yourself if you were correct to be using it.
All good, I just use UNIQUE ๐
Amateur – I fetch all the rows and de-dup on the client with a home-grown hashmap like a real man ๐
I’ll use an EXISTS with correlated subquery to EMP, add UNIQUE constraint on EMP(DEPTNO,ENAME), and then get rid of the DISTINCT….
๐
Or shuffle it all into Hadoop in the cloud and clear out the duplicates there ๐