Whenever you have the need to use the DISTINCT keyword, its worth just pausing for a second, and making sure that you are not hiding just a larger issue. It actually might represent either incorrect use of SQL or incorrect assumptions from the data model.
Consider the following example
SELECT DISTINCT d.dname FROM emp e, dept d WHERE e.ename = 'SMITH' AND e.deptno = d.deptno
The query is certainly valid, but when I see “distinct” I ask myself the following questions:
Has the DISTINCT has been added in an attempt to only return a single row ?, ie, is someone working under the assumption being that an employee name can only refer to a single department ? Unless there is a unique constraint on the ENAME column, then we can still just as easily get multiple rows back (even with the DISTINCT), so the SQL will be a “sleeping problem” in the application until the data causes it to fail.
Because the DISTINCT keyword here:
- doesn’t give you any guarantee that you’ll only get one row, and
- might be making the database do excessive work (get all rows, sort them and remove duplicates), every single time you run this SQL
Alternatively, perhaps there is meant to be a single department for an ENAME, ie, ENAME is unique. If that is the case, then we should change the database model/design to reflect it and not use DISTINCT to “fake” it.
I’m not saying DISTINCT is always a bug – but it’s always worth just asking yourself if you were correct to be using it.
All good, I just use UNIQUE 😉
Amateur – I fetch all the rows and de-dup on the client with a home-grown hashmap like a real man 🙂
I’ll use an EXISTS with correlated subquery to EMP, add UNIQUE constraint on EMP(DEPTNO,ENAME), and then get rid of the DISTINCT….
Or shuffle it all into Hadoop in the cloud and clear out the duplicates there 🙂