Handling Duplicates

Shivani Kumari
Last Updated: May 13, 2022

Introduction

Sometimes, tables in databases contain duplicate records. To get rid of duplicate records in databases, we first need to identify them and remove them from the given table. It will also be great to prevent inserting duplicate records into our table. We will use different methods to deal with duplicates.

Identify Duplicates

To find duplicates in a table, we need to use two following clauses.

  1. The GROUP BY clause is used to group all rows of the target column. The target column is the column that we want to check for duplicate values.
  2. The COUNT function in the HAVING clause checks if any groups have more than one entry. If they have, it means those have duplicate values.

Syntax

SELECT column1, coulmn2, … COUNT(*)

FROM table_name

GROUP BY column1,coulmn2, ….

HAVING COUNT(*) > 1;

Example

A few duplicates are present in the OrderID column in the given example. In an ideal situation, each row should have a unique value for OrderID since each order is assigned its value. But this thing is not implemented here. We can use the given query to find duplicates.

SELECT OrderID, COUNT(OrderID)

FROM Orders

GROUP BY OrderID

HAVING COUNT(OrderID)>1;

Output

The above SQL code prints all entries whose duplicates are present in the table.

Number of Records: 2

OrderIDCOUNT(OrderID)
102512
102762

Prevent Duplicates

The UNIQUE constraint is used so that a column does not contain duplicates.

Both the PRIMARY KEYand UNIQUE constraints guarantee the uniqueness of a column or set of columns.

A PRIMARY KEY constraint intuitively has a UNIQUE constraint.

However, we can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table.

Syntax

CREATE TABLE table_name(

col1 datatype [NULL | NOT NULL],

col2 datatype [NULL | NOT NULL],

col2 datatype [NULL | NOT NULL],

……

CONSTRAINT constraint_name UNIQUE(uc_col1, uc_col2,..uc_col_n) 

);

  • Table_name is name of the table that we are going to create.
  • col1, col2 are the columns we wish to create in the table.
  • Constraint_name is name of the unique constraint.
  • uc_col1, uc_col2, ... uc_col_n are the columns that make up the unique constraint.

Example

In the example, we have created a unique constraint on the employees table named employees_unique. It has one field, which is the employee_number.

CREATE TABLE employees

( employee_id INT PRIMARY KEY,

  employee_number INT NOT NULL,

  last_name VARCHAR(50) NOT NULL,

  first_name VARCHAR(50),

  salary MONEY,

  CONSTRAINT employees_unique UNIQUE (employee_number)

);

Eliminate Duplicates

The distinct clause is used to eliminate duplicate records present inside a table.

The DISTINCT keyword will fetch unique records.

Syntax

SELECT DISTINCT expressions

FROM table

[WHERE conditions];

Example

In this example, we used a table called suppliers having following data:

We find all the unique states in the suppliers table using the SQL command below.

SELECT DISTINCT state

FROM suppliers

ORDER BY state ;

Output

This SQL command will return all unique state values from the suppliers table and eliminate duplicates. As we can see, the state of California only appears once in the result set instead of four times.

state
Arkansas
California
Georgia
Texas
Washington
Wisconsin

Frequently Asked Questions

Q1. Why is the DISTINCT keyword used?

The DISTINCT keyword is used to fetch distinct records from a database table.

It is used to remove duplicates from the output of the SELECT statement.

Q2.What is the Difference Between UNIQUE and DISTINCT in SQL?

Distinct keyword in SQL is used with SELECT Statement to remove duplicates from the result of SELECT Statement. But  The UNIQUE keyword in SQL is used to mark one of its database constraints. The UNIQUE constraint on a column can prevent duplicate values from being stored in this column or set.

Q3. Can we use the DISTINCT keyword with all aggregate functions in SQL?

With SUM(), AVG(), and COUNT(expression), DISTINCT eliminates duplicate values before the sum, average, or count is calculated. It is meaningless in the case of MAX() and MIN().

Key Takeaways

Duplicates in tables cause redundancy in databases. It requires extra memory to store these duplicate records. We first identify and then remove these duplicates. We used GROUP BY and COUNT clauses to identify duplicates and the DISTINCT clause to remove duplicates. It is also better to prevent these duplicate records from inserting into the database. We use the UNIQUE clause to prevent duplicates.

Visit here to learn more about different topics related to database and management systems. Ninjas, don’t stop here. Check out the Top 100 SQL Problems to master frequently asked questions in big companies and land your dream job. Also, try  CodeStudio to practice a wide range of DSA questions asked in many interviews.

Was this article helpful ?
1 upvote

Comments

No comments yet

Be the first to share what you think