It happens…
You’re under a time crunch to deliver data for a business project, and a few duplicates catch your eye. What now?
Luckily, Excel provides tools to dedupe data in a pinch.
WAIT!! Before you click that Remove Duplicates button, and before you delete any records, follow these critical steps:
Step 1: Make a copy of your data first!
Step 2: Standardize your data fields
Step 3: SORT your file
Step 4: View your duplicates!
Step 5: Choose the correct fields
(If this is overwhelming - we are a phone call away 877-236-9155)
Step 1: Make a copy of your data first!
It goes without saying, but we're saying it anyway:
Take that 1 minute to click FILE --> SAVE AS and make a copy. You will thank yourself (and us) later.
Step 2: Standardize your data fields
The Remove Duplicates functionality in Excel uses EXACT matches. This is very different than what a human eye will recognize as duplicates.
Here are the steps for standardizing the most common fields:
1. Standardizing Addresses:
Utilize CASS standardization or even better, NCOA. (Need help processing CASS or NCOA?) Attempting to fix a mixture of Address1 and Address2 without a postal solution requires a complex formula and may still result in inconsistencies.
2. Standardizing Split Names:
Names have become more creative over the years, making it nearly impossible to 'standardize' a person's name. Tom may not like to be called Thomas; maybe it is Tomas, or maybe it really is just Tom. Incorrectly modifying a person's name can be insulting.
To be safe, the most you should do is remove punctuation such as commas, periods, and apostrophes for matching purposes.
The following describes the steps of a file that has a split name field of First Name and Last Name. If your data is a mixture of Full Name and Split Name, we can help with a more sophisticated approach.
a. Copy the name columns:
The best option for deduping on names is to copy the name columns for matching while leaving the ORIGINAL names intact.
Highlight the "First Name" and the "Last Name" column and right-click
Select Copy on the drop-down menu
Scroll all the way to the right and highlight the first empty column
Right-click on the column header
Click on the first clipboard icon, which is the basic Paste option
(With the two name fields copied to the end of your file, you can now remove punctuation using those copied columns)
b. Highlight the two new name columns
c. Click on Home --> Find & Select --> Replace
In the "Find what:" box, type a period
Leave the "Replace with:" box BLANK
Click Replace All
The periods are now gone
d. Repeat this for each of the punctuation types: commas, apostrophes, and any other that exist
Final result after punctuation is removed
3. Standardizing Phone Numbers:
To Standardize a phone field, use the same steps as above to copy your phone fields.
a. Copy the phone number column
b. Highlight the new phone column
c. Click on Home --> Find & Select --> Replace
d. Repeat this for each of the punctuation types that exist:
This will take several steps to remove EACH one SEPARATELY. Here is the typical list of characters in a phone field:
(
)
-
,
.
(If your phone field has additional characters that are not numeric, continue to step through the "Find & Replace" command box.)
Step 3: SORT your file
This next step is CRITICAL. Sort your file to include the field(s) that are causing the duplication. In some cases, multiple fields will need to be sorted.
Example: ZIP Code, Address 1, and Address 2.
Why is sorting critical? When Excel removes duplicates, it will keep the first record in the set and remove the rest.
There may be cases where you sort a field in descending order to preserve the best record.
Sorting helps ensure the most desired record is first.
The below example prioritizes Vehicle Year to keep the record with the newest vehicle and deletes the rest.
1. Sorting Address Fields:
a. Highlight all columns including the headers.
(Caution - if you only highlight and sort a few columns, you will skew your file).
b. Click Home --> Sort & Filter --> Custom Sort
c. Choose the fields for sorting
In the Custom Sort dialog box, use the dropdown arrows to choose the fields for sorting. The sort order matters. For addresses, choose ZIP Code FIRST, then Address 1, then Address 2. Keep these in the default sort order ("A to Z").
d. Set the Priority
Choose the field within the duplicates that will determine the priority. In this case, we want to keep the record with the most recent vehicle the person owns.
Add the Vehicle Year to the sort box, and change "Smallest to Largest" to "Largest to Smallest"
Your file is now sorted in the correct order and ready for deduping by address.
2. Sorting Name Fields:
When deduping by name, follow the same steps as above:
a. Highlight all columns including the headers.
b. Click Home --> Sort & Filter --> Custom Sort
c. Choose the fields for sorting
Choose Last Name as your first sorting field, then, First Name.
d. Set the Priority
Again, choose the field that determines the record you would like to keep. In the above example, this is vehicle year.
3. Sorting Phone Fields:
Same steps as sorting address and name:
a. Highlight all columns including the headers.
b. Click Home --> Sort & Filter --> Custom Sort
c. Choose the fields for sorting
For sorting phones, simply choose the field containing the phone numbers.
d. Set the Priority
As in the above examples, choose the field that determines the record you would like to keep. in this case, vehicle year
Step 4: View your duplicates!
Now that you have standardized and sorted your data, take a few minutes to scroll through the duplicates. Do a spot check to determine if;
There is additional punctuation that needs to be removed
The duplicates contain exact matches
Further standardization is needed
Other anomalies exist that need to be addressed
Step 5: Choose the correct fields for deduping
1. Removing Duplicates Based on Addresses
a. Select the entire spreadsheet. Make sure all of the columns are highlighted.
b. Click Data --> Remove Duplicates
c. Click the My data has headers box if it's not already selected.
d. Resize the "Remove Duplicates" dialog box.
Drag down the bottom border to enlarge the dialog box. This will allow you to easily view the entire list of fields.
e. Click Unselect ALL so that NO boxes are selected.
By default, all fields are selected when the dialog box is opened.
f. Check ONLY the boxes that are causing the duplicates.
In this case, that is ZIP Code, Address1 and Address 2
g. Click OK to remove the duplicates
After clicking OK, a new dialog box will appear showing the number of records removed, and how many unique records remain.
You can always click UNDO, if needed, to revert back.
Below is the before and after comparison for the current example. You can see that the first record in the duplicate set remains, while the others are removed.
2. Removing Duplicates Based on Names
Use the standardized fields that were created for the names in Step 2, not the original.
(Once the dedupe is completed, you can then delete these extra columns if desired).
a. Select the entire spreadsheet. Make sure all of the columns are highlighted.
b. Click Data--> Remove Duplicates
c. Click the My data has headers box if is not already selected.
d. Resize the remove duplicates dialog box.
Drag down the bottom border to enlarge the dialog box. This will allow you to easily view the entire list of fields.
e. Click Unselect ALL so that NO boxes are selected.
By default, all fields are selected when the dialog box is opened.
f. Check ONLY the boxes that are causing the duplicates.
In this case, that is Last Name and First Name
g. Click OK to remove the duplicates
After clicking OK, a new dialog box will appear showing the number of records removed, and how many unique records remain.
You can always click UNDO, if needed, to revert back.
3. Removing Duplicates Based on Phones
Use the standardized phone number field, not the original.
(Once the dedupe is completed, you can then delete this extra column if desired).
a. Select the entire spreadsheet. Make sure all of the columns are highlighted.
b. Click Data--> Remove Duplicates
c. Click the My data has headers box if is not already selected.
d. Resize the remove duplicates dialog box.
Drag down the bottom border to enlarge the dialog box. This will allow you to easily view the entire list of fields.
e. Click Unselect ALL so that NO boxes are selected.
By default, all fields are selected when the dialog box is opened
f. Check ONLY the boxes that are causing the duplicates.
In this case, Phone Number.
g. Click OK to remove the duplicates
After clicking OK, a new dialog box will appear showing the number of records removed, and how many unique records remain.
You can always click UNDO, if needed, to revert back.
Finally...
Check your results. If the duplicates did not remove as expected, it is possible they were not exact matches and require further standardization, or perhaps advanced matching algorithms are required.
*Frequent question: Can you use Conditional Formatting to remove duplicates? NO, see our previous blog AVOID These 3 Deduping DIY Disasters