CSV duplicate word remover

Removing Duplicate Words from CSV Files specifically removing duplicated words that occur in column A and column B of the uploaded CSV file. You simply choose which column the duplication can remain.

Here’s a 3-row example of input data and the expected output, assuming the user selected Column B:

Input Data:

Column AColumn B
apple orangeorange banana
cat dog squirreldog rabbit squirrel
red blue greenblue yellow

Output Data (with Column B retaining duplicates):

Column AColumn B
appleorange banana
cat squirreldog rabbit squirrel
red greenblue yellow

In this example, the input data contains duplicate words in both columns for each row. After processing with the script and selecting Column B as the preferred column, the output data will have the duplicate words removed from Column A, while Column B retains the duplicates.


CSV Duplicates Remover




Handling CSV (Comma Separated Values) files is a common task in data processing and manipulation. In some cases, you may find yourself dealing with duplicate words within the columns of a CSV file, which could affect data analysis or processing. In this article, we will explain a custom PHP script that helps users upload a CSV file and automatically remove duplicate words that occur in both Column A and Column B of each row whilst retaining duplicates in the user’s preferred column.

Script Overview

The PHP script provided combines both the front-end and back-end functionalities in a single file. The front-end is a simple HTML form that allows users to upload a CSV file and choose a preferred column for retaining duplicates. The back-end handles the file processing and removal of duplicate words from the specified columns.

Functionality Breakdown

  1. The script starts by defining two utility functions: remove_bom and remove_duplicates.
    • The remove_bom function is responsible for removing the Byte Order Mark (BOM) from the beginning of the text, which may be present in some UTF-8 encoded CSV files.
    • The remove_duplicates function takes a row from the CSV file, a boolean flag indicating whether it’s the first row, and the user’s preferred column. It then checks both columns for duplicate words and returns a modified row with the duplicate words removed, whilst retaining duplicates in the preferred column.
  2. The script proceeds to check if the form has been submitted by the user. If so, it validates the uploaded file to ensure it is a CSV file and then moves the file to the server for processing.
  3. The script then opens both the original CSV file and creates a new CSV file to store the amended data. It processes the original file row by row, calling the remove_duplicates function for each row and writing the modified row to the new CSV file.
  4. Once the entire file has been processed, the script displays a download link for the amended CSV file, allowing users to download it for further use.

Conclusion

This custom PHP script provides a simple and efficient solution for removing duplicate words from two columns of a CSV file whilst allowing the user to select a preferred column for retaining duplicates. It combines both the front-end and back-end functionalities, making it easy for users to upload a file and download the amended version. The script can be further customised and extended to handle more complex data manipulation tasks, making it a valuable tool for working with CSV files.