Remove CSV text qualifiers within a field using Python

Working with CSV data ought to be straightforward, but even these seemingly simple data inputs can cause problems. For example, have you ever tried to import a CSV into a database like SQL Server using the Microsoft SQL Server Import tool and found that it won’t load because one of your CSV fields has a text qualifier inside the field itself? Let’s imagine you have a set of addresses and each field is qualified using the double-quote mark (“).

A typical record might look like this:

ID,AddressLine1,AddressLine2,City,Postcode,Country
"1","20 Market Road","Madeupsham","Nowheresford","XY12 3ZA","England"

Now suppose that in the first line of the address there is a house name rather than a house number. Your record might look something like this:

ID,AddressLine1,AddressLine2,City,Postcode,Country
"1","31 "The Wotsit"","Wheresthat Lane","Noideaton","YZ32 1BZ","England"

The problem here is that part way through the AddressLine1 field, there is a double-quote mark (“) before the field value itself has finished. Data import tools like the ones used by SQL Server will break when they try to parse it, because they will interpret the record like this:

ID:  "1"
AddressLine1:  "31 "
AddressLine2: ERROR! Invalid record delimiter, encountered ('T')

That additional double-quote mark before the name “The Wotsit” has confused the import tool, since it treats it as indicating the end of the field, after which it then checks for a comma (since this is a CSV file). When instead it finds something else (the letter ‘T’) it will fail.

All is not lost, however! Python’s own CSV library does not parse CSV files in the same way, even when fields contain text qualifiers. This means that you can use Python to parse, strip out erroneous characters and then rewrite the CSV file cleanly. You can either throw away the double-quote mark wherever it appears within a field value or you can replace it with something else. The script below uses a function called charMunger to process each field in a CSV file, removing the specified character, and then the script rewrites the CSV file in full. Note that the script will expect the column headings to appear in the first line of the CSV file. If your file does not have column headers, you will need to adapt the script slightly by providing an additional input file that contains the headers.

import csv
import re

# define variables
csvFileIn       = "C:\\Temp\\Original.csv" # specify input CSV
csvFileOut      = "C:\\Temp\\Munged.csv"   # specify output CSV
textQualifier   = chr(34) # ascii code for the " character
replacementChar = ''      # change if replacement needed
cleanWhitespace = True    # set to False to preserve whitespace
outListRow      = []
outList         = []

# remove user-specified char
def charMunger(strIn,char):
    
    if cleanWhitespace == True:
        strIn  = re.sub(' +', ' ', strIn)
    
    # replace char with nothing
    if char in strIn:
        strOut = strIn.replace(char,replacementChar)
    else:
        strOut = strIn
    return strOut

# get column headings
with open(csvFileIn, 'r') as headerIn:
    reader = csv.reader(headerIn)
    header = next(reader)  # gets the first line

# append column headers to output file
outList.append(header)

# count number of columns
numCols = len(header)

# loop through input data and munge
with open(csvFileIn, 'r') as csvInF:
    csvFile = csv.DictReader(csvInF)
    # for each csv row, munge each attribute and append to list
    for row in csvFile:
        x = 0
        outListRow = []
        while x < numCols:
            outListRow.append(charMunger(row[header[x]],textQualifier))
            x+=1
        outList.append(outListRow)
       
        
# write out munged data
with open(csvFileOut, 'w', newline="") as csvOutF:
    writer = csv.writer(csvOutF, quotechar='"', quoting=csv.QUOTE_ALL)
    writer.writerows(outList)

Edit the variables to specify: your input CSV, your output CSV, your text qualifier (default is ASCII character 34, i.e. the double-quote mark) and the character you want to replace the text qualifier with where it appears inside a field (default is to just strip it out and replace with nothing). Set the cleanWhitespace variable to False if you don’t want the script to remove extraneous whitespace characters (default will be to replace any consecutive set of whitespace with a single whitespace character).

Run the script and your output file will be easily loadable into SQL Server or any other data import mechanism that can’t tolerate text qualifiers inside a field.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: