pandas read_csv with pound sign in column headers

I need to read data from a tab-delimited file where the 1st row contains column headers but the 1st character of that row is a pound sign/octothorpe/hastag/#.

The data looks like this:

#   year-month-day  spam    eggs
1956-01-31  11  21
1985-03-20  12  22
1940-11-22  13  23

read_csv makes 3 mistakes: 1. It will include any leading pound signs either as its own unique column, or as the first character of the first column, causing too many columns or preserving whitespace, tabs and commas as part of the columnname, even when it’s told that is the delimiter. 2. Tabs, Whitespace, Commas, Singlequotes and Double quote delimiters will be randomly used as delimiters for columns, with a priority system not defined in the docs, and depending if the delimiter is next to whitespace, or not for example 'abc','xyz' or 'abc', 'xyz' 3. If escape is defined as a backslash, then escaped characters will not be taken as literal. 4. If you ask pandas to infer any of the above or header, all the above will be inferred incorrectly.

Looks like the only viable option is to 1: roll your own tell header read function then 2: tell pandas read_csv to ignore the header column.

Is there a better way to do this?

Answer

You still have to shift the column names by a single position to the left to account for the empty column getting created due to the removal of # char.

Then, remove the extra column whose values are all NaN.

def column_cleaning(frame):
    frame.columns = np.roll(frame.columns, len(frame.columns)-1)
    return frame.dropna(how='all', axis=1)

FILE_CONTENTS = """
#   year-month-day  spam    eggs
1956-01-31  11  21
1985-03-20  12  22
1940-11-22  13  23
"""

df = pd.read_csv(StringIO(FILE_CONTENTS), delim_whitespace=True, escapechar="#")

column_cleaning(df)

Image

Leave a Reply

Your email address will not be published. Required fields are marked *