I need to read data from a tab-delimited file where the 1st row contains column headers but the 1st character of that row is a
The data looks like this:
# year-month-day spam eggs 1956-01-31 11 21 1985-03-20 12 22 1940-11-22 13 23
read_csv makes 3 mistakes: 1. It will include any leading pound signs either as its own unique column, or as the first character of the first column, causing too many columns or preserving whitespace, tabs and commas as part of the columnname, even when it’s told that is the delimiter. 2. Tabs, Whitespace, Commas, Singlequotes and Double quote delimiters will be randomly used as delimiters for columns, with a priority system not defined in the docs, and depending if the delimiter is next to whitespace, or not for example
3. If escape is defined as a backslash, then escaped characters will not be taken as literal.
4. If you ask pandas to infer any of the above or header, all the above will be inferred incorrectly.
Looks like the only viable option is to 1: roll your own tell header read function then 2: tell pandas
read_csv to ignore the header column.
Is there a better way to do this?
You still have to shift the column names by a single position to the left to account for the empty column getting created due to the removal of
Then, remove the extra column whose values are all
def column_cleaning(frame): frame.columns = np.roll(frame.columns, len(frame.columns)-1) return frame.dropna(how='all', axis=1) FILE_CONTENTS = """ # year-month-day spam eggs 1956-01-31 11 21 1985-03-20 12 22 1940-11-22 13 23 """ df = pd.read_csv(StringIO(FILE_CONTENTS), delim_whitespace=True, escapechar="#") column_cleaning(df)