I am working on analyzing some very large files (~200 million rows)
The program runs for about a half an hour before I get this error message:
MemoryError: Unable to allocate 3.25 GiB for an array with shape (7, 62388743) and data type object
I’m wondering if there is a way to bypass this memory error, or if there is a different function I can use that won’t require as much memory? I have split the file into pieces, but the issue with that is that I need all of the data in one dataframe so that I can analyze it as a whole.
You can limit the number of columns with
usecols. This will reduce the memory footprint. You also seem to have some bad data in the CSV file making columns you think should be
int64 to be
object. These could be empty cells, or any non-digit value. Here is an example that will read the csv and then scan for bad data. This example uses commas, not tab, because thats a bit easier to demonstrate.
import pandas as pd import numpy as np import io import re test_csv = io.StringIO("""field1,field2,field3,other 1,2,3,this 4,what?,6,is 7,,9,extra""") _numbers_re = re.compile(r"d+$") df = pd.read_csv(test_csv,sep=",",error_bad_lines=False, usecols=['field1', 'field2', 'field3']) print(df) # columns that arent int64 bad_cols = list(df.dtypes[df.dtypes!=np.dtype('int64')].index) if bad_cols: print("bad cols", bad_cols) for bad_col in bad_cols: col = df[bad_col] bad = col[col.str.match(_numbers_re) != True] print(bad) exit(1)