Compare size of values in columns which are in kbs, mbs and gbs in pandas

I want to compare each row with its previous row size, lets say if first row have 6kb size and 2nd row has 2kb size. if second row of dataframe has 50% less size than the previous one then that row should be printed.

following is my dataframe,

    size     number     key      date
0  120 K    12345     Hello     20181002
1  119 K    12345     No        20181001
2  30 K     12345     Hello     20181003
3  90 K     12345     No        20181003
4  150 K    12345     Hello     20181004
5  180 M    12345     No        20181005
6  70 M     12345     Hello     20181006

in above dataframe 2nd row compare with 1st and the difference in not less than 50% then it will ignore, but 3rd row size is less than 50% of 2nd row so it will print 3rd row same for 6th row will be print as it is less than 50% of size.

Answer

You can use .replace() to translate the size column with K, M, G, etc. to their corresponding values scaled up by the magnitude symbols, as follows:

K converted to e+03 in scientific notation

M converted to e+06 in scientific notation

G converted to e+09 in scientific notation

(supports integer as well as float numbers in any number of decimal places)

Then, convert the text in scientific notation to float type, followed by casting to integer for final required format, as follows:

size_val = df['size'].replace({' ': '', 'K': 'e+03', 'M': 'e+06', 'G': 'e+09'}, regex=True).astype(float).astype(int)

Then, use df.loc to filter the rows with size ratio of current row and previous row (with getting values of previous row by .shift()):

df.loc[(size_val / size_val.shift()) < 0.5]

Result:

   size  number    key      date
2  30 K   12345  Hello  20181003
6  70 M   12345  Hello  20181006

Translated values of size (in size_val) are the actual values (translated from texts to integers) scaled up by the magnitude symbols:

print(size_val)


0       120000
1       119000
2        30000
3        90000
4       150000
5    180000000
6     70000000
Name: size, dtype: int32