i started to use python and i am trying to find outliers per year using the quantile my data is organized as follows: columns of years, and for each year i have months and their corresponding salinity and temperature

year=[1997:2021] month=[1,2...] SAL=[33,32,50,......,35,...]

this is my code:

#1st quartile Q1 = DF['SAL'].quantile(0.25) #3rd quartile Q3 = DF['SAL'].quantile(0.75) #calculate IQR IQR = Q3 - Q1 print(IQR) df_out = DF['SAL'][((DF['SAL'] < (Q1 - 1.5 * IQR)) |(DF['SAL'] > (Q3 + 1.5 * IQR)))]

i want to identify the month and year of the outlier and replace it with nan, let me know if you have any suggestions, thanks a lot

## Answer

You can use the following function. It uses the definition of an outlier that is below Q1-1.5IQR or above Q3+1.5IQR, such as classically done for boxplots.

import pandas as pd import numpy as np df = pd.DataFrame({'year': np.repeat(range(1997,2022), 12), 'month': np.tile(range(12), 25)+1, 'SAL': np.random.randint(20,40, size=12*25)+np.random.choice([0,-20, 20], size=12*25, p=[0.9,0.05,0.05]), }) def outliers(s, replace=np.nan): Q1, Q3 = np.percentile(s, [25 ,75]) IQR = Q3-Q1 return s.where((s > (Q1 - 1.5 * IQR)) & (s < (Q3 + 1.5 * IQR)), replace) # add new column with excluded outliers df['SAL_excl'] = df.groupby('year')['SAL'].apply(outliers)

## Checking that it works:

with outliers:

import seaborn as sns sns.boxplot(data=df, x='year', y='SAL')

without outliers:

sns.boxplot(data=df, x='year', y='SAL_excl')

NB. it is possible that new outliers appear as data has now new Q1/Q3/IQR due to the filtering.

How to retrieve rows with outliers:

df[df['SAL_excl'].isna()]

output:

year month SAL SAL_excl 28 1999 5 53 NaN 33 1999 10 7 NaN 94 2004 11 52 NaN 100 2005 5 38 NaN 163 2010 8 6 NaN 182 2012 3 25 NaN 188 2012 9 22 NaN 278 2020 3 53 NaN 294 2021 7 9 NaN