Using Query in Pandas to remove a vector of values

I work in R and this operation would be easy in tidyverse; However, I’m having trouble figuring out how to do it in Python and Pandas.

Let’s say we’re using the gapminder dataset

data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
gapminder = pd.read_csv(data_url)

and let’s say that I want to filter out from the dataset all year values that are equal to 1952 and 1957. I would think that something like this would work, but it doesn’t:

vector = [1952, 1957]
gapminder.query("year isin(vector)")

I realize here that I’ve made a vector in what is really a list. When I try to pass those two year values into an array as vector = pd.array(1952, 1957) That doesn’t work either.

In R, for instance, you would have to do something simple like

vector = c(1952, 1957)
gapminder %>% filter(year %in% vector)
#or
gapminder %>% filter(year %in% c(1952, 1957))

So really this is a two part question: first, how can I create a vector of many values (if I were pulling these values from another dataset, I believe that I could just use pd.to_numpy) and then how do I then remove all rows based on that vector of observations from a dataframe?

I’ve looked at a lot of different variations for using query like here, for instance, https://www.geeksforgeeks.org/python-filtering-data-with-pandas-query-method/, but this has been surprisingly hard to find.

*Here I am updating my question: I found that this isn’t working if I pull a vector from another dataset (or even from the same dataset); for instance:

vector = (1952, 1957)

#how to take a dataframe and make a vector
#how to make a vector

gapminder.vec = gapminder
.query('year == [1952, 1958]')
[['country']]
.to_numpy()

gap_sum = gapminder.query("year != @gapminder.vec")
gap_sum

I receive the following error: enter image description here

Thanks much!

James

Answer

You can use in or even == inside the query string like so:

# gapminder.query("year == @vector") returns the same result
print(gapminder.query("year in @vector"))

          country  year        pop continent  lifeExp    gdpPercap
0     Afghanistan  1952  8425333.0      Asia   28.801   779.445314
1     Afghanistan  1957  9240934.0      Asia   30.332   820.853030
12        Albania  1952  1282697.0    Europe   55.230  1601.056136
13        Albania  1957  1476505.0    Europe   59.280  1942.284244
24        Algeria  1952  9279525.0    Africa   43.077  2449.008185
...           ...   ...        ...       ...      ...          ...
1669   Yemen Rep.  1957  5498090.0      Asia   33.970   804.830455
1680       Zambia  1952  2672000.0    Africa   42.038  1147.388831
1681       Zambia  1957  3016000.0    Africa   44.077  1311.956766
1692     Zimbabwe  1952  3080907.0    Africa   48.451   406.884115
1693     Zimbabwe  1957  3646340.0    Africa   50.469   518.764268

The @ symbol tells the query string to look for a variable named vector outside of the context of the dataframe.


There are a couple of issues with the updated component of your question that I’ll address:

  1. The direct issue you’re receiving is because you’re using double square brackets to select a column. By using a double square bracket, you’re forcing the selected column to be returned as a 2d table (e.g. a dataframe that contains a single column), instead of just the column itself. To resolve this issue, simply get rid of the double brackets. The to_numpy is also not necessary.

  2. in your gap_sum variable, you’re checking where the values in "year" are not in your gapminder.vec – which is a pd.Series (array for more generic term) of country names. So these don’t really make sense to compare.

  3. Don’t use . notation to create variables in python. You’re not making a new variable, but are attaching a new attribute to an existing object. Instead use underscores as is common practice in python (e.g. use gapminder_vec instead of gapminder.vec)

# countries that have years that are either 1952 or 1958
#   will contain duplicate country names
gapminder_vec = gapminder.query('year == [1952, 1958]')['country']

# This won't actually filter anything- because `gapminder_vec` is 
#  a bunch of country names. Not years. 
gapminder.query("year not in @gapminder_vec")

Also to perform a filter rather than a subset:

vec = (1952, 1958)

# returns a subset containing the rows who have a year in `vec`
subset_with_years_in_vec = gapminder.query('year in @vec')

# return subset containing rows who DO NOT have a year in `vec`
subset_without_years_in_vec = gapminder.query('year not in @vec')