I work in R and this operation would be easy in tidyverse; However, I’m having trouble figuring out how to do it in Python and Pandas.
Let’s say we’re using the gapminder dataset
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv' gapminder = pd.read_csv(data_url)
and let’s say that I want to filter out from the dataset all year values that are equal to 1952 and 1957. I would think that something like this would work, but it doesn’t:
vector = [1952, 1957] gapminder.query("year isin(vector)")
I realize here that I’ve made a vector in what is really a list. When I try to pass those two year values into an array as vector = pd.array(1952, 1957)
That doesn’t work either.
In R, for instance, you would have to do something simple like
vector = c(1952, 1957) gapminder %>% filter(year %in% vector) #or gapminder %>% filter(year %in% c(1952, 1957))
So really this is a two part question: first, how can I create a vector of many values (if I were pulling these values from another dataset, I believe that I could just use pd.to_numpy) and then how do I then remove all rows based on that vector of observations from a dataframe?
I’ve looked at a lot of different variations for using query like here, for instance, https://www.geeksforgeeks.org/python-filtering-data-with-pandas-query-method/, but this has been surprisingly hard to find.
*Here I am updating my question: I found that this isn’t working if I pull a vector from another dataset (or even from the same dataset); for instance:
vector = (1952, 1957) #how to take a dataframe and make a vector #how to make a vector gapminder.vec = gapminder .query('year == [1952, 1958]') [['country']] .to_numpy() gap_sum = gapminder.query("year != @gapminder.vec") gap_sum
I receive the following error:
Thanks much!
James
Answer
You can use in
or even ==
inside the query string like so:
# gapminder.query("year == @vector") returns the same result print(gapminder.query("year in @vector")) country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 12 Albania 1952 1282697.0 Europe 55.230 1601.056136 13 Albania 1957 1476505.0 Europe 59.280 1942.284244 24 Algeria 1952 9279525.0 Africa 43.077 2449.008185 ... ... ... ... ... ... ... 1669 Yemen Rep. 1957 5498090.0 Asia 33.970 804.830455 1680 Zambia 1952 2672000.0 Africa 42.038 1147.388831 1681 Zambia 1957 3016000.0 Africa 44.077 1311.956766 1692 Zimbabwe 1952 3080907.0 Africa 48.451 406.884115 1693 Zimbabwe 1957 3646340.0 Africa 50.469 518.764268
The @
symbol tells the query string to look for a variable named vector
outside of the context of the dataframe.
There are a couple of issues with the updated component of your question that I’ll address:
The direct issue you’re receiving is because you’re using double square brackets to select a column. By using a double square bracket, you’re forcing the selected column to be returned as a 2d table (e.g. a dataframe that contains a single column), instead of just the column itself. To resolve this issue, simply get rid of the double brackets. The
to_numpy
is also not necessary.in your
gap_sum
variable, you’re checking where the values in"year"
are not in yourgapminder.vec
– which is apd.Series
(array for more generic term) of country names. So these don’t really make sense to compare.Don’t use
.
notation to create variables in python. You’re not making a new variable, but are attaching a new attribute to an existing object. Instead use underscores as is common practice in python (e.g. usegapminder_vec
instead ofgapminder.vec
)
# countries that have years that are either 1952 or 1958 # will contain duplicate country names gapminder_vec = gapminder.query('year == [1952, 1958]')['country'] # This won't actually filter anything- because `gapminder_vec` is # a bunch of country names. Not years. gapminder.query("year not in @gapminder_vec")
Also to perform a filter rather than a subset:
vec = (1952, 1958) # returns a subset containing the rows who have a year in `vec` subset_with_years_in_vec = gapminder.query('year in @vec') # return subset containing rows who DO NOT have a year in `vec` subset_without_years_in_vec = gapminder.query('year not in @vec')