R’s relevel() and factor variables in linear regression in pandas

Data:

a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red

In R, if I want to construct a linear regression model that takes into account categorical data (I think they’re called factor variables in R), I can simply do:

df$d = relevel(df$d, 'green')

After this, for the purpose of building the model, R will add columns for each colour, for example:

dblue
0
1
0
0
0
1
0

There will be no column for green because if all other colour values are 0, it means that green=1 (this is our reference level). Now, create a regression model:

mod = lm(a ~ b + c + d, data=df)
summary(mod)

Call:
lm(formula = a ~ b + c + d, data = rel)

Residuals:
         1          2          3          4          5          6          7 
 4.708e-16 -7.061e-16  2.219e-31  2.354e-16 -1.233e-31  7.061e-16 -7.061e-16 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -1.600e+00  3.622e-15 -4.418e+14 1.44e-15 ***
b            1.600e+00  9.403e-16  1.702e+15 3.74e-16 ***
c           -6.000e-01  3.766e-16 -1.593e+15 4.00e-16 ***
dblue        8.829e-16  1.823e-15  4.840e-01    0.713    
dorange      1.589e-15  2.294e-15  6.930e-01    0.614    
dred         2.295e-15  1.631e-15  1.407e+00    0.393    

I am trying to achieve the same in Python Pandas. So far I’ve only come up with this:

d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
    if r != 'green':
        df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)

It works and yields the same results, but I’m wondering if there is a method in pandas for this.

Answer

You could use pd.get_dummies:

import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
     'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], 
                    dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)

yields

   a  b   c  blue  orange  red
0  1  5   9     0       0    1
1  2  6  10     1       0    0
2  3  7  11     0       0    0
3  4  8  12     0       0    1
4  3  4   3     0       1    0
5  3  4   3     1       0    0
6  3  4   3     0       0    1

Using statsmodels,

import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())

yields

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      a   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.149e+25
Date:                Sun, 22 Mar 2015   Prob (F-statistic):           1.64e-13
Time:                        05:57:33   Log-Likelihood:                 200.74
No. Observations:                   7   AIC:                            -389.5
Df Residuals:                       1   BIC:                            -389.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -1.6000   6.11e-13  -2.62e+12      0.000        -1.600    -1.600
b              1.6000   1.59e-13   1.01e+13      0.000         1.600     1.600
c             -0.6000   6.36e-14  -9.44e+12      0.000        -0.600    -0.600
blue         1.11e-16   3.08e-13      0.000      1.000     -3.91e-12  3.91e-12
orange      7.994e-15   3.87e-13      0.021      0.987     -4.91e-12  4.93e-12
red         4.829e-15   2.75e-13      0.018      0.989     -3.49e-12   3.5e-12
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.203
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.752
Skew:                           0.200   Prob(JB):                        0.687
Kurtosis:                       1.445   Cond. No.                         85.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Alternatively, you could use a patsy formula to specify the dummy contrast:

import pandas as pd
import statsmodels.formula.api as smf

d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
     'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)

model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())

References:

Leave a Reply

Your email address will not be published. Required fields are marked *