How to create a histogram that displays counts of binomial values with respect to a specified range of another variable

It was difficult to explain in one sentence what I’m looking for, so I’ll provide a clearer explanation here:

Overview: I have a dataset of cell phone customer data and two columns to work with; Churn and Service Outage, I want to create a histogram that shows the count of customers that have churned, based on their total service outage time. I am grouping the service outage time into ranges and each customer has an either yes value or no value, in addition to their outage time. Below is what the graph looks like in Excel – and is including the entire dataset of 10,000 points, which is why the counts are so much greater. The look of the graph is what I’m going for.

enter image description here

Goal: To be able to do this in Python.

Problem: Once the data is imported I handle a few problems. I convert Yes/No values to 1’s and 0’s and have been able to create a groupby dataframe that outputs the count of customers with outages in specified ranges, as shown below.

import pandas
import numpy

# create DF    
df = pandas.DataFrame({
'Churn':
    ['Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No'],
'Outage_sec_perweek':
    [10.964310, 12.292789, 10.923990, 14.960431, 8.131345, 7.554437, 9.366187, 9.879618, 9.509801, 10.379130]})


df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
# Which outputs
   Churn  Outage_sec_perweek
0      0            7.978323
1      1           11.699080
2      0           10.752800
3      0           14.913540
4      1            8.147417
5      0            8.420993
6      1           11.182725
7      1            7.791632
8      0            5.739006
9      0            8.707824

df1 = df.groupby(pandas.cut(df['Outage_sec_perweek'], numpy.arange(0, 
df['Outage_sec_perweek'].max() + 5, 5))).count()
print(df1)

# Which outputs - Outage column is correct but the churn column is not
                    Churn  Outage_sec_perweek
Outage_sec_perweek                           
(0.0, 5.0]              0                  0
(5.0, 10.0]             6                  6
(10.0, 15.0]            4                  4

Obviously this result is not correct or ideal, since I don’t differentiate the Churn column by churn vs not churn.

The missing piece is having a count of 1’s and 0’s to associate with each outage count, so that the resulting dataframe would be something like:

                    Outage_sec_perweek    No_Churn   Yes_Churn
Outage_sec_perweek                    
(0.0, 5.0]                           0       0           0
(5.0, 10.0]                          6       4           2
(10.0, 15.0]                         4       2           2 

The goal of course is to apply the range created by numpy, to the counting process of churn vs no churn. I know how to count the number of people churning and not churning, but grouping them based on how much outage time they experienced, is something I’ve never had to apply before – in terms of Python and pandas. I don’t want to resort to making a verbose conditional such as: df = df.loc[(df['Outage_sec_perweek'] >= 0) & (df['Outage_sec_perweek'] < 5)]...

and so on.

Answer

Setup

print(df)
   Churn  Outage_sec_perweek
0      0            7.978323
1      1           11.699080
2      0           10.752800
3      0           14.913540
4      1            8.147417
5      0            8.420993
6      1           11.182725
7      1            7.791632
8      0            5.739006
9      0            8.707824

First categorize the column Outage_sec_perweek into discrete intervals, then use crosstab to create a frequency table which shows the counts of churns which fall within a specific outage interval. Then use the plot method to create a nice bar plot showing the distribution

s = df['Outage_sec_perweek']
s = pd.cut(s, bins=np.r_[0 : s.max() + 5 : 5])

table = pd.crosstab(s, df['Churn'])
          .reindex(s.cat.categories, fill_value=0)

#OR table = df.groupby([s, 'Churn'])['Outage_sec_perweek'].count().unstack()

Output of frequency table

print(table)

Churn               0  1
Outage_sec_perweek      
(0.0, 5.0]          0  0
(5.0, 10.0]         4  2
(10.0, 15.0]        2  2

Output of bar plot

table.plot(kind='bar')

enter image description here