# How to create a histogram that displays counts of binomial values with respect to a specified range of another variable

It was difficult to explain in one sentence what I’m looking for, so I’ll provide a clearer explanation here:

Overview: I have a dataset of cell phone customer data and two columns to work with; Churn and Service Outage, I want to create a histogram that shows the count of customers that have churned, based on their total service outage time. I am grouping the service outage time into ranges and each customer has an either yes value or no value, in addition to their outage time. Below is what the graph looks like in Excel – and is including the entire dataset of 10,000 points, which is why the counts are so much greater. The look of the graph is what I’m going for. Goal: To be able to do this in Python.

Problem: Once the data is imported I handle a few problems. I convert Yes/No values to 1’s and 0’s and have been able to create a `groupby` dataframe that outputs the count of customers with outages in specified ranges, as shown below.

```import pandas
import numpy

# create DF
df = pandas.DataFrame({
'Churn':
['Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No'],
'Outage_sec_perweek':
[10.964310, 12.292789, 10.923990, 14.960431, 8.131345, 7.554437, 9.366187, 9.879618, 9.509801, 10.379130]})

df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
# Which outputs
Churn  Outage_sec_perweek
0      0            7.978323
1      1           11.699080
2      0           10.752800
3      0           14.913540
4      1            8.147417
5      0            8.420993
6      1           11.182725
7      1            7.791632
8      0            5.739006
9      0            8.707824

df1 = df.groupby(pandas.cut(df['Outage_sec_perweek'], numpy.arange(0,
df['Outage_sec_perweek'].max() + 5, 5))).count()
print(df1)

# Which outputs - Outage column is correct but the churn column is not
Churn  Outage_sec_perweek
Outage_sec_perweek
(0.0, 5.0]              0                  0
(5.0, 10.0]             6                  6
(10.0, 15.0]            4                  4
```

Obviously this result is not correct or ideal, since I don’t differentiate the Churn column by churn vs not churn.

The missing piece is having a count of 1’s and 0’s to associate with each outage count, so that the resulting dataframe would be something like:

```                    Outage_sec_perweek    No_Churn   Yes_Churn
Outage_sec_perweek
(0.0, 5.0]                           0       0           0
(5.0, 10.0]                          6       4           2
(10.0, 15.0]                         4       2           2
```

The goal of course is to apply the range created by numpy, to the counting process of churn vs no churn. I know how to count the number of people churning and not churning, but grouping them based on how much outage time they experienced, is something I’ve never had to apply before – in terms of Python and pandas. I don’t want to resort to making a verbose conditional such as: `df = df.loc[(df['Outage_sec_perweek'] >= 0) & (df['Outage_sec_perweek'] < 5)]...`

and so on.

Setup

```print(df)
Churn  Outage_sec_perweek
0      0            7.978323
1      1           11.699080
2      0           10.752800
3      0           14.913540
4      1            8.147417
5      0            8.420993
6      1           11.182725
7      1            7.791632
8      0            5.739006
9      0            8.707824
```

First categorize the column `Outage_sec_perweek` into discrete intervals, then use `crosstab` to create a frequency table which shows the counts of churns which fall within a specific outage interval. Then use the `plot` method to create a nice bar plot showing the distribution

```s = df['Outage_sec_perweek']
s = pd.cut(s, bins=np.r_[0 : s.max() + 5 : 5])

table = pd.crosstab(s, df['Churn'])
.reindex(s.cat.categories, fill_value=0)

#OR table = df.groupby([s, 'Churn'])['Outage_sec_perweek'].count().unstack()
```

Output of frequency table

```print(table)

Churn               0  1
Outage_sec_perweek
(0.0, 5.0]          0  0
(5.0, 10.0]         4  2
(10.0, 15.0]        2  2
```

Output of bar plot

```table.plot(kind='bar')
``` 