Python code does not show the desired output but keeps running

I was learning from the book “Introduction to Data Science” by Laura Igual.

Whenever I try to execute this code, my jupyter notebook keeps showing ‘[*]’ but never shows the desired output. Even my laptop starts to slow down and get choppy. The file is a CSV file with 26 columns and 10000 rows. Is it because the file is too large ?

import pandas as pd
import numpy as np


file_name = r"ACCIDENTS_GU_BCN_2013.csv"

file = open(file_name, "r")

data = pd.read_csv(file)

data['Date']= data[u'Dia de mes'].apply(lambda x: str(x))+'-'+
                    data[u'Mes de any'].apply(lambda x: str(x))

data['Date'] = pd.to_datetime(data['Date'])
accidents = data.groupby(['Date']).size()

print(accidents.mean())

Answer

The best apply is no apply at all. Use vectorized code:

data['Date'] = data['Dia de mes'].astype('str') + '-' data['Mes de any'].astype('str')

You can also drop the u-prefix to strings. They were necessary in Python 2 indicate Unicode strings. Python 3 made them redundant as all strings are Unicode by default.


What is vectorized code?

Simply put, vectorized code are code that automatically map an operation to every element of the array. Let’s say you have a list of numbers and you want to add 1 to each element:

# Regular Python
a_list = [1, 2, 3, 4]

for i in range(len(a_list)):
    a_list[i] += 1


# Vectorized code
import numppy as np

an_array = np.array([1, 2, 3, 4])
an_array += 1

Aside from being more succint, vectorized code is also a lot faster for long arrays since it uses highly-optimized C loops instead of native Python loops. Python is not a language known for its performance.

Vectorized code is pervasive in pandas / numpy. Learn how to use them effectively.