Iterate through 200 datasets

I have 200 datasets and I want to iterate through them to pick random rows and add them to another dataset(empty dataset), using iloc and value function. when I execute the code it does not give an error but also does not add anything to the empty dataset. However, when I try to run the single command to check if the random row has any value or not it gives an error of: AttributeError: ‘str’ object has no attribute ‘iloc’.

my code is given below:

Tdata = np.zeros([20, 6])
k = 0
for j in range(200):    
        for j1 in range(0, 20):
           Tdata[k:k+1,:] = (('dataset'+j)).iloc[random.randint(100)].values
           k += 1

(‘dataset’+j) is basically selecting different datasets. The names of my datasets are dataset0, dataset1, dataset2……there are already defined.

Answer

There are multiple issues with you code.

1. Using str in place of the actual DataFrame variable

You are trying use .iloc over a string dataframe1 for example. This won’t work since what str has no attribute .iloc, as the error reads for you.

Since you want to work with DataFrame variable names, you may need to use eval() to interpret the string as a variable name. NOTE: BE EXTRA CAREFUL while using eval(). Please read the dangers of using eval() carefully.

2. Sampling 20 rows from each DataFrame.

If you are trying to get 20 rows by using for j1 in range(0, 20): along with random.randint(100), there is a better way to avoid this iteration. Instead what you need is to use random.randint(0,100,(n,) to get n random indexes. In this case random.randint(0,100,(20,)

Or an even better way to do this is just simply using df.sample(20) to sample 20 rows from a given dataframe.

3. Forcing update over views of the dataframe

Its better to use a different appraoch than force an update over a view of the dataframe with Tdata[k:k+1,:] == .... Since you want to combine dataframes, its better to just collect them in a list and pass them to a pd.concat which would be much more useful.

Here is sample code with a simple setting which should help guide you to what you are looking for.

import pandas as pd
import numpy as np

dataset0 = pd.DataFrame(np.random.random((100,3)))
dataset1 = pd.DataFrame(np.random.random((100,3)))
dataset2 = pd.DataFrame(np.random.random((100,3)))
dataset3 = pd.DataFrame(np.random.random((100,3)))

##Using random.randint
##samples = [eval('dataset'+str(i)).iloc[np.random.randint(0,100,(3,))] for i in range(4)] 

##Using df.sample()
samples = [eval('dataset'+str(i)).sample(3) for i in range(4)]

##Change - 
##1. The 3 to 20 for 20 samples per dataframe
##2. range(4) to range(200) to work with 200 dataframes

output = pd.concat(samples)
print(output)
           0         1         2
42  0.372626  0.445972  0.030467
20  0.376201  0.445504  0.835735
56  0.214806  0.083550  0.582863
85  0.691495  0.346022  0.619638
24  0.290397  0.202795  0.704082
16  0.112986  0.013269  0.903917
51  0.521951  0.115386  0.632143
73  0.946870  0.531085  0.437418
98  0.745897  0.718701  0.280326
56  0.679253  0.010143  0.124667
4   0.028559  0.769682  0.737377
84  0.857553  0.866464  0.827472

4. Storing 200 dataframes??

Last but not the least, you should ask yourself, why are you storing 200 dataframe as individual variables, only to sample some rows from each.

Why not try to –

  1. Read each of the files iteratively
  2. Sample rows from each
  3. Store them in a list of dataframes
  4. pd.concat once you are done iterating over the 200 files

… instead of saving 200 dataframes and then doing the same.