I’m trying to convert columns of input data (of type Numpy Array) into different types (float and U30)

I have spent many many hours attempting this question, including asking tutors and other peers for their feedback. Nobody seems to be able to figure it out. I am quite desperate as this assignment is due today and i need this code in order to test my other code. The question is as follows.

“The datatype of the array created by NumPy in Task 1 is unstructured. This is because, in the default setting, NumPy decides the datatype for each value. Also, the output in Task 1 contains the header row that may not be required in our assignment. So, remove the header row and convert all the columns to type float (i.e., “float”) apart from the columns specified by the input parameter indexes (mentioned below). Also, the remaining columns which are not mentioned in indexes should be in Unicode of length 30 characters (i.e., “<U30”). Finally, every row is converted as a type tuple (e.g., tuple(i) for i in data).

Write a function data_type_format(data, indexes) that can complete the above-mentioned task, where, the input data is a NumPy array and indexes contains the column indices (in list) which are to be converted into <U30 data type, and the remaining columns in data which are not in indexes will be converted to type float.”

This is the code i used in Task 1:

    import numpy as np
    import pandas as pd
    def load_mydata(filename):
    "return"
    df = pd.read_csv(filename,delimiter=',',quotechar="",quoting=3,header=None)
    df = df.iloc[:,[0,1,2,5,8,9,10,11,12]]
    ndarray = np.array(df,dtype='U30')
    return ndarray

this is some of the codes i have tried:

    def data_type_format(data,indexes):
    "return"
    list=[float(element) if indx not in indexes else 
    str(ord(element)) if len(element)<30 else 
    element for indx,element in data] 
    return tuple(list)

    import csv
    def data_type_format(data,indexes):
    "return csv file data as tuple formatted"
    list=[]
    df = pd.read_csv(data,delimiter=',',quotechar="",quoting=3,header=None)
    df = df.iloc[:,[0,1,2,5,8,9,10,11,12]]
    ndarray = np.array(df,dtype='U30')
    for df in data:
        if df in indexes:
            df[indexes] = df[indexes].astype(float)
        else:
            df[indexes]= df[indexes].astype('U30')
    list.append(df)
    return tuple(list)

The code needs to be tested in a python shell with the following test.

    data = load_mydata("location_review_data.csv")
    data = data_type_format(data, [0, 1, 2, 3]) 
    print(data[0])   
    #gives result

(‘Thu Jan 30 06:58:27 +0000 2020’, ‘98675’, ‘1’, ‘22847’, 0.421, 0.442, 0.452, 0.397, 0.357)]1

i have tried many different versions and none are working ! the cvs file used : https://drive.google.com/file/d/1Qd-xFaK4z5ijSHTmmtR7VhJyFaBzjFvO/view?usp=sharing

Answer

With the downloaded csv I was able to get

In [181]: import pandas as pd
In [182]: df = pd.read_csv('../Downloads/location_review_data.csv')
In [183]: df
Out[183]: 
                           created_at  user_ID  review_ID   latitude  longitude  ...    sad  happy surprise  disgust    joy
0      Thu Jan 30 06:58:27 +0000 2020    98675          1  30.235909 -97.795140  ...  0.421  0.442    0.452    0.397  0.357
1      Thu Jan 30 06:03:24 +0000 2020    67730          2  30.269103 -97.749395  ...  0.469  0.408    0.488    0.377  0.350
2      Thu Jan 30 06:19:25 +0000 2020    11576          3  30.255731 -97.763386  ...  0.542  0.361    0.276    0.270  0.424
3      Thu Jan 30 06:16:38 +0000 2020    87911          4  30.263418 -97.757597  ...  0.418  0.499    0.352    0.367  0.291
4      Thu Jan 30 06:08:09 +0000 2020   148147          5  30.274292 -97.740523  ...  0.242  0.632    0.532    0.501  0.199
...                               ...      ...        ...        ...        ...  ...    ...    ...      ...      ...    ...
19995  Fri Feb 07 23:47:25 +0000 2020    21591      19996  42.357491 -71.058885  ...  0.437  0.431    0.609    0.446  0.260
19996  Fri Feb 07 23:21:33 +0000 2020    37809      19997  42.357437 -71.058470  ...  0.432  0.412    0.549    0.452  0.219
19997  Fri Feb 07 23:34:48 +0000 2020    39721      19998  42.357553 -71.057779  ...  0.597  0.314    0.311    0.304  0.428
19998  Sat Feb 08 00:18:22 +0000 2020    20873      19999  42.357142 -71.058455  ...  0.452  0.479    0.518    0.442  0.252
19999  Fri Feb 07 23:13:02 +0000 2020    33846      20000  42.357069 -71.058551  ...  0.391  0.490    0.548    0.459  0.199

[20000 rows x 13 columns]

A structured array from that:

In [184]: data = df.to_records()
In [185]: data.shape
Out[185]: (20000,)
In [186]: data.dtype
Out[186]: dtype((numpy.record, [('index', '<i8'), ('created_at', 'O'), ('user_ID', '<i8'), ('review_ID', '<i8'), ('latitude', '<f8'), ('longitude', '<f8'), ('location_ID', '<i8'), ('friend_count', 'O'), ('follower_count', 'O'), ('sad', '<f8'), ('happy', '<f8'), ('surprise', '<f8'), ('disgust', '<f8'), ('joy', '<f8')]))

The ‘O’ dtype fields contain strings as Python objects, which is pandas practice.

With numpy’s csv reader:

In [192]: data = np.genfromtxt('../Downloads/location_review_data.csv', names=True, dtype=None, delimiter=',', encoding=None, comments=None)
In [193]: data.shape
Out[193]: (20000,)
In [194]: data.dtype
Out[194]: dtype([('created_at', '<U30'), ('user_ID', '<i8'), ('review_ID', '<i8'), ('latitude', '<f8'), ('longitude', '<f8'), ('location_ID', '<i8'), ('friend_count', '<U8'), ('follower_count', '<U22'), ('sad', '<f8'), ('happy', '<f8'), ('surprise', '<f8'), ('disgust', '<f8'), ('joy', '<f8')])

The ‘#’ in some of the columns gave me problems, since that is the default comment character. I had to inactivate that.

first 3 lines/records:

In [195]: data[:3]
Out[195]: 
array([('Thu Jan 30 06:58:27 +0000 2020', 98675, 1, 30.23590912, -97.79513958,  22847, '#####', '############', 0.421, 0.442, 0.452, 0.397, 0.357),
       ('Thu Jan 30 06:03:24 +0000 2020', 67730, 2, 30.26910295, -97.74939537, 420315, '#####', '############', 0.469, 0.408, 0.488, 0.377, 0.35 ),
       ('Thu Jan 30 06:19:25 +0000 2020', 11576, 3, 30.25573099, -97.76338577, 316637, '#####', '##########', 0.542, 0.361, 0.276, 0.27 , 0.424)],
      dtype=[('created_at', '<U30'), ('user_ID', '<i8'), ('review_ID', '<i8'), ('latitude', '<f8'), ('longitude', '<f8'), ('location_ID', '<i8'), ('friend_count', '<U8'), ('follower_count', '<U22'), ('sad', '<f8'), ('happy', '<f8'), ('surprise', '<f8'), ('disgust', '<f8'), ('joy', '<f8')])