Error: ‘utf-8’ codec can’t decode byte 0xb0 in position 0: invalid start byte in google colab

import PyPDF4
from google.colab import files
fileReader = PyPDF4.PdfFileReader('ITC-1.pdf')
for i in range(2, fileReader.numPages):

sentences = []
while s.find('.') != -1:
    index = s.find('.')
    s = s[index+1:]

text_ds ='ITC-1.pdf').filter(lambda x: tf.cast(tf.strings.length(x), bool))
inverse_vocab = vectorize_layer.get_vocabulary()

The last line in the code above shows the error. I saw several posts to understand what it means, but none of the solutions seem to work for me. I cannot use my local machine because I would be needing access to GPUs. Please suggest a workaround for this. Thanks!

PS: Following the code here, the difference is in the way I am reading the file. If there are better ways to do it, pleasee let me know!


import pdfplumber
from tensorflow.keras.layers.experimental import preprocessing
import tensorflow as tf

f = open('test.txt', 'w')

with'test.pdf') as pdf:
    for page in pdf.pages:
layer = preprocessing.TextVectorization()
text_ds ='test.txt').filter(lambda x: tf.cast(tf.strings.length(x), bool))

inverse_vocab = layer.get_vocabulary()

You could do something like this:

  1. read pdf using pdfplumber.
  2. Write the pages to a text file.
  3. Then create dataset using that text file.