Python: Count word frequencies and put them into multiple dictionaries

I’m new to python, now I have a txt file like this:

Doc 1
aaa bbb ccc ddd ...

Doc 2
eee fff ggg hhh ...

Doc 3
aaa ggg iii kkk ...

...

Doc 11
eee ttt uuu zzz ...

Basically what I want to do is to count term frequency for each document and put them into 11 different dictionaries (like “For Doc1, {‘aaa’:10, ‘bbb’:5 …}”, build Term – Document matrix at the end. My current code is like following:

# split te text file into 11 documents(paragraphs) 
f = open('filename.txt', 'r')
data = f.read()
docs = data.split("nn")

# creat 11 tf dictionaries
dictstr = 'tf'
dictlist = [dictstr + str(i) for i in range(10)]

for i in range(10):
    for line in docs[i]:
        tokens = line.split()
        for term in tokens:
            term = term.lower()
            term = term.replace(',', '')
            term = term.replace('"', '')
            term = term.replace('.', '')
            term = term.replace('/', '')
            term = term.replace('(', '')
            term = term.replace(')', '')

            if not term in dict['tfi']:
                dict['tfi'][term] = 1

            else:
                dict['tfi'][term] += 1

There are some problems in the last “if – else” step, I’m stuck here. Can anyone tell me how to deal with it? (Don’t want to use other packages like “panda”) Thank you!
The txt resource’s here

Answer

This code reads in the file you provided, removes the unwanted characters in one pass (VS creating a new string for each use of .replace) and saves the word counts in a dict called result. The keys are the doc nums ('XXX9' -> 'tf9') and the values are collections.Counter objects with the word counts.

>>> import re
... from collections import Counter
... 
... with open('filename.txt', 'r') as f:
...     data = f.read().lower()
... 
... clean_data = re.sub(r'[,"./()]', '', data)
... 
... result = {}
... for line in clean_data.splitlines():
...     if not line:
...         continue  # skip blank lines
...     elif line.startswith('xxx'):
...         doc_num = 'tf{}'.format(line[3:])
...     else:
...         result[doc_num] = Counter(line.split())
... 
>>> list(result.keys())
['tf7', 'tf10', 'tf5', 'tf2', 'tf9', 'tf4', 'tf11', 'tf3', 'tf6', 'tf8', 'tf1']

>>> for k, v in list(result['tf1'].items())[:15]:
...     print("'{}': {}".format(k, v))
... 
'class': 1
'then': 1
'emerge': 1
'industry': 1
'common': 1
'ourselves': 2
'models': 1
'short': 1
'mgi': 1
'it': 1
'actionable': 1
'time': 1
'why': 1
'theory': 1
'equip': 2

Let me know if any changes need to be made to help answer your question!

Leave a Reply

Your email address will not be published. Required fields are marked *