I am going to use
nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all
nltk resources by
nltk.download() but, as I found out, it takes ~2.5GB.
This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for
nltk.tokenize.word_tokenize? So far, I’ve seen
nltk.download('punkt') but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?
You are right. You need Punkt Tokenizer Models. It has 13 MB and
nltk.download('punkt') should do the trick.