I faced a problem with stanford’s Sentence annotator. As an input I’ve got the text, which contains sentences, but there is no whitespace after dot in some parts of it. Like this:
Dog loves cat.Cat loves mouse. Mouse hates everybody.
So when I’m trying to use SentenceAnnotator – I’m getting 2 sentences
Dog loves cat.Cat loves mouse.
Mouse hates everybody.
Here is my code
Annotation doc = new Annotation(t); Properties props = new Properties(); props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); pipeline.annotate(doc); List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);
I also tried to add property
but no effect.
Maybe I’m missing something? Thanks!
UPD Also I tried to tokenize text using PTBTokenizer
PTBTokenizer ptbTokenizer = new PTBTokenizer( new FileReader(classLoader.getResource("simplifiedParagraphs.txt").getFile()) ,new WordTokenFactory() ,"untokenizable=allKeep,tokenizeNLs=true,ptb3Escaping=true,strictTreebank3=true,unicodeEllipsis=true"); List<String> strings = ptbTokenizer.tokenize();
but tokenizer thinks that cat.Cat is single word and doesn’t split it.
This is a pipeline where the sentence splitter is going to identify sentence boundaries for the tokens provided by the tokenizer, but the sentence splitter only groups adjacent tokens into sentences, it doesn’t try to merge or split them.
As you found, I think that the
ssplit.boundaryTokenRegex property would tell the sentence splitter to end a sentence when it sees “.” as a token, but this doesn’t help in cases where the tokenizer hasn’t split the “.” apart from surrounding text into a separate token.
You will need to either:
- preprocess your text (insert a space after “cat.”),
- postprocess your tokens or sentences to split cases like this, or
- find/develop a tokenizer that can split “cat.Cat” into three tokens.
None of the standard English tokenizers, which are typically intended to be used with newspaper text, have been developed to handle this kind of text.
Some related questions: