Regex in Python to detect ellipsis

I have a large text corpus that I want to process a little bit and then train a Word2Vec model based on it. There are cases that words are deleted due to ellipsis, like:

But seeing them playing to seven- and eight-year-olds is beautiful

or

The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous

Now I want to undo these deletes (inspired and second respectively). This is what I wrote:

re.sub(r'- (and|to|or)( [^ -]+?){1,2}-(.+?)( |$|n)', '- -', text)

But it doesn’t work, since if there is more than one word between and/or/to and the second word with -, only the first will be shown. My desired outputs are:

But seeing them playing to seven-year-olds and eight-year-olds is beautiful

and

The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous

Answer

You can use

re.sub(r'b-(s+(?:and|to|or)(?:s+w+)*s+w+(-w[w-]*))', r'21', text)

See the regex demo. Details:

  • b- – a hyphen that is preceded with a word char
  • (s+(?:and|to|or)(?:s+w+)*s+w+(-w[w-]*)) – Group 1:
    • s+ – one or more whitespaces
    • (?:and|to|or)and, to or or
    • (?:s+w+)* – zero or more occurrences of one or more whitespaces followed with one or more word chars
    • s+ – one or more whitespaces
    • w+ – one or more word chars
    • (-w[w-]*) – Group 2: a hyphen, a word char and then zero or more word or hyphen chars.

See the Python demo:

import re
texts = ['But seeing them playing to seven- and eight-year-olds is beautiful', 'The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous']
rx = r''
for text in texts:
    print( re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|n)', '- -', text) )

Output:

But seeing them playing to seven-year-olds and eight-year-olds is beautiful
The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous