Python parse comment by non string characters

I am trying to split/parse comments which have strings, numbers and emojis and I want to do a generic code that can parse strings in different parts depending on the existence of an emoji in the comment.

For example:

comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"

The output should be something like:

output_1 = ["This is", "my comment"]
output_2 = ["Another comment to", "parse"]

I have been thinking that I could do a parsing with special characters only, but maybe it will leave the “O” in “:O”, or the “v” in “:v”


You may try matching on the pattern (?<!S)w+S?(?: w+S?)*, which attempts to find any sequence of all word terms, which may end in an optional non whitespace character (such as a punctuation character).

inp = ["This is :) my comment :O", ">:O Another comment to :v parse"]
for i in inp:
    matches = re.findall(r'(?<!S)w+S?(?: w+S?)*', i)

This prints:

['This is', 'my comment']
['Another comment to', 'parse']

Here is an explanation of the regex pattern being used:

(?<!S)       assert that what precedes the word is either whitespace
              or the start of the string
w+           match a word
S?           followed by zero or one non whitespace character
              (such as punctuation symbols)
(?: w+S*)*  zero or more word/symbol sequences following