I am trying to split/parse comments which have strings, numbers and emojis and I want to do a generic code that can parse strings in different parts depending on the existence of an emoji in the comment.
comment_1 = "This is :) my comment :O" comment_2 = ">:O Another comment to :v parse"
The output should be something like:
output_1 = ["This is", "my comment"] output_2 = ["Another comment to", "parse"]
I have been thinking that I could do a parsing with special characters only, but maybe it will leave the “O” in “:O”, or the “v” in “:v”
You may try matching on the pattern
(?<!S)w+S?(?: w+S?)*, which attempts to find any sequence of all word terms, which may end in an optional non whitespace character (such as a punctuation character).
inp = ["This is :) my comment :O", ">:O Another comment to :v parse"] for i in inp: matches = re.findall(r'(?<!S)w+S?(?: w+S?)*', i) print(matches)
['This is', 'my comment'] ['Another comment to', 'parse']
Here is an explanation of the regex pattern being used:
(?<!S) assert that what precedes the word is either whitespace or the start of the string w+ match a word S? followed by zero or one non whitespace character (such as punctuation symbols) (?: w+S*)* zero or more word/symbol sequences following