Regular Expression Ending with a certain character should avoid multiple occurrence of that character before

I have a data which looks something like this

<workorder id = "124"
       issue = "broken hood"
       level = "minor"
       comment = " This will be some random text <imp>random text<imp>
         <role>Important<role> So this is goingto be fixed!"
>
</workorder> Some more random text

I need to capture everything from the starting ‘<workorder’ till the ending ‘>’ tag. The problem is, my regular expression stops when it comes at the second imp tag’s ‘>’ closing brace. See the figure for more details.

enter image description here

I am using regex101 website to test my regular expression, and the settings are Python, with the flags (global, single line and multiline). Single line essentially means that the . operator will match end of line too.

Here is my regular expression

 *(<workorder.*?>$)(.?)

There is a space before the first asterisk. Is there a way to capture everything until the ‘>’ before the ?

The data set may also look like this too : Here the ‘>’ is beside the ” character

<workorder id = "124"
       issue = "broken hood"
       level = "minor"
       comment = " This will be some random text <imp>random text<imp>
         <role>Important<role> So this is goingto be fixed!">
</workorder> Some more random text

or like this Here the ‘>’ is beside the / character

<workorder id = "124"
       issue = "broken hood"
       level = "minor"
       comment = " This will be some random text <imp>random text<imp>
         <role>Important<role> So this is going to be fixed!"/> 
Some more random text

or like this Here the ‘>’ is beside the / character but in next line

<workorder id = "124"
       issue = "broken hood"
       level = "minor"
       comment = " This will be some random text <imp>random text<imp>
         <role>Important<role> So this is going to be fixed!"
/> 
Some more random text

Answer

Maybe you can find an XML/HTML parser for that. If you want regex, you can try this:

(<workorder[sS]*?(?:</workorder>|/>))

Demo here.

Where

  • Outer (...) – Capture the result
  • <workorder – Match the starting string
  • [sS]*? – Match any characters in a non-greedy way so that you won’t be spanning multiple workorder groups.
  • (?:</workorder>|/>) – Match the ending string whether it is </workorder> or />.