How to handle this ‘or’ situation with Python REGEX, if its even possible

I’m trying to implement in Python a pattern where it recognizes either the situation where there are parentheses around the area code (if so there can be any number of leading or trailing whitespaces) or the situation where there are no parentheses (if so there is a hyphen between the area code and the next three numbers. Specifically:

  • (123) 456-7890 is valid
  • 123-456-7890 is valid
  • (123) 456-7890 is valid

(123-456-7890 is NOT valid # because, there is an open ‘(‘ but no closing parenthesis. either there are both or none at all. if none, then a hyphen between 123 and 456 is needed. if both, then no hyphen but any number of white spaces between 123 and 456.

123 456-7890 is NOT valid # because, since there are NO parentheses, it should have a hyphen between 123 and 456

I originally wrote: pattern = re.compile(r’^ *((?[0-9]{3})?)-? *([0-9]{3})-?([0-9]{4}) *$’)

but obviously this doesn’t work because of the both or none issue with the parentheses.

I tried maybe an or statement with groups too but am getting weird output with the results.

pattern = re.compile(r’^ *(([0-9]{3}-?)|(([0-9]{3}) *))([0-9]{3})-?([0-9]{4}) *$’)

result = pattern.findall(input_string)

Much help appreciated!

Answer

The character | is the logical ‘or’ operator in regex, so you can try:

r"(?:(d{3}) *?|d{3}-)d{3}-d{4}"
  • The part (d{3}) *?|d{3}- matches either three digits in parentheses (possibly followed by spaces) or three digits followed by a dash.

  • (?: ) is a non-capturing group – it indicates that the pattern inside should be matched, but it should not be returned on its own by re.findall(). It is needed here, so that the operator | is applied only to the two subpatterns inside.