Regex statement to replace spaces with underscore between words starting with Capital Letter

With input like:

Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic.

I am expecting an output like:

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

A solution I’ve tried using positive lookbehind (using Python re package) is:

re.sub(r"(?<=w)s([A-Z])", r"_1", above_string)

But here, because of w, I get an output:

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is_Novak_Djokovic.

Naturally, I can’t make it work using r"(?<=[A-Z]w*)s([A-Z])", because

error: look-behind requires fixed-width pattern

I have to apply this regex on huge number of (and much diverse) articles so I can’t afford any loop or a str.replace bruteforce. I was wondering if anyone could please come with with an efficient solution.

Answer

If you do not care about all Unicode uppercase letters, you can use

import re
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( re.sub(r"b([A-Z]w*)s+(?=[A-Z])", r"1_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

See the Python demo. See the regex demo. Details:

  • b – a word boundary
  • ([A-Z]w*) – Group 1 (1): an uppercase letter and zero or more word chars
  • s+ – one or more whitespaces
  • (?=[A-Z]) – a positive lookahead that matches a location immediately followed with an uppercase letter.

If you need to support all Unicode letters, it is advisable to pip install regex and use

import regex
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( regex.sub(r"b(p{Lu}w*)s+(?=p{Lu})", r"1_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

See this Python demo. Here, p{Lu} matches any Unicode uppercase letter.