How to retrieve the captured substrings from a capturing group that may repeat?

I’m sorry I found it difficult to express this question with my poor English. So, let’s go directly to a simple example.

Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I’m using is ^(w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".

Because the subject string to match doesn’t need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.

What is the correct pattern to use in this case? By the way, I’m going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.

Answer

If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern

[^:]+

which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.

In C++ there are different ways to approach this. Using std::regex_iterator

#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>

int main()
{
    std::string str{R"(one:two:three)"};
    std::regex r{R"([^:]+)"};

    std::vector<std::string> result{};

    auto it = std::sregex_iterator(str.begin(), str.end(), r);
    auto end = std::sregex_iterator();
    for(; it != end; ++it) {
        auto match = *it;
        result.push_back(match[0].str());
    }

    std::cout << "Input string: " << str << 'n';
    for(auto i : result)
        std::cout << i << 'n';
}

Prints as expected.

One can also use std::regex_search, even as it returns at first match — by iterating over the string to move the search start after every match

#include <string>
#include <regex>
#include <iostream>

int main()
{
    std::string str{"one:two:three"};
    std::regex r{"[^:]+"};

    std::smatch res;

    std::string::const_iterator search_beg( str.cbegin() );
    while ( regex_search( search_beg, str.cend(), res, r ) )
    {
        std::cout << res[0] << 'n';  
        search_beg = res.suffix().first;
    }
    std::cout << 'n';
}

(With this string and regex we don’t need the raw string literal so I’ve removed them here.)


This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with

/([^:]+)/g

The /g “modifier” is for “global,” to find all matches. The // are pattern delimiters.

When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches, which can thus be directly assigned to an array variable.

my @captures = $string =~ /[^:]+/g;

(when this is used literally as such then the capturing () aren’t needed)