Regular Expression in Java. Unexpected behaviour

I am trying to match mostly numbers, but depending on the Words which follow the Expression I need to make a difference.

I match every Number which is not followed by a Temperature Term like °C or a Time Specification. My Regular Expression looks like this:

(((d+?)(s*)(-)(s*))?(d+)(s*))++(?!minuten|Minuten|min|Min|Stunden|stunden|std|Std|°C| °C)

Here is an Example: http://regexr.com?33jeg

While this Behavior is what I expected Java does the Following: Index is the corresponding Group to the Match 4

0: "4 "1: "4 "2: "0 - "3: "0"4: " "5: "-"6: " "7: "4"8: " "9: "°C"

You need to Know that I match every String separate. So the match for the 5 looks like this:

0: "5 "1: "5 "2: "null"3: "null"4: "null"5: "null"6: "null"7: "5"8: " "9: "null"

This is how Id like the other Match to be. This unpleasant behavior is only when a “-” is somewhere in the String before the Match

My Java Code is the following:

public static void adaptPortionDetails(EList<Step> steps, double multiplicator){
    
    String portionMatcher = "(((\d+?)(\s*)(\-)(\s*))?(\d+)(\s*))++(?!°C|Grad|minuten|Minuten|min|Min|Stunden|stunden|std|Std)";
    
    for (int i = 0; i < steps.size(); i++) {
        Matcher matcher = Pattern.compile(portionMatcher).matcher(
                steps.get(i).getDescription());
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            printGroups(matcher);
            String newValue1Str;
            if (matcher.group(3) == null){
                newValue1Str = "";
                System.out.println("test");
            }else{
                double newValue1 = Integer.parseInt(matcher.group(3)) * multiplicator;
                newValue1Str = Fraction.getFraction(newValue1).toProperString();
            }
            double newValue2 = Integer.parseInt(matcher.group(7)) * multiplicator;
            String newValue2Str = Fraction.getFraction(newValue2).toProperString();
            
            
            matcher.appendReplacement(sb, newValue1Str + "$4$5$6" + newValue2Str + "$8");
        }
        matcher.appendTail(sb);
        steps.get(i).setDescription(sb.toString());
    }
}

Hope you can tell what I’m missing.

Answer

This seems to be a bug (or feature?) in Java’s implementation. It doesn’t seem to reset the captured text for the capturing group when the matching has to be redone from the next index.

This test reveals the discrepancy in behavior between Java regex engine and PHP’s PCRE.

  • Regex: (d+(-d+)?){1}+(?!x)
  • Input: 34 34-43x 78 90
  • Java result: 3 matches (34, 78, 90). The 2nd capturing group of the 2nd match is -43. The 2nd capturing group captures nothing for 1st and 3rd match.
  • PHP result: Also the same 3 matches, but 2nd capturing group captures nothing for all matches. For PHP’s PCRE implementation, when the match has to be redone, the captured text of the capturing groups are reset.

This is tested this on JRE 6 Update 37 and JRE 7 Update 11.

Same result for this, just to prove the point that captured text is not reset when matching has to be redone:

  • Regex: a(d+(-d+)?){1}+(?!x)
  • Input: a34 a34-43x a78 a90
  • PHP result

Some comment about your regex

I think the ++ should be {1}+, since it seems that you want to modify one number or one range of number at a time, while making the match possessive to discard unwanted numbers.

Workaround

The first group (the outer most capturing group), which captures everything (one number or a range of number), will always be overwritten when a match is found. Hence you can rely on it. You can check whether there exist a - in the group 1 (with contains method). If there is, then you can tell that capturing group 2 contains captured text from the current match, and you can use the captured text. If there is not, then you can ignore all the captured text in capturing group 2 and its nested capturing groups.

Leave a Reply

Your email address will not be published. Required fields are marked *