rlike regex not handling special characters

I am having difficulty in getting (java) spark’s rlike method to handle special characters. Normally a simple escape \ or quoting with Q and E would be sufficient for handling special characters in the regex but they don’t seem to work.

Take the example code below. It should show rows 2 & 3 as matching one of the items in alist ignoring case. However the code below only shows row 2 as matching.

How can I filter the rows of the dataset based on the “animal” columns match to one of the items in alist?

    StructType schema = new StructType(new StructField[]{
            new StructField("row_id", DataTypes.IntegerType, false, Metadata.empty()),
            new StructField("animal", DataTypes.StringType, false, Metadata.empty())
    });

    Dataset<Row> dataset = spark.createDataFrame(
            Arrays.asList(
                    RowFactory.create(1, "Bat"),
                    RowFactory.create(2, "Dog"),
                    RowFactory.create(3, "Cat (Type Not Stated)"),
                    RowFactory.create(4, "Other.")
            ), schema);


    List<String> alist = Arrays.asList(
            "\QDOG\E",
            "\QCat (Type Not Stated)\E");

    dataset = dataset.filter(dataset.col("animal").rlike(
            "(?i)\b("+String.join("|", alist)+")\b"
    ));

    dataset.show(5, false);

Answer

Your issue is not with the \Q or \E pattern, but with the \b word boundary pattern.

If I change your final rlike regexp by removing \b pattern as follow:

dataset = dataset.filter(dataset.col("animal").rlike(
  "(?i)("+String.join("|", alist)+")"
));

I get the two lines:

+------+---------------------+
|row_id|animal               |
+------+---------------------+
|2     |Dog                  |
|3     |Cat (Type Not Stated)|
+------+---------------------+