The question is published on by Tutorial Guruji team.
My problem is how to parse wildcard queries with Lucene that the query term is passed through a TokenFilter
.
I’m using a a custom Analyzer
with several filers (e.g. ASCIIFoldingFilter
, but that’s only an example). My problem is that whenever Lucene’s QueryParser
detects that one of the sub-queries is a WildcardQuery
, it by design [1] ignores the Analyzer
.
This means that a query for über is filtered correctly,
über -> uber
but a query for über* (with a wildcard) is not passed through a filter at all:
über* -> über*
Obviously this means – as index-side all tokens are filtered – that there can be no matches on any query containing ü…
Q: How do I force Lucene to filter the query for the WildCard queries, too? I’m looking for a way which would at least marginally re-use Lucene’s codebase 😉
Note: As an input I receive a query string, so building queries programmatically is not an option. Note: I’m using Lucene 4.5.1.
[1] http://www.gossamer-threads.com/lists/lucene/java-user/14224
Context:
// analyzer applies filters in Analyzer#createComponents (String, Reader) Analyzer analyzer = new CustomAnalyzer (Version.LUCENE_45); // I'm using org.apache.lucene.queryparser.classic.MultiFieldQueryParser QueryParser parser = new MultiFieldQueryParser (Version.LUCENE_45, fields, analyzer); parser.setAllowLeadingWildcard (true); parser.setMultiTermRewriteMethod (MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); // actual parsing of the input query Query query = parser.parse (input);
Answer
Ok, I found a solution: I’m extending QueryParser
to override #getWildcardQuery (String, String)
. This way I can intercept and alter the term after a wildcard query is detected and before it is created:
@Override protected Query getWildcardQuery (String field, String termStr) throws ParseException { String term = termStr; TokenStream stream = null; try { // we want only a single token and we don't want to lose special characters stream = new KeywordTokenizer (new StringReader (term)); stream = new LowerCaseFilter (Version.LUCENE_45, stream); stream = new ASCIIFoldingFilter (stream); CharTermAttribute charTermAttribute = stream.addAttribute (CharTermAttribute.class); stream.reset (); while (stream.incrementToken ()) { term = charTermAttribute.toString (); } } catch (IOException e) { LOGGER.debug ("Failed to filter search query token {}", term, e); } finally { IOUtils.closeQuietly (stream); } return super.getWildcardQuery (field, term); }
This solution is based on similar questions:
Using a Combination of Wildcards and Stemming
How to get a Token from a Lucene TokenStream?
Note: in my code it’s actually a bit more convoluted to keep all filters in the single location…
I still feel that there should be a better solution, though.