I’m trying to come up with the fastest way to make search suggestions. At first I thought a Levenstein UDF function combined with a mysql table would do the job. But using levenshtein, mysql would have to go over every row in the table (tons of words) which would make the query really slow.
Now I recently installed and started to use Sphinx (http://sphinxsearch.com/) for fulltext searching mainly because of its performance and tight mysql integration with SphinxSE.
So I asked myself if I can implement a “did you mean” algorithm using sphinx to boost performance somehow, and I think I found a simple one.
Basically i take all the keywords I want to correct, put a space between each letter, then put it in the sphinx index. If the word is ‘keyword’ it becomes ‘k e y w o r d’. Now when the user enters a word I split it in to letters and search in the sphinx index for a record (I just need one) that matches any of the letters provided. The best part is that sphinx is very good on calculating relevance (weight) of the matched rows, so the best match will always have the biggest weight (I think). It also accounts for word (letters in my case) positions so the best match will be in that order.
With the sphinx query I get the most similar word in my keywords list. Then I check it with php using the extended Levenshtain distance which accounts for rearranged letters https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance . If the string distance is lower than 2 (and != 0) then suggest the word. Otherwise don’t suggest anything.
Is there a problem with my idea? Something I didn’t think of? Any expected glitches with the sphinx query, and quirks with the sphinx relevance calculation which woudn’t give the best match? Please correct me if I’m mistaking somewhere.
I can’t see a problem with your idea. Go for it. Just to point out that your method is only relevant if you want to override the builtin behaviour that is very similar to LD.
For example, with sphinx 1.10-beta, you can specify min_infix_len and expand_keywords and use sphinx’s builtin weighting methods (BM25 and some proprietary code) for good results. http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/
Don’t forget to memcache these queries, and create a warm-up script.