[MarkLogic Dev General] Re: Double Metaphone in spellchecker and
Levenstein distance
Steve Mallen
Steve.Mallen at semantico.com
Thu May 15 02:44:51 PDT 2008
Hi Kelly,
Many thanks for the quick reply and info.
I guess that means that the dictionaries are only really suitable for
English words? If I create a French, German or Chinese dictionary
(based on content in those languages) then the Double Metaphone
algorithm isn't going to be as effective - unless the algorithm is
*based* on Double Metaphone, but is more language aware?
Does your reply also imply that there is no basic fuzzy search mechanism
in Mark Logic based on Levenstein Distance?
Many thanks,
-Steve
Kelly Stirman wrote:
> The spell correction functionality in MarkLogic employs the Double
> Metaphone algorithm:
>
> http://en.wikipedia.org/wiki/Double_Metaphone
>
> This is a more modern and more sophisticated approach to phonetic
> matches than soundex.
>
> You can load one of the sample dictionaries on the developer site, your
> own, or use the word lexicon of your database to generate a list of
> terms that exist across your documents.
>
> Kelly
>
> -----Original Message-----
>
> Hi folks,
>
> I've been looking through the developer docs to try to find out if I can
>
> do fuzzy searching or any type of phonetic searching in XQuery with Mark
>
> Logic.
>
> Does anyone know if there any functions to determine similarities and
> distance between strings - e.g. soundex, levenstein, metaphone?
>
> Specifically, I'd like to be able to do lucene-style fuzzy searches
> based on levenstein distance (for example, in Lucene, a search for
> "roam~" will find words like "foam" and "roams"). The spellcheck module
>
> looks like it does something similar, but I'm not sure what the
> implementation is based on? How does it find words from a dictionary
> that are spelt similarly to the search term? Is there any developer
> control over this?
>
> I'd also like to be able to do phonetic searches, so that, for example,
> a search for "fiziks" would match "physics" since they are phonetically
> similar. A few relational databases support "soundex" searches, and
> SOLR supports the use of various phonetic transcription algorithms. I
> guess that I could create an index of phonetic transcriptions during
> content load, and do lookups based on that, but it would be good if
> there was something I could use 'out-of-the-box'.
>
> Could anyone shed any light on this?
>
> Many thanks,
> -Steve
>
>
>
More information about the General
mailing list