[MarkLogic Dev General] Re: Double Metaphone in spellchecker and Levenstein distance

Steve Mallen Steve.Mallen at semantico.com
Thu May 15 02:44:51 PDT 2008


Hi Kelly,

Many thanks for the quick reply and info.

I guess that means that the dictionaries are only really suitable for 
English words?  If I create a French, German or Chinese dictionary 
(based on content in those languages) then the Double Metaphone 
algorithm isn't going to be as effective - unless the algorithm is 
*based* on Double Metaphone, but is more language aware?

Does your reply also imply that there is no basic fuzzy search mechanism 
in Mark Logic based on Levenstein Distance?

Many thanks,
-Steve

Kelly Stirman wrote:
> The spell correction functionality in MarkLogic employs the Double
> Metaphone algorithm: 
>
> http://en.wikipedia.org/wiki/Double_Metaphone
>
> This is a more modern and more sophisticated approach to phonetic
> matches than soundex.
>
> You can load one of the sample dictionaries on the developer site, your
> own, or use the word lexicon of your database to generate a list of
> terms that exist across your documents. 
>
> Kelly
>
> -----Original Message-----
>   
> Hi folks,
>
> I've been looking through the developer docs to try to find out if I can
>
> do fuzzy searching or any type of phonetic searching in XQuery with Mark
>
> Logic.
>
> Does anyone know if there any functions to determine similarities and 
> distance between strings - e.g. soundex, levenstein, metaphone?
>
> Specifically, I'd like to be able to do lucene-style fuzzy searches 
> based on levenstein distance (for example, in Lucene, a search for 
> "roam~" will find words like "foam" and "roams").  The spellcheck module
>
> looks like it does something similar, but I'm not sure what the 
> implementation is based on?  How does it find words from a dictionary 
> that are spelt similarly to the search term?  Is there any developer 
> control over this?
>
> I'd also like to be able to do phonetic searches, so that, for example, 
> a search for "fiziks" would match "physics" since they are phonetically 
> similar.  A few relational databases support "soundex" searches, and 
> SOLR supports the use of various phonetic transcription algorithms.  I 
> guess that I could create an index of phonetic transcriptions during 
> content load, and do lookups based on that, but it would be good if 
> there was something I could use 'out-of-the-box'.
>
> Could anyone shed any light on this?
>
> Many thanks,
> -Steve
>
>
>   



More information about the General mailing list