[MarkLogic Dev General] Fwd: question about xdmp:encoding-language-detect

Mary Holstege mary.holstege at marklogic.com
Fri Mar 27 08:34:24 PDT 2015


On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <jakob.fix at gmail.com> wrote:

> Hello, I think this message got lost when the mailing list was down in
> February (or nobody has an answer ...)
>
> Thanks,
> Jakob.

The xdmp:encoding-language-detect uses the ICU libraries to do the  
detection. Serbian and Croatian are very closely related to each other and  
have some similar orthography to Latvian (although not a great deal of  
linguistic similarity, it must be said). I think the ICU libraries  
probably lack some of the linguistic sophistication of Google's backend.

It has nothing to do with the licensing options.

//Mary

>
> ---------- Forwarded message ----------
> From: Jakob Fix <jakob.fix at gmail.com>
> Date: Sat, Feb 28, 2015 at 10:59 PM
> Subject: question about xdmp:encoding-language-detect
> To: General Mark Logic Developer Discussion  
> <general at developer.marklogic.com>
>
>
> Hello,
>
> using ML7.0-3, the above function, given more than 3500 characters of
> Latvian news story text, returns Croatian twice and Serbian once in
> the top three results:
>
> <encoding-language xmlns="xdmp:encoding-language-detect">
>   <encoding>utf-8</encoding>
>   <language>hr</language>
>   <score>7.081</score>
> </encoding-language>
> <encoding-language xmlns="xdmp:encoding-language-detect">
>   <encoding>utf-8</encoding>
>   <language>hr</language>
>   <score>7.012</score>
> </encoding-language>
> <encoding-language xmlns="xdmp:encoding-language-detect">
>   <encoding>utf-8</encoding>
>   <language>sr</language>
>   <score>6.882</score>
> </encoding-language>
> ...
>
> and no Latvian in sight. Google translate as well as
> detectlanguage.com correctly and with sufficient self-assurance return
> the correct result.
>
> Can someone explain what the reason behind this lack of confidence and
> the wrong detection is? Do you need the right language pack (I'm
> playing around with the developer licence which I thought is
> full-featured)? Is this something that needs training? The doc doesn't
> say so.
>
> Thanks!
>
> cheers,
> Jakob.
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general


-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/


More information about the General mailing list