[MarkLogic Dev General] Fwd: question about xdmp:encoding-language-detect

Jakob Fix jakob.fix at gmail.com
Fri Mar 27 08:44:35 PDT 2015


Thanks Mary for your quick reply. It's an explanation that I
understand, but this doesn't resolve my initial problem.
Any idea how to solve this in the short term and whether there are
improvements in the pipeline? Or that it's not a high priority?

cheers,
Jakob.


On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege
<mary.holstege at marklogic.com> wrote:
> On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <jakob.fix at gmail.com> wrote:
>
>> Hello, I think this message got lost when the mailing list was down in
>> February (or nobody has an answer ...)
>>
>> Thanks,
>> Jakob.
>
> The xdmp:encoding-language-detect uses the ICU libraries to do the
> detection. Serbian and Croatian are very closely related to each other and
> have some similar orthography to Latvian (although not a great deal of
> linguistic similarity, it must be said). I think the ICU libraries
> probably lack some of the linguistic sophistication of Google's backend.
>
> It has nothing to do with the licensing options.
>
> //Mary
>
>>
>> ---------- Forwarded message ----------
>> From: Jakob Fix <jakob.fix at gmail.com>
>> Date: Sat, Feb 28, 2015 at 10:59 PM
>> Subject: question about xdmp:encoding-language-detect
>> To: General Mark Logic Developer Discussion
>> <general at developer.marklogic.com>
>>
>>
>> Hello,
>>
>> using ML7.0-3, the above function, given more than 3500 characters of
>> Latvian news story text, returns Croatian twice and Serbian once in
>> the top three results:
>>
>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>   <encoding>utf-8</encoding>
>>   <language>hr</language>
>>   <score>7.081</score>
>> </encoding-language>
>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>   <encoding>utf-8</encoding>
>>   <language>hr</language>
>>   <score>7.012</score>
>> </encoding-language>
>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>   <encoding>utf-8</encoding>
>>   <language>sr</language>
>>   <score>6.882</score>
>> </encoding-language>
>> ...
>>
>> and no Latvian in sight. Google translate as well as
>> detectlanguage.com correctly and with sufficient self-assurance return
>> the correct result.
>>
>> Can someone explain what the reason behind this lack of confidence and
>> the wrong detection is? Do you need the right language pack (I'm
>> playing around with the developer licence which I thought is
>> full-featured)? Is this something that needs training? The doc doesn't
>> say so.
>>
>> Thanks!
>>
>> cheers,
>> Jakob.
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>
>
> --
> Using Opera's revolutionary email client: http://www.opera.com/mail/
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general


More information about the General mailing list