[MarkLogic Dev General] Fwd: question about xdmp:encoding-language-detect

Jakob Fix jakob.fix at gmail.com
Fri Mar 27 08:23:19 PDT 2015


Hello, I think this message got lost when the mailing list was down in
February (or nobody has an answer ...)

Thanks,
Jakob.


---------- Forwarded message ----------
From: Jakob Fix <jakob.fix at gmail.com>
Date: Sat, Feb 28, 2015 at 10:59 PM
Subject: question about xdmp:encoding-language-detect
To: General Mark Logic Developer Discussion <general at developer.marklogic.com>


Hello,

using ML7.0-3, the above function, given more than 3500 characters of
Latvian news story text, returns Croatian twice and Serbian once in
the top three results:

<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>hr</language>
  <score>7.081</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>hr</language>
  <score>7.012</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>sr</language>
  <score>6.882</score>
</encoding-language>
...

and no Latvian in sight. Google translate as well as
detectlanguage.com correctly and with sufficient self-assurance return
the correct result.

Can someone explain what the reason behind this lack of confidence and
the wrong detection is? Do you need the right language pack (I'm
playing around with the developer licence which I thought is
full-featured)? Is this something that needs training? The doc doesn't
say so.

Thanks!

cheers,
Jakob.


More information about the General mailing list