[MarkLogic Dev General] question about xdmp:encoding-language-detect

Jakob Fix jakob.fix at gmail.com
Fri Mar 27 09:09:19 PDT 2015


Thanks for your respective answers. My concern is that I've tried two
other detection services, the obvious one which is Google's
translation service which detected the language automatically, and
another one called detectlanguage.com which provides an API which also
detected correctly the language in the exact same text sample that I
used with MarkLogic's language detection feature.
cheers,
Jakob.


On Fri, Mar 27, 2015 at 5:01 PM, Justin Makeig
<Justin.Makeig at marklogic.com> wrote:
> Jakob,
> Are there any other markers that are specific to your domain that could help you triangulate? The built-in detection doesn't (and can't) know the context of your business. Some pre- or post-detection analysis might help you to better narrow. For example, is a specific source known to not have Croatian or Serbian content, but might have Latvian? Are there entities (e.g. names, addresses, etc.) that are decent indicators of Latvian? I don't know the specifics of your app or content, but there might be other context that you could pull in to enhance the out-of-the-box identification.
>
> Justin
>
>
> --
> Justin Makeig
> Director, Product Management
> MarkLogic
> justin.makeig at marklogic.com
> +1 (650) 655-2387
>
>> On Mar 27, 2015, at 8:44 AM, Jakob Fix <jakob.fix at gmail.com> wrote:
>>
>> Thanks Mary for your quick reply. It's an explanation that I
>> understand, but this doesn't resolve my initial problem.
>> Any idea how to solve this in the short term and whether there are
>> improvements in the pipeline? Or that it's not a high priority?
>>
>> cheers,
>> Jakob.
>>
>>
>> On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege
>> <mary.holstege at marklogic.com> wrote:
>>> On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <jakob.fix at gmail.com> wrote:
>>>
>>>> Hello, I think this message got lost when the mailing list was down in
>>>> February (or nobody has an answer ...)
>>>>
>>>> Thanks,
>>>> Jakob.
>>>
>>> The xdmp:encoding-language-detect uses the ICU libraries to do the
>>> detection. Serbian and Croatian are very closely related to each other and
>>> have some similar orthography to Latvian (although not a great deal of
>>> linguistic similarity, it must be said). I think the ICU libraries
>>> probably lack some of the linguistic sophistication of Google's backend.
>>>
>>> It has nothing to do with the licensing options.
>>>
>>> //Mary
>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Jakob Fix <jakob.fix at gmail.com>
>>>> Date: Sat, Feb 28, 2015 at 10:59 PM
>>>> Subject: question about xdmp:encoding-language-detect
>>>> To: General Mark Logic Developer Discussion
>>>> <general at developer.marklogic.com>
>>>>
>>>>
>>>> Hello,
>>>>
>>>> using ML7.0-3, the above function, given more than 3500 characters of
>>>> Latvian news story text, returns Croatian twice and Serbian once in
>>>> the top three results:
>>>>
>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>>  <encoding>utf-8</encoding>
>>>>  <language>hr</language>
>>>>  <score>7.081</score>
>>>> </encoding-language>
>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>>  <encoding>utf-8</encoding>
>>>>  <language>hr</language>
>>>>  <score>7.012</score>
>>>> </encoding-language>
>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>>  <encoding>utf-8</encoding>
>>>>  <language>sr</language>
>>>>  <score>6.882</score>
>>>> </encoding-language>
>>>> ...
>>>>
>>>> and no Latvian in sight. Google translate as well as
>>>> detectlanguage.com correctly and with sufficient self-assurance return
>>>> the correct result.
>>>>
>>>> Can someone explain what the reason behind this lack of confidence and
>>>> the wrong detection is? Do you need the right language pack (I'm
>>>> playing around with the developer licence which I thought is
>>>> full-featured)? Is this something that needs training? The doc doesn't
>>>> say so.
>>>>
>>>> Thanks!
>>>>
>>>> cheers,
>>>> Jakob.
>>>> _______________________________________________
>>>> General mailing list
>>>> General at developer.marklogic.com
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>
>>>
>>> --
>>> Using Opera's revolutionary email client: http://www.opera.com/mail/
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
>
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>


More information about the General mailing list