[MarkLogic Dev General] question about xdmp:encoding-language-detect

Jakob Fix jakob.fix at gmail.com
Fri Mar 27 09:33:06 PDT 2015


Hello Alex, although I'm only a lowly member of this mailing group, I
feel honored that you've singled me out based on my last name and
asking you for help with unsubscribing which I will promptly provide:

At the bottom of this email message you will find a link which I  will
report here:

http://developer.marklogic.com/mailman/listinfo/general

clicking on this link will lead you to a web page where in the bottom
half you will find a section dedicated to unsubscribing from this
list. It should be sufficient to enter your email address and click
"unsubscribe".

Actually I've just done it for you. You should have received a message
with a link that you should click in order to confirm your request to
unsubscribe.

It has been a pleasure helping ...

"Jakob'll Fix it" (tm)


On Fri, Mar 27, 2015 at 5:12 PM,  <alex.karman at karmancorp.com> wrote:
> Hello Jacob Fix,
>
> Can you please remove me from the list? I ask you personally, because I have
> asked generally for over two years, but I am still on it. Your last name is
> "Fix," so maybe you can actually "Fix" it :-)
>
> Thanks,
> --Alex
>
> -------- Original Message --------
> Subject: Re: [MarkLogic Dev General] question about
> xdmp:encoding-language-detect
> From: Jakob Fix <jakob.fix at gmail.com>
> Date: Fri, March 27, 2015 12:09 pm
> To: MarkLogic Developer Discussion <general at developer.marklogic.com>
>
> Thanks for your respective answers. My concern is that I've tried two
> other detection services, the obvious one which is Google's
> translation service which detected the language automatically, and
> another one called detectlanguage.com which provides an API which also
> detected correctly the language in the exact same text sample that I
> used with MarkLogic's language detection feature.
> cheers,
> Jakob.
>
>
> On Fri, Mar 27, 2015 at 5:01 PM, Justin Makeig
> <Justin.Makeig at marklogic.com> wrote:
>> Jakob,
>> Are there any other markers that are specific to your domain that could
>> help you triangulate? The built-in detection doesn't (and can't) know the
>> context of your business. Some pre- or post-detection analysis might help
>> you to better narrow. For example, is a specific source known to not have
>> Croatian or Serbian content, but might have Latvian? Are there entities
>> (e.g. names, addresses, etc.) that are decent indicators of Latvian? I don't
>> know the specifics of your app or content, but there might be other context
>> that you could pull in to enhance the out-of-the-box identification.
>>
>> Justin
>>
>>
>> --
>> Justin Makeig
>> Director, Product Management
>> MarkLogic
>> justin.makeig at marklogic.com
>> +1 (650) 655-2387
>>
>>> On Mar 27, 2015, at 8:44 AM, Jakob Fix <jakob.fix at gmail.com> wrote:
>>>
>>> Thanks Mary for your quick reply. It's an explanation that I
>>> understand, but this doesn't resolve my initial problem.
>>> Any idea how to solve this in the short term and whether there are
>>> improvements in the pipeline? Or that it's not a high priority?
>>>
>>> cheers,
>>> Jakob.
>>>
>>>
>>> On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege
>>> <mary.holstege at marklogic.com> wrote:
>>>> On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <jakob.fix at gmail.com>
>>>> wrote:
>>>>
>>>>> Hello, I think this message got lost when the mailing list was down in
>>>>> February (or nobody has an answer ...)
>>>>>
>>>>> Thanks,
>>>>> Jakob.
>>>>
>>>> The xdmp:encoding-language-detect uses the ICU libraries to do the
>>>> detection. Serbian and Croatian are very closely related to each other
>>>> and
>>>> have some similar orthography to Latvian (although not a great deal of
>>>> linguistic similarity, it must be said). I think the ICU libraries
>>>> probably lack some of the linguistic sophistication of Google's backend.
>>>>
>>>> It has nothing to do with the licensing options.
>>>>
>>>> //Mary
>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Jakob Fix <jakob.fix at gmail.com>
>>>>> Date: Sat, Feb 28, 2015 at 10:59 PM
>>>>> Subject: question about xdmp:encoding-language-detect
>>>>> To: General Mark Logic Developer Discussion
>>>>> <general at developer.marklogic.com>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> using ML7.0-3, the above function, given more than 3500 characters of
>>>>> Latvian news story text, returns Croatian twice and Serbian once in
>>>>> the top three results:
>>>>>
>>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>>> <encoding>utf-8</encoding>
>>>>> <language>hr</language>
>>>>> <score>7.081</score>
>>>>> </encoding-language>
>>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>>> <encoding>utf-8</encoding>
>>>>> <language>hr</language>
>>>>> <score>7.012</score>
>>>>> </encoding-language>
>>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>>> <encoding>utf-8</encoding>
>>>>> <language>sr</language>
>>>>> <score>6.882</score>
>>>>> </encoding-language>
>>>>> ...
>>>>>
>>>>> and no Latvian in sight. Google translate as well as
>>>>> detectlanguage.com correctly and with sufficient self-assurance return
>>>>> the correct result.
>>>>>
>>>>> Can someone explain what the reason behind this lack of confidence and
>>>>> the wrong detection is? Do you need the right language pack (I'm
>>>>> playing around with the developer licence which I thought is
>>>>> full-featured)? Is this something that needs training? The doc doesn't
>>>>> say so.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> cheers,
>>>>> Jakob.
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> General at developer.marklogic.com
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>
>>>>
>>>> --
>>>> Using Opera's revolutionary email client: http://www.opera.com/mail/
>>>> _______________________________________________
>>>> General mailing list
>>>> General at developer.marklogic.com
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>>
>>
>>
>>
>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>


More information about the General mailing list