[MarkLogic Dev General] question about xdmp:encoding-language-detect

Justin Makeig Justin.Makeig at marklogic.com
Fri Mar 27 09:01:37 PDT 2015


Jakob,
Are there any other markers that are specific to your domain that could help you triangulate? The built-in detection doesn't (and can't) know the context of your business. Some pre- or post-detection analysis might help you to better narrow. For example, is a specific source known to not have Croatian or Serbian content, but might have Latvian? Are there entities (e.g. names, addresses, etc.) that are decent indicators of Latvian? I don't know the specifics of your app or content, but there might be other context that you could pull in to enhance the out-of-the-box identification.

Justin


--
Justin Makeig
Director, Product Management
MarkLogic
justin.makeig at marklogic.com
+1 (650) 655-2387

> On Mar 27, 2015, at 8:44 AM, Jakob Fix <jakob.fix at gmail.com> wrote:
> 
> Thanks Mary for your quick reply. It's an explanation that I
> understand, but this doesn't resolve my initial problem.
> Any idea how to solve this in the short term and whether there are
> improvements in the pipeline? Or that it's not a high priority?
> 
> cheers,
> Jakob.
> 
> 
> On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege
> <mary.holstege at marklogic.com> wrote:
>> On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <jakob.fix at gmail.com> wrote:
>> 
>>> Hello, I think this message got lost when the mailing list was down in
>>> February (or nobody has an answer ...)
>>> 
>>> Thanks,
>>> Jakob.
>> 
>> The xdmp:encoding-language-detect uses the ICU libraries to do the
>> detection. Serbian and Croatian are very closely related to each other and
>> have some similar orthography to Latvian (although not a great deal of
>> linguistic similarity, it must be said). I think the ICU libraries
>> probably lack some of the linguistic sophistication of Google's backend.
>> 
>> It has nothing to do with the licensing options.
>> 
>> //Mary
>> 
>>> 
>>> ---------- Forwarded message ----------
>>> From: Jakob Fix <jakob.fix at gmail.com>
>>> Date: Sat, Feb 28, 2015 at 10:59 PM
>>> Subject: question about xdmp:encoding-language-detect
>>> To: General Mark Logic Developer Discussion
>>> <general at developer.marklogic.com>
>>> 
>>> 
>>> Hello,
>>> 
>>> using ML7.0-3, the above function, given more than 3500 characters of
>>> Latvian news story text, returns Croatian twice and Serbian once in
>>> the top three results:
>>> 
>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>  <encoding>utf-8</encoding>
>>>  <language>hr</language>
>>>  <score>7.081</score>
>>> </encoding-language>
>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>  <encoding>utf-8</encoding>
>>>  <language>hr</language>
>>>  <score>7.012</score>
>>> </encoding-language>
>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>  <encoding>utf-8</encoding>
>>>  <language>sr</language>
>>>  <score>6.882</score>
>>> </encoding-language>
>>> ...
>>> 
>>> and no Latvian in sight. Google translate as well as
>>> detectlanguage.com correctly and with sufficient self-assurance return
>>> the correct result.
>>> 
>>> Can someone explain what the reason behind this lack of confidence and
>>> the wrong detection is? Do you need the right language pack (I'm
>>> playing around with the developer licence which I thought is
>>> full-featured)? Is this something that needs training? The doc doesn't
>>> say so.
>>> 
>>> Thanks!
>>> 
>>> cheers,
>>> Jakob.
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> 
>> --
>> Using Opera's revolutionary email client: http://www.opera.com/mail/
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general




-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4477 bytes
Desc: not available
Url : http://developer.marklogic.com/pipermail/general/attachments/20150327/8e18d664/attachment-0001.bin 


More information about the General mailing list