[MarkLogic Dev General] Diacritic search

Gnana Arasan gnanaarasanm at gmail.com
Wed Oct 1 10:51:27 PDT 2008


On Wed, Oct 1, 2008 at 1:35 PM, Dominic Mitchell <dom at semantico.com> wrote:

>   On 30 Sep 2008, at 20:08, Gnana Arasan wrote:
>
>    We are inserting the xml(UTF-8) conent using
> session.insertContent(uri,inputstream,options).by default option encoding is
> UTF-8.(ML version 3.5-2).For example person name josé is stored.In cq using
> doc(uri) the content seems to be JosÃ(c) .
>
>
> The thing to do is to check the string-length() of "JosÃ(c)".
>
> If it's 4, then it's being stored correctly in MarkLogic.  This means that
> the issue is to do with output — something is interpreting UTF-8 as
> ISO-8859-1.
>
> If it's 5 then it's being stored incorrectly in MarkLogic.  This means that
> the input processes you thought were sending in UTF-8 are really
> interpreting the data as ISO-8859-1.  I'd guess from your input mail that
> you're using Java to read the content in.  I'd be *extremely* careful in
> Java, as it's all too easy to use the "system default encoding" by accident.
>  This is normally cp-1252 on Windows, or MacRoman on a mac, neither of which
> is particularly useful.
>
> Any time you read data in Java, you need to specify an encoding.
>  Particular candidates to watch out for include FileReader<http://java.sun.com/j2se/1.5.0/docs/api/java/io/FileReader.html>
>  and String.getBytes()<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()>.
>  If you examine the code that's creating that inputStream, you may well find
> such an example.
>
> -Dom
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
> Hi Dom,
          Thanks. I have done the same mistake what you mentioned
in java.Now able to search diacritic.
-Gnana Arasan.M
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20081001/ddb9568a/attachment.html


More information about the General mailing list