[MarkLogic Dev General] Diacritic search
gnanaarasanm at gmail.com
Wed Oct 1 10:51:27 PDT 2008
On Wed, Oct 1, 2008 at 1:35 PM, Dominic Mitchell <dom at semantico.com> wrote:
> On 30 Sep 2008, at 20:08, Gnana Arasan wrote:
> We are inserting the xml(UTF-8) conent using
> session.insertContent(uri,inputstream,options).by default option encoding is
> UTF-8.(ML version 3.5-2).For example person name josé is stored.In cq using
> doc(uri) the content seems to be JosÃ(c) .
> The thing to do is to check the string-length() of "JosÃ(c)".
> If it's 4, then it's being stored correctly in MarkLogic. This means that
> the issue is to do with output — something is interpreting UTF-8 as
> If it's 5 then it's being stored incorrectly in MarkLogic. This means that
> the input processes you thought were sending in UTF-8 are really
> interpreting the data as ISO-8859-1. I'd guess from your input mail that
> you're using Java to read the content in. I'd be *extremely* careful in
> Java, as it's all too easy to use the "system default encoding" by accident.
> This is normally cp-1252 on Windows, or MacRoman on a mac, neither of which
> is particularly useful.
> Any time you read data in Java, you need to specify an encoding.
> Particular candidates to watch out for include FileReader<http://java.sun.com/j2se/1.5.0/docs/api/java/io/FileReader.html>
> and String.getBytes()<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()>.
> If you examine the code that's creating that inputStream, you may well find
> such an example.
> General mailing list
> General at developer.marklogic.com
> Hi Dom,
Thanks. I have done the same mistake what you mentioned
in java.Now able to search diacritic.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General