[MarkLogic Dev General] Diacritic search

Dominic Mitchell dom at semantico.com
Wed Oct 1 01:05:38 PDT 2008

On 30 Sep 2008, at 20:08, Gnana Arasan wrote:
>   We are inserting the xml(UTF-8) conent using  
> session.insertContent(uri,inputstream,options).by default option  
> encoding is UTF-8.(ML version 3.5-2).For example person name josé is  
> stored.In cq using doc(uri) the content seems to be José .

The thing to do is to check the string-length() of "José".

If it's 4, then it's being stored correctly in MarkLogic.  This means  
that the issue is to do with output — something is interpreting UTF-8  
as ISO-8859-1.

If it's 5 then it's being stored incorrectly in MarkLogic.  This means  
that the input processes you thought were sending in UTF-8 are really  
interpreting the data as ISO-8859-1.  I'd guess from your input mail  
that you're using Java to read the content in.  I'd be extremely  
careful in Java, as it's all too easy to use the "system default  
encoding" by accident.  This is normally cp-1252 on Windows, or  
MacRoman on a mac, neither of which is particularly useful.

Any time you read data in Java, you need to specify an encoding.   
Particular candidates to watch out for include FileReader and  
String.getBytes().  If you examine the code that's creating that  
inputStream, you may well find such an example.

