[MarkLogic Dev General] BOM char and UTF-16
geert.josten at dayon.nl
Wed Feb 8 22:31:52 PST 2012
Your first line doesn’t show where the BOM is located. It should be the
first two characters of the first line. Note: the encoding attribute in the
XML pi, doesn’t ensure it really is written in that encoding, though is a
strong suggestion usually. Particularly if the file is written with XML
tools/libraries. Not sure either MarkLogic handles the BOM well, but I did
think so. I thought I uploaded UTF-8 files with BOM without problems.
But changing the encoding of the file on the fly to match that of the
MarkLogic app server setting is a good workaround too I guess.
*Van:* general-bounces at developer.marklogic.com [mailto:
general-bounces at developer.marklogic.com] *Namens *Josh Warner-Burke
*Verzonden:* woensdag 8 februari 2012 22:49
*Aan:* general at developer.marklogic.com
*Onderwerp:* [MarkLogic Dev General] BOM char and UTF-16
I emailed about a week ago about a problem I was having with XCC and large
files. I got some very good advice which said I needed to use
session.insertContent to get the file in. I'm done with that conversion
but dealing with the resulting problems due to the change.
What I'm looking at right now is a file that is UTF-16 and begins with two
BOM characters - which I have learned are actually relevant in telling any
string parser/consumer what order the bytes in each pair will be...
I wrote some code that strips out the BOMs but it seems to screw the
encoding up altogether. I also put in code to set the encoding to UTF16 in
the ContentCreateOptions. Without stripping BOMs, I get this:
Invalid root text "ÿþ" at [uri] line 1
To deal with UTF-16 don't you *need those BOMs? What am I missing here?
FYI the first line of the files looks like:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
So it's clearly utf-16.
There is some leeway in terms of how I create the Content object to feed to
insertContent - currently I'm treating it as a byte - but I could do
string conversion etc if that's what I need to do. Any help is
(e): jwburke at 42six.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General