[MarkLogic Dev General] Trailing Spaces Removed from Attribute Values--Bug or Feature?

Mike Sokolov sokolov at ifactory.com
Tue Mar 11 06:08:37 PST 2008


Agreed; however it's not clear that trailing whitespace needs to be 
preserved in order to be able to search for DITA tokens, as in the 
original example.  I guess it might depend on just what the tokens 
consist of but a word- or phrase-search might be able to make use of the 
implicit tokenization done by the indexer without the need for the 
trailing whitespace.

EG:  cts:attribute-word-search(..."topic/topic") ought to match 
"topic/topic" and not match "mytopic/topic-foo", I think.

-Mike

David Sewell wrote:
> Someone from Mark Logic really needs to weigh in on this. It appears
> that ML Server is doing attribute value normalization upon loading:
>
>  http://www.w3.org/TR/REC-xml/#AVNormalize
>
> However, the spec says "All attributes for which no declaration has been
> read SHOULD be treated by a non-validating processor as if declared
> CDATA." Meaning that unless a schema is associated with the file, the
> server should not be normalizing attribute whitespace, unless I'm not
> understanding something properly.
>
> I also confirmed this behavior with a simple XML file load.
>
> David S.
>
> On Mon, 10 Mar 2008, Eliot Kimber wrote:
>
>   
>> In storing some DITA documents into MarkLogic I discovered that the trailing
>> spaces in the DITA class= attributes are not preserved. I created a simple
>> test and got the same behavior, e.g.:
>>
>> xdmp:document-load("test.xml", <root foo=" bar "/>)
>>
>> <result>{doc("text.xml")/@foo}</result>
>>
>> Returns:
>>
>> <result>bar</result>
>>
>> not
>>
>> <result> bar </result>
>>
>> The DITA standard requires the trailing spaces in the class= values because
>> the value is a sequence blank-delimited tokens where you need to be able to
>> match on " {token} " so you don't get false positives (for example, the
>> "topic" type is the token "topic/topic", without the spaces, a search for
>> "topic/topic" would also match "mytopic/topic-foo", which would be bad.
>>
>> My question is: is this behavior unalterable or is it configurable?
>>
>> This behavior does make it impossible to use DITA documents stored in
>> MarkLogic with any normal DITA-aware processor (because they all expect there
>> to be a trailing space in the class= value) without some serious workaround
>> (essentially a post-fetch fixup to add back in the trailing space).
>>
>> Cheers,
>>
>> Eliot
>>
>>
>>     
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20080311/15efc734/attachment.html


More information about the General mailing list