[MarkLogic Dev General] Sorting pinyin text?

Marc Moskowitz mmoskowitz at ifactory.com
Tue Apr 1 07:31:12 PST 2008


Mary Holstege wrote:
> On Tue, 01 Apr 2008 07:05:45 -0700, Marc Moskowitz 
> <mmoskowitz at ifactory.com> wrote:
>
>> I'm trying to sort transliterations of Chinese words by standard 
>> pinyin sorting (syllable alphabetically, then by tone, followed by 
>> the next syllable). Is there a collation in either English or Chinese 
>> that deals correctly with this? If not, is there some way of creating 
>> a user-defined sort order? I know that I can create a sortable form 
>> for each word that sorts correctly by codepoint, but I would rather 
>> do something more efficient if possible.
>> Marc Moskowitz
>> Interactive Factory
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>
>
> The collation named "http://marklogic.com/collation/zh" ought to
> do what you want.  Pinyin is the default ordering for (mainland)
> Chinese.  There is no way of defining your own collation.  In
> theory you could write your own ordering function that operated
> on the strings, but it would be fairly painful and slow I imagine.
>
> //Mary
>
> Mary Holstege
> Lead Engineer
> Mark Logic Corporation
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
Mary,

The standard zh collation sorts Chinese characters correctly, but I'm 
trying to sort the pinyin transliterations. For example, this XQuery:

default collation="http://marklogic.com/collation/zh"
let $words := ('fù-bèi shòu dí','fùdi','fùgǎo','fūzi','fùtòng','fùxiè', 
'fù-mu')
for $x in $words
order by $x
return $x

returns

fù-bèi shòu dí
fù-mu
fùdi
fùgǎo
fùtòng
fùxiè
fūzi

which is in codepoint order, instead of the correct order:

fūzi (1st tone comes before 4th)
fù-bèi shòu dí
fùdi
fùgǎo
fù-mu (hyphens should be ignored)
fùtòng
fùxiè

Am I correct that the supported way to sort this text is to create a 
sortable form for each of these strings at document load time?
-Marc


More information about the General mailing list