[MarkLogic Dev General] xpath string construction

Michael Blakeley michael.blakeley at marklogic.com
Wed Oct 15 10:43:30 PDT 2008


Yes, the check for uniqueness has a cost - but it's a small one, because 
this is exactly the kind of index lookup work that the server is 
designed for. Reading the sequence document performs a similar query, 
and then goes on to do even more work.

-- Mike

Eric Palmitesta wrote:
> Hey Michael,
> 
> When I read your earlier description, I got the impression that you'd 
> have to, for each randomly generated id, check every previously 
> generated id.  For a large (millions of documents) data store, where 
> unique ids are interspersed throughout your documents, won't this start 
> to chug?  Maybe I didn't understand your ideas, can you clarify this 
> point for me?
> 
> I see what you mean about scaling, and how locking/unlocking a single 
> file doesn't scale at all with a growing number of threads that need 
> unique ids.
> 
> Interesting idea of unique ids on a per-document basis, concatenated 
> with another sub-id, I hadn't thought of something like this.
> 
> I appreciate the debate, it's a quick way to learn.  :)
> 
> Much thanks,
> 
> Eric
> 
> Michael Blakeley wrote:
>> Eric,
>>
>> I don't understand why you say that random ids don't scale well. I'd 
>> argue that sequential ids don't scale at all, and that the recursive 
>> approach scales pretty well. Maybe we mean different things when we talk 
>> about scaling?
>>
>> For small numbers of ids per transaction, where small is less than a few 
>> hundred, the recursive random approach will perform well and will very 
>> rarely lock any extra documents at all. The queries to see if a document 
>> already exists will be indexed, and will never return more than 1 
>> fragment. So I would expect this approach to scale at O(log n) with the 
>> number of documents in the database.
>>
>> The sequential approach can do better with large number of ids per 
>> transaction. But this approach requires a lock on the sequence document, 
>> for every transaction, effectively serializing everything. You can 
>> mitigate this with multiple sequence documents, but then you don't, 
>> strictly speaking, have a sequence anymore.
>>
>> I suppose it depends on the workload you need to optimize: the random 
>> approach is better for highly-parallel workloads, while the sequential 
>> approach might do better for workloads that need thousands of ids per 
>> transaction and don't mind being single-threaded.
>>
>> In today's world, I'd rather use a solution that is capable of a high 
>> degree of parallelism. So I rarely, if ever, use sequential ids.
>>
>> To generate thousands of ids per transaction (say, 1 for each element in 
>> a document), I might start with a single random id. Any check would be 
>> highly unlikely to take more than zero iterations. Then I'd concatenate 
>> my own, in-memory sequence onto it, for each descendant element.
>>
>> So the root element might have an id "123af097324-1", its first child 
>> would be "123af097324-2", etc (apply string-pad if you want the ids to 
>> have a fixed length). This approach would preserve the scalability of 
>> random ids, without requiring extra work to check any but the first id.
>>
>> -- Mike
>>
>> Eric Palmitesta wrote:
>>> Wow, thanks for the reply, Michael.  I'll probably be using some 
>>> variation of one of your examples.
>>>
>>> Michael Blakeley wrote:
>>>> Many people ask about sequential ids. It is possible to model an id 
>>>> sequence as a database document. But as with RDBMS sequences, there 
>>>> are serialization penalties. I don't see the advantage of sequential 
>>>> ids, so I rarely, if ever, use this approach.
>>> Assuming the recursive check isn't feasible (it doesn't scale well), 
>>> the advantage of sequential ids is being able to sleep at night 
>>> knowing collisions are simply impossible, and are not reliant on a 
>>> 'good-enough' random() function.  I'm nit-picking of course, I'm sure 
>>> random() is fine.  :)
>>>
>>> Cheers,
>>>
>>> Eric
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general



More information about the General mailing list