[MarkLogic Dev General] xpath string construction

Eric Palmitesta eric.palmitesta at utoronto.ca
Wed Oct 15 11:13:44 PDT 2008


It does have work to do in terms of locking and unlocking the sequence 
document, but I don't see which similar query would be performed.  It 
reads an int, increments it, and writes it back.  At no point does it 
need to check an ever-increasing index for a matching (duplicate) id.

There's probably something silly I'm missing here.  Can you clarify why 
you consider reading a node from a sequence document a 'similar query'?

Thanks,

Eric

Michael Blakeley wrote:
> Yes, the check for uniqueness has a cost - but it's a small one, because 
> this is exactly the kind of index lookup work that the server is 
> designed for. Reading the sequence document performs a similar query, 
> and then goes on to do even more work.
> 
> -- Mike
> 
> Eric Palmitesta wrote:
>> Hey Michael,
>>
>> When I read your earlier description, I got the impression that you'd 
>> have to, for each randomly generated id, check every previously 
>> generated id.  For a large (millions of documents) data store, where 
>> unique ids are interspersed throughout your documents, won't this 
>> start to chug?  Maybe I didn't understand your ideas, can you clarify 
>> this point for me?
>>
>> I see what you mean about scaling, and how locking/unlocking a single 
>> file doesn't scale at all with a growing number of threads that need 
>> unique ids.
>>
>> Interesting idea of unique ids on a per-document basis, concatenated 
>> with another sub-id, I hadn't thought of something like this.
>>
>> I appreciate the debate, it's a quick way to learn.  :)
>>
>> Much thanks,
>>
>> Eric
>>
>> Michael Blakeley wrote:
>>> Eric,
>>>
>>> I don't understand why you say that random ids don't scale well. I'd 
>>> argue that sequential ids don't scale at all, and that the recursive 
>>> approach scales pretty well. Maybe we mean different things when we 
>>> talk about scaling?
>>>
>>> For small numbers of ids per transaction, where small is less than a 
>>> few hundred, the recursive random approach will perform well and will 
>>> very rarely lock any extra documents at all. The queries to see if a 
>>> document already exists will be indexed, and will never return more 
>>> than 1 fragment. So I would expect this approach to scale at O(log n) 
>>> with the number of documents in the database.
>>>
>>> The sequential approach can do better with large number of ids per 
>>> transaction. But this approach requires a lock on the sequence 
>>> document, for every transaction, effectively serializing everything. 
>>> You can mitigate this with multiple sequence documents, but then you 
>>> don't, strictly speaking, have a sequence anymore.
>>>
>>> I suppose it depends on the workload you need to optimize: the random 
>>> approach is better for highly-parallel workloads, while the 
>>> sequential approach might do better for workloads that need thousands 
>>> of ids per transaction and don't mind being single-threaded.
>>>
>>> In today's world, I'd rather use a solution that is capable of a high 
>>> degree of parallelism. So I rarely, if ever, use sequential ids.
>>>
>>> To generate thousands of ids per transaction (say, 1 for each element 
>>> in a document), I might start with a single random id. Any check 
>>> would be highly unlikely to take more than zero iterations. Then I'd 
>>> concatenate my own, in-memory sequence onto it, for each descendant 
>>> element.
>>>
>>> So the root element might have an id "123af097324-1", its first child 
>>> would be "123af097324-2", etc (apply string-pad if you want the ids 
>>> to have a fixed length). This approach would preserve the scalability 
>>> of random ids, without requiring extra work to check any but the 
>>> first id.
>>>
>>> -- Mike
>>>
>>> Eric Palmitesta wrote:
>>>> Wow, thanks for the reply, Michael.  I'll probably be using some 
>>>> variation of one of your examples.
>>>>
>>>> Michael Blakeley wrote:
>>>>> Many people ask about sequential ids. It is possible to model an id 
>>>>> sequence as a database document. But as with RDBMS sequences, there 
>>>>> are serialization penalties. I don't see the advantage of 
>>>>> sequential ids, so I rarely, if ever, use this approach.
>>>> Assuming the recursive check isn't feasible (it doesn't scale well), 
>>>> the advantage of sequential ids is being able to sleep at night 
>>>> knowing collisions are simply impossible, and are not reliant on a 
>>>> 'good-enough' random() function.  I'm nit-picking of course, I'm 
>>>> sure random() is fine.  :)
>>>>
>>>> Cheers,
>>>>
>>>> Eric
>>>> _______________________________________________
>>>> General mailing list
>>>> General at developer.marklogic.com
>>>> http://xqzone.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general


More information about the General mailing list