[MarkLogic Dev General] xpath string construction

Eric Palmitesta eric.palmitesta at utoronto.ca
Wed Oct 15 06:52:34 PDT 2008


Hey Michael,

When I read your earlier description, I got the impression that you'd 
have to, for each randomly generated id, check every previously 
generated id.  For a large (millions of documents) data store, where 
unique ids are interspersed throughout your documents, won't this start 
to chug?  Maybe I didn't understand your ideas, can you clarify this 
point for me?

I see what you mean about scaling, and how locking/unlocking a single 
file doesn't scale at all with a growing number of threads that need 
unique ids.

Interesting idea of unique ids on a per-document basis, concatenated 
with another sub-id, I hadn't thought of something like this.

I appreciate the debate, it's a quick way to learn.  :)

Much thanks,

Eric

Michael Blakeley wrote:
> Eric,
> 
> I don't understand why you say that random ids don't scale well. I'd 
> argue that sequential ids don't scale at all, and that the recursive 
> approach scales pretty well. Maybe we mean different things when we talk 
> about scaling?
> 
> For small numbers of ids per transaction, where small is less than a few 
> hundred, the recursive random approach will perform well and will very 
> rarely lock any extra documents at all. The queries to see if a document 
> already exists will be indexed, and will never return more than 1 
> fragment. So I would expect this approach to scale at O(log n) with the 
> number of documents in the database.
> 
> The sequential approach can do better with large number of ids per 
> transaction. But this approach requires a lock on the sequence document, 
> for every transaction, effectively serializing everything. You can 
> mitigate this with multiple sequence documents, but then you don't, 
> strictly speaking, have a sequence anymore.
> 
> I suppose it depends on the workload you need to optimize: the random 
> approach is better for highly-parallel workloads, while the sequential 
> approach might do better for workloads that need thousands of ids per 
> transaction and don't mind being single-threaded.
> 
> In today's world, I'd rather use a solution that is capable of a high 
> degree of parallelism. So I rarely, if ever, use sequential ids.
> 
> To generate thousands of ids per transaction (say, 1 for each element in 
> a document), I might start with a single random id. Any check would be 
> highly unlikely to take more than zero iterations. Then I'd concatenate 
> my own, in-memory sequence onto it, for each descendant element.
> 
> So the root element might have an id "123af097324-1", its first child 
> would be "123af097324-2", etc (apply string-pad if you want the ids to 
> have a fixed length). This approach would preserve the scalability of 
> random ids, without requiring extra work to check any but the first id.
> 
> -- Mike
> 
> Eric Palmitesta wrote:
>> Wow, thanks for the reply, Michael.  I'll probably be using some 
>> variation of one of your examples.
>>
>> Michael Blakeley wrote:
>>> Many people ask about sequential ids. It is possible to model an id 
>>> sequence as a database document. But as with RDBMS sequences, there 
>>> are serialization penalties. I don't see the advantage of sequential 
>>> ids, so I rarely, if ever, use this approach.
>>
>> Assuming the recursive check isn't feasible (it doesn't scale well), 
>> the advantage of sequential ids is being able to sleep at night 
>> knowing collisions are simply impossible, and are not reliant on a 
>> 'good-enough' random() function.  I'm nit-picking of course, I'm sure 
>> random() is fine.  :)
>>
>> Cheers,
>>
>> Eric
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general


More information about the General mailing list