[MarkLogic Dev General] xpath string construction

Eric Palmitesta eric.palmitesta at utoronto.ca
Wed Oct 15 11:21:45 PDT 2008


Ah, ok.

Thanks for putting up with my questions, guys.

Eric

Ian Small wrote:
> What Michael is saying is that the query that finds the sequence
> document is of similar complexity to the query that would check for
> uniqueness.  So the cost of finding the sequence document = the cost of
> checking for uniqueness.  After you are done with that cost, then the
> sequence approach requires reading the doc (probably already somewhere
> in cache), then updating it and writing it back, updating indexes etc.
> 
> So "similar query" != "cts:similar-query()".  "similar query" = "a query
> of effectively the same complexity as the other one I am talking about"
> 
> ian 
> 
>> -----Original Message-----
>> From: general-bounces at developer.marklogic.com 
>> [mailto:general-bounces at developer.marklogic.com] On Behalf Of 
>> Eric Palmitesta
>> Sent: Wednesday, October 15, 2008 11:14 AM
>> To: General Mark Logic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] xpath string construction
>>
>> It does have work to do in terms of locking and unlocking the 
>> sequence document, but I don't see which similar query would 
>> be performed.  It reads an int, increments it, and writes it 
>> back.  At no point does it need to check an ever-increasing 
>> index for a matching (duplicate) id.
>>
>> There's probably something silly I'm missing here.  Can you 
>> clarify why you consider reading a node from a sequence 
>> document a 'similar query'?
>>
>> Thanks,
>>
>> Eric
>>
>> Michael Blakeley wrote:
>>> Yes, the check for uniqueness has a cost - but it's a small one, 
>>> because this is exactly the kind of index lookup work that 
>> the server 
>>> is designed for. Reading the sequence document performs a similar 
>>> query, and then goes on to do even more work.
>>>
>>> -- Mike
>>>
>>> Eric Palmitesta wrote:
>>>> Hey Michael,
>>>>
>>>> When I read your earlier description, I got the impression 
>> that you'd 
>>>> have to, for each randomly generated id, check every previously 
>>>> generated id.  For a large (millions of documents) data 
>> store, where 
>>>> unique ids are interspersed throughout your documents, won't this 
>>>> start to chug?  Maybe I didn't understand your ideas, can 
>> you clarify 
>>>> this point for me?
>>>>
>>>> I see what you mean about scaling, and how 
>> locking/unlocking a single 
>>>> file doesn't scale at all with a growing number of threads 
>> that need 
>>>> unique ids.
>>>>
>>>> Interesting idea of unique ids on a per-document basis, 
>> concatenated 
>>>> with another sub-id, I hadn't thought of something like this.
>>>>
>>>> I appreciate the debate, it's a quick way to learn.  :)
>>>>
>>>> Much thanks,
>>>>
>>>> Eric
>>>>
>>>> Michael Blakeley wrote:
>>>>> Eric,
>>>>>
>>>>> I don't understand why you say that random ids don't 
>> scale well. I'd 
>>>>> argue that sequential ids don't scale at all, and that 
>> the recursive 
>>>>> approach scales pretty well. Maybe we mean different 
>> things when we 
>>>>> talk about scaling?
>>>>>
>>>>> For small numbers of ids per transaction, where small is 
>> less than a 
>>>>> few hundred, the recursive random approach will perform well and 
>>>>> will very rarely lock any extra documents at all. The 
>> queries to see 
>>>>> if a document already exists will be indexed, and will 
>> never return 
>>>>> more than 1 fragment. So I would expect this approach to scale at 
>>>>> O(log n) with the number of documents in the database.
>>>>>
>>>>> The sequential approach can do better with large number 
>> of ids per 
>>>>> transaction. But this approach requires a lock on the sequence 
>>>>> document, for every transaction, effectively serializing 
>> everything.
>>>>> You can mitigate this with multiple sequence documents, 
>> but then you 
>>>>> don't, strictly speaking, have a sequence anymore.
>>>>>
>>>>> I suppose it depends on the workload you need to optimize: the 
>>>>> random approach is better for highly-parallel workloads, 
>> while the 
>>>>> sequential approach might do better for workloads that need 
>>>>> thousands of ids per transaction and don't mind being 
>> single-threaded.
>>>>> In today's world, I'd rather use a solution that is capable of a 
>>>>> high degree of parallelism. So I rarely, if ever, use 
>> sequential ids.
>>>>> To generate thousands of ids per transaction (say, 1 for each 
>>>>> element in a document), I might start with a single 
>> random id. Any 
>>>>> check would be highly unlikely to take more than zero iterations. 
>>>>> Then I'd concatenate my own, in-memory sequence onto it, for each 
>>>>> descendant element.
>>>>>
>>>>> So the root element might have an id "123af097324-1", its first 
>>>>> child would be "123af097324-2", etc (apply string-pad if you want 
>>>>> the ids to have a fixed length). This approach would preserve the 
>>>>> scalability of random ids, without requiring extra work 
>> to check any 
>>>>> but the first id.
>>>>>
>>>>> -- Mike
>>>>>
>>>>> Eric Palmitesta wrote:
>>>>>> Wow, thanks for the reply, Michael.  I'll probably be using some 
>>>>>> variation of one of your examples.
>>>>>>
>>>>>> Michael Blakeley wrote:
>>>>>>> Many people ask about sequential ids. It is possible to 
>> model an 
>>>>>>> id sequence as a database document. But as with RDBMS 
>> sequences, 
>>>>>>> there are serialization penalties. I don't see the advantage of 
>>>>>>> sequential ids, so I rarely, if ever, use this approach.
>>>>>> Assuming the recursive check isn't feasible (it doesn't scale 
>>>>>> well), the advantage of sequential ids is being able to sleep at 
>>>>>> night knowing collisions are simply impossible, and are 
>> not reliant 
>>>>>> on a 'good-enough' random() function.  I'm nit-picking 
>> of course, 
>>>>>> I'm sure random() is fine.  :)
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Eric
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> General at developer.marklogic.com
>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> General at developer.marklogic.com
>>>>> http://xqzone.com/mailman/listinfo/general
>>>> _______________________________________________
>>>> General mailing list
>>>> General at developer.marklogic.com
>>>> http://xqzone.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general


More information about the General mailing list